← Back
← Part of: Building an AI-Powered Local Newsroom

Data Warehouse & Ingestion Pipeline

In This White Paper

Executive Summary

The Data Warehouse and Ingestion Pipeline forms the foundational infrastructure of the Bushwick Daily AI newsroom, combining PostgreSQL 17 with pgvector for unified storage of 1.52 million news articles and 212,000+ emails with semantic search capabilities. The system automates 95% of news article extraction from websites using intelligent heuristics while processing the complete editorial inbox through a multi-stage pipeline. This infrastructure transforms the local newsroom from manual information gathering into a searchable, AI-ready knowledge base that enables journalists to surface relevant context in milliseconds rather than hours.


Section 1: Technical Architecture

System Overview:

                    ARTICLE INGESTION                           EMAIL INGESTION
                          |                                            |
    [75 News Outlets] --> Scrapy Spider                   Gmail API --> Service Account
                          |                                            |
                    URL Cache Check                        Deduplication Check
                          |                                            |
                    Heuristic Triage                       Gemini Classification
                          |                                            |
                    PostgresPipeline                       Batch Processing
                          |                                            |
                    +-----+-----+                          +-----+-----+
                    |           |                          |           |
              articles    article_content            classified    email_chunks
                    |           |                    _emails            |
                    +-----+-----+                          +-----+-----+
                          |                                      |
                    Celery Task                            Celery Task
                          |                                      |
                    article_chunks                     enhanced_email_embeddings
                          |                                      |
                    +-----+-----------------------------------+--+
                                          |
                              PostgreSQL 17 + pgvector
                              (HNSW Vector Indexes)
                                          |
                              RAG Chatbot / Search API

Core Technologies:

Component Technology Purpose
Database PostgreSQL 17 + pgvector Unified storage with vector similarity search
Article Scraping Scrapy CrawlSpider Intelligent heuristic-based news extraction
Task Queue Celery + Redis Background embedding generation
Embeddings SentenceTransformers (all-MiniLM-L6-v2) 384-dimensional semantic vectors
Email Extraction Gmail API + Service Account Domain-wide email retrieval
Classification Gemini 1.5 Flash + Heuristics Hybrid AI/rule-based categorization
Vector Index HNSW (vector_cosine_ops) Sub-100ms similarity search at scale

Data Models & Schema:

The warehouse centers on six primary entity clusters: - Articles: outlets -> articles -> article_content -> article_chunks (4.56M chunks with embeddings) - Emails: classified_emails -> email_chunks (3.58M chunks) -> enhanced_email_embeddings - Participants: email_participants -> sender_profiles (453K relationship records) - Events: email_events (calendar extraction) -> event_flyers - Entities: entities_v2 -> entity_mentions_v2 (cross-corpus entity tracking) - Attachments: email_attachments -> Google Drive (streaming upload with SHA-256 dedup)

Key design decisions include CASCADE deletes for referential integrity, JSONB columns for flexible metadata, and explicit ON CONFLICT handling for idempotent upserts.

Ingestion Pipeline Stages:

  1. Source Acquisition:
  2. Articles: Scrapy LinkExtractor with domain-constrained crawling, respects robots.txt
  3. Emails: Gmail API with service account delegation, incremental sync via message ID tracking

  4. Cleaning & Validation:

  5. UTF-8 sanitization removing null bytes PostgreSQL rejects
  6. URL normalization (lowercase, fragment removal, trailing slash handling)
  7. Content fingerprinting for deduplication

  8. Transformation:

  9. OpenGraph metadata extraction for article identification
  10. Readability-lxml for main content extraction with fallback chains
  11. RecursiveCharacterTextSplitter: 500-char chunks with 50-char overlap

  12. Loading:

  13. SQLAlchemy QueuePool with pool_pre_ping for stale connection detection
  14. Atomic transactions via engine.begin() context managers
  15. INSERT-only mode for articles prevents data loss from paywalled/deleted content

  16. Indexing & Optimization:

  17. HNSW indexes on embedding columns (~6-7GB each) for approximate nearest neighbor
  18. GIN indexes on tsvector columns for full-text search
  19. URL hash indexes for O(1) deduplication checks

Key Technical Achievements: - ~27ms vector similarity queries across 3.58M email chunks (verified December 2024) - ~90% network reduction via URL cache middleware that skips already-scraped articles before HTTP requests - 0.001 false positive rate Bloom filter option for memory-constrained environments - Background processing at scale: 500 emails/batch with model loaded once per session

Code Architecture Highlights:

# Heuristic article detection using OpenGraph and semantic containers
def triage_page(self, response):
    """Heuristically decides if a page is an article."""
    if response.css('meta[property="og:type"][content="article"]'):
        yield from self.parse_article(response)
        return

    for selector in self.SEMANTIC_CONTAINER_TAGS:
        container = response.css(selector)[0]
        container_text = "".join(container.css("::text").getall())

        if len(container_text) < self.MIN_TEXT_LENGTH:
            continue
        if (len(container_text) / len(container_html)) < self.TEXT_TO_MARKUP_RATIO:
            continue

        yield from self.parse_article(response, pre_validated_container=container)

Section 2: Features & Standards

Core Capabilities:

  1. Incremental Crawling System: The UrlCacheMiddleware loads all existing URLs at spider startup and rejects requests for known content before making HTTP calls. This transforms daily news monitoring from hours-long full crawls into efficient delta operations.

  2. Hybrid Classification Engine: Deterministic rules handle predictable emails (financial alerts, PR wire services, marketing) instantly, while Gemini 1.5 Flash processes nuanced cases. This achieves both cost efficiency and accuracy across 16 classification categories.

  3. Context-Enriched Embeddings: Email embeddings include sender history, thread context, pipeline classification, and related article references. A search for "housing crisis" surfaces emails where the sender has previously contributed housing stories.

  4. Participant Intelligence System: The email_participants table tracks 453K relationship records with MD5 fingerprints for cross-email linking, enabling queries like "show all communications from this person regardless of email address variations."

Standards & Best Practices Implemented:

Evolution Over Time:

The developer blog documents a clear progression from ambitious initial architecture to production hardening:


Section 3: Impact on News Operations

Journalism Problems Solved:

  1. Competitive Intelligence Blind Spots: Previously, monitoring 75 NYC news outlets required manual checking. Now, journalists query "what have other outlets published about Bushwick rezoning?" and receive semantically-ranked results in milliseconds.

  2. Lost Institutional Memory: Email threads with sources were siloed in individual inboxes. The participant intelligence system now tracks every interaction across 453K records, enabling "show me all communications with this city council member" queries.

  3. Event Calendar Bottleneck: Community board newsletters containing multiple events required manual transcription. The extraction pipeline identifies events, extracts structured data, and queues them for calendar publication with minimal human intervention.

Operational Improvements:

Mission Alignment:

Sustainable local journalism requires operational efficiency that large newsrooms take for granted. This data warehouse transforms a one-person newsroom's research capabilities, enabling the depth of coverage typically requiring a dedicated research staff. When a breaking story emerges, the journalist can instantly surface every prior article, email thread, and source interaction related to that topic.

Future Potential:

The HNSW-indexed vector architecture supports additional embedding models without schema changes. The participant intelligence system creates the foundation for automated source relationship management. The event extraction pipeline can extend to any structured data type (meeting agendas, budget documents, permit applications) using the same AI-extraction patterns.


Report compiled from 321 commits spanning July 2025 to March 2026. Core infrastructure files: /news_scraper_project/news_scraper_project/pipelines.py, /news_scraper_project/news_scraper_project/spiders/article_spider.py, /news_scraper_project/news_scraper_project/tasks.py, /news_scraper_project/news_scraper_project/middlewares.py, /news_scraper_project/scripts/create_chunks_table.py.