Data Warehouse & Ingestion Pipeline

Building an AI-Powered Local Newsroom (Overview)
Data Warehouse & Ingestion Pipeline (Current)
NYCNewsScanner & Publisher
Email Assistant

Executive Summary

The Data Warehouse and Ingestion Pipeline forms the foundational infrastructure of the Bushwick Daily AI newsroom, combining PostgreSQL 17 with pgvector for unified storage of 1.52 million news articles and 212,000+ emails with semantic search capabilities. The system automates 95% of news article extraction from websites using intelligent heuristics while processing the complete editorial inbox through a multi-stage pipeline. This infrastructure transforms the local newsroom from manual information gathering into a searchable, AI-ready knowledge base that enables journalists to surface relevant context in milliseconds rather than hours.

Section 1: Technical Architecture

System Overview:

                    ARTICLE INGESTION                           EMAIL INGESTION
                          |                                            |
    [75 News Outlets] --> Scrapy Spider                   Gmail API --> Service Account
                          |                                            |
                    URL Cache Check                        Deduplication Check
                          |                                            |
                    Heuristic Triage                       Gemini Classification
                          |                                            |
                    PostgresPipeline                       Batch Processing
                          |                                            |
                    +-----+-----+                          +-----+-----+
                    |           |                          |           |
              articles    article_content            classified    email_chunks
                    |           |                    _emails            |
                    +-----+-----+                          +-----+-----+
                          |                                      |
                    Celery Task                            Celery Task
                          |                                      |
                    article_chunks                     enhanced_email_embeddings
                          |                                      |
                    +-----+-----------------------------------+--+
                                          |
                              PostgreSQL 17 + pgvector
                              (HNSW Vector Indexes)
                                          |
                              RAG Chatbot / Search API

Core Technologies:

Component	Technology	Purpose
Database	PostgreSQL 17 + pgvector	Unified storage with vector similarity search
Article Scraping	Scrapy CrawlSpider	Intelligent heuristic-based news extraction
Task Queue	Celery + Redis	Background embedding generation
Embeddings	SentenceTransformers (all-MiniLM-L6-v2)	384-dimensional semantic vectors
Email Extraction	Gmail API + Service Account	Domain-wide email retrieval
Classification	Gemini 1.5 Flash + Heuristics	Hybrid AI/rule-based categorization
Vector Index	HNSW (vector_cosine_ops)	Sub-100ms similarity search at scale

Data Models & Schema:

The warehouse centers on six primary entity clusters: - Articles: outlets -> articles -> article_content -> article_chunks (4.56M chunks with embeddings) - Emails: classified_emails -> email_chunks (3.58M chunks) -> enhanced_email_embeddings - Participants: email_participants -> sender_profiles (453K relationship records) - Events: email_events (calendar extraction) -> event_flyers - Entities: entities_v2 -> entity_mentions_v2 (cross-corpus entity tracking) - Attachments: email_attachments -> Google Drive (streaming upload with SHA-256 dedup)

Key design decisions include CASCADE deletes for referential integrity, JSONB columns for flexible metadata, and explicit ON CONFLICT handling for idempotent upserts.

Ingestion Pipeline Stages:

Source Acquisition:
Articles: Scrapy LinkExtractor with domain-constrained crawling, respects robots.txt
Emails: Gmail API with service account delegation, incremental sync via message ID tracking
Cleaning & Validation:
UTF-8 sanitization removing null bytes PostgreSQL rejects
URL normalization (lowercase, fragment removal, trailing slash handling)
Content fingerprinting for deduplication
Transformation:
OpenGraph metadata extraction for article identification
Readability-lxml for main content extraction with fallback chains
RecursiveCharacterTextSplitter: 500-char chunks with 50-char overlap
Loading:
SQLAlchemy QueuePool with pool_pre_ping for stale connection detection
Atomic transactions via engine.begin() context managers
INSERT-only mode for articles prevents data loss from paywalled/deleted content
Indexing & Optimization:
HNSW indexes on embedding columns (~6-7GB each) for approximate nearest neighbor
GIN indexes on tsvector columns for full-text search
URL hash indexes for O(1) deduplication checks

Key Technical Achievements: - ~27ms vector similarity queries across 3.58M email chunks (verified December 2024) - ~90% network reduction via URL cache middleware that skips already-scraped articles before HTTP requests - 0.001 false positive rate Bloom filter option for memory-constrained environments - Background processing at scale: 500 emails/batch with model loaded once per session

Code Architecture Highlights:

# Heuristic article detection using OpenGraph and semantic containers
def triage_page(self, response):
    """Heuristically decides if a page is an article."""
    if response.css('meta[property="og:type"][content="article"]'):
        yield from self.parse_article(response)
        return

    for selector in self.SEMANTIC_CONTAINER_TAGS:
        container = response.css(selector)[0]
        container_text = "".join(container.css("::text").getall())

        if len(container_text) < self.MIN_TEXT_LENGTH:
            continue
        if (len(container_text) / len(container_html)) < self.TEXT_TO_MARKUP_RATIO:
            continue

        yield from self.parse_article(response, pre_validated_container=container)

Section 2: Features & Standards

Core Capabilities:

Incremental Crawling System: The UrlCacheMiddleware loads all existing URLs at spider startup and rejects requests for known content before making HTTP calls. This transforms daily news monitoring from hours-long full crawls into efficient delta operations.
Hybrid Classification Engine: Deterministic rules handle predictable emails (financial alerts, PR wire services, marketing) instantly, while Gemini 1.5 Flash processes nuanced cases. This achieves both cost efficiency and accuracy across 16 classification categories.
Context-Enriched Embeddings: Email embeddings include sender history, thread context, pipeline classification, and related article references. A search for "housing crisis" surfaces emails where the sender has previously contributed housing stories.
Participant Intelligence System: The email_participants table tracks 453K relationship records with MD5 fingerprints for cross-email linking, enabling queries like "show all communications from this person regardless of email address variations."

Standards & Best Practices Implemented:

Data Quality: COALESCE wrappers on all joined columns prevent NULL propagation; content fingerprinting with SHA-256 ensures deduplication accuracy
Scalability: Bloom filters support probabilistic URL filtering at minimal memory cost; batch processing with single model load eliminates subprocess overhead
Maintainability: Archived scripts directory with manifest preserves historical context; comprehensive logging at each pipeline step enables operational debugging
Security: Sandboxed iframe rendering for HTML emails; service account delegation with scoped permissions; no inline credentials

Evolution Over Time:

The developer blog documents a clear progression from ambitious initial architecture to production hardening:

Week 1 (July 1-8, 2025): Initial Scrapy pipeline encountered connection pool exhaustion and protocol corruption. Rebuilt on SQLAlchemy with pool_pre_ping after days of debugging psycopg3 edge cases.
Weeks 2-3: RAG system launch with vector search, followed by Gmail integration. Discovered that foreign key violations crashed the pipeline when emails were referenced before commit - added pre-validation patterns.
Month 2: Batch classification infrastructure emerged. The shift from subprocess-per-email to single-session processing reduced per-email time from 10 seconds to 6 seconds, saving 100+ hours on the 182K email backlog.
Month 3+: Operational maturity with monitoring dashboards, archive manifests, and graceful degradation patterns. The system learned to skip problematic articles after repeated failures rather than crashing entirely.

Section 3: Impact on News Operations

Journalism Problems Solved:

Competitive Intelligence Blind Spots: Previously, monitoring 75 NYC news outlets required manual checking. Now, journalists query "what have other outlets published about Bushwick rezoning?" and receive semantically-ranked results in milliseconds.
Lost Institutional Memory: Email threads with sources were siloed in individual inboxes. The participant intelligence system now tracks every interaction across 453K records, enabling "show me all communications with this city council member" queries.
Event Calendar Bottleneck: Community board newsletters containing multiple events required manual transcription. The extraction pipeline identifies events, extracts structured data, and queues them for calendar publication with minimal human intervention.

Operational Improvements:

Speed: Vector search returns relevant results in <100ms across 8.14M total chunks (articles + emails combined)
Depth: Journalists can surface historical context across 1.52M articles spanning years of coverage
Accuracy: Deduplication prevents the same content from fragmenting search results; 92.7% of emails link to enriched sender profiles
Scale: The system processes the entire editorial inbox automatically, classifying 200K+ emails with hybrid AI/rule engine

Mission Alignment:

Sustainable local journalism requires operational efficiency that large newsrooms take for granted. This data warehouse transforms a one-person newsroom's research capabilities, enabling the depth of coverage typically requiring a dedicated research staff. When a breaking story emerges, the journalist can instantly surface every prior article, email thread, and source interaction related to that topic.

Future Potential:

The HNSW-indexed vector architecture supports additional embedding models without schema changes. The participant intelligence system creates the foundation for automated source relationship management. The event extraction pipeline can extend to any structured data type (meeting agendas, budget documents, permit applications) using the same AI-extraction patterns.

Main Document

White Paper: Building an AI-Powered Local Newsroom

Section 4

NYCNewsScanner & Publisher

Section 5

Email Assistant

Core infrastructure files: /news_scraper_project/news_scraper_project/pipelines.py, /news_scraper_project/news_scraper_project/spiders/article_spider.py, /news_scraper_project/news_scraper_project/tasks.py, /news_scraper_project/news_scraper_project/middlewares.py, /news_scraper_project/scripts/create_chunks_table.py.

Data Warehouse & Ingestion Pipeline

Data Warehouse & Ingestion Pipeline

In This White Paper

Executive Summary

Section 1: Technical Architecture

Section 2: Features & Standards

Section 3: Impact on News Operations

Continue Reading