MEEKER <AJM> USER: AJM@NYC-01 SESSION 04-17-26 · 09:42:11 EST LIVE

SECTION 01 · SUBSYSTEM REPORT

Data Warehouse & Ingestion Pipeline

The foundation layer: 1.79M articles, 8.14M semantic chunks, 255 GB on a single SSD. PostgreSQL 17 with pgvector HNSW indices, Scrapy + Celery orchestration, atomic deduplication, and sub-100ms vector search across the full corpus.

Production PostgreSQL 17 pgvector Scrapy Published March 2026
1.79M Articles ingested across 75+ NYC outlets. 8.14 million embedded chunks. 255 GB. One SSD. Sub-100ms p95.

Data Warehouse & Ingestion Pipeline

In This White Paper

Executive Summary

The Data Warehouse and Ingestion Pipeline forms the foundational infrastructure of the Bushwick Daily AI newsroom, combining PostgreSQL 17 with pgvector for unified storage of 1.52 million news articles and 212,000+ emails with semantic search capabilities. The system automates 95% of news article extraction from websites using intelligent heuristics while processing the complete editorial inbox through a multi-stage pipeline. This infrastructure transforms the local newsroom from manual information gathering into a searchable, AI-ready knowledge base that enables journalists to surface relevant context in milliseconds rather than hours.


Section 1: Technical Architecture

System Overview:

                    ARTICLE INGESTION                           EMAIL INGESTION
                          |                                            |
    [75 News Outlets] --> Scrapy Spider                   Gmail API --> Service Account
                          |                                            |
                    URL Cache Check                        Deduplication Check
                          |                                            |
                    Heuristic Triage                       Gemini Classification
                          |                                            |
                    PostgresPipeline                       Batch Processing
                          |                                            |
                    +-----+-----+                          +-----+-----+
                    |           |                          |           |
              articles    article_content            classified    email_chunks
                    |           |                    _emails            |
                    +-----+-----+                          +-----+-----+
                          |                                      |
                    Celery Task                            Celery Task
                          |                                      |
                    article_chunks                     enhanced_email_embeddings
                          |                                      |
                    +-----+-----------------------------------+--+
                                          |
                              PostgreSQL 17 + pgvector
                              (HNSW Vector Indexes)
                                          |
                              RAG Chatbot / Search API

Core Technologies:

Component Technology Purpose
Database PostgreSQL 17 + pgvector Unified storage with vector similarity search
Article Scraping Scrapy CrawlSpider Intelligent heuristic-based news extraction
Task Queue Celery + Redis Background embedding generation
Embeddings SentenceTransformers (all-MiniLM-L6-v2) 384-dimensional semantic vectors
Email Extraction Gmail API + Service Account Domain-wide email retrieval
Classification Gemini 1.5 Flash + Heuristics Hybrid AI/rule-based categorization
Vector Index HNSW (vector_cosine_ops) Sub-100ms similarity search at scale

Data Models & Schema:

The warehouse centers on six primary entity clusters: - Articles: outlets -> articles -> article_content -> article_chunks (4.56M chunks with embeddings) - Emails: classified_emails -> email_chunks (3.58M chunks) -> enhanced_email_embeddings - Participants: email_participants -> sender_profiles (453K relationship records) - Events: email_events (calendar extraction) -> event_flyers - Entities: entities_v2 -> entity_mentions_v2 (cross-corpus entity tracking) - Attachments: email_attachments -> Google Drive (streaming upload with SHA-256 dedup)

Key design decisions include CASCADE deletes for referential integrity, JSONB columns for flexible metadata, and explicit ON CONFLICT handling for idempotent upserts.

Ingestion Pipeline Stages:

  1. Source Acquisition:
  2. Articles: Scrapy LinkExtractor with domain-constrained crawling, respects robots.txt
  3. Emails: Gmail API with service account delegation, incremental sync via message ID tracking

  4. Cleaning & Validation:

  5. UTF-8 sanitization removing null bytes PostgreSQL rejects
  6. URL normalization (lowercase, fragment removal, trailing slash handling)
  7. Content fingerprinting for deduplication

  8. Transformation:

  9. OpenGraph metadata extraction for article identification
  10. Readability-lxml for main content extraction with fallback chains
  11. RecursiveCharacterTextSplitter: 500-char chunks with 50-char overlap

  12. Loading:

  13. SQLAlchemy QueuePool with pool_pre_ping for stale connection detection
  14. Atomic transactions via engine.begin() context managers
  15. INSERT-only mode for articles prevents data loss from paywalled/deleted content

  16. Indexing & Optimization:

  17. HNSW indexes on embedding columns (~6-7GB each) for approximate nearest neighbor
  18. GIN indexes on tsvector columns for full-text search
  19. URL hash indexes for O(1) deduplication checks

Key Technical Achievements: - ~27ms vector similarity queries across 3.58M email chunks (verified December 2024) - ~90% network reduction via URL cache middleware that skips already-scraped articles before HTTP requests - 0.001 false positive rate Bloom filter option for memory-constrained environments - Background processing at scale: 500 emails/batch with model loaded once per session

Code Architecture Highlights:

# Heuristic article detection using OpenGraph and semantic containers
def triage_page(self, response):
    """Heuristically decides if a page is an article."""
    if response.css('meta[property="og:type"][content="article"]'):
        yield from self.parse_article(response)
        return

    for selector in self.SEMANTIC_CONTAINER_TAGS:
        container = response.css(selector)[0]
        container_text = "".join(container.css("::text").getall())

        if len(container_text) < self.MIN_TEXT_LENGTH:
            continue
        if (len(container_text) / len(container_html)) < self.TEXT_TO_MARKUP_RATIO:
            continue

        yield from self.parse_article(response, pre_validated_container=container)

Section 2: Features & Standards

Core Capabilities:

  1. Incremental Crawling System: The UrlCacheMiddleware loads all existing URLs at spider startup and rejects requests for known content before making HTTP calls. This transforms daily news monitoring from hours-long full crawls into efficient delta operations.

  2. Hybrid Classification Engine: Deterministic rules handle predictable emails (financial alerts, PR wire services, marketing) instantly, while Gemini 1.5 Flash processes nuanced cases. This achieves both cost efficiency and accuracy across 16 classification categories.

  3. Context-Enriched Embeddings: Email embeddings include sender history, thread context, pipeline classification, and related article references. A search for "housing crisis" surfaces emails where the sender has previously contributed housing stories.

  4. Participant Intelligence System: The email_participants table tracks 453K relationship records with MD5 fingerprints for cross-email linking, enabling queries like "show all communications from this person regardless of email address variations."

Standards & Best Practices Implemented:

  • Data Quality: COALESCE wrappers on all joined columns prevent NULL propagation; content fingerprinting with SHA-256 ensures deduplication accuracy
  • Scalability: Bloom filters support probabilistic URL filtering at minimal memory cost; batch processing with single model load eliminates subprocess overhead
  • Maintainability: Archived scripts directory with manifest preserves historical context; comprehensive logging at each pipeline step enables operational debugging
  • Security: Sandboxed iframe rendering for HTML emails; service account delegation with scoped permissions; no inline credentials

Evolution Over Time:

The developer blog documents a clear progression from ambitious initial architecture to production hardening:

  • Week 1 (July 1-8, 2025): Initial Scrapy pipeline encountered connection pool exhaustion and protocol corruption. Rebuilt on SQLAlchemy with pool_pre_ping after days of debugging psycopg3 edge cases.

  • Weeks 2-3: RAG system launch with vector search, followed by Gmail integration. Discovered that foreign key violations crashed the pipeline when emails were referenced before commit - added pre-validation patterns.

  • Month 2: Batch classification infrastructure emerged. The shift from subprocess-per-email to single-session processing reduced per-email time from 10 seconds to 6 seconds, saving 100+ hours on the 182K email backlog.

  • Month 3+: Operational maturity with monitoring dashboards, archive manifests, and graceful degradation patterns. The system learned to skip problematic articles after repeated failures rather than crashing entirely.


Section 3: Impact on News Operations

Journalism Problems Solved:

  1. Competitive Intelligence Blind Spots: Previously, monitoring 75 NYC news outlets required manual checking. Now, journalists query "what have other outlets published about Bushwick rezoning?" and receive semantically-ranked results in milliseconds.

  2. Lost Institutional Memory: Email threads with sources were siloed in individual inboxes. The participant intelligence system now tracks every interaction across 453K records, enabling "show me all communications with this city council member" queries.

  3. Event Calendar Bottleneck: Community board newsletters containing multiple events required manual transcription. The extraction pipeline identifies events, extracts structured data, and queues them for calendar publication with minimal human intervention.

Operational Improvements:

  • Speed: Vector search returns relevant results in <100ms across 8.14M total chunks (articles + emails combined)
  • Depth: Journalists can surface historical context across 1.52M articles spanning years of coverage
  • Accuracy: Deduplication prevents the same content from fragmenting search results; 92.7% of emails link to enriched sender profiles
  • Scale: The system processes the entire editorial inbox automatically, classifying 200K+ emails with hybrid AI/rule engine

Mission Alignment:

Sustainable local journalism requires operational efficiency that large newsrooms take for granted. This data warehouse transforms a one-person newsroom's research capabilities, enabling the depth of coverage typically requiring a dedicated research staff. When a breaking story emerges, the journalist can instantly surface every prior article, email thread, and source interaction related to that topic.

Future Potential:

The HNSW-indexed vector architecture supports additional embedding models without schema changes. The participant intelligence system creates the foundation for automated source relationship management. The event extraction pipeline can extend to any structured data type (meeting agendas, budget documents, permit applications) using the same AI-extraction patterns.


Core infrastructure files: /news_scraper_project/news_scraper_project/pipelines.py, /news_scraper_project/news_scraper_project/spiders/article_spider.py, /news_scraper_project/news_scraper_project/tasks.py, /news_scraper_project/news_scraper_project/middlewares.py, /news_scraper_project/scripts/create_chunks_table.py.