The Data Warehouse and Ingestion Pipeline forms the foundational infrastructure of the Bushwick Daily AI newsroom, combining PostgreSQL 17 with pgvector for unified storage of 1.52 million news articles and 212,000+ emails with semantic search capabilities. The system automates 95% of news article extraction from websites using intelligent heuristics while processing the complete editorial inbox through a multi-stage pipeline. This infrastructure transforms the local newsroom from manual information gathering into a searchable, AI-ready knowledge base that enables journalists to surface relevant context in milliseconds rather than hours.
System Overview:
ARTICLE INGESTION EMAIL INGESTION
| |
[75 News Outlets] --> Scrapy Spider Gmail API --> Service Account
| |
URL Cache Check Deduplication Check
| |
Heuristic Triage Gemini Classification
| |
PostgresPipeline Batch Processing
| |
+-----+-----+ +-----+-----+
| | | |
articles article_content classified email_chunks
| | _emails |
+-----+-----+ +-----+-----+
| |
Celery Task Celery Task
| |
article_chunks enhanced_email_embeddings
| |
+-----+-----------------------------------+--+
|
PostgreSQL 17 + pgvector
(HNSW Vector Indexes)
|
RAG Chatbot / Search API
Core Technologies:
| Component | Technology | Purpose |
|---|---|---|
| Database | PostgreSQL 17 + pgvector | Unified storage with vector similarity search |
| Article Scraping | Scrapy CrawlSpider | Intelligent heuristic-based news extraction |
| Task Queue | Celery + Redis | Background embedding generation |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) | 384-dimensional semantic vectors |
| Email Extraction | Gmail API + Service Account | Domain-wide email retrieval |
| Classification | Gemini 1.5 Flash + Heuristics | Hybrid AI/rule-based categorization |
| Vector Index | HNSW (vector_cosine_ops) | Sub-100ms similarity search at scale |
Data Models & Schema:
The warehouse centers on six primary entity clusters:
- Articles: outlets -> articles -> article_content -> article_chunks (4.56M chunks with embeddings)
- Emails: classified_emails -> email_chunks (3.58M chunks) -> enhanced_email_embeddings
- Participants: email_participants -> sender_profiles (453K relationship records)
- Events: email_events (calendar extraction) -> event_flyers
- Entities: entities_v2 -> entity_mentions_v2 (cross-corpus entity tracking)
- Attachments: email_attachments -> Google Drive (streaming upload with SHA-256 dedup)
Key design decisions include CASCADE deletes for referential integrity, JSONB columns for flexible metadata, and explicit ON CONFLICT handling for idempotent upserts.
Ingestion Pipeline Stages:
Emails: Gmail API with service account delegation, incremental sync via message ID tracking
Cleaning & Validation:
Content fingerprinting for deduplication
Transformation:
RecursiveCharacterTextSplitter: 500-char chunks with 50-char overlap
Loading:
engine.begin() context managersINSERT-only mode for articles prevents data loss from paywalled/deleted content
Indexing & Optimization:
Key Technical Achievements: - ~27ms vector similarity queries across 3.58M email chunks (verified December 2024) - ~90% network reduction via URL cache middleware that skips already-scraped articles before HTTP requests - 0.001 false positive rate Bloom filter option for memory-constrained environments - Background processing at scale: 500 emails/batch with model loaded once per session
Code Architecture Highlights:
# Heuristic article detection using OpenGraph and semantic containers
def triage_page(self, response):
"""Heuristically decides if a page is an article."""
if response.css('meta[property="og:type"][content="article"]'):
yield from self.parse_article(response)
return
for selector in self.SEMANTIC_CONTAINER_TAGS:
container = response.css(selector)[0]
container_text = "".join(container.css("::text").getall())
if len(container_text) < self.MIN_TEXT_LENGTH:
continue
if (len(container_text) / len(container_html)) < self.TEXT_TO_MARKUP_RATIO:
continue
yield from self.parse_article(response, pre_validated_container=container)
Core Capabilities:
Incremental Crawling System: The UrlCacheMiddleware loads all existing URLs at spider startup and rejects requests for known content before making HTTP calls. This transforms daily news monitoring from hours-long full crawls into efficient delta operations.
Hybrid Classification Engine: Deterministic rules handle predictable emails (financial alerts, PR wire services, marketing) instantly, while Gemini 1.5 Flash processes nuanced cases. This achieves both cost efficiency and accuracy across 16 classification categories.
Context-Enriched Embeddings: Email embeddings include sender history, thread context, pipeline classification, and related article references. A search for "housing crisis" surfaces emails where the sender has previously contributed housing stories.
Participant Intelligence System: The email_participants table tracks 453K relationship records with MD5 fingerprints for cross-email linking, enabling queries like "show all communications from this person regardless of email address variations."
Standards & Best Practices Implemented:
Evolution Over Time:
The developer blog documents a clear progression from ambitious initial architecture to production hardening:
Week 1 (July 1-8, 2025): Initial Scrapy pipeline encountered connection pool exhaustion and protocol corruption. Rebuilt on SQLAlchemy with pool_pre_ping after days of debugging psycopg3 edge cases.
Weeks 2-3: RAG system launch with vector search, followed by Gmail integration. Discovered that foreign key violations crashed the pipeline when emails were referenced before commit - added pre-validation patterns.
Month 2: Batch classification infrastructure emerged. The shift from subprocess-per-email to single-session processing reduced per-email time from 10 seconds to 6 seconds, saving 100+ hours on the 182K email backlog.
Month 3+: Operational maturity with monitoring dashboards, archive manifests, and graceful degradation patterns. The system learned to skip problematic articles after repeated failures rather than crashing entirely.
Journalism Problems Solved:
Competitive Intelligence Blind Spots: Previously, monitoring 75 NYC news outlets required manual checking. Now, journalists query "what have other outlets published about Bushwick rezoning?" and receive semantically-ranked results in milliseconds.
Lost Institutional Memory: Email threads with sources were siloed in individual inboxes. The participant intelligence system now tracks every interaction across 453K records, enabling "show me all communications with this city council member" queries.
Event Calendar Bottleneck: Community board newsletters containing multiple events required manual transcription. The extraction pipeline identifies events, extracts structured data, and queues them for calendar publication with minimal human intervention.
Operational Improvements:
Mission Alignment:
Sustainable local journalism requires operational efficiency that large newsrooms take for granted. This data warehouse transforms a one-person newsroom's research capabilities, enabling the depth of coverage typically requiring a dedicated research staff. When a breaking story emerges, the journalist can instantly surface every prior article, email thread, and source interaction related to that topic.
Future Potential:
The HNSW-indexed vector architecture supports additional embedding models without schema changes. The participant intelligence system creates the foundation for automated source relationship management. The event extraction pipeline can extend to any structured data type (meeting agendas, budget documents, permit applications) using the same AI-extraction patterns.
Report compiled from 321 commits spanning July 2025 to March 2026. Core infrastructure files: /news_scraper_project/news_scraper_project/pipelines.py, /news_scraper_project/news_scraper_project/spiders/article_spider.py, /news_scraper_project/news_scraper_project/tasks.py, /news_scraper_project/news_scraper_project/middlewares.py, /news_scraper_project/scripts/create_chunks_table.py.