← Back to site

Building an AI-Powered Local Newsroom

A Technical White Paper on the Bushwick Daily System

Alec Meeker — March 2026


Abstract

The local news crisis has left communities across America without reliable journalism coverage. While national outlets consolidate and local newsrooms collapse, AI has emerged as a potential force multiplier that could enable small newsrooms to punch above their weight. This white paper documents the nine-month development of a comprehensive AI infrastructure built by Bushwick Daily, a hyperlocal Brooklyn publication. The system encompasses 255 GB of structured data including 1.79 million scraped news articles from 75+ outlets, 218,000 processed emails with participant intelligence, and 8 million searchable semantic chunks. The architecture integrates multiple AI subsystems: an intelligent news scanner that monitors NYC media and extracts stories from email intelligence, an adaptive memory system that learns editorial preferences through human feedback, and an email assistant that triages correspondence and generates voice-accurate draft responses. The human-in-the-loop design philosophy ensures AI augments rather than replaces editorial judgment, with explicit checkpoints requiring human approval before any automated action. Development proceeded through deliberate iteration, with significant architectural pivots driven by production experience rather than theoretical assumptions. The resulting system demonstrates that sophisticated AI infrastructure is achievable for small newsrooms, potentially transforming the economics of local journalism by enabling one-person operations to maintain coverage depth traditionally requiring dedicated research staff.


1. Introduction: The Local News Crisis and an AI Response

1.1 The Challenge

The collapse of local journalism represents one of the most significant information gaps in American civic life. Since 2004, over 2,500 newspapers have closed. The remaining local outlets operate with skeleton staffs, unable to provide the coverage depth that community accountability requires. Hyperlocal publications face an impossible equation: the reporting workload of a full newsroom with the resources of a small business.

For Bushwick Daily, this tension manifested in concrete operational challenges. Monitoring 75+ NYC news outlets for relevant coverage required hours of daily manual checking. An inbox processing 200+ emails daily created a triage bottleneck where important community communications were lost in promotional noise. Institutional knowledge about sources, organizations, and ongoing stories existed only in the publisher's memory, vulnerable to the cognitive limitations of a single operator.

1.2 The Hypothesis

The hypothesis driving this project was straightforward: AI could serve as a force multiplier for small newsrooms rather than a replacement for journalists. The goal was not automated content generation but automated information processing, freeing editorial attention for judgment and writing while AI handled collection, organization, and routine correspondence.

This required building AI infrastructure from scratch rather than adopting off-the-shelf solutions. Commercial AI tools optimize for general use cases; a newsroom requires domain-specific capabilities around source management, editorial voice preservation, and journalistic accuracy standards.

1.3 The Approach

Development proceeded through direct engagement with production requirements. The author — with a background in investment analysis ($200M portfolio management, PE fund due diligence), a B.S. in Finance and Business Information Systems, and eight years of company operations — built the system iteratively, applying the same analytical rigor to AI architecture decisions that he previously applied to portfolio analysis and fund evaluation. Prior experience with VBA/SQL automation for institutional reporting (reducing Emerging Market Debt reporting cycles by 70% at Erie Insurance) provided a foundation for the data pipeline engineering documented in this paper.

The approach prioritized: - Human-in-the-loop design: AI proposes, humans approve. No automated actions without explicit human checkpoints. - Editorial integrity: Voice preservation, source attribution, and accuracy verification built into system architecture. - Practical utility: Features driven by real workflow pain points rather than theoretical capabilities. - Sustainable architecture: Production-grade reliability over prototype impressiveness.

1.4 Scope of This Paper

This paper covers the complete technical architecture of the Bushwick Daily AI newsroom system as of March 2026, including:

The development journey itself is documented, including architectural pivots, failed approaches, and lessons learned from nine months of iteration.


2. System Architecture Overview

2.1 Design Philosophy

Five principles guided architectural decisions:

1. Human-in-the-Loop by Default Every automated action passes through human approval. The system proposes; humans decide. This applies to email responses, story extraction approvals, form submissions, and content publication. AI handles the mechanical work of drafting, organizing, and retrieving; humans retain editorial judgment.

2. Editorial Integrity as Architecture Voice preservation is not a feature but a requirement. The system learns from human corrections rather than overriding editorial preferences. When AI-generated content is edited before sending, those edits become training data for future improvements.

3. Practical Over Impressive Features address documented workflow problems rather than theoretical capabilities. The incremental crawling system exists because daily news monitoring was consuming hours; the email triage system exists because 200+ daily emails were unmanageable manually.

4. Production Reliability Graceful degradation over brittle perfection. When individual articles cause parsing errors, the system logs and skips rather than crashing. When API calls fail, retry logic with exponential backoff ensures eventual completion. Background processing through Celery workers ensures main operations remain responsive.

5. Sustainable Operations Infrastructure must run on available hardware within realistic budgets. The system operates on a Mac with an external SSD for PostgreSQL data, using commodity cloud APIs with cost tracking. AI operations are optimized for cost efficiency without sacrificing capability.

2.2 High-Level Architecture

                    ┌──────────────────────────────────────────────────────┐
                    │                  DATA SOURCES                         │
                    │  75+ News Outlets    Gmail Inbox    Reddit/Web        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │              INGESTION LAYER                          │
                    │  Scrapy Spider    Gmail API    Web Scraper            │
                    │  URL Cache        Deduplication  Rate Limiting        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │           PROCESSING LAYER                            │
                    │  Article Extraction    Email Classification           │
                    │  Entity Extraction     Participant Intelligence       │
                    │  Chunking              Embedding Generation           │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │              DATA WAREHOUSE                           │
                    │         PostgreSQL 17 + pgvector                      │
                    │  1.79M Articles   218K Emails   8M+ Chunks            │
                    │  453K Participants   10K Sender Profiles              │
                    │            HNSW Vector Indexes                        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
          ┌──────────────────────┼─────────────┼────────────┼──────────────────────┐
          │                      │             │            │                      │
          ▼                      ▼             ▼            ▼                      ▼
   ┌─────────────┐     ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   │   RAG       │     │ NYCNews     │   │   Email     │   │  Memory     │   │  Publish    │
   │  Chatbot    │     │  Scanner    │   │  Assistant  │   │  System     │   │  Pipeline   │
   │             │     │             │   │             │   │             │   │             │
   │ Semantic    │     │ Multi-Agent │   │ Triage      │   │ Adaptive    │   │ WordPress   │
   │ Search      │     │ Research    │   │ Drafting    │   │ Learning    │   │ Integration │
   │ Q&A         │     │ Story       │   │ Task        │   │ Feedback    │   │ SEO Gen     │
   │             │     │ Extraction  │   │ Orchestr.   │   │ Loop        │   │ Image AI    │
   └─────────────┘     └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
          │                      │             │            │                      │
          └──────────────────────┼─────────────┼────────────┼──────────────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │               HUMAN INTERFACE                         │
                    │  CLI Commands    Web Dashboard    API Endpoints       │
                    │              Human-in-the-Loop Review                 │
                    └──────────────────────────────────────────────────────┘

2.3 Technology Stack Summary

Layer Technologies Rationale
Data Storage PostgreSQL 17, pgvector, Redis Unified SQL + vector search; HNSW indexes enable sub-100ms similarity queries across millions of chunks
Backend Python 3.11, FastAPI, Celery Async processing with background tasks for compute-intensive embedding generation
Scraping Scrapy, BeautifulSoup, Readability Intelligent heuristic spider with fallback extraction chains
AI/LLM Claude (Anthropic), Gemini (Google), GPT-4 (OpenAI) Multi-model strategy: Claude for generation quality, Gemini Flash for cost-efficient classification, GPT-4 for specialized tasks
Embeddings SentenceTransformers (all-MiniLM-L6-v2) 384-dimensional vectors; lightweight model runs locally without GPU
Browser Automation Playwright RSVP form filling, screenshot capture, web interaction
Workflow Engine LangGraph Multi-agent orchestration with state persistence
Frontend React, Streamlit Development speed with Streamlit; production UI with React
Infrastructure macOS, Homebrew, External SSD Commodity hardware sufficient for production workloads

3. Data Infrastructure

3.1 Data Warehouse Architecture

The data warehouse centers on PostgreSQL 17 with the pgvector extension, providing unified storage for both structured metadata and high-dimensional vector embeddings. This architecture eliminates the complexity of maintaining separate vector stores while enabling joins between semantic search results and relational data.

Primary Entity Clusters:

-- Article Storage
outlets (75 active)
  └── articles (1.79M records)
        └── article_content (full HTML/text with tsvector search)
              └── article_chunks (4.56M chunks with 384-dim embeddings)

-- Email Storage
classified_emails (218K records, 60 columns)
  ├── email_chunks (3.58M chunks with embeddings)
  ├── email_participants (453K relationship records)
       └── sender_profiles (10K with communication patterns)
  ├── email_attachments (Google Drive links, SHA-256 dedup)
  └── email_events (ICS calendar extraction)

-- Entity Intelligence
entities_v2 (2,580+ entities)
  └── entity_mentions_v2 (cross-corpus tracking)
        └── entity_aliases_v2 (disambiguation)

Vector Indexing Strategy:

HNSW (Hierarchical Navigable Small World) indexes provide approximate nearest neighbor search with sub-100ms query times across millions of vectors:

CREATE INDEX idx_email_chunks_embedding_hnsw
ON email_chunks USING hnsw (embedding vector_cosine_ops);

-- Index size: 5.8 GB for 3.58M email chunks
-- Query time: ~27ms average (verified December 2024)

The choice of vector_cosine_ops reflects the semantic nature of the search: cosine similarity captures meaning alignment regardless of vector magnitude, which is appropriate for text embeddings where we care about semantic direction rather than intensity.

3.2 Article Ingestion Pipeline

The article scraping system achieves 95% automation through intelligent heuristics, requiring manual configuration only for exceptional sites.

Article Detection Cascade:

def triage_page(self, response):
    """Multi-stage article detection."""

    # Stage 1: OpenGraph metadata
    if response.css('meta[property="og:type"][content="article"]'):
        yield from self.parse_article(response)
        return

    # Stage 2: Semantic container analysis
    for selector in ['article', 'main', 'div[role="main"]']:
        container = response.css(selector)
        if container:
            text_content = "".join(container.css("::text").getall())
            html_content = container.get()

            # Text-to-markup ratio filter
            if len(text_content) / len(html_content) > 0.6:
                if len(text_content) > 250:
                    yield from self.parse_article(response)
                    return

Incremental Crawling:

The UrlCacheMiddleware eliminates redundant network requests by loading all existing URLs at spider startup:

class UrlCacheMiddleware:
    def spider_opened(self, spider):
        with engine.connect() as conn:
            urls = conn.execute(
                text("SELECT url FROM articles WHERE outlet_id = :id"),
                {"id": outlet_id}
            ).fetchall()
            self.scraped_urls = {self.normalize_url(row[0]) for row in urls}

    def process_request(self, request, spider):
        if self.normalize_url(request.url) in self.scraped_urls:
            raise IgnoreRequest(f"Already scraped: {request.url}")
        return None

This reduces network traffic by approximately 90% on subsequent crawls, enabling hourly news checks without server overload.

Site Override System:

For sites that defeat heuristic detection, YAML configuration provides escape hatches:

hellgatenyc.com:
  article_url_pattern: '/[a-z0-9-]+/$'
  body_fallback_chain:
    - "article.post-content"
    - "div.entry-content"
    - "div.article-body"

3.3 Email Ingestion Pipeline

Email processing follows a seven-step pipeline orchestrated by update_emails_v2.sh:

Step 1:   Gmail API extraction with service account delegation
Step 1.1: URL extraction from email content
Step 1.25: Deterministic triage (email/notify/no/spam)
Step 1.5: Attachment extractionGoogle Drive with SHA-256 dedup
Step 2:   Semantic chunking with 500-char segments
Step 3:   Gemini Flash classification (16 categories)
Step 3.5: Structured event extraction (ICS parsing)
Step 4:   Participant intelligence (role extraction, fingerprinting)
Step 5:   Story extraction for journalism workflow
Step 6:   LangChain agent processing
Step 7:   Entity extraction (SpaCy NER)

Hybrid Classification System:

Deterministic rules handle predictable patterns before invoking LLM classification:

class DeterministicFilter:
    # Spam TLDs with >95% spam rate
    SPAM_TLDS = {'.xyz', '.top', '.gdn', '.click', '.loan'}

    # Known newsletter domains
    NEWSLETTER_DOMAINS = {'mailchimp.com', 'substack.com', 'constantcontact.com'}

    # Financial transaction patterns
    FINANCIAL_PATTERNS = [r'paypal.*receipt', r'venmo.*paid']

    def classify(self, email) -> Optional[str]:
        # Check spam indicators
        if any(email.sender.endswith(tld) for tld in self.SPAM_TLDS):
            return 'spam'

        # Check newsletter patterns
        sender_domain = email.sender.split('@')[1]
        if sender_domain in self.NEWSLETTER_DOMAINS:
            return 'marketing_newsletter'

        # ... additional rules
        return None  # Fall through to LLM classification

This hybrid approach preserves LLM capacity for nuanced decisions while handling obvious cases instantly, achieving approximately 10x throughput improvement over LLM-only classification.

3.4 Data Quality Framework

Deduplication:

Content fingerprinting using SHA-256 prevents duplicate storage:

def generate_fingerprint(self, email):
    # Normalize content
    content = self.normalize_tracking_params(email.body_text)
    content = self.normalize_whitespace(content)

    # Generate hash
    return hashlib.sha256(content.encode('utf-8')).hexdigest()

Referential Integrity:

CASCADE deletes maintain consistency:

ALTER TABLE article_chunks
ADD CONSTRAINT fk_article
FOREIGN KEY (article_id) REFERENCES articles(id) ON DELETE CASCADE;

Encoding Safety:

UTF-8 sanitization prevents PostgreSQL errors from null bytes:

clean_text = text.encode('utf-8', 'ignore').decode('utf-8')
clean_text = clean_text.replace('\x00', '')  # Remove null bytes

3.5 Key Technical Achievements


4. NYCNewsScanner: Intelligent News Discovery

4.1 Scanning Architecture

The NYCNewsScanner employs a multi-agent architecture using the Claude Agent SDK to parallelize news research across diverse source types.

                         ┌─────────────────────┐
                         │  News Coordinator   │
                         │  (Orchestrator)     │
                         └──────────┬──────────┘
            ┌───────────────────────┼───────────────────────┐
            ▼                       ▼                       ▼
 ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
 │  DB Researcher   │    │  Email Researcher │    │  Web Researcher  │
 │  (7 outlets)     │    │  (Press/News)     │    │  (Reddit/Web)    │
 └────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
          │                       │                       │
          └───────────────────────┴───────────────────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Story Extractor     │
                       │  (5W Analysis)       │
                       └──────────┬───────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Memory Integration  │←─── Quality Gate
                       │  (4-Level Scope)     │←─── Feedback Parser
                       └──────────┬───────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Pitch Generation    │
                       │  + Human Review      │
                       └──────────────────────┘

Source Coverage:

4.2 Relevance Scoring & Prioritization

Each extracted story receives a newsworthiness score based on multiple factors:

def calculate_newsworthiness(self, story, entities, email_context):
    score = 0.0

    # Geographic proximity to Bushwick
    bushwick_entities = ['Bushwick', 'Community Board 4', 'BK90']
    for entity in entities:
        if entity in bushwick_entities:
            score += 0.3

    # Known community figures
    if self.entity_db.is_known_figure(story.who):
        score += 0.2

    # Temporal urgency
    if story.when and story.when < datetime.now() + timedelta(days=7):
        score += 0.2

    # Source credibility
    if email_context.sender_profile.is_government:
        score += 0.15
    elif email_context.sender_profile.is_journalist:
        score += 0.1

    return min(score, 1.0)

Stories scoring below threshold are filtered before human review, focusing editorial attention on genuinely newsworthy content.

4.3 Memory System

The memory system implements a four-level scope hierarchy based on research into production memory architectures (Mem0, Letta/MemGPT, academic literature):

Scope Levels:

  1. Global: Applies across all contexts ("Route advertising inquiries to business team")
  2. Organization: Persists through staff turnover ("CB4 meets second Wednesday monthly")
  3. Project: Tracks ongoing investigations ("Bushwick rezoning - these stakeholders are connected")
  4. Sender: Individual preferences ("Maria prefers informal tone")

Memory Storage:

CREATE TABLE am_memories (
    id SERIAL PRIMARY KEY,
    scope_type VARCHAR CHECK (scope_type IN ('global', 'org', 'project', 'sender')),
    scope_id VARCHAR,  -- NULL for global, email for sender, org_id for org
    content TEXT NOT NULL,
    embedding VECTOR(384),
    memory_type VARCHAR,  -- 'preference', 'fact', 'procedure', 'relationship'
    confidence FLOAT DEFAULT 1.0,
    times_applied INTEGER DEFAULT 0,
    times_led_to_acceptance INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_accessed TIMESTAMPTZ
);

Memory Retrieval:

Hybrid search using Reciprocal Rank Fusion combines vector similarity with full-text matching:

CREATE FUNCTION search_memories(query_text TEXT, query_embedding VECTOR, scope_filter JSONB)
RETURNS TABLE (memory_id INT, content TEXT, score FLOAT) AS $$
    -- Vector similarity component
    WITH vector_results AS (
        SELECT id, content, 1 - (embedding <=> query_embedding) as vscore,
               ROW_NUMBER() OVER (ORDER BY embedding <=> query_embedding) as vrank
        FROM am_memories
        WHERE scope_type = ANY(scope_filter->'scopes')
    ),
    -- Full-text component
    text_results AS (
        SELECT id, content, ts_rank(search_vector, plainto_tsquery(query_text)) as tscore,
               ROW_NUMBER() OVER (ORDER BY ts_rank DESC) as trank
        FROM am_memories
        WHERE search_vector @@ plainto_tsquery(query_text)
    )
    -- RRF combination
    SELECT COALESCE(v.id, t.id), COALESCE(v.content, t.content),
           (1.0/(60 + COALESCE(vrank, 1000)) + 1.0/(60 + COALESCE(trank, 1000))) as rrf_score
    FROM vector_results v FULL OUTER JOIN text_results t ON v.id = t.id
    ORDER BY rrf_score DESC;
$$ LANGUAGE sql;

4.3.1 Memory Formation

New memories form through the AUDN (Add/Update/Delete/Noop) cycle in reconciler.py:

  1. Trigger: Human edits an AI-generated draft
  2. Parse: FeedbackParser classifies edit type (tone, factual, intent, structural)
  3. Gate: QualityGate filters noise (minimum 10% magnitude, 60% confidence, 5-minute debounce)
  4. Extract: LLM extracts learnable preference from qualified signal
  5. Reconcile: Compare to existing memories; AUDN determines action
class QualityGate:
    def should_process(self, signal: FeedbackSignal) -> bool:
        # Minimum edit magnitude
        if signal.edit_ratio < 0.10:
            return False

        # Confidence threshold
        if signal.confidence < 0.60:
            return False

        # Debounce rapid edits
        if signal.seconds_since_last < 300:
            return False

        return True

4.3.2 Memory Retrieval

When generating content or responses, the system retrieves relevant memories using scope-based boosting:

def retrieve_memories(self, context):
    base_memories = self.search_memories(context.query)

    # Apply scope boosting
    for memory in base_memories:
        if memory.scope_type == 'sender' and memory.scope_id == context.sender:
            memory.score *= 1.3  # Sender memories most relevant
        elif memory.scope_type == 'org' and memory.scope_id == context.org:
            memory.score *= 1.2
        elif memory.scope_type == 'project' and memory.scope_id in context.projects:
            memory.score *= 1.1
        # Global memories use base score

    return sorted(base_memories, key=lambda m: m.score, reverse=True)[:10]

4.3.3 Memory Decay & Consolidation

The effectiveness tracker implements decay for underperforming memories:

def compute_effectiveness(self, memory):
    if memory.times_applied == 0:
        return 0.5  # Neutral default

    acceptance_rate = memory.times_led_to_acceptance / memory.times_applied
    edit_ratio = memory.total_edit_distance / memory.total_output_length

    # Weighted combination
    return 0.6 * acceptance_rate + 0.4 * (1 - edit_ratio)

Memories with effectiveness below threshold become candidates for automatic pruning via the ea memory prune CLI command.

4.4 Reflection Engine

The reflection system was redesigned after research revealed that "reflection without external feedback degrades performance." The current architecture uses deterministic parsing BEFORE any LLM involvement:

The Anti-Pattern (Avoided):

# DON'T: Send raw edits directly to LLM for reflection
def learn_from_edit(original, edited):
    prompt = f"What preference does this edit reveal?\nOriginal: {original}\nEdited: {edited}"
    return llm.complete(prompt)  # Degrades over time

The Implemented Pattern:

# DO: Deterministic parsing, quality gate, then targeted LLM extraction
def learn_from_edit(original, edited):
    # Step 1: Deterministic classification
    diff = difflib.SequenceMatcher(a=original, b=edited)
    edit_type = classify_edit_type(diff)  # tone, factual, intent, etc.

    # Step 2: Quality gate
    if not quality_gate.should_process(edit_type, diff.ratio()):
        return None

    # Step 3: Targeted LLM extraction (only for qualified signals)
    return extract_learnable_preference(original, edited, edit_type)

The quality gate prevents several documented failure modes: - Noise pollution: Typo fixes don't become "preferences" - Debounce protection: Rapid-fire edits don't create duplicate memories - Magnitude filtering: Only edits changing 10%+ of content pass

4.5 Publisher Integration

The pitch generation system produces structured outputs for editorial review:

@dataclass
class EditorPitch:
    pitch_id: str
    headline: str
    key_facts: List[str]
    source_attribution: str
    source_email_id: int
    newsworthiness_score: float
    suggested_angle: str
    related_articles: List[int]  # IDs of existing coverage

The HITL review interface displays pitches with full context:

┌─────────────────────────────────────────────────────────────┐
 PITCH: City Announces New Affordable Housing Lottery       
├─────────────────────────────────────────────────────────────┤
 Source: HPD Press Release (maria.torres@nyc.gov)           
 Score: 0.87 (HIGH)                                          
                                                             
 KEY FACTS:                                                  
  150 units at 1234 Bushwick Ave                           
  Income bands: 40%, 60%, 80% AMI                          
  Applications open March 20                                
                                                             
 RELATED COVERAGE: 3 prior articles on this development     
                                                             
 [a]pprove  [r]eject  [e]dit  [n]ext  [q]uit                
└─────────────────────────────────────────────────────────────┘

5. Email Assistant

5.1 Processing Pipeline

The Email Assistant implements a complete email processing and response generation system:

Gmail APITriageHITL ReviewOrchestratorSkillsDraft ReviewSendDeterministic    Pre-orchestrator     Task           Skill         Human
        Filter           checkpoint           Extraction     Dispatch      Approval

Triage System:

Four-tier classification replaces the initial binary (YES/NO) approach:

Category Description Action
email Requires response from publisher → Orchestrator
notify Awareness-only (press releases, FYI) → Notification
no Not relevant to newsroom → Archive
spam Blocked, flagged for filter learning → Spam folder

Pre-Orchestrator HITL:

A filtering checkpoint before expensive LLM calls:

$ ea hitl review --limit 50

Thread: Re: Interview Request - Community Garden
From: reporter@brooklyneagle.com
Triage: EMAIL (needs response)
Preview: Hi Alec, I'm working on a story about...

Actions: [x]spam [m]mute [o]rchestrator [n]ext [q]uit

Each spam email caught here saves 3-5 LLM calls (task extraction, classification, skill dispatch, draft generation).

5.2 Context-Aware Response Generation

Draft responses leverage multiple context sources:

Sender Profile:

profile = sender_profiles.get(sender_email)
# Returns: organization, role, is_journalist, is_government,
#          communication_history, preferred_tone, last_interaction

Thread Context:

thread = email_threads.get_full_thread(thread_id)
# Returns: all messages in thread, participants, subject evolution

Adaptive Memories:

memories = memory_store.retrieve_for_context(
    sender=sender_email,
    org=profile.organization,
    projects=active_projects,
    query=email_subject
)
# Returns: ranked preferences, facts, procedures relevant to this email

Few-Shot Examples:

examples = few_shot_retriever.get_similar_responses(
    email_type=classification,
    sender_type=profile.classification,
    limit=3
)
# Returns: successful past responses to similar emails

The draft generation prompt assembles these contexts:

prompt = f"""
You are drafting a response for Alec Meeker, publisher of Bushwick Daily.

SENDER CONTEXT:
{sender_profile_summary}

THREAD CONTEXT:
{thread_history}

RELEVANT MEMORIES:
{format_memories(memories)}

SIMILAR PAST RESPONSES:
{format_examples(examples)}

WRITING GUIDELINES:
- Warm but direct tone
- Business development orientation
- NO em-dashes, NO "I hope this finds you well"
- Be specific about next steps

EMAIL TO RESPOND TO:
{current_email}

Draft a response:
"""

5.3 Human Oversight Design

Draft Review Interface:

┌─────────────────────────────────────────────────────────────┐
 DRAFT RESPONSE                                              
├─────────────────────────────────────────────────────────────┤
 To: maria.torres@nyc.gov                                    
 Subject: Re: Press Credential Application                   
                                                             
 Hi Maria,                                                   
                                                             
 Thanks for following up on the credential application.      
 I've attached our updated circulation numbers and added     
 the publication schedule you requested.                     
                                                             
 Let me know if you need anything else for the review.       
                                                             
 Best,                                                       
 Alec                                                        
                                                             
 [s]end [e]dit [r]evise [a]nswer [d]iscard [n]ext [q]uit    
└─────────────────────────────────────────────────────────────┘

Action Options:

Action Effect Learning
Send Send as-is Positive signal: draft was perfect
Edit Open in editor, send after Diff triggers memory extraction
Revise Regenerate with notes Notes inform next attempt
Answer Continue multi-turn Clarification needed
Discard Reject draft Strong negative signal

Two-Phase HITL for Automation:

For actions like form filling, the system implements two-phase approval:

Phase 1: Analyze formShow proposed valuesAwait approval
Phase 2: Fill formCapture screenshotConfirm submission

This prevents automated form submission with incorrect data while reducing manual data entry.


6. Integration & Orchestration

6.1 System Interoperability

Components communicate through shared PostgreSQL tables and well-defined interfaces:

Shared Data Structures:

# All components use consistent email representation
@dataclass
class EmailContext:
    email_id: int
    thread_id: str
    sender_email: str
    sender_profile: Optional[SenderProfile]
    subject: str
    body_text: str
    body_html: str
    received_at: datetime
    classification: str
    participant_roles: Dict[str, str]
    attachments: List[Attachment]

Event-Driven Updates:

Content lineage tracking creates audit trails across all operations:

CREATE TABLE content_lineage_events (
    id SERIAL PRIMARY KEY,
    content_type VARCHAR,  -- email, story, pitch, article, queue_item
    content_id INTEGER,
    event_type VARCHAR,    -- extracted, approved, rejected, published
    actor_type VARCHAR,    -- system, human
    actor_id VARCHAR,
    parent_content_type VARCHAR,
    parent_content_id INTEGER,
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

6.2 State Management

Complex workflow state persists to PostgreSQL rather than memory:

Session Persistence:

class SessionStore:
    def save_session(self, session_id, state):
        # Write-through: every update persists immediately
        with self.conn.cursor() as cur:
            cur.execute("""
                INSERT INTO nycnews_agent_sessions (session_id, state, updated_at)
                VALUES (%s, %s, NOW())
                ON CONFLICT (session_id) DO UPDATE SET state = %s, updated_at = NOW()
            """, (session_id, json.dumps(state), json.dumps(state)))
            self.conn.commit()

    def load_session(self, session_id):
        # Read-through: database fallback when not in memory
        with self.conn.cursor() as cur:
            cur.execute(
                "SELECT state FROM nycnews_agent_sessions WHERE session_id = %s",
                (session_id,)
            )
            result = cur.fetchone()
            return json.loads(result[0]) if result else None

Bidirectional Task-Draft Linking:

-- Draft knows its generating task
ALTER TABLE ea_email_drafts ADD COLUMN source_task_id INTEGER REFERENCES ea_tasks(id);

-- Task knows its output draft
ALTER TABLE ea_tasks ADD COLUMN generated_draft_id INTEGER REFERENCES ea_email_drafts(id);

This enables atomic state synchronization:

def sync_task_draft_status(task_id, draft_id, action):
    """Atomic status update for linked task and draft."""
    with conn.cursor() as cur:
        if action == 'send':
            cur.execute("""
                UPDATE ea_tasks SET status = 'completed', completed_at = NOW()
                WHERE id = %s;
                UPDATE ea_email_drafts SET status = 'sent', sent_at = NOW()
                WHERE id = %s;
            """, (task_id, draft_id))
        elif action == 'reject':
            cur.execute("""
                UPDATE ea_tasks SET status = 'cancelled'
                WHERE id = %s;
                UPDATE ea_email_drafts SET status = 'rejected'
                WHERE id = %s;
            """, (task_id, draft_id))
        conn.commit()

6.3 Error Handling & Recovery

Graceful Degradation:

def process_email_with_fallbacks(email):
    try:
        # Primary path: full context enrichment
        profile = get_sender_profile(email.sender)
        memories = get_relevant_memories(email, profile)
        examples = get_few_shot_examples(email)
    except ProfileServiceError:
        # Fallback: basic context only
        profile = None
        memories = []
        examples = get_generic_examples(email.classification)

    # Always proceed with available context
    return generate_draft(email, profile, memories, examples)

Retry Logic:

@celery.task(bind=True, max_retries=3, default_retry_delay=60)
def generate_embeddings(self, article_id):
    try:
        # Embedding generation
        chunks = chunk_article(article_id)
        embeddings = model.encode(chunks)
        store_embeddings(article_id, embeddings)
    except TransientError as e:
        # Retry for transient failures
        raise self.retry(exc=e, countdown=60 * (self.request.retries + 1))
    except PermanentError as e:
        # Log and skip for permanent failures
        log_problematic_article(article_id, str(e))
        return False

Recovery Scripts:

When issues occur at scale, recovery scripts identify and fix discrepancies:

def recover_missing_embeddings():
    """Find and regenerate embeddings that failed to persist."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT ce.id FROM classified_emails ce
            LEFT JOIN email_chunks ec ON ce.id = ec.email_id
            WHERE ec.email_id IS NULL
            AND ce.processed_at IS NOT NULL
        """)
        missing = cur.fetchall()

    for email_id in missing:
        regenerate_embeddings.delay(email_id)

7. Development Journey & Lessons Learned

7.1 The Learning Curve

This system was built iteratively over nine months, with the author applying analytical frameworks from investment analysis and institutional finance to system architecture decisions. The commit history reveals the development journey:

Week 1 (July 2025): Foundation and First Failures

The initial commit established an ambitious architecture: Scrapy spider, PostgreSQL with pgvector, Celery background tasks. Within 48 hours, production reality intervened:

Commit b6cce11: "lots of updates" - Connection pool exhaustion and protocol corruption under real-world load

Commit a98e976: "pooling set to 1" - Defensive retreat, implementing log_problematic_article() for graceful failure handling

The solution emerged in commit 2b52729:

"celery workers are stable and sql alchemy has been used to implement"

Lesson: Production stability often requires abandoning clever solutions for boring, proven ones. SQLAlchemy's connection pooling solved problems that took days to debug with manual psycopg3 management.

Month 2 (August 2025): Email Intelligence Integration

Adding email processing to the RAG system revealed deeper architectural challenges:

Commit 2b1155a: "email embedding system is now working with real data and is merged with Rag"

This required HNSW index rebuilding (6+ hours) and careful handling of the embedding dimension matching between article and email chunks.

Key Insight: Unifying data models early (same embedding dimension, same chunk size) enables system-wide search without constant adaptation layers.

Month 4 (October 2025): The Entity System

Building entity extraction revealed the complexity of real-world names:

Commit 382acca: "entity extraction system created"

The three-hash architecture (SHA-256 for exact match, MD5-16 for fast lookup, Double Metaphone for fuzzy matching) emerged from production requirements: "Johnn Smith" needs to match "John Smith" without false positives.

Month 6 (December 2025): Memory System Pivot

The initial reflection system sent raw edit diffs to Claude for preference extraction. Research revealed this approach degrades over time:

From research notes: "Reflection without external feedback degrades performance. When you see a reflection system that works, you're almost always looking at a verification system in disguise."

The redesign implemented deterministic parsing BEFORE LLM involvement, with quality gates preventing noise accumulation.

Lesson: Academic research is worth reading. The Mem0 and MemGPT papers predicted exactly the failure mode we encountered.

7.2 Iterative Development

Features evolved based on real editorial needs rather than theoretical requirements:

Email Triage Evolution:

  1. Initial: Binary YES/NO classification
  2. Week 2: User feedback demanded distinguishing "notify" from "no action"
  3. Week 4: Four-tier classification with deterministic pre-filtering
  4. Month 2: Backlog mode for efficient catchup processing

Draft Generation Evolution:

  1. Initial: Static prompts with generic voice
  2. Week 3: Few-shot retrieval added after inconsistent voice matching
  3. Month 2: Memory integration for learned preferences
  4. Month 4: Task orchestration for action extraction

Each iteration addressed a specific user complaint rather than anticipated requirements.

7.3 Key Technical Insights

1. Incremental Processing is Essential

Systems processing 200K+ items need incremental operation. The URL cache middleware (articles) and attachments_extracted flag (emails) enable processing only new content.

2. Cost Tracking Enables Optimization

# Token usage tracking revealed opportunities
# Initial: ~10 seconds per email (subprocess per email)
# After single-session batch: ~6 seconds per email
# Savings: 100+ hours on 182K email backlog

3. Database Joins Beat Application Logic

The email classification bug that returned 0 results for newsletter queries was a JOIN issue (classified_emails.classification was always 'unclassified'; the real data was in email_pipeline_routes.pipeline_type). Fixing the JOIN unlocked 337 emails instantly.

4. Human-in-the-Loop Reduces Risk

Every automated action we considered implementing eventually got a HITL checkpoint. The two-phase form filling pattern emerged from a near-miss where test data would have been submitted to a real form.

5. Checkpoint Commits Enable Recovery

The commit messages "pre changes before X" and "pre-refactor" represent deliberate checkpoints before risky changes. When refactors go wrong, these enable clean rollback.

7.4 What Worked, What Didn't

Worked Well:

Didn't Work:

Still Evolving:


8. Impact & Results

8.1 Operational Improvements

Quantitative Metrics:

Metric Before After Improvement
News source monitoring 2-3 hours/day 15 minutes/day 90% reduction
Email triage time 1-2 hours/day 20 minutes/day 80% reduction
Article research time 30-60 min/story 5-10 min/story 80% reduction
Response drafting 10-15 min/email 2-3 min/email 80% reduction
Event calendar updates Manual entry Semi-automated Hours saved weekly

System Scale:

8.2 Editorial Capabilities

The system enables capabilities previously unavailable to a small newsroom:

Comprehensive Competitive Intelligence:

Institutional Memory:

Source Relationship Management:

Automated Event Tracking:

8.3 Sustainability Implications

For local news sustainability, the implications are significant:

Cost Structure:

Scalability:

Replicability:


9. Ethical Considerations

9.1 AI in Journalism

The use of AI in journalism raises legitimate concerns that this system addresses through architectural choices:

Human Judgment Preserved:

Accuracy Safeguards:

Voice Authenticity:

Transparency:

9.2 Data Privacy

Email Privacy:

Source Protection:

9.3 Transparency

AI Assistance Disclosure:


10. Future Directions

10.1 Planned Enhancements

Short-term (2026 Q2):

Medium-term (2026 H2):

Long-term:

10.2 Scaling Considerations

The architecture supports scaling in several dimensions:

Horizontal Scaling:

Multi-Newsroom:

Data Growth:

10.3 Open Questions

Technical:

Editorial:

Sustainability:


11. Conclusion

This white paper documents nine months of building AI infrastructure for a hyperlocal newsroom. The resulting system demonstrates that sophisticated AI capabilities are achievable for small news operations, potentially transforming the economics of local journalism.

The technical achievement is significant: 255 GB of structured news and email data, 8 million searchable semantic chunks, sub-100ms query performance, and integrated AI agents that monitor, extract, draft, and orchestrate across diverse information sources.

But the more important achievement is the design philosophy embedded in the architecture: human-in-the-loop by default, editorial integrity as requirement, practical utility over impressive capabilities. AI serves as force multiplier, not replacement. The system proposes; humans decide.

For local journalism, the implications extend beyond individual newsroom efficiency. If AI can enable one publisher to maintain coverage depth previously requiring dedicated staff, similar systems could revive local news coverage in communities currently without any journalism presence. The technology is replicable, the costs are manageable, and the architecture is documented.

The local news crisis is real and deepening. This system represents one response: building tools that make small newsrooms more capable rather than waiting for business models that may never materialize. AI infrastructure won't solve every problem facing local journalism, but it can address the operational bottlenecks that consume editorial attention and prevent the deep community engagement that local news requires.

The code exists. The documentation exists. The path forward is clearer than it was nine months ago.


Appendices

A. Technology Reference

Core Technologies:

Category Technology Version Purpose
Database PostgreSQL 17 Primary data store
Vector Search pgvector 0.7.0 Embedding storage and similarity search
Cache Redis 7.x Task queue broker, session cache
Backend Python 3.11 Primary language
Web Framework FastAPI 0.128.0 API endpoints
Task Queue Celery 5.x Background processing
Scraping Scrapy 2.x Web scraping framework
Browser Automation Playwright 1.x Form filling, screenshots
Workflow LangGraph 0.x Multi-agent orchestration
ML/NLP SpaCy 3.x Named entity recognition
Embeddings SentenceTransformers 2.x all-MiniLM-L6-v2 model

LLM APIs:

Provider Model Use Case
Anthropic Claude Sonnet 4 Draft generation, content analysis
Anthropic Claude Haiku Image analysis, fast classification
Google Gemini 2.0 Flash Lite Email classification (cost-efficient)
OpenAI GPT-4 Specialized tasks, comparison

Infrastructure:

Component Specification
Hardware Mac Mini M2 / MacBook Pro
Storage Samsung T7 SSD (1TB) for PostgreSQL data
OS macOS 14.x
Package Manager Homebrew
Python Environment venv

B. Repository Structure

Master-Scrape-Rag-Pipe/
├── news_scraper_project/           # Article ingestion
   ├── spiders/article_spider.py   # Intelligent heuristic spider
   ├── pipelines.py                # PostgreSQL atomic writes
   ├── tasks.py                    # Celery embedding tasks
   └── middlewares.py              # URL cache middleware

├── email_assistant/                # Email processing CLI
   ├── cli.py                      # Main entry point
   ├── orchestration/              # Task extraction & routing
      ├── orchestrator.py         # Central dispatcher
      ├── task_extractor.py       # LLM-based extraction
      └── task_classifier.py      # Capability-aware matching
   ├── drafting/                   # Response generation
      ├── draft_response.py       # Context-enriched drafting
      └── context_enrichment.py   # RAG context injection
   ├── memory/                     # Adaptive learning
      ├── store.py                # Memory persistence
      ├── feedback_parser.py      # Deterministic parsing
      └── quality_gate.py         # Signal validation
   ├── skills/                     # Automation capabilities
      ├── rsvp_skill.py           # Form filling
      ├── calendar_skill.py       # Event management
      └── smart_form_skill.py     # Intelligent form analysis
   └── rsvp/                       # Browser automation
       └── playwright_controller.py

├── rag_chatbot/                    # RAG interface
   ├── app.py                      # FastAPI application
   ├── ai_generator.py             # Claude API integration
   ├── vector_store_pg.py          # pgvector search
   └── NYCNewsAgent/               # Research agent system
       ├── research_agent/         # Multi-agent research
          ├── agent.py            # Claude Agent SDK
          └── email_story_extractor.py
       ├── publish_pipeline/       # WordPress integration
          ├── metadata_extractor.py
          ├── seo_generator.py
          └── image_analyzer.py
       └── api/                    # Workflow APIs
           ├── workflow_api.py
           └── lineage/            # Content traceability

├── entity_extraction/              # Entity intelligence
   ├── spacy_entity_pipeline.py    # NER pipeline
   └── entity_disambiguation.py    # Alias management

├── attachment_extraction/          # Email attachments
   ├── drive_uploader.py           # Google Drive streaming
   └── calendar_parser.py          # ICS parsing

├── event_extraction/               # Calendar events
   └── extract_events_from_emails.py

├── migrations/                     # Database schemas
   └── 001-068_*.sql

├── update_emails_v2.sh             # Email pipeline orchestrator
├── run_crawls.py                   # Article scraping orchestrator
└── CLAUDE.md                       # AI assistant instructions

C. Commit Statistics

Development Phases:

Phase Period Focus
Foundation July 2025 Scrapy, PostgreSQL, Celery stabilization
Email Intelligence Aug-Sep 2025 Gmail integration, entity extraction, participant tracking
Advanced Features Oct-Dec 2025 LangGraph workflows, memory system, reflection
Production Polish Jan-Mar 2026 HITL review, task orchestration, publishing pipeline

Database Scale:

Table Records Size
articles 1.79M 15 GB
article_chunks 4.56M 45 GB
classified_emails 218K 8 GB
email_chunks 3.58M 35 GB
sender_profiles 10K 200 MB
email_participants 453K 2 GB
HNSW indexes 3 20 GB
Total ~255 GB

D. Glossary

AUDN: Add/Update/Delete/Noop - memory reconciliation operations determining how new learning integrates with existing memories

Chunk: A semantic segment of text (typically 500 characters) with associated vector embedding for similarity search

Deterministic Filter: Rule-based classification that handles obvious cases without LLM invocation, preserving AI capacity for nuanced decisions

Few-Shot Learning: Providing examples of desired outputs in prompts to guide model behavior toward publication-specific patterns

HITL: Human-in-the-Loop - design pattern ensuring human approval before automated actions

HNSW: Hierarchical Navigable Small World - approximate nearest neighbor algorithm enabling fast vector similarity search

Memory Scope: The hierarchical level at which a learned preference applies (global, organization, project, sender)

Participant Intelligence: System for tracking email participants across the corpus, enabling relationship mapping and communication pattern analysis

pgvector: PostgreSQL extension adding vector data type and similarity search operators

Quality Gate: Validation layer that filters low-quality signals before they can influence learning systems

RAG: Retrieval-Augmented Generation - pattern combining information retrieval with LLM generation

RRF: Reciprocal Rank Fusion - technique for combining multiple ranking signals (e.g., vector similarity + text search)

Sender Profile: Aggregated intelligence about an email sender including organization, role, communication patterns, and relationship history

Semantic Search: Finding content by meaning rather than exact keyword matching, enabled by vector embeddings

tsvector: PostgreSQL's full-text search data type for efficient text matching


White paper compiled from 321 commits over 9 months of development Data warehouse: 255 GB (1.79M articles, 218K emails, 8M+ chunks) Last updated: March 2026


For questions about this system, potential collaboration, or career opportunities, contact: Alec Meeker alec@bushwickdaily.com LinkedIn: linkedin.com/in/alecmeeker GitHub: alecmeeeker.github.io