White Paper, Building an AI-Powered Local Newsroom

Abstract

The local news crisis has left communities across America without reliable journalism coverage. While national outlets consolidate and local newsrooms collapse, AI has emerged as a potential force multiplier that could enable small newsrooms to punch above their weight. This white paper documents the nine-month development of a comprehensive AI infrastructure built by Bushwick Daily, a hyperlocal Brooklyn publication. The system encompasses 255 GB of structured data including 1.79 million scraped news articles from 75+ outlets, 218,000 processed emails with participant intelligence, and 8 million searchable semantic chunks. The architecture integrates multiple AI subsystems: an intelligent news scanner that monitors NYC media and extracts stories from email intelligence, an adaptive memory system that learns editorial preferences through human feedback, and an email assistant that triages correspondence and generates voice-accurate draft responses. The human-in-the-loop design philosophy ensures AI augments rather than replaces editorial judgment, with explicit checkpoints requiring human approval before any automated action. Development proceeded through deliberate iteration, with significant architectural pivots driven by production experience rather than theoretical assumptions. The resulting system demonstrates that sophisticated AI infrastructure is achievable for small newsrooms, potentially transforming the economics of local journalism by enabling one-person operations to maintain coverage depth traditionally requiring dedicated research staff.

1. Introduction: The Local News Crisis and an AI Response

1.1 The Challenge

The collapse of local journalism represents one of the most significant information gaps in American civic life. Since 2004, over 2,500 newspapers have closed. The remaining local outlets operate with skeleton staffs, unable to provide the coverage depth that community accountability requires. Hyperlocal publications face an impossible equation: the reporting workload of a full newsroom with the resources of a small business.

For Bushwick Daily, this tension manifested in concrete operational challenges. Monitoring 75+ NYC news outlets for relevant coverage required hours of daily manual checking. An inbox processing 200+ emails daily created a triage bottleneck where important community communications were lost in promotional noise. Institutional knowledge about sources, organizations, and ongoing stories existed only in the publisher's memory, vulnerable to the cognitive limitations of a single operator.

1.2 The Hypothesis

The hypothesis driving this project was straightforward: AI could serve as a force multiplier for small newsrooms rather than a replacement for journalists. The goal was not automated content generation but automated information processing, freeing editorial attention for judgment and writing while AI handled collection, organization, and routine correspondence.

This required building AI infrastructure from scratch rather than adopting off-the-shelf solutions. Commercial AI tools optimize for general use cases; a newsroom requires domain-specific capabilities around source management, editorial voice preservation, and journalistic accuracy standards.

1.3 The Approach

Development proceeded through direct engagement with production requirements. The author, with a background in investment analysis ($200M portfolio management, PE fund due diligence), a B.S. in Finance and Business Information Systems, and eight years of company operations, built the system iteratively, applying the same analytical rigor to AI architecture decisions that he previously applied to portfolio analysis and fund evaluation. Prior experience with VBA/SQL automation for institutional reporting (reducing Emerging Market Debt reporting cycles by 70% at Erie Insurance) provided a foundation for the data pipeline engineering documented in this paper.

The approach prioritized: - Human-in-the-loop design: AI proposes, humans approve. No automated actions without explicit human checkpoints. - Editorial integrity: Voice preservation, source attribution, and accuracy verification built into system architecture. - Practical utility: Features driven by real workflow pain points rather than theoretical capabilities. - Sustainable architecture: Production-grade reliability over prototype impressiveness.

1.4 Scope of This Paper

This paper covers the complete technical architecture of the Bushwick Daily AI newsroom system as of March 2026, including:

Data Infrastructure: PostgreSQL 17 with pgvector for unified article and email storage with semantic search
NYCNewsScanner: Multi-agent news monitoring and story extraction from both published articles and email intelligence
Email Assistant: Intelligent triage, context-aware response generation, and task orchestration with skill-based automation
Memory System: Adaptive learning architecture that improves through human feedback
Integration Layer: How these components communicate and share context

The development journey itself is documented, including architectural pivots, failed approaches, and lessons learned from nine months of iteration.

2. System Architecture Overview

2.1 Design Philosophy

Five principles guided architectural decisions:

1. Human-in-the-Loop by Default Every automated action passes through human approval. The system proposes; humans decide. This applies to email responses, story extraction approvals, form submissions, and content publication. AI handles the mechanical work of drafting, organizing, and retrieving; humans retain editorial judgment.

2. Editorial Integrity as Architecture Voice preservation is not a feature but a requirement. The system learns from human corrections rather than overriding editorial preferences. When AI-generated content is edited before sending, those edits become training data for future improvements.

3. Practical Over Impressive Features address documented workflow problems rather than theoretical capabilities. The incremental crawling system exists because daily news monitoring was consuming hours; the email triage system exists because 200+ daily emails were unmanageable manually.

4. Production Reliability Graceful degradation over brittle perfection. When individual articles cause parsing errors, the system logs and skips rather than crashing. When API calls fail, retry logic with exponential backoff ensures eventual completion. Background processing through Celery workers ensures main operations remain responsive.

5. Sustainable Operations Infrastructure must run on available hardware within realistic budgets. The system operates on a Mac with an external SSD for PostgreSQL data, using commodity cloud APIs with cost tracking. AI operations are optimized for cost efficiency without sacrificing capability.

2.2 High-Level Architecture

                    ┌──────────────────────────────────────────────────────┐
                    │                  DATA SOURCES                         │
                    │  75+ News Outlets    Gmail Inbox    Reddit/Web        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │              INGESTION LAYER                          │
                    │  Scrapy Spider    Gmail API    Web Scraper            │
                    │  URL Cache        Deduplication  Rate Limiting        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │           PROCESSING LAYER                            │
                    │  Article Extraction    Email Classification           │
                    │  Entity Extraction     Participant Intelligence       │
                    │  Chunking              Embedding Generation           │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │              DATA WAREHOUSE                           │
                    │         PostgreSQL 17 + pgvector                      │
                    │  1.79M Articles   218K Emails   8M+ Chunks            │
                    │  453K Participants   10K Sender Profiles              │
                    │            HNSW Vector Indexes                        │
                    └────────────┬─────────────┬────────────┬──────────────┘
                                 │             │            │
          ┌──────────────────────┼─────────────┼────────────┼──────────────────────┐
          │                      │             │            │                      │
          ▼                      ▼             ▼            ▼                      ▼
   ┌─────────────┐     ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   │   RAG       │     │ NYCNews     │   │   Email     │   │  Memory     │   │  Publish    │
   │  Chatbot    │     │  Scanner    │   │  Assistant  │   │  System     │   │  Pipeline   │
   │             │     │             │   │             │   │             │   │             │
   │ Semantic    │     │ Multi-Agent │   │ Triage      │   │ Adaptive    │   │ WordPress   │
   │ Search      │     │ Research    │   │ Drafting    │   │ Learning    │   │ Integration │
   │ Q&A         │     │ Story       │   │ Task        │   │ Feedback    │   │ SEO Gen     │
   │             │     │ Extraction  │   │ Orchestr.   │   │ Loop        │   │ Image AI    │
   └─────────────┘     └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
          │                      │             │            │                      │
          └──────────────────────┼─────────────┼────────────┼──────────────────────┘
                                 │             │            │
                    ┌────────────▼─────────────▼────────────▼──────────────┐
                    │               HUMAN INTERFACE                         │
                    │  CLI Commands    Web Dashboard    API Endpoints       │
                    │              Human-in-the-Loop Review                 │
                    └──────────────────────────────────────────────────────┘

2.3 Technology Stack Summary

Layer	Technologies	Rationale
Data Storage	PostgreSQL 17, pgvector, Redis	Unified SQL + vector search; HNSW indexes enable sub-100ms similarity queries across millions of chunks
Backend	Python 3.11, FastAPI, Celery	Async processing with background tasks for compute-intensive embedding generation
Scraping	Scrapy, BeautifulSoup, Readability	Intelligent heuristic spider with fallback extraction chains
AI/LLM	Claude (Anthropic), Gemini (Google), GPT-4 (OpenAI)	Multi-model strategy: Claude for generation quality, Gemini Flash for cost-efficient classification, GPT-4 for specialized tasks
Embeddings	SentenceTransformers (all-MiniLM-L6-v2)	384-dimensional vectors; lightweight model runs locally without GPU
Browser Automation	Playwright	RSVP form filling, screenshot capture, web interaction
Workflow Engine	LangGraph	Multi-agent orchestration with state persistence
Frontend	React, Streamlit	Development speed with Streamlit; production UI with React
Infrastructure	macOS, Homebrew, External SSD	Commodity hardware sufficient for production workloads

3. Data Infrastructure

3.1 Data Warehouse Architecture

The data warehouse centers on PostgreSQL 17 with the pgvector extension, providing unified storage for both structured metadata and high-dimensional vector embeddings. This architecture eliminates the complexity of maintaining separate vector stores while enabling joins between semantic search results and relational data.

Primary Entity Clusters:

-- Article Storage
outlets (75 active)
  └── articles (1.79M records)
        └── article_content (full HTML/text with tsvector search)
              └── article_chunks (4.56M chunks with 384-dim embeddings)

-- Email Storage
classified_emails (218K records, 60 columns)
  ├── email_chunks (3.58M chunks with embeddings)
  ├── email_participants (453K relationship records)
  │     └── sender_profiles (10K with communication patterns)
  ├── email_attachments (Google Drive links, SHA-256 dedup)
  └── email_events (ICS calendar extraction)

-- Entity Intelligence
entities_v2 (2,580+ entities)
  └── entity_mentions_v2 (cross-corpus tracking)
        └── entity_aliases_v2 (disambiguation)

Vector Indexing Strategy:

HNSW (Hierarchical Navigable Small World) indexes provide approximate nearest neighbor search with sub-100ms query times across millions of vectors:

CREATE INDEX idx_email_chunks_embedding_hnsw
ON email_chunks USING hnsw (embedding vector_cosine_ops);

-- Index size: 5.8 GB for 3.58M email chunks
-- Query time: ~27ms average (verified December 2024)

The choice of vector_cosine_ops reflects the semantic nature of the search: cosine similarity captures meaning alignment regardless of vector magnitude, which is appropriate for text embeddings where we care about semantic direction rather than intensity.

3.2 Article Ingestion Pipeline

The article scraping system achieves 95% automation through intelligent heuristics, requiring manual configuration only for exceptional sites.

Article Detection Cascade:

def triage_page(self, response):
    """Multi-stage article detection."""

    # Stage 1: OpenGraph metadata
    if response.css('meta[property="og:type"][content="article"]'):
        yield from self.parse_article(response)
        return

    # Stage 2: Semantic container analysis
    for selector in ['article', 'main', 'div[role="main"]']:
        container = response.css(selector)
        if container:
            text_content = "".join(container.css("::text").getall())
            html_content = container.get()

            # Text-to-markup ratio filter
            if len(text_content) / len(html_content) > 0.6:
                if len(text_content) > 250:
                    yield from self.parse_article(response)
                    return

Incremental Crawling:

The UrlCacheMiddleware eliminates redundant network requests by loading all existing URLs at spider startup:

class UrlCacheMiddleware:
    def spider_opened(self, spider):
        with engine.connect() as conn:
            urls = conn.execute(
                text("SELECT url FROM articles WHERE outlet_id = :id"),
                {"id": outlet_id}
            ).fetchall()
            self.scraped_urls = {self.normalize_url(row[0]) for row in urls}

    def process_request(self, request, spider):
        if self.normalize_url(request.url) in self.scraped_urls:
            raise IgnoreRequest(f"Already scraped: {request.url}")
        return None

This reduces network traffic by approximately 90% on subsequent crawls, enabling hourly news checks without server overload.

Site Override System:

For sites that defeat heuristic detection, YAML configuration provides escape hatches:

hellgatenyc.com:
  article_url_pattern: '/[a-z0-9-]+/$'
  body_fallback_chain:
    - "article.post-content"
    - "div.entry-content"
    - "div.article-body"

3.3 Email Ingestion Pipeline

Email processing follows a seven-step pipeline orchestrated by update_emails_v2.sh:

Step 1:   Gmail API extraction with service account delegation
Step 1.1: URL extraction from email content
Step 1.25: Deterministic triage (email/notify/no/spam)
Step 1.5: Attachment extraction → Google Drive with SHA-256 dedup
Step 2:   Semantic chunking with 500-char segments
Step 3:   Gemini Flash classification (16 categories)
Step 3.5: Structured event extraction (ICS parsing)
Step 4:   Participant intelligence (role extraction, fingerprinting)
Step 5:   Story extraction for journalism workflow
Step 6:   LangChain agent processing
Step 7:   Entity extraction (SpaCy NER)

Hybrid Classification System:

Deterministic rules handle predictable patterns before invoking LLM classification:

class DeterministicFilter:
    # Spam TLDs with >95% spam rate
    SPAM_TLDS = {'.xyz', '.top', '.gdn', '.click', '.loan'}

    # Known newsletter domains
    NEWSLETTER_DOMAINS = {'mailchimp.com', 'substack.com', 'constantcontact.com'}

    # Financial transaction patterns
    FINANCIAL_PATTERNS = [r'paypal.*receipt', r'venmo.*paid']

    def classify(self, email) -> Optional[str]:
        # Check spam indicators
        if any(email.sender.endswith(tld) for tld in self.SPAM_TLDS):
            return 'spam'

        # Check newsletter patterns
        sender_domain = email.sender.split('@')[1]
        if sender_domain in self.NEWSLETTER_DOMAINS:
            return 'marketing_newsletter'

        # ... additional rules
        return None  # Fall through to LLM classification

This hybrid approach preserves LLM capacity for nuanced decisions while handling obvious cases instantly, achieving approximately 10x throughput improvement over LLM-only classification.

3.4 Data Quality Framework

Deduplication:

Content fingerprinting using SHA-256 prevents duplicate storage:

def generate_fingerprint(self, email):
    # Normalize content
    content = self.normalize_tracking_params(email.body_text)
    content = self.normalize_whitespace(content)

    # Generate hash
    return hashlib.sha256(content.encode('utf-8')).hexdigest()

Referential Integrity:

CASCADE deletes maintain consistency:

ALTER TABLE article_chunks
ADD CONSTRAINT fk_article
FOREIGN KEY (article_id) REFERENCES articles(id) ON DELETE CASCADE;

Encoding Safety:

UTF-8 sanitization prevents PostgreSQL errors from null bytes:

clean_text = text.encode('utf-8', 'ignore').decode('utf-8')
clean_text = clean_text.replace('\x00', '')  # Remove null bytes

3.5 Key Technical Achievements

~27ms vector similarity queries across 3.58M email chunks
~90% network reduction via URL cache middleware
4.56M article chunks and 3.58M email chunks with HNSW indexes
92.7% profile linkage for emails to enhanced sender profiles
INSERT-only mode for articles prevents data loss from paywalled content changes

4. NYCNewsScanner: Intelligent News Discovery

4.1 Scanning Architecture

The NYCNewsScanner employs a multi-agent architecture using the Claude Agent SDK to parallelize news research across diverse source types.

                         ┌─────────────────────┐
                         │  News Coordinator   │
                         │  (Orchestrator)     │
                         └──────────┬──────────┘
            ┌───────────────────────┼───────────────────────┐
            ▼                       ▼                       ▼
 ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
 │  DB Researcher   │    │  Email Researcher │    │  Web Researcher  │
 │  (7 outlets)     │    │  (Press/News)     │    │  (Reddit/Web)    │
 └────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
          │                       │                       │
          └───────────────────────┴───────────────────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Story Extractor     │
                       │  (5W Analysis)       │
                       └──────────┬───────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Memory Integration  │←─── Quality Gate
                       │  (4-Level Scope)     │←─── Feedback Parser
                       └──────────┬───────────┘
                                  ▼
                       ┌──────────────────────┐
                       │  Pitch Generation    │
                       │  + Human Review      │
                       └──────────────────────┘

Source Coverage:

Database Researcher: Queries 7 major NYC outlets (Gothamist, Hell Gate, QNS, THE CITY, Bklyner, Brooklyn Paper, amNY) using deterministic SQL against the 1.79M article corpus
Email Researcher: Scans 215+ newsletters and 122+ press releases classified via Gemini Flash
Web Researcher: Monitors r/Bushwick and conducts targeted web searches for hyperlocal content not yet in the scraping pipeline

4.2 Relevance Scoring & Prioritization

Each extracted story receives a newsworthiness score based on multiple factors:

def calculate_newsworthiness(self, story, entities, email_context):
    score = 0.0

    # Geographic proximity to Bushwick
    bushwick_entities = ['Bushwick', 'Community Board 4', 'BK90']
    for entity in entities:
        if entity in bushwick_entities:
            score += 0.3

    # Known community figures
    if self.entity_db.is_known_figure(story.who):
        score += 0.2

    # Temporal urgency
    if story.when and story.when < datetime.now() + timedelta(days=7):
        score += 0.2

    # Source credibility
    if email_context.sender_profile.is_government:
        score += 0.15
    elif email_context.sender_profile.is_journalist:
        score += 0.1

    return min(score, 1.0)

Stories scoring below threshold are filtered before human review, focusing editorial attention on genuinely newsworthy content.

4.3 Memory System

The memory system implements a four-level scope hierarchy based on research into production memory architectures (Mem0, Letta/MemGPT, academic literature):

Scope Levels:

Global: Applies across all contexts ("Route advertising inquiries to business team")
Organization: Persists through staff turnover ("CB4 meets second Wednesday monthly")
Project: Tracks ongoing investigations ("Bushwick rezoning - these stakeholders are connected")
Sender: Individual preferences ("Maria prefers informal tone")

Memory Storage:

CREATE TABLE am_memories (
    id SERIAL PRIMARY KEY,
    scope_type VARCHAR CHECK (scope_type IN ('global', 'org', 'project', 'sender')),
    scope_id VARCHAR,  -- NULL for global, email for sender, org_id for org
    content TEXT NOT NULL,
    embedding VECTOR(384),
    memory_type VARCHAR,  -- 'preference', 'fact', 'procedure', 'relationship'
    confidence FLOAT DEFAULT 1.0,
    times_applied INTEGER DEFAULT 0,
    times_led_to_acceptance INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    last_accessed TIMESTAMPTZ
);

Memory Retrieval:

Hybrid search using Reciprocal Rank Fusion combines vector similarity with full-text matching:

CREATE FUNCTION search_memories(query_text TEXT, query_embedding VECTOR, scope_filter JSONB)
RETURNS TABLE (memory_id INT, content TEXT, score FLOAT) AS $$
    -- Vector similarity component
    WITH vector_results AS (
        SELECT id, content, 1 - (embedding <=> query_embedding) as vscore,
               ROW_NUMBER() OVER (ORDER BY embedding <=> query_embedding) as vrank
        FROM am_memories
        WHERE scope_type = ANY(scope_filter->'scopes')
    ),
    -- Full-text component
    text_results AS (
        SELECT id, content, ts_rank(search_vector, plainto_tsquery(query_text)) as tscore,
               ROW_NUMBER() OVER (ORDER BY ts_rank DESC) as trank
        FROM am_memories
        WHERE search_vector @@ plainto_tsquery(query_text)
    )
    -- RRF combination
    SELECT COALESCE(v.id, t.id), COALESCE(v.content, t.content),
           (1.0/(60 + COALESCE(vrank, 1000)) + 1.0/(60 + COALESCE(trank, 1000))) as rrf_score
    FROM vector_results v FULL OUTER JOIN text_results t ON v.id = t.id
    ORDER BY rrf_score DESC;
$$ LANGUAGE sql;

4.3.1 Memory Formation

New memories form through the AUDN (Add/Update/Delete/Noop) cycle in reconciler.py:

Trigger: Human edits an AI-generated draft
Parse: FeedbackParser classifies edit type (tone, factual, intent, structural)
Gate: QualityGate filters noise (minimum 10% magnitude, 60% confidence, 5-minute debounce)
Extract: LLM extracts learnable preference from qualified signal
Reconcile: Compare to existing memories; AUDN determines action

class QualityGate:
    def should_process(self, signal: FeedbackSignal) -> bool:
        # Minimum edit magnitude
        if signal.edit_ratio < 0.10:
            return False

        # Confidence threshold
        if signal.confidence < 0.60:
            return False

        # Debounce rapid edits
        if signal.seconds_since_last < 300:
            return False

        return True

4.3.2 Memory Retrieval

When generating content or responses, the system retrieves relevant memories using scope-based boosting:

def retrieve_memories(self, context):
    base_memories = self.search_memories(context.query)

    # Apply scope boosting
    for memory in base_memories:
        if memory.scope_type == 'sender' and memory.scope_id == context.sender:
            memory.score *= 1.3  # Sender memories most relevant
        elif memory.scope_type == 'org' and memory.scope_id == context.org:
            memory.score *= 1.2
        elif memory.scope_type == 'project' and memory.scope_id in context.projects:
            memory.score *= 1.1
        # Global memories use base score

    return sorted(base_memories, key=lambda m: m.score, reverse=True)[:10]

4.3.3 Memory Decay & Consolidation

The effectiveness tracker implements decay for underperforming memories:

def compute_effectiveness(self, memory):
    if memory.times_applied == 0:
        return 0.5  # Neutral default

    acceptance_rate = memory.times_led_to_acceptance / memory.times_applied
    edit_ratio = memory.total_edit_distance / memory.total_output_length

    # Weighted combination
    return 0.6 * acceptance_rate + 0.4 * (1 - edit_ratio)

Memories with effectiveness below threshold become candidates for automatic pruning via the ea memory prune CLI command.

4.4 Reflection Engine

The reflection system was redesigned after research revealed that "reflection without external feedback degrades performance." The current architecture uses deterministic parsing BEFORE any LLM involvement:

The Anti-Pattern (Avoided):

# DON'T: Send raw edits directly to LLM for reflection
def learn_from_edit(original, edited):
    prompt = f"What preference does this edit reveal?\nOriginal: {original}\nEdited: {edited}"
    return llm.complete(prompt)  # Degrades over time

The Implemented Pattern:

# DO: Deterministic parsing, quality gate, then targeted LLM extraction
def learn_from_edit(original, edited):
    # Step 1: Deterministic classification
    diff = difflib.SequenceMatcher(a=original, b=edited)
    edit_type = classify_edit_type(diff)  # tone, factual, intent, etc.

    # Step 2: Quality gate
    if not quality_gate.should_process(edit_type, diff.ratio()):
        return None

    # Step 3: Targeted LLM extraction (only for qualified signals)
    return extract_learnable_preference(original, edited, edit_type)

The quality gate prevents several documented failure modes: - Noise pollution: Typo fixes don't become "preferences" - Debounce protection: Rapid-fire edits don't create duplicate memories - Magnitude filtering: Only edits changing 10%+ of content pass

4.5 Publisher Integration

The pitch generation system produces structured outputs for editorial review:

@dataclass
class EditorPitch:
    pitch_id: str
    headline: str
    key_facts: List[str]
    source_attribution: str
    source_email_id: int
    newsworthiness_score: float
    suggested_angle: str
    related_articles: List[int]  # IDs of existing coverage

The HITL review interface displays pitches with full context:

┌─────────────────────────────────────────────────────────────┐
│ PITCH: City Announces New Affordable Housing Lottery       │
├─────────────────────────────────────────────────────────────┤
│ Source: HPD Press Release (maria.torres@nyc.gov)           │
│ Score: 0.87 (HIGH)                                          │
│                                                             │
│ KEY FACTS:                                                  │
│ • 150 units at 1234 Bushwick Ave                           │
│ • Income bands: 40%, 60%, 80% AMI                          │
│ • Applications open March 20                                │
│                                                             │
│ RELATED COVERAGE: 3 prior articles on this development     │
│                                                             │
│ [a]pprove  [r]eject  [e]dit  [n]ext  [q]uit                │
└─────────────────────────────────────────────────────────────┘

5. Email Assistant

5.1 Processing Pipeline

The Email Assistant implements a complete email processing and response generation system:

Gmail API → Triage → HITL Review → Orchestrator → Skills → Draft Review → Send
                ↓
        Deterministic    Pre-orchestrator     Task           Skill         Human
        Filter           checkpoint           Extraction     Dispatch      Approval

Triage System:

Four-tier classification replaces the initial binary (YES/NO) approach:

Category	Description	Action
`email`	Requires response from publisher	→ Orchestrator
`notify`	Awareness-only (press releases, FYI)	→ Notification
`no`	Not relevant to newsroom	→ Archive
`spam`	Blocked, flagged for filter learning	→ Spam folder

Pre-Orchestrator HITL:

A filtering checkpoint before expensive LLM calls:

$ ea hitl review --limit 50

Thread: Re: Interview Request - Community Garden
From: reporter@brooklyneagle.com
Triage: EMAIL (needs response)
Preview: Hi Alec, I'm working on a story about...

Actions: [x]spam [m]mute [o]rchestrator [n]ext [q]uit

Each spam email caught here saves 3-5 LLM calls (task extraction, classification, skill dispatch, draft generation).

5.2 Context-Aware Response Generation

Draft responses leverage multiple context sources:

Sender Profile:

profile = sender_profiles.get(sender_email)
# Returns: organization, role, is_journalist, is_government,
#          communication_history, preferred_tone, last_interaction

Thread Context:

thread = email_threads.get_full_thread(thread_id)
# Returns: all messages in thread, participants, subject evolution

Adaptive Memories:

memories = memory_store.retrieve_for_context(
    sender=sender_email,
    org=profile.organization,
    projects=active_projects,
    query=email_subject
)
# Returns: ranked preferences, facts, procedures relevant to this email

Few-Shot Examples:

examples = few_shot_retriever.get_similar_responses(
    email_type=classification,
    sender_type=profile.classification,
    limit=3
)
# Returns: successful past responses to similar emails

The draft generation prompt assembles these contexts:

prompt = f"""
You are drafting a response for Alec Meeker, publisher of Bushwick Daily.

SENDER CONTEXT:
{sender_profile_summary}

THREAD CONTEXT:
{thread_history}

RELEVANT MEMORIES:
{format_memories(memories)}

SIMILAR PAST RESPONSES:
{format_examples(examples)}

WRITING GUIDELINES:
- Warm but direct tone
- Business development orientation
- NO em-dashes, NO "I hope this finds you well"
- Be specific about next steps

EMAIL TO RESPOND TO:
{current_email}

Draft a response:
"""

5.3 Human Oversight Design

Draft Review Interface:

┌─────────────────────────────────────────────────────────────┐
│ DRAFT RESPONSE                                              │
├─────────────────────────────────────────────────────────────┤
│ To: maria.torres@nyc.gov                                    │
│ Subject: Re: Press Credential Application                   │
│                                                             │
│ Hi Maria,                                                   │
│                                                             │
│ Thanks for following up on the credential application.      │
│ I've attached our updated circulation numbers and added     │
│ the publication schedule you requested.                     │
│                                                             │
│ Let me know if you need anything else for the review.       │
│                                                             │
│ Best,                                                       │
│ Alec                                                        │
│                                                             │
│ [s]end [e]dit [r]evise [a]nswer [d]iscard [n]ext [q]uit    │
└─────────────────────────────────────────────────────────────┘

Action Options:

Action	Effect	Learning
Send	Send as-is	Positive signal: draft was perfect
Edit	Open in editor, send after	Diff triggers memory extraction
Revise	Regenerate with notes	Notes inform next attempt
Answer	Continue multi-turn	Clarification needed
Discard	Reject draft	Strong negative signal

Two-Phase HITL for Automation:

For actions like form filling, the system implements two-phase approval:

Phase 1: Analyze form → Show proposed values → Await approval
Phase 2: Fill form → Capture screenshot → Confirm submission

This prevents automated form submission with incorrect data while reducing manual data entry.

6. Integration & Orchestration

6.1 System Interoperability

Components communicate through shared PostgreSQL tables and well-defined interfaces:

Shared Data Structures:

# All components use consistent email representation
@dataclass
class EmailContext:
    email_id: int
    thread_id: str
    sender_email: str
    sender_profile: Optional[SenderProfile]
    subject: str
    body_text: str
    body_html: str
    received_at: datetime
    classification: str
    participant_roles: Dict[str, str]
    attachments: List[Attachment]

Event-Driven Updates:

Content lineage tracking creates audit trails across all operations:

CREATE TABLE content_lineage_events (
    id SERIAL PRIMARY KEY,
    content_type VARCHAR,  -- email, story, pitch, article, queue_item
    content_id INTEGER,
    event_type VARCHAR,    -- extracted, approved, rejected, published
    actor_type VARCHAR,    -- system, human
    actor_id VARCHAR,
    parent_content_type VARCHAR,
    parent_content_id INTEGER,
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

6.2 State Management

Complex workflow state persists to PostgreSQL rather than memory:

Session Persistence:

class SessionStore:
    def save_session(self, session_id, state):
        # Write-through: every update persists immediately
        with self.conn.cursor() as cur:
            cur.execute("""
                INSERT INTO nycnews_agent_sessions (session_id, state, updated_at)
                VALUES (%s, %s, NOW())
                ON CONFLICT (session_id) DO UPDATE SET state = %s, updated_at = NOW()
            """, (session_id, json.dumps(state), json.dumps(state)))
            self.conn.commit()

    def load_session(self, session_id):
        # Read-through: database fallback when not in memory
        with self.conn.cursor() as cur:
            cur.execute(
                "SELECT state FROM nycnews_agent_sessions WHERE session_id = %s",
                (session_id,)
            )
            result = cur.fetchone()
            return json.loads(result[0]) if result else None

Bidirectional Task-Draft Linking:

-- Draft knows its generating task
ALTER TABLE ea_email_drafts ADD COLUMN source_task_id INTEGER REFERENCES ea_tasks(id);

-- Task knows its output draft
ALTER TABLE ea_tasks ADD COLUMN generated_draft_id INTEGER REFERENCES ea_email_drafts(id);

This enables atomic state synchronization:

def sync_task_draft_status(task_id, draft_id, action):
    """Atomic status update for linked task and draft."""
    with conn.cursor() as cur:
        if action == 'send':
            cur.execute("""
                UPDATE ea_tasks SET status = 'completed', completed_at = NOW()
                WHERE id = %s;
                UPDATE ea_email_drafts SET status = 'sent', sent_at = NOW()
                WHERE id = %s;
            """, (task_id, draft_id))
        elif action == 'reject':
            cur.execute("""
                UPDATE ea_tasks SET status = 'cancelled'
                WHERE id = %s;
                UPDATE ea_email_drafts SET status = 'rejected'
                WHERE id = %s;
            """, (task_id, draft_id))
        conn.commit()

6.3 Error Handling & Recovery

Graceful Degradation:

def process_email_with_fallbacks(email):
    try:
        # Primary path: full context enrichment
        profile = get_sender_profile(email.sender)
        memories = get_relevant_memories(email, profile)
        examples = get_few_shot_examples(email)
    except ProfileServiceError:
        # Fallback: basic context only
        profile = None
        memories = []
        examples = get_generic_examples(email.classification)

    # Always proceed with available context
    return generate_draft(email, profile, memories, examples)

Retry Logic:

@celery.task(bind=True, max_retries=3, default_retry_delay=60)
def generate_embeddings(self, article_id):
    try:
        # Embedding generation
        chunks = chunk_article(article_id)
        embeddings = model.encode(chunks)
        store_embeddings(article_id, embeddings)
    except TransientError as e:
        # Retry for transient failures
        raise self.retry(exc=e, countdown=60 * (self.request.retries + 1))
    except PermanentError as e:
        # Log and skip for permanent failures
        log_problematic_article(article_id, str(e))
        return False

Recovery Scripts:

When issues occur at scale, recovery scripts identify and fix discrepancies:

def recover_missing_embeddings():
    """Find and regenerate embeddings that failed to persist."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT ce.id FROM classified_emails ce
            LEFT JOIN email_chunks ec ON ce.id = ec.email_id
            WHERE ec.email_id IS NULL
            AND ce.processed_at IS NOT NULL
        """)
        missing = cur.fetchall()

    for email_id in missing:
        regenerate_embeddings.delay(email_id)

7. Impact & Results

8.1 Operational Improvements

Quantitative Metrics:

Metric	Before	After	Improvement
News source monitoring	2-3 hours/day	15 minutes/day	90% reduction
Email triage time	1-2 hours/day	20 minutes/day	80% reduction
Article research time	30-60 min/story	5-10 min/story	80% reduction
Response drafting	10-15 min/email	2-3 min/email	80% reduction
Event calendar updates	Manual entry	Semi-automated	Hours saved weekly

System Scale:

1.79M articles searchable across 75+ NYC news outlets
218K emails processed with 16-category classification
8M+ semantic chunks indexed for sub-100ms search
10K sender profiles with communication pattern analysis
453K participant records for relationship intelligence

8.2 Editorial Capabilities

The system enables capabilities previously unavailable to a small newsroom:

Comprehensive Competitive Intelligence:

Query "what have other outlets published about Bushwick rezoning?" and receive semantically ranked results across 1.79M articles
Identify coverage gaps by comparing topic presence across outlets
Track breaking stories across sources in near-real-time

Institutional Memory:

Search all prior communications with any source: "show me all emails with this council member"
Surface historical context for breaking stories automatically
Preserve knowledge through staff transitions via organization-level memories

Source Relationship Management:

Track 10K sender profiles with communication patterns
Identify who provides information about which topics
Detect relationship networks (who gets mentioned together)

Automated Event Tracking:

Calendar events extracted from emails with structured data
ICS attachments parsed into database records
Event flyers uploaded to Google Drive with database tracking

8.3 Sustainability Implications

For local news sustainability, the implications are significant:

Cost Structure:

AI API costs: ~$50-100/month for full system operation
Infrastructure: Commodity Mac + external SSD
No specialized engineering staff required for maintenance

Scalability:

Architecture supports adding outlets to scraping pipeline with config changes
Email volume can increase without architectural changes
Memory system improves with use rather than requiring retraining

Replicability:

All components use standard open-source technologies
No proprietary systems or vendor lock-in
Documentation enables adoption by other small newsrooms

8. Ethical Considerations

9.1 AI in Journalism

The use of AI in journalism raises legitimate concerns that this system addresses through architectural choices:

Human Judgment Preserved:

AI proposes; humans decide. No content publishes without human approval.
Draft responses require explicit send action after review.
Story pitches require editorial approval before article generation.

Accuracy Safeguards:

The Question tool lets the model ask for information rather than guess
Date awareness prevents responses to stale emails
Quality gates prevent learning from spurious signals

Voice Authenticity:

Few-shot learning from actual publisher responses
Anti-patterns explicitly blocked in writing instructions
Continuous learning from human corrections

Transparency:

All AI-generated content is reviewed before sending
No automated social media posting
Editorial decisions remain with humans

9.2 Data Privacy

Email Privacy:

All email data remains on local infrastructure (no cloud processing of content)
Service account authentication isolates Gmail access
Sender profiles aggregate patterns without storing personal data

Source Protection:

Participant intelligence tracks relationships but not conversation content
Entity extraction identifies public figures, not private individuals
Memory system stores preferences, not sensitive information

9.3 Transparency

AI Assistance Disclosure:

Generated drafts are reviewed and edited by humans
The publisher retains full editorial responsibility
AI serves as research and drafting assistance, not authorship

9. Future Directions

10.1 Planned Enhancements

Short-term (2026 Q2):

Expanded RSVP automation for additional form types
Enhanced entity disambiguation using co-occurrence patterns
Mobile-friendly review interfaces for on-the-go approval

Medium-term (2026 H2):

Voice assistant integration for hands-free email triage
Automated source relationship cultivation reminders
Cross-publication collaboration features

Long-term:

Multi-newsroom knowledge sharing (privacy-preserving)
Investigative research automation
Community engagement intelligence

10.2 Scaling Considerations

The architecture supports scaling in several dimensions:

Horizontal Scaling:

Database-backed session management enables multiple workers
Celery task distribution scales embedding generation
Stateless API design supports load balancing

Multi-Newsroom:

Tenant isolation through database partitioning
Shared entity knowledge with newsroom-specific memories
Cost sharing for infrastructure while preserving editorial independence

Data Growth:

HNSW indexes scale sub-linearly with data size
Incremental processing prevents full-corpus reprocessing
Archival strategies for historical data management

10.3 Open Questions

Technical:

Optimal memory consolidation strategies for long-running systems
Entity disambiguation at scale (millions of mentions)
Multi-model orchestration for cost/quality optimization

Editorial:

Appropriate boundaries for AI drafting vs. human writing
Community transparency expectations for AI-assisted journalism
Source relationship implications of AI-mediated communication

Sustainability:

Economic models for AI infrastructure in local news
Skills development for non-technical journalists
Maintenance burden for sophisticated systems

10. Conclusion

This white paper documents nine months of building AI infrastructure for a hyperlocal newsroom. The resulting system demonstrates that sophisticated AI capabilities are achievable for small news operations, potentially transforming the economics of local journalism.

The technical achievement is significant: 255 GB of structured news and email data, 8 million searchable semantic chunks, sub-100ms query performance, and integrated AI agents that monitor, extract, draft, and orchestrate across diverse information sources.

But the more important achievement is the design philosophy embedded in the architecture: human-in-the-loop by default, editorial integrity as requirement, practical utility over impressive capabilities. AI serves as force multiplier, not replacement. The system proposes; humans decide.

For local journalism, the implications extend beyond individual newsroom efficiency. If AI can enable one publisher to maintain coverage depth previously requiring dedicated staff, similar systems could revive local news coverage in communities currently without any journalism presence. The technology is replicable, the costs are manageable, and the architecture is documented.

The local news crisis is real and deepening. This system represents one response: building tools that make small newsrooms more capable rather than waiting for business models that may never materialize. AI infrastructure won't solve every problem facing local journalism, but it can address the operational bottlenecks that consume editorial attention and prevent the deep community engagement that local news requires.

The code exists. The documentation exists. The path forward is clearer than it was nine months ago.

Appendices

A. Technology Reference

Core Technologies:

Category	Technology	Version	Purpose
Database	PostgreSQL	17	Primary data store
Vector Search	pgvector	0.7.0	Embedding storage and similarity search
Cache	Redis	7.x	Task queue broker, session cache
Backend	Python	3.11	Primary language
Web Framework	FastAPI	0.128.0	API endpoints
Task Queue	Celery	5.x	Background processing
Scraping	Scrapy	2.x	Web scraping framework
Browser Automation	Playwright	1.x	Form filling, screenshots
Workflow	LangGraph	0.x	Multi-agent orchestration
ML/NLP	SpaCy	3.x	Named entity recognition
Embeddings	SentenceTransformers	2.x	all-MiniLM-L6-v2 model

LLM APIs:

Provider	Model	Use Case
Anthropic	Claude Sonnet 4	Draft generation, content analysis
Anthropic	Claude Haiku	Image analysis, fast classification
Google	Gemini 2.0 Flash Lite	Email classification (cost-efficient)
OpenAI	GPT-4	Specialized tasks, comparison

Infrastructure:

Component	Specification
Hardware	Mac Mini M2 / MacBook Pro
Storage	Samsung T7 SSD (1TB) for PostgreSQL data
OS	macOS 14.x
Package Manager	Homebrew
Python Environment	venv

B. Repository Structure

Master-Scrape-Rag-Pipe/
├── news_scraper_project/           # Article ingestion
│   ├── spiders/article_spider.py   # Intelligent heuristic spider
│   ├── pipelines.py                # PostgreSQL atomic writes
│   ├── tasks.py                    # Celery embedding tasks
│   └── middlewares.py              # URL cache middleware
│
├── email_assistant/                # Email processing CLI
│   ├── cli.py                      # Main entry point
│   ├── orchestration/              # Task extraction & routing
│   │   ├── orchestrator.py         # Central dispatcher
│   │   ├── task_extractor.py       # LLM-based extraction
│   │   └── task_classifier.py      # Capability-aware matching
│   ├── drafting/                   # Response generation
│   │   ├── draft_response.py       # Context-enriched drafting
│   │   └── context_enrichment.py   # RAG context injection
│   ├── memory/                     # Adaptive learning
│   │   ├── store.py                # Memory persistence
│   │   ├── feedback_parser.py      # Deterministic parsing
│   │   └── quality_gate.py         # Signal validation
│   ├── skills/                     # Automation capabilities
│   │   ├── rsvp_skill.py           # Form filling
│   │   ├── calendar_skill.py       # Event management
│   │   └── smart_form_skill.py     # Intelligent form analysis
│   └── rsvp/                       # Browser automation
│       └── playwright_controller.py
│
├── rag_chatbot/                    # RAG interface
│   ├── app.py                      # FastAPI application
│   ├── ai_generator.py             # Claude API integration
│   ├── vector_store_pg.py          # pgvector search
│   └── NYCNewsAgent/               # Research agent system
│       ├── research_agent/         # Multi-agent research
│       │   ├── agent.py            # Claude Agent SDK
│       │   └── email_story_extractor.py
│       ├── publish_pipeline/       # WordPress integration
│       │   ├── metadata_extractor.py
│       │   ├── seo_generator.py
│       │   └── image_analyzer.py
│       └── api/                    # Workflow APIs
│           ├── workflow_api.py
│           └── lineage/            # Content traceability
│
├── entity_extraction/              # Entity intelligence
│   ├── spacy_entity_pipeline.py    # NER pipeline
│   └── entity_disambiguation.py    # Alias management
│
├── attachment_extraction/          # Email attachments
│   ├── drive_uploader.py           # Google Drive streaming
│   └── calendar_parser.py          # ICS parsing
│
├── event_extraction/               # Calendar events
│   └── extract_events_from_emails.py
│
├── migrations/                     # Database schemas
│   └── 001-068_*.sql
│
├── update_emails_v2.sh             # Email pipeline orchestrator
├── run_crawls.py                   # Article scraping orchestrator
└── CLAUDE.md                       # AI assistant instructions

C. Commit Statistics

Total commits: 321
Development period: July 1, 2025 - March 16, 2026 (9 months)
First commit: 73aae77 - "initial commit"
Latest commit: 8fbdf9a - "feat(hitl): add archive action and show full email body in review"

Development Phases:

Phase	Period	Focus
Foundation	July 2025	Scrapy, PostgreSQL, Celery stabilization
Email Intelligence	Aug-Sep 2025	Gmail integration, entity extraction, participant tracking
Advanced Features	Oct-Dec 2025	LangGraph workflows, memory system, reflection
Production Polish	Jan-Mar 2026	HITL review, task orchestration, publishing pipeline

Database Scale:

Table	Records	Size
articles	1.79M	15 GB
article_chunks	4.56M	45 GB
classified_emails	218K	8 GB
email_chunks	3.58M	35 GB
sender_profiles	10K	200 MB
email_participants	453K	2 GB
HNSW indexes	3	20 GB
Total		~255 GB

D. Glossary

AUDN: Add/Update/Delete/Noop - memory reconciliation operations determining how new learning integrates with existing memories

Chunk: A semantic segment of text (typically 500 characters) with associated vector embedding for similarity search

Deterministic Filter: Rule-based classification that handles obvious cases without LLM invocation, preserving AI capacity for nuanced decisions

Few-Shot Learning: Providing examples of desired outputs in prompts to guide model behavior toward publication-specific patterns

HITL: Human-in-the-Loop - design pattern ensuring human approval before automated actions

HNSW: Hierarchical Navigable Small World - approximate nearest neighbor algorithm enabling fast vector similarity search

Memory Scope: The hierarchical level at which a learned preference applies (global, organization, project, sender)

Participant Intelligence: System for tracking email participants across the corpus, enabling relationship mapping and communication pattern analysis

pgvector: PostgreSQL extension adding vector data type and similarity search operators

Quality Gate: Validation layer that filters low-quality signals before they can influence learning systems

RAG: Retrieval-Augmented Generation - pattern combining information retrieval with LLM generation

RRF: Reciprocal Rank Fusion - technique for combining multiple ranking signals (e.g., vector similarity + text search)

Sender Profile: Aggregated intelligence about an email sender including organization, role, communication patterns, and relationship history

Semantic Search: Finding content by meaning rather than exact keyword matching, enabled by vector embeddings

tsvector: PostgreSQL's full-text search data type for efficient text matching

Continue Reading

Data warehouse: 255 GB (1.79M articles, 218K emails, 8M+ chunks).
Last updated: March 2026.

For questions about this system, potential collaboration, or career opportunities, contact:
Alec Meeker, alec@bushwickdaily.com
LinkedIn: linkedin.com/in/alecmeeker · GitHub: alecmeeeker.github.io