Abstract
The local news crisis has left communities across America without reliable journalism coverage. While national outlets consolidate and local newsrooms collapse, AI has emerged as a potential force multiplier that could enable small newsrooms to punch above their weight. This white paper documents the nine-month development of a comprehensive AI infrastructure built by Bushwick Daily, a hyperlocal Brooklyn publication. The system encompasses 255 GB of structured data including 1.79 million scraped news articles from 75+ outlets, 218,000 processed emails with participant intelligence, and 8 million searchable semantic chunks. The architecture integrates multiple AI subsystems: an intelligent news scanner that monitors NYC media and extracts stories from email intelligence, an adaptive memory system that learns editorial preferences through human feedback, and an email assistant that triages correspondence and generates voice-accurate draft responses. The human-in-the-loop design philosophy ensures AI augments rather than replaces editorial judgment, with explicit checkpoints requiring human approval before any automated action. Development proceeded through deliberate iteration, with significant architectural pivots driven by production experience rather than theoretical assumptions. The resulting system demonstrates that sophisticated AI infrastructure is achievable for small newsrooms, potentially transforming the economics of local journalism by enabling one-person operations to maintain coverage depth traditionally requiring dedicated research staff.
1. Introduction: The Local News Crisis and an AI Response
1.1 The Challenge
The collapse of local journalism represents one of the most significant information gaps in American civic life. Since 2004, over 2,500 newspapers have closed. The remaining local outlets operate with skeleton staffs, unable to provide the coverage depth that community accountability requires. Hyperlocal publications face an impossible equation: the reporting workload of a full newsroom with the resources of a small business.
For Bushwick Daily, this tension manifested in concrete operational challenges. Monitoring 75+ NYC news outlets for relevant coverage required hours of daily manual checking. An inbox processing 200+ emails daily created a triage bottleneck where important community communications were lost in promotional noise. Institutional knowledge about sources, organizations, and ongoing stories existed only in the publisher's memory, vulnerable to the cognitive limitations of a single operator.
1.2 The Hypothesis
The hypothesis driving this project was straightforward: AI could serve as a force multiplier for small newsrooms rather than a replacement for journalists. The goal was not automated content generation but automated information processing, freeing editorial attention for judgment and writing while AI handled collection, organization, and routine correspondence.
This required building AI infrastructure from scratch rather than adopting off-the-shelf solutions. Commercial AI tools optimize for general use cases; a newsroom requires domain-specific capabilities around source management, editorial voice preservation, and journalistic accuracy standards.
1.3 The Approach
Development proceeded through direct engagement with production requirements. The author, with a background in investment analysis ($200M portfolio management, PE fund due diligence), a B.S. in Finance and Business Information Systems, and eight years of company operations, built the system iteratively, applying the same analytical rigor to AI architecture decisions that he previously applied to portfolio analysis and fund evaluation. Prior experience with VBA/SQL automation for institutional reporting (reducing Emerging Market Debt reporting cycles by 70% at Erie Insurance) provided a foundation for the data pipeline engineering documented in this paper.
The approach prioritized: - Human-in-the-loop design: AI proposes, humans approve. No automated actions without explicit human checkpoints. - Editorial integrity: Voice preservation, source attribution, and accuracy verification built into system architecture. - Practical utility: Features driven by real workflow pain points rather than theoretical capabilities. - Sustainable architecture: Production-grade reliability over prototype impressiveness.
1.4 Scope of This Paper
This paper covers the complete technical architecture of the Bushwick Daily AI newsroom system as of March 2026, including:
- Data Infrastructure: PostgreSQL 17 with pgvector for unified article and email storage with semantic search
- NYCNewsScanner: Multi-agent news monitoring and story extraction from both published articles and email intelligence
- Email Assistant: Intelligent triage, context-aware response generation, and task orchestration with skill-based automation
- Memory System: Adaptive learning architecture that improves through human feedback
- Integration Layer: How these components communicate and share context
The development journey itself is documented, including architectural pivots, failed approaches, and lessons learned from nine months of iteration.
2. System Architecture Overview
2.1 Design Philosophy
Five principles guided architectural decisions:
1. Human-in-the-Loop by Default Every automated action passes through human approval. The system proposes; humans decide. This applies to email responses, story extraction approvals, form submissions, and content publication. AI handles the mechanical work of drafting, organizing, and retrieving; humans retain editorial judgment.
2. Editorial Integrity as Architecture Voice preservation is not a feature but a requirement. The system learns from human corrections rather than overriding editorial preferences. When AI-generated content is edited before sending, those edits become training data for future improvements.
3. Practical Over Impressive Features address documented workflow problems rather than theoretical capabilities. The incremental crawling system exists because daily news monitoring was consuming hours; the email triage system exists because 200+ daily emails were unmanageable manually.
4. Production Reliability Graceful degradation over brittle perfection. When individual articles cause parsing errors, the system logs and skips rather than crashing. When API calls fail, retry logic with exponential backoff ensures eventual completion. Background processing through Celery workers ensures main operations remain responsive.
5. Sustainable Operations Infrastructure must run on available hardware within realistic budgets. The system operates on a Mac with an external SSD for PostgreSQL data, using commodity cloud APIs with cost tracking. AI operations are optimized for cost efficiency without sacrificing capability.
2.2 High-Level Architecture
┌──────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ 75+ News Outlets Gmail Inbox Reddit/Web │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ INGESTION LAYER │
│ Scrapy Spider Gmail API Web Scraper │
│ URL Cache Deduplication Rate Limiting │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ PROCESSING LAYER │
│ Article Extraction Email Classification │
│ Entity Extraction Participant Intelligence │
│ Chunking Embedding Generation │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ DATA WAREHOUSE │
│ PostgreSQL 17 + pgvector │
│ 1.79M Articles 218K Emails 8M+ Chunks │
│ 453K Participants 10K Sender Profiles │
│ HNSW Vector Indexes │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌──────────────────────┼─────────────┼────────────┼──────────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ RAG │ │ NYCNews │ │ Email │ │ Memory │ │ Publish │
│ Chatbot │ │ Scanner │ │ Assistant │ │ System │ │ Pipeline │
│ │ │ │ │ │ │ │ │ │
│ Semantic │ │ Multi-Agent │ │ Triage │ │ Adaptive │ │ WordPress │
│ Search │ │ Research │ │ Drafting │ │ Learning │ │ Integration │
│ Q&A │ │ Story │ │ Task │ │ Feedback │ │ SEO Gen │
│ │ │ Extraction │ │ Orchestr. │ │ Loop │ │ Image AI │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │ │
└──────────────────────┼─────────────┼────────────┼──────────────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ HUMAN INTERFACE │
│ CLI Commands Web Dashboard API Endpoints │
│ Human-in-the-Loop Review │
└──────────────────────────────────────────────────────┘
2.3 Technology Stack Summary
| Layer | Technologies | Rationale |
|---|---|---|
| Data Storage | PostgreSQL 17, pgvector, Redis | Unified SQL + vector search; HNSW indexes enable sub-100ms similarity queries across millions of chunks |
| Backend | Python 3.11, FastAPI, Celery | Async processing with background tasks for compute-intensive embedding generation |
| Scraping | Scrapy, BeautifulSoup, Readability | Intelligent heuristic spider with fallback extraction chains |
| AI/LLM | Claude (Anthropic), Gemini (Google), GPT-4 (OpenAI) | Multi-model strategy: Claude for generation quality, Gemini Flash for cost-efficient classification, GPT-4 for specialized tasks |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) | 384-dimensional vectors; lightweight model runs locally without GPU |
| Browser Automation | Playwright | RSVP form filling, screenshot capture, web interaction |
| Workflow Engine | LangGraph | Multi-agent orchestration with state persistence |
| Frontend | React, Streamlit | Development speed with Streamlit; production UI with React |
| Infrastructure | macOS, Homebrew, External SSD | Commodity hardware sufficient for production workloads |
3. Data Infrastructure
3.1 Data Warehouse Architecture
The data warehouse centers on PostgreSQL 17 with the pgvector extension, providing unified storage for both structured metadata and high-dimensional vector embeddings. This architecture eliminates the complexity of maintaining separate vector stores while enabling joins between semantic search results and relational data.
Primary Entity Clusters:
-- Article Storage
outlets (75 active)
└── articles (1.79M records)
└── article_content (full HTML/text with tsvector search)
└── article_chunks (4.56M chunks with 384-dim embeddings)
-- Email Storage
classified_emails (218K records, 60 columns)
├── email_chunks (3.58M chunks with embeddings)
├── email_participants (453K relationship records)
│ └── sender_profiles (10K with communication patterns)
├── email_attachments (Google Drive links, SHA-256 dedup)
└── email_events (ICS calendar extraction)
-- Entity Intelligence
entities_v2 (2,580+ entities)
└── entity_mentions_v2 (cross-corpus tracking)
└── entity_aliases_v2 (disambiguation)
Vector Indexing Strategy:
HNSW (Hierarchical Navigable Small World) indexes provide approximate nearest neighbor search with sub-100ms query times across millions of vectors:
CREATE INDEX idx_email_chunks_embedding_hnsw
ON email_chunks USING hnsw (embedding vector_cosine_ops);
-- Index size: 5.8 GB for 3.58M email chunks
-- Query time: ~27ms average (verified December 2024)
The choice of vector_cosine_ops reflects the semantic nature of the search: cosine similarity captures meaning alignment regardless of vector magnitude, which is appropriate for text embeddings where we care about semantic direction rather than intensity.
3.2 Article Ingestion Pipeline
The article scraping system achieves 95% automation through intelligent heuristics, requiring manual configuration only for exceptional sites.
Article Detection Cascade:
def triage_page(self, response):
"""Multi-stage article detection."""
# Stage 1: OpenGraph metadata
if response.css('meta[property="og:type"][content="article"]'):
yield from self.parse_article(response)
return
# Stage 2: Semantic container analysis
for selector in ['article', 'main', 'div[role="main"]']:
container = response.css(selector)
if container:
text_content = "".join(container.css("::text").getall())
html_content = container.get()
# Text-to-markup ratio filter
if len(text_content) / len(html_content) > 0.6:
if len(text_content) > 250:
yield from self.parse_article(response)
return
Incremental Crawling:
The UrlCacheMiddleware eliminates redundant network requests by loading all existing URLs at spider startup:
class UrlCacheMiddleware:
def spider_opened(self, spider):
with engine.connect() as conn:
urls = conn.execute(
text("SELECT url FROM articles WHERE outlet_id = :id"),
{"id": outlet_id}
).fetchall()
self.scraped_urls = {self.normalize_url(row[0]) for row in urls}
def process_request(self, request, spider):
if self.normalize_url(request.url) in self.scraped_urls:
raise IgnoreRequest(f"Already scraped: {request.url}")
return None
This reduces network traffic by approximately 90% on subsequent crawls, enabling hourly news checks without server overload.
Site Override System:
For sites that defeat heuristic detection, YAML configuration provides escape hatches:
hellgatenyc.com:
article_url_pattern: '/[a-z0-9-]+/$'
body_fallback_chain:
- "article.post-content"
- "div.entry-content"
- "div.article-body"
3.3 Email Ingestion Pipeline
Email processing follows a seven-step pipeline orchestrated by update_emails_v2.sh:
Step 1: Gmail API extraction with service account delegation
Step 1.1: URL extraction from email content
Step 1.25: Deterministic triage (email/notify/no/spam)
Step 1.5: Attachment extraction → Google Drive with SHA-256 dedup
Step 2: Semantic chunking with 500-char segments
Step 3: Gemini Flash classification (16 categories)
Step 3.5: Structured event extraction (ICS parsing)
Step 4: Participant intelligence (role extraction, fingerprinting)
Step 5: Story extraction for journalism workflow
Step 6: LangChain agent processing
Step 7: Entity extraction (SpaCy NER)
Hybrid Classification System:
Deterministic rules handle predictable patterns before invoking LLM classification:
class DeterministicFilter:
# Spam TLDs with >95% spam rate
SPAM_TLDS = {'.xyz', '.top', '.gdn', '.click', '.loan'}
# Known newsletter domains
NEWSLETTER_DOMAINS = {'mailchimp.com', 'substack.com', 'constantcontact.com'}
# Financial transaction patterns
FINANCIAL_PATTERNS = [r'paypal.*receipt', r'venmo.*paid']
def classify(self, email) -> Optional[str]:
# Check spam indicators
if any(email.sender.endswith(tld) for tld in self.SPAM_TLDS):
return 'spam'
# Check newsletter patterns
sender_domain = email.sender.split('@')[1]
if sender_domain in self.NEWSLETTER_DOMAINS:
return 'marketing_newsletter'
# ... additional rules
return None # Fall through to LLM classification
This hybrid approach preserves LLM capacity for nuanced decisions while handling obvious cases instantly, achieving approximately 10x throughput improvement over LLM-only classification.
3.4 Data Quality Framework
Deduplication:
Content fingerprinting using SHA-256 prevents duplicate storage:
def generate_fingerprint(self, email):
# Normalize content
content = self.normalize_tracking_params(email.body_text)
content = self.normalize_whitespace(content)
# Generate hash
return hashlib.sha256(content.encode('utf-8')).hexdigest()
Referential Integrity:
CASCADE deletes maintain consistency:
ALTER TABLE article_chunks
ADD CONSTRAINT fk_article
FOREIGN KEY (article_id) REFERENCES articles(id) ON DELETE CASCADE;
Encoding Safety:
UTF-8 sanitization prevents PostgreSQL errors from null bytes:
clean_text = text.encode('utf-8', 'ignore').decode('utf-8')
clean_text = clean_text.replace('\x00', '') # Remove null bytes
3.5 Key Technical Achievements
- ~27ms vector similarity queries across 3.58M email chunks
- ~90% network reduction via URL cache middleware
- 4.56M article chunks and 3.58M email chunks with HNSW indexes
- 92.7% profile linkage for emails to enhanced sender profiles
- INSERT-only mode for articles prevents data loss from paywalled content changes
4. NYCNewsScanner: Intelligent News Discovery
4.1 Scanning Architecture
The NYCNewsScanner employs a multi-agent architecture using the Claude Agent SDK to parallelize news research across diverse source types.
┌─────────────────────┐
│ News Coordinator │
│ (Orchestrator) │
└──────────┬──────────┘
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DB Researcher │ │ Email Researcher │ │ Web Researcher │
│ (7 outlets) │ │ (Press/News) │ │ (Reddit/Web) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└───────────────────────┴───────────────────────┘
▼
┌──────────────────────┐
│ Story Extractor │
│ (5W Analysis) │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Memory Integration │←─── Quality Gate
│ (4-Level Scope) │←─── Feedback Parser
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Pitch Generation │
│ + Human Review │
└──────────────────────┘
Source Coverage:
- Database Researcher: Queries 7 major NYC outlets (Gothamist, Hell Gate, QNS, THE CITY, Bklyner, Brooklyn Paper, amNY) using deterministic SQL against the 1.79M article corpus
- Email Researcher: Scans 215+ newsletters and 122+ press releases classified via Gemini Flash
- Web Researcher: Monitors r/Bushwick and conducts targeted web searches for hyperlocal content not yet in the scraping pipeline
4.2 Relevance Scoring & Prioritization
Each extracted story receives a newsworthiness score based on multiple factors:
def calculate_newsworthiness(self, story, entities, email_context):
score = 0.0
# Geographic proximity to Bushwick
bushwick_entities = ['Bushwick', 'Community Board 4', 'BK90']
for entity in entities:
if entity in bushwick_entities:
score += 0.3
# Known community figures
if self.entity_db.is_known_figure(story.who):
score += 0.2
# Temporal urgency
if story.when and story.when < datetime.now() + timedelta(days=7):
score += 0.2
# Source credibility
if email_context.sender_profile.is_government:
score += 0.15
elif email_context.sender_profile.is_journalist:
score += 0.1
return min(score, 1.0)
Stories scoring below threshold are filtered before human review, focusing editorial attention on genuinely newsworthy content.
4.3 Memory System
The memory system implements a four-level scope hierarchy based on research into production memory architectures (Mem0, Letta/MemGPT, academic literature):
Scope Levels:
- Global: Applies across all contexts ("Route advertising inquiries to business team")
- Organization: Persists through staff turnover ("CB4 meets second Wednesday monthly")
- Project: Tracks ongoing investigations ("Bushwick rezoning - these stakeholders are connected")
- Sender: Individual preferences ("Maria prefers informal tone")
Memory Storage:
CREATE TABLE am_memories (
id SERIAL PRIMARY KEY,
scope_type VARCHAR CHECK (scope_type IN ('global', 'org', 'project', 'sender')),
scope_id VARCHAR, -- NULL for global, email for sender, org_id for org
content TEXT NOT NULL,
embedding VECTOR(384),
memory_type VARCHAR, -- 'preference', 'fact', 'procedure', 'relationship'
confidence FLOAT DEFAULT 1.0,
times_applied INTEGER DEFAULT 0,
times_led_to_acceptance INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_accessed TIMESTAMPTZ
);
Memory Retrieval:
Hybrid search using Reciprocal Rank Fusion combines vector similarity with full-text matching:
CREATE FUNCTION search_memories(query_text TEXT, query_embedding VECTOR, scope_filter JSONB)
RETURNS TABLE (memory_id INT, content TEXT, score FLOAT) AS $$
-- Vector similarity component
WITH vector_results AS (
SELECT id, content, 1 - (embedding <=> query_embedding) as vscore,
ROW_NUMBER() OVER (ORDER BY embedding <=> query_embedding) as vrank
FROM am_memories
WHERE scope_type = ANY(scope_filter->'scopes')
),
-- Full-text component
text_results AS (
SELECT id, content, ts_rank(search_vector, plainto_tsquery(query_text)) as tscore,
ROW_NUMBER() OVER (ORDER BY ts_rank DESC) as trank
FROM am_memories
WHERE search_vector @@ plainto_tsquery(query_text)
)
-- RRF combination
SELECT COALESCE(v.id, t.id), COALESCE(v.content, t.content),
(1.0/(60 + COALESCE(vrank, 1000)) + 1.0/(60 + COALESCE(trank, 1000))) as rrf_score
FROM vector_results v FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC;
$$ LANGUAGE sql;
4.3.1 Memory Formation
New memories form through the AUDN (Add/Update/Delete/Noop) cycle in reconciler.py:
- Trigger: Human edits an AI-generated draft
- Parse:
FeedbackParserclassifies edit type (tone, factual, intent, structural) - Gate:
QualityGatefilters noise (minimum 10% magnitude, 60% confidence, 5-minute debounce) - Extract: LLM extracts learnable preference from qualified signal
- Reconcile: Compare to existing memories; AUDN determines action
class QualityGate:
def should_process(self, signal: FeedbackSignal) -> bool:
# Minimum edit magnitude
if signal.edit_ratio < 0.10:
return False
# Confidence threshold
if signal.confidence < 0.60:
return False
# Debounce rapid edits
if signal.seconds_since_last < 300:
return False
return True
4.3.2 Memory Retrieval
When generating content or responses, the system retrieves relevant memories using scope-based boosting:
def retrieve_memories(self, context):
base_memories = self.search_memories(context.query)
# Apply scope boosting
for memory in base_memories:
if memory.scope_type == 'sender' and memory.scope_id == context.sender:
memory.score *= 1.3 # Sender memories most relevant
elif memory.scope_type == 'org' and memory.scope_id == context.org:
memory.score *= 1.2
elif memory.scope_type == 'project' and memory.scope_id in context.projects:
memory.score *= 1.1
# Global memories use base score
return sorted(base_memories, key=lambda m: m.score, reverse=True)[:10]
4.3.3 Memory Decay & Consolidation
The effectiveness tracker implements decay for underperforming memories:
def compute_effectiveness(self, memory):
if memory.times_applied == 0:
return 0.5 # Neutral default
acceptance_rate = memory.times_led_to_acceptance / memory.times_applied
edit_ratio = memory.total_edit_distance / memory.total_output_length
# Weighted combination
return 0.6 * acceptance_rate + 0.4 * (1 - edit_ratio)
Memories with effectiveness below threshold become candidates for automatic pruning via the ea memory prune CLI command.
4.4 Reflection Engine
The reflection system was redesigned after research revealed that "reflection without external feedback degrades performance." The current architecture uses deterministic parsing BEFORE any LLM involvement:
The Anti-Pattern (Avoided):
# DON'T: Send raw edits directly to LLM for reflection
def learn_from_edit(original, edited):
prompt = f"What preference does this edit reveal?\nOriginal: {original}\nEdited: {edited}"
return llm.complete(prompt) # Degrades over time
The Implemented Pattern:
# DO: Deterministic parsing, quality gate, then targeted LLM extraction
def learn_from_edit(original, edited):
# Step 1: Deterministic classification
diff = difflib.SequenceMatcher(a=original, b=edited)
edit_type = classify_edit_type(diff) # tone, factual, intent, etc.
# Step 2: Quality gate
if not quality_gate.should_process(edit_type, diff.ratio()):
return None
# Step 3: Targeted LLM extraction (only for qualified signals)
return extract_learnable_preference(original, edited, edit_type)
The quality gate prevents several documented failure modes: - Noise pollution: Typo fixes don't become "preferences" - Debounce protection: Rapid-fire edits don't create duplicate memories - Magnitude filtering: Only edits changing 10%+ of content pass
4.5 Publisher Integration
The pitch generation system produces structured outputs for editorial review:
@dataclass
class EditorPitch:
pitch_id: str
headline: str
key_facts: List[str]
source_attribution: str
source_email_id: int
newsworthiness_score: float
suggested_angle: str
related_articles: List[int] # IDs of existing coverage
The HITL review interface displays pitches with full context:
┌─────────────────────────────────────────────────────────────┐
│ PITCH: City Announces New Affordable Housing Lottery │
├─────────────────────────────────────────────────────────────┤
│ Source: HPD Press Release (maria.torres@nyc.gov) │
│ Score: 0.87 (HIGH) │
│ │
│ KEY FACTS: │
│ • 150 units at 1234 Bushwick Ave │
│ • Income bands: 40%, 60%, 80% AMI │
│ • Applications open March 20 │
│ │
│ RELATED COVERAGE: 3 prior articles on this development │
│ │
│ [a]pprove [r]eject [e]dit [n]ext [q]uit │
└─────────────────────────────────────────────────────────────┘
5. Email Assistant
5.1 Processing Pipeline
The Email Assistant implements a complete email processing and response generation system:
Gmail API → Triage → HITL Review → Orchestrator → Skills → Draft Review → Send
↓
Deterministic Pre-orchestrator Task Skill Human
Filter checkpoint Extraction Dispatch Approval
Triage System:
Four-tier classification replaces the initial binary (YES/NO) approach:
| Category | Description | Action |
|---|---|---|
email |
Requires response from publisher | → Orchestrator |
notify |
Awareness-only (press releases, FYI) | → Notification |
no |
Not relevant to newsroom | → Archive |
spam |
Blocked, flagged for filter learning | → Spam folder |
Pre-Orchestrator HITL:
A filtering checkpoint before expensive LLM calls:
$ ea hitl review --limit 50
Thread: Re: Interview Request - Community Garden
From: reporter@brooklyneagle.com
Triage: EMAIL (needs response)
Preview: Hi Alec, I'm working on a story about...
Actions: [x]spam [m]mute [o]rchestrator [n]ext [q]uit
Each spam email caught here saves 3-5 LLM calls (task extraction, classification, skill dispatch, draft generation).
5.2 Context-Aware Response Generation
Draft responses leverage multiple context sources:
Sender Profile:
profile = sender_profiles.get(sender_email)
# Returns: organization, role, is_journalist, is_government,
# communication_history, preferred_tone, last_interaction
Thread Context:
thread = email_threads.get_full_thread(thread_id)
# Returns: all messages in thread, participants, subject evolution
Adaptive Memories:
memories = memory_store.retrieve_for_context(
sender=sender_email,
org=profile.organization,
projects=active_projects,
query=email_subject
)
# Returns: ranked preferences, facts, procedures relevant to this email
Few-Shot Examples:
examples = few_shot_retriever.get_similar_responses(
email_type=classification,
sender_type=profile.classification,
limit=3
)
# Returns: successful past responses to similar emails
The draft generation prompt assembles these contexts:
prompt = f"""
You are drafting a response for Alec Meeker, publisher of Bushwick Daily.
SENDER CONTEXT:
{sender_profile_summary}
THREAD CONTEXT:
{thread_history}
RELEVANT MEMORIES:
{format_memories(memories)}
SIMILAR PAST RESPONSES:
{format_examples(examples)}
WRITING GUIDELINES:
- Warm but direct tone
- Business development orientation
- NO em-dashes, NO "I hope this finds you well"
- Be specific about next steps
EMAIL TO RESPOND TO:
{current_email}
Draft a response:
"""
5.3 Human Oversight Design
Draft Review Interface:
┌─────────────────────────────────────────────────────────────┐
│ DRAFT RESPONSE │
├─────────────────────────────────────────────────────────────┤
│ To: maria.torres@nyc.gov │
│ Subject: Re: Press Credential Application │
│ │
│ Hi Maria, │
│ │
│ Thanks for following up on the credential application. │
│ I've attached our updated circulation numbers and added │
│ the publication schedule you requested. │
│ │
│ Let me know if you need anything else for the review. │
│ │
│ Best, │
│ Alec │
│ │
│ [s]end [e]dit [r]evise [a]nswer [d]iscard [n]ext [q]uit │
└─────────────────────────────────────────────────────────────┘
Action Options:
| Action | Effect | Learning |
|---|---|---|
| Send | Send as-is | Positive signal: draft was perfect |
| Edit | Open in editor, send after | Diff triggers memory extraction |
| Revise | Regenerate with notes | Notes inform next attempt |
| Answer | Continue multi-turn | Clarification needed |
| Discard | Reject draft | Strong negative signal |
Two-Phase HITL for Automation:
For actions like form filling, the system implements two-phase approval:
Phase 1: Analyze form → Show proposed values → Await approval
Phase 2: Fill form → Capture screenshot → Confirm submission
This prevents automated form submission with incorrect data while reducing manual data entry.
6. Integration & Orchestration
6.1 System Interoperability
Components communicate through shared PostgreSQL tables and well-defined interfaces:
Shared Data Structures:
# All components use consistent email representation
@dataclass
class EmailContext:
email_id: int
thread_id: str
sender_email: str
sender_profile: Optional[SenderProfile]
subject: str
body_text: str
body_html: str
received_at: datetime
classification: str
participant_roles: Dict[str, str]
attachments: List[Attachment]
Event-Driven Updates:
Content lineage tracking creates audit trails across all operations:
CREATE TABLE content_lineage_events (
id SERIAL PRIMARY KEY,
content_type VARCHAR, -- email, story, pitch, article, queue_item
content_id INTEGER,
event_type VARCHAR, -- extracted, approved, rejected, published
actor_type VARCHAR, -- system, human
actor_id VARCHAR,
parent_content_type VARCHAR,
parent_content_id INTEGER,
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
6.2 State Management
Complex workflow state persists to PostgreSQL rather than memory:
Session Persistence:
class SessionStore:
def save_session(self, session_id, state):
# Write-through: every update persists immediately
with self.conn.cursor() as cur:
cur.execute("""
INSERT INTO nycnews_agent_sessions (session_id, state, updated_at)
VALUES (%s, %s, NOW())
ON CONFLICT (session_id) DO UPDATE SET state = %s, updated_at = NOW()
""", (session_id, json.dumps(state), json.dumps(state)))
self.conn.commit()
def load_session(self, session_id):
# Read-through: database fallback when not in memory
with self.conn.cursor() as cur:
cur.execute(
"SELECT state FROM nycnews_agent_sessions WHERE session_id = %s",
(session_id,)
)
result = cur.fetchone()
return json.loads(result[0]) if result else None
Bidirectional Task-Draft Linking:
-- Draft knows its generating task
ALTER TABLE ea_email_drafts ADD COLUMN source_task_id INTEGER REFERENCES ea_tasks(id);
-- Task knows its output draft
ALTER TABLE ea_tasks ADD COLUMN generated_draft_id INTEGER REFERENCES ea_email_drafts(id);
This enables atomic state synchronization:
def sync_task_draft_status(task_id, draft_id, action):
"""Atomic status update for linked task and draft."""
with conn.cursor() as cur:
if action == 'send':
cur.execute("""
UPDATE ea_tasks SET status = 'completed', completed_at = NOW()
WHERE id = %s;
UPDATE ea_email_drafts SET status = 'sent', sent_at = NOW()
WHERE id = %s;
""", (task_id, draft_id))
elif action == 'reject':
cur.execute("""
UPDATE ea_tasks SET status = 'cancelled'
WHERE id = %s;
UPDATE ea_email_drafts SET status = 'rejected'
WHERE id = %s;
""", (task_id, draft_id))
conn.commit()
6.3 Error Handling & Recovery
Graceful Degradation:
def process_email_with_fallbacks(email):
try:
# Primary path: full context enrichment
profile = get_sender_profile(email.sender)
memories = get_relevant_memories(email, profile)
examples = get_few_shot_examples(email)
except ProfileServiceError:
# Fallback: basic context only
profile = None
memories = []
examples = get_generic_examples(email.classification)
# Always proceed with available context
return generate_draft(email, profile, memories, examples)
Retry Logic:
@celery.task(bind=True, max_retries=3, default_retry_delay=60)
def generate_embeddings(self, article_id):
try:
# Embedding generation
chunks = chunk_article(article_id)
embeddings = model.encode(chunks)
store_embeddings(article_id, embeddings)
except TransientError as e:
# Retry for transient failures
raise self.retry(exc=e, countdown=60 * (self.request.retries + 1))
except PermanentError as e:
# Log and skip for permanent failures
log_problematic_article(article_id, str(e))
return False
Recovery Scripts:
When issues occur at scale, recovery scripts identify and fix discrepancies:
def recover_missing_embeddings():
"""Find and regenerate embeddings that failed to persist."""
with conn.cursor() as cur:
cur.execute("""
SELECT ce.id FROM classified_emails ce
LEFT JOIN email_chunks ec ON ce.id = ec.email_id
WHERE ec.email_id IS NULL
AND ce.processed_at IS NOT NULL
""")
missing = cur.fetchall()
for email_id in missing:
regenerate_embeddings.delay(email_id)
7. Impact & Results
8.1 Operational Improvements
Quantitative Metrics:
| Metric | Before | After | Improvement |
|---|---|---|---|
| News source monitoring | 2-3 hours/day | 15 minutes/day | 90% reduction |
| Email triage time | 1-2 hours/day | 20 minutes/day | 80% reduction |
| Article research time | 30-60 min/story | 5-10 min/story | 80% reduction |
| Response drafting | 10-15 min/email | 2-3 min/email | 80% reduction |
| Event calendar updates | Manual entry | Semi-automated | Hours saved weekly |
System Scale:
- 1.79M articles searchable across 75+ NYC news outlets
- 218K emails processed with 16-category classification
- 8M+ semantic chunks indexed for sub-100ms search
- 10K sender profiles with communication pattern analysis
- 453K participant records for relationship intelligence
8.2 Editorial Capabilities
The system enables capabilities previously unavailable to a small newsroom:
Comprehensive Competitive Intelligence:
- Query "what have other outlets published about Bushwick rezoning?" and receive semantically ranked results across 1.79M articles
- Identify coverage gaps by comparing topic presence across outlets
- Track breaking stories across sources in near-real-time
Institutional Memory:
- Search all prior communications with any source: "show me all emails with this council member"
- Surface historical context for breaking stories automatically
- Preserve knowledge through staff transitions via organization-level memories
Source Relationship Management:
- Track 10K sender profiles with communication patterns
- Identify who provides information about which topics
- Detect relationship networks (who gets mentioned together)
Automated Event Tracking:
- Calendar events extracted from emails with structured data
- ICS attachments parsed into database records
- Event flyers uploaded to Google Drive with database tracking
8.3 Sustainability Implications
For local news sustainability, the implications are significant:
Cost Structure:
- AI API costs: ~$50-100/month for full system operation
- Infrastructure: Commodity Mac + external SSD
- No specialized engineering staff required for maintenance
Scalability:
- Architecture supports adding outlets to scraping pipeline with config changes
- Email volume can increase without architectural changes
- Memory system improves with use rather than requiring retraining
Replicability:
- All components use standard open-source technologies
- No proprietary systems or vendor lock-in
- Documentation enables adoption by other small newsrooms
8. Ethical Considerations
9.1 AI in Journalism
The use of AI in journalism raises legitimate concerns that this system addresses through architectural choices:
Human Judgment Preserved:
- AI proposes; humans decide. No content publishes without human approval.
- Draft responses require explicit send action after review.
- Story pitches require editorial approval before article generation.
Accuracy Safeguards:
- The
Questiontool lets the model ask for information rather than guess - Date awareness prevents responses to stale emails
- Quality gates prevent learning from spurious signals
Voice Authenticity:
- Few-shot learning from actual publisher responses
- Anti-patterns explicitly blocked in writing instructions
- Continuous learning from human corrections
Transparency:
- All AI-generated content is reviewed before sending
- No automated social media posting
- Editorial decisions remain with humans
9.2 Data Privacy
Email Privacy:
- All email data remains on local infrastructure (no cloud processing of content)
- Service account authentication isolates Gmail access
- Sender profiles aggregate patterns without storing personal data
Source Protection:
- Participant intelligence tracks relationships but not conversation content
- Entity extraction identifies public figures, not private individuals
- Memory system stores preferences, not sensitive information
9.3 Transparency
AI Assistance Disclosure:
- Generated drafts are reviewed and edited by humans
- The publisher retains full editorial responsibility
- AI serves as research and drafting assistance, not authorship
9. Future Directions
10.1 Planned Enhancements
Short-term (2026 Q2):
- Expanded RSVP automation for additional form types
- Enhanced entity disambiguation using co-occurrence patterns
- Mobile-friendly review interfaces for on-the-go approval
Medium-term (2026 H2):
- Voice assistant integration for hands-free email triage
- Automated source relationship cultivation reminders
- Cross-publication collaboration features
Long-term:
- Multi-newsroom knowledge sharing (privacy-preserving)
- Investigative research automation
- Community engagement intelligence
10.2 Scaling Considerations
The architecture supports scaling in several dimensions:
Horizontal Scaling:
- Database-backed session management enables multiple workers
- Celery task distribution scales embedding generation
- Stateless API design supports load balancing
Multi-Newsroom:
- Tenant isolation through database partitioning
- Shared entity knowledge with newsroom-specific memories
- Cost sharing for infrastructure while preserving editorial independence
Data Growth:
- HNSW indexes scale sub-linearly with data size
- Incremental processing prevents full-corpus reprocessing
- Archival strategies for historical data management
10.3 Open Questions
Technical:
- Optimal memory consolidation strategies for long-running systems
- Entity disambiguation at scale (millions of mentions)
- Multi-model orchestration for cost/quality optimization
Editorial:
- Appropriate boundaries for AI drafting vs. human writing
- Community transparency expectations for AI-assisted journalism
- Source relationship implications of AI-mediated communication
Sustainability:
- Economic models for AI infrastructure in local news
- Skills development for non-technical journalists
- Maintenance burden for sophisticated systems
10. Conclusion
This white paper documents nine months of building AI infrastructure for a hyperlocal newsroom. The resulting system demonstrates that sophisticated AI capabilities are achievable for small news operations, potentially transforming the economics of local journalism.
The technical achievement is significant: 255 GB of structured news and email data, 8 million searchable semantic chunks, sub-100ms query performance, and integrated AI agents that monitor, extract, draft, and orchestrate across diverse information sources.
But the more important achievement is the design philosophy embedded in the architecture: human-in-the-loop by default, editorial integrity as requirement, practical utility over impressive capabilities. AI serves as force multiplier, not replacement. The system proposes; humans decide.
For local journalism, the implications extend beyond individual newsroom efficiency. If AI can enable one publisher to maintain coverage depth previously requiring dedicated staff, similar systems could revive local news coverage in communities currently without any journalism presence. The technology is replicable, the costs are manageable, and the architecture is documented.
The local news crisis is real and deepening. This system represents one response: building tools that make small newsrooms more capable rather than waiting for business models that may never materialize. AI infrastructure won't solve every problem facing local journalism, but it can address the operational bottlenecks that consume editorial attention and prevent the deep community engagement that local news requires.
The code exists. The documentation exists. The path forward is clearer than it was nine months ago.
Appendices
A. Technology Reference
Core Technologies:
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Database | PostgreSQL | 17 | Primary data store |
| Vector Search | pgvector | 0.7.0 | Embedding storage and similarity search |
| Cache | Redis | 7.x | Task queue broker, session cache |
| Backend | Python | 3.11 | Primary language |
| Web Framework | FastAPI | 0.128.0 | API endpoints |
| Task Queue | Celery | 5.x | Background processing |
| Scraping | Scrapy | 2.x | Web scraping framework |
| Browser Automation | Playwright | 1.x | Form filling, screenshots |
| Workflow | LangGraph | 0.x | Multi-agent orchestration |
| ML/NLP | SpaCy | 3.x | Named entity recognition |
| Embeddings | SentenceTransformers | 2.x | all-MiniLM-L6-v2 model |
LLM APIs:
| Provider | Model | Use Case |
|---|---|---|
| Anthropic | Claude Sonnet 4 | Draft generation, content analysis |
| Anthropic | Claude Haiku | Image analysis, fast classification |
| Gemini 2.0 Flash Lite | Email classification (cost-efficient) | |
| OpenAI | GPT-4 | Specialized tasks, comparison |
Infrastructure:
| Component | Specification |
|---|---|
| Hardware | Mac Mini M2 / MacBook Pro |
| Storage | Samsung T7 SSD (1TB) for PostgreSQL data |
| OS | macOS 14.x |
| Package Manager | Homebrew |
| Python Environment | venv |
B. Repository Structure
Master-Scrape-Rag-Pipe/
├── news_scraper_project/ # Article ingestion
│ ├── spiders/article_spider.py # Intelligent heuristic spider
│ ├── pipelines.py # PostgreSQL atomic writes
│ ├── tasks.py # Celery embedding tasks
│ └── middlewares.py # URL cache middleware
│
├── email_assistant/ # Email processing CLI
│ ├── cli.py # Main entry point
│ ├── orchestration/ # Task extraction & routing
│ │ ├── orchestrator.py # Central dispatcher
│ │ ├── task_extractor.py # LLM-based extraction
│ │ └── task_classifier.py # Capability-aware matching
│ ├── drafting/ # Response generation
│ │ ├── draft_response.py # Context-enriched drafting
│ │ └── context_enrichment.py # RAG context injection
│ ├── memory/ # Adaptive learning
│ │ ├── store.py # Memory persistence
│ │ ├── feedback_parser.py # Deterministic parsing
│ │ └── quality_gate.py # Signal validation
│ ├── skills/ # Automation capabilities
│ │ ├── rsvp_skill.py # Form filling
│ │ ├── calendar_skill.py # Event management
│ │ └── smart_form_skill.py # Intelligent form analysis
│ └── rsvp/ # Browser automation
│ └── playwright_controller.py
│
├── rag_chatbot/ # RAG interface
│ ├── app.py # FastAPI application
│ ├── ai_generator.py # Claude API integration
│ ├── vector_store_pg.py # pgvector search
│ └── NYCNewsAgent/ # Research agent system
│ ├── research_agent/ # Multi-agent research
│ │ ├── agent.py # Claude Agent SDK
│ │ └── email_story_extractor.py
│ ├── publish_pipeline/ # WordPress integration
│ │ ├── metadata_extractor.py
│ │ ├── seo_generator.py
│ │ └── image_analyzer.py
│ └── api/ # Workflow APIs
│ ├── workflow_api.py
│ └── lineage/ # Content traceability
│
├── entity_extraction/ # Entity intelligence
│ ├── spacy_entity_pipeline.py # NER pipeline
│ └── entity_disambiguation.py # Alias management
│
├── attachment_extraction/ # Email attachments
│ ├── drive_uploader.py # Google Drive streaming
│ └── calendar_parser.py # ICS parsing
│
├── event_extraction/ # Calendar events
│ └── extract_events_from_emails.py
│
├── migrations/ # Database schemas
│ └── 001-068_*.sql
│
├── update_emails_v2.sh # Email pipeline orchestrator
├── run_crawls.py # Article scraping orchestrator
└── CLAUDE.md # AI assistant instructions
C. Commit Statistics
- Total commits: 321
- Development period: July 1, 2025 - March 16, 2026 (9 months)
- First commit: 73aae77 - "initial commit"
- Latest commit: 8fbdf9a - "feat(hitl): add archive action and show full email body in review"
Development Phases:
| Phase | Period | Focus |
|---|---|---|
| Foundation | July 2025 | Scrapy, PostgreSQL, Celery stabilization |
| Email Intelligence | Aug-Sep 2025 | Gmail integration, entity extraction, participant tracking |
| Advanced Features | Oct-Dec 2025 | LangGraph workflows, memory system, reflection |
| Production Polish | Jan-Mar 2026 | HITL review, task orchestration, publishing pipeline |
Database Scale:
| Table | Records | Size |
|---|---|---|
| articles | 1.79M | 15 GB |
| article_chunks | 4.56M | 45 GB |
| classified_emails | 218K | 8 GB |
| email_chunks | 3.58M | 35 GB |
| sender_profiles | 10K | 200 MB |
| email_participants | 453K | 2 GB |
| HNSW indexes | 3 | 20 GB |
| Total | ~255 GB |
D. Glossary
AUDN: Add/Update/Delete/Noop - memory reconciliation operations determining how new learning integrates with existing memories
Chunk: A semantic segment of text (typically 500 characters) with associated vector embedding for similarity search
Deterministic Filter: Rule-based classification that handles obvious cases without LLM invocation, preserving AI capacity for nuanced decisions
Few-Shot Learning: Providing examples of desired outputs in prompts to guide model behavior toward publication-specific patterns
HITL: Human-in-the-Loop - design pattern ensuring human approval before automated actions
HNSW: Hierarchical Navigable Small World - approximate nearest neighbor algorithm enabling fast vector similarity search
Memory Scope: The hierarchical level at which a learned preference applies (global, organization, project, sender)
Participant Intelligence: System for tracking email participants across the corpus, enabling relationship mapping and communication pattern analysis
pgvector: PostgreSQL extension adding vector data type and similarity search operators
Quality Gate: Validation layer that filters low-quality signals before they can influence learning systems
RAG: Retrieval-Augmented Generation - pattern combining information retrieval with LLM generation
RRF: Reciprocal Rank Fusion - technique for combining multiple ranking signals (e.g., vector similarity + text search)
Sender Profile: Aggregated intelligence about an email sender including organization, role, communication patterns, and relationship history
Semantic Search: Finding content by meaning rather than exact keyword matching, enabled by vector embeddings
tsvector: PostgreSQL's full-text search data type for efficient text matching
Continue Reading
Data warehouse: 255 GB (1.79M articles, 218K emails, 8M+ chunks).
Last updated: March 2026.
For questions about this system, potential collaboration, or career opportunities, contact:
Alec Meeker, alec@bushwickdaily.com
LinkedIn: linkedin.com/in/alecmeeker · GitHub: alecmeeeker.github.io