The local news crisis has left communities across America without reliable journalism coverage. While national outlets consolidate and local newsrooms collapse, AI has emerged as a potential force multiplier that could enable small newsrooms to punch above their weight. This white paper documents the nine-month development of a comprehensive AI infrastructure built by Bushwick Daily, a hyperlocal Brooklyn publication. The system encompasses 255 GB of structured data including 1.79 million scraped news articles from 75+ outlets, 218,000 processed emails with participant intelligence, and 8 million searchable semantic chunks. The architecture integrates multiple AI subsystems: an intelligent news scanner that monitors NYC media and extracts stories from email intelligence, an adaptive memory system that learns editorial preferences through human feedback, and an email assistant that triages correspondence and generates voice-accurate draft responses. The human-in-the-loop design philosophy ensures AI augments rather than replaces editorial judgment, with explicit checkpoints requiring human approval before any automated action. Development proceeded through deliberate iteration, with significant architectural pivots driven by production experience rather than theoretical assumptions. The resulting system demonstrates that sophisticated AI infrastructure is achievable for small newsrooms, potentially transforming the economics of local journalism by enabling one-person operations to maintain coverage depth traditionally requiring dedicated research staff.
The collapse of local journalism represents one of the most significant information gaps in American civic life. Since 2004, over 2,500 newspapers have closed. The remaining local outlets operate with skeleton staffs, unable to provide the coverage depth that community accountability requires. Hyperlocal publications face an impossible equation: the reporting workload of a full newsroom with the resources of a small business.
For Bushwick Daily, this tension manifested in concrete operational challenges. Monitoring 75+ NYC news outlets for relevant coverage required hours of daily manual checking. An inbox processing 200+ emails daily created a triage bottleneck where important community communications were lost in promotional noise. Institutional knowledge about sources, organizations, and ongoing stories existed only in the publisher's memory, vulnerable to the cognitive limitations of a single operator.
The hypothesis driving this project was straightforward: AI could serve as a force multiplier for small newsrooms rather than a replacement for journalists. The goal was not automated content generation but automated information processing, freeing editorial attention for judgment and writing while AI handled collection, organization, and routine correspondence.
This required building AI infrastructure from scratch rather than adopting off-the-shelf solutions. Commercial AI tools optimize for general use cases; a newsroom requires domain-specific capabilities around source management, editorial voice preservation, and journalistic accuracy standards.
Development proceeded through direct engagement with production requirements. The author — with a background in investment analysis ($200M portfolio management, PE fund due diligence), a B.S. in Finance and Business Information Systems, and eight years of company operations — built the system iteratively, applying the same analytical rigor to AI architecture decisions that he previously applied to portfolio analysis and fund evaluation. Prior experience with VBA/SQL automation for institutional reporting (reducing Emerging Market Debt reporting cycles by 70% at Erie Insurance) provided a foundation for the data pipeline engineering documented in this paper.
The approach prioritized: - Human-in-the-loop design: AI proposes, humans approve. No automated actions without explicit human checkpoints. - Editorial integrity: Voice preservation, source attribution, and accuracy verification built into system architecture. - Practical utility: Features driven by real workflow pain points rather than theoretical capabilities. - Sustainable architecture: Production-grade reliability over prototype impressiveness.
This paper covers the complete technical architecture of the Bushwick Daily AI newsroom system as of March 2026, including:
The development journey itself is documented, including architectural pivots, failed approaches, and lessons learned from nine months of iteration.
Five principles guided architectural decisions:
1. Human-in-the-Loop by Default Every automated action passes through human approval. The system proposes; humans decide. This applies to email responses, story extraction approvals, form submissions, and content publication. AI handles the mechanical work of drafting, organizing, and retrieving; humans retain editorial judgment.
2. Editorial Integrity as Architecture Voice preservation is not a feature but a requirement. The system learns from human corrections rather than overriding editorial preferences. When AI-generated content is edited before sending, those edits become training data for future improvements.
3. Practical Over Impressive Features address documented workflow problems rather than theoretical capabilities. The incremental crawling system exists because daily news monitoring was consuming hours; the email triage system exists because 200+ daily emails were unmanageable manually.
4. Production Reliability Graceful degradation over brittle perfection. When individual articles cause parsing errors, the system logs and skips rather than crashing. When API calls fail, retry logic with exponential backoff ensures eventual completion. Background processing through Celery workers ensures main operations remain responsive.
5. Sustainable Operations Infrastructure must run on available hardware within realistic budgets. The system operates on a Mac with an external SSD for PostgreSQL data, using commodity cloud APIs with cost tracking. AI operations are optimized for cost efficiency without sacrificing capability.
┌──────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ 75+ News Outlets Gmail Inbox Reddit/Web │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ INGESTION LAYER │
│ Scrapy Spider Gmail API Web Scraper │
│ URL Cache Deduplication Rate Limiting │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ PROCESSING LAYER │
│ Article Extraction Email Classification │
│ Entity Extraction Participant Intelligence │
│ Chunking Embedding Generation │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ DATA WAREHOUSE │
│ PostgreSQL 17 + pgvector │
│ 1.79M Articles 218K Emails 8M+ Chunks │
│ 453K Participants 10K Sender Profiles │
│ HNSW Vector Indexes │
└────────────┬─────────────┬────────────┬──────────────┘
│ │ │
┌──────────────────────┼─────────────┼────────────┼──────────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ RAG │ │ NYCNews │ │ Email │ │ Memory │ │ Publish │
│ Chatbot │ │ Scanner │ │ Assistant │ │ System │ │ Pipeline │
│ │ │ │ │ │ │ │ │ │
│ Semantic │ │ Multi-Agent │ │ Triage │ │ Adaptive │ │ WordPress │
│ Search │ │ Research │ │ Drafting │ │ Learning │ │ Integration │
│ Q&A │ │ Story │ │ Task │ │ Feedback │ │ SEO Gen │
│ │ │ Extraction │ │ Orchestr. │ │ Loop │ │ Image AI │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │ │
└──────────────────────┼─────────────┼────────────┼──────────────────────┘
│ │ │
┌────────────▼─────────────▼────────────▼──────────────┐
│ HUMAN INTERFACE │
│ CLI Commands Web Dashboard API Endpoints │
│ Human-in-the-Loop Review │
└──────────────────────────────────────────────────────┘
| Layer | Technologies | Rationale |
|---|---|---|
| Data Storage | PostgreSQL 17, pgvector, Redis | Unified SQL + vector search; HNSW indexes enable sub-100ms similarity queries across millions of chunks |
| Backend | Python 3.11, FastAPI, Celery | Async processing with background tasks for compute-intensive embedding generation |
| Scraping | Scrapy, BeautifulSoup, Readability | Intelligent heuristic spider with fallback extraction chains |
| AI/LLM | Claude (Anthropic), Gemini (Google), GPT-4 (OpenAI) | Multi-model strategy: Claude for generation quality, Gemini Flash for cost-efficient classification, GPT-4 for specialized tasks |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) | 384-dimensional vectors; lightweight model runs locally without GPU |
| Browser Automation | Playwright | RSVP form filling, screenshot capture, web interaction |
| Workflow Engine | LangGraph | Multi-agent orchestration with state persistence |
| Frontend | React, Streamlit | Development speed with Streamlit; production UI with React |
| Infrastructure | macOS, Homebrew, External SSD | Commodity hardware sufficient for production workloads |
The data warehouse centers on PostgreSQL 17 with the pgvector extension, providing unified storage for both structured metadata and high-dimensional vector embeddings. This architecture eliminates the complexity of maintaining separate vector stores while enabling joins between semantic search results and relational data.
Primary Entity Clusters:
-- Article Storage
outlets (75 active)
└── articles (1.79M records)
└── article_content (full HTML/text with tsvector search)
└── article_chunks (4.56M chunks with 384-dim embeddings)
-- Email Storage
classified_emails (218K records, 60 columns)
├── email_chunks (3.58M chunks with embeddings)
├── email_participants (453K relationship records)
│ └── sender_profiles (10K with communication patterns)
├── email_attachments (Google Drive links, SHA-256 dedup)
└── email_events (ICS calendar extraction)
-- Entity Intelligence
entities_v2 (2,580+ entities)
└── entity_mentions_v2 (cross-corpus tracking)
└── entity_aliases_v2 (disambiguation)
Vector Indexing Strategy:
HNSW (Hierarchical Navigable Small World) indexes provide approximate nearest neighbor search with sub-100ms query times across millions of vectors:
CREATE INDEX idx_email_chunks_embedding_hnsw
ON email_chunks USING hnsw (embedding vector_cosine_ops);
-- Index size: 5.8 GB for 3.58M email chunks
-- Query time: ~27ms average (verified December 2024)
The choice of vector_cosine_ops reflects the semantic nature of the search: cosine similarity captures meaning alignment regardless of vector magnitude, which is appropriate for text embeddings where we care about semantic direction rather than intensity.
The article scraping system achieves 95% automation through intelligent heuristics, requiring manual configuration only for exceptional sites.
Article Detection Cascade:
def triage_page(self, response):
"""Multi-stage article detection."""
# Stage 1: OpenGraph metadata
if response.css('meta[property="og:type"][content="article"]'):
yield from self.parse_article(response)
return
# Stage 2: Semantic container analysis
for selector in ['article', 'main', 'div[role="main"]']:
container = response.css(selector)
if container:
text_content = "".join(container.css("::text").getall())
html_content = container.get()
# Text-to-markup ratio filter
if len(text_content) / len(html_content) > 0.6:
if len(text_content) > 250:
yield from self.parse_article(response)
return
Incremental Crawling:
The UrlCacheMiddleware eliminates redundant network requests by loading all existing URLs at spider startup:
class UrlCacheMiddleware:
def spider_opened(self, spider):
with engine.connect() as conn:
urls = conn.execute(
text("SELECT url FROM articles WHERE outlet_id = :id"),
{"id": outlet_id}
).fetchall()
self.scraped_urls = {self.normalize_url(row[0]) for row in urls}
def process_request(self, request, spider):
if self.normalize_url(request.url) in self.scraped_urls:
raise IgnoreRequest(f"Already scraped: {request.url}")
return None
This reduces network traffic by approximately 90% on subsequent crawls, enabling hourly news checks without server overload.
Site Override System:
For sites that defeat heuristic detection, YAML configuration provides escape hatches:
hellgatenyc.com:
article_url_pattern: '/[a-z0-9-]+/$'
body_fallback_chain:
- "article.post-content"
- "div.entry-content"
- "div.article-body"
Email processing follows a seven-step pipeline orchestrated by update_emails_v2.sh:
Step 1: Gmail API extraction with service account delegation
Step 1.1: URL extraction from email content
Step 1.25: Deterministic triage (email/notify/no/spam)
Step 1.5: Attachment extraction → Google Drive with SHA-256 dedup
Step 2: Semantic chunking with 500-char segments
Step 3: Gemini Flash classification (16 categories)
Step 3.5: Structured event extraction (ICS parsing)
Step 4: Participant intelligence (role extraction, fingerprinting)
Step 5: Story extraction for journalism workflow
Step 6: LangChain agent processing
Step 7: Entity extraction (SpaCy NER)
Hybrid Classification System:
Deterministic rules handle predictable patterns before invoking LLM classification:
class DeterministicFilter:
# Spam TLDs with >95% spam rate
SPAM_TLDS = {'.xyz', '.top', '.gdn', '.click', '.loan'}
# Known newsletter domains
NEWSLETTER_DOMAINS = {'mailchimp.com', 'substack.com', 'constantcontact.com'}
# Financial transaction patterns
FINANCIAL_PATTERNS = [r'paypal.*receipt', r'venmo.*paid']
def classify(self, email) -> Optional[str]:
# Check spam indicators
if any(email.sender.endswith(tld) for tld in self.SPAM_TLDS):
return 'spam'
# Check newsletter patterns
sender_domain = email.sender.split('@')[1]
if sender_domain in self.NEWSLETTER_DOMAINS:
return 'marketing_newsletter'
# ... additional rules
return None # Fall through to LLM classification
This hybrid approach preserves LLM capacity for nuanced decisions while handling obvious cases instantly, achieving approximately 10x throughput improvement over LLM-only classification.
Deduplication:
Content fingerprinting using SHA-256 prevents duplicate storage:
def generate_fingerprint(self, email):
# Normalize content
content = self.normalize_tracking_params(email.body_text)
content = self.normalize_whitespace(content)
# Generate hash
return hashlib.sha256(content.encode('utf-8')).hexdigest()
Referential Integrity:
CASCADE deletes maintain consistency:
ALTER TABLE article_chunks
ADD CONSTRAINT fk_article
FOREIGN KEY (article_id) REFERENCES articles(id) ON DELETE CASCADE;
Encoding Safety:
UTF-8 sanitization prevents PostgreSQL errors from null bytes:
clean_text = text.encode('utf-8', 'ignore').decode('utf-8')
clean_text = clean_text.replace('\x00', '') # Remove null bytes
The NYCNewsScanner employs a multi-agent architecture using the Claude Agent SDK to parallelize news research across diverse source types.
┌─────────────────────┐
│ News Coordinator │
│ (Orchestrator) │
└──────────┬──────────┘
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DB Researcher │ │ Email Researcher │ │ Web Researcher │
│ (7 outlets) │ │ (Press/News) │ │ (Reddit/Web) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└───────────────────────┴───────────────────────┘
▼
┌──────────────────────┐
│ Story Extractor │
│ (5W Analysis) │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Memory Integration │←─── Quality Gate
│ (4-Level Scope) │←─── Feedback Parser
└──────────┬───────────┘
▼
┌──────────────────────┐
│ Pitch Generation │
│ + Human Review │
└──────────────────────┘
Source Coverage:
Each extracted story receives a newsworthiness score based on multiple factors:
def calculate_newsworthiness(self, story, entities, email_context):
score = 0.0
# Geographic proximity to Bushwick
bushwick_entities = ['Bushwick', 'Community Board 4', 'BK90']
for entity in entities:
if entity in bushwick_entities:
score += 0.3
# Known community figures
if self.entity_db.is_known_figure(story.who):
score += 0.2
# Temporal urgency
if story.when and story.when < datetime.now() + timedelta(days=7):
score += 0.2
# Source credibility
if email_context.sender_profile.is_government:
score += 0.15
elif email_context.sender_profile.is_journalist:
score += 0.1
return min(score, 1.0)
Stories scoring below threshold are filtered before human review, focusing editorial attention on genuinely newsworthy content.
The memory system implements a four-level scope hierarchy based on research into production memory architectures (Mem0, Letta/MemGPT, academic literature):
Scope Levels:
Memory Storage:
CREATE TABLE am_memories (
id SERIAL PRIMARY KEY,
scope_type VARCHAR CHECK (scope_type IN ('global', 'org', 'project', 'sender')),
scope_id VARCHAR, -- NULL for global, email for sender, org_id for org
content TEXT NOT NULL,
embedding VECTOR(384),
memory_type VARCHAR, -- 'preference', 'fact', 'procedure', 'relationship'
confidence FLOAT DEFAULT 1.0,
times_applied INTEGER DEFAULT 0,
times_led_to_acceptance INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_accessed TIMESTAMPTZ
);
Memory Retrieval:
Hybrid search using Reciprocal Rank Fusion combines vector similarity with full-text matching:
CREATE FUNCTION search_memories(query_text TEXT, query_embedding VECTOR, scope_filter JSONB)
RETURNS TABLE (memory_id INT, content TEXT, score FLOAT) AS $$
-- Vector similarity component
WITH vector_results AS (
SELECT id, content, 1 - (embedding <=> query_embedding) as vscore,
ROW_NUMBER() OVER (ORDER BY embedding <=> query_embedding) as vrank
FROM am_memories
WHERE scope_type = ANY(scope_filter->'scopes')
),
-- Full-text component
text_results AS (
SELECT id, content, ts_rank(search_vector, plainto_tsquery(query_text)) as tscore,
ROW_NUMBER() OVER (ORDER BY ts_rank DESC) as trank
FROM am_memories
WHERE search_vector @@ plainto_tsquery(query_text)
)
-- RRF combination
SELECT COALESCE(v.id, t.id), COALESCE(v.content, t.content),
(1.0/(60 + COALESCE(vrank, 1000)) + 1.0/(60 + COALESCE(trank, 1000))) as rrf_score
FROM vector_results v FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC;
$$ LANGUAGE sql;
New memories form through the AUDN (Add/Update/Delete/Noop) cycle in reconciler.py:
FeedbackParser classifies edit type (tone, factual, intent, structural)QualityGate filters noise (minimum 10% magnitude, 60% confidence, 5-minute debounce)class QualityGate:
def should_process(self, signal: FeedbackSignal) -> bool:
# Minimum edit magnitude
if signal.edit_ratio < 0.10:
return False
# Confidence threshold
if signal.confidence < 0.60:
return False
# Debounce rapid edits
if signal.seconds_since_last < 300:
return False
return True
When generating content or responses, the system retrieves relevant memories using scope-based boosting:
def retrieve_memories(self, context):
base_memories = self.search_memories(context.query)
# Apply scope boosting
for memory in base_memories:
if memory.scope_type == 'sender' and memory.scope_id == context.sender:
memory.score *= 1.3 # Sender memories most relevant
elif memory.scope_type == 'org' and memory.scope_id == context.org:
memory.score *= 1.2
elif memory.scope_type == 'project' and memory.scope_id in context.projects:
memory.score *= 1.1
# Global memories use base score
return sorted(base_memories, key=lambda m: m.score, reverse=True)[:10]
The effectiveness tracker implements decay for underperforming memories:
def compute_effectiveness(self, memory):
if memory.times_applied == 0:
return 0.5 # Neutral default
acceptance_rate = memory.times_led_to_acceptance / memory.times_applied
edit_ratio = memory.total_edit_distance / memory.total_output_length
# Weighted combination
return 0.6 * acceptance_rate + 0.4 * (1 - edit_ratio)
Memories with effectiveness below threshold become candidates for automatic pruning via the ea memory prune CLI command.
The reflection system was redesigned after research revealed that "reflection without external feedback degrades performance." The current architecture uses deterministic parsing BEFORE any LLM involvement:
The Anti-Pattern (Avoided):
# DON'T: Send raw edits directly to LLM for reflection
def learn_from_edit(original, edited):
prompt = f"What preference does this edit reveal?\nOriginal: {original}\nEdited: {edited}"
return llm.complete(prompt) # Degrades over time
The Implemented Pattern:
# DO: Deterministic parsing, quality gate, then targeted LLM extraction
def learn_from_edit(original, edited):
# Step 1: Deterministic classification
diff = difflib.SequenceMatcher(a=original, b=edited)
edit_type = classify_edit_type(diff) # tone, factual, intent, etc.
# Step 2: Quality gate
if not quality_gate.should_process(edit_type, diff.ratio()):
return None
# Step 3: Targeted LLM extraction (only for qualified signals)
return extract_learnable_preference(original, edited, edit_type)
The quality gate prevents several documented failure modes: - Noise pollution: Typo fixes don't become "preferences" - Debounce protection: Rapid-fire edits don't create duplicate memories - Magnitude filtering: Only edits changing 10%+ of content pass
The pitch generation system produces structured outputs for editorial review:
@dataclass
class EditorPitch:
pitch_id: str
headline: str
key_facts: List[str]
source_attribution: str
source_email_id: int
newsworthiness_score: float
suggested_angle: str
related_articles: List[int] # IDs of existing coverage
The HITL review interface displays pitches with full context:
┌─────────────────────────────────────────────────────────────┐
│ PITCH: City Announces New Affordable Housing Lottery │
├─────────────────────────────────────────────────────────────┤
│ Source: HPD Press Release (maria.torres@nyc.gov) │
│ Score: 0.87 (HIGH) │
│ │
│ KEY FACTS: │
│ • 150 units at 1234 Bushwick Ave │
│ • Income bands: 40%, 60%, 80% AMI │
│ • Applications open March 20 │
│ │
│ RELATED COVERAGE: 3 prior articles on this development │
│ │
│ [a]pprove [r]eject [e]dit [n]ext [q]uit │
└─────────────────────────────────────────────────────────────┘
The Email Assistant implements a complete email processing and response generation system:
Gmail API → Triage → HITL Review → Orchestrator → Skills → Draft Review → Send
↓
Deterministic Pre-orchestrator Task Skill Human
Filter checkpoint Extraction Dispatch Approval
Triage System:
Four-tier classification replaces the initial binary (YES/NO) approach:
| Category | Description | Action |
|---|---|---|
email |
Requires response from publisher | → Orchestrator |
notify |
Awareness-only (press releases, FYI) | → Notification |
no |
Not relevant to newsroom | → Archive |
spam |
Blocked, flagged for filter learning | → Spam folder |
Pre-Orchestrator HITL:
A filtering checkpoint before expensive LLM calls:
$ ea hitl review --limit 50
Thread: Re: Interview Request - Community Garden
From: reporter@brooklyneagle.com
Triage: EMAIL (needs response)
Preview: Hi Alec, I'm working on a story about...
Actions: [x]spam [m]mute [o]rchestrator [n]ext [q]uit
Each spam email caught here saves 3-5 LLM calls (task extraction, classification, skill dispatch, draft generation).
Draft responses leverage multiple context sources:
Sender Profile:
profile = sender_profiles.get(sender_email)
# Returns: organization, role, is_journalist, is_government,
# communication_history, preferred_tone, last_interaction
Thread Context:
thread = email_threads.get_full_thread(thread_id)
# Returns: all messages in thread, participants, subject evolution
Adaptive Memories:
memories = memory_store.retrieve_for_context(
sender=sender_email,
org=profile.organization,
projects=active_projects,
query=email_subject
)
# Returns: ranked preferences, facts, procedures relevant to this email
Few-Shot Examples:
examples = few_shot_retriever.get_similar_responses(
email_type=classification,
sender_type=profile.classification,
limit=3
)
# Returns: successful past responses to similar emails
The draft generation prompt assembles these contexts:
prompt = f"""
You are drafting a response for Alec Meeker, publisher of Bushwick Daily.
SENDER CONTEXT:
{sender_profile_summary}
THREAD CONTEXT:
{thread_history}
RELEVANT MEMORIES:
{format_memories(memories)}
SIMILAR PAST RESPONSES:
{format_examples(examples)}
WRITING GUIDELINES:
- Warm but direct tone
- Business development orientation
- NO em-dashes, NO "I hope this finds you well"
- Be specific about next steps
EMAIL TO RESPOND TO:
{current_email}
Draft a response:
"""
Draft Review Interface:
┌─────────────────────────────────────────────────────────────┐
│ DRAFT RESPONSE │
├─────────────────────────────────────────────────────────────┤
│ To: maria.torres@nyc.gov │
│ Subject: Re: Press Credential Application │
│ │
│ Hi Maria, │
│ │
│ Thanks for following up on the credential application. │
│ I've attached our updated circulation numbers and added │
│ the publication schedule you requested. │
│ │
│ Let me know if you need anything else for the review. │
│ │
│ Best, │
│ Alec │
│ │
│ [s]end [e]dit [r]evise [a]nswer [d]iscard [n]ext [q]uit │
└─────────────────────────────────────────────────────────────┘
Action Options:
| Action | Effect | Learning |
|---|---|---|
| Send | Send as-is | Positive signal: draft was perfect |
| Edit | Open in editor, send after | Diff triggers memory extraction |
| Revise | Regenerate with notes | Notes inform next attempt |
| Answer | Continue multi-turn | Clarification needed |
| Discard | Reject draft | Strong negative signal |
Two-Phase HITL for Automation:
For actions like form filling, the system implements two-phase approval:
Phase 1: Analyze form → Show proposed values → Await approval
Phase 2: Fill form → Capture screenshot → Confirm submission
This prevents automated form submission with incorrect data while reducing manual data entry.
Components communicate through shared PostgreSQL tables and well-defined interfaces:
Shared Data Structures:
# All components use consistent email representation
@dataclass
class EmailContext:
email_id: int
thread_id: str
sender_email: str
sender_profile: Optional[SenderProfile]
subject: str
body_text: str
body_html: str
received_at: datetime
classification: str
participant_roles: Dict[str, str]
attachments: List[Attachment]
Event-Driven Updates:
Content lineage tracking creates audit trails across all operations:
CREATE TABLE content_lineage_events (
id SERIAL PRIMARY KEY,
content_type VARCHAR, -- email, story, pitch, article, queue_item
content_id INTEGER,
event_type VARCHAR, -- extracted, approved, rejected, published
actor_type VARCHAR, -- system, human
actor_id VARCHAR,
parent_content_type VARCHAR,
parent_content_id INTEGER,
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Complex workflow state persists to PostgreSQL rather than memory:
Session Persistence:
class SessionStore:
def save_session(self, session_id, state):
# Write-through: every update persists immediately
with self.conn.cursor() as cur:
cur.execute("""
INSERT INTO nycnews_agent_sessions (session_id, state, updated_at)
VALUES (%s, %s, NOW())
ON CONFLICT (session_id) DO UPDATE SET state = %s, updated_at = NOW()
""", (session_id, json.dumps(state), json.dumps(state)))
self.conn.commit()
def load_session(self, session_id):
# Read-through: database fallback when not in memory
with self.conn.cursor() as cur:
cur.execute(
"SELECT state FROM nycnews_agent_sessions WHERE session_id = %s",
(session_id,)
)
result = cur.fetchone()
return json.loads(result[0]) if result else None
Bidirectional Task-Draft Linking:
-- Draft knows its generating task
ALTER TABLE ea_email_drafts ADD COLUMN source_task_id INTEGER REFERENCES ea_tasks(id);
-- Task knows its output draft
ALTER TABLE ea_tasks ADD COLUMN generated_draft_id INTEGER REFERENCES ea_email_drafts(id);
This enables atomic state synchronization:
def sync_task_draft_status(task_id, draft_id, action):
"""Atomic status update for linked task and draft."""
with conn.cursor() as cur:
if action == 'send':
cur.execute("""
UPDATE ea_tasks SET status = 'completed', completed_at = NOW()
WHERE id = %s;
UPDATE ea_email_drafts SET status = 'sent', sent_at = NOW()
WHERE id = %s;
""", (task_id, draft_id))
elif action == 'reject':
cur.execute("""
UPDATE ea_tasks SET status = 'cancelled'
WHERE id = %s;
UPDATE ea_email_drafts SET status = 'rejected'
WHERE id = %s;
""", (task_id, draft_id))
conn.commit()
Graceful Degradation:
def process_email_with_fallbacks(email):
try:
# Primary path: full context enrichment
profile = get_sender_profile(email.sender)
memories = get_relevant_memories(email, profile)
examples = get_few_shot_examples(email)
except ProfileServiceError:
# Fallback: basic context only
profile = None
memories = []
examples = get_generic_examples(email.classification)
# Always proceed with available context
return generate_draft(email, profile, memories, examples)
Retry Logic:
@celery.task(bind=True, max_retries=3, default_retry_delay=60)
def generate_embeddings(self, article_id):
try:
# Embedding generation
chunks = chunk_article(article_id)
embeddings = model.encode(chunks)
store_embeddings(article_id, embeddings)
except TransientError as e:
# Retry for transient failures
raise self.retry(exc=e, countdown=60 * (self.request.retries + 1))
except PermanentError as e:
# Log and skip for permanent failures
log_problematic_article(article_id, str(e))
return False
Recovery Scripts:
When issues occur at scale, recovery scripts identify and fix discrepancies:
def recover_missing_embeddings():
"""Find and regenerate embeddings that failed to persist."""
with conn.cursor() as cur:
cur.execute("""
SELECT ce.id FROM classified_emails ce
LEFT JOIN email_chunks ec ON ce.id = ec.email_id
WHERE ec.email_id IS NULL
AND ce.processed_at IS NOT NULL
""")
missing = cur.fetchall()
for email_id in missing:
regenerate_embeddings.delay(email_id)
This system was built iteratively over nine months, with the author applying analytical frameworks from investment analysis and institutional finance to system architecture decisions. The commit history reveals the development journey:
Week 1 (July 2025): Foundation and First Failures
The initial commit established an ambitious architecture: Scrapy spider, PostgreSQL with pgvector, Celery background tasks. Within 48 hours, production reality intervened:
Commit b6cce11: "lots of updates" - Connection pool exhaustion and protocol corruption under real-world load
Commit a98e976: "pooling set to 1" - Defensive retreat, implementing
log_problematic_article()for graceful failure handling
The solution emerged in commit 2b52729:
"celery workers are stable and sql alchemy has been used to implement"
Lesson: Production stability often requires abandoning clever solutions for boring, proven ones. SQLAlchemy's connection pooling solved problems that took days to debug with manual psycopg3 management.
Month 2 (August 2025): Email Intelligence Integration
Adding email processing to the RAG system revealed deeper architectural challenges:
Commit 2b1155a: "email embedding system is now working with real data and is merged with Rag"
This required HNSW index rebuilding (6+ hours) and careful handling of the embedding dimension matching between article and email chunks.
Key Insight: Unifying data models early (same embedding dimension, same chunk size) enables system-wide search without constant adaptation layers.
Month 4 (October 2025): The Entity System
Building entity extraction revealed the complexity of real-world names:
Commit 382acca: "entity extraction system created"
The three-hash architecture (SHA-256 for exact match, MD5-16 for fast lookup, Double Metaphone for fuzzy matching) emerged from production requirements: "Johnn Smith" needs to match "John Smith" without false positives.
Month 6 (December 2025): Memory System Pivot
The initial reflection system sent raw edit diffs to Claude for preference extraction. Research revealed this approach degrades over time:
From research notes: "Reflection without external feedback degrades performance. When you see a reflection system that works, you're almost always looking at a verification system in disguise."
The redesign implemented deterministic parsing BEFORE LLM involvement, with quality gates preventing noise accumulation.
Lesson: Academic research is worth reading. The Mem0 and MemGPT papers predicted exactly the failure mode we encountered.
Features evolved based on real editorial needs rather than theoretical requirements:
Email Triage Evolution:
Draft Generation Evolution:
Each iteration addressed a specific user complaint rather than anticipated requirements.
1. Incremental Processing is Essential
Systems processing 200K+ items need incremental operation. The URL cache middleware (articles) and attachments_extracted flag (emails) enable processing only new content.
2. Cost Tracking Enables Optimization
# Token usage tracking revealed opportunities
# Initial: ~10 seconds per email (subprocess per email)
# After single-session batch: ~6 seconds per email
# Savings: 100+ hours on 182K email backlog
3. Database Joins Beat Application Logic
The email classification bug that returned 0 results for newsletter queries was a JOIN issue (classified_emails.classification was always 'unclassified'; the real data was in email_pipeline_routes.pipeline_type). Fixing the JOIN unlocked 337 emails instantly.
4. Human-in-the-Loop Reduces Risk
Every automated action we considered implementing eventually got a HITL checkpoint. The two-phase form filling pattern emerged from a near-miss where test data would have been submitted to a real form.
5. Checkpoint Commits Enable Recovery
The commit messages "pre changes before X" and "pre-refactor" represent deliberate checkpoints before risky changes. When refactors go wrong, these enable clean rollback.
Worked Well:
Didn't Work:
Still Evolving:
Quantitative Metrics:
| Metric | Before | After | Improvement |
|---|---|---|---|
| News source monitoring | 2-3 hours/day | 15 minutes/day | 90% reduction |
| Email triage time | 1-2 hours/day | 20 minutes/day | 80% reduction |
| Article research time | 30-60 min/story | 5-10 min/story | 80% reduction |
| Response drafting | 10-15 min/email | 2-3 min/email | 80% reduction |
| Event calendar updates | Manual entry | Semi-automated | Hours saved weekly |
System Scale:
The system enables capabilities previously unavailable to a small newsroom:
Comprehensive Competitive Intelligence:
Institutional Memory:
Source Relationship Management:
Automated Event Tracking:
For local news sustainability, the implications are significant:
Cost Structure:
Scalability:
Replicability:
The use of AI in journalism raises legitimate concerns that this system addresses through architectural choices:
Human Judgment Preserved:
Accuracy Safeguards:
Question tool lets the model ask for information rather than guessVoice Authenticity:
Transparency:
Email Privacy:
Source Protection:
AI Assistance Disclosure:
Short-term (2026 Q2):
Medium-term (2026 H2):
Long-term:
The architecture supports scaling in several dimensions:
Horizontal Scaling:
Multi-Newsroom:
Data Growth:
Technical:
Editorial:
Sustainability:
This white paper documents nine months of building AI infrastructure for a hyperlocal newsroom. The resulting system demonstrates that sophisticated AI capabilities are achievable for small news operations, potentially transforming the economics of local journalism.
The technical achievement is significant: 255 GB of structured news and email data, 8 million searchable semantic chunks, sub-100ms query performance, and integrated AI agents that monitor, extract, draft, and orchestrate across diverse information sources.
But the more important achievement is the design philosophy embedded in the architecture: human-in-the-loop by default, editorial integrity as requirement, practical utility over impressive capabilities. AI serves as force multiplier, not replacement. The system proposes; humans decide.
For local journalism, the implications extend beyond individual newsroom efficiency. If AI can enable one publisher to maintain coverage depth previously requiring dedicated staff, similar systems could revive local news coverage in communities currently without any journalism presence. The technology is replicable, the costs are manageable, and the architecture is documented.
The local news crisis is real and deepening. This system represents one response: building tools that make small newsrooms more capable rather than waiting for business models that may never materialize. AI infrastructure won't solve every problem facing local journalism, but it can address the operational bottlenecks that consume editorial attention and prevent the deep community engagement that local news requires.
The code exists. The documentation exists. The path forward is clearer than it was nine months ago.
Core Technologies:
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Database | PostgreSQL | 17 | Primary data store |
| Vector Search | pgvector | 0.7.0 | Embedding storage and similarity search |
| Cache | Redis | 7.x | Task queue broker, session cache |
| Backend | Python | 3.11 | Primary language |
| Web Framework | FastAPI | 0.128.0 | API endpoints |
| Task Queue | Celery | 5.x | Background processing |
| Scraping | Scrapy | 2.x | Web scraping framework |
| Browser Automation | Playwright | 1.x | Form filling, screenshots |
| Workflow | LangGraph | 0.x | Multi-agent orchestration |
| ML/NLP | SpaCy | 3.x | Named entity recognition |
| Embeddings | SentenceTransformers | 2.x | all-MiniLM-L6-v2 model |
LLM APIs:
| Provider | Model | Use Case |
|---|---|---|
| Anthropic | Claude Sonnet 4 | Draft generation, content analysis |
| Anthropic | Claude Haiku | Image analysis, fast classification |
| Gemini 2.0 Flash Lite | Email classification (cost-efficient) | |
| OpenAI | GPT-4 | Specialized tasks, comparison |
Infrastructure:
| Component | Specification |
|---|---|
| Hardware | Mac Mini M2 / MacBook Pro |
| Storage | Samsung T7 SSD (1TB) for PostgreSQL data |
| OS | macOS 14.x |
| Package Manager | Homebrew |
| Python Environment | venv |
Master-Scrape-Rag-Pipe/
├── news_scraper_project/ # Article ingestion
│ ├── spiders/article_spider.py # Intelligent heuristic spider
│ ├── pipelines.py # PostgreSQL atomic writes
│ ├── tasks.py # Celery embedding tasks
│ └── middlewares.py # URL cache middleware
│
├── email_assistant/ # Email processing CLI
│ ├── cli.py # Main entry point
│ ├── orchestration/ # Task extraction & routing
│ │ ├── orchestrator.py # Central dispatcher
│ │ ├── task_extractor.py # LLM-based extraction
│ │ └── task_classifier.py # Capability-aware matching
│ ├── drafting/ # Response generation
│ │ ├── draft_response.py # Context-enriched drafting
│ │ └── context_enrichment.py # RAG context injection
│ ├── memory/ # Adaptive learning
│ │ ├── store.py # Memory persistence
│ │ ├── feedback_parser.py # Deterministic parsing
│ │ └── quality_gate.py # Signal validation
│ ├── skills/ # Automation capabilities
│ │ ├── rsvp_skill.py # Form filling
│ │ ├── calendar_skill.py # Event management
│ │ └── smart_form_skill.py # Intelligent form analysis
│ └── rsvp/ # Browser automation
│ └── playwright_controller.py
│
├── rag_chatbot/ # RAG interface
│ ├── app.py # FastAPI application
│ ├── ai_generator.py # Claude API integration
│ ├── vector_store_pg.py # pgvector search
│ └── NYCNewsAgent/ # Research agent system
│ ├── research_agent/ # Multi-agent research
│ │ ├── agent.py # Claude Agent SDK
│ │ └── email_story_extractor.py
│ ├── publish_pipeline/ # WordPress integration
│ │ ├── metadata_extractor.py
│ │ ├── seo_generator.py
│ │ └── image_analyzer.py
│ └── api/ # Workflow APIs
│ ├── workflow_api.py
│ └── lineage/ # Content traceability
│
├── entity_extraction/ # Entity intelligence
│ ├── spacy_entity_pipeline.py # NER pipeline
│ └── entity_disambiguation.py # Alias management
│
├── attachment_extraction/ # Email attachments
│ ├── drive_uploader.py # Google Drive streaming
│ └── calendar_parser.py # ICS parsing
│
├── event_extraction/ # Calendar events
│ └── extract_events_from_emails.py
│
├── migrations/ # Database schemas
│ └── 001-068_*.sql
│
├── update_emails_v2.sh # Email pipeline orchestrator
├── run_crawls.py # Article scraping orchestrator
└── CLAUDE.md # AI assistant instructions
Development Phases:
| Phase | Period | Focus |
|---|---|---|
| Foundation | July 2025 | Scrapy, PostgreSQL, Celery stabilization |
| Email Intelligence | Aug-Sep 2025 | Gmail integration, entity extraction, participant tracking |
| Advanced Features | Oct-Dec 2025 | LangGraph workflows, memory system, reflection |
| Production Polish | Jan-Mar 2026 | HITL review, task orchestration, publishing pipeline |
Database Scale:
| Table | Records | Size |
|---|---|---|
| articles | 1.79M | 15 GB |
| article_chunks | 4.56M | 45 GB |
| classified_emails | 218K | 8 GB |
| email_chunks | 3.58M | 35 GB |
| sender_profiles | 10K | 200 MB |
| email_participants | 453K | 2 GB |
| HNSW indexes | 3 | 20 GB |
| Total | ~255 GB |
AUDN: Add/Update/Delete/Noop - memory reconciliation operations determining how new learning integrates with existing memories
Chunk: A semantic segment of text (typically 500 characters) with associated vector embedding for similarity search
Deterministic Filter: Rule-based classification that handles obvious cases without LLM invocation, preserving AI capacity for nuanced decisions
Few-Shot Learning: Providing examples of desired outputs in prompts to guide model behavior toward publication-specific patterns
HITL: Human-in-the-Loop - design pattern ensuring human approval before automated actions
HNSW: Hierarchical Navigable Small World - approximate nearest neighbor algorithm enabling fast vector similarity search
Memory Scope: The hierarchical level at which a learned preference applies (global, organization, project, sender)
Participant Intelligence: System for tracking email participants across the corpus, enabling relationship mapping and communication pattern analysis
pgvector: PostgreSQL extension adding vector data type and similarity search operators
Quality Gate: Validation layer that filters low-quality signals before they can influence learning systems
RAG: Retrieval-Augmented Generation - pattern combining information retrieval with LLM generation
RRF: Reciprocal Rank Fusion - technique for combining multiple ranking signals (e.g., vector similarity + text search)
Sender Profile: Aggregated intelligence about an email sender including organization, role, communication patterns, and relationship history
Semantic Search: Finding content by meaning rather than exact keyword matching, enabled by vector embeddings
tsvector: PostgreSQL's full-text search data type for efficient text matching
White paper compiled from 321 commits over 9 months of development Data warehouse: 255 GB (1.79M articles, 218K emails, 8M+ chunks) Last updated: March 2026
For questions about this system, potential collaboration, or career opportunities, contact: Alec Meeker alec@bushwickdaily.com LinkedIn: linkedin.com/in/alecmeeker GitHub: alecmeeeker.github.io