NYCNewsScanner & Publisher

Executive Summary

The NYCNewsScanner & Publisher System is a sophisticated AI-powered news monitoring and content production pipeline that transforms passive data collection into active newsroom intelligence. The system employs a multi-agent architecture using the Claude Agent SDK to scan 7+ major NYC news outlets and 200K+ emails, extracting newsworthy stories with structured 5W analysis (Who, What, Where, When, Why). What makes this system particularly advanced is its integration with an adaptive memory layer based on Mem0 research that enables institutional learning across four scope levels (sender, organization, project, global) and a reflection system with quality gates that prevents the well-documented degradation that occurs when AI systems learn from raw feedback. Together, these components create an AI newsroom assistant that improves over time while maintaining editorial control.

Section 1: Technical Architecture

System Overview:

The NYCNewsScanner operates as a hub-and-spoke architecture where a news coordinator agent orchestrates parallel research agents that query both internal databases (1.5M+ articles, 200K+ emails) and external web sources. Extracted stories flow through a pitch generation system, then into a publishing pipeline that produces WordPress-ready payloads. Throughout this workflow, the Memory System captures institutional knowledge about sources, organizations, and ongoing projects, while the Reflection Engine validates AI outputs through deterministic quality gates before any learning occurs.

                                 ┌─────────────────────┐
                                 │  News Coordinator   │
                                 │  (Orchestrator)     │
                                 └──────────┬──────────┘
                    ┌───────────────────────┼───────────────────────┐
                    ↓                       ↓                       ↓
         ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
         │  DB Researcher   │    │  Email Researcher │    │  Web Researcher  │
         │  (7 outlets)     │    │  (Press/News)     │    │  (Reddit/Web)    │
         └────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
                  │                       │                       │
                  └───────────────────────┴───────────────────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Story Extractor     │
                               │  (5W Analysis)       │
                               └──────────┬───────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Memory Store        │←─── Quality Gate
                               │  (4-Level Scope)     │←─── Feedback Parser
                               └──────────┬───────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Publish Pipeline    │
                               │  (WordPress Ready)   │
                               └──────────────────────┘

Core Technologies:

Component	Technology	Purpose
Scanner	Claude Agent SDK + LangGraph	Multi-agent orchestration for parallel news research
Memory System	PostgreSQL + pgvector (HNSW)	Adaptive memory with hybrid semantic/keyword search
Reflection Engine	Deterministic Python (difflib)	Quality gates and feedback parsing without LLM involvement
Publisher	FastAPI + WordPress REST API	Article assembly with SEO generation and S3 image handling

Scanning & Analysis Pipeline:

Source Monitoring: The database researcher queries 7 major outlets (Gothamist, Hell Gate, QNS, THE CITY, Bklyner, Brooklyn Paper, amNY) using deterministic PostgreSQL queries against the article corpus. The email researcher scans 215+ newsletters and 122+ press releases classified via Gemini Flash. The web researcher monitors r/Bushwick and conducts targeted web searches for hyperlocal content.
Content Extraction: Articles and emails are pre-chunked into ~500-byte semantic segments with 384-dimensional embeddings (all-MiniLM-L6-v2). The story extractor analyzes these chunks using Claude to identify the 5W's: Who (entities involved), What (the news event), Where (geographic relevance to Bushwick), When (timing/urgency), and Why (significance).
Relevance Scoring: Each extracted story receives a newsworthiness score based on geographic proximity to Bushwick, entity recognition matches with known community figures, and temporal relevance. Stories scoring below threshold are filtered before human review.
Entity Recognition: The system maintains a dynamic entity database of 2,580+ entities including email senders, government officials, and organizations. Entity extraction uses SpaCy NER combined with Aho-Corasick pattern matching for O(n) performance across thousands of patterns.
Context Enrichment: When drafting content or responses, the Memory System retrieves relevant memories using Reciprocal Rank Fusion across four scopes, providing historical context about sources, organizational relationships, and ongoing project involvement.

Memory System Architecture:

Short-term Memory: Session state managed via PostgreSQL-backed storage with Redis caching for cross-instance coordination. The nycnews_agent_sessions table tracks active research sessions with workflow state, HITL blocking signals, and conversation history.
Long-term Memory: The am_memories table stores learned preferences with 384-dimensional embeddings for semantic retrieval. Each memory includes times_applied, times_led_to_acceptance, confidence scores, and effectiveness tracking. Memories persist across sessions and are scoped hierarchically.
Memory Retrieval: The search_memories() PostgreSQL function implements hybrid search combining vector similarity (cosine distance via HNSW index) with full-text search (tsvector) using Reciprocal Rank Fusion. Retrieval applies scope-based boosting: sender (+0.3), organization (+0.2), project (+0.1), global (baseline).
Memory Formation: New memories form through the AUDN cycle (Add/Update/Delete/Noop) implemented in reconciler.py. When a user edits an AI-generated draft, the FeedbackParser deterministically classifies the edit type (tone, factual, intent, structural), then the QualityGate filters noise (minimum magnitude 10%, confidence 60%, 5-minute debounce). Only high-quality signals proceed to LLM-based memory extraction and reconciliation.

Reflection Integration:

Self-Assessment: The QualityGate class evaluates feedback signals before any LLM involvement. It blocks minor edits, approval-without-change signals, and signals below confidence thresholds. This deterministic filtering prevents the documented degradation that occurs when raw feedback is sent directly to LLMs for reflection.
Learning Loop: The effectiveness tracker computes memory utility using the formula: 0.6 * acceptance_rate + 0.4 * (1 - edit_ratio). Memories that consistently lead to rejected drafts accumulate negative signals and are candidates for automatic pruning via the ea memory prune CLI command.
Quality Gates: The research finding that informed this architecture states: "Reflection without external feedback degrades performance. When you see a reflection system that works, you're almost always looking at a verification system in disguise." The quality gate IS that verification system.

Publishing Pipeline:

The publish_pipeline/ module handles the transition from journalist-generated content to WordPress-ready payloads:

MetadataExtractor: Parses article markdown, extracts structured metadata
SEOGenerator: LLM-based SEO title, description, and tag generation
ImageAnalyzer: Claude Vision integration for image captioning and alt text
S3UploadService: Direct S3 uploads for CDN distribution
InlineImageProcessor: Extracts inline images and converts to CDN URLs
PublishAssembler: Final JSON assembly with validation (title <110 chars, slug <60 chars, 3-10 tags, content >100 words)

Key Technical Achievements:

68x improvement in email query results after fixing the classification JOIN bug (0 to 342+ emails with extractable URLs)
Query performance <1 second for all database researcher queries against 1.5M+ articles
4-level memory scope hierarchy enabling organization-level memory persistence through staff turnover
HNSW index optimization reducing search from 16 minutes to 9 seconds for email corpus

Section 2: Features & Standards

Core Capabilities:

Intelligent News Discovery: The parallel researcher architecture (database + email + web) surfaces stories that exist across multiple source types. A press release in the email corpus might be enriched by related Reddit discussion and prior coverage in the article database. This cross-referencing catches stories a single-source scan would miss.
Contextual Understanding: When drafting responses to a source, the system retrieves not just sender-level memories ("Maria prefers informal tone") but also organization-level context ("HPD requires form XYZ for FOIL requests") and project-level notes ("Bushwick rezoning investigation - Maria is sympathetic"). This multi-scope retrieval provides depth that single-level systems lack.
Self-Improving Analysis: The effectiveness tracker ensures that memories earning poor outcomes are deprioritized or pruned. A tone preference that consistently leads to rejected drafts will decay in confidence, while successful memories strengthen. This creates a virtuous cycle without human intervention in memory management.
Editorial Assistance: The system assists rather than replaces journalists. Story extraction produces structured pitches that require human approval. Draft generation surfaces learned preferences but presents them for editor review. The HITL (Human-in-the-Loop) signaling system blocks automated actions until human decisions are recorded.

Standards & Best Practices:

LLM Integration Patterns: The architecture strictly separates deterministic operations (feedback parsing, quality gates, database queries) from LLM operations (memory extraction, content generation). This reduces API costs and prevents the documented performance degradation from overusing LLMs for tasks better handled deterministically.
State Management: Complex workflow state is persisted to PostgreSQL rather than held in memory. The session_store.py implements write-through caching (every update persists immediately) and read-through retrieval (database fallback when session not in memory). This enables server restarts without losing session state.
Error Handling: The agent employs exponential backoff retry with CLI death detection. When the underlying Claude subprocess terminates, the system saves session state and provides clear recovery instructions rather than silently failing or corrupting data.
Human-in-the-Loop: HITL decision points are implemented using PostgreSQL LISTEN/NOTIFY for cross-instance signaling. The hitl_signaling.py module tracks decisions with atomic ordering: database write precedes in-memory state update, ensuring consistency even during crashes.

Evolution & Learning:

The developer blog reveals a critical pivot in the memory system design. The initial implementation (commit 990af0f) sent raw edit diffs directly to Claude for reflection. After a research phase that produced 15,000+ lines of documentation analyzing Mem0, Letta/MemGPT, and academic memory systems, the team discovered that "reflection without external feedback degrades performance."

The redesigned system (commit 3b7e2ae) implements deterministic parsing BEFORE any LLM involvement. The FeedbackParser uses Python's difflib.SequenceMatcher to classify edit types (tone, factual, intent, structural, complete_rewrite, minor) without any AI inference. Only signals that pass the quality gate proceed to LLM memory extraction.

This "ship, learn, rebuild better" pattern demonstrates mature engineering practice - the team identified through systematic research that their initial intuition was wrong, then rebuilt with production-grade architecture.

Section 3: Impact on News Operations

Journalism Capabilities Unlocked:

Comprehensive Morning Briefings: Instead of manually checking 7 news websites, journalists receive a synthesized summary with hyperlinks to all relevant articles from the past 48 hours. The system organizes content by topic and flags stories unique to specific outlets.
Email Intelligence Integration: Press releases from government agencies, newsletters from community organizations, and Google Alerts are now searchable alongside published articles. The 68x improvement in email query results (from 0 to 342+ with extractable URLs) transformed email from a broken feature to a valuable intelligence source.
Source Relationship Continuity: When a new press contact takes over at a city agency, organization-level memories about how that agency operates persist. The system remembers FOIL request procedures, preferred communication channels, and historical interaction patterns even through staff turnover.
Automated Event Pipeline: Calendar events extracted from emails flow into structured records with title, date, location, and description. What was a manual transcription process becomes a review-and-approve workflow.

The Memory Advantage:

Institutional memory matters for local news because community journalism depends on accumulated knowledge. When a journalist leaves, they take years of source relationships and procedural knowledge with them. The 4-level scope hierarchy captures this knowledge at the appropriate level:

Global: "Route all advertising inquiries to the business team"
Organization: "Community Board 4 meetings are held on the second Wednesday monthly"
Project: "The Bushwick rezoning investigation has been ongoing since 2024; these 5 stakeholders are connected"
Sender: "Maria at CB4 prefers text for urgent matters and addresses informal email correspondence as 'Hey'"

This memory layer means new journalists can draft appropriate responses to established sources because the system surfaces relevant context automatically.

The Reflection Advantage:

Self-assessment matters for quality journalism because AI assistance must improve over time without degrading accuracy. The quality gate prevents several documented failure modes:

Noise pollution: Minor typo fixes don't become "preferences" that the system tries to replicate
Debounce protection: Rapid-fire edits to the same email don't create duplicate or conflicting memories
Magnitude filtering: Only edits that change 10%+ of content pass the gate, ensuring the signal represents meaningful preference change

The effectiveness tracker then validates whether stored memories actually improve outcomes. A memory about "use formal tone with government contacts" that consistently leads to approved drafts strengthens; one that leads to rejections weakens.

Workflow Integration:

Journalists interact with the system through multiple interfaces:

CLI Commands: ea memory list, ea memory build-profile <email>, ea memory org --create for direct memory management
Web Dashboard: Real-time SSE progress tracking during story extraction workflows
HITL Review: Click-to-select interface for approving extracted stories and pitches with visual feedback
Publishing Pipeline: Structured JSON output compatible with existing WordPress workflow

The system integrates into the update_emails_v2.sh pipeline as Steps 5-7 (story extraction, LangChain agent processing, entity extraction), running automatically after new emails are processed.

Mission Alignment:

The NYCNewsScanner & Publisher System embodies the values of good local journalism while making it sustainable:

Breadth without burnout: Automated monitoring of 7 outlets plus email plus Reddit means comprehensive coverage without requiring staff to manually check dozens of sources daily
Depth through memory: Accumulated institutional knowledge about sources, organizations, and ongoing projects enables the kind of contextual reporting that distinguishes local journalism from wire service aggregation
Quality through gates: Reflection systems that learn from every interaction can accumulate errors; quality gates ensure only high-confidence signals inform future behavior
Control through HITL: No story publishes without human approval; no draft sends without editor review; the AI assists rather than replaces editorial judgment

The system costs approximately $0.15-0.25 per comprehensive research report versus hours of human research time, making capabilities previously available only to well-resourced newsrooms accessible to a small local publication.

Main Document

White Paper: Building an AI-Powered Local Newsroom

Section 3

Data Warehouse & Ingestion Pipeline

Section 5

Email Assistant

Sources: Developer Blog, Systems Manifest, Codebase Analysis

NYCNewsScanner & Publisher System

NYCNewsScanner & Publisher System

In This White Paper

Executive Summary

Section 1: Technical Architecture

Section 2: Features & Standards

Section 3: Impact on News Operations

Continue Reading