← Back
← Part of: Building an AI-Powered Local Newsroom

NYCNewsScanner & Publisher System

In This White Paper

Executive Summary

The NYCNewsScanner & Publisher System is a sophisticated AI-powered news monitoring and content production pipeline that transforms passive data collection into active newsroom intelligence. The system employs a multi-agent architecture using the Claude Agent SDK to scan 7+ major NYC news outlets and 200K+ emails, extracting newsworthy stories with structured 5W analysis (Who, What, Where, When, Why). What makes this system particularly advanced is its integration with an adaptive memory layer based on Mem0 research that enables institutional learning across four scope levels (sender, organization, project, global) and a reflection system with quality gates that prevents the well-documented degradation that occurs when AI systems learn from raw feedback. Together, these components create an AI newsroom assistant that improves over time while maintaining editorial control.


Section 1: Technical Architecture

System Overview:

The NYCNewsScanner operates as a hub-and-spoke architecture where a news coordinator agent orchestrates parallel research agents that query both internal databases (1.5M+ articles, 200K+ emails) and external web sources. Extracted stories flow through a pitch generation system, then into a publishing pipeline that produces WordPress-ready payloads. Throughout this workflow, the Memory System captures institutional knowledge about sources, organizations, and ongoing projects, while the Reflection Engine validates AI outputs through deterministic quality gates before any learning occurs.

                                 ┌─────────────────────┐
                                 │  News Coordinator   │
                                 │  (Orchestrator)     │
                                 └──────────┬──────────┘
                    ┌───────────────────────┼───────────────────────┐
                    ↓                       ↓                       ↓
         ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
         │  DB Researcher   │    │  Email Researcher │    │  Web Researcher  │
         │  (7 outlets)     │    │  (Press/News)     │    │  (Reddit/Web)    │
         └────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
                  │                       │                       │
                  └───────────────────────┴───────────────────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Story Extractor     │
                               │  (5W Analysis)       │
                               └──────────┬───────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Memory Store        │←─── Quality Gate
                               │  (4-Level Scope)     │←─── Feedback Parser
                               └──────────┬───────────┘
                                          ↓
                               ┌──────────────────────┐
                               │  Publish Pipeline    │
                               │  (WordPress Ready)   │
                               └──────────────────────┘

Core Technologies:

Component Technology Purpose
Scanner Claude Agent SDK + LangGraph Multi-agent orchestration for parallel news research
Memory System PostgreSQL + pgvector (HNSW) Adaptive memory with hybrid semantic/keyword search
Reflection Engine Deterministic Python (difflib) Quality gates and feedback parsing without LLM involvement
Publisher FastAPI + WordPress REST API Article assembly with SEO generation and S3 image handling

Scanning & Analysis Pipeline:

  1. Source Monitoring: The database researcher queries 7 major outlets (Gothamist, Hell Gate, QNS, THE CITY, Bklyner, Brooklyn Paper, amNY) using deterministic PostgreSQL queries against the article corpus. The email researcher scans 215+ newsletters and 122+ press releases classified via Gemini Flash. The web researcher monitors r/Bushwick and conducts targeted web searches for hyperlocal content.

  2. Content Extraction: Articles and emails are pre-chunked into ~500-byte semantic segments with 384-dimensional embeddings (all-MiniLM-L6-v2). The story extractor analyzes these chunks using Claude to identify the 5W's: Who (entities involved), What (the news event), Where (geographic relevance to Bushwick), When (timing/urgency), and Why (significance).

  3. Relevance Scoring: Each extracted story receives a newsworthiness score based on geographic proximity to Bushwick, entity recognition matches with known community figures, and temporal relevance. Stories scoring below threshold are filtered before human review.

  4. Entity Recognition: The system maintains a dynamic entity database of 2,580+ entities including email senders, government officials, and organizations. Entity extraction uses SpaCy NER combined with Aho-Corasick pattern matching for O(n) performance across thousands of patterns.

  5. Context Enrichment: When drafting content or responses, the Memory System retrieves relevant memories using Reciprocal Rank Fusion across four scopes, providing historical context about sources, organizational relationships, and ongoing project involvement.

Memory System Architecture:

Reflection Integration:

Publishing Pipeline:

The publish_pipeline/ module handles the transition from journalist-generated content to WordPress-ready payloads:

  1. MetadataExtractor: Parses article markdown, extracts structured metadata
  2. SEOGenerator: LLM-based SEO title, description, and tag generation
  3. ImageAnalyzer: Claude Vision integration for image captioning and alt text
  4. S3UploadService: Direct S3 uploads for CDN distribution
  5. InlineImageProcessor: Extracts inline images and converts to CDN URLs
  6. PublishAssembler: Final JSON assembly with validation (title <110 chars, slug <60 chars, 3-10 tags, content >100 words)

Key Technical Achievements:


Section 2: Features & Standards

Core Capabilities:

  1. Intelligent News Discovery: The parallel researcher architecture (database + email + web) surfaces stories that exist across multiple source types. A press release in the email corpus might be enriched by related Reddit discussion and prior coverage in the article database. This cross-referencing catches stories a single-source scan would miss.

  2. Contextual Understanding: When drafting responses to a source, the system retrieves not just sender-level memories ("Maria prefers informal tone") but also organization-level context ("HPD requires form XYZ for FOIL requests") and project-level notes ("Bushwick rezoning investigation - Maria is sympathetic"). This multi-scope retrieval provides depth that single-level systems lack.

  3. Self-Improving Analysis: The effectiveness tracker ensures that memories earning poor outcomes are deprioritized or pruned. A tone preference that consistently leads to rejected drafts will decay in confidence, while successful memories strengthen. This creates a virtuous cycle without human intervention in memory management.

  4. Editorial Assistance: The system assists rather than replaces journalists. Story extraction produces structured pitches that require human approval. Draft generation surfaces learned preferences but presents them for editor review. The HITL (Human-in-the-Loop) signaling system blocks automated actions until human decisions are recorded.

Standards & Best Practices:

Evolution & Learning:

The developer blog reveals a critical pivot in the memory system design. The initial implementation (commit 990af0f) sent raw edit diffs directly to Claude for reflection. After a research phase that produced 15,000+ lines of documentation analyzing Mem0, Letta/MemGPT, and academic memory systems, the team discovered that "reflection without external feedback degrades performance."

The redesigned system (commit 3b7e2ae) implements deterministic parsing BEFORE any LLM involvement. The FeedbackParser uses Python's difflib.SequenceMatcher to classify edit types (tone, factual, intent, structural, complete_rewrite, minor) without any AI inference. Only signals that pass the quality gate proceed to LLM memory extraction.

This "ship, learn, rebuild better" pattern demonstrates mature engineering practice - the team identified through systematic research that their initial intuition was wrong, then rebuilt with production-grade architecture.


Section 3: Impact on News Operations

Journalism Capabilities Unlocked:

  1. Comprehensive Morning Briefings: Instead of manually checking 7 news websites, journalists receive a synthesized summary with hyperlinks to all relevant articles from the past 48 hours. The system organizes content by topic and flags stories unique to specific outlets.

  2. Email Intelligence Integration: Press releases from government agencies, newsletters from community organizations, and Google Alerts are now searchable alongside published articles. The 68x improvement in email query results (from 0 to 342+ with extractable URLs) transformed email from a broken feature to a valuable intelligence source.

  3. Source Relationship Continuity: When a new press contact takes over at a city agency, organization-level memories about how that agency operates persist. The system remembers FOIL request procedures, preferred communication channels, and historical interaction patterns even through staff turnover.

  4. Automated Event Pipeline: Calendar events extracted from emails flow into structured records with title, date, location, and description. What was a manual transcription process becomes a review-and-approve workflow.

The Memory Advantage:

Institutional memory matters for local news because community journalism depends on accumulated knowledge. When a journalist leaves, they take years of source relationships and procedural knowledge with them. The 4-level scope hierarchy captures this knowledge at the appropriate level:

This memory layer means new journalists can draft appropriate responses to established sources because the system surfaces relevant context automatically.

The Reflection Advantage:

Self-assessment matters for quality journalism because AI assistance must improve over time without degrading accuracy. The quality gate prevents several documented failure modes:

The effectiveness tracker then validates whether stored memories actually improve outcomes. A memory about "use formal tone with government contacts" that consistently leads to approved drafts strengthens; one that leads to rejections weakens.

Workflow Integration:

Journalists interact with the system through multiple interfaces:

  1. CLI Commands: ea memory list, ea memory build-profile <email>, ea memory org --create for direct memory management
  2. Web Dashboard: Real-time SSE progress tracking during story extraction workflows
  3. HITL Review: Click-to-select interface for approving extracted stories and pitches with visual feedback
  4. Publishing Pipeline: Structured JSON output compatible with existing WordPress workflow

The system integrates into the update_emails_v2.sh pipeline as Steps 5-7 (story extraction, LangChain agent processing, entity extraction), running automatically after new emails are processed.

Mission Alignment:

The NYCNewsScanner & Publisher System embodies the values of good local journalism while making it sustainable:

The system costs approximately $0.15-0.25 per comprehensive research report versus hours of human research time, making capabilities previously available only to well-resourced newsrooms accessible to a small local publication.


Report compiled from 321 commits spanning July 2025 to March 2026 Sources: Developer Blog, Systems Manifest, Codebase Analysis