Email Ingestion Data Warehouse
Originally developed for Bushwick Daily's AI-powered newsroom. This system serves as the upstream data source for the Local News Multiverse Agent Platform.
Related Open Source Projects
- Email Ingestion Data Warehouse (this document)
- Multiverse Agent Platform
Why This Exists
The Problem
Local newsrooms receive hundreds of emails daily: story leads, press releases, event listings, advertising inquiries, and the inevitable flood of newsletters. Manually triaging this volume is unsustainable, yet buried in that noise are the tips and sources that become great stories.
The Constraints
I built this on a MacBook Pro without dedicated ML infrastructure. Every architectural decision reflects this reality:
- CPU-first design: all-MiniLM-L6-v2 (384-dim) runs fast on CPU without GPU
- PostgreSQL as backbone: pgvector for embeddings, no external vector database needed
- Local-first philosophy: Your emails stay on your servers, not in someone else's cloud
- Token efficiency: Deterministic rules bypass LLM for obvious cases, saving 30-50% on API costs
The Solution
A 7-step pipeline that transforms raw Gmail data into a searchable, classified, relationship-aware knowledge base:
Gmail API → Deduplication → Chunking → Embedding → Classification → Participants → ProfilesArchitecture Overview
1. EXTRACTION
|
[Gmail API] --> [Service Account Auth] --> [Email Extraction] --> [Fingerprint Generation]
|
2. DEDUPLICATION
|
[Content Normalization] --> [Multi-Hash Fingerprints] --> [Duplicate Groups]
|
3. PROCESSING
|
[Semantic Chunking] --> [Vector Embeddings: all-MiniLM-L6-v2] --> [LLM Classification: Gemini Flash]
|
4. INTELLIGENCE
|
[Participant Extraction] --> [Sender Profiles] --> [Relationship Mapping]
|
5. STORAGE
|
[PostgreSQL + pgvector]
|
[HNSW Vector Index]
|
6. CONSUMPTION
|
[RAG Search API] | [Multiverse Agents] | [Email Assistant]Database Schema
Core Tables
| Table | Purpose | Key Columns |
|---|---|---|
classified_emails | Primary email storage | gmail_id, thread_id, body_text, content_fingerprint |
email_chunks | Semantic chunks with embeddings | email_id, text, embedding VECTOR(384) |
email_participants | Normalized sender/recipient records | participant_email, role, fingerprint |
sender_profiles | Aggregated sender intelligence | email, is_journalist, influence_score |
email_duplicate_groups | Deduplication clusters | content_fingerprint, member_count |
email_fingerprints_v2 | Multi-hash fingerprints | new_content_hash, structure_hash |
Vector Storage
-- email_chunks with HNSW index for fast semantic search
CREATE TABLE email_chunks (
id SERIAL PRIMARY KEY,
email_id INTEGER REFERENCES classified_emails(id),
chunk_index INTEGER,
text TEXT,
embedding VECTOR(384), -- all-MiniLM-L6-v2 output
metadata JSONB
);
-- HNSW index for approximate nearest neighbor search
CREATE INDEX idx_email_chunks_embedding_hnsw
ON email_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);Sender Intelligence
-- sender_profiles aggregates sender behavior
CREATE TABLE sender_profiles (
fingerprint VARCHAR(32) PRIMARY KEY, -- MD5 of email
email VARCHAR(255) UNIQUE,
is_journalist BOOLEAN,
is_government BOOLEAN,
is_media BOOLEAN,
email_count INTEGER,
influence_score FLOAT,
topics TEXT[],
frequent_contacts JSONB
);Pipeline Stages
Stage 1: Gmail API Extraction
python scripts/gmail_extractor.py --max-results 500Uses Google Cloud service account with domain-wide delegation to access Gmail. Extracts:
- Full email metadata (headers, labels, size)
- Body content (text and HTML)
- Threading information (message-id, in-reply-to, references)
- Attachment metadata (not content)
Stage 2: Content Normalization & Fingerprinting
The deduplication system uses multi-hash fingerprinting:
# Content normalization replaces URLs and emails with placeholders
"Check out https://example.com for details"
→ "check out [URL] for details"
# Multiple hash types for different dedup strategies
{
"new_content_hash": "abc123...", # Fresh content only
"quoted_content_hash": "def456...", # Quoted text only
"full_content_hash": "ghi789...", # Complete message
"structure_hash": "jkl012...", # Email structure
}Gmail Alias Resolution: The system normalizes email addresses according to provider rules:
- Gmail: Ignores dots, removes plus suffixes (
john.doe+test@gmail.com→johndoe@gmail.com) - Yahoo: Removes dash suffixes
- ProtonMail: Preserves plus addressing (they use it for unique addresses)
Stage 3: Semantic Chunking
python scripts/batch_chunking.pyEmails are chunked using a word-based strategy optimized for email content:
- Chunk size: ~500 characters
- Overlap: 50 characters (preserves context)
- Short emails (<50 chars): Combined with subject as single chunk
- Long emails (>50k chars): Truncated with marker
Stage 4: Vector Embeddings
Uses sentence-transformers/all-MiniLM-L6-v2:
- 384 dimensions (efficient for CPU)
- ~80ms per embedding on M1 MacBook
- Cosine similarity for semantic search
Why this model? It's the sweet spot for CPU-constrained environments:
- 5x smaller than larger models
- 2x faster inference
- 95% of quality for email-length text
Stage 5: LLM Classification
python scripts/batch_classifier.py --all --batch-size 5016 classification categories tuned for newsroom workflows:
| Category | Example |
|---|---|
story_lead_or_tip | Anonymous tips, source referrals |
freelance_pitch | Article proposals from writers |
press_release | Official announcements |
community_event_listing | Local events |
sales_or_advertising_inquiry | Ad space requests |
financial_admin | Receipts, invoices |
Deterministic Pre-filtering: Obvious cases bypass LLM to save tokens:
@venmo.com→financial_admin(no LLM needed)globenewswire.com→press_release(no LLM needed)
Stage 6: Participant Extraction
python scripts/participant_extractor.py --incrementalExtracts and normalizes participants from email headers:
- Sender: Who sent the email
- Recipients: TO field (tracks primary)
- CC: Carbon copy (tracks position)
- BCC: Blind carbon copy
Each participant gets an MD5 fingerprint enabling cross-email analysis.
Stage 7: Sender Profile Enrichment
Aggregated intelligence about each sender:
- Classification flags: is_journalist, is_government, is_media
- Volume metrics: email_count, thread_count, response_rate
- Temporal data: first_seen, last_seen, days_active
- Network data: frequent_contacts, communication patterns
Hashing Strategy
Why Multiple Hashes?
Email deduplication is harder than document deduplication. The same content appears in:
- Direct sends
- Forwards (with "FYI" prefix)
- Replies (with quoted text)
- CC'd copies to different recipients
Hash Types
| Hash | Purpose | Ignores |
|---|---|---|
full_content_hash | Basic dedup | Nothing |
new_content_hash | Forward detection | Quoted text |
structure_hash | Layout matching | Content |
thread_hash | Conversation grouping | Message body |
recipient_set_hash | Audience matching | Message body |
Normalization Pipeline
- Remove tracking parameters (UTM, pixels)
- Replace URLs →
[URL] - Replace emails →
[EMAIL] - Remove zero-width characters
- Normalize whitespace
- Normalize Unicode (smart quotes → regular)
Integration with Multiverse Platform
This system serves as the upstream data source for the Local News Multiverse Agent Platform.
Agent Tools
The multiverse agents use these tables:
# lookup_sender_entity - queries sender_profiles
sender = lookup_sender_entity("john@example.com")
# Returns: is_journalist, influence_score, email_count, topics
# search_emails - vector search on email_chunks
results = search_emails("city council budget meeting", limit=10)
# Returns: similar emails with relevance scoresRequired Schema
For multiverse compatibility, ensure:
classified_emailshas: id, gmail_id, thread_id, sender_email, body_textemail_chunkshas: embedding VECTOR(384) with HNSW indexsender_profileshas: fingerprint, is_journalist, is_government, is_media
Quick Start
1. Prerequisites
# PostgreSQL with pgvector
brew install postgresql@17
psql -c "CREATE EXTENSION vector;"
# Python 3.10+
python --version # Should be 3.10+2. Clone and Setup
git clone https://github.com/alecmeeeker/email-ingestion-data-warehouse.git
cd email-ingestion-data-warehouse
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt3. Configure Environment
# Copy example config
cp .env.example .env
# Edit with your values
nano .envRequired variables:
GOOGLE_SERVICE_ACCOUNT_FILE: Path to service account JSONGMAIL_DELEGATE_EMAIL: Email to impersonateGEMINI_API_KEY: For LLM classification
4. Run Migrations
# Create database
createdb email_warehouse
# Run migrations in order
for f in migrations/*.sql; do psql -d email_warehouse -f "$f"; done5. Run Pipeline
# Extract emails
python scripts/gmail_extractor.py --max-results 100
# Create chunks and embeddings
python scripts/batch_chunking.py
# Classify emails
python scripts/batch_classifier.py --all
# Extract participants
python scripts/participant_extractor.pyConfiguration Reference
Database
| Variable | Default | Description |
|---|---|---|
DB_NAME | email_warehouse | PostgreSQL database name |
DB_USER | postgres | Database user |
DB_HOST | localhost | Database host |
DB_PASSWORD | (empty) | Database password |
DB_PORT | 5432 | Database port |
Gmail API
| Variable | Required | Description |
|---|---|---|
GOOGLE_SERVICE_ACCOUNT_FILE | Yes | Path to service account JSON |
GMAIL_DELEGATE_EMAIL | Yes | Email to impersonate |
LLM
| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY | Yes | Google Gemini API key |
Model
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL | sentence-transformers/all-MiniLM-L6-v2 | Embedding model |
HF_HUB_OFFLINE | 0 | Use cached models only |
BATCH_SIZE | 200 | Processing batch size |
Performance Considerations
Why These Choices Work on MacBook
| Choice | Why |
|---|---|
| all-MiniLM-L6-v2 | 384-dim vs 768-dim = 2x faster, 50% memory |
| HNSW index | O(log n) search, no rebuild on insert |
| Deterministic rules | 30-50% of emails skip LLM entirely |
| Batch commits | Resume from failure without reprocessing |
| Incremental extraction | Only process new emails |
Benchmarks (M1 MacBook Pro)
| Operation | Throughput |
|---|---|
| Email extraction | ~100 emails/min |
| Chunking + embedding | ~200 emails/min |
| LLM classification | ~30 emails/min |
| Participant extraction | ~500 emails/min |
Scaling Notes
For larger deployments:
- Increase
BATCH_SIZEfor better throughput - Use connection pooling (pgbouncer)
- Consider Gemini Flash Lite for classification
- Run embedding and classification in parallel
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
Areas of interest:
- Additional email providers (Outlook, IMAP)
- Alternative embedding models
- Classification category extensions
- Performance optimizations
Related Projects
- Multiverse Agent Platform — Self-hosted MLOps for agentic AI workflows
- Data Warehouse Report — The production system this was extracted from
Built with care by Alec Meeker for Bushwick Daily.