Architecture Overview¶
Egregora uses a functional pipeline architecture that processes conversations through pure transformations. This design provides clear separation of concerns and better maintainability.
Pipeline Flow¶
graph LR
A[Ingestion + Privacy] --> B[Enrichment]
B --> C[Generation]
C --> D[Publication]
B -.-> E[RAG Index]
E -.-> C Egregora processes conversations through these stages:
- Ingestion: Input adapters parse exports into structured IR (Intermediate Representation) and apply privacy strategies.
- Enrichment: Optionally enrich URLs and media with LLM descriptions.
- Generation: Writer agent generates blog posts with RAG retrieval.
- Publication: Output adapters persist to MkDocs site.
Critical Invariant: Privacy stage runs WITHIN the adapter, BEFORE any data enters the pipeline or reaches LLMs.
Three-Layer Functional Architecture¶
| Text Only | |
|---|---|
Key Pattern: No PipelineStage abstraction—all transforms are pure functions.
Code Structure¶
Input Adapters¶
Purpose: Convert external data sources into the IR (Intermediate Representation) schema.
Available adapters:
WhatsAppAdapter: Parse WhatsApp.zipexportsIperonTJROAdapter: Ingest Brazilian judicial recordsSelfReflectionAdapter: Re-ingest past blog posts
Protocol:
| Python | |
|---|---|
All adapters produce data conforming to IR_MESSAGE_SCHEMA.
Privacy Layer¶
Module: egregora.privacy
Ensures real names never reach the LLM. Privacy logic is integrated into the input adapters.
Key components:
deterministic_author_uuid(): Convert names to UUIDsdetect_pii(): Scan for phone numbers, emails, addresses- Namespace management: Scoped anonymity per chat/tenant
Process:
- Input adapter calls
deterministic_author_uuid()during parsing - Core pipeline receives anonymized IR
- LLM only sees UUIDs, never real names
- Reverse mapping stored locally (never sent to API)
Transformations¶
Module: egregora.transformations
Pure functional transformations on Ibis tables.
Key functions:
create_windows(): Group messages into time/count-based windowsenrich_window(): Add URL/media enrichments
Pattern:
| Python | |
|---|---|
RAG (Retrieval-Augmented Generation)¶
Module: egregora.rag
LanceDB-based vector storage with dual-queue embedding router.
Architecture:
graph LR
A[Documents] --> B[Embedding Router]
B --> C{Route}
C -->|Single Query| D[Single Endpoint]
C -->|Bulk Index| E[Batch Endpoint]
D --> F[LanceDB]
E --> F
F --> G[Vector Search] Key features:
- Synchronous API (
index_documents,search) - Dual-queue router: single endpoint (low-latency) + batch endpoint (high-throughput)
- Automatic rate limit handling with exponential backoff
- Asymmetric embeddings:
RETRIEVAL_DOCUMENTvsRETRIEVAL_QUERY - Configurable indexable document types
Configuration:
| YAML | |
|---|---|
API:
| Python | |
|---|---|
Agents¶
Egregora uses Pydantic-AI agents for LLM interactions.
Writer Agent¶
Module: egregora.agents.writer
Generates blog posts from conversation windows.
Input: Conversation window as XML (via conversation.xml.jinja)
Output: Markdown blog post with frontmatter
Tools: RAG search for retrieving similar past content
Caching: L3 cache with semantic hashing (zero-cost re-runs for unchanged windows)
Enricher Agent¶
Module: egregora.agents.enricher
Extracts and enriches URLs and media from messages.
Capabilities:
- URL enrichment: Extract title, description, context
- Media enrichment: Generate captions for images/videos
- Text enrichment: Extract key points from long text
Caching: L1 cache for enrichment results (asset-level)
Reader Agent¶
Module: egregora.agents.reader
Post quality evaluation and ranking using ELO system.
Architecture:
- Pairwise post comparison (A vs B)
- ELO rating updates based on comparison outcomes
- Persistent ratings in SQLite database
- Comparison history tracking
Usage:
Banner Agent¶
Module: egregora.agents.banner
Generates cover images for blog posts using Gemini Imagen.
Input: Post title, summary, style instructions
Output: PNG image saved to docs/assets/banners/
Output Adapters¶
Purpose: Persist generated documents to various formats.
Available adapters:
MkDocsAdapter: Create MkDocs site structureParquetAdapter: Export to Parquet files
Protocol:
| Python | |
|---|---|
Database Management¶
Module: egregora.database
DuckDB for analytics and persistence.
Key components:
DuckDBStorageManager: Context manager for database accessSQLManager: Jinja2-based SQL template rendering- View Registry: Reusable DuckDB views (e.g.,
daily_aggregates_view) - Run Tracking: INSERT+UPDATE pattern for pipeline runs
Views:
| Python | |
|---|---|
Tiered Caching¶
Three-tier cache system for cost reduction:
- L1 Cache: Asset enrichment results (URLs, media)
- L2 Cache: RAG search results with index metadata invalidation
- L3 Cache: Writer output with semantic hashing
CLI:
| Bash | |
|---|---|
Data Flow¶
Ibis Everywhere¶
All data flows through Ibis DataFrames:
| Python | |
|---|---|
Schema Validation¶
Central schema: database/ir_schema.py
All stages preserve IR_MESSAGE_SCHEMA core columns:
message_id: Unique identifierwindow_id: Window assignmenttimestamp: UTC timestampauthor_uuid: Anonymized authorcontent: Message textmedia_type: Media type (if any)media_path: Media path (if any)
Validation:
| Python | |
|---|---|
Design Principles¶
✅ Privacy-First: Anonymize BEFORE LLM (critical invariant) ✅ Ibis Everywhere: DuckDB tables, pandas only at boundaries ✅ Functional Transforms: Table → Table (no classes) ✅ Schemas as Contracts: All stages preserve IR_MESSAGE_SCHEMA ✅ Simple Default: Full rebuild (--resume for incremental) ✅ Alpha Mindset: Clean breaks, no backward compatibility
Technology Stack¶
| Component | Technology | Purpose |
|---|---|---|
| DataFrames | Ibis | Unified data manipulation API |
| Database | DuckDB | Analytics + storage |
| Vector Store | LanceDB | RAG vector search |
| LLM | Google Gemini | Content generation |
| Embeddings | Gemini Embeddings | Vector search |
| Site Generator | MkDocs | Static site generation |
| CLI | Typer | Command-line interface |
| Package Manager | uv | Modern Python tooling |
Performance Characteristics¶
- Stateless Runs: Each run is independent (except RAG/annotations)
- Lazy Evaluation: Ibis defers execution until needed
- Batching: Embeddings and enrichments are batched
- Caching: Three-tier cache (L1: enrichment, L2: RAG, L3: writer)
- Vectorized: DuckDB enables fast analytics
- Concurrency: ThreadPoolExecutor for I/O bound tasks
Next Steps¶
- Privacy Model - Deep dive on anonymization
- Knowledge Base - RAG and vector search
- Content Generation - LLM writer internals
- Project Structure - Detailed code organization