Skip to content

Configuration API

Configuration management for Egregora, including settings models and validation.

Overview

Egregora uses Pydantic V2 for type-safe configuration management. All settings are defined in .egregora/config.yml and validated at load time.

CLI Commands

Validate Configuration

Check your configuration file for errors:

Bash
egregora config validate
egregora config validate ./my-blog

Shows: - ✅ Validation success with summary - ❌ Detailed error messages for invalid fields - ⚠️ Warnings for unusual settings

Show Configuration

Display current configuration:

Bash
egregora config show
egregora config show ./my-blog

Settings Models

EgregoraConfig

Root configuration object loaded from .egregora/config.yml.

EgregoraConfig

Bases: BaseSettings

Root configuration for Egregora.

This model defines the complete .egregora/config.yml schema.

Supports environment variable overrides with the pattern: EGREGORA_SECTION__KEY (e.g., EGREGORA_MODELS__WRITER)

Example config.yml:

YAML
models:
  writer: google-gla:gemini-flash-latest
  enricher: google-gla:gemini-flash-latest

rag:
  enabled: true
  top_k: 5
  min_similarity_threshold: 0.7

writer:
  custom_instructions: "Write in a casual, friendly tone"
  enable_banners: true

privacy:
  anonymization_enabled: true
  pii_detection_enabled: true

pipeline:
  step_size: 1
  step_unit: days

database:
  pipeline_db: duckdb:///./.egregora/pipeline.duckdb
  runs_db: duckdb:///./.egregora/runs.duckdb

output:
  format: mkdocs

Attributes
models class-attribute instance-attribute
Python
models: ModelSettings = Field(default_factory=ModelSettings, description='LLM model configuration')
rag class-attribute instance-attribute
Python
rag: RAGSettings = Field(default_factory=RAGSettings, description='RAG configuration')
writer class-attribute instance-attribute
Python
writer: WriterAgentSettings = Field(default_factory=WriterAgentSettings, description='Writer configuration')
privacy class-attribute instance-attribute
Python
privacy: PrivacySettings = Field(default_factory=PrivacySettings, description='Privacy settings')
enrichment class-attribute instance-attribute
Python
enrichment: EnrichmentSettings = Field(default_factory=EnrichmentSettings, description='Enrichment settings')
pipeline class-attribute instance-attribute
Python
pipeline: PipelineSettings = Field(default_factory=PipelineSettings, description='Pipeline settings')
paths class-attribute instance-attribute
Python
paths: PathsSettings = Field(default_factory=PathsSettings, description='Site directory paths (relative to site root)')
database class-attribute instance-attribute
Python
database: DatabaseSettings = Field(default_factory=DatabaseSettings, description='Database configuration (pipeline and run tracking)')
output class-attribute instance-attribute
Python
output: OutputSettings = Field(default_factory=OutputSettings, description='Output format settings')
features class-attribute instance-attribute
Python
features: FeaturesSettings = Field(default_factory=FeaturesSettings, description='Feature flags')
quota class-attribute instance-attribute
Python
quota: QuotaSettings = Field(default_factory=QuotaSettings, description='LLM usage quota tracking')

ModelSettings

LLM model configuration for different tasks.

ModelSettings

Bases: BaseModel

LLM model configuration for different tasks.

  • Pydantic-AI agents expect provider-prefixed IDs like google-gla:gemini-flash-latest
  • Direct Google GenAI SDK calls expect models/<name> identifiers
Functions
validate_pydantic_model_format classmethod
Python
validate_pydantic_model_format(v: str) -> str

Validate Pydantic-AI model name format.

validate_google_model_format classmethod
Python
validate_google_model_format(v: str) -> str

Validate Google GenAI SDK model name format.

RAGSettings

RAG (Retrieval-Augmented Generation) configuration.

RAGSettings

Bases: BaseModel

Retrieval-Augmented Generation (RAG) configuration.

Uses LanceDB for vector storage and similarity search. Embedding API uses dual-queue router for optimal throughput.

Functions
validate_top_k classmethod
Python
validate_top_k(v: int) -> int

Validate top_k is reasonable and warn if too high.

PrivacySettings

Privacy and data protection settings.

PrivacySettings

Bases: BaseModel

Privacy and data protection settings (YAML configuration).

Two-level privacy model: 1. Structural (Level 1): Deterministic preprocessing of raw input data 2. PII Prevention (Level 2): LLM-native PII understanding in agent outputs

.. warning:: Disabling privacy features should only be done for public datasets (e.g., judicial records, public archives, news articles).

For private conversations, always keep privacy enabled to protect PII.

This Pydantic model (for YAML config) has the same name as the dataclass in egregora.privacy.config.PrivacySettings (for runtime policy). They serve different purposes:

  • This class: YAML configuration (persisted to config.yml)
  • privacy.config.PrivacySettings: Runtime policy with tenant isolation
Attributes
enabled property
Python
enabled: bool

Legacy: overall privacy enabled (checks structural privacy).

anonymize_authors property
Python
anonymize_authors: bool

Legacy: check if authors are being anonymized.

Functions
validate_privacy_settings
Python
validate_privacy_settings() -> PrivacySettings

Validate privacy settings and warn if disabled.

PipelineSettings

Pipeline execution settings.

PipelineSettings

Bases: BaseModel

Pipeline execution settings.

Configuration Examples

Minimal Configuration

YAML
1
2
3
4
5
6
7
# .egregora/config.yml
models:
  writer: google-gla:gemini-flash-latest

rag:
  enabled: true
  top_k: 5

Full Configuration

YAML
# .egregora/config.yml

# Model configuration
models:
  writer: google-gla:gemini-flash-latest
  enricher: google-gla:gemini-flash-latest
  enricher_vision: google-gla:gemini-flash-latest
  ranking: google-gla:gemini-flash-latest
  editor: google-gla:gemini-flash-latest
  reader: google-gla:gemini-flash-latest
  embedding: models/gemini-embedding-001
  banner: models/gemini-2.5-flash-image

# RAG configuration
rag:
  enabled: true
  top_k: 5
  min_similarity_threshold: 0.7
  indexable_types: ["POST"]
  embedding_max_batch_size: 100
  embedding_timeout: 60.0
  embedding_max_retries: 5

# Writer agent
writer:
  custom_instructions: |
    Write in a casual, friendly tone.
    Focus on practical examples.

# Privacy settings
privacy:
  enabled: true                    # Enable anonymization & PII detection
  pii_detection_enabled: true      # Warn about PII in content
  pii_action: warn                 # "warn", "redact", or "skip"
  anonymize_authors: true          # Replace names with UUIDs
  custom_pii_patterns: []          # Additional regex patterns

# Enrichment
enrichment:
  enabled: true
  enable_url: true
  enable_media: true
  max_enrichments: 50

# Pipeline
pipeline:
  step_size: 1
  step_unit: days                  # "days", "hours", "messages"
  overlap_ratio: 0.2               # 20% overlap
  max_windows: 1                   # Process 1 window (0 = all)
  checkpoint_enabled: false        # Enable incremental processing

# Paths (relative to site root)
paths:
  egregora_dir: .egregora
  rag_dir: .egregora/rag
  lancedb_dir: .egregora/lancedb
  cache_dir: .egregora/cache
  prompts_dir: .egregora/prompts
  docs_dir: docs
  posts_dir: docs/posts
  profiles_dir: docs/profiles
  media_dir: docs/assets/media

# Output format
output:
  format: mkdocs                   # "mkdocs" or "hugo"

# Reader agent
reader:
  enabled: false
  comparisons_per_post: 5
  k_factor: 32
  database_path: .egregora/reader.duckdb

# Quota limits
quota:
  daily_llm_requests: 220
  per_second_limit: 1
  concurrency: 1

Validation

Field Validators

Configuration fields are validated with custom validators:

Python
# Model name format validation
models:
  writer: google-gla:gemini-flash-latest  # ✅ Valid
  writer: gemini-flash-latest             # ❌ Invalid (missing prefix)

# RAG top_k validation
rag:
  top_k: 5      # ✅ Good
  top_k: 20     # ⚠️  Warning (unusually high)
  top_k: 100    # ❌ Error (exceeds maximum)

Cross-Field Validation

The config validator checks dependencies between fields:

YAML
1
2
3
4
5
6
7
8
9
# ❌ Error: RAG enabled but lancedb_dir not set
rag:
  enabled: true
paths:
  lancedb_dir: ""  # Empty path

# ⚠️  Warning: Very high token limit
pipeline:
  max_prompt_tokens: 500000  # Exceeds most model limits

Programmatic Usage

Loading Configuration

Python
from pathlib import Path
from egregora.config.settings import load_egregora_config

# Load from site root
config = load_egregora_config(Path("./my-blog"))

# Access settings
print(config.models.writer)
print(config.rag.enabled)
print(config.pipeline.step_size)

Creating Configuration

Python
from egregora.config.settings import EgregoraConfig, save_egregora_config

# Create with defaults
config = EgregoraConfig()

# Modify settings
config.rag.enabled = False
config.pipeline.step_size = 7

# Save to file
save_egregora_config(config, Path("./my-blog"))

Configuration Overrides

Python
1
2
3
4
5
6
7
8
from egregora.config.overrides import ConfigOverrideBuilder

# Build with overrides
overrides = ConfigOverrideBuilder()
overrides.set_model("writer", "google-gla:gemini-pro-latest")
overrides.set_rag_enabled(False)

config = overrides.build(base_config)

See Also