Skip to content

Settings

settings

Centralized configuration for Egregora (ALPHA VERSION).

This module consolidates ALL configuration code in one place: - Pydantic models for .egregora.toml (in root) - Loading and saving functions - Runtime dataclasses for function parameters - Model configuration utilities

Why TOML? - Native Python support in 3.11+ (tomllib) - Unambiguous date/time parsing (unlike YAML which can be ambiguous) - Clearer specification, avoiding "The Norway Problem" - Standard for Python ecosystem (matches pyproject.toml)

Benefits: - Single source of truth for all configuration - Backend independence (works with Hugo, Astro, etc.) - Type safety (Pydantic validation at load time) - No backward compatibility - clean alpha design

Strategy: - ONLY loads from .egregora.toml in root - Creates default config if missing in root

ModelSettings

Bases: BaseModel

LLM model configuration for different tasks.

  • Pydantic-AI agents expect provider-prefixed IDs like google-gla:gemini-flash-latest
  • Direct Google GenAI SDK calls expect models/<name> identifiers

validate_pydantic_model_format classmethod

validate_pydantic_model_format(v: str) -> str

Validate Pydantic-AI model name format.

Source code in src/egregora/config/settings.py
@field_validator("writer", "enricher", "enricher_vision", "ranking", "editor", "reader")
@classmethod
def validate_pydantic_model_format(cls, v: str) -> str:
    """Validate Pydantic-AI model name format."""
    if not v.startswith(("google-gla:", "openrouter:")):
        msg = (
            f"Invalid Pydantic-AI model format: {v!r}\n"
            f"Expected format: 'google-gla:<model-name>' or 'openrouter:<model-name>'\n"
            f"Examples:\n"
            f"  - google-gla:gemini-2.0-flash-exp\n"
            f"  - google-gla:gemini-flash-latest\n"
            f"  - google-gla:gemini-2.0-flash-exp\n"
            f"  - google-gla:gemini-1.5-pro\n"
            f"  - openrouter:openai/gpt-4o"
        )
        raise ValueError(msg)
    return v

validate_google_model_format classmethod

validate_google_model_format(v: str) -> str

Validate Google GenAI SDK model name format.

Source code in src/egregora/config/settings.py
@field_validator("embedding", "banner")
@classmethod
def validate_google_model_format(cls, v: str) -> str:
    """Validate Google GenAI SDK model name format."""
    if not v.startswith("models/"):
        msg = (
            f"Invalid Google GenAI model format: {v!r}\n"
            f"Expected format: 'models/<model-name>'\n"
            f"Examples:\n"
            f"  - models/gemini-embedding-001\n"
            f"  - models/gemini-2.0-flash-exp"
        )
        raise ValueError(msg)
    return v

ImageGenerationSettings

Bases: BaseModel

Configuration for image generation requests.

RAGSettings

Bases: BaseModel

Retrieval-Augmented Generation (RAG) configuration.

⭐ MAGICAL FEATURE: Contextual Memory This is one of the three features that make Egregora special. Posts reference previous discussions, creating connected narratives.

Uses LanceDB for vector storage and similarity search. Embedding API uses dual-queue router for optimal throughput.

validate_top_k classmethod

validate_top_k(v: int) -> int

Validate top_k is reasonable and warn if too high.

Source code in src/egregora/config/settings.py
@field_validator("top_k")
@classmethod
def validate_top_k(cls, v: int) -> int:
    """Validate top_k is reasonable and warn if too high."""
    if v > RAG_TOP_K_WARNING_THRESHOLD:
        logger.warning(
            "RAG top_k=%s is unusually high. "
            "Consider values between 5-10 for better performance and relevance.",
            v,
        )
    return v

WriterAgentSettings

Bases: BaseModel

Blog post writer configuration.

PrivacySettings

Bases: BaseModel

Privacy and PII configuration.

EnrichmentSettings

Bases: BaseModel

Enrichment settings for URLs and media.

PipelineSettings

Bases: BaseModel

Pipeline execution settings.

PathsSettings

Bases: BaseModel

Site directory paths configuration.

All paths are relative to site_root (output directory). Provides defaults that match the standard .egregora/ structure.

validate_safe_path classmethod

validate_safe_path(v: str) -> str

Validate path is relative and does not contain traversal sequences.

Source code in src/egregora/config/settings.py
@field_validator(
    "egregora_dir",
    "rag_dir",
    "lancedb_dir",
    "prompts_dir",
    "docs_dir",
    "posts_dir",
    "profiles_dir",
    "media_dir",
    "journal_dir",
    "cache_dir",
    mode="after",
)
@classmethod
def validate_safe_path(cls, v: str) -> str:
    """Validate path is relative and does not contain traversal sequences."""
    if not v:
        return v
    path = Path(v)

    if any(part == ".." for part in path.parts):
        msg = f"Path must not contain traversal sequences ('..'): {v}"
        raise ValueError(msg)
    return v

OutputAdapterConfig

Bases: BaseModel

Configuration for a single output adapter.

Each adapter represents a target format (e.g., MkDocs, Hugo) with its own configuration file.

OutputSettings

Bases: BaseModel

Output adapter registry.

Registers all output adapters used for this site. Each adapter has a type and optional config path.

SourceOverrideSettings

Bases: BaseModel

Per-source overrides for pipeline execution.

Allows fine-grained control of windowing, enrichment, and date ranges without requiring repeated CLI flags.

SourceSettings

Bases: BaseModel

Configuration for a single input source.

SiteSettings

Bases: BaseModel

Site-level configuration including configured sources.

DatabaseSettings

Bases: BaseModel

Database configuration for pipeline and observability.

All values must be valid Ibis connection URIs (e.g. DuckDB, Postgres, SQLite).

ReaderSettings

Bases: BaseModel

Reader agent configuration for post evaluation and ranking.

⭐ MAGICAL FEATURE: Content Discovery This is one of the three features that make Egregora special. It should be enabled by default for 95% of users.

TaxonomySettings

Bases: BaseModel

Semantic taxonomy generation settings.

After posts are generated, clusters similar posts and assigns consistent tags using LLM analysis. Uses K-Means clustering.

FeaturesSettings

Bases: BaseModel

Feature flags for experimental or optional functionality.

QuotaSettings

Bases: BaseModel

Configuration for LLM usage budgets and concurrency.

ProfileSettings

Bases: BaseModel

Configuration for profile generation agent.

⭐ MAGICAL FEATURE: Author Profiles This is one of the three features that make Egregora special. Creates loving portraits of people from their messages - storytelling, not analytics. Always enabled (no opt-out flag).

EgregoraConfig

Bases: BaseSettings

Root configuration for Egregora.

This model defines the complete .egregora.toml schema.

Supports environment variable overrides with the pattern: EGREGORA_SECTION__KEY (e.g., EGREGORA_MODELS__WRITER)

validate_cross_field

validate_cross_field() -> EgregoraConfig

Validate cross-field dependencies and warn about potential issues.

Source code in src/egregora/config/settings.py
@model_validator(mode="after")
def validate_cross_field(self) -> EgregoraConfig:
    """Validate cross-field dependencies and warn about potential issues."""
    # If RAG is enabled, ensure lancedb_dir is set
    if self.rag.enabled and not self.paths.lancedb_dir:
        msg = (
            "RAG is enabled (rag.enabled=true) but paths.lancedb_dir is not set. "
            "Set paths.lancedb_dir to a valid directory path."
        )
        raise ValueError(msg)

    # Warn about very high max_prompt_tokens
    if self.pipeline.max_prompt_tokens > MAX_PROMPT_TOKENS_WARNING_THRESHOLD:
        logger.warning(
            "pipeline.max_prompt_tokens=%s exceeds most model limits. "
            "Consider using pipeline.use_full_context_window=true instead of setting a high token limit.",
            self.pipeline.max_prompt_tokens,
        )

    # Warn if use_full_context_window is enabled
    if self.pipeline.use_full_context_window:
        logger.info(
            "pipeline.use_full_context_window=true. Using full model context window "
            "(overrides max_prompt_tokens setting)."
        )

    return self

from_cli_overrides classmethod

from_cli_overrides(
    base_config: EgregoraConfig, **cli_args: Any
) -> EgregoraConfig

Create a new config instance with CLI overrides applied.

Handles nested updates for pipeline, enrichment, rag, etc. CLI arguments are expected to be flat key-value pairs or dicts matching the argument structure of CLI commands.

Source code in src/egregora/config/settings.py
@classmethod
def from_cli_overrides(cls, base_config: EgregoraConfig, **cli_args: Any) -> EgregoraConfig:
    """Create a new config instance with CLI overrides applied.

    Handles nested updates for pipeline, enrichment, rag, etc.
    CLI arguments are expected to be flat key-value pairs or dicts
    matching the argument structure of CLI commands.
    """
    # Apply pipeline settings overrides
    pipeline_overrides = {}
    for key in [
        "step_size",
        "step_unit",
        "overlap_ratio",
        "max_prompt_tokens",
        "use_full_context_window",
    ]:
        if key in cli_args and cli_args[key] is not None:
            pipeline_overrides[key] = cli_args[key]

    if cli_args.get("timezone") is not None:
        pipeline_overrides["timezone"] = str(cli_args["timezone"])

    from_date = cli_args.get("from_date")
    if from_date:
        pipeline_overrides["from_date"] = from_date.isoformat()

    to_date = cli_args.get("to_date")
    if to_date:
        pipeline_overrides["to_date"] = to_date.isoformat()

    # Apply enrichment settings overrides
    enrichment_overrides = {}
    if "enable_enrichment" in cli_args and cli_args["enable_enrichment"] is not None:
        enrichment_overrides["enabled"] = cli_args["enable_enrichment"]

    # Apply rag settings overrides
    rag_overrides: dict[str, Any] = {}

    # Apply model overrides
    model_overrides = {}
    if cli_args.get("model"):
        model = cli_args["model"]
        model_overrides = {
            "writer": model,
            "enricher": model,
            "enricher_vision": model,
            "ranking": model,
            "editor": model,
        }

    # Construct updates
    updates = {}
    if pipeline_overrides:
        updates["pipeline"] = base_config.pipeline.model_copy(update=pipeline_overrides)
    if enrichment_overrides:
        updates["enrichment"] = base_config.enrichment.model_copy(update=enrichment_overrides)
    if rag_overrides:
        updates["rag"] = base_config.rag.model_copy(update=rag_overrides)
    if model_overrides:
        updates["models"] = base_config.models.model_copy(update=model_overrides)

    return base_config.model_copy(update=updates)

RuntimeContext dataclass

RuntimeContext(
    output_dir: Annotated[Path, "Directory for the generated site"],
    input_file: Annotated[Path | None, "Path to the chat export file"] = None,
    model_override: Annotated[str | None, "Model override from CLI"] = None,
    debug: Annotated[bool, "Enable debug logging"] = False,
)

Runtime-only context that cannot be persisted to config file.

This is the minimal set of fields that are truly runtime-specific: - Paths resolved at invocation time - Debug flags

API keys are read directly from environment variables by pydantic-ai/genai. All other configuration lives in EgregoraConfig (single source of truth).

input_path property

input_path: Path | None

Alias for input_file (source-agnostic naming).

WriterRuntimeConfig dataclass

WriterRuntimeConfig(
    posts_dir: Annotated[Path, "Directory to save posts"],
    profiles_dir: Annotated[Path, "Directory to save profiles"],
    rag_dir: Annotated[Path, "Directory for RAG data"],
    model_config: Annotated[object | None, "Model configuration"] = None,
    enable_rag: Annotated[bool, "Enable RAG"] = True,
)

Runtime configuration for post writing (not the Pydantic WriterConfig).

MediaEnrichmentContext dataclass

MediaEnrichmentContext(
    media_type: Annotated[str, "The type of media (e.g., 'image', 'video')"],
    media_filename: Annotated[str, "The filename of the media"],
    author: Annotated[str, "The author of the message containing the media"],
    timestamp: Annotated[str, "The timestamp of the message"],
    nearby_messages: Annotated[str, "Messages sent before and after the media"],
    ocr_text: Annotated[str, "Text extracted from the media via OCR"] = "",
    detected_objects: Annotated[str, "Objects detected in the media"] = "",
)

Context for media enrichment prompts.

EnrichmentRuntimeConfig dataclass

EnrichmentRuntimeConfig(
    client: Annotated[object, "The Gemini client"],
    output_dir: Annotated[Path, "The directory to save enriched data"],
    model: Annotated[
        str, "The Gemini model to use for enrichment"
    ] = ModelDefaults.ENRICHER,
)

Runtime configuration for enrichment operations.

PipelineEnrichmentConfig dataclass

PipelineEnrichmentConfig(
    batch_threshold: int = 10,
    max_enrichments: int = 500,
    enable_url: bool = True,
    enable_media: bool = True,
)

Extended enrichment configuration for pipeline operations.

Extends basic enrichment config with pipeline-specific settings.

__post_init__

__post_init__() -> None

Validate configuration after initialization.

Source code in src/egregora/config/settings.py
def __post_init__(self) -> None:
    """Validate configuration after initialization."""
    if self.batch_threshold < 1:
        msg = f"batch_threshold must be >= 1, got {self.batch_threshold}"
        raise InvalidEnrichmentConfigError(msg)
    if self.max_enrichments < 0:
        msg = f"max_enrichments must be >= 0, got {self.max_enrichments}"
        raise InvalidEnrichmentConfigError(msg)

from_cli_args classmethod

from_cli_args(**kwargs: Any) -> PipelineEnrichmentConfig

Create config from CLI arguments.

Source code in src/egregora/config/settings.py
@classmethod
def from_cli_args(cls, **kwargs: Any) -> PipelineEnrichmentConfig:
    """Create config from CLI arguments."""
    return cls(
        batch_threshold=int(kwargs.get("batch_threshold", 10)),
        max_enrichments=int(kwargs.get("max_enrichments", 500)),
        enable_url=bool(kwargs.get("enable_url", True)),
        enable_media=bool(kwargs.get("enable_media", True)),
    )

find_egregora_config

find_egregora_config(start_dir: Path, *, site: str | None = None) -> Path

Search upward for .egregora.toml.

Parameters:

Name Type Description Default
start_dir Path

Starting directory for upward search

required
site str | None

Optional site identifier (reserved for future use)

None

Returns:

Type Description
Path

Path to config file if found

Raises:

Type Description
ConfigNotFoundError

If the config file cannot be found

Source code in src/egregora/config/settings.py
def find_egregora_config(start_dir: Path, *, site: str | None = None) -> Path:
    """Search upward for .egregora.toml.

    Args:
        start_dir: Starting directory for upward search
        site: Optional site identifier (reserved for future use)

    Returns:
        Path to config file if found

    Raises:
        ConfigNotFoundError: If the config file cannot be found

    """
    current = start_dir.expanduser().resolve()
    for candidate in (current, *current.parents):
        toml_path = candidate / ".egregora.toml"
        if toml_path.exists():
            return toml_path

    raise ConfigNotFoundError(start_dir)

load_egregora_config

load_egregora_config(
    site_root: Path | None = None, *, site: str | None = None
) -> EgregoraConfig

Load Egregora configuration from .egregora.toml.

Configuration priority (highest to lowest): 1. CLI (applied via from_cli_overrides later) 2. Environment variables (EGREGORA_SECTION__KEY) 3. Config file (.egregora.toml) 4. Defaults

Parameters:

Name Type Description Default
site_root Path | None

Root directory of the site. If None, uses current working directory.

None
site str | None

Optional site identifier to select from the sites mapping.

None

Returns:

Type Description
EgregoraConfig

Validated EgregoraConfig instance

Raises:

Type Description
ConfigValidationError

If config file contains invalid data

ConfigNotFoundError

If the config file cannot be found and a default one is not created.

Source code in src/egregora/config/settings.py
def load_egregora_config(site_root: Path | None = None, *, site: str | None = None) -> EgregoraConfig:
    """Load Egregora configuration from .egregora.toml.

    Configuration priority (highest to lowest):
    1. CLI (applied via from_cli_overrides later)
    2. Environment variables (EGREGORA_SECTION__KEY)
    3. Config file (.egregora.toml)
    4. Defaults

    Args:
        site_root: Root directory of the site. If None, uses current working directory.
        site: Optional site identifier to select from the sites mapping.

    Returns:
        Validated EgregoraConfig instance

    Raises:
        ConfigValidationError: If config file contains invalid data
        ConfigNotFoundError: If the config file cannot be found and a default one is not created.

    """
    if site_root is None:
        site_root = Path.cwd()

    try:
        config_path = find_egregora_config(site_root, site=site)
        logger.info("Loading config from %s", config_path)
    except ConfigNotFoundError:
        logger.info("No configuration found, creating default config at %s", site_root / ".egregora.toml")
        return create_default_config(site_root, site=site or DEFAULT_SITE_NAME)

    try:
        raw_config = config_path.read_text(encoding="utf-8")
        file_data = tomllib.loads(raw_config)

        # Create base config with defaults (environment overrides applied)
        base_config = EgregoraConfig()
        base_dict = base_config.model_dump(mode="json")

        # Merge file config into base, skipping keys that are set in env vars
        env_override_paths = _collect_env_override_paths()
        selected_site, site_data = _normalize_sites_config(file_data, site=site)

        env_override_paths_with_site = set(env_override_paths)
        for path in env_override_paths:
            if path and path[0] == "sites":
                env_override_paths_with_site.add(path)
            else:
                env_override_paths_with_site.add(("sites", selected_site, *path))

        wrapped_base = {"sites": {selected_site: base_dict}}
        wrapped_override = {"sites": {selected_site: site_data}}
        merged = _merge_config(wrapped_base, wrapped_override, env_override_paths_with_site)

        selected_config = merged["sites"][selected_site]
        return EgregoraConfig.model_validate(selected_config)

    except ValidationError as e:
        logger.exception("Configuration validation failed for %s:", config_path)
        for error in e.errors():
            loc = " -> ".join(str(location_part) for location_part in error["loc"])
            logger.exception("  %s: %s", loc, error["msg"])
        raise ConfigValidationError(e.errors()) from e
    except (OSError, ValueError) as e:
        logger.exception("Failed to read or parse config from %s", config_path)
        msg = f"Failed to process config file: {e}"
        raise ConfigError(msg) from e

create_default_config

create_default_config(
    site_root: Path, *, site: str = DEFAULT_SITE_NAME
) -> EgregoraConfig

Create default .egregora.toml and return it.

Parameters:

Name Type Description Default
site_root Path

Root directory of the site

required
site str

Site identifier to write under the sites mapping

DEFAULT_SITE_NAME

Returns:

Type Description
EgregoraConfig

EgregoraConfig with all defaults

Source code in src/egregora/config/settings.py
def create_default_config(site_root: Path, *, site: str = DEFAULT_SITE_NAME) -> EgregoraConfig:
    """Create default .egregora.toml and return it.

    Args:
        site_root: Root directory of the site
        site: Site identifier to write under the sites mapping

    Returns:
        EgregoraConfig with all defaults

    """
    config = EgregoraConfig()  # All defaults from Pydantic
    save_egregora_config(config, site_root, site=site)
    logger.info("Created default config at %s/.egregora.toml", site_root)
    return config

save_egregora_config

save_egregora_config(
    config: EgregoraConfig, site_root: Path, *, site: str = DEFAULT_SITE_NAME
) -> Path

Save EgregoraConfig to .egregora.toml in site_root.

Parameters:

Name Type Description Default
config EgregoraConfig

EgregoraConfig instance to save

required
site_root Path

Root directory of the site

required
site str

Site identifier to write under the sites mapping

DEFAULT_SITE_NAME

Returns:

Type Description
Path

Path to the saved config file

Source code in src/egregora/config/settings.py
def save_egregora_config(config: EgregoraConfig, site_root: Path, *, site: str = DEFAULT_SITE_NAME) -> Path:
    """Save EgregoraConfig to .egregora.toml in site_root.

    Args:
        config: EgregoraConfig instance to save
        site_root: Root directory of the site
        site: Site identifier to write under the sites mapping

    Returns:
        Path to the saved config file

    """
    config_path = site_root / ".egregora.toml"
    config_path.parent.mkdir(parents=True, exist_ok=True)

    # Export as dict under the selected site mapping
    data = {"sites": {site: config.model_dump(exclude_defaults=False, mode="json")}}

    # Remove None values as tomli_w doesn't support them
    def _clean_nones(d: dict[str, Any]) -> dict[str, Any]:
        cleaned = {}
        for k, v in d.items():
            if v is None:
                continue
            if isinstance(v, dict):
                v = _clean_nones(v)
            elif isinstance(v, list):
                v = [_clean_nones(item) if isinstance(item, dict) else item for item in v if item is not None]
            cleaned[k] = v
        return cleaned

    data = _clean_nones(data)

    # Write as TOML
    toml_str = tomli_w.dumps(data)

    config_path.write_text(toml_str, encoding="utf-8")
    logger.debug("Saved config to %s", config_path)

    return config_path

parse_date_arg

parse_date_arg(date_str: str, _arg_name: str = 'date') -> date

Parse a date string in YYYY-MM-DD format.

Parameters:

Name Type Description Default
date_str str

Date string in YYYY-MM-DD format

required
_arg_name str

Name of the argument (for error messages)

'date'

Returns:

Type Description
date

date object in UTC

Raises:

Type Description
InvalidDateFormatError

If date_str is not in YYYY-MM-DD format

Source code in src/egregora/config/settings.py
def parse_date_arg(date_str: str, _arg_name: str = "date") -> date:
    """Parse a date string in YYYY-MM-DD format.

    Args:
        date_str: Date string in YYYY-MM-DD format
        _arg_name: Name of the argument (for error messages)

    Returns:
        date object in UTC

    Raises:
        InvalidDateFormatError: If date_str is not in YYYY-MM-DD format

    """
    try:
        return datetime.strptime(date_str, "%Y-%m-%d").replace(tzinfo=UTC).date()
    except ValueError as e:
        raise InvalidDateFormatError(date_str) from e

validate_timezone

validate_timezone(timezone_str: str) -> ZoneInfo

Validate timezone string and return ZoneInfo object.

Parameters:

Name Type Description Default
timezone_str str

Timezone identifier (e.g., 'America/New_York', 'UTC')

required

Returns:

Type Description
ZoneInfo

ZoneInfo object for the specified timezone

Raises:

Type Description
InvalidTimezoneError

If timezone_str is not a valid timezone identifier

Source code in src/egregora/config/settings.py
def validate_timezone(timezone_str: str) -> ZoneInfo:
    """Validate timezone string and return ZoneInfo object.

    Args:
        timezone_str: Timezone identifier (e.g., 'America/New_York', 'UTC')

    Returns:
        ZoneInfo object for the specified timezone

    Raises:
        InvalidTimezoneError: If timezone_str is not a valid timezone identifier

    """
    try:
        return ZoneInfo(timezone_str)
    except Exception as e:
        raise InvalidTimezoneError(timezone_str, e) from e