Data Primitives API¶
The data primitives module provides the core data structures used throughout Egregora.
Overview¶
All content produced by the Egregora pipeline is represented as Document instances. Documents use content-addressed IDs (UUID v5 of content hash) for deterministic identity and deduplication.
Document¶
Document dataclass ¶
Document(content: str | bytes, type: DocumentType, metadata: dict[str, Any] = dict(), id: str | None = None, parent_id: str | None = None, parent: Document | None = None, created_at: datetime = (lambda: datetime.now(UTC))(), source_window: str | None = None, suggested_path: str | None = None)
Content-addressed document produced by the pipeline.
Core abstraction for all generated content.
V3 CHANGE (2025-11-28): Adopts "Semantic Identity". - Posts: ID = Slug - Media: ID = Semantic Slug - Others: ID = UUID (or specific logic)
Examples:
| Python Console Session | |
|---|---|
| Python Console Session | |
|---|---|
Attributes:
| Name | Type | Description |
|---|---|---|
content | str | bytes | Markdown (str) or binary (bytes) content of the document |
type | DocumentType | Type of document (post, profile, journal, enrichment, media) |
metadata | dict[str, Any] | Format-agnostic metadata (title, date, author, etc.) |
id | str | None | Explicit ID override (Semantic Identity) |
parent_id | str | None | Document ID of parent (for enrichments) |
parent | Document | None | Optional in-memory parent Document reference |
created_at | datetime | Timestamp when document was created |
source_window | str | None | Window label if from windowed pipeline |
suggested_path | str | None | Optional hint for output format (not authoritative) |
Attributes¶
document_id property ¶
Return the document's stable identifier.
Strategy (V3): 1. Explicit ID (self.id) 2. Semantic Slug (for POST/MEDIA if present) 3. Content-based UUIDv5 (Fallback)
Functions¶
with_parent ¶
Return new document with parent relationship.
Source code in src/egregora/data_primitives/document.py
with_metadata ¶
Return new document with updated metadata.
Source code in src/egregora/data_primitives/document.py
DocumentType¶
DocumentType ¶
Bases: Enum
Types of documents in the Egregora pipeline.
Each document type represents a distinct kind of content that may have different storage conventions in different output formats.
DocumentCollection¶
DocumentCollection dataclass ¶
Batch of documents produced by a single operation (e.g., one window).
MediaAsset¶
MediaAsset dataclass ¶
MediaAsset(content: str | bytes, type: DocumentType, metadata: dict[str, Any] = dict(), id: str | None = None, parent_id: str | None = None, parent: Document | None = None, created_at: datetime = (lambda: datetime.now(UTC))(), source_window: str | None = None, suggested_path: str | None = None)