Skip to content

Data Primitives API

The data primitives module provides the core data structures used throughout Egregora.

Overview

All content produced by the Egregora pipeline is represented as Document instances. Documents use content-addressed IDs (UUID v5 of content hash) for deterministic identity and deduplication.

Document

Document dataclass

Python
Document(content: str | bytes, type: DocumentType, metadata: dict[str, Any] = dict(), id: str | None = None, parent_id: str | None = None, parent: Document | None = None, created_at: datetime = (lambda: datetime.now(UTC))(), source_window: str | None = None, suggested_path: str | None = None)

Content-addressed document produced by the pipeline.

Core abstraction for all generated content.

V3 CHANGE (2025-11-28): Adopts "Semantic Identity". - Posts: ID = Slug - Media: ID = Semantic Slug - Others: ID = UUID (or specific logic)

Examples:

Python Console Session
1
2
3
4
5
6
7
8
>>> # Create a post document with semantic ID
>>> doc = Document(
...     content="# My Post...",
...     type=DocumentType.POST,
...     metadata={"slug": "my-post"},
... )
>>> doc.document_id
'my-post'
Python Console Session
1
2
3
4
5
6
7
8
>>> # Create a profile (still uses UUID)
>>> doc = Document(
...     content="...",
...     type=DocumentType.PROFILE,
...     id="abc-123", # Explicit ID
... )
>>> doc.document_id
'abc-123'

Attributes:

Name Type Description
content str | bytes

Markdown (str) or binary (bytes) content of the document

type DocumentType

Type of document (post, profile, journal, enrichment, media)

metadata dict[str, Any]

Format-agnostic metadata (title, date, author, etc.)

id str | None

Explicit ID override (Semantic Identity)

parent_id str | None

Document ID of parent (for enrichments)

parent Document | None

Optional in-memory parent Document reference

created_at datetime

Timestamp when document was created

source_window str | None

Window label if from windowed pipeline

suggested_path str | None

Optional hint for output format (not authoritative)

Attributes

document_id property
Python
document_id: str

Return the document's stable identifier.

Strategy (V3): 1. Explicit ID (self.id) 2. Semantic Slug (for POST/MEDIA if present) 3. Content-based UUIDv5 (Fallback)

slug property
Python
slug: str

Return a human-friendly identifier when available.

Functions

with_parent
Python
with_parent(parent: Document | str) -> Document

Return new document with parent relationship.

Source code in src/egregora/data_primitives/document.py
Python
def with_parent(self, parent: Document | str) -> Document:
    """Return new document with parent relationship."""
    parent_id = parent.document_id if isinstance(parent, Document) else parent
    parent_obj = parent if isinstance(parent, Document) else self.parent
    cls = self.__class__
    return cls(
        content=self.content,
        type=self.type,
        metadata=self.metadata.copy(),
        id=self.id,
        parent_id=parent_id,
        parent=parent_obj,
        created_at=self.created_at,
        source_window=self.source_window,
        suggested_path=self.suggested_path,
    )
with_metadata
Python
with_metadata(**updates: Any) -> Document

Return new document with updated metadata.

Source code in src/egregora/data_primitives/document.py
Python
def with_metadata(self, **updates: Any) -> Document:
    """Return new document with updated metadata."""
    new_metadata = self.metadata.copy()
    new_metadata.update(updates)
    cls = self.__class__
    return cls(
        content=self.content,
        type=self.type,
        metadata=new_metadata,
        id=self.id,
        parent_id=self.parent_id,
        parent=self.parent,
        created_at=self.created_at,
        source_window=self.source_window,
        suggested_path=self.suggested_path,
    )

DocumentType

DocumentType

Bases: Enum

Types of documents in the Egregora pipeline.

Each document type represents a distinct kind of content that may have different storage conventions in different output formats.

DocumentCollection

DocumentCollection dataclass

Python
DocumentCollection(documents: list[Document], window_label: str | None = None)

Batch of documents produced by a single operation (e.g., one window).

MediaAsset

MediaAsset dataclass

Python
MediaAsset(content: str | bytes, type: DocumentType, metadata: dict[str, Any] = dict(), id: str | None = None, parent_id: str | None = None, parent: Document | None = None, created_at: datetime = (lambda: datetime.now(UTC))(), source_window: str | None = None, suggested_path: str | None = None)

Bases: Document

Specialized Document for binary media assets managed by the pipeline.

Usage Examples

Creating Documents

Python
from egregora.data_primitives.document import Document, DocumentType

# Create a blog post
post = Document(
    content="# My Post\n\nContent here...",
    type=DocumentType.POST,
    metadata={
        "title": "My Post",
        "date": "2025-01-10",
        "slug": "my-post",
    }
)

# Document ID is deterministic
print(post.document_id)  # UUID based on content hash

Working with Collections

Python
from egregora.data_primitives.document import DocumentCollection

# Create a collection
docs = [post1, post2, profile1]
collection = DocumentCollection(
    documents=docs,
    window_label="2025-01-10"
)

# Filter by type
posts = collection.by_type(DocumentType.POST)
profiles = collection.by_type(DocumentType.PROFILE)

# Find by ID
doc = collection.find_by_id("some-uuid")

Media Assets

Python
from egregora.data_primitives.document import MediaAsset, DocumentType

# Read image file
with open("photo.jpg", "rb") as f:
    image_data = f.read()

# Create media asset
media = MediaAsset(
    content=image_data,
    type=DocumentType.MEDIA,
    metadata={
        "filename": "photo.jpg",
        "mime_type": "image/jpeg",
    }
)

# Create enrichment linked to media
enrichment = Document(
    content="A sunset over the ocean",
    type=DocumentType.ENRICHMENT_MEDIA,
    metadata={"url": "media/photo.jpg"},
    parent_id=media.document_id
)