Skip to content

Data Generator

Document Processing Orchestrator Data Generator Package.

This module provides various data generator implementations for different types of data processing.

Classes requiring optional extras are loaded lazily via getattr so that the package can be imported without installing every extra: - ImageCaptionDataGenerator, MultiModelImageCaptionDataGenerator: require the 'image' extra - PIITextAnonymizationDataGenerator: require the 'pii' extra

BaseDataGenerator

Bases: ABC

Base class for data generator.

generate(elements, **kwargs) abstractmethod

Generates data for a list of chunks.

Parameters:

Name Type Description Default
elements Any

The elements to be used for generating data / metadata. ideally formatted as List[Dict].

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Name Type Description
Any Any

The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator.

ChunkRelationMetadataDataGenerator()

Bases: BaseDataGenerator

Data generator that assigns chunk relation metadata to elements.

Each element is treated as an atomic single chunk. The generator populates file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order fields on each element's metadata.

Initialize the ChunkRelationMetadataDataGenerator.

generate(elements, **kwargs)

Enrich elements with chunk relation metadata.

Enriches each element with chunk-level relationship metadata: file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order.

Processing skip if all chunk relation fields are already present on the first element.

The file_id is resolved in order of priority: 1. file_id kwarg 2. file_id on the first element's metadata 3. Generated UUID value

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

Elements to enrich with chunk relation metadata.

required
**kwargs Any

Additional keyword arguments.

{}
Kwargs

file_id (str, optional): Explicit file id. If not provided, falls back to element metadata or a generated UUID when omitted.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Elements enriched with chunk relation metadata.

IDRegulationGraphDataGenerator()

Bases: BaseDataGenerator

Maps Indonesian regulation chunks to graph nodes and edges for Neo4j indexing.

Each chunk is enriched with nodes and edges in its metadata. No LLM calls are made; the mapping is purely structural, derived from fields already present in the chunk metadata produced by IDRegulationChunker.

The schema follows ONTOLOGY.md: each article is split into a stable Article identity node and one or more versioned ArticleVersion nodes that carry text and sub-structure (clauses).

Concepts and obligations (produced by the optional upstream IDRegulationComplianceDataGenerator) are included when present in the chunk metadata, otherwise skipped.

Initialize IDRegulationGraphDataGenerator.

generate(elements, **kwargs)

Enrich regulation chunks with graph nodes and edges.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

Regulation chunks produced by IDRegulationChunker.

required
**kwargs Any

Unused; accepted for interface compatibility.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Same chunks with metadata["nodes"] and metadata["edges"] populated.

LLMTextRewriteDataGenerator(model_api_keys=None, default_model_id=DEFAULT_MODEL_ID, default_system_prompt=DEFAULT_SYSTEM_PROMPT, default_user_prompt=DEFAULT_USER_PROMPT, default_text_rewrite_enabled=True, default_structures_to_rewrite=None, default_hyperparameters=None, default_retry_config=None, default_include_media_images_as_context=True, max_concurrent_requests=DEFAULT_MAX_CONCURRENT_REQUESTS)

Bases: BaseDataGenerator

LLM-powered element text rewrite data generator with lazy initialization and batching.

This generator rewrites the text field of document elements using an LLM while preserving element structure and metadata. It supports dynamic model selection, configuration-based processor caching, concurrent batch processing, and optional image context for multimodal models.

Designed primarily for document normalization tasks such as OCR cleanup, formatting correction, and layout refinement.

Initialize the LLMTextRewriteDataGenerator.

Parameters:

Name Type Description Default
model_api_keys dict[str, str] | None

Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys are passed during LMInvoker initialization.

None
default_model_id str

Default model ID used when model_id is not provided to generate(). Defaults to DEFAULT_MODEL_ID.

DEFAULT_MODEL_ID
default_system_prompt str

Default system prompt used when system_prompt is not provided to generate(). Defaults to DEFAULT_SYSTEM_PROMPT.

DEFAULT_SYSTEM_PROMPT
default_user_prompt str

Default user prompt used when user_prompt is not provided to generate(). Defaults to DEFAULT_USER_PROMPT.

DEFAULT_USER_PROMPT
default_text_rewrite_enabled bool

Default value for text_rewrite_enabled in generate(). Defaults to True.

True
default_structures_to_rewrite list[str] | None

Default value for structures_to_rewrite in generate(). If None, all structures are eligible for rewriting. Defaults to None.

None
default_hyperparameters dict[str, Any] | None

Default hyperparameters passed to the LMInvoker when default_hyperparameters is not provided in generate(). Defaults to None.

None
default_retry_config dict[str, Any] | None

Default retry configuration passed to the LMInvoker when retry_config is not provided in generate(). Defaults to None.

None
default_include_media_images_as_context bool

Default value for include_media_images_as_context in generate(). Defaults to True.

True
max_concurrent_requests int

Default maximum number of concurrent LLM requests per generate() call when max_concurrent_requests is not provided to generate(). Defaults to DEFAULT_MAX_CONCURRENT_REQUESTS.

DEFAULT_MAX_CONCURRENT_REQUESTS

Raises:

Type Description
ValueError

If max_concurrent_requests is less than 1.

generate(elements, **kwargs)

Rewrite element.text for elements matching structures_to_rewrite using an LLM.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

List of dictionaries containing elements to be processed.

required
**kwargs Any

Additional keyword arguments for the LLM text rewrite process.

{}
Kwargs

text_rewrite_enabled (bool, optional): Whether to enable LLM text rewriting. Defaults to True or the value configured in init. structures_to_rewrite (list[str], optional): List of element structure types whose text field should be rewritten. Elements with structures not in this list are passed through unchanged. Defaults to the value configured in init (applies rewriting to all structures if None). model_id (str, optional): The ID of the model to use for text rewriting. Defaults to DEFAULT_MODEL_ID or the value configured in init. system_prompt (str, optional): The system prompt to use for text rewriting. Defaults to DEFAULT_SYSTEM_PROMPT or the value configured in init. user_prompt (str, optional): The user prompt template for text rewriting. Must contain a {text} placeholder. Defaults to DEFAULT_USER_PROMPT or the value configured in init. default_hyperparameters (dict[str, Any], optional): Additional hyperparameters passed to the LMInvoker configuration. Defaults to {} or the value configured in init. retry_config (dict[str, Any], optional): Retry configuration passed to the LMInvoker. Defaults to {} or the value configured in init. include_media_images_as_context (bool, optional): Whether to attach associated media images from element.metadata.media as visual context when invoking the LLM. Defaults to True or the value configured in init. max_concurrent_requests (int, optional): Maximum number of LLM requests in flight at once for this generate() call. Defaults to the value configured in init.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of dictionaries with text rewritten for matching elements. Non-matching elements are returned unchanged.

Raises:

Type Description
ValueError

If max_concurrent_requests is less than 1.

__getattr__(name)

Lazily import classes that depend on optional extras.

Parameters:

Name Type Description Default
name str

The attribute name being accessed on this module.

required

Returns:

Name Type Description
type type

The requested class, imported from its submodule on first access.

Raises:

Type Description
ImportError

If the class is found in _OPTIONAL_IMPORTS but its required extra is not installed.

AttributeError

If name is not a known attribute of this module.