Data Generator
Document Processing Orchestrator Data Generator Package.
This module provides various data generator implementations for different types of data processing.
Classes requiring optional extras are loaded lazily via getattr so that the package can be imported without installing every extra: - ImageCaptionDataGenerator, MultiModelImageCaptionDataGenerator: require the 'image' extra - PIITextAnonymizationDataGenerator: require the 'pii' extra
BaseDataGenerator
Bases: ABC
Base class for data generator.
generate(elements, **kwargs)
abstractmethod
Generates data for a list of chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
Any
|
The elements to be used for generating data / metadata. ideally formatted as List[Dict]. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator. |
ChunkRelationMetadataDataGenerator()
Bases: BaseDataGenerator
Data generator that assigns chunk relation metadata to elements.
Each element is treated as an atomic single chunk. The generator populates file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order fields on each element's metadata.
Initialize the ChunkRelationMetadataDataGenerator.
generate(elements, **kwargs)
Enrich elements with chunk relation metadata.
Enriches each element with chunk-level relationship metadata: file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order.
Processing skip if all chunk relation fields are already present on the first element.
The file_id is resolved in order of priority:
1. file_id kwarg
2. file_id on the first element's metadata
3. Generated UUID value
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
Elements to enrich with chunk relation metadata. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Kwargs
file_id (str, optional): Explicit file id. If not provided, falls back to element metadata or a generated UUID when omitted.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Elements enriched with chunk relation metadata. |
IDRegulationGraphDataGenerator()
Bases: BaseDataGenerator
Maps Indonesian regulation chunks to graph nodes and edges for Neo4j indexing.
Each chunk is enriched with nodes and edges in its metadata. No
LLM calls are made; the mapping is purely structural, derived from fields
already present in the chunk metadata produced by IDRegulationChunker.
The schema follows ONTOLOGY.md: each article is split into a stable
Article identity node and one or more versioned ArticleVersion
nodes that carry text and sub-structure (clauses).
Concepts and obligations (produced by the optional upstream IDRegulationComplianceDataGenerator) are included when present in the chunk metadata, otherwise skipped.
Initialize IDRegulationGraphDataGenerator.
generate(elements, **kwargs)
Enrich regulation chunks with graph nodes and edges.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
Regulation chunks produced by IDRegulationChunker. |
required |
**kwargs
|
Any
|
Unused; accepted for interface compatibility. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Same chunks with |
LLMTextRewriteDataGenerator(model_api_keys=None, default_model_id=DEFAULT_MODEL_ID, default_system_prompt=DEFAULT_SYSTEM_PROMPT, default_user_prompt=DEFAULT_USER_PROMPT, default_text_rewrite_enabled=True, default_structures_to_rewrite=None, default_hyperparameters=None, default_retry_config=None, default_include_media_images_as_context=True, max_concurrent_requests=DEFAULT_MAX_CONCURRENT_REQUESTS)
Bases: BaseDataGenerator
LLM-powered element text rewrite data generator with lazy initialization and batching.
This generator rewrites the text field of document elements using an LLM while preserving element
structure and metadata. It supports dynamic model selection, configuration-based processor caching,
concurrent batch processing, and optional image context for multimodal models.
Designed primarily for document normalization tasks such as OCR cleanup, formatting correction, and layout refinement.
Initialize the LLMTextRewriteDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_api_keys
|
dict[str, str] | None
|
Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys are passed during LMInvoker initialization. |
None
|
default_model_id
|
str
|
Default model ID used when |
DEFAULT_MODEL_ID
|
default_system_prompt
|
str
|
Default system prompt used when |
DEFAULT_SYSTEM_PROMPT
|
default_user_prompt
|
str
|
Default user prompt used when |
DEFAULT_USER_PROMPT
|
default_text_rewrite_enabled
|
bool
|
Default value for |
True
|
default_structures_to_rewrite
|
list[str] | None
|
Default value for |
None
|
default_hyperparameters
|
dict[str, Any] | None
|
Default hyperparameters passed to the LMInvoker
when |
None
|
default_retry_config
|
dict[str, Any] | None
|
Default retry configuration passed to the
LMInvoker when |
None
|
default_include_media_images_as_context
|
bool
|
Default value for
|
True
|
max_concurrent_requests
|
int
|
Default maximum number of concurrent LLM requests per
generate() call when |
DEFAULT_MAX_CONCURRENT_REQUESTS
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If max_concurrent_requests is less than 1. |
generate(elements, **kwargs)
Rewrite element.text for elements matching structures_to_rewrite using an LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
List of dictionaries containing elements to be processed. |
required |
**kwargs
|
Any
|
Additional keyword arguments for the LLM text rewrite process. |
{}
|
Kwargs
text_rewrite_enabled (bool, optional): Whether to enable LLM text rewriting. Defaults to True
or the value configured in init.
structures_to_rewrite (list[str], optional): List of element structure types whose text
field should be rewritten. Elements with structures not in this list are passed through unchanged.
Defaults to the value configured in init (applies rewriting to all structures if None).
model_id (str, optional): The ID of the model to use for text rewriting. Defaults to DEFAULT_MODEL_ID
or the value configured in init.
system_prompt (str, optional): The system prompt to use for text rewriting.
Defaults to DEFAULT_SYSTEM_PROMPT or the value configured in init.
user_prompt (str, optional): The user prompt template for text rewriting. Must contain a {text} placeholder.
Defaults to DEFAULT_USER_PROMPT or the value configured in init.
default_hyperparameters (dict[str, Any], optional): Additional hyperparameters passed to the LMInvoker
configuration. Defaults to {} or the value configured in init.
retry_config (dict[str, Any], optional): Retry configuration passed to the LMInvoker. Defaults to {}
or the value configured in init.
include_media_images_as_context (bool, optional): Whether to attach associated media images
from element.metadata.media as visual context when invoking the LLM. Defaults to True or the
value configured in init.
max_concurrent_requests (int, optional): Maximum number of LLM requests in flight at once for this
generate() call. Defaults to the value configured in init.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries with text rewritten for matching elements. Non-matching elements are returned unchanged. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If max_concurrent_requests is less than 1. |
__getattr__(name)
Lazily import classes that depend on optional extras.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The attribute name being accessed on this module. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
type |
type
|
The requested class, imported from its submodule on first access. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If the class is found in |
AttributeError
|
If |