Data Generator

Document Processing Orchestrator Data Generator Package.

This module provides various data generator implementations for different types of data processing.

Classes requiring optional extras are loaded lazily via getattr so that the package can be imported without installing every extra: - ImageCaptionDataGenerator, MultiModelImageCaptionDataGenerator: require the 'image' extra - PIITextAnonymizationDataGenerator: require the 'pii' extra

`BaseDataGenerator`

Bases: ABC

Base class for data generator.

`generate(elements, **kwargs)` `abstractmethod`

Generates data for a list of chunks.

Parameters:

Name	Type	Description	Default
`elements`	`Any`	The elements to be used for generating data / metadata. ideally formatted as List[Dict].	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Name	Type	Description
`Any`	`Any`	The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator.

`ChunkRelationMetadataDataGenerator()`

Bases: BaseDataGenerator

Data generator that assigns chunk relation metadata to elements.

Each element is treated as an atomic single chunk. The generator populates file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order fields on each element's metadata.

Initialize the ChunkRelationMetadataDataGenerator.

`generate(elements, **kwargs)`

Enrich elements with chunk relation metadata.

Enriches each element with chunk-level relationship metadata: file_id, chunk_id, previous_chunk, next_chunk, parent_chunk, children_chunk, and order.

Processing skip if all chunk relation fields are already present on the first element.

The file_id is resolved in order of priority: 1. file_id kwarg 2. file_id on the first element's metadata 3. Generated UUID value

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	Elements to enrich with chunk relation metadata.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Kwargs

file_id (str, optional): Explicit file id. If not provided, falls back to element metadata or a generated UUID when omitted.

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: Elements enriched with chunk relation metadata.

`IDRegulationGraphDataGenerator(index_name=None)`

Bases: BaseDataGenerator

Maps Indonesian regulation chunks to graph nodes and edges for Neo4j indexing.

Each chunk is enriched with nodes and edges in its metadata. No LLM calls are made; the mapping is purely structural, derived from fields already present in the chunk metadata produced by IDRegulationChunker.

The schema follows ONTOLOGY.md: each article is split into a stable Article identity node and one or more versioned ArticleVersion nodes that carry text and sub-structure (clauses).

Explanation (Penjelasan) chunks become Explanation nodes bound via HAS_EXPLANATION to the ArticleVersion they elucidate, or to the Regulation node for the general (Umum) elucidation.

Concepts and obligations (produced by the optional upstream IDRegulationComplianceDataGenerator) are included when present in the chunk metadata, otherwise skipped.

Initialize IDRegulationGraphDataGenerator.

Parameters:

Name	Type	Description	Default
`index_name`	`str \| None`	Default namespace prefix applied to every emitted node id (and every edge endpoint referencing those ids) when a `generate()` call does not supply its own `index_name` keyword argument. Two pipelines indexing the same regulation under different `index_name` values produce disjoint node ids, so they never collide on the Neo4j `MERGE (n:Label {id: $id})` upsert performed by the graph data store. When `None` (default) and no per-call override is given, node ids are emitted un-namespaced, matching the behavior of existing deployments that predate this argument. Defaults to None. Note that when set, semantic identities normally shared across namespaces (Concept name, Obligation hash) are also namespaced — cross-namespace concept merging is out of scope and would require a separate normalization pass.	`None`

`generate(elements, **kwargs)`

Enrich regulation chunks with graph nodes and edges.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	Regulation chunks produced by IDRegulationChunker.	required
`**kwargs`	`Any`	`index_name` (str \| None, optional) overrides the constructor's default namespace for this call only, so a single long-lived generator instance can process documents for different knowledge bases without namespace collisions. Falls back to the constructor's `index_name` when omitted. All other keys are unused; accepted for interface compatibility.	`{}`

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: Same chunks with `metadata["nodes"]` and `metadata["edges"]` populated. If the document has no `regulation_id` (e.g. appendix or FAQ chunks from StructuredElementChunker), all chunks are returned unchanged with empty `nodes` and `edges`.

`LLMTextRewriteDataGenerator(model_api_keys=None, default_model_id=DEFAULT_MODEL_ID, default_system_prompt=DEFAULT_SYSTEM_PROMPT, default_user_prompt=DEFAULT_USER_PROMPT, default_text_rewrite_enabled=True, default_structures_to_rewrite=None, default_hyperparameters=None, default_retry_config=None, default_include_media_images_as_context=True, max_concurrent_requests=DEFAULT_MAX_CONCURRENT_REQUESTS)`

Bases: BaseDataGenerator

LLM-powered element text rewrite data generator with lazy initialization and batching.

This generator rewrites the text field of document elements using an LLM while preserving element structure and metadata. It supports dynamic model selection, configuration-based processor caching, concurrent batch processing, and optional image context for multimodal models.

Designed primarily for document normalization tasks such as OCR cleanup, formatting correction, and layout refinement.

Initialize the LLMTextRewriteDataGenerator.

Parameters:

Name	Type	Description	Default
`model_api_keys`	`dict[str, str] \| None`	Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys are passed during LMInvoker initialization.	`None`
`default_model_id`	`str`	Default model ID used when `model_id` is not provided to generate(). Defaults to DEFAULT_MODEL_ID.	`DEFAULT_MODEL_ID`
`default_system_prompt`	`str`	Default system prompt used when `system_prompt` is not provided to generate(). Defaults to DEFAULT_SYSTEM_PROMPT.	`DEFAULT_SYSTEM_PROMPT`
`default_user_prompt`	`str`	Default user prompt used when `user_prompt` is not provided to generate(). Defaults to DEFAULT_USER_PROMPT.	`DEFAULT_USER_PROMPT`
`default_text_rewrite_enabled`	`bool`	Default value for `text_rewrite_enabled` in generate(). Defaults to True.	`True`
`default_structures_to_rewrite`	`list[str] \| None`	Default value for `structures_to_rewrite` in generate(). If None, all structures are eligible for rewriting. Defaults to None.	`None`
`default_hyperparameters`	`dict[str, Any] \| None`	Default hyperparameters passed to the LMInvoker when `default_hyperparameters` is not provided in generate(). Defaults to None.	`None`
`default_retry_config`	`dict[str, Any] \| None`	Default retry configuration passed to the LMInvoker when `retry_config` is not provided in generate(). Defaults to None.	`None`
`default_include_media_images_as_context`	`bool`	Default value for `include_media_images_as_context` in generate(). Defaults to True.	`True`
`max_concurrent_requests`	`int`	Default maximum number of concurrent LLM requests per generate() call when `max_concurrent_requests` is not provided to generate(). Defaults to DEFAULT_MAX_CONCURRENT_REQUESTS.	`DEFAULT_MAX_CONCURRENT_REQUESTS`

Raises:

Type	Description
`ValueError`	If max_concurrent_requests is less than 1.

`generate(elements, **kwargs)`

Rewrite element.text for elements matching structures_to_rewrite using an LLM.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	List of dictionaries containing elements to be processed.	required
`**kwargs`	`Any`	Additional keyword arguments for the LLM text rewrite process.	`{}`

Kwargs

text_rewrite_enabled (bool, optional): Whether to enable LLM text rewriting. Defaults to True or the value configured in init. structures_to_rewrite (list[str], optional): List of element structure types whose text field should be rewritten. Elements with structures not in this list are passed through unchanged. Defaults to the value configured in init (applies rewriting to all structures if None). model_id (str, optional): The ID of the model to use for text rewriting. Defaults to DEFAULT_MODEL_ID or the value configured in init. system_prompt (str, optional): The system prompt to use for text rewriting. Defaults to DEFAULT_SYSTEM_PROMPT or the value configured in init. user_prompt (str, optional): The user prompt template for text rewriting. Must contain a {text} placeholder. Defaults to DEFAULT_USER_PROMPT or the value configured in init. default_hyperparameters (dict[str, Any], optional): Additional hyperparameters passed to the LMInvoker configuration. Defaults to {} or the value configured in init. retry_config (dict[str, Any], optional): Retry configuration passed to the LMInvoker. Defaults to {} or the value configured in init. include_media_images_as_context (bool, optional): Whether to attach associated media images from element.metadata.media as visual context when invoking the LLM. Defaults to True or the value configured in init. max_concurrent_requests (int, optional): Maximum number of LLM requests in flight at once for this generate() call. Defaults to the value configured in init. api_key (str, optional): API key for the model. Takes precedence over init-time default_model_api_keys lookup by model_id. An explicit empty string is valid and does not fall back to init-time keys.

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of dictionaries with text rewritten for matching elements. Non-matching elements are returned unchanged.

Raises:

Type	Description
`ValueError`	If max_concurrent_requests is less than 1.

`getattr(name)`

Lazily import classes that depend on optional extras.

Parameters:

Name	Type	Description	Default
`name`	`str`	The attribute name being accessed on this module.	required

Returns:

Name	Type	Description
`type`	`type`	The requested class, imported from its submodule on first access.

Raises:

Type	Description
`ImportError`	If the class is found in `_OPTIONAL_IMPORTS` but its required extra is not installed.
`AttributeError`	If `name` is not a known attribute of this module.

Data Generator

BaseDataGenerator

generate(elements, **kwargs) abstractmethod

ChunkRelationMetadataDataGenerator()

generate(elements, **kwargs)

IDRegulationGraphDataGenerator(index_name=None)

generate(elements, **kwargs)

generate(elements, **kwargs)

__getattr__(name)

`BaseDataGenerator`

`generate(elements, **kwargs)` `abstractmethod`

`ChunkRelationMetadataDataGenerator()`

`generate(elements, **kwargs)`

`IDRegulationGraphDataGenerator(index_name=None)`

`generate(elements, **kwargs)`

`generate(elements, **kwargs)`

`getattr(name)`