Schema

Unified schema exports for gllm_multimodal.

`AudioTranscript`

Bases: BaseModel

A class representing an audio transcript.

An audio transcript is a textual record of spoken content from audio or video sources, including timing information and optional language identification. It provides a structured way to store and manage transcribed audio data.

Attributes:

Name	Type	Description
`text`	`str`	The text of the transcript.
`start_time`	`float`	The start time of the transcript in seconds.
`end_time`	`float`	The end time of the transcript in seconds.
`lang_id`	`str \| None`	The language ID of the transcript.

`Caption`

Bases: BaseModel

Result class for captioning operations (image, video, etc.).

This class provides a structured format for captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details

Attributes:

Name	Type	Description
`text_one_liner`	`str`	Brief, single-sentence summary of the content. Defaults to empty string if not provided.
`text_context`	`str`	Detailed, multi-sentence description of the content. Defaults to empty string if not provided.
`domain_knowledge`	`str`	Domain-specific interpretation or context. Defaults to empty string if not provided.
`number_of_captions`	`int`	Total number of distinct captions generated. Defaults to 0 if no captions are generated.
`media_metadata`	`dict[str, Any]`	Additional information about the media such as location.
`multimodal_context`	`list[Attachment \| str]`	Optional list of external context objects (files, bytes, or pre-processed inputs) or raw strings that can enrich captioning results. Bytes are automatically converted into Attachment objects via `Attachment.from_bytes`.
`output_schema`	`str`	Output schema. Defaults to empty string if not provided.
`schema_description`	`str`	Schema description. Defaults to empty string if not provided.
`language`	`str`	Language of the captions. Defaults to "Indonesian" if not provided.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

`handle_deprecated_fields(values)` `classmethod`

Map deprecated field names to their replacements and emit warnings.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

`handle_multimodal_context(multimodal_value)` `classmethod`

Normalize and validate multimodal_context.

This method ensures that the multimodal_context field is a list of Attachment objects or strings. It handles multiple input cases:

None -> returns an empty list
list[bytes] -> converts each item into an Attachment via Attachment.from_bytes
list[Attachment] -> keeps as-is
list[str] -> keeps as-is if it's not a valid image/binary source, otherwise converts to Attachment.
list[mixed] -> normalizes supported types

Parameters:

Name	Type	Description	Default
`multimodal_value`	`Any`	Input value provided to `multimodal_context`.	required

Returns:

Type	Description
`Any`	list[Attachment \| str]: A normalized list of `Attachment` objects or strings.

`handle_none_metadata(metadata_value)` `classmethod`

Handle None values for media_metadata by using empty dict.

`handle_none_number_of_captions(caption_value)` `classmethod`

Handle None values for number_of_captions by using default.

`handle_none_values(str_value)` `classmethod`

Handle None values by converting them to default values.

`CaptionResult`

Bases: Caption

Result of a caption operation.

Attributes:

Name	Type	Description
`captions`	`str \| list[str] \| dict[str, Any]`	The caption result.

`Keyframe`

Bases: BaseModel

Represents a keyframe extracted from a video segment.

Attributes:

Name	Type	Description
`time_offset`	`float`	Time within the segment where the keyframe occurs.
`caption`	`str \| None`	Text description of this specific keyframe.

`Mermaid`

Bases: BaseModel

Mermaid additional metadata.

Attributes:

Name	Type	Description
`diagram_type`	`str`	type of the diagram to be generated.
`context`	`str`	additional context to generate mermaid.

`OcrResult`

Bases: BaseModel

Structured result of an OCR operation.

Attributes:

Name	Type	Description
`text`	`str`	Full concatenated text extracted from the document. This value is mirrored in TextResult.result for API consistency.
`lines`	`list[str]`	Engine-native line units when available (e.g. from specialized OCR backends). LM-based implementations typically leave this empty. Defaults to an empty list.
`confidence`	`float \| None`	Overall confidence score of the extraction, expressed as a value between 0.0 and 1.0. Only populated by specialized OCR engines; LM-based implementations leave this as None. Defaults to None.
`page_count`	`int`	Number of pages processed. Populated by engines that support multi-page documents (e.g., Azure Document Intelligence). Defaults to 1.

`Segment`

Bases: BaseModel

Represents a video segment with its captions, transcripts, and keyframes.

Attributes:

Name	Type	Description
`start_time`	`float \| None`	The segment's starting time in seconds.
`end_time`	`float \| None`	The segment's ending time in seconds.
`transcripts`	`list[AudioTranscript]`	Optional list of transcripts for the segment.
`segment_caption`	`list[str]`	The single, rich description of the segment's action/plot.
`keyframes`	`list[Keyframe]`	Optional list of keyframes extracted from the segment.

`ensure_caption()`

Ensure segment has caption, fallback to keyframes/transcripts if needed.

`ensure_keyframes()`

Ensure all keyframes time offset is non-negative.

`ensure_transcripts()`

Ensure all transcripts time offset is non-negative.

`TextResult`

Bases: BaseModel

Base class for all image-to-text operation results.

This class provides the foundation for structured results from any image-to-text operation, including: - Image Captioning - Scene Text Detection

Attributes:

Name	Type	Description
`text`	`str`	The extracted or generated text from the image. This is the primary output of any image-to-text operation. May be empty if the operation fails or no text is found.
`metadata`	`dict[str, Any] \| BaseModel`	Additional metadata from the conversion process.

`VideoCaptionMetadata`

Bases: BaseModel

Metadata for video captioning results.

Attributes:

Name	Type	Description
`video_summary`	`str`	A high-level summary of the entire video's plot, topic, or main events.
`segments`	`list[Segment]`	List of video segments with their captions and metadata.

`ensure_segment_end_time_greater_than_start_time()`

Ensure segment end time is greater than start time.

If end time equal or lower than start time, then use next segment start time-1 as end time.

`VideoCaptionResult`

Bases: VideoCaptionMetadata

Backward-compatible model alias for video caption result payload.

Schema

AudioTranscript

Caption

handle_deprecated_fields(values) classmethod

handle_multimodal_context(multimodal_value) classmethod

handle_none_metadata(metadata_value) classmethod

handle_none_number_of_captions(caption_value) classmethod

handle_none_values(str_value) classmethod

CaptionResult

Keyframe

Mermaid

OcrResult

Segment

ensure_caption()

ensure_keyframes()

ensure_transcripts()

TextResult

VideoCaptionMetadata

ensure_segment_end_time_greater_than_start_time()

VideoCaptionResult

`AudioTranscript`

`Caption`

`handle_deprecated_fields(values)` `classmethod`

`handle_multimodal_context(multimodal_value)` `classmethod`

`handle_none_metadata(metadata_value)` `classmethod`

`handle_none_number_of_captions(caption_value)` `classmethod`

`handle_none_values(str_value)` `classmethod`

`CaptionResult`

`Keyframe`

`Mermaid`

`OcrResult`

`Segment`

`ensure_caption()`

`ensure_keyframes()`

`ensure_transcripts()`

`TextResult`

`VideoCaptionMetadata`

`ensure_segment_end_time_greater_than_start_time()`

`VideoCaptionResult`