Skip to content

Schema

Unified schema exports for gllm_multimodal.

AudioTranscript

Bases: BaseModel

A class representing an audio transcript.

An audio transcript is a textual record of spoken content from audio or video sources, including timing information and optional language identification. It provides a structured way to store and manage transcribed audio data.

Attributes:

Name Type Description
text str

The text of the transcript.

start_time float

The start time of the transcript in seconds.

end_time float

The end time of the transcript in seconds.

lang_id str | None

The language ID of the transcript.

Caption

Bases: BaseModel

Result class for captioning operations (image, video, etc.).

This class provides a structured format for captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details

Attributes:

Name Type Description
text_one_liner str

Brief, single-sentence summary of the content. Defaults to empty string if not provided.

text_context str

Detailed, multi-sentence description of the content. Defaults to empty string if not provided.

domain_knowledge str

Domain-specific interpretation or context. Defaults to empty string if not provided.

number_of_captions int

Total number of distinct captions generated. Defaults to 0 if no captions are generated.

media_metadata dict[str, Any]

Additional information about the media such as location.

multimodal_context list[Attachment | str]

Optional list of external context objects (files, bytes, or pre-processed inputs) or raw strings that can enrich captioning results. Bytes are automatically converted into Attachment objects via Attachment.from_bytes.

output_schema str

Output schema. Defaults to empty string if not provided.

schema_description str

Schema description. Defaults to empty string if not provided.

language str

Language of the captions. Defaults to "Indonesian" if not provided.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

handle_deprecated_fields(values) classmethod

Map deprecated field names to their replacements and emit warnings.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

handle_multimodal_context(multimodal_value) classmethod

Normalize and validate multimodal_context.

This method ensures that the multimodal_context field is a list of Attachment objects or strings. It handles multiple input cases:

  • None -> returns an empty list
  • list[bytes] -> converts each item into an Attachment via Attachment.from_bytes
  • list[Attachment] -> keeps as-is
  • list[str] -> keeps as-is if it's not a valid image/binary source, otherwise converts to Attachment.
  • list[mixed] -> normalizes supported types

Parameters:

Name Type Description Default
multimodal_value Any

Input value provided to multimodal_context.

required

Returns:

Type Description
Any

list[Attachment | str]: A normalized list of Attachment objects or strings.

handle_none_metadata(metadata_value) classmethod

Handle None values for media_metadata by using empty dict.

handle_none_number_of_captions(caption_value) classmethod

Handle None values for number_of_captions by using default.

handle_none_values(str_value) classmethod

Handle None values by converting them to default values.

CaptionResult

Bases: Caption

Result of a caption operation.

Attributes:

Name Type Description
captions str | list[str] | dict[str, Any]

The caption result.

Keyframe

Bases: BaseModel

Represents a keyframe extracted from a video segment.

Attributes:

Name Type Description
time_offset float

Time within the segment where the keyframe occurs.

caption str | None

Text description of this specific keyframe.

Mermaid

Bases: BaseModel

Mermaid additional metadata.

Attributes:

Name Type Description
diagram_type str

type of the diagram to be generated.

context str

additional context to generate mermaid.

Segment

Bases: BaseModel

Represents a video segment with its captions, transcripts, and keyframes.

Attributes:

Name Type Description
start_time float | None

The segment's starting time in seconds.

end_time float | None

The segment's ending time in seconds.

transcripts list[AudioTranscript]

Optional list of transcripts for the segment.

segment_caption list[str]

The single, rich description of the segment's action/plot.

keyframes list[Keyframe]

Optional list of keyframes extracted from the segment.

ensure_caption()

Ensure segment has caption, fallback to keyframes/transcripts if needed.

ensure_keyframes()

Ensure all keyframes time offset is non-negative.

ensure_transcripts()

Ensure all transcripts time offset is non-negative.

TextResult

Bases: BaseModel

Base class for all image-to-text operation results.

This class provides the foundation for structured results from any image-to-text operation, including: - Image Captioning - Scene Text Detection

Attributes:

Name Type Description
text str

The extracted or generated text from the image. This is the primary output of any image-to-text operation. May be empty if the operation fails or no text is found.

metadata dict[str, Any] | BaseModel

Additional metadata from the conversion process.

VideoCaptionMetadata

Bases: BaseModel

Metadata for video captioning results.

Attributes:

Name Type Description
video_summary str

A high-level summary of the entire video's plot, topic, or main events.

segments list[Segment]

List of video segments with their captions and metadata.

ensure_segment_end_time_greater_than_start_time()

Ensure segment end time is greater than start time.

If end time equal or lower than start time, then use next segment start time-1 as end time.

VideoCaptionResult

Bases: VideoCaptionMetadata

Backward-compatible model alias for video caption result payload.