Skip to content

Caption

Schema for captioning operations in Gen AI applications.

This module defines the data structures for representing results from captioning operations (image, video, etc.). It provides: 1. Result class for captions 2. Support for multiple caption types 3. Metadata storage 4. Domain knowledge integration 5. External context support through attachments

Caption

Bases: BaseModel

Result class for captioning operations (image, video, etc.).

This class provides a structured format for captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details

Attributes:

Name Type Description
text_one_liner str

Brief, single-sentence summary of the content. Defaults to empty string if not provided.

text_context str

Detailed, multi-sentence description of the content. Defaults to empty string if not provided.

domain_knowledge str

Domain-specific interpretation or context. Defaults to empty string if not provided.

number_of_captions int

Total number of distinct captions generated. Defaults to 0 if no captions are generated.

media_metadata dict[str, Any]

Additional information about the media such as location.

multimodal_context list[Attachment | str]

Optional list of external context objects (files, bytes, or pre-processed inputs) or raw strings that can enrich captioning results. Bytes are automatically converted into Attachment objects via Attachment.from_bytes.

output_schema str

Output schema. Defaults to empty string if not provided.

schema_description str

Schema description. Defaults to empty string if not provided.

language str

Language of the captions. Defaults to "Indonesian" if not provided.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

handle_deprecated_fields(values) classmethod

Map deprecated field names to their replacements and emit warnings.

Deprecated

image_one_liner: Use text_one_liner instead. Will be removed in 0.4.0. image_description: Use text_context instead. Will be removed in 0.4.0. image_metadata: Use media_metadata instead. Will be removed in 0.4.0.

handle_multimodal_context(multimodal_value) classmethod

Normalize and validate multimodal_context.

This method ensures that the multimodal_context field is a list of Attachment objects or strings. It handles multiple input cases:

  • None -> returns an empty list
  • list[bytes] -> converts each item into an Attachment via Attachment.from_bytes
  • list[Attachment] -> keeps as-is
  • list[str] -> keeps as-is if it's not a valid image/binary source, otherwise converts to Attachment.
  • list[mixed] -> normalizes supported types

Parameters:

Name Type Description Default
multimodal_value Any

Input value provided to multimodal_context.

required

Returns:

Type Description
Any

list[Attachment | str]: A normalized list of Attachment objects or strings.

handle_none_metadata(metadata_value) classmethod

Handle None values for media_metadata by using empty dict.

handle_none_number_of_captions(caption_value) classmethod

Handle None values for number_of_captions by using default.

handle_none_values(str_value) classmethod

Handle None values by converting them to default values.