module pixeltable.functions.huggingface

Pixeltable UDFs that wrap various models from the Hugging Face transformers package. These UDFs will cause Pixeltable to invoke the relevant models locally. In order to use them, you must first pip install transformers (or in some cases, sentence-transformers, as noted in the specific UDFs).

UDFs

udf automatic_speech_recognition()

automatic_speech_recognition(
    audio: Audio,
    *,
    model_id: String,
    language: String | None = None,
    chunk_length_s: Int | None = None,
    return_timestamps: Bool = False
) -> String

Transcribes speech to text using a pretrained ASR model. model_id should be a reference to a pretrained automatic-speech-recognition model. This is a generic function that works with many ASR model families. For production use with specific models, consider specialized functions like whisper.transcribe() or speech2text_for_conditional_generation(). Requirements:

pip install torch transformers torchaudio

Recommended Models:

OpenAI Whisper: openai/whisper-tiny.en, openai/whisper-small, openai/whisper-base
Facebook Wav2Vec2: facebook/wav2vec2-base-960h, facebook/wav2vec2-large-960h-lv60-self
Microsoft SpeechT5: microsoft/speecht5_asr
Meta MMS (Multilingual): facebook/mms-1b-all

Parameters:

audio (Audio): The audio file(s) to transcribe.
model_id (String): The pretrained ASR model to use.
language (String | None): Language code for multilingual models (e.g., ‘en’, ‘es’, ‘fr’).
chunk_length_s (Int | None): Maximum length of audio chunks in seconds for long audio processing.
return_timestamps (Bool): Whether to return word-level timestamps (model dependent).

Returns:

String: The transcribed text.

Examples: Add a computed column that transcribes audio files:

tbl.add_computed_column(
    transcription=automatic_speech_recognition(
        tbl.audio_file,
        model_id='openai/whisper-tiny.en',  # Recommended
    )
)

Transcribe with language specification:

tbl.add_computed_column(
    transcription=automatic_speech_recognition(
        tbl.audio_file, model_id='facebook/mms-1b-all', language='en'
    )
)

udf clip()

# Signature 1:
clip(text: String, model_id: String) -> Array[(None,), Float]

# Signature 2:
clip(image: Image, model_id: String) -> Array[(None,), Float]

Computes a CLIP embedding for the specified text or image. model_id should be a reference to a pretrained CLIP Model. Requirements:

pip install torch transformers

Parameters:

text (String): The string to embed.
model_id (String): The pretrained model to use for the embedding.

Returns:

Array[(None,), Float]: An array containing the output of the embedding model.

Examples: Add a computed column that applies the model openai/clip-vit-base-patch32 to an existing Pixeltable column tbl.text of the table tbl:

tbl.add_computed_column(
    result=clip(tbl.text, model_id='openai/clip-vit-base-patch32')
)

udf cross_encoder()

cross_encoder(
    sentences1: String,
    sentences2: String,
    *,
    model_id: String
) -> Float

Performs predicts on the given sentence pair. model_id should be a pretrained Cross-Encoder model, as described in the Cross-Encoder Pretrained Models documentation. Requirements:

pip install torch sentence-transformers

Parameters:

sentences1 (String): The first sentence to be paired.
sentences2 (String): The second sentence to be paired.
model_id (String): The identifier of the cross-encoder model to use.

Returns:

Float: The similarity score between the inputs.

Examples: Add a computed column that applies the model ms-marco-MiniLM-L-4-v2 to the sentences in columns tbl.sentence1 and tbl.sentence2:

tbl.add_computed_column(
    result=sentence_transformer(
        tbl.sentence1, tbl.sentence2, model_id='ms-marco-MiniLM-L-4-v2'
    )
)

udf cross_encoder_list()

cross_encoder_list(
    sentence1: String,
    sentences2: Json,
    *,
    model_id: String
) -> Json

udf detr_for_object_detection()

detr_for_object_detection(
    image: Image,
    *,
    model_id: String,
    threshold: Float = 0.5,
    revision: String = 'no_timm'
) -> Json

Computes DETR object detections for the specified image. model_id should be a reference to a pretrained DETR Model. Requirements:

pip install torch transformers

Parameters:

image (Image): The image to embed.
model_id (String): The pretrained model to use for object detection.

Returns:

Json: A dictionary containing the output of the object detection model, in the following format:

{
    'scores': [0.99, 0.999],  # list of confidence scores for each detected object
    'labels': [25, 25],  # list of COCO class labels for each detected object
    'label_text': ['giraffe', 'giraffe'],  # corresponding text names of class labels
    'boxes': [[51.942, 356.174, 181.481, 413.975], [383.225, 58.66, 605.64, 361.346]]
        # list of bounding boxes for each detected object, as [x1, y1, x2, y2]
}

Examples: Add a computed column that applies the model facebook/detr-resnet-50 to an existing Pixeltable column image of the table tbl:

tbl.add_computed_column(
    detections=detr_for_object_detection(
        tbl.image, model_id='facebook/detr-resnet-50', threshold=0.8
    )
)

udf detr_to_coco()

detr_to_coco(image: Image, detr_info: Json) -> Json

Converts the output of a DETR object detection model to COCO format. Parameters:

image (Image): The image for which detections were computed.
detr_info (Json): The output of a DETR object detection model, as returned by detr_for_object_detection.

Returns:

Json: A dictionary containing the data from detr_info, converted to COCO format.

Examples: Add a computed column that converts the output tbl.detections to COCO format, where tbl.image is the image for which detections were computed:

tbl.add_computed_column(
    detections_coco=detr_to_coco(tbl.image, tbl.detections)
)

udf image_captioning()

image_captioning(
    image: Image,
    *,
    model_id: String,
    model_kwargs: Json | None = None
) -> String

Generates captions for images using a pretrained image captioning model. model_id should be a reference to a pretrained image-to-text model such as BLIP, Git, or LLaVA. Requirements:

pip install torch transformers

Parameters:

image (Image): The image to caption.
model_id (String): The pretrained model to use for captioning.
model_kwargs (Json | None): Additional keyword arguments to pass to the model’s generate method, such as max_length.

Returns:

String: The generated caption text.

Examples: Add a computed column caption to an existing table tbl that generates captions using the Salesforce/blip-image-captioning-base model:

tbl.add_computed_column(
    caption=image_captioning(
        tbl.image,
        model_id='Salesforce/blip-image-captioning-base',
        model_kwargs={'max_length': 30},
    )
)

udf image_to_image()

image_to_image(
    image: Image,
    prompt: String,
    *,
    model_id: String,
    seed: Int | None = None,
    model_kwargs: Json | None = None
) -> Image

Transforms input images based on text prompts using a pretrained image-to-image model. model_id should be a reference to a pretrained image-to-image model. Requirements:

pip install torch transformers diffusers accelerate

Parameters:

image (Image): The input image to transform.
prompt (String): The text prompt describing the desired transformation.
model_id (String): The pretrained image-to-image model to use.
seed (Int | None): Random seed for reproducibility.
model_kwargs (Json | None): Additional keyword arguments to pass to the model, such as strength, guidance_scale, or num_inference_steps.

Returns:

Image: The transformed image.

Examples: Add a computed column that transforms images based on prompts:

tbl.add_computed_column(
    transformed=image_to_image(
        tbl.source_image,
        tbl.transformation_prompt,
        model_id='runwayml/stable-diffusion-v1-5',
    )
)

udf image_to_video()

image_to_video(
    image: Image,
    *,
    model_id: String,
    num_frames: Int = 25,
    fps: Int = 6,
    seed: Int | None = None,
    model_kwargs: Json | None = None
) -> Video

Generates videos from input images using a pretrained image-to-video model. model_id should be a reference to a pretrained image-to-video model. Requirements:

pip install torch transformers diffusers accelerate

Parameters:

image (Image): The input image to animate into a video.
model_id (String): The pretrained image-to-video model to use.
num_frames (Int): Number of video frames to generate.
fps (Int): Frames per second for the output video.
seed (Int | None): Random seed for reproducibility.
model_kwargs (Json | None): Additional keyword arguments to pass to the model, such as num_inference_steps, motion_bucket_id, or guidance_scale.

Returns:

Video: The generated video file.

Examples: Add a computed column that creates videos from images:

tbl.add_computed_column(
    video=image_to_video(
        tbl.input_image,
        model_id='stabilityai/stable-video-diffusion-img2vid-xt',
        num_frames=25,
        fps=7,
    )
)

udf question_answering()

question_answering(
    context: String,
    question: String,
    *,
    model_id: String
) -> Json

Answers questions based on provided context using a pretrained QA model. model_id should be a reference to a pretrained question answering model such as BERT or RoBERTa. Requirements:

pip install torch transformers

Parameters:

context (String): The context text containing the answer.
question (String): The question to answer.
model_id (String): The pretrained QA model to use.

Returns:

Json: A dictionary containing the answer, confidence score, and start/end positions.

Examples: Add a computed column that answers questions based on document context:

tbl.add_computed_column(
    answer=question_answering(
        tbl.document_text,
        tbl.question,
        model_id='deepset/roberta-base-squad2',
    )
)

udf sentence_transformer()

sentence_transformer(
    sentence: String,
    *,
    model_id: String,
    normalize_embeddings: Bool = False
) -> Array[(None,), Float]

Computes sentence embeddings. model_id should be a pretrained Sentence Transformers model, as described in the Sentence Transformers Pretrained Models documentation. Requirements:

pip install torch sentence-transformers

Parameters:

sentence (String): The sentence to embed.
model_id (String): The pretrained model to use for the encoding.
normalize_embeddings (Bool): If True, normalizes embeddings to length 1; see the Sentence Transformers API Docs for more details

Returns:

Array[(None,), Float]: An array containing the output of the embedding model.

Examples: Add a computed column that applies the model all-mpnet-base-2 to an existing Pixeltable column tbl.sentence of the table tbl:

tbl.add_computed_column(
    result=sentence_transformer(
        tbl.sentence, model_id='all-mpnet-base-v2'
    )
)

udf sentence_transformer_list()

sentence_transformer_list(
    sentences: Json,
    *,
    model_id: String,
    normalize_embeddings: Bool = False
) -> Json

udf speech2text_for_conditional_generation()

speech2text_for_conditional_generation(
    audio: Audio,
    *,
    model_id: String,
    language: String | None = None
) -> String

Transcribes or translates speech to text using a Speech2Text model. model_id should be a reference to a pretrained Speech2Text model. Requirements:

pip install torch torchaudio sentencepiece transformers

Parameters:

audio (Audio): The audio clip to transcribe or translate.
model_id (String): The pretrained model to use for the transcription or translation.
language (String | None): If using a multilingual translation model, the language code to translate to. If not provided, the model’s default language will be used. If the model is not translation model, is not a multilingual model, or does not support the specified language, an error will be raised.

Returns:

String: The transcribed or translated text.

Examples: Add a computed column that applies the model facebook/s2t-small-librispeech-asr to an existing Pixeltable column audio of the table tbl:

tbl.add_computed_column(
    transcription=speech2text_for_conditional_generation(
        tbl.audio, model_id='facebook/s2t-small-librispeech-asr'
    )
)

Add a computed column that applies the model facebook/s2t-medium-mustc-multilingual-st to an existing Pixeltable column audio of the table tbl, translating the audio to French:

tbl.add_computed_column(
    translation=speech2text_for_conditional_generation(
        tbl.audio,
        model_id='facebook/s2t-medium-mustc-multilingual-st',
        language='fr',
    )
)

udf summarization()

summarization(
    text: String,
    *,
    model_id: String,
    model_kwargs: Json | None = None
) -> String

Summarizes text using a pretrained summarization model. model_id should be a reference to a pretrained summarization model such as BART, T5, or Pegasus. Requirements:

pip install torch transformers

Parameters:

text (String): The text to summarize.
model_id (String): The pretrained model to use for summarization.
model_kwargs (Json | None): Additional keyword arguments to pass to the model’s generate method, such as max_length.

Returns:

String: The generated summary text.

Examples: Add a computed column that summarizes documents:

tbl.add_computed_column(
    summary=text_summarization(
        tbl.document_text,
        model_id='facebook/bart-large-cnn',
        max_length=100,
    )
)

udf text_classification()

text_classification(text: String, *, model_id: String, top_k: Int = 5) -> Json

Classifies text using a pretrained classification model. model_id should be a reference to a pretrained text classification model such as BERT, RoBERTa, or DistilBERT. Requirements:

pip install torch transformers

Parameters:

text (String): The text to classify.
model_id (String): The pretrained model to use for classification.
top_k (Int): The number of top predictions to return.

Returns:

Json: A dictionary containing classification results with scores, labels, and label text.

Examples: Add a computed column for sentiment analysis:

tbl.add_computed_column(
    sentiment=text_classification(
        tbl.review_text,
        model_id='cardiffnlp/twitter-roberta-base-sentiment-latest',
    )
)

udf text_generation()

text_generation(
    text: String,
    *,
    model_id: String,
    model_kwargs: Json | None = None
) -> String

Generates text using a pretrained language model. model_id should be a reference to a pretrained text generation model. Requirements:

pip install torch transformers

Parameters:

text (String): The input text to continue/complete.
model_id (String): The pretrained model to use for text generation.
model_kwargs (Json | None): Additional keyword arguments to pass to the model’s generate method, such as max_length, temperature, etc. See the Hugging Face text_generation documentation for details.

Returns:

String: The generated text completion.

Examples: Add a computed column that generates text completions using the Qwen/Qwen3-0.6B model:

tbl.add_computed_column(
    completion=text_generation(
        tbl.prompt,
        model_id='Qwen/Qwen3-0.6B',
        model_kwargs={'temperature': 0.5, 'max_length': 150},
    )
)

udf text_to_image()

text_to_image(
    prompt: String,
    *,
    model_id: String,
    height: Int = 512,
    width: Int = 512,
    seed: Int | None = None,
    model_kwargs: Json | None = None
) -> Image

Generates images from text prompts using a pretrained text-to-image model. model_id should be a reference to a pretrained text-to-image model such as Stable Diffusion or FLUX. Requirements:

pip install torch transformers diffusers accelerate

Parameters:

prompt (String): The text prompt describing the desired image.
model_id (String): The pretrained text-to-image model to use.
height (Int): Height of the generated image in pixels.
width (Int): Width of the generated image in pixels.
seed (Int | None): Optional random seed for reproducibility.
model_kwargs (Json | None): Additional keyword arguments to pass to the model, such as num_inference_steps, guidance_scale, or negative_prompt.

Returns:

Image: The generated Image.

Examples: Add a computed column that generates images from text prompts:

tbl.add_computed_column(
    generated_image=text_to_image(
        tbl.prompt,
        model_id='stable-diffusion-v1.5/stable-diffusion-v1-5',
        height=512,
        width=512,
        model_kwargs={'num_inference_steps': 25},
    )
)

udf text_to_speech()

text_to_speech(
    text: String,
    *,
    model_id: String,
    speaker_id: Int | None = None,
    vocoder: String | None = None
) -> Audio

Converts text to speech using a pretrained TTS model. model_id should be a reference to a pretrained text-to-speech model. Requirements:

pip install torch transformers datasets soundfile

Parameters:

text (String): The text to convert to speech.
model_id (String): The pretrained TTS model to use.
speaker_id (Int | None): Speaker ID for multi-speaker models.
vocoder (String | None): Optional vocoder model for higher quality audio.

Returns:

Audio: The generated audio file.

Examples: Add a computed column that converts text to speech:

tbl.add_computed_column(
    audio=text_to_speech(
        tbl.text_content, model_id='microsoft/speecht5_tts', speaker_id=0
    )
)

udf token_classification()

token_classification(
    text: String,
    *,
    model_id: String,
    aggregation_strategy: String = 'simple'
) -> Json

Extracts named entities from text using a pretrained named entity recognition (NER) model. model_id should be a reference to a pretrained token classification model for NER. Requirements:

pip install torch transformers

Parameters:

text (String): The text to analyze for named entities.
model_id (String): The pretrained model to use.
aggregation_strategy (String): Method used to aggregate tokens.

Returns:

Json: A list of dictionaries containing entity information (text, label, confidence, start, end).

Examples: Add a computed column that extracts named entities:

tbl.add_computed_column(
    entities=token_classification(
        tbl.text,
        model_id='dbmdz/bert-large-cased-finetuned-conll03-english',
    )
)

udf translation()

translation(
    text: String,
    *,
    model_id: String,
    src_lang: String | None = None,
    target_lang: String | None = None
) -> String

Translates text using a pretrained translation model. model_id should be a reference to a pretrained translation model such as MarianMT or T5. Requirements:

pip install torch transformers sentencepiece

Parameters:

text (String): The text to translate.
model_id (String): The pretrained translation model to use.
src_lang (String | None): Source language code (optional, can be inferred from model).
target_lang (String | None): Target language code (optional, can be inferred from model).

Returns:

String: The translated text.

Examples: Add a computed column that translates text:

tbl.add_computed_column(
    french_text=translation(
        tbl.english_text,
        model_id='Helsinki-NLP/opus-mt-en-fr',
        src_lang='en',
        target_lang='fr',
    )
)

udf vit_for_image_classification()

vit_for_image_classification(image: Image, *, model_id: String, top_k: Int = 5) -> Json

Computes image classifications for the specified image using a Vision Transformer (ViT) model. model_id should be a reference to a pretrained ViT Model. Note: Be sure the model is a ViT model that is trained for image classification (that is, a model designed for use with the ViTForImageClassification class), such as google/vit-base-patch16-224. General feature-extraction models such as google/vit-base-patch16-224-in21k will not produce the desired results. Requirements:

pip install torch transformers

Parameters:

image (Image): The image to classify.
model_id (String): The pretrained model to use for the classification.
top_k (Int): The number of classes to return.

Returns:

Json: A dictionary containing the output of the image classification model, in the following format:

{
    'scores': [0.325, 0.198, 0.105],  # list of probabilities of the top-k most likely classes
    'labels': [340, 353, 386],  # list of class IDs for the top-k most likely classes
    'label_text': ['zebra', 'gazelle', 'African elephant, Loxodonta africana'],
        # corresponding text names of the top-k most likely classes

Examples: Add a computed column that applies the model google/vit-base-patch16-224 to an existing Pixeltable column image of the table tbl, returning the 10 most likely classes for each image:

tbl.add_computed_column(
    image_class=vit_for_image_classification(
        tbl.image, model_id='google/vit-base-patch16-224', top_k=10
    )
)

SDK Reference

​module pixeltable.functions.huggingface

​UDFs

​udf automatic_speech_recognition()

​udf clip()

​udf cross_encoder()

​udf cross_encoder_list()

​udf detr_for_object_detection()

​udf detr_to_coco()

​udf image_captioning()

​udf image_to_image()

​udf image_to_video()

​udf question_answering()

​udf sentence_transformer()

​udf sentence_transformer_list()

​udf speech2text_for_conditional_generation()

​udf summarization()

​udf text_classification()

​udf text_generation()

​udf text_to_image()

​udf text_to_speech()

​udf token_classification()

​udf translation()

​udf vit_for_image_classification()

module pixeltable.functions.huggingface

UDFs

udf automatic_speech_recognition()

udf clip()

udf cross_encoder()

udf cross_encoder_list()

udf detr_for_object_detection()

udf detr_to_coco()

udf image_captioning()

udf image_to_image()

udf image_to_video()

udf question_answering()

udf sentence_transformer()

udf sentence_transformer_list()

udf speech2text_for_conditional_generation()

udf summarization()

udf text_classification()

udf text_generation()

udf text_to_image()

udf text_to_speech()

udf token_classification()

udf translation()

udf vit_for_image_classification()