transformers package.
These UDFs will cause Pixeltable to invoke the relevant models locally. In order to use them, you must first pip install transformers (or in some cases, sentence-transformers, as noted in the specific UDFs).
View source on GitHub
UDFs
udf automatic_speech_recognition()
model_id should be a reference to a
pretrained automatic-speech-recognition model.
This is a generic function that works with many ASR model families. For production use with specific models, consider specialized functions like whisper.transcribe() or speech2text_for_conditional_generation().
Requirements:
pip install torch transformers torchaudio
- OpenAI Whisper:
openai/whisper-tiny.en,openai/whisper-small,openai/whisper-base - Facebook Wav2Vec2:
facebook/wav2vec2-base-960h,facebook/wav2vec2-large-960h-lv60-self - Microsoft SpeechT5:
microsoft/speecht5_asr - Meta MMS (Multilingual):
facebook/mms-1b-all
audio(Audio): The audio file(s) to transcribe.model_id(String): The pretrained ASR model to use.language(String | None): Language code for multilingual models (e.g., ‘en’, ‘es’, ‘fr’).chunk_length_s(Int | None): Maximum length of audio chunks in seconds for long audio processing.return_timestamps(Bool): Whether to return word-level timestamps (model dependent).
String: The transcribed text.
udf clip()
model_id should be a reference to a pretrained
CLIP Model.
Requirements:
pip install torch transformers
text(String): The string to embed.model_id(String): The pretrained model to use for the embedding.
Array[(None,), Float]: An array containing the output of the embedding model.
openai/clip-vit-base-patch32 to an existing Pixeltable column tbl.text of the table tbl:
udf cross_encoder()
model_id should be a pretrained Cross-Encoder model, as described in the Cross-Encoder Pretrained Models documentation.
Requirements:
pip install torch sentence-transformers
sentences1(String): The first sentence to be paired.sentences2(String): The second sentence to be paired.model_id(String): The identifier of the cross-encoder model to use.
Float: The similarity score between the inputs.
ms-marco-MiniLM-L-4-v2 to the sentences in columns tbl.sentence1 and tbl.sentence2:
udf cross_encoder_list()
udf detr_for_object_detection()
model_id should be a reference to a pretrained
DETR Model.
Requirements:
pip install torch transformers
image(Image): The image to embed.model_id(String): The pretrained model to use for object detection.
Json: A dictionary containing the output of the object detection model, in the following format:
facebook/detr-resnet-50 to an existing Pixeltable column image of the table tbl:
udf detr_to_coco()
image(Image): The image for which detections were computed.detr_info(Json): The output of a DETR object detection model, as returned bydetr_for_object_detection.
Json: A dictionary containing the data fromdetr_info, converted to COCO format.
tbl.detections to COCO format, where tbl.image is the image for which detections were computed:
udf image_captioning()
model_id should be a reference to a
pretrained image-to-text model such as BLIP, Git, or LLaVA.
Requirements:
pip install torch transformers
image(Image): The image to caption.model_id(String): The pretrained model to use for captioning.model_kwargs(Json | None): Additional keyword arguments to pass to the model’sgeneratemethod, such asmax_length.
String: The generated caption text.
caption to an existing table tbl that generates captions using the Salesforce/blip-image-captioning-base model:
udf image_to_image()
model_id should be a reference to a pretrained image-to-image model.
Requirements:
pip install torch transformers diffusers accelerate
image(Image): The input image to transform.prompt(String): The text prompt describing the desired transformation.model_id(String): The pretrained image-to-image model to use.seed(Int | None): Random seed for reproducibility.model_kwargs(Json | None): Additional keyword arguments to pass to the model, such asstrength,guidance_scale, ornum_inference_steps.
Image: The transformed image.
udf image_to_video()
model_id should be a reference to a pretrained image-to-video model.
Requirements:
pip install torch transformers diffusers accelerate
image(Image): The input image to animate into a video.model_id(String): The pretrained image-to-video model to use.num_frames(Int): Number of video frames to generate.fps(Int): Frames per second for the output video.seed(Int | None): Random seed for reproducibility.model_kwargs(Json | None): Additional keyword arguments to pass to the model, such asnum_inference_steps,motion_bucket_id, orguidance_scale.
Video: The generated video file.
udf question_answering()
model_id should be a reference to a
pretrained question answering model such as BERT or RoBERTa.
Requirements:
pip install torch transformers
context(String): The context text containing the answer.question(String): The question to answer.model_id(String): The pretrained QA model to use.
Json: A dictionary containing the answer, confidence score, and start/end positions.
udf sentence_transformer()
model_id should be a pretrained Sentence Transformers model, as described
in the Sentence Transformers Pretrained Models documentation.
Requirements:
pip install torch sentence-transformers
sentence(String): The sentence to embed.model_id(String): The pretrained model to use for the encoding.normalize_embeddings(Bool): IfTrue, normalizes embeddings to length 1; see the Sentence Transformers API Docs for more details
Array[(None,), Float]: An array containing the output of the embedding model.
all-mpnet-base-2 to an existing Pixeltable column tbl.sentence of the table tbl:
udf sentence_transformer_list()
udf speech2text_for_conditional_generation()
model_id should be a reference to a
pretrained Speech2Text model.
Requirements:
pip install torch torchaudio sentencepiece transformers
audio(Audio): The audio clip to transcribe or translate.model_id(String): The pretrained model to use for the transcription or translation.language(String | None): If using a multilingual translation model, the language code to translate to. If not provided, the model’s default language will be used. If the model is not translation model, is not a multilingual model, or does not support the specified language, an error will be raised.
String: The transcribed or translated text.
facebook/s2t-small-librispeech-asr to an existing Pixeltable column audio of the table tbl:
facebook/s2t-medium-mustc-multilingual-st to an existing Pixeltable column audio of the table tbl, translating the audio to French:
udf summarization()
model_id should be a reference to a pretrained
summarization model such as BART, T5, or Pegasus.
Requirements:
pip install torch transformers
text(String): The text to summarize.model_id(String): The pretrained model to use for summarization.model_kwargs(Json | None): Additional keyword arguments to pass to the model’sgeneratemethod, such asmax_length.
String: The generated summary text.
udf text_classification()
model_id should be a reference to a pretrained
text classification model such as BERT, RoBERTa, or DistilBERT.
Requirements:
pip install torch transformers
text(String): The text to classify.model_id(String): The pretrained model to use for classification.top_k(Int): The number of top predictions to return.
Json: A dictionary containing classification results with scores, labels, and label text.
udf text_generation()
model_id should be a reference to a pretrained
text generation model.
Requirements:
pip install torch transformers
text(String): The input text to continue/complete.model_id(String): The pretrained model to use for text generation.model_kwargs(Json | None): Additional keyword arguments to pass to the model’sgeneratemethod, such asmax_length,temperature, etc. See the Hugging Face text_generation documentation for details.
String: The generated text completion.
Qwen/Qwen3-0.6B model:
udf text_to_image()
model_id should be a reference to a
pretrained text-to-image model such as Stable Diffusion or FLUX.
Requirements:
pip install torch transformers diffusers accelerate
prompt(String): The text prompt describing the desired image.model_id(String): The pretrained text-to-image model to use.height(Int): Height of the generated image in pixels.width(Int): Width of the generated image in pixels.seed(Int | None): Optional random seed for reproducibility.model_kwargs(Json | None): Additional keyword arguments to pass to the model, such asnum_inference_steps,guidance_scale, ornegative_prompt.
Image: The generated Image.
udf text_to_speech()
model_id should be a reference to a
pretrained text-to-speech model.
Requirements:
pip install torch transformers datasets soundfile
text(String): The text to convert to speech.model_id(String): The pretrained TTS model to use.speaker_id(Int | None): Speaker ID for multi-speaker models.vocoder(String | None): Optional vocoder model for higher quality audio.
Audio: The generated audio file.
udf token_classification()
model_id should be a reference to a pretrained token classification model for NER.
Requirements:
pip install torch transformers
text(String): The text to analyze for named entities.model_id(String): The pretrained model to use.aggregation_strategy(String): Method used to aggregate tokens.
Json: A list of dictionaries containing entity information (text, label, confidence, start, end).
udf translation()
model_id should be a reference to a pretrained
translation model such as MarianMT or T5.
Requirements:
pip install torch transformers sentencepiece
text(String): The text to translate.model_id(String): The pretrained translation model to use.src_lang(String | None): Source language code (optional, can be inferred from model).target_lang(String | None): Target language code (optional, can be inferred from model).
String: The translated text.
udf vit_for_image_classification()
model_id should be a reference to a pretrained ViT Model.
Note: Be sure the model is a ViT model that is trained for image classification (that is, a model designed for use with the ViTForImageClassification class), such as google/vit-base-patch16-224. General feature-extraction models such as google/vit-base-patch16-224-in21k will not produce the desired results.
Requirements:
pip install torch transformers
image(Image): The image to classify.model_id(String): The pretrained model to use for the classification.top_k(Int): The number of classes to return.
Json: A dictionary containing the output of the image classification model, in the following format:
google/vit-base-patch16-224 to an existing Pixeltable column image of the table tbl, returning the 10 most likely classes for each image: