module pixeltable.functions.vllm

Pixeltable UDFs for vLLM models. Provides integration with vLLM for high-throughput inference with large language models, supporting chat completions and text generation with HuggingFace models.

udf chat_completions()

Signature

@pxt.udf
chat_completions(
    messages: pxt.Json[(Json, ...)],
    *,
    model: pxt.String,
    engine_args: pxt.Json | None = None,
    sampling_params: pxt.Json | None = None
) -> VllmRequestOutput

Generate a chat completion from a list of messages using vLLM. For additional details, see the vLLM documentation. Requirements:

pip install vllm

Parameters:

messages (pxt.Json[(Json): A list of messages to generate a response for. Each message should be a dict with role and content keys, following the OpenAI chat format.
model (Any): The HuggingFace model identifier (e.g., 'Qwen/Qwen2.5-0.5B-Instruct').
engine_args (Any): Additional keyword args for the vLLM LLM constructor, such as dtype, max_model_len, gpu_memory_utilization, tensor_parallel_size. For details, see the vLLM engine args documentation.
sampling_params (Any): Keyword args for vLLM SamplingParams, such as max_tokens, temperature, top_p, top_k. For details, see the vLLM sampling params documentation.

Returns:

VllmRequestOutput: A dict containing the vLLM RequestOutput in its native format.

Examples: Add a computed column that generates chat completions:

t.add_computed_column(
    result=chat_completions(
        t.messages, model='Qwen/Qwen2.5-0.5B-Instruct'
    )
)

With custom sampling parameters:

t.add_computed_column(
    result=chat_completions(
        t.messages,
        model='Qwen/Qwen2.5-0.5B-Instruct',
        sampling_params={'max_tokens': 256, 'temperature': 0.7},
    )
)

udf generate()

Signature

@pxt.udf
generate(
    prompt: pxt.String,
    *,
    model: pxt.String,
    engine_args: pxt.Json | None = None,
    sampling_params: pxt.Json | None = None
) -> VllmRequestOutput

Generate text completion for a given prompt using vLLM. Uses vLLM’s high-throughput inference engine for efficient local LLM serving. Models are loaded from HuggingFace and cached for reuse across calls. For additional details, see the vLLM documentation. Requirements:

pip install vllm

Parameters:

prompt (pxt.String): The text prompt to generate a completion for.
model (pxt.String): The HuggingFace model identifier (e.g., 'Qwen/Qwen2.5-0.5B-Instruct').
engine_args (pxt.Json | None): Additional keyword args for the vLLM LLM constructor, such as dtype, max_model_len, gpu_memory_utilization, tensor_parallel_size. For details, see the vLLM engine args documentation.
sampling_params (pxt.Json | None): Keyword args for vLLM SamplingParams, such as max_tokens, temperature, top_p, top_k. For details, see the vLLM sampling params documentation.

Returns:

VllmRequestOutput: A dict containing the vLLM RequestOutput in its native format.

Examples: Add a computed column that generates text completions:

t.add_computed_column(
    result=generate(t.prompt, model='Qwen/Qwen2.5-0.5B-Instruct')
)

​module pixeltable.functions.vllm

​udf chat_completions()

​udf generate()

module pixeltable.functions.vllm

udf chat_completions()

udf generate()