module pixeltable.functions.document

Pixeltable UDFs for DocumentType.

iterator document_splitter()

Signature

@pxt.iterator
document_splitter(
    document: pxt.Document,
    separators: pxt.String,
    *,
    elements: pxt.Json | None = None,
    limit: pxt.Int | None = None,
    overlap: pxt.Int | None = None,
    metadata: pxt.String = '',
    skip_tags: pxt.Json | None = None,
    spacy_model: pxt.String = 'en_core_web_sm',
    tiktoken_encoding: pxt.String | None = 'cl100k_base',
    tiktoken_target_model: pxt.String | None = None,
    image_dpi: pxt.Int = 300,
    image_format: pxt.String = 'png'
)

Iterator over chunks of a document. The document is chunked according to the specified separators. Chunked text will be cleaned with ftfy.fix_text to fix up common problems with unicode sequences. Outputs: One row per chunk, with the following columns, depending on the specified elements and metadata:

text (pxt.String): The text of the chunk. Present if 'text' is specified in elements.
image (pxt.Image): The image extracted from the chunk. Present if 'image' is specified in elements.
title (pxt.String | None): The document title. Present if 'title' is specified in metadata.
heading (pxt.Json | None): The heading hierarchy at the start of the chunk (HTML and Markdown only). Present if 'heading' is specified in metadata.
sourceline (pxt.Int | None): The source line number of the start of the chunk (HTML only). Present if 'sourceline' is specified in metadata.
page (pxt.Int | None): The page number of the chunk (PDF only). Present if 'page' is specified in metadata.
bounding_box (pxt.Json | None): The bounding box of the chunk on the page, as an {x1, y1, x2, y2} dictionary (PDF only). Present if 'bounding_box' is specified in metadata.

Parameters:

separators (pxt.String): separators to use to chunk the document. Options are: 'heading', 'paragraph', 'sentence', 'token_limit', 'char_limit', 'page'. This may be a comma-separated string, e.g., 'heading,token_limit'.
elements (pxt.Json | None): list of elements to extract from the document. Options are: 'text', 'image'. Defaults to ['text'] if not specified. The 'image' element is only supported for the 'page' separator on PDF documents.
limit (pxt.Int | None): the maximum number of tokens or characters in each chunk, if 'token_limit' or 'char_limit' is specified.
metadata (pxt.String): additional metadata fields to include in the output. Options are: 'title', 'heading' (HTML and Markdown), 'sourceline' (HTML), 'page' (PDF), 'bounding_box' (PDF). The input may be a comma-separated string, e.g., 'title,heading,sourceline'.
skip_tags (pxt.Json | None): list of HTML tags to skip when processing HTML documents.
spacy_model (pxt.String): Name of the spaCy model to use for sentence segmentation. This parameter is ignored unless the 'sentence' separator is specified.
tiktoken_encoding (pxt.String | None): Name of the tiktoken encoding to use when counting tokens. This parameter is ignored unless the 'token_limit' separator is specified.
tiktoken_target_model (pxt.String | None): Name of the target model to use when counting tokens with tiktoken. If specified, this parameter overrides tiktoken_encoding. This parameter is ignored unless the 'token_limit' separator is specified.
image_dpi (pxt.Int): DPI to use when extracting images from PDFs. Defaults to 300.
image_format (pxt.String): format to use when extracting images from PDFs. Defaults to ‘png’.

Examples: All these examples assume an existing table tbl with a column doc of type pxt.Document. Create a view that splits all documents into chunks of up to 300 tokens:

pxt.create_view(
    'chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='token_limit', limit=300
    ),
)

Create a view that splits all documents along sentence boundaries, including title and heading metadata:

pxt.create_view(
    'sentence_chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='sentence', metadata='title,heading'
    ),
)

SDK Reference

​module pixeltable.functions.document

​iterator document_splitter()

module pixeltable.functions.document

iterator document_splitter()