Skip to main content

module  pixeltable.functions.document

Pixeltable UDFs for DocumentType.

iterator  document_splitter()

Signature
document_splitter(
    document: Any,
    separators: str,
    *,
    elements: list[typing.Literal['text', 'image']] | None = None,
    limit: int | None = None,
    overlap: int | None = None,
    metadata: str = '',
    skip_tags: list[str] | None = None,
    tiktoken_encoding: str | None = 'cl100k_base',
    tiktoken_target_model: str | None = None,
    image_dpi: int = 300,
    image_format: str = 'png'
) 
Iterator over chunks of a document. The document is chunked according to the specified separators. The iterator yields a text field containing the text of the chunk, and it may also include additional metadata fields if specified in the metadata parameter, as explained below. Chunked text will be cleaned with ftfy.fix_text to fix up common problems with unicode sequences. Parameters:
  • separators (str): separators to use to chunk the document. Options are: 'heading', 'paragraph', 'sentence', 'token_limit', 'char_limit', 'page'. This may be a comma-separated string, e.g., 'heading,token_limit'.
  • elements (list[typing.Literal['text', 'image']] | None): list of elements to extract from the document. Options are: 'text', 'image'. Defaults to ['text'] if not specified. The 'image' element is only supported for the 'page' separator on PDF documents.
  • limit (int | None): the maximum number of tokens or characters in each chunk, if 'token_limit' or 'char_limit' is specified.
  • metadata (str, default: ''): additional metadata fields to include in the output. Options are: 'title', 'heading' (HTML and Markdown), 'sourceline' (HTML), 'page' (PDF), 'bounding_box' (PDF). The input may be a comma-separated string, e.g., 'title,heading,sourceline'.
  • image_dpi (int, default: 300): DPI to use when extracting images from PDFs. Defaults to 300.
  • image_format (str, default: 'png'): format to use when extracting images from PDFs. Defaults to ‘png’.
Examples: All these examples assume an existing table tbl with a column doc of type pxt.Document. Create a view that splits all documents into chunks of up to 300 tokens:
pxt.create_view(
    'chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='token_limit', limit=300
    ),
)
Create a view that splits all documents along sentence boundaries, including title and heading metadata:
pxt.create_view(
    'sentence_chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='sentence', metadata='title,heading'
    ),
)