module pixeltable.functions.document
Pixeltable UDFs forDocumentType.
iterator document_splitter()
Signature
separators.
Chunked text will be cleaned with ftfy.fix_text to fix up common problems with unicode sequences.
Outputs:
One row per chunk, with the following columns, depending on the specified elements and metadata:
text(pxt.String): The text of the chunk. Present if'text'is specified inelements.image(pxt.Image): The image extracted from the chunk. Present if'image'is specified inelements.title(pxt.String | None): The document title. Present if'title'is specified inmetadata.heading(pxt.Json | None): The heading hierarchy at the start of the chunk (HTML and Markdown only). Present if'heading'is specified inmetadata.sourceline(pxt.Int | None): The source line number of the start of the chunk (HTML only). Present if'sourceline'is specified inmetadata.page(pxt.Int | None): The page number of the chunk (PDF only). Present if'page'is specified inmetadata.bounding_box(pxt.Json | None): The bounding box of the chunk on the page, as an{x1, y1, x2, y2}dictionary (PDF only). Present if'bounding_box'is specified inmetadata.
separators(pxt.String): separators to use to chunk the document. Options are:'heading','paragraph','sentence','token_limit','char_limit','page'. This may be a comma-separated string, e.g.,'heading,token_limit'.elements(pxt.Json | None): list of elements to extract from the document. Options are:'text','image'. Defaults to['text']if not specified. The'image'element is only supported for the'page'separator on PDF documents.limit(pxt.Int | None): the maximum number of tokens or characters in each chunk, if'token_limit'or'char_limit'is specified.metadata(pxt.String): additional metadata fields to include in the output. Options are:'title','heading'(HTML and Markdown),'sourceline'(HTML),'page'(PDF),'bounding_box'(PDF). The input may be a comma-separated string, e.g.,'title,heading,sourceline'.skip_tags(pxt.Json | None): list of HTML tags to skip when processing HTML documents.spacy_model(pxt.String): Name of the spaCy model to use for sentence segmentation. This parameter is ignored unless the'sentence'separator is specified.tiktoken_encoding(pxt.String | None): Name of the tiktoken encoding to use when counting tokens. This parameter is ignored unless the'token_limit'separator is specified.tiktoken_target_model(pxt.String | None): Name of the target model to use when counting tokens with tiktoken. If specified, this parameter overridestiktoken_encoding. This parameter is ignored unless the'token_limit'separator is specified.image_dpi(pxt.Int): DPI to use when extracting images from PDFs. Defaults to 300.image_format(pxt.String): format to use when extracting images from PDFs. Defaults to ‘png’.
tbl with a column doc of type pxt.Document. Create a view that splits all documents into chunks of up to 300 tokens: