separators
.
The iterator yields a text
field containing the text of the chunk, and it may also include additional metadata fields if specified in the metadata
parameter, as explained below.
Chunked text will be cleaned with ftfy.fix_text
to fix up common problems with unicode sequences.
Args: separators: separators to use to chunk the document. Options are: 'heading'
, 'paragraph'
, 'sentence'
, 'token_limit'
, 'char_limit'
, 'page'
. This may be a comma-separated string, e.g., 'heading,token_limit'
. limit: the maximum number of tokens or characters in each chunk, if 'token_limit'
or 'char_limit'
is specified. metadata: additional metadata fields to include in the output. Options are: 'title'
, 'heading'
(HTML and Markdown), 'sourceline'
(HTML), 'page'
(PDF), 'bounding_box'
(PDF). The input may be a comma-separated string, e.g., 'title,heading,sourceline'
.
View source on GitHub
Methods
close()
Close the iterator and release all resources
Signature:
create()
Signature:
input_schema()
Provide the Pixeltable types of the init() parameters
The keys need to match the names of the init() parameters. This is equivalent to the parameters_types parameter of the @function decorator.
Signature:
output_schema()
Specify the dictionary returned by next() and a list of unstored column names
Signature:
- tuple[dict[str, pixeltable.type_system.ColumnType], list[str]]: a dictionary which is turned into a list of columns in the output table a list of unstored column names