> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# document

> <a href="https://github.com/pixeltable/pixeltable/blob/main/pixeltable/functions/document.py#L0" id="viewSource" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/View%20Source%20on%20Github-blue?logo=github&labelColor=gray" alt="View Source on GitHub" style={{ display: 'inline', margin: '0px' }} noZoom /></a>

# <span style={{ 'color': 'gray' }}>module</span>  pixeltable.functions.document

Pixeltable UDFs for `DocumentType`.

## <span style={{ 'color': 'gray' }}>iterator</span>  document\_splitter()

```python Signature theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
@pxt.iterator
document_splitter(
    document: pxt.Document,
    separators: pxt.String,
    *,
    elements: pxt.Json[(String, ...)] | None = None,
    limit: pxt.Int | None = None,
    overlap: pxt.Int | None = None,
    metadata: pxt.String = '',
    skip_tags: pxt.Json[(String, ...)] | None = None,
    spacy_model: pxt.String = 'en_core_web_sm',
    tiktoken_encoding: pxt.String | None = 'cl100k_base',
    tiktoken_target_model: pxt.String | None = None,
    image_dpi: pxt.Int = 300,
    image_format: pxt.String = 'png'
)
```

Iterator over chunks of a document. The document is chunked according to the specified `separators`.

Chunked text will be cleaned with `ftfy.fix_text` to fix up common problems with unicode sequences.

**Outputs**:

One row per chunk, with the following columns, depending on the specified `elements` and `metadata`:

* `text` (`pxt.String`): The text of the chunk. Present if `'text'` is specified in `elements`.
* `image` (`pxt.Image`): The image extracted from the chunk. Present if `'image'` is specified in `elements`.
* `title` (`pxt.String | None`): The document title. Present if `'title'` is specified in `metadata`.
* `heading` (`pxt.Json | None`): The heading hierarchy at the start of the chunk (HTML and Markdown only).
  Present if `'heading'` is specified in `metadata`.
* `sourceline` (`pxt.Int | None`): The source line number of the start of the chunk (HTML only).
  Present if `'sourceline'` is specified in `metadata`.
* `page` (`pxt.Int | None`): The page number of the chunk (PDF only). Present if `'page'` is specified in
  `metadata`.
* `bounding_box` (`pxt.Json | None`): The bounding box of the chunk on the page, as an `{x1, y1, x2, y2}`
  dictionary (PDF only). Present if `'bounding_box'` is specified in `metadata`.

**Parameters:**

* **`separators`** (`pxt.String`): separators to use to chunk the document. Options are:
  `'heading'`, `'paragraph'`, `'sentence'`, `'token_limit'`, `'char_limit'`, `'page'`.
  This may be a comma-separated string, e.g., `'heading,token_limit'`.
* **`elements`** (`pxt.Json[(String`): list of elements to extract from the document. Options are:
  `'text'`, `'image'`. Defaults to `['text']` if not specified. The `'image'` element is only supported
  for the `'page'` separator on PDF documents.
* **`limit`** (`Any`): the maximum number of tokens or characters in each chunk, if `'token_limit'`
  or `'char_limit'` is specified.
* **`metadata`** (`Any`): additional metadata fields to include in the output. Options are:
  `'title'`, `'heading'` (HTML and Markdown), `'sourceline'` (HTML), `'page'` (PDF), `'bounding_box'`
  (PDF). The input may be a comma-separated string, e.g., `'title,heading,sourceline'`.
* **`skip_tags`** (`Any`): list of HTML tags to skip when processing HTML documents.
* **`spacy_model`** (`Any`): Name of the spaCy model to use for sentence segmentation. This parameter is ignored unless
  the `'sentence'` separator is specified.
* **`tiktoken_encoding`** (`Any`): Name of the tiktoken encoding to use when counting tokens. This parameter is ignored
  unless the `'token_limit'` separator is specified.
* **`tiktoken_target_model`** (`Any`): Name of the target model to use when counting tokens with tiktoken. If specified,
  this parameter overrides `tiktoken_encoding`. This parameter is ignored unless the `'token_limit'`
  separator is specified.
* **`image_dpi`** (`Any`): DPI to use when extracting images from PDFs. Defaults to 300.
* **`image_format`** (`Any`): format to use when extracting images from PDFs. Defaults to 'png'.

**Examples:**

All these examples assume an existing table `tbl` with a column `doc` of type `pxt.Document`.

Create a view that splits all documents into chunks of up to 300 tokens:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
pxt.create_view(
    'chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='token_limit', limit=300
    ),
)
```

Create a view that splits all documents along sentence boundaries, including title and heading metadata:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
pxt.create_view(
    'sentence_chunks',
    tbl,
    iterator=document_splitter(
        tbl.doc, separators='sentence', metadata='title,heading'
    ),
)
```