Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Identify and extract people, organizations, locations, dates, and other entities from text using LLMs.

Problem

You have unstructured text containing important information—names, companies, dates, locations—that you need to extract and structure for analysis, search, or integration with other systems.

Solution

What’s in this recipe:
  • Extract entities as structured JSON
  • Use OpenAI’s structured output for reliable parsing
  • Access extracted entities as queryable columns
You use structured output to get entities in a consistent JSON format. The entities are stored as JSON columns that you can query and filter.

Setup

%pip install -qU pixeltable openai
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
Note: you may need to restart the kernel to use updated packages.
import os
import getpass

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import pixeltable as pxt
from pixeltable.functions.openai import chat_completions
import json
# Create a fresh directory
pxt.drop_dir('entities_demo', force=True)
pxt.create_dir('entities_demo')
Created directory ‘entities_demo’.
<pixeltable.catalog.dir.Dir at 0x306118050>

Define entity extraction schema

# Define the JSON schema for entity extraction
entity_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "entities",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "people": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Names of people mentioned"
                },
                "organizations": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Names of companies, institutions, or groups"
                },
                "locations": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Geographic locations (cities, countries, addresses)"
                },
                "dates": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Dates or time references"
                }
            },
            "required": ["people", "organizations", "locations", "dates"],
            "additionalProperties": False
        }
    }
}

Create extraction pipeline

# Create table for articles
articles = pxt.create_table(
    'entities_demo.articles',
    {'title': pxt.String, 'content': pxt.String}
)
Created table ‘articles’.
# Add entity extraction column
extraction_prompt = 'Extract all named entities from the following text:\n\n' + articles.content

articles.add_computed_column(
    extraction_response=chat_completions(
        messages=[{'role': 'user', 'content': extraction_prompt}],
        model='gpt-4o-mini',
        model_kwargs={'response_format': entity_schema}
    )
)
Added 0 column values with 0 errors.
No rows affected.
# Extract the entities JSON
articles.add_computed_column(
    entities=articles.extraction_response.choices[0].message.content
)
Added 0 column values with 0 errors.
No rows affected.

Extract entities from text

# Insert sample articles
sample_articles = [
    {
        'title': 'Tech Acquisition',
        'content': 'Microsoft announced today that CEO Satya Nadella will lead the acquisition of a Seattle-based startup. The deal, expected to close in March 2024, is valued at $500 million.'
    },
    {
        'title': 'Sports Update',
        'content': 'LeBron James led the Los Angeles Lakers to victory against the Boston Celtics on Tuesday night at Staples Center. Coach Darvin Ham praised the teams performance.'
    },
    {
        'title': 'Research Breakthrough',
        'content': 'Dr. Sarah Chen at Stanford University published groundbreaking research on renewable energy. The study, funded by the National Science Foundation, was conducted in Palo Alto, California.'
    },
]

articles.insert(sample_articles)
Inserting rows into `articles`: 3 rows [00:00, 404.21 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 12 values computed.
# View extracted entities
articles.select(articles.title, articles.entities).collect()

Explanation

Structured output ensures reliable extraction: By using OpenAI’s structured output (response_format), the model always returns valid JSON matching the schema. No post-processing or error handling needed. Common entity types:
Customizing the schema: Modify the entity_schema to extract domain-specific entities—product SKUs, legal terms, medical conditions, etc.

See also