This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Identify and extract people, organizations, locations, dates, and other
entities from text using LLMs.
Problem
You have unstructured text containing important information—names,
companies, dates, locations—that you need to extract and structure for
analysis, search, or integration with other systems.
Solution
What’s in this recipe:
- Extract entities as structured JSON
- Use OpenAI’s structured output for reliable parsing
- Access extracted entities as queryable columns
You use structured output to get entities in a consistent JSON format.
The entities are stored as JSON columns that you can query and filter.
Setup
%pip install -qU pixeltable openai
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
Note: you may need to restart the kernel to use updated packages.
import os
import getpass
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import pixeltable as pxt
from pixeltable.functions.openai import chat_completions
import json
# Create a fresh directory
pxt.drop_dir('entities_demo', force=True)
pxt.create_dir('entities_demo')
Created directory ‘entities_demo’.
<pixeltable.catalog.dir.Dir at 0x306118050>
# Define the JSON schema for entity extraction
entity_schema = {
"type": "json_schema",
"json_schema": {
"name": "entities",
"strict": True,
"schema": {
"type": "object",
"properties": {
"people": {
"type": "array",
"items": {"type": "string"},
"description": "Names of people mentioned"
},
"organizations": {
"type": "array",
"items": {"type": "string"},
"description": "Names of companies, institutions, or groups"
},
"locations": {
"type": "array",
"items": {"type": "string"},
"description": "Geographic locations (cities, countries, addresses)"
},
"dates": {
"type": "array",
"items": {"type": "string"},
"description": "Dates or time references"
}
},
"required": ["people", "organizations", "locations", "dates"],
"additionalProperties": False
}
}
}
# Create table for articles
articles = pxt.create_table(
'entities_demo.articles',
{'title': pxt.String, 'content': pxt.String}
)
Created table ‘articles’.
# Add entity extraction column
extraction_prompt = 'Extract all named entities from the following text:\n\n' + articles.content
articles.add_computed_column(
extraction_response=chat_completions(
messages=[{'role': 'user', 'content': extraction_prompt}],
model='gpt-4o-mini',
model_kwargs={'response_format': entity_schema}
)
)
Added 0 column values with 0 errors.
No rows affected.
# Extract the entities JSON
articles.add_computed_column(
entities=articles.extraction_response.choices[0].message.content
)
Added 0 column values with 0 errors.
No rows affected.
# Insert sample articles
sample_articles = [
{
'title': 'Tech Acquisition',
'content': 'Microsoft announced today that CEO Satya Nadella will lead the acquisition of a Seattle-based startup. The deal, expected to close in March 2024, is valued at $500 million.'
},
{
'title': 'Sports Update',
'content': 'LeBron James led the Los Angeles Lakers to victory against the Boston Celtics on Tuesday night at Staples Center. Coach Darvin Ham praised the teams performance.'
},
{
'title': 'Research Breakthrough',
'content': 'Dr. Sarah Chen at Stanford University published groundbreaking research on renewable energy. The study, funded by the National Science Foundation, was conducted in Palo Alto, California.'
},
]
articles.insert(sample_articles)
Inserting rows into `articles`: 3 rows [00:00, 404.21 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 12 values computed.
# View extracted entities
articles.select(articles.title, articles.entities).collect()
Explanation
Structured output ensures reliable extraction:
By using OpenAI’s structured output (response_format), the model
always returns valid JSON matching the schema. No post-processing or
error handling needed.
Common entity types:
Customizing the schema:
Modify the entity_schema to extract domain-specific entities—product
SKUs, legal terms, medical conditions, etc.
See also