Extract named entities from text

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Identify and extract people, organizations, locations, dates, and other entities from text using LLMs.

Problem

You have unstructured text containing important information—names, companies, dates, locations—that you need to extract and structure for analysis, search, or integration with other systems.

Solution

What’s in this recipe:

Extract entities as structured JSON
Use OpenAI’s structured output for reliable parsing
Access extracted entities as queryable columns

You use structured output to get entities in a consistent JSON format. The entities are stored as JSON columns that you can query and filter.

Setup

%pip install -qU pixeltable openai

WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
WARNING: Ignoring invalid distribution ~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
Note: you may need to restart the kernel to use updated packages.

import getpass
import os

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')

import json
import pixeltable as pxt
from pixeltable.functions.openai import chat_completions

# Create a fresh directory
pxt.drop_dir('entities_demo', force=True)
pxt.create_dir('entities_demo')

Created directory ‘entities_demo’.
<pixeltable.catalog.dir.Dir at 0x306118050>

Define entity extraction schema

# Define the JSON schema for entity extraction
entity_schema = {
    'type': 'json_schema',
    'json_schema': {
        'name': 'entities',
        'strict': True,
        'schema': {
            'type': 'object',
            'properties': {
                'people': {
                    'type': 'array',
                    'items': {'type': 'string'},
                    'description': 'Names of people mentioned',
                },
                'organizations': {
                    'type': 'array',
                    'items': {'type': 'string'},
                    'description': 'Names of companies, institutions, or groups',
                },
                'locations': {
                    'type': 'array',
                    'items': {'type': 'string'},
                    'description': 'Geographic locations (cities, countries, addresses)',
                },
                'dates': {
                    'type': 'array',
                    'items': {'type': 'string'},
                    'description': 'Dates or time references',
                },
            },
            'required': ['people', 'organizations', 'locations', 'dates'],
            'additionalProperties': False,
        },
    },
}

Create extraction pipeline

# Create table for articles
articles = pxt.create_table(
    'entities_demo/articles', {'title': pxt.String, 'content': pxt.String}
)

Created table ‘articles’.

# Add entity extraction column
extraction_prompt = (
    'Extract all named entities from the following text:\n\n'
    + articles.content
)

articles.add_computed_column(
    extraction_response=chat_completions(
        messages=[{'role': 'user', 'content': extraction_prompt}],
        model='gpt-4o-mini',
        model_kwargs={'response_format': entity_schema},
    )
)

Added 0 column values with 0 errors.
No rows affected.

# Extract the entities JSON
articles.add_computed_column(
    entities=articles.extraction_response.choices[0].message.content
)

Added 0 column values with 0 errors.
No rows affected.

Extract entities from text

# Insert sample articles
sample_articles = [
    {
        'title': 'Tech Acquisition',
        'content': 'Microsoft announced today that CEO Satya Nadella will lead the acquisition of a Seattle-based startup. The deal, expected to close in March 2024, is valued at $500 million.',
    },
    {
        'title': 'Sports Update',
        'content': 'LeBron James led the Los Angeles Lakers to victory against the Boston Celtics on Tuesday night at Staples Center. Coach Darvin Ham praised the teams performance.',
    },
    {
        'title': 'Research Breakthrough',
        'content': 'Dr. Sarah Chen at Stanford University published groundbreaking research on renewable energy. The study, funded by the National Science Foundation, was conducted in Palo Alto, California.',
    },
]

articles.insert(sample_articles)

Inserting rows into `articles`: 3 rows [00:00, 404.21 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 12 values computed.

# View extracted entities
articles.select(articles.title, articles.entities).collect()

Explanation

Structured output ensures reliable extraction: By using OpenAI’s structured output (response_format), the model always returns valid JSON matching the schema. No post-processing or error handling needed. Common entity types:

Customizing the schema: Modify the entity_schema to extract domain-specific entities—product SKUs, legal terms, medical conditions, etc.

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Define entity extraction schema

Create extraction pipeline

Extract entities from text

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Define entity extraction schema

​Create extraction pipeline

​Extract entities from text

​Explanation

​See also

Problem

Solution

Setup

Define entity extraction schema

Create extraction pipeline

Extract entities from text

Explanation

See also