Pandas

Integrating with Pandas: Extending Your Data Science Toolkit

Working with Pandas in Pixeltable

Pixeltable seamlessly integrates with Pandas, enabling you to leverage both tools for a comprehensive workflow. While Pixeltable and Pandas serve different purposes, they complement each other well - Pixeltable excels at persistent storage, incremental updates, and multimodal data handling.

Key Differences

Pixeltable dataframes differ from Pandas DataFrames in two important ways:

  • Pixeltable dataframes do not hold data in memory or allow direct updates (use insert/update/delete operations instead)
  • Query execution in Pixeltable must be initiated explicitly to return results

Importing Pandas Data into Pixeltable

The simplest way to import Pandas data into Pixeltable is using the import_pandas function:

import pixeltable as pxt
import pandas as pd

# Create or load a Pandas DataFrame
df = pd.read_csv("my_data.csv")

# Create a Pixeltable table from the DataFrame 
table = pxt.io.import_pandas("my_table", df)

The import_pandas function:

  • Automatically creates a new Pixeltable table
  • Infers the column types from your Pandas DataFrame

Data Type Mapping

When importing from Pandas, Pixeltable automatically maps data types:

  • Numeric types (int, float) map to corresponding Pixeltable types
  • String/object types map to StringType
  • Datetime types map to TimestampType
  • Complex types (lists, dicts) map to JsonType

Extracting Data to Pandas

You can convert Pixeltable query results to Pandas DataFrames using the to_pandas() method:

# Query Pixeltable and convert to Pandas
result = table.select(table.column1, table.column2).collect()
df = result.to_pandas()

# Now use standard Pandas operations
print(df.describe())

Common Operations Comparison

Here's how common data operations translate between Pixeltable and Pandas:

# Computing a new feature
# Pandas:
df["test"] = df["col1"] - df["col2"]
df["test"].head(5)

# Pixeltable:
table.select(table.col1 - table.col2).head(5)

Best Practices

  1. Memory Management
    1. Use Pixeltable for persistent storage and large datasets
    2. Convert to Pandas only for the specific data segments you need to analyze
    3. Be cautious with collect() on large tables without limits
  2. Incremental Processing
    1. Let Pixeltable handle incremental updates through computed columns
    2. Use Pandas for one-off analytical tasks and exploratory data analysis
  3. Type Safety
    1. Leverage Pixeltable's type system for data validation
    2. Be aware of type conversions when moving between Pixeltable and Pandas

Example Workflow

Here's a complete example showing how to effectively combine Pixeltable and Pandas:

import pixeltable as pxt
import pandas as pd

# Create a Pixeltable table with some data
table = pxt.create_table('example', {
    'id': pxt.IntType(),
    'value': pxt.FloatType()
})

# Insert some data
table.insert([
    {'id': i, 'value': float(i)} 
    for i in range(10)
])

# Query specific data and convert to Pandas
df = table.select(
    table.id, 
    table.value
).where(
    table.value > 5
).collect().to_pandas()

# Perform Pandas operations
df['squared'] = df['value'] ** 2

# Create new Pixeltable table from results
results_table = pxt.io.import_pandas(
    'results', 
    df
)