# Changelog Source: https://docs.pixeltable.com/changelog/changelog Release history and updates for Pixeltable ## Contributors Pixeltable is built by a vibrant community of contributors. We're grateful for everyone who has helped make Pixeltable better! **Want to contribute?** Check out our [Contributing Guide](https://github.com/pixeltable/pixeltable/tree/main?tab=contributing-ov-file#readme) to get started. **Top Contributors:** View our top contributors on [GitHub](https://github.com/pixeltable/pixeltable/graphs/contributors). *** ## Release History View the complete release history for Pixeltable below. Each release includes detailed information about new features, bug fixes, and improvements. For the latest release information, visit our [GitHub Releases page](https://github.com/pixeltable/pixeltable/releases). *** ### v0.5.20 **Released:** March 03, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.20](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.20) #### What's Changed * Perftest to log if it thinks that it's running in CI by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1163](https://github.com/pixeltable/pixeltable/pull/1163) * \[PXT-1002] re-enable force replace view in random ops by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1166](https://github.com/pixeltable/pixeltable/pull/1166) * \[PXT-1002] Fix table md caching when an insert finalizes view creation by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1138](https://github.com/pixeltable/pixeltable/pull/1138) * Add missing %pip install to custom-iterators.ipynb by [@aaron-siegel](https://github.com/aaron-siegel) in [#1171](https://github.com/pixeltable/pixeltable/pull/1171) * Add migration guides for new users coming from common stacks by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1167](https://github.com/pixeltable/pixeltable/pull/1167) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.19...v0.5.20](https://github.com/pixeltable/pixeltable/compare/v0.5.19...v0.5.20) *** ### v0.5.19 **Released:** March 01, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.19](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.19) #### What's Changed * Add local docs serving instructions to contributing guide by [@apreshill](https://github.com/apreshill) in [#1054](https://github.com/pixeltable/pixeltable/pull/1054) * TableOp refactoring so that TableVersion is not required for some ops by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1153](https://github.com/pixeltable/pixeltable/pull/1153) * [@pxt](https://github.com/pxt).iterator decorator by [@aaron-siegel](https://github.com/aaron-siegel) in [#1111](https://github.com/pixeltable/pixeltable/pull/1111) * Docs: add missing integrations, SDK entries, and cookbook updates for v0.5.11–v0.5.18 by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1158](https://github.com/pixeltable/pixeltable/pull/1158) * Quieter CI output by [@aaron-siegel](https://github.com/aaron-siegel) in [#1161](https://github.com/pixeltable/pixeltable/pull/1161) * \[PXT-1002] Make non-transactional TableOps idempotent by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1139](https://github.com/pixeltable/pixeltable/pull/1139) * \[PXT-1043] Support video embeddings in VoyageAI by [@aaron-siegel](https://github.com/aaron-siegel) in [#1160](https://github.com/pixeltable/pixeltable/pull/1160) * PXT-877 Fixing if\_exists='replace' cannot be used to replace a Table with a View/Snapshot or vice-versa by [@christopherpestano](https://github.com/christopherpestano) in [#1150](https://github.com/pixeltable/pixeltable/pull/1150) * PXT-1020: support for multi-threaded API calls by [@mkornacker](https://github.com/mkornacker) in [#1155](https://github.com/pixeltable/pixeltable/pull/1155) * Fix TableVersion.is\_iterator\_column by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1159](https://github.com/pixeltable/pixeltable/pull/1159) * PXT-933 Support videos in gemini generate\_content by [@amithadke](https://github.com/amithadke) in [#1152](https://github.com/pixeltable/pixeltable/pull/1152) * \[PXT-1018] Add a "source" field to list of columns in t.describe() by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1135](https://github.com/pixeltable/pixeltable/pull/1135) * uvloop compatibility by [@mkornacker](https://github.com/mkornacker) in [#1164](https://github.com/pixeltable/pixeltable/pull/1164) * docs: update deployment guides for thread safety, sync endpoints, and uvloop by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1165](https://github.com/pixeltable/pixeltable/pull/1165) * Add Bedrock API Key auth support and notebook outputs by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1146](https://github.com/pixeltable/pixeltable/pull/1146) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.18...v0.5.19](https://github.com/pixeltable/pixeltable/compare/v0.5.18...v0.5.19) *** ### v0.5.18 **Released:** February 24, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.18](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.18) #### What's Changed * misc improvements in the code by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1072](https://github.com/pixeltable/pixeltable/pull/1072) * \[PXT-995] improve test migration coverage of literals of various types by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1128](https://github.com/pixeltable/pixeltable/pull/1128) * Twelvelabs notebook update by [@apreshill](https://github.com/apreshill) in [#1117](https://github.com/pixeltable/pixeltable/pull/1117) * \[PXT-1040] Temporarily disable twelvelabs nb test by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1140](https://github.com/pixeltable/pixeltable/pull/1140) * Update contribution guidelines regarding AI-generated code by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1134](https://github.com/pixeltable/pixeltable/pull/1134) * \[PXT-1007 + PXT-1010] Modifying add\_columns to support column metadata and introducing standard ColumnSpec by [@christopherpestano](https://github.com/christopherpestano) in [#1119](https://github.com/pixeltable/pixeltable/pull/1119) * Adding negative\_prompt to img2img notebook by [@christopherpestano](https://github.com/christopherpestano) in [#1136](https://github.com/pixeltable/pixeltable/pull/1136) * \[PXT-1040] disable all twelvelabs tests by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1142](https://github.com/pixeltable/pixeltable/pull/1142) * PXT-1039: video\_splitter(mode='accurate') doesn't work by [@mkornacker](https://github.com/mkornacker) in [#1145](https://github.com/pixeltable/pixeltable/pull/1145) * PXT-966: crop() udf for videos by [@mkornacker](https://github.com/mkornacker) in [#1144](https://github.com/pixeltable/pixeltable/pull/1144) * dumps() udf for json by [@mkornacker](https://github.com/mkornacker) in [#1149](https://github.com/pixeltable/pixeltable/pull/1149) * Fixes for recent versions of mintlify by [@aaron-siegel](https://github.com/aaron-siegel) in [#1151](https://github.com/pixeltable/pixeltable/pull/1151) * \[PXT-1003] Add offset parameter to limit() queries for pagination by [@aaron-siegel](https://github.com/aaron-siegel) in [#1148](https://github.com/pixeltable/pixeltable/pull/1148) * Add agentic patterns cookbook by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1141](https://github.com/pixeltable/pixeltable/pull/1141) * PXT-985 + PXT-1041 - Adding custom\_metadata and comment for columns by [@christopherpestano](https://github.com/christopherpestano) in [#1132](https://github.com/pixeltable/pixeltable/pull/1132) * Fix: Implement drop\_index() for BtreeIndex and EmbeddingIndex by [@KeeProMise](https://github.com/KeeProMise) in [#1133](https://github.com/pixeltable/pixeltable/pull/1133) * Update OpenAI vision and image gen APIs to make proper use of images in dicts by [@aaron-siegel](https://github.com/aaron-siegel) in [#1147](https://github.com/pixeltable/pixeltable/pull/1147) * \[PXT-995] Literal should serialize its entire type info by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1123](https://github.com/pixeltable/pixeltable/pull/1123) #### New Contributors * [@KeeProMise](https://github.com/KeeProMise) made their first contribution in [#1133](https://github.com/pixeltable/pixeltable/pull/1133) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.17...v0.5.18](https://github.com/pixeltable/pixeltable/compare/v0.5.17...v0.5.18) *** ### v0.5.17 **Released:** February 10, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.17](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.17) #### What's Changed * Standardize names for runner configs by [@aaron-siegel](https://github.com/aaron-siegel) in [#1122](https://github.com/pixeltable/pixeltable/pull/1122) * Add Jina AI integration for embeddings and reranking by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1029](https://github.com/pixeltable/pixeltable/pull/1029) * Add Microsoft Fabric Integration for Azure OpenAI by [@pawarbi](https://github.com/pawarbi) in [#1109](https://github.com/pixeltable/pixeltable/pull/1109) * Switch away from gemini-2.0 models by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1115](https://github.com/pixeltable/pixeltable/pull/1115) * PXT-985 Adding custom\_metadata and restricting comment field to string by [@christopherpestano](https://github.com/christopherpestano) in [#1102](https://github.com/pixeltable/pixeltable/pull/1102) * Nightly CI fix by [@aaron-siegel](https://github.com/aaron-siegel) in [#1129](https://github.com/pixeltable/pixeltable/pull/1129) * PXT-1033: handle min\_segment\_duration=None correctly in VideoSplitter by [@mkornacker](https://github.com/mkornacker) in [#1131](https://github.com/pixeltable/pixeltable/pull/1131) * Apply ruff formatting to code snippets in docstrings by [@aaron-siegel](https://github.com/aaron-siegel) in [#1125](https://github.com/pixeltable/pixeltable/pull/1125) * Improved treatment of stored UDFs by [@aaron-siegel](https://github.com/aaron-siegel) in [#1126](https://github.com/pixeltable/pixeltable/pull/1126) * PXT-1023: Support for ragged arrays in export\_parquet() by [@mkornacker](https://github.com/mkornacker) in [#1124](https://github.com/pixeltable/pixeltable/pull/1124) #### New Contributors * [@pawarbi](https://github.com/pawarbi) made their first contribution in [#1109](https://github.com/pixeltable/pixeltable/pull/1109) * [@christopherpestano](https://github.com/christopherpestano) made their first contribution in [#1102](https://github.com/pixeltable/pixeltable/pull/1102) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.16...v0.5.17](https://github.com/pixeltable/pixeltable/compare/v0.5.16...v0.5.17) *** ### v0.5.16 **Released:** February 04, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.16](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.16) #### What's Changed * PXT-898 Allow Pixeltable API key to change in the environment mid-stream in a Python session by [@amithadke](https://github.com/amithadke) in [#1060](https://github.com/pixeltable/pixeltable/pull/1060) * various runwayml followups by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1095](https://github.com/pixeltable/pixeltable/pull/1095) * Ensure progress bar stops on empty results and plan exit by [@amithadke](https://github.com/amithadke) in [#1097](https://github.com/pixeltable/pixeltable/pull/1097) * Fix exception handling in catalog by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1101](https://github.com/pixeltable/pixeltable/pull/1101) * Migrate docs to `uuid7()` UDF by [@apreshill](https://github.com/apreshill) in [#1093](https://github.com/pixeltable/pixeltable/pull/1093) * Add retries to Python install in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#1094](https://github.com/pixeltable/pixeltable/pull/1094) * fix: Make notebook outputs visible in dark mode by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1107](https://github.com/pixeltable/pixeltable/pull/1107) * various improvements to random-ops script by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1084](https://github.com/pixeltable/pixeltable/pull/1084) * Prep work for iterator refactor: Add media types and iterators to migration test by [@aaron-siegel](https://github.com/aaron-siegel) in [#1103](https://github.com/pixeltable/pixeltable/pull/1103) * Add export media to s3 to io cookbooks in docs by [@apreshill](https://github.com/apreshill) in [#1088](https://github.com/pixeltable/pixeltable/pull/1088) * Include audio\_splitter and video\_splitter in db dumps by [@aaron-siegel](https://github.com/aaron-siegel) in [#1110](https://github.com/pixeltable/pixeltable/pull/1110) * PXT-965 Support http url and blob store uri for creating json/parquet/csv tables by [@amithadke](https://github.com/amithadke) in [#1104](https://github.com/pixeltable/pixeltable/pull/1104) * Fixes for Pandas 3.0 by [@aaron-siegel](https://github.com/aaron-siegel) in [#1112](https://github.com/pixeltable/pixeltable/pull/1112) * Upgrade ruff to latest by [@aaron-siegel](https://github.com/aaron-siegel) in [#1114](https://github.com/pixeltable/pixeltable/pull/1114) * Fixes for Transformers 5 by [@aaron-siegel](https://github.com/aaron-siegel) in [#1113](https://github.com/pixeltable/pixeltable/pull/1113) * Use a larger runner in merge queue for full tests on Python 3.10 by [@aaron-siegel](https://github.com/aaron-siegel) in [#1120](https://github.com/pixeltable/pixeltable/pull/1120) * \[PXT-944] speech2text\_for\_conditional\_generation declares return type… by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1116](https://github.com/pixeltable/pixeltable/pull/1116) * \[PXT-875] Fix openai perftest on github by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1062](https://github.com/pixeltable/pixeltable/pull/1062) * PXT-973: additional\_columns doesn't evaluate as expected when creating a view by [@mkornacker](https://github.com/mkornacker) in [#1087](https://github.com/pixeltable/pixeltable/pull/1087) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.15...v0.5.16](https://github.com/pixeltable/pixeltable/compare/v0.5.15...v0.5.16) *** ### v0.5.15 **Released:** January 29, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.15](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.15) #### What's Changed * docs: update overview description and callout/footer styling by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1086](https://github.com/pixeltable/pixeltable/pull/1086) * Fix HF datasets rotten\_tomatoes references in tests & notebook by [@aaron-siegel](https://github.com/aaron-siegel) in [#1089](https://github.com/pixeltable/pixeltable/pull/1089) * Gemini UDFs to use "rate limits" scheduler by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1092](https://github.com/pixeltable/pixeltable/pull/1092) * Allow dict/list config params to be specified as environment variables by [@aaron-siegel](https://github.com/aaron-siegel) in [#1091](https://github.com/pixeltable/pixeltable/pull/1091) * Minor Gemini UDF followup for safer get\_retry\_delay() by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1098](https://github.com/pixeltable/pixeltable/pull/1098) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.14...v0.5.15](https://github.com/pixeltable/pixeltable/compare/v0.5.14...v0.5.15) *** ### v0.5.14 **Released:** January 24, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.14](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.14) #### What's Changed * Add RunwayML integration with UDFs for image and video generation by [@tiennguyentony](https://github.com/tiennguyentony) in [#1019](https://github.com/pixeltable/pixeltable/pull/1019) * Deployment and Use Cases Docs by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1043](https://github.com/pixeltable/pixeltable/pull/1043) * Transaction rollback by [@mkornacker](https://github.com/mkornacker) in [#1075](https://github.com/pixeltable/pixeltable/pull/1075) * \[PXT-972] Bugfix: FrameIterator.set\_pos() on videos with start\_time > 0 by [@aaron-siegel](https://github.com/aaron-siegel) in [#1082](https://github.com/pixeltable/pixeltable/pull/1082) * to\_string() method on UUIDType by [@aaron-siegel](https://github.com/aaron-siegel) in [#1078](https://github.com/pixeltable/pixeltable/pull/1078) * CI and Makefile step to ensure notebooks have >= 50% of their cells with outputs by [@aaron-siegel](https://github.com/aaron-siegel) in [#1073](https://github.com/pixeltable/pixeltable/pull/1073) * Regenerate all outputs for Reve integration notebook by [@apreshill](https://github.com/apreshill) in [#1071](https://github.com/pixeltable/pixeltable/pull/1071) * Apply ruff formatting to all notebooks by [@aaron-siegel](https://github.com/aaron-siegel) in [#1074](https://github.com/pixeltable/pixeltable/pull/1074) #### New Contributors * [@tiennguyentony](https://github.com/tiennguyentony) made their first contribution in [#1019](https://github.com/pixeltable/pixeltable/pull/1019) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.13...v0.5.14](https://github.com/pixeltable/pixeltable/compare/v0.5.13...v0.5.14) *** ### v0.5.13 **Released:** January 22, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.13](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.13) #### What's Changed * rename reset\_db fixture to uses\_db by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1067](https://github.com/pixeltable/pixeltable/pull/1067) * Use '/' as path delimiter by [@amithadke](https://github.com/amithadke) in [#1055](https://github.com/pixeltable/pixeltable/pull/1055) * Temporarily disable progress reporting when verbosity \< 2 by [@aaron-siegel](https://github.com/aaron-siegel) in [#1079](https://github.com/pixeltable/pixeltable/pull/1079) * Follow up fixes for Path delimiter change by [@amithadke](https://github.com/amithadke) in [#1076](https://github.com/pixeltable/pixeltable/pull/1076) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.12...v0.5.13](https://github.com/pixeltable/pixeltable/compare/v0.5.12...v0.5.13) *** ### v0.5.12 **Released:** January 17, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.12](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.12) #### What's Changed * Lint markdown in notebooks by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1033](https://github.com/pixeltable/pixeltable/pull/1033) * Adjust down max connections on OpenAI client by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1058](https://github.com/pixeltable/pixeltable/pull/1058) * \[PXT-915] Gemini embedding UDFs by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#986](https://github.com/pixeltable/pixeltable/pull/986) * PXT-866 Add validation for version in pixeltable uri by [@amithadke](https://github.com/amithadke) in [#1048](https://github.com/pixeltable/pixeltable/pull/1048) * uuid7() udf by [@mkornacker](https://github.com/mkornacker) in [#1059](https://github.com/pixeltable/pixeltable/pull/1059) * \[PXT-875] Disable performance test until it reliably passes by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1061](https://github.com/pixeltable/pixeltable/pull/1061) * Daemonize pgserver on Windows by [@aaron-siegel](https://github.com/aaron-siegel) in [#1057](https://github.com/pixeltable/pixeltable/pull/1057) * PXT-954: assertion in recompute\_columns() for view column by [@mkornacker](https://github.com/mkornacker) in [#1064](https://github.com/pixeltable/pixeltable/pull/1064) * Remove obsolete mkdocs by [@aaron-siegel](https://github.com/aaron-siegel) in [#1056](https://github.com/pixeltable/pixeltable/pull/1056) * Working with blob storage nb by [@apreshill](https://github.com/apreshill) in [#977](https://github.com/pixeltable/pixeltable/pull/977) * PXT-961: correct support for alpha in draw\_bounding\_boxes() by [@mkornacker](https://github.com/mkornacker) in [#1068](https://github.com/pixeltable/pixeltable/pull/1068) * Notebook CI tweaks by [@aaron-siegel](https://github.com/aaron-siegel) in [#1069](https://github.com/pixeltable/pixeltable/pull/1069) * PXT-943: Rectify all indices in TableRestorer, not just embedding indices by [@aaron-siegel](https://github.com/aaron-siegel) in [#1066](https://github.com/pixeltable/pixeltable/pull/1066) * \[PXT-955] Skip UDA evaluation if a required parameter is None by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1070](https://github.com/pixeltable/pixeltable/pull/1070) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.11...v0.5.12](https://github.com/pixeltable/pixeltable/compare/v0.5.11...v0.5.12) *** ### v0.5.11 **Released:** January 13, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.11](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.11) #### What's Changed * \[PXT-916] Store embedding indexes as halfvecs by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1007](https://github.com/pixeltable/pixeltable/pull/1007) * Add a "read only random ops" stress-tests job by [@aaron-siegel](https://github.com/aaron-siegel) in [#1047](https://github.com/pixeltable/pixeltable/pull/1047) * Streamline dev installation by [@aaron-siegel](https://github.com/aaron-siegel) in [#1046](https://github.com/pixeltable/pixeltable/pull/1046) * Add reruns by default to all cockroach test failures by [@aaron-siegel](https://github.com/aaron-siegel) in [#1053](https://github.com/pixeltable/pixeltable/pull/1053) * PXT-938: export\_sql() by [@mkornacker](https://github.com/mkornacker) in [#1037](https://github.com/pixeltable/pixeltable/pull/1037) * Add cookbooks: SQL and Segmentation by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1038](https://github.com/pixeltable/pixeltable/pull/1038) * \[PXT-629] Update plan is incomplete by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1044](https://github.com/pixeltable/pixeltable/pull/1044) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.10...v0.5.11](https://github.com/pixeltable/pixeltable/compare/v0.5.10...v0.5.11) *** ### v0.5.10 **Released:** January 10, 2026\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.10](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.10) #### What's Changed * Adding ipywidgets to dev dependencies by [@mkornacker](https://github.com/mkornacker) in [#1027](https://github.com/pixeltable/pixeltable/pull/1027) * Add a seed to TestSample.test\_sample\_basic\_f by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1040](https://github.com/pixeltable/pixeltable/pull/1040) * Twelvelabs notebook by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1013](https://github.com/pixeltable/pixeltable/pull/1013) * Readme Updates by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1041](https://github.com/pixeltable/pixeltable/pull/1041) * Proper configurability for spaCy models by [@aaron-siegel](https://github.com/aaron-siegel) in [#1039](https://github.com/pixeltable/pixeltable/pull/1039) * Various import fixes by [@aaron-siegel](https://github.com/aaron-siegel) in [#1042](https://github.com/pixeltable/pixeltable/pull/1042) * PXT-875 Run perf tests on a dedicated larger runner by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1024](https://github.com/pixeltable/pixeltable/pull/1024) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.9...v0.5.10](https://github.com/pixeltable/pixeltable/compare/v0.5.9...v0.5.10) *** ### v0.5.9 **Released:** December 30, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.9](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.9) #### What's Changed * Bedrock invoke\_model() udf by [@mkornacker](https://github.com/mkornacker) in [#1018](https://github.com/pixeltable/pixeltable/pull/1018) * \[PXT-765] Support for Office Formats as part of Document Type through MarkdownIT by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#960](https://github.com/pixeltable/pixeltable/pull/960) * HF DetrForSegmentation by [@mkornacker](https://github.com/mkornacker) in [#1020](https://github.com/pixeltable/pixeltable/pull/1020) * Image2Image: Updated HF.py to use AutoPipelineForImage2Image and Cookbook by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1025](https://github.com/pixeltable/pixeltable/pull/1025) * Fixed broken tutorial links. by [@joerg84](https://github.com/joerg84) in [#1026](https://github.com/pixeltable/pixeltable/pull/1026) * Allow `similarity(image=...)` to accept a filename or URL instead of a PIL image object by [@aaron-siegel](https://github.com/aaron-siegel) in [#1023](https://github.com/pixeltable/pixeltable/pull/1023) * docs(cookbook): add MCP tool calling section to LLM tool calling guide by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1021](https://github.com/pixeltable/pixeltable/pull/1021) * PXT-928: Export Json columns to parquet as pa.struct by [@mkornacker](https://github.com/mkornacker) in [#1017](https://github.com/pixeltable/pixeltable/pull/1017) * removing psutil by [@mkornacker](https://github.com/mkornacker) in [#1031](https://github.com/pixeltable/pixeltable/pull/1031) * Use head() instead of collect() in test\_add\_column\_to\_view by [@aaron-siegel](https://github.com/aaron-siegel) in [#1022](https://github.com/pixeltable/pixeltable/pull/1022) * disable progress reporting in Jupyter if ipywidgets is not installed by [@mkornacker](https://github.com/mkornacker) in [#1032](https://github.com/pixeltable/pixeltable/pull/1032) #### New Contributors * [@joerg84](https://github.com/joerg84) made their first contribution in [#1026](https://github.com/pixeltable/pixeltable/pull/1026) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.8...v0.5.9](https://github.com/pixeltable/pixeltable/compare/v0.5.8...v0.5.9) *** ### v0.5.8 **Released:** December 20, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.8](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.8) #### What's Changed * Use high performance endpoint for Tigris by [@apreshill](https://github.com/apreshill) in [#1011](https://github.com/pixeltable/pixeltable/pull/1011) * Merge Table.add\_embedding\_index examples by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1014](https://github.com/pixeltable/pixeltable/pull/1014) * Notebook fixes & some cleanup by [@aaron-siegel](https://github.com/aaron-siegel) in [#1010](https://github.com/pixeltable/pixeltable/pull/1010) * Progress tracker by [@mkornacker](https://github.com/mkornacker) in [#956](https://github.com/pixeltable/pixeltable/pull/956) * \[PXT-925] Fix spurious exception when `if_not_exists='ignore'` is used with a missing parent dir by [@aaron-siegel](https://github.com/aaron-siegel) in [#1015](https://github.com/pixeltable/pixeltable/pull/1015) * Improve primary key error message by [@aaron-siegel](https://github.com/aaron-siegel) in [#1016](https://github.com/pixeltable/pixeltable/pull/1016) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.7...v0.5.8](https://github.com/pixeltable/pixeltable/compare/v0.5.7...v0.5.8) *** ### v0.5.7 **Released:** December 18, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.7](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.7) #### What's Changed * Fix a bug in rag-demo.ipynb by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#996](https://github.com/pixeltable/pixeltable/pull/996) * Fixes the errant `/datastore/` url in the Reve docstrings by [@apreshill](https://github.com/apreshill) in [#999](https://github.com/pixeltable/pixeltable/pull/999) * Remove custom-iterators.ipynb from docs for now, and clean up docs.json by [@aaron-siegel](https://github.com/aaron-siegel) in [#997](https://github.com/pixeltable/pixeltable/pull/997) * \[PXT-921] Skip test\_create\_video\_table on cockroachdb by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#1002](https://github.com/pixeltable/pixeltable/pull/1002) * Add iterators cookbook with all 6 built-in iterators by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1000](https://github.com/pixeltable/pixeltable/pull/1000) * PXT 910 Add rerun options to presigned url tests by [@amithadke](https://github.com/amithadke) in [#1006](https://github.com/pixeltable/pixeltable/pull/1006) * docs: add presigned\_url to S3 cookbook and update SDK docs by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1004](https://github.com/pixeltable/pixeltable/pull/1004) * docs(providers): add Tigris example notebook by [@Xe](https://github.com/Xe) in [#998](https://github.com/pixeltable/pixeltable/pull/998) * docs: update Mintlify theme colors and styling by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#1008](https://github.com/pixeltable/pixeltable/pull/1008) * Add `pxt.Binary` type to type system; `bytes` support in JSON; working Gemini 3 Pro by [@aaron-siegel](https://github.com/aaron-siegel) in [#1001](https://github.com/pixeltable/pixeltable/pull/1001) * Support audio and video embedding indices by [@aaron-siegel](https://github.com/aaron-siegel) in [#990](https://github.com/pixeltable/pixeltable/pull/990) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.6...v0.5.7](https://github.com/pixeltable/pixeltable/compare/v0.5.6...v0.5.7) *** ### v0.5.6 **Released:** December 15, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.6](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.6) #### What's Changed * \[PXT-892] Support variable framerate in FrameIterator by [@aaron-siegel](https://github.com/aaron-siegel) in [#961](https://github.com/pixeltable/pixeltable/pull/961) * \[PXT-875] Define GRAFANA\_INSTANCE\_ID for the perf job by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#989](https://github.com/pixeltable/pixeltable/pull/989) * \[PXT-399] Remove pymupdf as a dependency by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#981](https://github.com/pixeltable/pixeltable/pull/981) * Docs Cleanup + Cookbooks + Versioning/Lineage + Production for Workshop by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#964](https://github.com/pixeltable/pixeltable/pull/964) * Iterators Refactor Part 1 by [@aaron-siegel](https://github.com/aaron-siegel) in [#992](https://github.com/pixeltable/pixeltable/pull/992) * Update documentation for iterators and aggregate functions by [@aaron-siegel](https://github.com/aaron-siegel) in [#995](https://github.com/pixeltable/pixeltable/pull/995) * PXT-910 Add presigned\_url udf by [@amithadke](https://github.com/amithadke) in [#991](https://github.com/pixeltable/pixeltable/pull/991) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.5...v0.5.6](https://github.com/pixeltable/pixeltable/compare/v0.5.5...v0.5.6) *** ### v0.5.5 **Released:** December 11, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.5](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.5) #### What's Changed * Multimodal support for Gemini `generate_content()` by [@aaron-siegel](https://github.com/aaron-siegel) in [#983](https://github.com/pixeltable/pixeltable/pull/983) * PXT-903 Add UUID in pixeltable types by [@amithadke](https://github.com/amithadke) in [#979](https://github.com/pixeltable/pixeltable/pull/979) * PXT-905/907: clean up handling of Huggingface datasets by [@mkornacker](https://github.com/mkornacker) in [#984](https://github.com/pixeltable/pixeltable/pull/984) * Twelve Labs multimodal embeddings support by [@aaron-siegel](https://github.com/aaron-siegel) in [#987](https://github.com/pixeltable/pixeltable/pull/987) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.4...v0.5.5](https://github.com/pixeltable/pixeltable/compare/v0.5.4...v0.5.5) *** ### v0.5.4 **Released:** December 09, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.4](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.4) #### What's Changed * \[PXT-645] Support more numpy dtypes for Array by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#940](https://github.com/pixeltable/pixeltable/pull/940) * Add working-with-voyageai tutorial notebook by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#978](https://github.com/pixeltable/pixeltable/pull/978) * StringSplitter docstring fix plus test by [@mkornacker](https://github.com/mkornacker) in [#980](https://github.com/pixeltable/pixeltable/pull/980) * \[PXT-875] performance test for openai endpoints by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#963](https://github.com/pixeltable/pixeltable/pull/963) * Restructuring of docs site and repo by [@aaron-siegel](https://github.com/aaron-siegel) in [#982](https://github.com/pixeltable/pixeltable/pull/982) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.3...v0.5.4](https://github.com/pixeltable/pixeltable/compare/v0.5.3...v0.5.4) *** ### v0.5.3 **Released:** December 04, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.3](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.3) #### What's Changed * PXT-872 Support count() with sample and group by clause. by [@amithadke](https://github.com/amithadke) in [#955](https://github.com/pixeltable/pixeltable/pull/955) * Add VOYAGE\_API\_KEY to CI and configuration.mdx; update uv.lock doctools reference by [@aaron-siegel](https://github.com/aaron-siegel) in [#976](https://github.com/pixeltable/pixeltable/pull/976) * Fal.ai Integration by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#959](https://github.com/pixeltable/pixeltable/pull/959) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.2...v0.5.3](https://github.com/pixeltable/pixeltable/compare/v0.5.2...v0.5.3) *** ### v0.5.2 **Released:** December 03, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.2](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.2) #### What's Changed * Use database schemas and search\_path for test isolation in parallel runs by [@amithadke](https://github.com/amithadke) in [#953](https://github.com/pixeltable/pixeltable/pull/953) * Working CI for Cockroach by [@aaron-siegel](https://github.com/aaron-siegel) in [#906](https://github.com/pixeltable/pixeltable/pull/906) * Fix internal documentation links by [@aaron-siegel](https://github.com/aaron-siegel) in [#954](https://github.com/pixeltable/pixeltable/pull/954) * \[PXT-886] Fix a bug in RateLimitsScheduler's error handling by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#951](https://github.com/pixeltable/pixeltable/pull/951) * \[PXT-786] Development Guide by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#958](https://github.com/pixeltable/pixeltable/pull/958) * Add Reve integration notebook by [@apreshill](https://github.com/apreshill) in [#939](https://github.com/pixeltable/pixeltable/pull/939) * Adds support for Voyage AI embeddings and rerankers. by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#962](https://github.com/pixeltable/pixeltable/pull/962) * Some rough-edges features/improvements by [@mkornacker](https://github.com/mkornacker) in [#967](https://github.com/pixeltable/pixeltable/pull/967) * \[PXT-908] Ensure that generated Gemini videos have sound by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#973](https://github.com/pixeltable/pixeltable/pull/973) * PXT-904: add MIME type for object uploads by [@mkornacker](https://github.com/mkornacker) in [#971](https://github.com/pixeltable/pixeltable/pull/971) * Update uv.lock by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#974](https://github.com/pixeltable/pixeltable/pull/974) * Add uv.lock validation to the pr tests by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#975](https://github.com/pixeltable/pixeltable/pull/975) * Documentation and config updates by [@aaron-siegel](https://github.com/aaron-siegel) in [#972](https://github.com/pixeltable/pixeltable/pull/972) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.1...v0.5.2](https://github.com/pixeltable/pixeltable/compare/v0.5.1...v0.5.2) *** ### v0.5.1 **Released:** November 19, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.1](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.1) #### What's Changed * Add TableVersionMd.from\_dict and update publish to use objects instead of dicts by [@amithadke](https://github.com/amithadke) in [#944](https://github.com/pixeltable/pixeltable/pull/944) * Publishing existing version returns 201, 204 does not allow any content to be sent back in body. by [@amithadke](https://github.com/amithadke) in [#948](https://github.com/pixeltable/pixeltable/pull/948) * Replace StorageDestination with StorageTarget by [@amithadke](https://github.com/amithadke) in [#947](https://github.com/pixeltable/pixeltable/pull/947) * Missing converter for schema change in PR 932 by [@mkornacker](https://github.com/mkornacker) in [#949](https://github.com/pixeltable/pixeltable/pull/949) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.5.0...v0.5.1](https://github.com/pixeltable/pixeltable/compare/v0.5.0...v0.5.1) *** ### v0.5.0 **Released:** November 18, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.5.0](https://github.com/pixeltable/pixeltable/releases/tag/v0.5.0) #### What's Changed * Data sharing docs by [@apreshill](https://github.com/apreshill) in [#931](https://github.com/pixeltable/pixeltable/pull/931) * Numerous documentation fixes by [@aaron-siegel](https://github.com/aaron-siegel) in [#933](https://github.com/pixeltable/pixeltable/pull/933) * PXT-846: FrameIterator(keyframes\_only: bool) by [@mkornacker](https://github.com/mkornacker) in [#934](https://github.com/pixeltable/pixeltable/pull/934) * \[PXT-809] Improve OpenAI rate limiting by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#912](https://github.com/pixeltable/pixeltable/pull/912) * Multi-phase drop\_table() by [@mkornacker](https://github.com/mkornacker) in [#932](https://github.com/pixeltable/pixeltable/pull/932) * Streamline Makefile by [@aaron-siegel](https://github.com/aaron-siegel) in [#937](https://github.com/pixeltable/pixeltable/pull/937) * Changes to protocol to handle publishing existing version by [@amithadke](https://github.com/amithadke) in [#938](https://github.com/pixeltable/pixeltable/pull/938) * PXT-871: == None filter doesn't work correctly on an array column by [@mkornacker](https://github.com/mkornacker) in [#941](https://github.com/pixeltable/pixeltable/pull/941) * More documentation improvements by [@aaron-siegel](https://github.com/aaron-siegel) in [#936](https://github.com/pixeltable/pixeltable/pull/936) * Circularity detection in view creation with if\_exists='replace' by [@aaron-siegel](https://github.com/aaron-siegel) in [#942](https://github.com/pixeltable/pixeltable/pull/942) * Add Tigris integration by [@Xe](https://github.com/Xe) in [#935](https://github.com/pixeltable/pixeltable/pull/935) * Improvements to notebook documentation by [@aaron-siegel](https://github.com/aaron-siegel) in [#943](https://github.com/pixeltable/pixeltable/pull/943) * Improvements to retriable errors detection in RequestRateScheduler by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#922](https://github.com/pixeltable/pixeltable/pull/922) * Rename `DataFrame` to `Query` and `DataFrameResultSet` to `ResultSet` by [@aaron-siegel](https://github.com/aaron-siegel) in [#902](https://github.com/pixeltable/pixeltable/pull/902) * PXT-873: t.sample() fails on externalized array data by [@mkornacker](https://github.com/mkornacker) in [#945](https://github.com/pixeltable/pixeltable/pull/945) #### New Contributors * [@Xe](https://github.com/Xe) made their first contribution in [#935](https://github.com/pixeltable/pixeltable/pull/935) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.24...v0.5.0](https://github.com/pixeltable/pixeltable/compare/v0.4.24...v0.5.0) *** ### v0.4.24 **Released:** November 12, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.24](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.24) #### What's Changed * Update imagen model in tests and docs (3.0 is deprecated) by [@aaron-siegel](https://github.com/aaron-siegel) in [#929](https://github.com/pixeltable/pixeltable/pull/929) * Allow hyphens in table and dir names by [@aaron-siegel](https://github.com/aaron-siegel) in [#926](https://github.com/pixeltable/pixeltable/pull/926) * Skip download when replicating the same version of a table a second time by [@aaron-siegel](https://github.com/aaron-siegel) in [#927](https://github.com/pixeltable/pixeltable/pull/927) * Several fixes and improvements for data sharing by [@aaron-siegel](https://github.com/aaron-siegel) in [#928](https://github.com/pixeltable/pixeltable/pull/928) * PXT-862: bug fix for drop\_table() by [@mkornacker](https://github.com/mkornacker) in [#930](https://github.com/pixeltable/pixeltable/pull/930) * Various docs updates by [@aaron-siegel](https://github.com/aaron-siegel) in [#923](https://github.com/pixeltable/pixeltable/pull/923) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.23...v0.4.24](https://github.com/pixeltable/pixeltable/compare/v0.4.23...v0.4.24) *** ### v0.4.23 **Released:** November 11, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.23](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.23) #### What's Changed * Add PIXELTABLE\_API\_KEY to CI environment by [@aaron-siegel](https://github.com/aaron-siegel) in [#914](https://github.com/pixeltable/pixeltable/pull/914) * `create_store_tbls: bool` option in Catalog.create\_replica() by [@aaron-siegel](https://github.com/aaron-siegel) in [#916](https://github.com/pixeltable/pixeltable/pull/916) * \[PXT-380] Remove NamedFunction object and related code in named\_function.py by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#911](https://github.com/pixeltable/pixeltable/pull/911) * Switch to new random ops script in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#909](https://github.com/pixeltable/pixeltable/pull/909) * \[PXT-799] Allow setting `fps` greater than the framerate of the video in `FrameIterator` by [@aaron-siegel](https://github.com/aaron-siegel) in [#918](https://github.com/pixeltable/pixeltable/pull/918) * Intelligible error message when replicating a view of an existing original base table by [@aaron-siegel](https://github.com/aaron-siegel) in [#897](https://github.com/pixeltable/pixeltable/pull/897) * \[PXT-837] Support creating/inserting directly from an existing Table by [@aaron-siegel](https://github.com/aaron-siegel) in [#919](https://github.com/pixeltable/pixeltable/pull/919) * Add parameters to `make stresstest` by [@aaron-siegel](https://github.com/aaron-siegel) in [#920](https://github.com/pixeltable/pixeltable/pull/920) * Introduce "anchor tables" in TableVersion(Handle) for live replicas; working pull() by [@aaron-siegel](https://github.com/aaron-siegel) in [#917](https://github.com/pixeltable/pixeltable/pull/917) * Time travel for view over snapshot; replicas of view over snapshot by [@aaron-siegel](https://github.com/aaron-siegel) in [#924](https://github.com/pixeltable/pixeltable/pull/924) * Proper display of embeddings by [@aaron-siegel](https://github.com/aaron-siegel) in [#925](https://github.com/pixeltable/pixeltable/pull/925) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.22...v0.4.23](https://github.com/pixeltable/pixeltable/compare/v0.4.22...v0.4.23) *** ### v0.4.22 **Released:** November 04, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.22](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.22) #### What's Changed * Manage `additional_md` from Catalog, rather than TableVersion by [@aaron-siegel](https://github.com/aaron-siegel) in [#913](https://github.com/pixeltable/pixeltable/pull/913) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.21...v0.4.22](https://github.com/pixeltable/pixeltable/compare/v0.4.21...v0.4.22) *** ### v0.4.21 **Released:** November 03, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.21](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.21) #### What's Changed * Hotfix for bug when publishing older versions of a table by [@aaron-siegel](https://github.com/aaron-siegel) in [#910](https://github.com/pixeltable/pixeltable/pull/910) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.20...v0.4.21](https://github.com/pixeltable/pixeltable/compare/v0.4.20...v0.4.21) *** ### v0.4.20 **Released:** November 03, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.20](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.20) #### What's Changed * pyscenedetect udfs by [@mkornacker](https://github.com/mkornacker) in [#899](https://github.com/pixeltable/pixeltable/pull/899) * CockroachDB fixes + CI target by [@aaron-siegel](https://github.com/aaron-siegel) in [#900](https://github.com/pixeltable/pixeltable/pull/900) * Add protocol for replica operations. by [@amithadke](https://github.com/amithadke) in [#819](https://github.com/pixeltable/pixeltable/pull/819) * \[PXT-822, PXT-674] Fix for querying snapshots of tables with unstored columns by [@aaron-siegel](https://github.com/aaron-siegel) in [#895](https://github.com/pixeltable/pixeltable/pull/895) * Switch to using random\_tbl\_ops\_2 in stress-tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#898](https://github.com/pixeltable/pixeltable/pull/898) * Fix nondeterminism in unit test by [@aaron-siegel](https://github.com/aaron-siegel) in [#905](https://github.com/pixeltable/pixeltable/pull/905) * \[PXT-817] UDFs for reve.com by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#901](https://github.com/pixeltable/pixeltable/pull/901) * \[PXT-826] Refactor index creation logic by [@aaron-siegel](https://github.com/aaron-siegel) in [#908](https://github.com/pixeltable/pixeltable/pull/908) * UV\_OPTS in Makefile by [@aaron-siegel](https://github.com/aaron-siegel) in [#896](https://github.com/pixeltable/pixeltable/pull/896) * Ignore additional\_mds when checking table or table version metadata by [@amithadke](https://github.com/amithadke) in [#903](https://github.com/pixeltable/pixeltable/pull/903) * \[PXT-786] push() and pull() implementations by [@amithadke](https://github.com/amithadke) in [#907](https://github.com/pixeltable/pixeltable/pull/907) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.19...v0.4.20](https://github.com/pixeltable/pixeltable/compare/v0.4.19...v0.4.20) *** ### v0.4.19 **Released:** October 29, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.19](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.19) #### What's Changed * Add image recipes to cookbook by [@apreshill](https://github.com/apreshill) in [#857](https://github.com/pixeltable/pixeltable/pull/857) * Add display-name to CI matrix (prep for testing global media destination) by [@aaron-siegel](https://github.com/aaron-siegel) in [#879](https://github.com/pixeltable/pixeltable/pull/879) * Enable all media destinations in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#876](https://github.com/pixeltable/pixeltable/pull/876) * \[PXT-814] UDF to encode a numpy array to an audio file by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#881](https://github.com/pixeltable/pixeltable/pull/881) * Convert notebooks to use YAML frontmatter and fix formatting issues by [@goodlux](https://github.com/goodlux) in [#880](https://github.com/pixeltable/pixeltable/pull/880) * Rename a public constant by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#884](https://github.com/pixeltable/pixeltable/pull/884) * Multi-phase create\_table() by [@mkornacker](https://github.com/mkornacker) in [#854](https://github.com/pixeltable/pixeltable/pull/854) * Initial integration of TwelveLabs Embed API by [@mkornacker](https://github.com/mkornacker) in [#885](https://github.com/pixeltable/pixeltable/pull/885) * Fix `pxt.__version__` by [@aaron-siegel](https://github.com/aaron-siegel) in [#887](https://github.com/pixeltable/pixeltable/pull/887) * Update many error messages for consistency by [@aaron-siegel](https://github.com/aaron-siegel) in [#869](https://github.com/pixeltable/pixeltable/pull/869) * Replace `Optional[T]` with `T | None` (Python 3.10 style) throughout the codebase by [@aaron-siegel](https://github.com/aaron-siegel) in [#888](https://github.com/pixeltable/pixeltable/pull/888) * Docs-related updates to Makefile and pyproject by [@aaron-siegel](https://github.com/aaron-siegel) in [#889](https://github.com/pixeltable/pixeltable/pull/889) * \[PXT-685] Add `recompute_columns()` to computed columns fundamentals notebook by [@aaron-siegel](https://github.com/aaron-siegel) in [#892](https://github.com/pixeltable/pixeltable/pull/892) * \[PXT-811, PXT-812] Improve two error messages with helpful hints by [@aaron-siegel](https://github.com/aaron-siegel) in [#891](https://github.com/pixeltable/pixeltable/pull/891) * Revert two uses of `Optional` in unit tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#893](https://github.com/pixeltable/pixeltable/pull/893) * Dependency updates for Python 3.14 by [@aaron-siegel](https://github.com/aaron-siegel) in [#894](https://github.com/pixeltable/pixeltable/pull/894) * Azure support by [@aaron-siegel](https://github.com/aaron-siegel) in [#886](https://github.com/pixeltable/pixeltable/pull/886) * Default media destination as configuration parameter by [@aaron-siegel](https://github.com/aaron-siegel) in [#883](https://github.com/pixeltable/pixeltable/pull/883) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.18...v0.4.19](https://github.com/pixeltable/pixeltable/compare/v0.4.18...v0.4.19) *** ### v0.4.18 **Released:** October 22, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.18](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.18) #### What's Changed * Updates to nightly.yml by [@aaron-siegel](https://github.com/aaron-siegel) in [#866](https://github.com/pixeltable/pixeltable/pull/866) * Streamline CI configs on PRs by [@aaron-siegel](https://github.com/aaron-siegel) in [#858](https://github.com/pixeltable/pixeltable/pull/858) * Update WhisperX to >=3.7 and enable for Python 3.13 by [@aaron-siegel](https://github.com/aaron-siegel) in [#860](https://github.com/pixeltable/pixeltable/pull/860) * elements parameter for DocSplitter by [@mkornacker](https://github.com/mkornacker) in [#865](https://github.com/pixeltable/pixeltable/pull/865) * Fix examples docstring for add\_embedding\_index() by [@aaron-siegel](https://github.com/aaron-siegel) in [#871](https://github.com/pixeltable/pixeltable/pull/871) * Improvements to random\_tbl\_ops script by [@aaron-siegel](https://github.com/aaron-siegel) in [#868](https://github.com/pixeltable/pixeltable/pull/868) * Enforce `numpy>=2.2` by [@aaron-siegel](https://github.com/aaron-siegel) in [#872](https://github.com/pixeltable/pixeltable/pull/872) * Segmentation-related improvements by [@mkornacker](https://github.com/mkornacker) in [#873](https://github.com/pixeltable/pixeltable/pull/873) * Randomize the behavior of `sample()` in the case `seed=None` by [@aaron-siegel](https://github.com/aaron-siegel) in [#828](https://github.com/pixeltable/pixeltable/pull/828) * \[PXT-729] Documentation deploy scripts for Mintlify website and local development by [@goodlux](https://github.com/goodlux) in [#867](https://github.com/pixeltable/pixeltable/pull/867) * Properly reconstruct btree and vector indices when a replica is restored by [@aaron-siegel](https://github.com/aaron-siegel) in [#875](https://github.com/pixeltable/pixeltable/pull/875) * Fix various errors and typos in README and the notebooks by [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) in [#877](https://github.com/pixeltable/pixeltable/pull/877) * UDFs for Hugging Face Auto model integrations by [@aaron-siegel](https://github.com/aaron-siegel) in [#870](https://github.com/pixeltable/pixeltable/pull/870) #### New Contributors * [@sergey-mkhitaryan](https://github.com/sergey-mkhitaryan) made their first contribution in [#877](https://github.com/pixeltable/pixeltable/pull/877) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.17...v0.4.18](https://github.com/pixeltable/pixeltable/compare/v0.4.17...v0.4.18) *** ### v0.4.17 **Released:** October 16, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.17](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.17) #### What's Changed * Update model used by Together AI tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#846](https://github.com/pixeltable/pixeltable/pull/846) * Fix broken links at the bottom of basics notebook by [@apreshill](https://github.com/apreshill) in [#844](https://github.com/pixeltable/pixeltable/pull/844) * Retry failed notebook tests once in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#830](https://github.com/pixeltable/pixeltable/pull/830) * feat(storage): add Backblaze B2 S3-compatible integration and tests by [@jeronimodeleon](https://github.com/jeronimodeleon) in [#840](https://github.com/pixeltable/pixeltable/pull/840) * cockroachDB: Set null\_ordered\_last on session start. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#838](https://github.com/pixeltable/pixeltable/pull/838) * cockroachDB: Explicit coercions for arithmetic ops. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#839](https://github.com/pixeltable/pixeltable/pull/839) * Fix for isolated NB tests in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#847](https://github.com/pixeltable/pixeltable/pull/847) * Notebook updates & OpenRouter notebook by [@aaron-siegel](https://github.com/aaron-siegel) in [#851](https://github.com/pixeltable/pixeltable/pull/851) * ffmpeg with libx264 by [@mkornacker](https://github.com/mkornacker) in [#855](https://github.com/pixeltable/pixeltable/pull/855) * Fixed incorrect documentation links by [@metadaddy](https://github.com/metadaddy) in [#859](https://github.com/pixeltable/pixeltable/pull/859) * Update pixeltable-pgserver dependency to 0.4.0 by [@aaron-siegel](https://github.com/aaron-siegel) in [#853](https://github.com/pixeltable/pixeltable/pull/853) * Support packaging of tables with embedding indices for data sharing by [@aaron-siegel](https://github.com/aaron-siegel) in [#841](https://github.com/pixeltable/pixeltable/pull/841) * mode 'accurate' for VideoSplitter and segment\_video() by [@mkornacker](https://github.com/mkornacker) in [#856](https://github.com/pixeltable/pixeltable/pull/856) * Added PDF-Page-Chunk-Extractor for image extraction (Issue 703) (PR 705) by [@kamir](https://github.com/kamir) in [#850](https://github.com/pixeltable/pixeltable/pull/850) * Formatting fixes by [@aaron-siegel](https://github.com/aaron-siegel) in [#862](https://github.com/pixeltable/pixeltable/pull/862) * Fix pyproject and mypy config by [@aaron-siegel](https://github.com/aaron-siegel) in [#863](https://github.com/pixeltable/pixeltable/pull/863) * Fixes for load\_replica\_md() with non-snapshot tables by [@aaron-siegel](https://github.com/aaron-siegel) in [#861](https://github.com/pixeltable/pixeltable/pull/861) * Correctly process cellmd in package/restore by [@aaron-siegel](https://github.com/aaron-siegel) in [#864](https://github.com/pixeltable/pixeltable/pull/864) #### New Contributors * [@jeronimodeleon](https://github.com/jeronimodeleon) made their first contribution in [#840](https://github.com/pixeltable/pixeltable/pull/840) * [@metadaddy](https://github.com/metadaddy) made their first contribution in [#859](https://github.com/pixeltable/pixeltable/pull/859) * [@kamir](https://github.com/kamir) made their first contribution in [#850](https://github.com/pixeltable/pixeltable/pull/850) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.16...v0.4.17](https://github.com/pixeltable/pixeltable/compare/v0.4.16...v0.4.17) *** ### v0.4.16 **Released:** October 08, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.16](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.16) #### What's Changed * Openrouter Integration by [@aaron-siegel](https://github.com/aaron-siegel) in [#825](https://github.com/pixeltable/pixeltable/pull/825) * Concurrency fixes & random\_tbl\_ops v2 by [@aaron-siegel](https://github.com/aaron-siegel) in [#814](https://github.com/pixeltable/pixeltable/pull/814) * Images and arrays in json structures, plus improved storage of array columns by [@mkornacker](https://github.com/mkornacker) in [#812](https://github.com/pixeltable/pixeltable/pull/812) * Minimal edits to docstrings. by [@goodlux](https://github.com/goodlux) in [#813](https://github.com/pixeltable/pixeltable/pull/813) * Add SDK documentation for Mintlify by [@goodlux](https://github.com/goodlux) in [#835](https://github.com/pixeltable/pixeltable/pull/835) * Fix for performance problem when importing HF datasets by [@mkornacker](https://github.com/mkornacker) in [#833](https://github.com/pixeltable/pixeltable/pull/833) * cockroachDB: div, mod operations SQL changed. Timestamp propagated through client stack by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#823](https://github.com/pixeltable/pixeltable/pull/823) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.15...v0.4.16](https://github.com/pixeltable/pixeltable/compare/v0.4.15...v0.4.16) *** ### v0.4.15 **Released:** October 01, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.15](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.15) #### What's Changed * Add a spot for the cookbook in docs/ by [@apreshill](https://github.com/apreshill) in [#815](https://github.com/pixeltable/pixeltable/pull/815) * Fixes for notebook tests resource cleanup by [@aaron-siegel](https://github.com/aaron-siegel) in [#827](https://github.com/pixeltable/pixeltable/pull/827) * Adding export\_lancedb() to API reference by [@mkornacker](https://github.com/mkornacker) in [#824](https://github.com/pixeltable/pixeltable/pull/824) * Replace `create_replica()` with separate `publish()` and `replicate()` methods by [@aaron-siegel](https://github.com/aaron-siegel) in [#816](https://github.com/pixeltable/pixeltable/pull/816) * PXT-638, PXT-675, PXT-682 Handle Keyboard exception by [@amithadke](https://github.com/amithadke) in [#803](https://github.com/pixeltable/pixeltable/pull/803) * PXT-772 Filling in missing docstrings by [@goodlux](https://github.com/goodlux) in [#822](https://github.com/pixeltable/pixeltable/pull/822) * with\_audio() udf by [@mkornacker](https://github.com/mkornacker) in [#826](https://github.com/pixeltable/pixeltable/pull/826) #### New Contributors * [@apreshill](https://github.com/apreshill) made their first contribution in [#815](https://github.com/pixeltable/pixeltable/pull/815) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.14...v0.4.15](https://github.com/pixeltable/pixeltable/compare/v0.4.14...v0.4.15) *** ### v0.4.14 **Released:** September 23, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.14](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.14) #### What's Changed * Proper implementation of package/restore for non-snapshot replicas by [@aaron-siegel](https://github.com/aaron-siegel) in [#797](https://github.com/pixeltable/pixeltable/pull/797) * Set up pydoclint by [@aaron-siegel](https://github.com/aaron-siegel) in [#805](https://github.com/pixeltable/pixeltable/pull/805) * upgrade mint.json -> docs.json by [@goodlux](https://github.com/goodlux) in [#809](https://github.com/pixeltable/pixeltable/pull/809) * Enable a destination parameter on stored computed columns. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#766](https://github.com/pixeltable/pixeltable/pull/766) * Add support for running tests with cockroachdb as backend by [@amithadke](https://github.com/amithadke) in [#811](https://github.com/pixeltable/pixeltable/pull/811) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.13...v0.4.14](https://github.com/pixeltable/pixeltable/compare/v0.4.13...v0.4.14) *** ### v0.4.13 **Released:** September 19, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.13](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.13) #### What's Changed * Added pxt.io.export\_lancedb() by [@mkornacker](https://github.com/mkornacker) in [#795](https://github.com/pixeltable/pixeltable/pull/795) * Update README.md by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#801](https://github.com/pixeltable/pixeltable/pull/801) * Use raw\.githubusercontent.com instead of raw\.github.com in tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#806](https://github.com/pixeltable/pixeltable/pull/806) * Simplify & generalize TableDataSource types by [@aaron-siegel](https://github.com/aaron-siegel) in [#804](https://github.com/pixeltable/pixeltable/pull/804) * Short Sample App: CLI Media Toolkit for Multimodal Data Processing by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#802](https://github.com/pixeltable/pixeltable/pull/802) * Table.get\_versions() by [@aaron-siegel](https://github.com/aaron-siegel) in [#800](https://github.com/pixeltable/pixeltable/pull/800) * Fixes for nightly CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#807](https://github.com/pixeltable/pixeltable/pull/807) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.12...v0.4.13](https://github.com/pixeltable/pixeltable/compare/v0.4.12...v0.4.13) *** ### v0.4.12 **Released:** September 05, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.12](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.12) #### What's Changed * Update model used by groq tests and examples by [@aaron-siegel](https://github.com/aaron-siegel) in [#790](https://github.com/pixeltable/pixeltable/pull/790) * Clear TempStore, MediaStore, and HF cache after each test in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#792](https://github.com/pixeltable/pixeltable/pull/792) * Explicitly install pixeltable in run-isolated-nb-tests.sh by [@aaron-siegel](https://github.com/aaron-siegel) in [#794](https://github.com/pixeltable/pixeltable/pull/794) * Handle incomplete rate limit headers better by [@mkornacker](https://github.com/mkornacker) in [#788](https://github.com/pixeltable/pixeltable/pull/788) * SDK changes/fixes for data sharing by [@aaron-siegel](https://github.com/aaron-siegel) in [#791](https://github.com/pixeltable/pixeltable/pull/791) * Disable TestWhisperx on Linux w/ GPU by [@mkornacker](https://github.com/mkornacker) in [#789](https://github.com/pixeltable/pixeltable/pull/789) * recompute\_columns(): added where parameter by [@mkornacker](https://github.com/mkornacker) in [#787](https://github.com/pixeltable/pixeltable/pull/787) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.11...v0.4.12](https://github.com/pixeltable/pixeltable/compare/v0.4.11...v0.4.12) *** ### v0.4.11 **Released:** August 29, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.11](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.11) #### What's Changed * missing .md for VideoSplitter by [@mkornacker](https://github.com/mkornacker) in [#784](https://github.com/pixeltable/pixeltable/pull/784) * CI & dev environment enhancements by [@aaron-siegel](https://github.com/aaron-siegel) in [#785](https://github.com/pixeltable/pixeltable/pull/785) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.10...v0.4.11](https://github.com/pixeltable/pixeltable/compare/v0.4.10...v0.4.11) *** ### v0.4.10 **Released:** August 28, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.10](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.10) #### What's Changed * Fix local\_public\_names() to properly exclude private functions by [@goodlux](https://github.com/goodlux) in [#778](https://github.com/pixeltable/pixeltable/pull/778) * Add .DS\_Store to .gitignore by [@goodlux](https://github.com/goodlux) in [#779](https://github.com/pixeltable/pixeltable/pull/779) * More video built-ins by [@mkornacker](https://github.com/mkornacker) in [#768](https://github.com/pixeltable/pixeltable/pull/768) * Add missing **all** to gemini and whisper modules by [@aaron-siegel](https://github.com/aaron-siegel) in [#781](https://github.com/pixeltable/pixeltable/pull/781) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.9...v0.4.10](https://github.com/pixeltable/pixeltable/compare/v0.4.9...v0.4.10) *** ### v0.4.9 **Released:** August 27, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.9](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.9) #### What's Changed * WhisperX Speaker Diarization by [@aaron-siegel](https://github.com/aaron-siegel) in [#770](https://github.com/pixeltable/pixeltable/pull/770) * Basic support for concurrent pixeltable metadata creation/upgrade by [@amithadke](https://github.com/amithadke) in [#769](https://github.com/pixeltable/pixeltable/pull/769) * Support for pydantic models in Table.insert() by [@mkornacker](https://github.com/mkornacker) in [#760](https://github.com/pixeltable/pixeltable/pull/760) * Add comments for concurrent pixeltable initialization changes by [@amithadke](https://github.com/amithadke) in [#772](https://github.com/pixeltable/pixeltable/pull/772) * Disable notebook tests that are failing in CI for unknown reasons by [@aaron-siegel](https://github.com/aaron-siegel) in [#777](https://github.com/pixeltable/pixeltable/pull/777) * Publish the existing mypy plugin under `pixeltable.mypy` module to make it accessible for external use. by [@amithadke](https://github.com/amithadke) in [#776](https://github.com/pixeltable/pixeltable/pull/776) * Remove `ext` package and fold contents into `functions` by [@aaron-siegel](https://github.com/aaron-siegel) in [#775](https://github.com/pixeltable/pixeltable/pull/775) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.8...v0.4.9](https://github.com/pixeltable/pixeltable/compare/v0.4.8...v0.4.9) *** ### v0.4.8 **Released:** August 20, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.8](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.8) #### What's Changed * Performance test for chat completion integrations by [@mkornacker](https://github.com/mkornacker) in [#746](https://github.com/pixeltable/pixeltable/pull/746) * Bugfixes related to missing dependencies by [@aaron-siegel](https://github.com/aaron-siegel) in [#747](https://github.com/pixeltable/pixeltable/pull/747) * Makefile and pytest improvements by [@aaron-siegel](https://github.com/aaron-siegel) in [#753](https://github.com/pixeltable/pixeltable/pull/753) * Update dev version of onnx by [@aaron-siegel](https://github.com/aaron-siegel) in [#755](https://github.com/pixeltable/pixeltable/pull/755) * Pytest configuration fix by [@aaron-siegel](https://github.com/aaron-siegel) in [#756](https://github.com/pixeltable/pixeltable/pull/756) * RequestRateScheduler improvements by [@mkornacker](https://github.com/mkornacker) in [#752](https://github.com/pixeltable/pixeltable/pull/752) * Update README.md by [@aaron-siegel](https://github.com/aaron-siegel) in [#754](https://github.com/pixeltable/pixeltable/pull/754) * Updating tutorial notebook to use Table.recompute\_columns(). by [@mkornacker](https://github.com/mkornacker) in [#757](https://github.com/pixeltable/pixeltable/pull/757) * Changes to pixeltable shared client for R2 support. by [@amithadke](https://github.com/amithadke) in [#653](https://github.com/pixeltable/pixeltable/pull/653) * Fix README spacing and linting issues by [@aaron-siegel](https://github.com/aaron-siegel) in [#759](https://github.com/pixeltable/pixeltable/pull/759) * Move stored\_img\_cols from ExecNode To RowBuilder, add stored\_media\_cols by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#749](https://github.com/pixeltable/pixeltable/pull/749) * Group local media file operations into a MediaStore or TempStore class by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#748](https://github.com/pixeltable/pixeltable/pull/748) * Correct construction of two row\_builder members. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#761](https://github.com/pixeltable/pixeltable/pull/761) * PXT-661 PXT-662 Adding checks for dropping column used by view predicates by [@amithadke](https://github.com/amithadke) in [#751](https://github.com/pixeltable/pixeltable/pull/751) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.7...v0.4.8](https://github.com/pixeltable/pixeltable/compare/v0.4.7...v0.4.8) *** ### v0.4.7 **Released:** August 04, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.7](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.7) #### What's Changed * Consolidate ColumnMd operations into from\_md() and to\_md(). by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#715](https://github.com/pixeltable/pixeltable/pull/715) * Update README.md + Changelog by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#727](https://github.com/pixeltable/pixeltable/pull/727) * Consolidate all store\_table row prep into DataRow\.create\_store\_table\_row. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#723](https://github.com/pixeltable/pixeltable/pull/723) * More rigor in UDF evolution tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#728](https://github.com/pixeltable/pixeltable/pull/728) * Rerun tests that fail due to concurrency conflicts by [@aaron-siegel](https://github.com/aaron-siegel) in [#737](https://github.com/pixeltable/pixeltable/pull/737) * Replace most uses of `Union[]` with Python 3.10-style unions by [@aaron-siegel](https://github.com/aaron-siegel) in [#735](https://github.com/pixeltable/pixeltable/pull/735) * Extend FrameIterator to output all available frame attributes by [@mkornacker](https://github.com/mkornacker) in [#716](https://github.com/pixeltable/pixeltable/pull/716) * Clean up pytest output by [@aaron-siegel](https://github.com/aaron-siegel) in [#740](https://github.com/pixeltable/pixeltable/pull/740) * Introduce `TypedDict`s for user-facing table, dir, column, and index metadata by [@aaron-siegel](https://github.com/aaron-siegel) in [#739](https://github.com/pixeltable/pixeltable/pull/739) * get\_dir\_contents(), a more structured replacement for list\_tables() / list\_dirs() by [@aaron-siegel](https://github.com/aaron-siegel) in [#742](https://github.com/pixeltable/pixeltable/pull/742) * Test cleanup by [@aaron-siegel](https://github.com/aaron-siegel) in [#743](https://github.com/pixeltable/pixeltable/pull/743) * Prefer public API in tests by [@aaron-siegel](https://github.com/aaron-siegel) in [#744](https://github.com/pixeltable/pixeltable/pull/744) * Catching missing sqlalchemy transaction-related exceptions by [@mkornacker](https://github.com/mkornacker) in [#745](https://github.com/pixeltable/pixeltable/pull/745) * PXT-668: Remove unneeded test\_sample\_md5\_fraction. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#750](https://github.com/pixeltable/pixeltable/pull/750) * PXT-671: fixes to RateLimitsScheduler by [@mkornacker](https://github.com/mkornacker) in [#741](https://github.com/pixeltable/pixeltable/pull/741) * make\_video API Doc by [@pierrebrunelle](https://github.com/pierrebrunelle) in [#736](https://github.com/pixeltable/pixeltable/pull/736) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.6...v0.4.7](https://github.com/pixeltable/pixeltable/compare/v0.4.6...v0.4.7) *** ### v0.4.6 **Released:** July 24, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.6](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.6) #### What's Changed * Migrate from `poetry` to `uv` by [@aaron-siegel](https://github.com/aaron-siegel) in [#722](https://github.com/pixeltable/pixeltable/pull/722) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.5...v0.4.6](https://github.com/pixeltable/pixeltable/compare/v0.4.5...v0.4.6) *** ### v0.4.5 **Released:** July 24, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.5](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.5) #### What's Changed * Consolidate more MediaStore operations - part 3 by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#701](https://github.com/pixeltable/pixeltable/pull/701) * Working Python 3.13 dev installation by [@aaron-siegel](https://github.com/aaron-siegel) in [#695](https://github.com/pixeltable/pixeltable/pull/695) * Replace uses of sql.text() in catalog.py with idiomatic SQLAlchemy by [@aaron-siegel](https://github.com/aaron-siegel) in [#707](https://github.com/pixeltable/pixeltable/pull/707) * Move some column summary information into RowBuilder. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#711](https://github.com/pixeltable/pixeltable/pull/711) * DataFrameResultSet.to\_pydantic() by [@mkornacker](https://github.com/mkornacker) in [#713](https://github.com/pixeltable/pixeltable/pull/713) * PXT-667: Write media files to MediaStore with correct version. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#714](https://github.com/pixeltable/pixeltable/pull/714) * Time travel by [@aaron-siegel](https://github.com/aaron-siegel) in [#710](https://github.com/pixeltable/pixeltable/pull/710) * Correct the table.history status report for newly created views. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#719](https://github.com/pixeltable/pixeltable/pull/719) * Include all columns in packager data preview by [@aaron-siegel](https://github.com/aaron-siegel) in [#720](https://github.com/pixeltable/pixeltable/pull/720) * Communicate Column spec for all MediaStore save and move operations by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#718](https://github.com/pixeltable/pixeltable/pull/718) * Further simplify DataRowBatch. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#724](https://github.com/pixeltable/pixeltable/pull/724) * Use the method plan.\_insert\_prefetch\_node everywhere. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#721](https://github.com/pixeltable/pixeltable/pull/721) * Support Python 3.10 style union types by [@aaron-siegel](https://github.com/aaron-siegel) in [#726](https://github.com/pixeltable/pixeltable/pull/726) * Additional config parameters + more flexible rate limit parsing for Azure OpenAI support by [@aaron-siegel](https://github.com/aaron-siegel) in [#725](https://github.com/pixeltable/pixeltable/pull/725) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.4...v0.4.5](https://github.com/pixeltable/pixeltable/compare/v0.4.4...v0.4.5) *** ### v0.4.4 **Released:** July 16, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.4](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.4) #### What's Changed * Consolidate MediaStore file operations, including temp file name creation by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#694](https://github.com/pixeltable/pixeltable/pull/694) * Update google-genai dev dependency by [@aaron-siegel](https://github.com/aaron-siegel) in [#699](https://github.com/pixeltable/pixeltable/pull/699) * CI changes for random-tbl-ops by [@aaron-siegel](https://github.com/aaron-siegel) in [#697](https://github.com/pixeltable/pixeltable/pull/697) * schema\_overrides bugfixes by [@aaron-siegel](https://github.com/aaron-siegel) in [#700](https://github.com/pixeltable/pixeltable/pull/700) * Load replicas as views by [@aaron-siegel](https://github.com/aaron-siegel) in [#696](https://github.com/pixeltable/pixeltable/pull/696) * Multi-phase transactions by [@mkornacker](https://github.com/mkornacker) in [#692](https://github.com/pixeltable/pixeltable/pull/692) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.3...v0.4.4](https://github.com/pixeltable/pixeltable/compare/v0.4.3...v0.4.4) *** ### v0.4.3 **Released:** July 10, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.3](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.3) #### What's Changed * Allow config parameters to be specified in `pxt.init()` by [@aaron-siegel](https://github.com/aaron-siegel) in [#680](https://github.com/pixeltable/pixeltable/pull/680) * Prepare to report more status in table.history() by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#682](https://github.com/pixeltable/pixeltable/pull/682) * `pxt.ls()` command for pretty-printing all contents of a Pixeltable dir by [@aaron-siegel](https://github.com/aaron-siegel) in [#681](https://github.com/pixeltable/pixeltable/pull/681) * Handle 429 errors in RateLimitScheduler by [@mkornacker](https://github.com/mkornacker) in [#670](https://github.com/pixeltable/pixeltable/pull/670) * Support dicts and Sequences of dicts in HF datasets \[rough-edges PXT-640] by [@aaron-siegel](https://github.com/aaron-siegel) in [#684](https://github.com/pixeltable/pixeltable/pull/684) * Allow packaging of non-snapshot tables in TablePackager by [@aaron-siegel](https://github.com/aaron-siegel) in [#688](https://github.com/pixeltable/pixeltable/pull/688) * Use a JSON field xxx\_cellmd in place of xxx\_errortype and xxx\_errormsg by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#685](https://github.com/pixeltable/pixeltable/pull/685) * Consolidate media operations in the MediaStore module by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#691](https://github.com/pixeltable/pixeltable/pull/691) * Enhance UpdateStatus to subsume SyncStatus. Save user and UpdateStatus in a field in TableVersionMd. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#689](https://github.com/pixeltable/pixeltable/pull/689) * Refactor create\_replica to conform to concurrency protocol by [@aaron-siegel](https://github.com/aaron-siegel) in [#690](https://github.com/pixeltable/pixeltable/pull/690) * Add additional packages & task configurations to nightly.yml by [@aaron-siegel](https://github.com/aaron-siegel) in [#693](https://github.com/pixeltable/pixeltable/pull/693) * Doc fixes for audio and video UDFs by [@aaron-siegel](https://github.com/aaron-siegel) in [#698](https://github.com/pixeltable/pixeltable/pull/698) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.2...v0.4.3](https://github.com/pixeltable/pixeltable/compare/v0.4.2...v0.4.3) *** ### v0.4.2 **Released:** June 27, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.2](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.2) #### What's Changed * Revert various accumulated workarounds in CI by [@aaron-siegel](https://github.com/aaron-siegel) in [#669](https://github.com/pixeltable/pixeltable/pull/669) * Use ColumnHandles in external stores by [@aaron-siegel](https://github.com/aaron-siegel) in [#664](https://github.com/pixeltable/pixeltable/pull/664) * Update versions of a few more libraries by [@aaron-siegel](https://github.com/aaron-siegel) in [#668](https://github.com/pixeltable/pixeltable/pull/668) * First part of additional status collection for table.history reporting. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#654](https://github.com/pixeltable/pixeltable/pull/654) * Add table.history() method to return a user-readable list of known changes to a table. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#640](https://github.com/pixeltable/pixeltable/pull/640) * Added Table.recompute\_columns() by [@mkornacker](https://github.com/mkornacker) in [#667](https://github.com/pixeltable/pixeltable/pull/667) * Collect more information on ins, del, upd operations. Freeze UpdateStatus. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#673](https://github.com/pixeltable/pixeltable/pull/673) * Refactor SyncStatus for merge with UpdateStatus. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#674](https://github.com/pixeltable/pixeltable/pull/674) * Adding recompute\_columns() to overview in table.md by [@mkornacker](https://github.com/mkornacker) in [#675](https://github.com/pixeltable/pixeltable/pull/675) * CI workflow for random table ops by [@aaron-siegel](https://github.com/aaron-siegel) in [#676](https://github.com/pixeltable/pixeltable/pull/676) * \~40% improvement in insert performance by [@aaron-siegel](https://github.com/aaron-siegel) in [#658](https://github.com/pixeltable/pixeltable/pull/658) * Skip whisperx on t4 instances by [@aaron-siegel](https://github.com/aaron-siegel) in [#678](https://github.com/pixeltable/pixeltable/pull/678) * Pretty-print update status in notebooks or IPython shells by [@aaron-siegel](https://github.com/aaron-siegel) in [#677](https://github.com/pixeltable/pixeltable/pull/677) * Performance improvements in add\_computed\_column by [@aaron-siegel](https://github.com/aaron-siegel) in [#679](https://github.com/pixeltable/pixeltable/pull/679) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.1...v0.4.2](https://github.com/pixeltable/pixeltable/compare/v0.4.1...v0.4.2) *** ### v0.4.1 **Released:** June 19, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.1](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.1) #### What's Changed * Docs/update model kwargs by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#662](https://github.com/pixeltable/pixeltable/pull/662) * Fixes and improvements for nightly CI job by [@aaron-siegel](https://github.com/aaron-siegel) in [#665](https://github.com/pixeltable/pixeltable/pull/665) * Docs/changelog v0.4.0 by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#663](https://github.com/pixeltable/pixeltable/pull/663) * Update dev versions of many libraries used by Pixeltable by [@aaron-siegel](https://github.com/aaron-siegel) in [#666](https://github.com/pixeltable/pixeltable/pull/666) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.0...v0.4.1](https://github.com/pixeltable/pixeltable/compare/v0.4.0...v0.4.1) *** ### v0.4.0 **Released:** June 16, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.0](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.0) #### Highlights * Support for concurrent insert/query and table/view operations * `sample()` operator for deterministic, pseudo-random samples of tables and data frames * More flexible API for optional LLM parameters * Groq integration * MCP integration * HEIC image support * Numerous bugfixes #### All Changes * Support for concurrent table operations by [@mkornacker](https://github.com/mkornacker) in [#611](https://github.com/pixeltable/pixeltable/pull/611) * New Deepseek notebook by [@aaron-siegel](https://github.com/aaron-siegel) in [#634](https://github.com/pixeltable/pixeltable/pull/634) * Re-enable 3 of the 4 disabled Labelstudio tests by [@mkornacker](https://github.com/mkornacker) in [#635](https://github.com/pixeltable/pixeltable/pull/635) * Implement `to_sql` for many string methods by [@aaron-siegel](https://github.com/aaron-siegel) in [#636](https://github.com/pixeltable/pixeltable/pull/636) * Remove extraneous reload\_catalog() in test\_packager by [@aaron-siegel](https://github.com/aaron-siegel) in [#637](https://github.com/pixeltable/pixeltable/pull/637) * fix building with llm link by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#638](https://github.com/pixeltable/pixeltable/pull/638) * Allow HEIC images by [@aaron-siegel](https://github.com/aaron-siegel) in [#639](https://github.com/pixeltable/pixeltable/pull/639) * Include preview data in request when publishing a table by [@aaron-siegel](https://github.com/aaron-siegel) in [#631](https://github.com/pixeltable/pixeltable/pull/631) * WIP: stratified sampling operation on DataFrame by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#591](https://github.com/pixeltable/pixeltable/pull/591) * remove main reference and replace with release by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#646](https://github.com/pixeltable/pixeltable/pull/646) * docs: add product updates changelog with version history and release notes by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#645](https://github.com/pixeltable/pixeltable/pull/645) * remove print statement in gemini tool calls by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#651](https://github.com/pixeltable/pixeltable/pull/651) * PXT-595: Raise error if attempting to access metadata from a future v… by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#642](https://github.com/pixeltable/pixeltable/pull/642) * Make TableVersion timestamps consistent across propagated changes. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#643](https://github.com/pixeltable/pixeltable/pull/643) * Update RowBuilder.create\_table\_raw to save PIL image with the jpeg extension by [@Yann-CV](https://github.com/Yann-CV) in [#648](https://github.com/pixeltable/pixeltable/pull/648) * Fix bug in handling "nullary" JsonMapper expressions by [@aaron-siegel](https://github.com/aaron-siegel) in [#655](https://github.com/pixeltable/pixeltable/pull/655) * Update release.sh to handle pre-releases by [@aaron-siegel](https://github.com/aaron-siegel) in [#656](https://github.com/pixeltable/pixeltable/pull/656) * Refactor inference API integrations to use `model_kwargs` dicts instead of explicit parameters by [@aaron-siegel](https://github.com/aaron-siegel) in [#641](https://github.com/pixeltable/pixeltable/pull/641) * Refactor tool invocation unit tests \[techdebt] by [@aaron-siegel](https://github.com/aaron-siegel) in [#657](https://github.com/pixeltable/pixeltable/pull/657) * Concurrent view interactions by [@mkornacker](https://github.com/mkornacker) in [#652](https://github.com/pixeltable/pixeltable/pull/652) * Consolidate all SQL generation related to sampling inside of SqlSampleNode by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#649](https://github.com/pixeltable/pixeltable/pull/649) * Suppressing asyncio slow callback warnings by [@mkornacker](https://github.com/mkornacker) in [#660](https://github.com/pixeltable/pixeltable/pull/660) * Groq integration by [@aaron-siegel](https://github.com/aaron-siegel) in [#659](https://github.com/pixeltable/pixeltable/pull/659) * First cut at MCP integration by [@aaron-siegel](https://github.com/aaron-siegel) in [#661](https://github.com/pixeltable/pixeltable/pull/661) #### New Contributors * [@Yann-CV](https://github.com/Yann-CV) made their first contribution in [#648](https://github.com/pixeltable/pixeltable/pull/648) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0](https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0) *** ### v0.4.0-pre.3 **Released:** June 10, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.0-pre.3](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.0-pre.3) #### What's Changed * Update release.sh to handle pre-releases by [@aaron-siegel](https://github.com/aaron-siegel) in [#656](https://github.com/pixeltable/pixeltable/pull/656) * Refactor inference API integrations to use `model_kwargs` dicts instead of explicit parameters by [@aaron-siegel](https://github.com/aaron-siegel) in [#641](https://github.com/pixeltable/pixeltable/pull/641) * Refactor tool invocation unit tests \[techdebt] by [@aaron-siegel](https://github.com/aaron-siegel) in [#657](https://github.com/pixeltable/pixeltable/pull/657) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.4.0-pre.2...v0.4.0-pre.3](https://github.com/pixeltable/pixeltable/compare/v0.4.0-pre.2...v0.4.0-pre.3) *** ### v0.4.0-pre.2 **Released:** June 07, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.0-pre.2](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.0-pre.2) #### What's Changed * fix building with llm link by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#638](https://github.com/pixeltable/pixeltable/pull/638) * Allow HEIC images by [@aaron-siegel](https://github.com/aaron-siegel) in [#639](https://github.com/pixeltable/pixeltable/pull/639) * Include preview data in request when publishing a table by [@aaron-siegel](https://github.com/aaron-siegel) in [#631](https://github.com/pixeltable/pixeltable/pull/631) * WIP: stratified sampling operation on DataFrame by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#591](https://github.com/pixeltable/pixeltable/pull/591) * remove main reference and replace with release by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#646](https://github.com/pixeltable/pixeltable/pull/646) * docs: add product updates changelog with version history and release notes by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#645](https://github.com/pixeltable/pixeltable/pull/645) * remove print statement in gemini tool calls by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#651](https://github.com/pixeltable/pixeltable/pull/651) * PXT-595: Raise error if attempting to access metadata from a future v… by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#642](https://github.com/pixeltable/pixeltable/pull/642) * Make TableVersion timestamps consistent across propagated changes. by [@jpeterson-pxt](https://github.com/jpeterson-pxt) in [#643](https://github.com/pixeltable/pixeltable/pull/643) * Update RowBuilder.create\_table\_raw to save PIL image with the jpeg extension by [@Yann-CV](https://github.com/Yann-CV) in [#648](https://github.com/pixeltable/pixeltable/pull/648) * Fix bug in handling "nullary" JsonMapper expressions by [@aaron-siegel](https://github.com/aaron-siegel) in [#655](https://github.com/pixeltable/pixeltable/pull/655) #### New Contributors * [@Yann-CV](https://github.com/Yann-CV) made their first contribution in [#648](https://github.com/pixeltable/pixeltable/pull/648) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0-pre.2](https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0-pre.2) *** ### v0.4.0-pre.1 **Released:** May 28, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.4.0-pre.1](https://github.com/pixeltable/pixeltable/releases/tag/v0.4.0-pre.1) #### What's Changed * Support for concurrent table operations by [@mkornacker](https://github.com/mkornacker) in [#611](https://github.com/pixeltable/pixeltable/pull/611) * New Deepseek notebook by [@aaron-siegel](https://github.com/aaron-siegel) in [#634](https://github.com/pixeltable/pixeltable/pull/634) * Re-enable 3 of the 4 disabled Labelstudio tests by [@mkornacker](https://github.com/mkornacker) in [#635](https://github.com/pixeltable/pixeltable/pull/635) * Implement `to_sql` for many string methods by [@aaron-siegel](https://github.com/aaron-siegel) in [#636](https://github.com/pixeltable/pixeltable/pull/636) * Remove extraneous reload\_catalog() in test\_packager by [@aaron-siegel](https://github.com/aaron-siegel) in [#637](https://github.com/pixeltable/pixeltable/pull/637) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0-pre.1](https://github.com/pixeltable/pixeltable/compare/v0.3.15...v0.4.0-pre.1) *** ### v0.3.15 **Released:** May 25, 2025\ **Author:** [@aaron-siegel](https://github.com/aaron-siegel)\ **View on GitHub:** [v0.3.15](https://github.com/pixeltable/pixeltable/releases/tag/v0.3.15) #### What's Changed * Rename blueprint links to guides in pixelagent documentation by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#628](https://github.com/pixeltable/pixeltable/pull/628) * Add documentation for embedding\_access feature by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#626](https://github.com/pixeltable/pixeltable/pull/626) * Improve import documentation by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#624](https://github.com/pixeltable/pixeltable/pull/624) * Update mint.json to use Kandinsky color theme by [@jacobweiss2305](https://github.com/jacobweiss2305) in [#633](https://github.com/pixeltable/pixeltable/pull/633) * Merge different versions of base tables consistently when pulling replicas by [@aaron-siegel](https://github.com/aaron-siegel) in [#625](https://github.com/pixeltable/pixeltable/pull/625) * Add UDFs for Google Imagen and Veo; Support Tool Calling in Gemini by [@aaron-siegel](https://github.com/aaron-siegel) in [#632](https://github.com/pixeltable/pixeltable/pull/632) **Full Changelog**: [https://github.com/pixeltable/pixeltable/compare/v0.3.14...v0.3.15](https://github.com/pixeltable/pixeltable/compare/v0.3.14...v0.3.15) *** # Agentic Patterns Source: https://docs.pixeltable.com/howto/cookbooks/agents/agentic-patterns Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Two popular taxonomies describe the building blocks of agentic AI systems: * **Cognitive / reasoning-oriented** (Taxonomy 1): Reflection, Tool Use, ReAct, Planning, Multi-Agent — asks *“how does the agent think?”* * **Architectural / system-design-oriented** (Taxonomy 2): Prompt Chaining, Routing, Parallelization, Tool Use, Evaluator-Optimizer, Orchestrator-Worker — asks *“how do you wire LLM calls together?”* (See [OpenAI’s Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf), [Anthropic’s multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system), and [Pydantic AI’s multi-agent delegation](https://ai.pydantic.dev/multi-agent-applications/#agent-delegation).) Mapping them against each other reveals:
The cleanest framing: **six architectural patterns** that describe how you structure LLM calls, plus **two cross-cutting reasoning strategies** (ReAct and Planning) that can be layered inside any of them. This cookbook implements all eight in Pixeltable, where your agent *is* a table:
## Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai pxt.drop_dir('agentic_patterns', force=True) pxt.create_dir('agentic_patterns') ```
  Created directory 'agentic\_patterns'.
  \
## Pattern 1: Prompt Chaining Break a complex task into sequential steps, where each step’s output feeds the next. **Imperative approach:** a chain of function calls or an explicit pipeline object. **Pixeltable approach:** each step is a computed column. The engine resolves dependencies automatically.
  input → step 1 (outline) → step 2 (draft) → step 3 (polish) → output
```python theme={null} # Create a table with a single input column chain = pxt.create_table('agentic_patterns/chain', {'topic': pxt.String}) ```
  Created table 'chain'.
```python theme={null} # Step 1: generate an outline chain.add_computed_column( outline_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Create a 3-point outline for a short article about: ' + chain.topic, } ], model='gpt-4o-mini', ) ) chain.add_computed_column( outline=chain.outline_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 2: write a draft from the outline chain.add_computed_column( draft_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Write a short article (2-3 paragraphs) based on this outline:\n\n' + chain.outline, } ], model='gpt-4o-mini', ) ) chain.add_computed_column( draft=chain.draft_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 3: polish the draft chain.add_computed_column( polish_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Edit this article for clarity and conciseness. ' 'Return only the improved text:\n\n' + chain.draft, } ], model='gpt-4o-mini', ) ) chain.add_computed_column( final_article=chain.polish_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert a topic — all three steps execute automatically chain.insert([{'topic': 'the benefits of declarative AI pipelines'}]) chain.select( chain.topic, chain.outline, chain.draft, chain.final_article ).collect() ```
  Inserted 1 row with 0 errors in 14.58 s (0.07 rows/s)
Every intermediate result (`outline`, `draft`, `final_article`) is persisted in the table. Inserting another topic reuses the same pipeline — no code changes needed. If the same topic is inserted again, cached results are returned instantly. ## Pattern 2: Routing Classify an input and route it to a specialized handler. This is the agent equivalent of a switch/case statement. **Imperative approach:** a triage agent that performs handoffs to specialized agents. **Pixeltable approach:** one computed column classifies; a UDF selects the prompt; a second LLM call generates the response.
  input → classify intent → select specialized prompt → generate response
```python theme={null} router = pxt.create_table( 'agentic_patterns/router', {'query': pxt.String} ) ```
  Created table 'router'.
```python theme={null} # Step 1: classify the query intent router.add_computed_column( classify_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Classify this customer query into exactly one category: ' 'technical, billing, or general. Reply with the single word only.\n\n' 'Query: ' + router.query, } ], model='gpt-4o-mini', ) ) router.add_computed_column( intent=router.classify_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 2: route to a specialized system prompt based on the classification @pxt.udf def route_prompt(intent: str, query: str) -> list[dict]: """Select a system prompt based on the classified intent.""" system_prompts = { 'technical': 'You are a senior technical support engineer. ' 'Provide precise, step-by-step troubleshooting guidance.', 'billing': 'You are a billing specialist. ' 'Be empathetic and clear about charges, refunds, and payment options.', 'general': 'You are a friendly customer service representative. ' 'Answer helpfully and concisely.', } # Default to general if classification is unexpected system = system_prompts.get( intent.strip().lower(), system_prompts['general'] ) return [ {'role': 'system', 'content': system}, {'role': 'user', 'content': query}, ] router.add_computed_column( routed_messages=route_prompt(router.intent, router.query) ) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 3: generate the specialized response router.add_computed_column( response_raw=openai.chat_completions( messages=router.routed_messages, model='gpt-4o-mini' ) ) router.add_computed_column( response=router.response_raw.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert queries spanning different intents router.insert( [ { 'query': 'My API calls are returning 429 errors since this morning' }, {'query': 'I was charged twice for my subscription last month'}, {'query': 'What programming languages do you support?'}, ] ) router.select(router.query, router.intent, router.response).collect() ```
  Inserted 3 rows with 0 errors in 6.93 s (0.43 rows/s)
Each query was classified and then handled by a specialized system prompt. The `intent` column is inspectable for every row, making it easy to audit routing decisions. ## Pattern 3: Parallelization Run multiple independent LLM calls on the same input simultaneously, then combine the results. **Imperative approach:** `asyncio.gather` or thread pools. **Pixeltable approach:** add independent computed columns. The engine parallelizes them automatically because they share no dependencies.
           ┌→ sentiment  ─┐
  input  ──┼→ entities   ──┼→ merge → combined output
           └→ summary    ─┘
```python theme={null} parallel = pxt.create_table( 'agentic_patterns/parallel', {'text': pxt.String} ) ```
  Created table 'parallel'.
```python theme={null} # Three independent LLM calls — Pixeltable runs them in parallel automatically parallel.add_computed_column( sentiment_raw=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Analyze the sentiment of this text. ' 'Reply with: positive, negative, or neutral.\n\n' + parallel.text, } ], model='gpt-4o-mini', ) ) parallel.add_computed_column( sentiment=parallel.sentiment_raw.choices[0].message.content.astype( pxt.String ) ) parallel.add_computed_column( entities_raw=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Extract all named entities (people, companies, locations) ' 'from this text. Return a comma-separated list.\n\n' + parallel.text, } ], model='gpt-4o-mini', ) ) parallel.add_computed_column( entities=parallel.entities_raw.choices[0].message.content.astype( pxt.String ) ) parallel.add_computed_column( summary_raw=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Summarize this text in one sentence.\n\n' + parallel.text, } ], model='gpt-4o-mini', ) ) parallel.add_computed_column( summary=parallel.summary_raw.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Merge the parallel results into a single structured report @pxt.udf def merge_analysis(sentiment: str, entities: str, summary: str) -> dict: """Combine parallel analysis results into one report.""" return { 'sentiment': sentiment.strip(), 'entities': entities.strip(), 'summary': summary.strip(), } parallel.add_computed_column( report=merge_analysis( parallel.sentiment, parallel.entities, parallel.summary ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} parallel.insert( [ { 'text': 'Apple announced record quarterly revenue of $124 billion, ' 'driven by strong iPhone sales in Europe and Asia. CEO Tim Cook ' "expressed optimism about the company's AI initiatives, while " 'some analysts remain cautious about increased R&D spending.' } ] ) parallel.select( parallel.text, parallel.sentiment, parallel.entities, parallel.summary ).collect() ``` The three LLM calls (`sentiment`, `entities`, `summary`) have no dependency on each other, so Pixeltable dispatches them concurrently. The `merge_analysis` UDF waits for all three before combining the results. No async code required. ## Pattern 4: Tool Use Give an LLM access to external functions it can call to gather information or take action. **Imperative approach:** `@function_tool` decorator, tool loop that re-prompts until the LLM stops requesting tools. **Pixeltable approach:** `pxt.tools()` bundles UDFs into tool definitions; `invoke_tools()` executes the LLM’s choices — both as computed columns.
  input → LLM (with tools) → invoke\_tools() → results
For a deeper walkthrough including MCP servers, see [Use tool calling with LLMs](/howto/cookbooks/agents/llm-tool-calling). ```python theme={null} # Define tool functions as UDFs @pxt.udf def get_weather(city: str) -> str: """Get the current weather for a city.""" weather_data = { 'new york': 'Sunny, 72F', 'london': 'Cloudy, 58F', 'tokyo': 'Rainy, 65F', 'paris': 'Partly cloudy, 68F', } return weather_data.get( city.lower(), f'Weather data not available for {city}' ) @pxt.udf def get_stock_price(symbol: str) -> str: """Get the current stock price for a ticker symbol.""" prices = {'AAPL': '$178.50', 'GOOGL': '$141.25', 'MSFT': '$378.90'} return prices.get(symbol.upper(), f'Price not available for {symbol}') # Bundle into a Tools object tools = pxt.tools(get_weather, get_stock_price) ``` ```python theme={null} # Create the tool-calling pipeline tool_agent = pxt.create_table( 'agentic_patterns/tool_agent', {'query': pxt.String} ) # LLM decides which tool(s) to call tool_agent.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': tool_agent.query}], model='gpt-4o-mini', tools=tools, ) ) # Execute the tool calls automatically tool_agent.add_computed_column( tool_output=openai.invoke_tools(tools, tool_agent.response) ) ```
  Created table 'tool\_agent'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} tool_agent.insert( [ {'query': "What's the weather in Tokyo?"}, {'query': "What's Apple's stock price?"}, { 'query': "What's the weather in Paris and Microsoft's stock price?" }, ] ) for row in tool_agent.select( tool_agent.query, tool_agent.tool_output ).collect(): print(f'Query: {row["query"]}') for tool_name, results in (row['tool_output'] or {}).items(): if results: print(f' -> {tool_name}: {results}') print() ``` The LLM chose which tools to invoke (including multiple tools for the last query). `invoke_tools()` executed them and stored results. The full LLM response is also persisted in the `response` column for debugging. ## Pattern 5: Evaluator-Optimizer One LLM generates output, a second LLM evaluates it, and the results are used to decide whether to refine. This is the architectural cousin of the *Reflection* pattern from Taxonomy 1 — an agent critiques its own output and iteratively improves it. **Imperative approach:** a while-loop that re-prompts until a quality threshold is met (see [Pixelagent’s reflection example](https://github.com/pixeltable/pixelagent/tree/main/examples/reflection)). **Pixeltable approach:** chained computed columns — generate, evaluate, then conditionally refine. The evaluation score is stored alongside the content for analysis.
  input → generate → evaluate (score + feedback) → refine if needed → output
```python theme={null} evaluator = pxt.create_table( 'agentic_patterns/evaluator', {'product_brief': pxt.String} ) ```
  Created table 'evaluator'.
```python theme={null} # Step 1: generate initial marketing copy evaluator.add_computed_column( gen_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Write a short marketing tagline (one sentence) for this product:\n\n' + evaluator.product_brief, } ], model='gpt-4o-mini', ) ) evaluator.add_computed_column( first_draft=evaluator.gen_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 2: evaluate the draft with an LLM-as-judge evaluator.add_computed_column( eval_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Rate this marketing tagline on a scale of 1-10 for clarity, ' 'creativity, and persuasiveness. Then provide one sentence of feedback ' 'for improvement.\n\n' 'Tagline: ' + evaluator.first_draft + '\n\n' 'Reply in this exact format:\n' 'Score: \nFeedback: ', } ], model='gpt-4o-mini', ) ) evaluator.add_computed_column( evaluation=evaluator.eval_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 3: refine using the feedback evaluator.add_computed_column( refine_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Improve this marketing tagline based on the feedback below. ' 'Return only the improved tagline.\n\n' 'Original: ' + evaluator.first_draft + '\n\n' 'Feedback: ' + evaluator.evaluation, } ], model='gpt-4o-mini', ) ) evaluator.add_computed_column( refined=evaluator.refine_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} evaluator.insert( [ { 'product_brief': 'A noise-canceling headphone designed for open-plan offices, ' 'with 30-hour battery life and a built-in microphone for calls.' }, { 'product_brief': 'An AI-powered code review tool that catches bugs, suggests ' "improvements, and learns your team's coding style over time." }, ] ) evaluator.select( evaluator.product_brief, evaluator.first_draft, evaluator.evaluation, evaluator.refined, ).collect() ```
  Inserted 2 rows with 0 errors in 2.95 s (0.68 rows/s)
Both the first draft and the refined version are stored side-by-side with the evaluation. This makes it straightforward to compare outputs, audit the judge’s reasoning, or filter rows where the score fell below a threshold. ## Pattern 6: Orchestrator-Worker A central agent decomposes a task, delegates sub-tasks to specialized worker agents, and synthesizes the results. This is the architectural cousin of the *Multi-Agent* pattern from Taxonomy 1, and the same structure Anthropic uses in their [multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system) — a lead agent coordinates parallel subagents, each with their own context and tools. **Imperative approach:** an orchestrator agent class that spawns worker agent instances and collects their outputs. **Pixeltable approach:** each worker is a table with computed columns, wrapped as a callable function via `pxt.udf(table, return_value=...)`. The orchestrator table calls these functions as computed columns.
  input → decompose → worker A (summarizer)  ─┐
                    → worker B (fact-checker) ─┼→ synthesize → output
For more on table UDFs, see [Use a table pipeline as a reusable function](/howto/cookbooks/agents/pattern-table-as-udf). ### Build worker agents as tables ```python theme={null} # Worker A: summarizer summarizer_tbl = pxt.create_table( 'agentic_patterns/summarizer', {'text': pxt.String} ) summarizer_tbl.add_computed_column( response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Summarize this text in 2-3 sentences:\n\n' + summarizer_tbl.text, } ], model='gpt-4o-mini', ) ) summarizer_tbl.add_computed_column( summary=summarizer_tbl.response.choices[0].message.content.astype( pxt.String ) ) # Wrap as a callable function summarize = pxt.udf(summarizer_tbl, return_value=summarizer_tbl.summary) ```
  Created table 'summarizer'.
  Added 0 column values with 0 errors in 0.10 s
  Added 0 column values with 0 errors in 0.06 s
```python theme={null} # Worker B: fact-checker checker_tbl = pxt.create_table( 'agentic_patterns/checker', {'claim': pxt.String} ) checker_tbl.add_computed_column( response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Assess whether this claim is plausible. ' 'Reply with: PLAUSIBLE or DUBIOUS, followed by a one-sentence explanation.\n\n' 'Claim: ' + checker_tbl.claim, } ], model='gpt-4o-mini', ) ) checker_tbl.add_computed_column( assessment=checker_tbl.response.choices[0].message.content.astype( pxt.String ) ) # Wrap as a callable function fact_check = pxt.udf(checker_tbl, return_value=checker_tbl.assessment) ```
  Created table 'checker'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.02 s
### Build the orchestrator ```python theme={null} # Orchestrator table: delegates to workers, then synthesizes orchestrator = pxt.create_table( 'agentic_patterns/orchestrator', {'article': pxt.String} ) # Dispatch to worker A (summarizer) and worker B (fact-checker) in parallel orchestrator.add_computed_column( summary=summarize(text=orchestrator.article) ) orchestrator.add_computed_column( fact_check_result=fact_check(claim=orchestrator.article) ) ```
  Created table 'orchestrator'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Synthesize worker outputs into a final briefing orchestrator.add_computed_column( synth_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Based on the summary and fact-check below, write a brief ' 'editorial note (2-3 sentences) about this article.\n\n' 'Summary: ' + orchestrator.summary + '\n\n' 'Fact-check: ' + orchestrator.fact_check_result, } ], model='gpt-4o-mini', ) ) orchestrator.add_computed_column( briefing=orchestrator.synth_response.choices[ 0 ].message.content.astype(pxt.String) ) ```
  Added 0 column values with 0 errors in 0.02 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} orchestrator.insert( [ { 'article': 'A recent study published in Nature found that global sea levels ' 'rose by 4.5 mm per year over the last decade, nearly double the rate observed ' 'in the 1990s. Researchers attribute the acceleration primarily to ice sheet ' 'loss in Greenland and Antarctica, compounded by thermal expansion of ocean ' 'water. The findings suggest coastal cities may face significant flooding risks ' 'by 2050 without aggressive mitigation strategies.' } ] ) orchestrator.select( orchestrator.summary, orchestrator.fact_check_result, orchestrator.briefing, ).collect() ```
  Inserted 1 row with 0 errors in 4.69 s (0.21 rows/s)
The orchestrator table called two independent worker pipelines (`summarize` and `fact_check`), each backed by their own table with full intermediate-result persistence. The synthesis step consumed both outputs to produce the final briefing. Adding a new worker (e.g., a tone analyzer) requires only creating another table, wrapping it with `pxt.udf()`, and adding one more computed column to the orchestrator. ## Strategy A: ReAct ReAct is not a wiring pattern — it is a **reasoning strategy** that can be applied inside any of the six patterns above. The agent alternates between reasoning about the next step and acting on it (typically via tools), observing the result before deciding what to do next. **Imperative approach:** a while-loop that parses the LLM’s THOUGHT/ACTION output, calls tools, and feeds observations back (see [Pixelagent’s ReAct example](https://github.com/pixeltable/pixelagent/tree/main/examples/planning)). **Pixeltable approach:** the reasoning loop lives in a UDF that inserts rows into a tool-calling table and reads back results. The table stores every thought-action-observation triple for full observability.
  question → \[THOUGHT → ACTION → OBSERVATION] × N → final answer
```python theme={null} import re # Define a tool for the ReAct agent @pxt.udf def lookup_population(country: str) -> str: """Look up the approximate population of a country.""" populations = { 'united states': '331 million', 'china': '1.4 billion', 'india': '1.4 billion', 'germany': '84 million', 'brazil': '214 million', 'japan': '125 million', } return populations.get( country.lower(), f'Population data not available for {country}' ) react_tools = pxt.tools(lookup_population) ``` ```python theme={null} # Build a tool-calling table that the ReAct loop will insert into react_steps = pxt.create_table( 'agentic_patterns/react_steps', {'step': pxt.Int, 'prompt': pxt.String, 'system_prompt': pxt.String}, ) react_steps.add_computed_column( response=openai.chat_completions( messages=[ {'role': 'system', 'content': react_steps.system_prompt}, {'role': 'user', 'content': react_steps.prompt}, ], model='gpt-4o-mini', tools=react_tools, ) ) react_steps.add_computed_column( answer=react_steps.response.choices[0].message.content.astype( pxt.String ) ) react_steps.add_computed_column( tool_output=openai.invoke_tools(react_tools, react_steps.response) ) ```
  Created table 'react\_steps'.
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.00 s
  No rows affected.
```python theme={null} # The ReAct loop: reason → act → observe, repeated until done REACT_SYSTEM = ( "You are a research assistant. Answer the user's question step by step.\n" 'Available tools: lookup_population\n\n' 'On each turn, respond in this exact format:\n' 'THOUGHT: \n' 'ACTION: \n\n' 'When ACTION is FINAL, include your final answer after it.\n' 'Current step: {step} of {max_steps}.' ) question = 'Which country has a larger population, Brazil or Germany?' max_steps = 4 history = [] for step in range(1, max_steps + 1): # Build prompt with accumulated observations prompt = question if history: prompt += '\n\nPrevious observations:\n' + '\n'.join(history) system = REACT_SYSTEM.format(step=step, max_steps=max_steps) react_steps.insert( [{'step': step, 'prompt': prompt, 'system_prompt': system}] ) # Read back the result for this step row = ( react_steps.where(react_steps.step == step) .select(react_steps.answer, react_steps.tool_output) .collect() ) answer_text = row['answer'][0] or '' tool_out = row['tool_output'][0] # Record observation from tool output (if any) if tool_out: history.append(f'Step {step} tool result: {tool_out}') # Check if the agent decided to finalize if 'FINAL' in answer_text.upper(): break print(f'Completed in {step} steps') for row in react_steps.select( react_steps.step, react_steps.answer, react_steps.tool_output ).collect(): print(f'Step {row["step"]}:') if row['answer']: print(f' {row["answer"][:200]}') for tool_name, results in (row['tool_output'] or {}).items(): if results: print(f' -> {tool_name}: {results}') print() ``` Every thought, action, and observation is persisted as a row in the `react_steps` table. The loop itself is plain Python; the LLM calls and tool execution happen declaratively via computed columns. This makes the reasoning trace fully queryable after the fact — useful for debugging or evaluation. ## Strategy B: Planning Planning is the second cross-cutting reasoning strategy. Instead of acting step-by-step (ReAct), the agent first generates a complete plan, then executes each step. This is especially effective for complex tasks where the structure of the solution can be determined upfront. **Imperative approach:** an LLM generates a plan as structured JSON, then a loop executes each step (see [Pixelagent’s planning example](https://github.com/pixeltable/pixelagent/tree/main/examples/planning)). **Pixeltable approach:** a prompt-chaining pipeline where the first column generates the plan and a UDF parses it into executable steps. Each step then feeds into subsequent computed columns.
  question → generate plan → execute step 1 → execute step 2 → ... → synthesize
```python theme={null} import json as json_mod planner = pxt.create_table( 'agentic_patterns/planner', {'question': pxt.String} ) # Step 1: generate a plan as structured JSON planner.add_computed_column( plan_response=openai.chat_completions( messages=[ { 'role': 'user', 'content': 'Break this question into 2-3 research steps. ' 'Return ONLY a JSON object like {"steps": ["sub-question 1", "sub-question 2"]}. ' 'No other text.\n\n' 'Question: ' + planner.question, } ], model='gpt-4o-mini', ) ) planner.add_computed_column( plan_text=planner.plan_response.choices[0].message.content.astype( pxt.String ) ) ```
  Created table 'planner'.
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 2: parse the plan and execute each sub-question, then synthesize @pxt.udf def execute_plan(plan_json: str, original_question: str) -> list[dict]: """Parse the plan JSON and return structured sub-questions.""" try: data = json_mod.loads(plan_json) # Handle both {"steps": [...]} and direct [...] steps = ( data if isinstance(data, list) else data.get('steps', data.get('questions', [])) ) return [ {'step': i + 1, 'sub_question': q} for i, q in enumerate(steps) ] except (json_mod.JSONDecodeError, TypeError): return [{'step': 1, 'sub_question': original_question}] planner.add_computed_column( plan_steps=execute_plan(planner.plan_text, planner.question) ) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Step 3: execute the plan — answer each sub-question, then synthesize @pxt.udf def format_plan_for_execution( plan_steps: list[dict], original_question: str ) -> str: """Format the plan steps into a single execution prompt.""" step_list = '\n'.join( f'{s["step"]}. {s["sub_question"]}' for s in plan_steps ) return ( f'Answer each of these research sub-questions briefly, ' f'then provide a final synthesis that answers the original question.\n\n' f'Original question: {original_question}\n\n' f'Sub-questions:\n{step_list}' ) planner.add_computed_column( exec_prompt=format_plan_for_execution( planner.plan_steps, planner.question ) ) planner.add_computed_column( exec_response=openai.chat_completions( messages=[{'role': 'user', 'content': planner.exec_prompt}], model='gpt-4o-mini', ) ) planner.add_computed_column( final_answer=planner.exec_response.choices[0].message.content.astype( pxt.String ) ) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} planner.insert( [ { 'question': 'What are the economic and environmental trade-offs of electric vehicles vs hydrogen fuel cells?' } ] ) row = planner.select( planner.question, planner.plan_text, planner.final_answer ).collect() print('Plan:', row['plan_text'][0]) print() print('Answer:', row['final_answer'][0][:500]) ``` The plan (stored in `plan_steps`) is fully inspectable. The execution step answers all sub-questions in a single LLM call, but this could also use parallelization (Pattern 3) to answer each sub-question independently and merge the results. Planning and ReAct compose naturally with any of the six architectural patterns. ## Choosing a Pattern ### Six architectural patterns
### Two cross-cutting reasoning strategies
Patterns compose naturally. An orchestrator-worker system might use routing in the orchestrator, tool use within a worker, and ReAct reasoning inside the tool-calling loop. Because each pattern is just a set of computed columns on a table, combining them requires no special glue code. ## See Also **Pixeltable cookbooks:** * [Use tool calling with LLMs](/howto/cookbooks/agents/llm-tool-calling) — deep dive into `pxt.tools()`, `invoke_tools()`, and MCP server integration * [Build an agent with persistent memory](/howto/cookbooks/agents/pattern-agent-memory) — embedding indexes for semantic memory recall * [Build a RAG pipeline](/howto/cookbooks/agents/pattern-rag-pipeline) — document chunking, embedding, and retrieval-augmented generation * [Look up structured data with retrieval UDFs](/howto/cookbooks/agents/pattern-data-lookup) — `pxt.retrieval_udf()` for key-based lookups * [Use a table pipeline as a reusable function](/howto/cookbooks/agents/pattern-table-as-udf) — `pxt.udf(table)` explained in depth **Pixelagent examples** (imperative implementations of the same patterns): * [Reflection loop](https://github.com/pixeltable/pixelagent/tree/main/examples/reflection) — main agent + critic agent with iterative refinement * [ReAct / Planning](https://github.com/pixeltable/pixelagent/tree/main/examples/planning) — step-by-step reasoning with tool calls * [Tool calling](https://github.com/pixeltable/pixelagent/tree/main/examples/tool-calling) — OpenAI, Anthropic, and Bedrock tool integration * [Memory](https://github.com/pixeltable/pixelagent/tree/main/examples/memory) — persistent and semantic memory management **External references:** * [OpenAI’s Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) — the six architectural patterns * [Anthropic: How we built our multi-agent research system](https://www.anthropic.com/engineering/multi-agent-research-system) — orchestrator-worker at scale * [Pydantic AI: Multi-agent applications](https://ai.pydantic.dev/multi-agent-applications/#agent-delegation) — agent delegation patterns # Use tool calling and MCP servers with LLMs Source: https://docs.pixeltable.com/howto/cookbooks/agents/llm-tool-calling Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Enable LLMs to call functions and tools, then execute the results automatically. ## Problem You want an LLM to decide which functions to call based on user queries—for agents, chatbots, or automated workflows.
## Solution **What’s in this recipe:** * Define tools as Python functions * Let LLMs decide which tool to call * Automatically execute tool calls with `invoke_tools` * Use MCP servers to load external tools You define tools with JSON schemas, pass them to the LLM, and use `invoke_tools` to execute the function calls. ### Setup ```python theme={null} %pip install -qU pixeltable openai mcp ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('tools_demo', force=True) pxt.create_dir('tools_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'tools\_demo'.
  \
### Define tools as UDFs ```python theme={null} # Define tool functions as Pixeltable UDFs @pxt.udf def get_weather(city: str) -> str: """Get the current weather for a city.""" # In production, call a real weather API weather_data = { 'new york': 'Sunny, 72°F', 'london': 'Cloudy, 58°F', 'tokyo': 'Rainy, 65°F', 'paris': 'Partly cloudy, 68°F', } return weather_data.get( city.lower(), f'Weather data not available for {city}' ) @pxt.udf def get_stock_price(symbol: str) -> str: """Get the current stock price for a symbol.""" # In production, call a real stock API prices = { 'AAPL': '$178.50', 'GOOGL': '$141.25', 'MSFT': '$378.90', 'AMZN': '$185.30', } return prices.get(symbol.upper(), f'Price not available for {symbol}') ``` ```python theme={null} # Create a Tools object with our functions tools = pxt.tools(get_weather, get_stock_price) ``` ### Create tool-calling pipeline ```python theme={null} # Create table for queries queries = pxt.create_table('tools_demo/queries', {'query': pxt.String}) ```
  Created table 'queries'.
```python theme={null} # Add LLM call with tools queries.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': queries.query}], model='gpt-4o-mini', tools=tools, # Pass tools to the LLM ) ) ```
  Added 0 column values with 0 errors in 0.00 s
  No rows affected.
```python theme={null} # Automatically execute tool calls and get results queries.add_computed_column( tool_results=openai.invoke_tools(tools, queries.response) ) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
### Run tool-enabled queries ```python theme={null} # Insert queries that require tool calls sample_queries = [ {'query': "What's the weather in Tokyo?"}, {'query': "What's the stock price of Apple?"}, { 'query': "What's the weather in Paris and the price of Microsoft stock?" }, ] queries.insert(sample_queries) ```
  Inserted 3 rows with 0 errors in 4.16 s (0.72 rows/s)
  3 rows inserted.
```python theme={null} # View results queries.select(queries.query, queries.tool_results).collect() ```
## Using MCP Servers as Tools The [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) is an open protocol that standardizes how applications provide context to LLMs. Pixeltable can connect to MCP servers and use their exposed tools as UDFs. ### Why MCP?
### Create an MCP Server First, create an MCP server with tools you want to expose. Save this as `mcp_server.py`: ```python theme={null} from mcp.server.fastmcp import FastMCP mcp = FastMCP('PixeltableDemo', stateless_http=True) @mcp.tool() def calculate_discount(price: float, discount_percent: float) -> float: """Calculate the discounted price.""" return price * (1 - discount_percent / 100) @mcp.tool() def check_inventory(product_id: str) -> str: """Check inventory status for a product.""" # In production, query your inventory database inventory = { 'SKU001': 'In stock (42 units)', 'SKU002': 'Low stock (3 units)', 'SKU003': 'Out of stock', } return inventory.get(product_id, f'Unknown product: {product_id}') if __name__ == '__main__': mcp.run(transport='streamable-http') ``` Run the server: `python mcp_server.py` (it will listen on `http://localhost:8000/mcp`) ### Connect to MCP Server and Use Tools ```python theme={null} # Connect to the MCP server using pxt.mcp_udfs() # This creates a Pixeltable UDF for each tool exposed by the server # See: https://docs.pixeltable.com/platform/custom-functions#5-mcp-udfs mcp_tools = pxt.mcp_udfs('https://docs.pixeltable.com/mcp') # View available tools - each is now a callable Pixeltable function for tool in mcp_tools: print(f'- {tool.name}: {tool.comment()}') ```
  - SearchPixeltableDocumentation: Search across the Pixeltable Documentation knowledge base to find relevant information, code examples, API references, and guides. Use this tool when you need to answer questions about Pixeltable Documentation, find specific documentation, understand how features work, or locate implementation details. The search returns contextual content with titles and direct links to the documentation pages.
```python theme={null} # Bundle MCP tools for LLM use mcp_toolset = pxt.tools(*mcp_tools) # Create a table with MCP tool-calling pipeline mcp_queries = pxt.create_table( 'tools_demo/mcp_queries', {'query': pxt.String} ) # Add LLM call with MCP tools mcp_queries.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': mcp_queries.query}], model='gpt-4o-mini', tools=mcp_toolset, ) ) # Execute MCP tool calls mcp_queries.add_computed_column( tool_results=openai.invoke_tools(mcp_toolset, mcp_queries.response) ) # View the schema - note that mcp_toolset is stored as persistent metadata # Every subsequent insert will use these same tools automatically mcp_queries.describe() ```
  Created table 'mcp\_queries'.
  Added 0 column values with 0 errors in 0.00 s
  Added 0 column values with 0 errors in 0.01 s
```python theme={null} # Test with e-commerce queries mcp_queries.insert( [ {'query': 'What is Pixeltable?'}, {'query': 'How to use OpenAI in Pixeltable?'}, ] ) mcp_queries.select(mcp_queries.query, mcp_queries.tool_results).collect() ```
```python theme={null} # Extract the search result with a named column mcp_queries.select( search_result=mcp_queries.tool_results[ 'SearchPixeltableDocumentation' ][0] ).collect() ```
## Explanation **Tool calling flow:**
  Query → LLM decides tool → invoke\_tools executes → Results
**Key components:**
**MCP integration:**
  MCP Server → pxt.mcp\_udfs() → pxt.tools() → LLM tool calling
MCP servers expose tools via a standardized protocol. Pixeltable’s `mcp_udfs()` connects to any MCP server and returns the tools as callable UDFs that can be bundled with `pxt.tools()` for LLM use. **Supported providers:**
## See also * [Build a RAG pipeline](/howto/cookbooks/agents/pattern-rag-pipeline) - Retrieval-augmented generation * [Run local LLMs](/howto/providers/working-with-ollama) - Local model inference * [Multimodal MCP Servers](/libraries/mcp) - Pixeltable’s MCP server collection * [Custom Functions](/platform/custom-functions) - More about UDFs and MCP integration # Build an agent with memory Source: https://docs.pixeltable.com/howto/cookbooks/agents/pattern-agent-memory Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create an AI agent that remembers important information across conversations. ## Problem You want to build an AI agent that can store and recall important information—user preferences, key facts, or context from previous conversations.
## Solution **What’s in this recipe:** * Store memories with embeddings for semantic search * Retrieve relevant memories based on conversation context * Use `@pxt.query` for retrieval functions This pattern is inspired by [Pixelbot](https://github.com/pixeltable/pixelbot) and [Pixelmemory](https://github.com/pixeltable/pixelmemory). ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os from datetime import datetime if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions, embeddings ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('agent_demo', force=True) pxt.create_dir('agent_demo') ```
  Created directory 'agent\_demo'.
  \
### Create memory bank ```python theme={null} # Create memory bank table memories = pxt.create_table( 'agent_demo/memories', { 'content': pxt.String, # The memory content 'category': pxt.String, # Optional category (preference, fact, etc.) 'created_at': pxt.Timestamp, # When the memory was stored }, ) ```
  Created table 'memories'.
```python theme={null} # Add embedding index for semantic search on content memories.add_embedding_index( column='content', string_embed=embeddings.using(model='text-embedding-3-small'), ) ``` ### Define retrieval function ```python theme={null} # Define a query function to retrieve relevant memories @pxt.query def recall_memories(context: str, top_k: int = 3): """Retrieve memories relevant to the current context.""" sim = memories.content.similarity(string=context) return ( memories.where(sim > 0.5) .order_by(sim, asc=False) .limit(top_k) .select(content=memories.content, category=memories.category) ) ``` ### Store some memories ```python theme={null} # Store some initial memories initial_memories = [ { 'content': 'User prefers Python for data analysis', 'category': 'preference', 'created_at': datetime.now(), }, { 'content': 'The project deadline is March 15, 2024', 'category': 'fact', 'created_at': datetime.now(), }, { 'content': 'User works at a startup in San Francisco', 'category': 'fact', 'created_at': datetime.now(), }, { 'content': 'Budget for the ML project is $50,000', 'category': 'fact', 'created_at': datetime.now(), }, { 'content': 'User prefers concise explanations over detailed ones', 'category': 'preference', 'created_at': datetime.now(), }, ] memories.insert(initial_memories) ```
  Inserting rows into \`memories\`: 5 rows \[00:00, 590.53 rows/s]
  Inserted 5 rows with 0 errors.
  5 rows inserted, 15 values computed.
### Create conversation table with memory retrieval ```python theme={null} # Create conversation table conversations = pxt.create_table( 'agent_demo/conversations', {'user_message': pxt.String} ) ```
  Created table 'conversations'.
```python theme={null} # Add memory retrieval step conversations.add_computed_column( relevant_memories=recall_memories(conversations.user_message, top_k=3) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Build prompt with memories @pxt.udf def build_memory_prompt( user_message: str, relevant_memories: list[dict] ) -> str: memory_text = '\n'.join( [f'- {m["content"]}' for m in relevant_memories] ) return f"""You are a helpful assistant with access to the following memories about the user: {memory_text} Use these memories to personalize your response when relevant. User: {user_message} Assistant:""" conversations.add_computed_column( prompt=build_memory_prompt( conversations.user_message, conversations.relevant_memories ) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Generate response with memory context conversations.add_computed_column( response=chat_completions( messages=[{'role': 'user', 'content': conversations.prompt}], model='gpt-4o-mini', ) ) conversations.add_computed_column( assistant_reply=conversations.response.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  Added 0 column values with 0 errors.
  No rows affected.
### Chat with memory-aware agent ```python theme={null} # Test the memory-aware agent test_messages = [ { 'user_message': 'What programming language should I use for this project?' }, {'user_message': 'When do I need to finish this?'}, {'user_message': 'How much can I spend on cloud resources?'}, ] conversations.insert(test_messages) ```
  Inserting rows into \`conversations\`: 3 rows \[00:00, 1047.88 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 18 values computed.
```python theme={null} # View conversations with memory conversations.select( conversations.user_message, conversations.relevant_memories, conversations.assistant_reply, ).collect() ```
## Explanation **Memory-aware agent architecture:**
  User Message → Retrieve Memories → Build Prompt → LLM Response
                      ↓
              Memory Bank (with embeddings)
**Key components:**
**Adding new memories:** ```python theme={null} memories.insert([{ 'content': 'New information to remember', 'category': 'fact', 'created_at': datetime.now() }]) ``` ## See also * [Build a RAG pipeline](/howto/cookbooks/agents/pattern-rag-pipeline) - Document retrieval * [Use tool calling](/howto/cookbooks/agents/llm-tool-calling) - Function calling with LLMs * [Pixelbot](https://github.com/pixeltable/pixelbot) - Full agent implementation # Look up structured data with retrieval UDFs Source: https://docs.pixeltable.com/howto/cookbooks/agents/pattern-data-lookup Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create lookup functions that query tables by key—for customer records, product catalogs, or financial data. ## Problem You have structured data—customer records, product catalogs, financial data—and need to look up rows by key values. Common scenarios:
## Solution **What’s in this recipe:** * Create lookup functions from tables with `retrieval_udf` * Query by single or multiple keys * Use lookups in computed columns for data enrichment Use `pxt.retrieval_udf(table)` to automatically create a function that queries the table by its columns. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('lookup_demo', force=True) pxt.create_dir('lookup_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'lookup\_demo'.
  \
### Create a product catalog table ```python theme={null} # Create a product catalog products = pxt.create_table( 'lookup_demo/products', { 'sku': pxt.String, 'name': pxt.String, 'price': pxt.Float, 'category': pxt.String, }, ) products.insert( [ { 'sku': 'LAPTOP-001', 'name': 'MacBook Pro 14"', 'price': 1999.00, 'category': 'electronics', }, { 'sku': 'LAPTOP-002', 'name': 'ThinkPad X1', 'price': 1499.00, 'category': 'electronics', }, { 'sku': 'PHONE-001', 'name': 'iPhone 15 Pro', 'price': 999.00, 'category': 'electronics', }, { 'sku': 'CHAIR-001', 'name': 'Ergonomic Office Chair', 'price': 449.00, 'category': 'furniture', }, { 'sku': 'DESK-001', 'name': 'Standing Desk', 'price': 699.00, 'category': 'furniture', }, ] ) products.collect() ```
  Created table 'products'.
  Inserting rows into \`products\`: 5 rows \[00:00, 502.31 rows/s]
  Inserted 5 rows with 0 errors.
### Create a lookup function with retrieval\_udf ```python theme={null} # Create a lookup function that searches by SKU get_product = pxt.retrieval_udf( products, name='get_product', description='Look up a product by its SKU code', parameters=['sku'], # Only use SKU as the lookup key limit=1, # Return at most 1 result ) # Check the function signature ``` ```python theme={null} # Look up a product by SKU result = products.select(get_product(sku='LAPTOP-001')).limit(1).collect() ``` ### Look up by category (multiple results) ```python theme={null} # Create a category lookup (returns multiple products) get_by_category = pxt.retrieval_udf( products, name='get_by_category', description='Get all products in a category', parameters=['category'], limit=10, # Return up to 10 products ) # Find all electronics products.select(get_by_category(category='electronics')).limit( 1 ).collect() ```
### Use lookups for data enrichment ```python theme={null} # Create an orders table orders = pxt.create_table( 'lookup_demo/orders', { 'order_id': pxt.String, 'product_sku': pxt.String, 'quantity': pxt.Int, }, ) orders.insert( [ { 'order_id': 'ORD-001', 'product_sku': 'LAPTOP-001', 'quantity': 2, }, { 'order_id': 'ORD-002', 'product_sku': 'PHONE-001', 'quantity': 1, }, { 'order_id': 'ORD-003', 'product_sku': 'CHAIR-001', 'quantity': 4, }, ] ) ```
  Created table 'orders'.
  Inserting rows into \`orders\`: 3 rows \[00:00, 1186.28 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # Add a computed column that enriches orders with product details orders.add_computed_column( product_info=get_product(sku=orders.product_sku) ) # View enriched orders orders.select( orders.order_id, orders.product_sku, orders.quantity, orders.product_info, ).collect() ```
  Added 3 column values with 0 errors.
## Explanation **`retrieval_udf` parameters:**
**Use cases:**
**Tips:** * Use `limit=1` for unique key lookups * Specify only needed columns in `parameters` for cleaner APIs * Add descriptions for LLM tool integration ## See also * [Use tool calling with LLMs](/howto/cookbooks/agents/llm-tool-calling) - Use retrieval UDFs as LLM tools * [Build a RAG pipeline](/howto/cookbooks/agents/pattern-rag-pipeline) - Semantic search with `@pxt.query` # Build a RAG pipeline Source: https://docs.pixeltable.com/howto/cookbooks/agents/pattern-rag-pipeline Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create a retrieval-augmented generation system that answers questions using your documents as context. ## Problem You want an LLM to answer questions using your specific documents—not just its training data. You need to retrieve relevant context and include it in the prompt.
## Solution **What’s in this recipe:** * Embed and index documents for retrieval * Create a query function that retrieves context * Generate answers grounded in your documents You build a pipeline that: (1) embeds documents, (2) finds relevant chunks for a query, and (3) generates an answer using those chunks as context. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions, embeddings ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('rag_demo', force=True) pxt.create_dir('rag_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'rag\_demo'.
  \
### Step 1: create document store with embeddings ```python theme={null} # Create table for document chunks chunks = pxt.create_table( 'rag_demo/chunks', {'doc_id': pxt.String, 'chunk_text': pxt.String} ) ```
  Created table 'chunks'.
```python theme={null} # Add embedding index for semantic search chunks.add_embedding_index( column='chunk_text', string_embed=embeddings.using(model='text-embedding-3-small'), ) ``` ### Step 2: load documents ```python theme={null} # Sample knowledge base (in production, load from files/database) documents = [ { 'doc_id': 'password-reset', 'chunk_text': 'To reset your password, go to the login page and click "Forgot Password". Enter your email address and you will receive a reset link within 5 minutes. The link expires after 24 hours.', }, { 'doc_id': 'password-reset', 'chunk_text': 'Password requirements: minimum 8 characters, at least one uppercase letter, one number, and one special character. Passwords expire every 90 days for security.', }, { 'doc_id': 'account-settings', 'chunk_text': 'To update your profile, navigate to Settings > Account. You can change your display name, email address, and notification preferences. Changes take effect immediately.', }, { 'doc_id': 'billing', 'chunk_text': 'Billing occurs on the first of each month. You can view invoices under Settings > Billing. To change your payment method, click "Update Payment" and enter your new card details.', }, { 'doc_id': 'api-access', 'chunk_text': 'API keys can be generated in Settings > Developer. Each key has configurable permissions. Rate limits are 1000 requests per minute for standard plans, 10000 for enterprise.', }, ] chunks.insert(documents) ```
  Inserting rows into \`chunks\`: 5 rows \[00:00, 345.31 rows/s]
  Inserted 5 rows with 0 errors.
  5 rows inserted, 15 values computed.
### Step 3: create the RAG query function ```python theme={null} # Define a query function that retrieves context @pxt.query def retrieve_context(query: str, top_k: int = 3): """Retrieve the most relevant chunks for a query.""" sim = chunks.chunk_text.similarity(string=query) return ( chunks.where(sim > 0.5) .order_by(sim, asc=False) .limit(top_k) .select(doc_id=chunks.doc_id, text=chunks.chunk_text) ) ``` ```python theme={null} # View retrieved context for a query query = 'What are the key features?' context_chunks = retrieve_context(query) context_chunks ```
  retrieve\_context('What are the key features?')
### Step 4: generate answers with context ```python theme={null} # Create a table for questions/answers qa = pxt.create_table('rag_demo/qa', {'question': pxt.String}) ```
  Created table 'qa'.
```python theme={null} # Add retrieval step qa.add_computed_column(context=retrieve_context(qa.question, top_k=3)) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Build the RAG prompt @pxt.udf def build_rag_prompt(question: str, context: list[dict]) -> str: context_text = '\n\n'.join( [f'[{c["doc_id"]}]: {c["text"]}' for c in context] ) return f"""Answer the question based only on the provided context. If the context doesn't contain the answer, say "I don't have information about that." Context: {context_text} Question: {question} Answer:""" qa.add_computed_column(prompt=build_rag_prompt(qa.question, qa.context)) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Generate answer qa.add_computed_column( response=chat_completions( messages=[{'role': 'user', 'content': qa.prompt}], model='gpt-4o-mini', ) ) qa.add_computed_column(answer=qa.response.choices[0].message.content) ```
  Added 0 column values with 0 errors.
  Added 0 column values with 0 errors.
  No rows affected.
### Ask questions ```python theme={null} # Insert questions questions = [ {'question': 'How do I reset my password?'}, {'question': 'What are the API rate limits?'}, {'question': 'When am I billed?'}, ] qa.insert(questions) ```
  Inserting rows into \`qa\`: 3 rows \[00:00, 872.12 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 18 values computed.
```python theme={null} # View answers qa.select(qa.question, qa.answer).collect() ```
## Explanation **RAG pipeline flow:**
  Question → Embed → Retrieve similar chunks → Build prompt with context → Generate answer
**Key components:**
**Scaling tips:** * Use `doc-chunk-for-rag` recipe to split long documents * Adjust `top_k` to balance context size vs. relevance * Consider metadata filtering for large knowledge bases ## See also * [Chunk documents for RAG](/howto/cookbooks/text/doc-chunk-for-rag) - Split documents into chunks * [Create text embeddings](/howto/cookbooks/search/embed-text-openai) - Embedding fundamentals * [Semantic text search](/howto/cookbooks/search/search-semantic-text) - Search patterns # Use a table pipeline as a reusable function Source: https://docs.pixeltable.com/howto/cookbooks/agents/pattern-table-as-udf Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Convert a table with computed columns into a callable function for multi-agent workflows and pipeline composition. ## Problem You have a table that runs a complex pipeline—LLM calls, tool use, post-processing—and you want to reuse that entire pipeline from other tables. Copy-pasting computed column definitions is error-prone and hard to maintain.
## Solution **What’s in this recipe:** * Create an “agent” table with computed columns * Convert the table to a callable UDF with `pxt.udf(table, return_value=...)` * Use the table UDF in other tables’ computed columns You wrap an entire table pipeline as a function. When you call this function from another table, it inserts a row into the agent table, runs all computed columns, and returns the specified output column. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('table_udf_demo', force=True) pxt.create_dir('table_udf_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'table\_udf\_demo'.
  \
### Create an agent table with computed columns You create a table that encapsulates a complete pipeline. This example builds a summarization agent: ```python theme={null} # Create the agent table with input column summarizer = pxt.create_table( 'table_udf_demo/summarizer', {'text': pxt.String} ) ```
  Created table 'summarizer'.
```python theme={null} # Add the LLM call as a computed column summarizer.add_computed_column( response=chat_completions( messages=[ { 'role': 'user', 'content': 'Summarize this in one sentence:\n\n' + summarizer.text, } ], model='gpt-4o-mini', ) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Extract the summary text summarizer.add_computed_column( summary=summarizer.response.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Convert the table to a UDF You use `pxt.udf(table, return_value=...)` to convert the table into a callable function. The `return_value` specifies which column to return: ```python theme={null} # Convert the summarizer table into a callable UDF summarize = pxt.udf(summarizer, return_value=summarizer.summary) ``` ### Use the table UDF in another table You can now use `summarize()` as a computed column in any other table: ```python theme={null} # Create a table that uses the summarizer articles = pxt.create_table( 'table_udf_demo/articles', {'title': pxt.String, 'content': pxt.String}, ) ```
  Created table 'articles'.
```python theme={null} # Add the table UDF as a computed column articles.add_computed_column(summary=summarize(text=articles.content)) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert articles - summaries are generated automatically articles.insert( [ { 'title': 'Climate Report', 'content': 'Global temperatures rose by 1.2 degrees Celsius above pre-industrial levels last year, marking the hottest year on record. Scientists attribute this to continued greenhouse gas emissions and a strong El Nino pattern. The report calls for immediate action to reduce carbon emissions.', }, { 'title': 'Tech Merger', 'content': 'Two major semiconductor companies announced a merger valued at $50 billion. The combined entity will control 30% of the global chip market. Regulators in multiple countries will review the deal over the next 18 months.', }, ] ) ```
  Inserting rows into \`articles\`: 2 rows \[00:00, 196.58 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 6 values computed.
```python theme={null} # View results articles.select(articles.title, articles.summary).collect() ```
## Explanation **How table UDFs work:**
  Consumer table row → Table UDF called → Agent table inserts row →
  Computed columns run → Return value extracted → Consumer gets result
**When to use table UDFs vs `@pxt.query`:**
**Key benefits:** * **Encapsulation**: Hide complex pipeline details behind a simple function call * **Reusability**: Use the same agent from multiple consumer tables * **Persistence**: All intermediate results are stored in the agent table for debugging * **Composition**: Chain agents together for multi-stage workflows ## See also * [Look up structured data](/howto/cookbooks/agents/pattern-data-lookup) - Simple key-based lookups with `retrieval_udf` * [Build a RAG pipeline](/howto/cookbooks/agents/pattern-rag-pipeline) - Retrieval with `@pxt.query` * [Use tool calling with LLMs](/howto/cookbooks/agents/llm-tool-calling) - Add tools to agent tables # Extract audio from video Source: https://docs.pixeltable.com/howto/cookbooks/audio/audio-extract-from-video Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Extract the audio track from video files for transcription, analysis, or processing. ## Problem You have video files but need to work with just the audio track—for transcription, speaker analysis, or audio processing. Extracting audio manually with ffmpeg is tedious and doesn’t integrate with your data pipeline.
## Solution **What’s in this recipe:** * Extract audio from video as a computed column * Choose audio format (mp3, wav, flac) * Chain with transcription for automatic video-to-text You use the `extract_audio` function to create an audio column from video. This integrates seamlessly with transcription and other audio processing. ### Setup ```python theme={null} %pip install -qU pixeltable boto3 'numpy<2.4' ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.video import extract_audio ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('audio_extract_demo', force=True) pxt.create_dir('audio_extract_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'audio\_extract\_demo'.
  \
### Extract audio from video ```python theme={null} # Create table for videos videos = pxt.create_table( 'audio_extract_demo/videos', {'title': pxt.String, 'video': pxt.Video} ) ```
  Created table 'videos'.
```python theme={null} # Add computed column to extract audio as MP3 videos.add_computed_column( audio=extract_audio(videos.video, format='mp3') ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert a sample video (from multimedia-commons with audio) video_url = 's3://multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4' videos.insert([{'title': 'Sample Video', 'video': video_url}]) ```
  Inserting rows into \`videos\`: 1 rows \[00:00, 207.52 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 4 values computed.
```python theme={null} # View results videos.select(videos.title, videos.audio).collect() ```
### Chain with transcription Add transcription as a follow-up computed column: ```python theme={null} # Install whisper for transcription %pip install -qU openai-whisper ``` ```python theme={null} from pixeltable.functions import whisper # Add transcription of the extracted audio videos.add_computed_column( transcription=whisper.transcribe(videos.audio, model='base.en') ) ```
  Added 1 column value with 0 errors.
  1 row updated, 1 value computed.
```python theme={null} # Extract the transcript text videos.add_computed_column(transcript=videos.transcription.text) ```
  Added 1 column value with 0 errors.
  1 row updated, 1 value computed.
```python theme={null} # View the full pipeline results videos.select(videos.title, videos.transcript).collect() ```
## Explanation **Audio format options:**
**Pipeline flow:**
  Video → extract\_audio → Audio → whisper.transcribe → Transcript
Each step is a computed column. When you insert a new video: 1. Audio is extracted automatically 2. Whisper transcribes the audio 3. All results are cached for future queries ## See also * [Transcribe audio](/howto/cookbooks/audio/audio-transcribe) - Audio-only transcription * [Summarize podcasts](/howto/cookbooks/audio/audio-summarize-podcast) - Transcribe and summarize * [Extract video frames](/howto/cookbooks/video/video-extract-frames) - Work with video frames # Summarize podcasts and audio Source: https://docs.pixeltable.com/howto/cookbooks/audio/audio-summarize-podcast Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Transcribe audio files and generate summaries automatically using Whisper and LLMs. ## Problem You have podcast episodes, meeting recordings, or interviews that need both transcription and summarization. Doing this manually is time-consuming and doesn’t scale.
## Solution **What’s in this recipe:** * Transcribe audio with Whisper (runs locally) * Generate summaries with an LLM * Chain transcription → summarization automatically You create a pipeline where audio is transcribed first, then the transcript is summarized. Both steps run automatically when you insert new audio files. ### Setup ```python theme={null} %pip install -qU pixeltable openai-whisper openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai, whisper ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('podcast_demo', force=True) pxt.create_dir('podcast_demo') ```
  Created directory 'podcast\_demo'.
  \
### Create the pipeline Create a table with audio input, then add computed columns for transcription and summarization: ```python theme={null} # Create table for audio files podcasts = pxt.create_table( 'podcast_demo/episodes', {'title': pxt.String, 'audio': pxt.Audio} ) ```
  Created table 'episodes'.
```python theme={null} # Step 1: Transcribe with local Whisper (uses GPU if available) podcasts.add_computed_column( transcription=whisper.transcribe(podcasts.audio, model='base.en') ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Extract the text from transcription result (cast to String for concatenation) podcasts.add_computed_column( transcript_text=podcasts.transcription.text.astype(pxt.String) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Step 2: Summarize the transcript with OpenAI summary_prompt = ( """Summarize this transcript in 2-3 sentences, then list 3 key points. Transcript: """ + podcasts.transcript_text ) podcasts.add_computed_column( summary_response=openai.chat_completions( messages=[{'role': 'user', 'content': summary_prompt}], model='gpt-4o-mini', ) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Extract summary text from response podcasts.add_computed_column( summary=podcasts.summary_response.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Process audio files Insert audio files and watch the pipeline run automatically: ```python theme={null} # Insert sample audio audio_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/10-minute%20tour%20of%20Pixeltable.mp3' podcasts.insert([{'title': 'Pixeltable Tour', 'audio': audio_url}]) ```
  Inserting rows into \`episodes\`: 1 rows \[00:00, 185.18 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 8 values computed.
```python theme={null} # View transcript podcasts.select(podcasts.title, podcasts.transcript_text).collect() ```
```python theme={null} # View summary podcasts.select(podcasts.title, podcasts.summary).collect() ```
## Explanation **Pipeline architecture:**
  Audio → Whisper transcription → Transcript text → LLM summarization → Summary
Each step is a computed column that depends on the previous one. When you insert a new audio file, all steps run automatically in sequence. **Whisper model options:**
For production with varied audio quality, use `small.en` or larger. ## See also * [Transcribe audio](/howto/cookbooks/audio/audio-transcribe) - Basic audio transcription * [Summarize text](/howto/cookbooks/text/text-summarize) - Text summarization patterns # Convert text to speech Source: https://docs.pixeltable.com/howto/cookbooks/audio/audio-text-to-speech Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Generate natural-sounding audio from text using OpenAI’s text-to-speech models. ## Problem You need to convert text content into spoken audio—for accessibility, content repurposing, or voice applications.
## Solution **What’s in this recipe:** * Generate speech with OpenAI TTS * Choose from multiple voice options * Store text and audio together You add a computed column that converts text to audio. The audio is cached and only regenerated when the source text changes. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import speech ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('tts_demo', force=True) pxt.create_dir('tts_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'tts\_demo'.
  \
### Create text-to-speech pipeline ```python theme={null} # Create table for articles articles = pxt.create_table( 'tts_demo/articles', {'title': pxt.String, 'content': pxt.String} ) ```
  Created table 'articles'.
```python theme={null} # Add audio generation column articles.add_computed_column( audio=speech(articles.content, model='tts-1', voice='alloy') ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Generate audio ```python theme={null} # Insert sample articles sample_articles = [ { 'title': 'Welcome to AI', 'content': 'Artificial intelligence is transforming how we work and live. From smart assistants to autonomous vehicles, AI is becoming part of our daily lives.', }, { 'title': 'Getting Started', 'content': 'To begin your journey with machine learning, start by understanding the basics of data preparation and model training.', }, ] articles.insert(sample_articles) ```
  Inserting rows into \`articles\`: 2 rows \[00:00, 423.90 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 6 values computed.
```python theme={null} # View articles with generated audio articles.select( articles.title, articles.content, articles.audio ).collect() ```
## Explanation **OpenAI TTS models:**
**Voice options:**
**Tips:** * Use `tts-1` for drafts and real-time applications * Use `tts-1-hd` for final production audio * Audio is cached—no regeneration on queries ## See also * [Transcribe audio](/howto/cookbooks/audio/audio-transcribe) - Convert audio to text * [Summarize podcasts](/howto/cookbooks/audio/audio-summarize-podcast) - Transcribe and summarize audio # Transcribe audio files with Whisper Source: https://docs.pixeltable.com/howto/cookbooks/audio/audio-transcribe Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Convert speech to text locally using OpenAI’s open-source Whisper model—no API key needed. ## Problem You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable segments.
## Solution **What’s in this recipe:** * Transcribe audio files locally with Whisper (no API key) * Automatically segment long files * Extract and transcribe audio from videos You create a view with `audio_splitter` to break long files into segments, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed. ### Setup ```python theme={null} %pip install -qU pixeltable openai-whisper ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import whisper from pixeltable.functions.audio import audio_splitter ``` ### Load audio files ```python theme={null} # Create a fresh directory pxt.drop_dir('audio_demo', force=True) pxt.create_dir('audio_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Converting metadata from version 45 to 46
  Created directory 'audio\_demo'.
  \
```python theme={null} # Create table for audio files audio = pxt.create_table('audio_demo/files', {'audio': pxt.Audio}) ```
  Created table 'files'.
```python theme={null} # Insert a sample audio file (video files also work - audio is extracted automatically) audio.insert( [ { 'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4' } ] ) ```
  Inserted 1 row with 0 errors in 1.05 s (0.95 rows/s)
  1 row inserted.
### Split into segments Create a view that splits audio into 30-second segments with overlap: ```python theme={null} # Split audio into segments for transcription segments = pxt.create_view( 'audio_demo/segments', audio, iterator=audio_splitter( audio.audio, duration=30.0, # 30-second segments overlap=2.0, # 2-second overlap for context min_segment_duration=5.0, # Drop segments shorter than 5 seconds ), ) ``` ```python theme={null} # View the segments segments.select(segments.segment_start, segments.segment_end).collect() ```
### Transcribe with Whisper Add a computed column that transcribes each segment: ```python theme={null} # Add transcription column (runs locally - no API key needed) segments.add_computed_column( transcription=whisper.transcribe( audio=segments.audio_segment, model='base.en', # Options: tiny.en, base.en, small.en, medium.en, large ) ) ```
  Added 2 column values with 0 errors in 3.35 s (0.60 rows/s)
  2 rows updated.
```python theme={null} # Extract just the text segments.add_computed_column(text=segments.transcription.text) ```
  Added 2 column values with 0 errors in 0.06 s (31.82 rows/s)
  2 rows updated.
```python theme={null} # View transcriptions with timestamps segments.select( segments.segment_start, segments.segment_end, segments.text ).collect() ```
## Explanation **Whisper models:**
Models ending in `.en` are English-only and faster. Remove `.en` for multilingual support. **audio\_splitter parameters:**
**Video files work too:** When you insert a video file, Pixeltable automatically extracts the audio track. ## See also * [Iterators documentation](/platform/iterators) * [Whisper library](https://github.com/openai/whisper) # Create custom aggregate functions (UDAs) Source: https://docs.pixeltable.com/howto/cookbooks/core/custom-aggregates-uda Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Build reusable aggregation logic for group-by queries and analytics. ## Problem You need aggregations beyond the built-in `sum`, `count`, `mean`, `min`, `max` — such as collecting values into a list, concatenating strings, or computing custom statistics. ## Solution **What’s in this recipe:** * Define a UDA (User-Defined Aggregate) with the `@pxt.uda` decorator * Use UDAs in `group_by` queries * Create UDAs with multiple inputs ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt pxt.drop_dir('uda_demo', force=True) pxt.create_dir('uda_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'uda\_demo'.
  \
### Create sample data ```python theme={null} sales = pxt.create_table( 'uda_demo/sales', { 'region': pxt.String, 'product': pxt.String, 'amount': pxt.Float, 'quantity': pxt.Int, }, ) sales.insert( [ { 'region': 'North', 'product': 'Widget', 'amount': 100.0, 'quantity': 5, }, { 'region': 'North', 'product': 'Gadget', 'amount': 250.0, 'quantity': 2, }, { 'region': 'North', 'product': 'Widget', 'amount': 150.0, 'quantity': 8, }, { 'region': 'South', 'product': 'Widget', 'amount': 200.0, 'quantity': 10, }, { 'region': 'South', 'product': 'Gadget', 'amount': 175.0, 'quantity': 3, }, { 'region': 'East', 'product': 'Widget', 'amount': 125.0, 'quantity': 6, }, ] ) sales.collect() ```
  Created table 'sales'.

  Inserting rows into \`sales\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`sales\`: 6 rows \[00:00, 609.56 rows/s]
  Inserted 6 rows with 0 errors.
### Variance UDA (not built-in) ```python theme={null} # A UDA is a class that inherits from pxt.Aggregator # It must implement: __init__, update, and value @pxt.uda class variance(pxt.Aggregator): """Compute population variance using Welford's online algorithm.""" def __init__(self): self.count = 0 self.mean = 0.0 self.m2 = 0.0 # Sum of squared differences from mean def update(self, val: float) -> None: if val is not None: self.count += 1 delta = val - self.mean self.mean += delta / self.count delta2 = val - self.mean self.m2 += delta * delta2 def value(self) -> float: if self.count < 1: return 0.0 return self.m2 / self.count # Population variance ``` ```python theme={null} # Use like any built-in aggregate sales.select(variance(sales.amount)).collect() ```
```python theme={null} # Use in group_by queries sales.group_by(sales.region).select( sales.region, amount_variance=variance(sales.amount) ).collect() ```
### String concatenation UDA ```python theme={null} @pxt.uda class string_agg(pxt.Aggregator): """Concatenate strings with a comma separator.""" def __init__(self): self.values = [] def update(self, val: str) -> None: if val is not None: self.values.append(val) def value(self) -> str: return ', '.join(self.values) ``` ```python theme={null} # List all products sold in each region sales.group_by(sales.region).select( sales.region, products=string_agg(sales.product) ).collect() ```
### Collect values into a list ```python theme={null} @pxt.uda class collect_list(pxt.Aggregator): """Collect all values into a list.""" def __init__(self): self.items = [] def update(self, val: float) -> None: if val is not None: self.items.append(val) def value(self) -> list[float]: return self.items ``` ```python theme={null} # Get all amounts per region as a list sales.group_by(sales.region).select( sales.region, amounts=collect_list(sales.amount) ).collect() ```
### Weighted average UDA ```python theme={null} @pxt.uda class weighted_avg(pxt.Aggregator): """Compute weighted average: sum(value * weight) / sum(weight).""" def __init__(self): self.weighted_sum = 0.0 self.weight_sum = 0.0 def update(self, value: float, weight: float) -> None: if value is not None and weight is not None: self.weighted_sum += value * weight self.weight_sum += weight def value(self) -> float: if self.weight_sum == 0: return 0.0 return self.weighted_sum / self.weight_sum ``` ```python theme={null} # Compute quantity-weighted average price per region sales.group_by(sales.region).select( sales.region, avg_price=weighted_avg(sales.amount, sales.quantity) ).collect() ```
### Mode UDA (most frequent value) ```python theme={null} from collections import Counter @pxt.uda class mode(pxt.Aggregator): """Find the most frequent value in a group.""" def __init__(self): self.counts = Counter() def update(self, val: str) -> None: if val is not None: self.counts[val] += 1 def value(self) -> str: if not self.counts: return None return self.counts.most_common(1)[0][0] ``` ```python theme={null} # Find most common product per region sales.group_by(sales.region).select( sales.region, top_product=mode(sales.product) ).collect() ```
## Explanation **UDA structure:** ```python theme={null} @pxt.uda class my_aggregate(pxt.Aggregator): def __init__(self): # Initialize state self.state = initial_value def update(self, val: InputType) -> None: # Called for each row # Update internal state with val def value(self) -> OutputType: # Called at the end return self.state ``` **Key points:** * Always handle `None` values in `update()` * Multiple parameters in `update()` enable multi-column aggregations (like `weighted_avg`) * Return type annotation on `value()` determines output column type ## See also * [UDFs in Pixeltable](../../../platform/udfs-in-pixeltable) - Complete guide to custom functions * [Join tables](/howto/cookbooks/core/query-join-tables) - Combine data before aggregating # Split data into multiple rows with iterators Source: https://docs.pixeltable.com/howto/cookbooks/core/data-split-rows Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Transform a single document, video, image, or audio file into multiple rows for granular processing. **What’s in this recipe:** * Split documents into text chunks for RAG * Extract frames or segments from videos * Tile images for high-resolution analysis * Chunk audio files for transcription ## Problem You have documents, videos, or text that you need to break into smaller pieces for processing. A PDF needs to be split into chunks for retrieval-augmented generation. A video needs individual frames for analysis. Text needs to be divided into sentences or sliding windows. You need a way to transform one source row into multiple output rows automatically. ## Solution You create views with iterator functions that split source data into multiple rows. Pixeltable provides built-in iterators for documents, videos, images, audio, and strings. ### Setup ```python theme={null} %pip install -qU pixeltable spacy tiktoken !python -m spacy download en_core_web_sm -q ``` ```python theme={null} import pixeltable as pxt ``` ### Split documents into chunks Use `document_splitter` to break documents (PDF, HTML, Markdown, TXT) into text chunks. ```python theme={null} from pixeltable.functions.document import document_splitter pxt.drop_dir('split_demo', force=True) pxt.create_dir('split_demo') docs = pxt.create_table('split_demo/docs', {'doc': pxt.Document}) docs.insert( [ { 'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf' } ] ) ```
  Inserted 1 row with 0 errors in 0.13 s (7.68 rows/s)
  1 row inserted.
```python theme={null} chunks = pxt.create_view( 'split_demo/doc_chunks', docs, iterator=document_splitter( docs.doc, separators='sentence,token_limit', limit=300 ), ) chunks.select(chunks.text).limit(3).collect() ```
**Available separators:** * `heading` — Split on HTML/Markdown headings * `sentence` — Split on sentence boundaries (requires spacy) * `token_limit` — Split by token count (requires tiktoken) * `char_limit` — Split by character count * `page` — Split by page (PDF only) [SDK Reference: document\_splitter](/sdk/latest/document) ### Extract frames from videos Use `frame_iterator` to extract frames at specified intervals. ```python theme={null} from pixeltable.functions.video import frame_iterator videos = pxt.create_table('split_demo/videos', {'video': pxt.Video}) videos.insert( [ { 'video': 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/bangkok.mp4' } ] ) ```
  Inserted 1 row with 0 errors in 1.28 s (0.78 rows/s)
  1 row inserted.
```python theme={null} frames = pxt.create_view( 'split_demo/frames', videos, iterator=frame_iterator(videos.video, fps=1.0), ) frames.select(frames.frame, frames.frame_attrs).limit(3).collect() ```
**frame\_iterator options:** * `fps` — Frames per second to extract * `num_frames` — Extract exact number of frames (evenly spaced) * `keyframes_only` — Extract only keyframes [SDK Reference: frame\_iterator](/sdk/latest/video) ### Split videos into segments Use `video_splitter` to divide videos into smaller clips. ```python theme={null} from pixeltable.functions.video import video_splitter segments = pxt.create_view( 'split_demo/segments', videos, iterator=video_splitter( videos.video, duration=5.0, min_segment_duration=1.0 ), ) segments.select( segments.segment_start, segments.segment_end, segments.video_segment ).limit(3).collect() ```
**video\_splitter options:** * `duration` — Duration of each segment in seconds * `overlap` — Overlap between segments in seconds * `min_segment_duration` — Drop last segment if shorter than this [SDK Reference: video\_splitter](/sdk/latest/video) ### Split strings into sentences Use `string_splitter` to divide text into sentences. ```python theme={null} from pixeltable.functions.string import string_splitter texts = pxt.create_table('split_demo/texts', {'content': pxt.String}) texts.insert( [ { 'content': 'AI data infrastructure simplifies ML workflows. Declarative pipelines update incrementally. This makes development faster and more maintainable.' } ] ) ```
  Inserted 1 row with 0 errors in 0.03 s (38.38 rows/s)
  1 row inserted.
```python theme={null} sentences = pxt.create_view( 'split_demo/sentences', texts, iterator=string_splitter(texts.content, separators='sentence'), ) sentences.select(sentences.text).collect() ```
[SDK Reference: string\_splitter](/sdk/latest/string) ### Tile images for analysis Use `tile_iterator` to divide large images into a grid of smaller tiles. This is useful for processing high-resolution images that are too large to analyze at once, or for running object detection on different regions. ```python theme={null} from pixeltable.functions.image import tile_iterator images = pxt.create_table('split_demo/images', {'image': pxt.Image}) images.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/pixeltable-logo-large.png' } ] ) ```
  Inserted 1 row with 0 errors in 0.09 s (11.69 rows/s)
  1 row inserted.
```python theme={null} tiles = pxt.create_view( 'split_demo/tiles', images, iterator=tile_iterator(images.image, tile_size=(100, 100)), ) ``` **tile\_iterator options:** * `tile_size` — Size of each tile as `(width, height)` * `overlap` — Overlap between adjacent tiles as `(width, height)` [SDK Reference: tile\_iterator](/sdk/latest/image) ```python theme={null} tiles.select(tiles.tile_coord, tiles.tile).sample(n=4).collect() ```
### Split audio into chunks Use `audio_splitter` to divide audio files into time-based segments for transcription or analysis. ```python theme={null} from pixeltable.functions.audio import audio_splitter audio = pxt.create_table('split_demo/audio', {'audio': pxt.Audio}) audio.insert( [ { 'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/10-minute%20tour%20of%20Pixeltable.mp3' } ] ) ```
  Inserted 1 row with 0 errors in 0.67 s (1.50 rows/s)
  1 row inserted.
```python theme={null} audio_segments = pxt.create_view( 'split_demo/audio_chunks', audio, iterator=audio_splitter(audio.audio, duration=30.0, overlap=2.0), ) audio_segments.select( audio_segments.segment_start, audio_segments.segment_end ).limit(5).collect() ```
**audio\_splitter options:** * `duration` — Duration of each chunk in seconds * `overlap` — Overlap between chunks in seconds * `min_segment_duration` — Drop last chunk if shorter than this [SDK Reference: audio\_splitter](/sdk/latest/audio) ## See also * [Split documents for RAG](/howto/cookbooks/text/doc-chunk-for-rag) * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) * [Transcribe audio files](/howto/cookbooks/audio/audio-transcribe) # Get fast feedback on transformations Source: https://docs.pixeltable.com/howto/cookbooks/core/dev-iterative-workflow Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to iterate on transformation logic before running it on your entire dataset—especially for expensive operations like API calls or model inference. ## Solution **What’s in this recipe:** * Test transformations on sample rows before applying to your full dataset * Save expressions as variables to guarantee consistent logic * Apply the iterate-then-add workflow with built-in functions, expressions, and custom UDFs * Annotate columns with comments and custom metadata using `ColumnSpec` You test transformation logic on sample rows before processing your entire dataset using the iterate-then-add workflow. This lets you validate logic on a few rows before committing to your full table. You use `.select()` with `.collect()` to preview transformations—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied with the results, use `.add_computed_column()` with the same expression to persist the transformation across your full table. This workflow applies to any data type in Pixeltable: images, videos, audio files, documents, and structured tabular data. This recipe uses text data and shows three examples: 1. Testing built-in functions on sample data 2. Saving expressions as variables to ensure consistency 3. Iterating with custom user-defined functions (UDFs) ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt ``` ### Create sample data ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('demo_project', force=True) pxt.create_dir('demo_project') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'demo\_project'.
  \
```python theme={null} t = pxt.create_table('demo_project/lyrics', {'text': pxt.String}) ```
  Created table 'lyrics'.
```python theme={null} t.insert( [ {'text': 'Tumble out of bed and I stumble to the kitchen'}, {'text': 'Pour myself a cup of ambition'}, {'text': 'And yawn and stretch and try to come to life'}, {'text': "Jump in the shower and the blood starts pumpin'"}, {'text': "Out on the street, the traffic starts jumpin'"}, {'text': 'With folks like me on the job from nine to five'}, ] ) ```
  Inserted 6 rows with 0 errors in 0.01 s (916.65 rows/s)
  6 rows inserted.
### Example 1: built-in functions Iterate with built-in functions, then add to the table. ```python theme={null} # Test uppercase transformation on subset t.select(t.text, uppercase=t.text.upper()).head(2) ```
```python theme={null} # Confirm the transformation was only in memory—table unchanged t.head(2) ```
```python theme={null} # Apply to all rows (same expression) t.add_computed_column(uppercase=t.text.upper()) ```
  Added 6 column values with 0 errors in 0.04 s (158.08 rows/s)
  6 rows updated.
```python theme={null} # View text with uppercase column t.collect() ```
### Example 2: save and reuse expressions Save an expression as a variable to guarantee the same logic in both iterate and add steps. ```python theme={null} # Define the expression once - no duplication char_count_expr = t.text.len() # Iterate: Test on subset t.select(t.text, char_count=char_count_expr).head(2) ```
```python theme={null} # Confirm the transformation was only in memory—table unchanged t.head(2) ```
```python theme={null} # Add: Use the SAME expression to persist t.add_computed_column(char_count=char_count_expr) ```
  Added 6 column values with 0 errors in 0.02 s (348.64 rows/s)
  6 rows updated.
```python theme={null} # View text with char_count column t.collect() ```
This pattern works with any expression: * Built-in functions: `resize_expr = t.image.resize((224, 224))` * UDFs: `watermark_expr = add_watermark(t.image, '© 2024')` * Chained operations: `processed_expr = t.image.resize((224, 224)).rotate(90)` Benefits: * Write the expression once, use it twice * No copy-paste—reuse the same logic * Easy to iterate: change in one place, test again ### Example 3: custom UDF Iterate with a user-defined function, then add to the table. ```python theme={null} # Define a custom transformation @pxt.udf def word_count(text: str) -> int: return len(text.split()) ``` ```python theme={null} # Iterate: Test UDF on subset t.select(t.text, word_count=word_count(t.text)).head(2) ```
```python theme={null} # Confirm the transformation was only in memory—table unchanged t.head(2) ```
```python theme={null} # Add: Apply to all rows (same expression) t.add_computed_column(word_count=word_count(t.text)) ```
  Added 6 column values with 0 errors in 0.02 s (312.11 rows/s)
  6 rows updated.
```python theme={null} # View text with word_count column t.collect() ```
### Example 4: annotate columns with metadata Use `ColumnSpec` to attach a comment or custom metadata when adding columns. Comments appear in `describe()` output, while `custom_metadata` stores arbitrary data (tags, version info, config) that you can retrieve with `get_metadata()`. ```python theme={null} from pixeltable.types import ColumnSpec # Add a column with a comment and custom metadata t.add_column( source=ColumnSpec( type=pxt.String, comment='Original source URL or file path', custom_metadata={'added_by': 'data_team', 'version': 2}, ) ) t.describe() ```
## Explanation **How the iterate-then-add workflow works:** Queries and computed columns serve different purposes. Queries let you test transformations on sample rows without storing anything. Once you’re satisfied with the results, you use the exact same expression with `.add_computed_column()` to persist it across your entire table. This workflow is especially valuable for expensive operations—API calls, model inference, complex image processing—where you want to validate logic before processing your full dataset. Test on 2-3 rows to catch errors early, then commit once. **To customize this workflow:** * **Sample size**: Use `.head(n)` to collect only the first n rows—`.head(1)` for single-row testing, `.head(10)` for broader validation, or `.collect()` to collect all rows * **Save expressions**: Store transformations as variables (Example 2) to guarantee identical logic in both iterate and add steps * **Chain transformations**: Test multiple operations together—`.select(t.text.upper().split())` works just like single operations * **Use with any data type**: This pattern works with images, videos, audio, documents—not just text. For multimodal data, visual inspection during iteration is especially valuable **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query. Queries are temporary operations that retrieve and transform data from tables—they don’t store anything. Queries use lazy evaluation, meaning they don’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results. `.head(n)` is a convenience method that collects only the first n rows instead of all rows. Use `.head(n)` when iterating to get fast feedback without processing your entire dataset. Nothing is stored in your table when you run queries. You can test different approaches quickly without affecting your data. You can store query results in a Python variable to work with them in your session. ```python theme={null} # Store query results as a variable (in memory only) results = t.select( t.text, uppercase=t.text.upper() # Label the transformed column ).head(3) ``` These results are stored in memory and will not persist across sessions—only `.add_computed_column()` persists data to your table. Once you’re satisfied, `.add_computed_column()` uses the same expression but adds it as a persistent column in your table. Now the transformation runs on all rows and results are stored permanently. ## See also * [Transform images with PIL operations](/howto/cookbooks/images/img-pil-transforms) * [Convert RGB images to grayscale](/howto/cookbooks/images/img-rgb-to-grayscale) * [Apply filters to images](/howto/cookbooks/images/img-apply-filters) # Join tables to combine data Source: https://docs.pixeltable.com/howto/cookbooks/core/query-join-tables Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Combine data from multiple tables using inner, left, and cross joins. ## Problem You have related data in separate tables and need to combine them for analysis—customers with orders, products with inventory, or media with metadata.
## Solution **What’s in this recipe:** * Inner join to match rows from both tables * Left join to keep all rows from the first table * Cross join for Cartesian product (all combinations) * Join with filtering, aggregation, and saving results * Paginate results with `limit()` and `offset` Use `table1.join(table2, on=..., how=...)` to combine tables based on matching columns. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt import pixeltable.functions as pxtf ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('join_demo', force=True) pxt.create_dir('join_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'join\_demo'.
  \
### Create sample tables ```python theme={null} # Create a customers table customers = pxt.create_table( 'join_demo/customers', {'customer_id': pxt.Int, 'name': pxt.String, 'email': pxt.String}, ) customers.insert( [ {'customer_id': 1, 'name': 'Alice', 'email': 'alice@example.com'}, {'customer_id': 2, 'name': 'Bob', 'email': 'bob@example.com'}, { 'customer_id': 3, 'name': 'Charlie', 'email': 'charlie@example.com', }, ] ) customers.collect() ```
  Created table 'customers'.
  Inserted 3 rows with 0 errors in 0.01 s (385.68 rows/s)
```python theme={null} # Create an orders table orders = pxt.create_table( 'join_demo/orders', { 'order_id': pxt.Int, 'customer_id': pxt.Int, 'product': pxt.String, 'amount': pxt.Float, }, ) orders.insert( [ { 'order_id': 101, 'customer_id': 1, 'product': 'Laptop', 'amount': 999.00, }, { 'order_id': 102, 'customer_id': 1, 'product': 'Mouse', 'amount': 29.00, }, { 'order_id': 103, 'customer_id': 2, 'product': 'Keyboard', 'amount': 79.00, }, { 'order_id': 104, 'customer_id': 4, 'product': 'Monitor', 'amount': 299.00, }, # No matching customer ] ) orders.collect() ```
  Created table 'orders'.
  Inserted 4 rows with 0 errors in 0.01 s (657.81 rows/s)
### Inner join (matching rows only) ```python theme={null} # Inner join: only rows that match in both tables customers.join( orders, on=customers.customer_id == orders.customer_id, how='inner' ).select(customers.name, orders.product, orders.amount).collect() ```
### Left join (keep all from first table) ```python theme={null} # Left join: all customers, with order data where available # Charlie has no orders, so product/amount will be null customers.join( orders, on=customers.customer_id == orders.customer_id, how='left' ).select(customers.name, orders.product, orders.amount).collect() ```
### Join with filtering ```python theme={null} # Combine join with where clause to filter results customers.join( orders, on=customers.customer_id == orders.customer_id, how='inner' ).where(orders.amount > 50).select( customers.name, customers.email, orders.product, orders.amount ).collect() ```
### Join with aggregation ```python theme={null} # Join and aggregate: total spending per customer customers.join( orders, on=customers.customer_id == orders.customer_id, how='inner' ).group_by(customers.name).select( customers.name, total_spent=pxtf.sum(orders.amount), order_count=pxtf.count(orders.order_id), ).collect() ```
### Cross join (all combinations) ```python theme={null} # Cross join: every customer paired with every product (no 'on' condition) products = pxt.create_table( 'join_demo/products', {'product': pxt.String, 'price': pxt.Float} ) products.insert( [ {'product': 'Widget', 'price': 19.99}, {'product': 'Gadget', 'price': 29.99}, ] ) customers.join(products, how='cross').select( customers.name, products.product, products.price ).collect() ```
  Created table 'products'.
  Inserted 2 rows with 0 errors in 0.00 s (422.52 rows/s)
### Save join results to a new table ```python theme={null} # Build a join query and collect as DataFrame customer_orders_df = ( customers.join( orders, on=customers.customer_id == orders.customer_id, how='inner', ) .select( name=customers.name, email=customers.email, product=orders.product, amount=orders.amount, ) .collect() .to_pandas() ) customer_orders_df ```
```python theme={null} # Create a new table from the DataFrame orders_report = pxt.create_table( 'join_demo/orders_report', source=customer_orders_df ) orders_report.collect() ```
  Created table 'orders\_report'.
  Inserted 3 rows with 0 errors in 0.01 s (500.32 rows/s)
### Paginate results with limit and offset Use `limit(n, offset=k)` to retrieve results in pages. This is useful for displaying results incrementally or building paginated APIs. ```python theme={null} # Page 1: first 2 rows orders.order_by(orders.order_id).limit(2).collect() ```
```python theme={null} # Page 2: next 2 rows (skip the first 2) orders.order_by(orders.order_id).limit(2, offset=2).collect() ```
## Explanation **Join types:**
**Join syntax:** ```python theme={null} # Simple: join on column by name t1.join(t2, on=t1.id) # Explicit predicate t1.join(t2, on=t1.customer_id == t2.customer_id) # Composite key t1.join(t2, on=(t1.pk1 == t2.pk1) & (t1.pk2 == t2.pk2)) ``` **Aggregation functions:** ```python theme={null} from pixeltable.functions import sum, count, mean, min, max # Use as functions, not methods total=sum(t.amount) num_rows=count(t.id) ``` **Saving join results:** ```python theme={null} # Collect as DataFrame, then create table df = query.select(name=t.col, ...).collect().to_pandas() new_table = pxt.create_table('path', source=df) ``` **Pagination:** ```python theme={null} # limit(n) returns at most n rows # limit(n, offset=k) skips the first k rows, then returns n query.order_by(t.id).limit(10) # rows 0-9 query.order_by(t.id).limit(10, offset=10) # rows 10-19 ``` **Tips:** * Use explicit predicates (`t1.col == t2.col`) for clarity * Chain `.where()` after join to filter results * Chain `.group_by()` for aggregations * Use `'left'` join when the first table is your “main” table * Use named columns in `.select(name=col)` for clean column names * Always use `.order_by()` with pagination to get deterministic page ordering ## See also * [Look up structured data](/howto/cookbooks/agents/pattern-data-lookup) - Use retrieval UDFs for lookups * [Sample data for training](/howto/cookbooks/data/data-sampling) - Sample from joined results # Time Zones Source: https://docs.pixeltable.com/howto/cookbooks/core/time-zones Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Because typical use cases involve datasets that span multiple time zones, Pixeltable strives to be precise in how it handles time zone arithmetic for datetimes. Timestamps are always stored in the Pixeltable database in UTC, to ensure consistency across datasets and deployments. Time zone considerations therefore apply during insertion and retrieval of timestamp data. ```python theme={null} %pip install -qU pixeltable ``` ### The default time zone Every Pixeltable deployment has a **default time zone**. The default time zone can be configured either by setting the `PIXELTABLE_TIME_ZONE` environment variable, or by adding a `time-zone` entry to the `[pixeltable]` section in `$PIXELTABLE_HOME/config.toml`. It must be a valid [IANA Time Zone](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). (See the [Pixeltable Configuration](/platform/configuration) guide for more details on configuration options.) ```python theme={null} import os os.environ['PIXELTABLE_TIME_ZONE'] = 'America/Los_Angeles' ``` If no time zone is configured, then Pixeltable will fall back on the system time zone of the host on which it is running. **Because system time zone is deployment-dependent, it is recommended that production deployments configure a default time zone explicitly.** As outlined in the [Python datetime documentation](https://docs.python.org/3/library/datetime.html), a Python `datetime` object may be either **naive** (no time zone) or **aware** (equipped with an explicit time zone). Pixeltable will always interpret naive `datetime` objects as belonging to the configured default time zone. ### Insertion and retrieval When a `datetime` is inserted into the database, it will be converted to UTC and stored as an absolute timestamp. If the `datetime` has an explicit time zone, Pixeltable will use that time zone for the conversion; otherwise, Pixeltable will use the default time zone. When a `datetime` is retrieved, it will always be retrieved in the default time zone. To query in a different time zone, it is necessary to do an explicit conversion; we’ll give an example of this in a moment. Let’s first walk through a few examples that illustrate the default behavior. ```python theme={null} import pixeltable as pxt pxt.drop_dir('tz_demo', force=True) pxt.create_dir('tz_demo') t = pxt.create_table( 'tz_demo/example', {'dt': pxt.Timestamp, 'note': pxt.String} ) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'tz\_demo'.
  Created table 'example'.
```python theme={null} from datetime import datetime, timezone from zoneinfo import ZoneInfo naive_dt = datetime(2024, 8, 9, 23, 0, 0) explicit_dt = datetime( 2024, 8, 9, 23, 0, 0, tzinfo=ZoneInfo('America/Los_Angeles') ) other_dt = datetime( 2024, 8, 9, 23, 0, 0, tzinfo=ZoneInfo('America/New_York') ) t.insert( [ {'dt': naive_dt, 'note': 'No time zone specified (uses default)'}, { 'dt': explicit_dt, 'note': 'Time zone America/Los_Angeles was specified explicitly', }, { 'dt': other_dt, 'note': 'Time zone America/New_York was specified explicitly', }, ] ) ```
  Inserting rows into \`example\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`example\`: 3 rows \[00:00, 433.04 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 3 values computed.
On retrieval, all timestamps are normalized to the default time zone, regardless of how they were specified during insertion. ```python theme={null} t.collect() ```
To represent timestamps in a different time zone, use the `astimezone` method. ```python theme={null} t.select( t.dt, dt_new_york=t.dt.astimezone('America/New_York'), note=t.note ).collect() ```
### Timestamp methods and properties The Pixeltable API exposes all the standard `datetime` methods and properties from the Python library. Because retrieval uses the default time zone, they are all relative to the default time zone unless `astimezone` is used. ```python theme={null} t.select( t.dt, day_default=t.dt.day, day_eastern=t.dt.astimezone('America/New_York').day, ).collect() ```
Observe that the first two timestamps map to different dates depending on the time zone, as expected. # Track changes and revert to previous versions Source: https://docs.pixeltable.com/howto/cookbooks/core/version-control-history Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Undo mistakes, audit changes, and create point-in-time snapshots of your data. ## Problem You need to track what changed in your data pipeline, undo accidental modifications, or preserve a specific state for reproducibility. ## Solution **What’s in this recipe:** * View version history with `history()` and `get_versions()` * Access specific versions with `pxt.get_table('table:N')` * Undo changes with `revert()` * Create point-in-time snapshots with `pxt.create_snapshot()` ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt pxt.drop_dir('version_demo', force=True) pxt.create_dir('version_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'version\_demo'.
  \
### Create a table and make some changes Every data or schema change creates a new version. ```python theme={null} # Create table (version 0) products = pxt.create_table( 'version_demo/products', {'name': pxt.String, 'price': pxt.Float, 'category': pxt.String}, ) ```
  Created table 'products'.
```python theme={null} # Insert data (version 1) products.insert( [ {'name': 'Widget', 'price': 9.99, 'category': 'Tools'}, {'name': 'Gadget', 'price': 24.99, 'category': 'Electronics'}, {'name': 'Gizmo', 'price': 14.99, 'category': 'Electronics'}, ] ) ```
  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`products\`: 3 rows \[00:00, 432.95 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # Add a computed column (version 2 - schema change) products.add_computed_column(price_with_tax=products.price * 1.08) ```
  Added 3 column values with 0 errors.
  3 rows updated, 6 values computed.
```python theme={null} # Update some data (version 3) products.update({'price': 19.99}, where=products.name == 'Widget') ```
  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`products\`: 1 rows \[00:00, 297.47 rows/s]
  1 row updated, 3 values computed.
```python theme={null} # Insert more data (version 4) products.insert( [{'name': 'Thingamajig', 'price': 49.99, 'category': 'Tools'}] ) ```
  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`products\`: 1 rows \[00:00, 661.46 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 3 values computed.
### View version history Use `history()` for a human-readable summary of all changes. ```python theme={null} # View full history (most recent first) products.history() ```
```python theme={null} # View only the last 3 versions products.history(n=3) ```
### Programmatic access to version metadata Use `get_versions()` to access version data programmatically. ```python theme={null} # Get version metadata as a list of dictionaries versions = products.get_versions() # Access specific version info latest = versions[0] latest['version'], latest['change_type'], latest['inserts'] ```
  (4, 'data', 1)
### Access a specific version Use `pxt.get_table('table_name:version')` to get a read-only handle to a specific version: ```python theme={null} # Get the table at version 1 (after initial insert, before computed column) products_v1 = pxt.get_table('version_demo/products:1') # This is a read-only view of the data at that point in time products_v1.collect() ```
```python theme={null} # Compare data at version 2 (after computed column added) vs version 1 # Note: version 1 doesn't have the price_with_tax column yet products_v2 = pxt.get_table('version_demo/products:2') products_v2.collect() ```
### Revert to previous version Use `revert()` to undo the most recent change. This is irreversible. ```python theme={null} # Current state: 4 products products.count() ```
  4
```python theme={null} # Revert the last insert (removes Thingamajig) products.revert() products.count() ```
  3
```python theme={null} # History now shows version 4 was reverted products.history() ```
```python theme={null} # Can revert multiple times (back to before the update) products.revert() # Check the Widget price is back to original products.where(products.name == 'Widget').select( products.name, products.price ).collect() ```
### Create point-in-time snapshots Snapshots freeze a table’s state for reproducibility. Unlike `revert()`, snapshots preserve the data indefinitely. ```python theme={null} # Create a snapshot of the current state snapshot_v1 = pxt.create_snapshot('version_demo/products_v1', products) snapshot_v1.collect() ```
```python theme={null} # Now make changes to the original table products.insert( [{'name': 'Doohickey', 'price': 99.99, 'category': 'Premium'}] ) products.update({'price': 29.99}, where=products.name == 'Gadget') products.collect() ```
  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`products\`: 1 rows \[00:00, 535.67 rows/s]
  Inserted 1 row with 0 errors.
  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`products\`: 1 rows \[00:00, 558.05 rows/s]
```python theme={null} # Snapshot remains unchanged - still shows original data snapshot_v1.collect() ```
## Explanation **What creates a new version:** * `insert()` - adding rows * `update()` - modifying rows * `delete()` - removing rows * `add_column()` / `add_computed_column()` - schema changes * `drop_column()` - schema changes * `rename_column()` - schema changes **Version history methods:** * `history()` - Human-readable DataFrame showing all changes * `get_versions()` - List of dictionaries for programmatic access **Accessing specific versions:** * `pxt.get_table('table_name:N')` - Get read-only handle to version N * Useful for comparing data across versions, auditing changes, or recovering specific values * Version handles are read-only—you cannot modify historical versions **Reverting:** * `revert()` undoes the most recent version * Can call multiple times to go back further * Cannot revert past version 0 * Cannot revert if a snapshot references that version **Snapshots vs revert:** * Snapshots are persistent, named, point-in-time copies * `revert()` permanently removes the latest version * Use snapshots when you need to preserve state for reproducibility * Use `revert()` to undo mistakes ## See also * [Data sharing](../../../platform/data-sharing) - Share tables between environments * [Iterative development](/howto/cookbooks/core/dev-iterative-workflow) - Fast feedback during development # Configure API keys for AI services Source: https://docs.pixeltable.com/howto/cookbooks/core/workflow-api-keys Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Set up API credentials for OpenAI, Anthropic, and other AI providers so Pixeltable can access them. ## Problem You need to call AI services (OpenAI, Anthropic, Gemini, etc.) from your data pipeline. These services require API keys, but you don’t want to hardcode credentials in your notebooks or scripts.
## Solution **What’s in this recipe:** * Set API keys using environment variables * Store keys in a config file for all projects * Use `getpass` for one-time session keys You configure API keys using one of three methods, depending on your needs. Pixeltable automatically discovers credentials from environment variables or config files—no code changes needed.
### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import os import pixeltable as pxt ``` ### Option 1: environment variables **Use when:** CI/CD pipelines, Docker containers, production deployments Set the environment variable in your shell before running Python: ```bash theme={null} # In your terminal (temporary, current session only) export OPENAI_API_KEY="sk-..." # Or add to ~/.bashrc or ~/.zshrc (permanent) echo 'export OPENAI_API_KEY="sk-..."' >> ~/.zshrc ``` You can also set it in Python (useful for testing): ```python theme={null} # Set in Python (current process only) # os.environ['OPENAI_API_KEY'] = 'sk-...' # Check if a key is set 'OPENAI_API_KEY' in os.environ ```
  True
### Option 2: config file **Use when:** Local development, want credentials available to all Pixeltable projects Create `~/.pixeltable/config.toml`: ```toml theme={null} # ~/.pixeltable/config.toml [openai] api_key = "sk-..." [anthropic] api_key = "sk-ant-..." [google] api_key = "AIza..." ``` You can check if the config file exists: ```python theme={null} # Check config file location home_dir = pxt.home() # Usually ~/.pixeltable config_file = home_dir / 'config.toml' print(config_file) config_file.exists() ```
  /Users/asiegel/.pixeltable/config.toml
  True
### Option 3: getpass (interactive) **Use when:** Shared notebooks, demos, one-time sessions Prompt for the key at runtime—it won’t be saved anywhere: ```python theme={null} import getpass # Uncomment to use interactively: # if 'OPENAI_API_KEY' not in os.environ: # os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ### Verify your configuration Test that Pixeltable can access your credentials by checking the config: ```python theme={null} # Check which API keys are available services = [ 'OPENAI_API_KEY', 'ANTHROPIC_API_KEY', 'GOOGLE_API_KEY', 'MISTRAL_API_KEY', ] for svc in services: status = '✓' if svc in os.environ else '✗' print(f'{status} {svc}') ```
  ✓ OPENAI\_API\_KEY
  ✓ ANTHROPIC\_API\_KEY
  ✓ GOOGLE\_API\_KEY
  ✓ MISTRAL\_API\_KEY
## Explanation **Discovery order:** Pixeltable checks for API keys in this order: 1. Environment variable (e.g., `OPENAI_API_KEY`) 2. Config file (`~/.pixeltable/config.toml`) 3. Raises an error if not found **Supported services:**
**Config file is global:** All Pixeltable projects on your machine share the same config file. **Getpass is per-session:** The key only exists in memory for the current Python session. ## See also * [Pixeltable configuration reference](/platform/configuration) * [Working with OpenAI](/howto/providers/working-with-openai) # Extract fields from LLM JSON responses Source: https://docs.pixeltable.com/howto/cookbooks/core/workflow-json-extraction Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Parse and access specific fields from structured JSON responses returned by language models. ## Problem LLM APIs return nested JSON responses with metadata you don’t need. You want to extract just the text content or specific fields for downstream processing. ```json theme={null} { "id": "chatcmpl-123", "choices": [{ "message": { "content": "This is the actual response text" // ← You want this } }], "usage": {"tokens": 50} } ``` ## Solution **What’s in this recipe:** * Extract text content from chat completions * Access nested JSON fields * Create separate columns for different fields You use JSON path notation to extract specific fields from API responses and store them in computed columns. ### Setup ```python theme={null} %pip install -qU pixeltable openai import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai ``` ### Create prompts table ```python theme={null} # Create a fresh directory pxt.drop_dir('json_demo', force=True) pxt.create_dir('json_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'json\_demo'.
  \
```python theme={null} t = pxt.create_table('json_demo/prompts', {'prompt': pxt.String}) ```
  Created table 'prompts'.
```python theme={null} t.insert( [ {'prompt': 'What is the capital of France?'}, {'prompt': 'Write a haiku about coding'}, ] ) ```
  Inserting rows into \`prompts\`: 2 rows \[00:00, 325.83 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 2 values computed.
### Get LLM responses ```python theme={null} # Add computed column for API response (returns full JSON) t.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': t.prompt}], model='gpt-4o-mini', ) ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
### Extract specific fields Use dot notation to access nested JSON fields: ```python theme={null} # Extract just the text content t.add_computed_column(text=t.response.choices[0].message.content) # Extract token usage t.add_computed_column(tokens=t.response.usage.total_tokens) ```
  Added 2 column values with 0 errors.
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # View clean results t.select(t.prompt, t.text, t.tokens).collect() ```
## Explanation **Common extraction patterns:**
**Accessing JSON fields:** * Use dot notation for object properties: `response.usage` * Use brackets for array elements: `choices[0]` * Chain them: `response.choices[0].message.content` **Extracted columns are computed:** Changes to the source data automatically update all extracted fields. ## See also * [Configure API keys](/howto/cookbooks/core/workflow-api-keys) * [Extract structured data from images](/howto/cookbooks/images/vision-structured-output) # Add unique identifiers to your tables Source: https://docs.pixeltable.com/howto/cookbooks/core/workflow-uuid-identity Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Generate UUIDs for automatic row identification. ## Problem You need unique identifiers for rows in your data pipeline. Maybe you’re building an API that returns specific records, tracking processing status across systems, or joining data from multiple sources.
## Solution **What’s in this recipe:** * Create tables with auto-generated UUID primary keys * Add UUID columns to existing tables * Generate UUIDs with `uuid7()` You use `uuid7()` to generate UUIDs for each row. Define it in the schema with `{'column_name': uuid7()}` syntax, or add it to existing tables with `add_computed_column()`. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.uuid import uuid7 ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('uuid_demo', force=True) pxt.create_dir('uuid_demo') ``` ### Create a table with a UUID primary key Use `uuid7()` in your schema to create a column that auto-generates UUIDs: ```python theme={null} # Create table with auto-generated UUID primary key products = pxt.create_table( 'uuid_demo/products', { 'id': uuid7(), # Auto-generates UUID for each row 'name': pxt.String, 'price': pxt.Float, }, primary_key=['id'], ) ```
  Created table 'products'.
```python theme={null} # Insert data - no need to provide 'id', it's auto-generated products.insert( [ {'name': 'Laptop', 'price': 999.99}, {'name': 'Mouse', 'price': 29.99}, {'name': 'Keyboard', 'price': 79.99}, ] ) ```
  Inserted 3 rows with 0 errors in 0.02 s (191.21 rows/s)
  3 rows inserted.
```python theme={null} # View the data - each row has a unique UUID products.collect() ```
### Add a UUID column to an existing table You can add a UUID column to a table that already exists using `add_computed_column()`: ```python theme={null} # Create a table without a UUID column orders = pxt.create_table( 'uuid_demo/orders', {'customer': pxt.String, 'amount': pxt.Float} ) ```
  Created table 'orders'.
```python theme={null} # Insert some data orders.insert( [ {'customer': 'Alice', 'amount': 150.00}, {'customer': 'Bob', 'amount': 75.50}, ] ) ```
  Inserted 2 rows with 0 errors in 0.01 s (310.49 rows/s)
  2 rows inserted.
```python theme={null} # Add a UUID column to existing table orders.add_computed_column(order_id=uuid7()) ```
  Added 2 column values with 0 errors in 0.02 s (98.14 rows/s)
  2 rows updated.
```python theme={null} # View orders with their UUID column orders.collect() ```
## Explanation **Two ways to add UUIDs:**
Both use `uuid7()` which generates UUIDv7 (time-based) identifiers: * 128-bit values * Formatted as 32 hex digits with hyphens: `018e65c5-35e5-7c5d-8f37-f1c5b9c8a7b2` * Time-ordered for better database performance * Virtually guaranteed unique (collision probability is negligible) ## See also * [Tables and operations](/tutorials/tables-and-data-operations) * [Computed columns](/tutorials/computed-columns) # Export data for ML training Source: https://docs.pixeltable.com/howto/cookbooks/data/data-export-pytorch Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Convert Pixeltable data to PyTorch DataLoader format for model training. ## Problem You have prepared training data—images with labels, text with embeddings, or multimodal data—and need to export it for PyTorch model training.
## Solution **What’s in this recipe:** * Convert query results to PyTorch Dataset * Use with DataLoader for batch training * Export to Parquet for external tools You use `query.to_pytorch_dataset()` to create an iterable dataset compatible with PyTorch DataLoader. ### Setup ```python theme={null} %pip install -qU pixeltable torch torchvision ``` ```python theme={null} import pixeltable as pxt import torch from torch.utils.data import DataLoader ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('pytorch_demo', force=True) pxt.create_dir('pytorch_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'pytorch\_demo'.
  \
### Create sample training data ```python theme={null} # Create table with images and labels training_data = pxt.create_table( 'pytorch_demo/training_data', {'image': pxt.Image, 'label': pxt.Int} ) ```
  Created table 'training\_data'.
```python theme={null} # Insert sample images with labels base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' samples = [ {'image': f'{base_url}/000000000036.jpg', 'label': 0}, # cat {'image': f'{base_url}/000000000090.jpg', 'label': 1}, # other {'image': f'{base_url}/000000000139.jpg', 'label': 1}, # other ] training_data.insert(samples) ```
  Inserting rows into \`training\_data\`: 3 rows \[00:00, 659.03 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Export to PyTorch dataset ```python theme={null} # Add a resize step to ensure all images have the same size training_data.add_computed_column( image_resized=training_data.image.resize((224, 224)) ) # Convert to PyTorch dataset # 'pt' format returns images as CxHxW tensors with values in [0,1] pytorch_dataset = training_data.select( training_data.image_resized, training_data.label ).to_pytorch_dataset(image_format='pt') ```
  Added 3 column values with 0 errors.
```python theme={null} # Use with PyTorch DataLoader dataloader = DataLoader(pytorch_dataset, batch_size=2) # Get first batch to verify the shape batch = next(iter(dataloader)) batch[ 'image_resized' ].shape # Should be (2, 3, 224, 224) - batch_size x channels x height x width ```
  torch.Size(\[2, 3, 224, 224])
### Export to Parquet for external tools ```python theme={null} import tempfile from pathlib import Path # Export to Parquet for use with other ML tools export_path = Path(tempfile.mkdtemp()) / 'training_data' pxt.io.export_parquet( training_data.select(training_data.label), # Non-image columns parquet_path=export_path, ) ``` ## Explanation **Export methods:**
**Image format options:**
**DataLoader tips:** * Data is cached to disk for efficient repeated loading * Use `num_workers > 0` for parallel data loading * Filter/transform data before export to reduce size ## See also * [Sample data for training](/howto/cookbooks/data/data-sampling) - Stratified sampling * [Import Parquet files](/howto/cookbooks/data/data-import-parquet) - Parquet import/export # Upload media to S3 and other cloud storage Source: https://docs.pixeltable.com/howto/cookbooks/data/data-export-s3 Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. When Pixeltable generates media files (thumbnails, extracted frames, processed images), by default it stores them locally. For production workflows, you can configure Pixeltable to upload these files directly to cloud blob storage including Amazon S3, Google Cloud Storage, Azure Blob Storage, and S3-compatible services like Cloudflare R2, Backblaze B2, and Tigris. **Key features:** * Computed media (AI-generated outputs) automatically uploads to your bucket * Input media can optionally be persisted for durability * Files are cached locally and downloaded on-demand **Configuration options:** 1. **Global defaults** in `config.toml`: ```toml theme={null} [pixeltable] input_media_dest = "s3://my-bucket/input/" output_media_dest = "s3://my-bucket/output/" ``` 2. **Per-column destination** (computed columns only): ```python theme={null} t.add_computed_column( thumbnail=t.image.thumbnail((128, 128)), destination='s3://my-bucket/thumbnails/' ) ``` In this notebook, you’ll learn how to configure blob storage destinations for your media files. ## What you’ll learn * Where Pixeltable stores files by default * How to specify destinations for individual columns * How to configure global destinations for all columns * How destination precedence works ## How it works Pixeltable decides where to store media files using this priority: 1. **Column destination** (highest priority) — `destination` parameter in `add_computed_column()` 2. **Global configuration** — `input_media_dest` / `output_media_dest` in [config file](/platform/configuration) 3. **Pixeltable’s default local storage** — Used if nothing else is configured ## Prerequisites For this notebook, you’ll need: * `pixeltable` and `boto3` installed * (Optional) Cloud storage credentials if you want to use a cloud provider ```python theme={null} %pip install -qU pixeltable boto3 ``` ## Setup Let’s set up our demo environment. We’ll create a Pixeltable directory for this demo, set up local destination paths, create a table, and insert a sample image. You can substitute cloud storage URIs (like `s3://my-bucket/path/`) anywhere you see a local destination path. ```python theme={null} import pixeltable as pxt from pathlib import Path ``` ```python theme={null} # Clean slate for this demo pxt.drop_dir('blob_storage_demo', force=True) pxt.create_dir('blob_storage_demo') ``` Now we’ll create a table with an image column and insert a sample image from the web. ```python theme={null} # Create table t = pxt.create_table( 'blob_storage_demo/media', {'source_image': pxt.Image}, if_exists='replace', ) ```
  Created table 'media'.
We can inspect the schema before adding images to our table: ```python theme={null} t ```
Let’s insert a single sample image. ```python theme={null} sample_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' t.insert(source_image=sample_image) ```
  Inserted 1 row with 0 errors in 0.77 s (1.29 rows/s)
  1 row inserted.
And we can see the image in our table: ```python theme={null} t.collect() ```
## Default destinations By default, Pixeltable stores all media files in local storage under `~/.pixeltable/media`: * **Input files** (files you insert) — If you insert a URL, Pixeltable stores the URL and downloads it to cache on access. If you insert a local file path, Pixeltable just stores the path reference (the file stays where it is). * **Output files** (files Pixeltable generates) — Stored in `~/.pixeltable/media` This works out of the box with no configuration. You can change these defaults, which we’ll cover in the rest of this notebook. Let’s check where the source image is stored. Since we inserted a URL (not a local file), Pixeltable stores the URL reference and will download it to cache when we access it. ```python theme={null} # Let's see where the source_image is stored by default t.select(t.source_image.fileurl).collect() ```
Now let’s add a computed column without specifying a destination. This will show us where Pixeltable stores **output** files by default. ```python theme={null} # Add computed column with no destination specified - uses default t.add_computed_column( flipped=t.source_image.transpose(0), if_exists='replace' ) ```
  Added 1 column value with 0 errors in 0.02 s (45.44 rows/s)
  1 row updated.
Check the file URL - it points to `~/.pixeltable/media`, the default location for generated files. ```python theme={null} t.select(t.flipped, t.flipped.fileurl).collect() ```
## Per-column destinations When you create a computed column, you can specify exactly where to store generated files using the `destination=` parameter. This gives you fine-grained control over outputs, which may be costly and/or difficult to re-generate. We’ll create a destination directory for storing one of our processed images. For this demo, we’re using a local directory on your Desktop, but you can replace this path with a cloud storage URI (like `s3://my-bucket/rotated/`). ```python theme={null} # Create a local destination directory # For S3: dest_rotated = "s3://my-bucket/rotated/" # For GCS: dest_rotated = "gs://my-bucket/rotated/" base_path = Path.home() / 'Desktop' / 'pixeltable_outputs' base_path.mkdir(parents=True, exist_ok=True) dest_rotated = str(base_path / 'rotated') # Create directory (only needed for local paths) Path(dest_rotated).mkdir(exist_ok=True) ``` Now let’s add a computed column **with** an explicit destination to see the difference from the default behavior. ```python theme={null} # Add column WITH explicit destination t.add_computed_column( rotated=t.source_image.rotate(90), destination=dest_rotated, if_exists='replace', ) ```
  Added 1 column value with 0 errors in 0.02 s (48.98 rows/s)
  1 row updated.
Compare the file URLs. The `rotated` image uses our explicit destination, while `flipped` (created earlier) uses the default `~/.pixeltable/media` location. ```python theme={null} t.select(t.rotated, t.rotated.fileurl).collect() ```
```python theme={null} t.select(t.flipped, t.flipped.fileurl).collect() ```
## Changing global destinations Instead of setting `destination=` on every column, you can change the global default for ALL columns. ### Output and input destinations You can configure two types of global destinations: * **`output_media_dest`** — Changes the default for files Pixeltable generates (computed columns) * **`input_media_dest`** — Changes the default for files you insert into tables You can set them to the same bucket or different buckets depending on your needs. ### How to configure You have two options: **Option 1: Configuration file** (`~/.pixeltable/config.toml`) ```toml theme={null} [pixeltable] # Where files Pixeltable generates are stored output_media_dest = "s3://my-bucket/output/" # Where files you insert are stored input_media_dest = "s3://my-bucket/input/" ``` **Option 2: Environment variables** ```bash theme={null} export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/output/" export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/input/" ``` ### Supported providers and URI formats
For complete authentication and setup details, see the [Cloud Storage documentation](/integrations/cloud-storage). ## Overriding global destinations Even if you configure global destinations, you can still override them for specific columns using the `destination=` parameter in `add_computed_column()`. Let’s create a new destination directory and add a thumbnail column that uses it. ```python theme={null} # Create a different destination for thumbnails dest_thumbnails = str(base_path / 'thumbnails') Path(dest_thumbnails).mkdir(exist_ok=True) # Add column with explicit destination (overrides any global default) t.add_computed_column( thumbnail=t.source_image.thumbnail((128, 128)), destination=dest_thumbnails, if_exists='replace', ) ```
  Added 1 column value with 0 errors in 0.02 s (47.89 rows/s)
  1 row updated.
Let’s view the thumbnail and its file URL. The explicit `destination=` parameter always wins, regardless of global configuration. ```python theme={null} t.select(t.thumbnail, t.thumbnail.fileurl).collect() ```
## Getting URLs for your files When your files are in blob storage, you can get URLs that point directly to them. These URLs work in HTML, APIs, or any application you need to serve media with. The `.fileurl` property gives you direct URLs you can use anywhere. ```python theme={null} t.select( source=t.source_image.fileurl, rotated=t.rotated.fileurl, flipped=t.flipped.fileurl, ).collect() ```
## Generating presigned URLs **Note:** This section only applies if you’re using cloud storage (S3, GCS, Azure, R2, B2, Tigris). If you’re following along with local destinations (as in the examples above), you can skip this section or configure cloud storage to try it out.
When your files are in cloud storage, the `.fileurl` property returns storage URIs like `s3://bucket/path/file.jpg`. These aren’t directly accessible over HTTP. For private buckets or when you need time-limited HTTP access, use **presigned URLs**. These are temporary, authenticated URLs that allow anyone to access your files for a limited time without needing credentials. Presigned URLs are particularly useful for: * Sharing files from private buckets without making them public * Creating temporary download links with expiration * Serving media in web applications without exposing credentials * Providing time-limited access to sensitive content Use the `presigned_url` function from `pixeltable.functions.net`: ```python theme={null} import os # Use HTTPS URL format for Backblaze B2 b2_region = 'us-east-005' b2_bucket = 'pixeltable' cloud_destination = ( f'https://s3.{b2_region}.backblazeb2.com/{b2_bucket}/presigned-demo/' ) # Add the computed column t.add_computed_column( cloud_thumbnail=t.source_image.thumbnail((64, 64)), destination=cloud_destination, if_exists='replace', ) ```
  Added 1 column value with 0 errors in 0.22 s (4.46 rows/s)
  1 row updated.
```python theme={null} # Now generate presigned URLs for the cloud-stored files from pixeltable.functions import net t.select( cloud_thumbnail=t.cloud_thumbnail, storage_url=t.cloud_thumbnail.fileurl, presigned_url=net.presigned_url( t.cloud_thumbnail.fileurl, 3600 ), # 1-hour expiration ).collect() ```
The presigned URLs in the output are fully authenticated HTTP/HTTPS URLs that can be accessed directly in a browser or used in APIs without any credentials. ### Common expiration times
**Note:** Different storage providers have different maximum expiration limits. For example, Google Cloud Storage has a maximum 7-day expiration for presigned URLs. ### Troubleshooting presigned URLs If `presigned_url()` isn’t working: 1. **Local files**: Presigned URLs only work with cloud storage (S3, GCS, Azure, R2, B2, Tigris). If your files are stored locally (default), you’ll get an error. Configure a cloud destination first. 2. **Already HTTP URLs**: If `.fileurl` returns an `http://` or `https://` URL (not a storage URI like `s3://`), the file is already publicly accessible and doesn’t need a presigned URL. 3. **Credentials**: Ensure your cloud storage credentials are properly configured. See the [Cloud Storage documentation](/integrations/cloud-storage) for provider-specific setup. ## Common patterns Here are a few real-world patterns you might use: ### Pattern 1: All media in one place If you want everything in the same bucket, configure both input and output destinations in `~/.pixeltable/config.toml`: ```toml theme={null} [pixeltable] input_media_dest = "s3://my-bucket/media/" output_media_dest = "s3://my-bucket/media/" ``` Or set environment variables: ```bash theme={null} export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/media/" export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/media/" ``` ### Pattern 2: Separate input and output Keep source files separate from processed files in `~/.pixeltable/config.toml`: ```toml theme={null} [pixeltable] input_media_dest = "s3://my-bucket/uploads/" output_media_dest = "s3://my-bucket/processed/" ``` ### Pattern 3: Override for specific columns Use a global default, but send some columns elsewhere. First, set a global default in your config: ```toml theme={null} [pixeltable] output_media_dest = "s3://my-bucket/processed/" ``` Then in your code, most columns use the global default, but you can override specific ones: ```python theme={null} # Uses global default (s3://my-bucket/processed/) t.add_computed_column( thumbnail=t.image.thumbnail((128, 128)) ) # Overrides global default - goes to different location t.add_computed_column( large_thumbnail=t.image.thumbnail((512, 512)), destination='s3://my-bucket/thumbnails/' ) ``` ## Where do my files go? Understanding how Pixeltable handles different types of input files helps you make better decisions about storage configuration.
When you configure a cloud destination, Pixeltable populates both the destination and the local cache efficiently during `insert()`. For URLs, this means downloading once and using that download for both the upload and cache—avoiding wasteful upload→download cycles. ## What you learned * Pixeltable uses local storage by default for all media files * You can override the default for specific columns with the `destination` parameter * You can change the global default with `input_media_dest` and `output_media_dest` * Precedence: column destination > global config > Pixeltable’s default local storage * Use `.fileurl` to get URLs for your stored files * Use `net.presigned_url()` to generate time-limited, authenticated HTTP URLs for cloud storage files * Pixeltable handles caching intelligently to avoid wasteful operations ## See also * [Load from S3](../../../howto/cookbooks/data/data-import-s3) - Import media from cloud storage * [Cloud Storage Integration](../../../integrations/cloud-storage) - Provider setup ## Next steps * See the [Cloud Storage documentation](/integrations/cloud-storage) for complete provider setup and authentication details * Check out [Pixeltable Configuration](/platform/configuration) for all config options * Join our [Discord community](https://pixeltable.com/discord) if you have questions # Export data to SQL databases Source: https://docs.pixeltable.com/howto/cookbooks/data/data-export-sql Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Send your Pixeltable data to PostgreSQL, SQLite, MySQL, TigerData, or Snowflake for use in external applications. **What’s in this recipe:** * Export entire tables or filtered queries to any SQL database * Select specific columns for export * Handle existing tables with replace or append options * Connect to cloud PostgreSQL services (e.g. TigerData) ## Problem You have processed data in your pipeline—cleaned text, generated embeddings, extracted metadata—and need to send it to a SQL database for use by other applications or teams.
## Solution You use `export_sql()` to export tables or queries to any SQL database via database connection strings. The function automatically maps Pixeltable types to appropriate SQL types for each database dialect. ### Setup ```python theme={null} %pip install -qU pixeltable snowflake-sqlalchemy ``` ```python theme={null} import pixeltable as pxt import tempfile from pathlib import Path from pixeltable.io.sql import export_sql ``` ### Create sample data ```python theme={null} # Create a fresh directory pxt.drop_dir('sql_export_demo', force=True) pxt.create_dir('sql_export_demo') ```
  Created directory 'sql\_export\_demo'.
  \
```python theme={null} # Create a table with product data products = pxt.create_table( 'sql_export_demo/products', { 'name': pxt.String, 'price': pxt.Float, 'in_stock': pxt.Bool, 'metadata': pxt.Json, }, ) ```
  Created table 'products'.
```python theme={null} # Insert sample products products.insert( [ { 'name': 'Wireless Mouse', 'price': 29.99, 'in_stock': True, 'metadata': {'category': 'electronics', 'rating': 4.5}, }, { 'name': 'USB-C Hub', 'price': 49.99, 'in_stock': False, 'metadata': {'category': 'electronics', 'rating': 4.2}, }, { 'name': 'Mechanical Keyboard', 'price': 89.99, 'in_stock': True, 'metadata': {'category': 'electronics', 'rating': 4.8}, }, { 'name': 'Monitor Stand', 'price': 39.99, 'in_stock': True, 'metadata': {'category': 'accessories', 'rating': 4.0}, }, { 'name': 'Webcam', 'price': 59.99, 'in_stock': False, 'metadata': {'category': 'electronics', 'rating': 3.9}, }, ] ) ```
  Inserted 5 rows with 0 errors in 0.01 s (566.35 rows/s)
  5 rows inserted.
```python theme={null} # View the data products.collect() ```
### Export an entire table You pass a table and a SQLAlchemy connection string to export all rows and columns. ```python theme={null} # Create a SQLite database for this demo db_path = Path(tempfile.mkdtemp()) / 'products.db' connection_string = f'sqlite:///{db_path}' ``` ```python theme={null} # Export the full table export_sql(products, 'products', db_connect_str=connection_string) ``` ```python theme={null} # Verify the export with SQLAlchemy import sqlalchemy as sql engine = sql.create_engine(connection_string) with engine.connect() as conn: result = conn.execute(sql.text('SELECT * FROM products')).fetchall() result ```
  \[('Wireless Mouse', 29.99, 1, '\{"rating": 4.5, "category": "electronics"}'),
   ('USB-C Hub', 49.99, 0, '\{"rating": 4.2, "category": "electronics"}'),
   ('Mechanical Keyboard', 89.99, 1, '\{"rating": 4.8, "category": "electronics"}'),
   ('Monitor Stand', 39.99, 1, '\{"rating": 4.0, "category": "accessories"}'),
   ('Webcam', 59.99, 0, '\{"rating": 3.9, "category": "electronics"}')]
### Export a filtered query You can export any query result—filter rows, select specific columns, or apply transformations before export. ```python theme={null} # Export only in-stock products export_sql( products.where(products.in_stock == True), 'in_stock_products', db_connect_str=connection_string, ) ``` ```python theme={null} # Verify filtered export with engine.connect() as conn: result = conn.execute( sql.text('SELECT name, price FROM in_stock_products') ).fetchall() result ```
  \[('Wireless Mouse', 29.99),
   ('Mechanical Keyboard', 89.99),
   ('Monitor Stand', 39.99)]
### Export specific columns You select only the columns you need before exporting. You can also rename columns in the output. ```python theme={null} # Export only name and price columns export_sql( products.select(products.name, products.price), 'price_list', db_connect_str=connection_string, ) ``` ```python theme={null} # Export with renamed columns export_sql( products.select( product_name=products.name, unit_price=products.price ), 'renamed_columns', db_connect_str=connection_string, ) ``` ```python theme={null} # Verify column selection inspector = sql.inspect(engine) columns = [col['name'] for col in inspector.get_columns('price_list')] columns ```
  \['name', 'price']
### Handle existing tables You control what happens when the target table already exists using the `if_exists` parameter:
```python theme={null} # Append new data to existing table export_sql( products.where(products.price > 50), 'products', db_connect_str=connection_string, if_exists='insert', ) ``` ```python theme={null} # Check row count after insert with engine.connect() as conn: result = conn.execute( sql.text('SELECT COUNT(*) FROM products') ).fetchone() f'Total rows after insert: {result[0]}' ```
  'Total rows after insert: 7'
```python theme={null} # Replace with fresh data export_sql( products.select(products.name, products.price), 'products', db_connect_str=connection_string, if_exists='replace', ) ``` ```python theme={null} # Check that table was replaced inspector = sql.inspect(engine) columns = [col['name'] for col in inspector.get_columns('products')] with engine.connect() as conn: row_count = conn.execute( sql.text('SELECT COUNT(*) FROM products') ).fetchone()[0] f'Columns: {columns}, Row count: {row_count}' ```
  "Columns: \['name', 'price'], Row count: 5"
### Export to cloud PostgreSQL (TigerData) You can export directly to cloud-hosted PostgreSQL databases like [TigerData](https://www.timescale.com/cloud) (Timescale Cloud). Get your credentials from the TigerData dashboard after creating a service. ```python theme={null} import getpass import os # Skip interactive sections in CI environments SKIP_CLOUD_TESTS = os.environ.get('CI') or os.environ.get( 'GITHUB_ACTIONS' ) if not SKIP_CLOUD_TESTS: # Enter your TigerData credentials interactively tigerdata_host = input( 'TigerData host (e.g., abc123.tsdb.cloud.timescale.com): ' ) tigerdata_port = input('TigerData port (e.g., 38963): ') tigerdata_user = input('TigerData username (e.g., tsdbadmin): ') tigerdata_password = getpass.getpass('TigerData password: ') tigerdata_dbname = input('TigerData database name (e.g., tsdb): ') # Build the connection string (use postgresql+psycopg:// for SQLAlchemy compatibility) tigerdata_connection = f'postgresql+psycopg://{tigerdata_user}:{tigerdata_password}@{tigerdata_host}:{tigerdata_port}/{tigerdata_dbname}?sslmode=require' else: print('Skipping TigerData section (running in CI)') ``` ```python theme={null} if not SKIP_CLOUD_TESTS: # Export to TigerData export_sql( products, 'pixeltable_products', db_connect_str=tigerdata_connection, if_exists='replace', ) ``` ```python theme={null} if not SKIP_CLOUD_TESTS: # Verify the export in TigerData tigerdata_engine = sql.create_engine(tigerdata_connection) with tigerdata_engine.connect() as conn: result = conn.execute( sql.text('SELECT * FROM pixeltable_products') ).fetchall() result ```
  \[('Wireless Mouse', 29.99, True, \{'rating': 4.5, 'category': 'electronics'}),
   ('USB-C Hub', 49.99, False, \{'rating': 4.2, 'category': 'electronics'}),
   ('Mechanical Keyboard', 89.99, True, \{'rating': 4.8, 'category': 'electronics'}),
   ('Monitor Stand', 39.99, True, \{'rating': 4.0, 'category': 'accessories'}),
   ('Webcam', 59.99, False, \{'rating': 3.9, 'category': 'electronics'})]
### Export to Snowflake You can export directly to [Snowflake](https://www.snowflake.com/) data warehouses. Get your account identifier from the Snowflake web interface under **Admin → Accounts**. ```python theme={null} if not SKIP_CLOUD_TESTS: # Enter your Snowflake credentials interactively snowflake_account = input( 'Snowflake account identifier (e.g., WEZMMGC-AIB20064): ' ) snowflake_user = input('Snowflake username: ') snowflake_password = getpass.getpass('Snowflake password: ') snowflake_warehouse = input( 'Snowflake warehouse (e.g., COMPUTE_WH): ' ) snowflake_database = input('Snowflake database: ') snowflake_schema = input('Snowflake schema (e.g., PUBLIC): ') # Build the connection string snowflake_connection = f'snowflake://{snowflake_user}:{snowflake_password}@{snowflake_account}/{snowflake_database}/{snowflake_schema}?warehouse={snowflake_warehouse}' else: print('Skipping Snowflake section (running in CI)') ``` ```python theme={null} if not SKIP_CLOUD_TESTS: # Export to Snowflake (without JSON column) export_sql( products.select(products.name, products.price, products.in_stock), 'PIXELTABLE_PRODUCTS', db_connect_str=snowflake_connection, if_exists='replace', ) ``` ```python theme={null} if not SKIP_CLOUD_TESTS: # Verify the export in Snowflake snowflake_engine = sql.create_engine(snowflake_connection) with snowflake_engine.connect() as conn: result = conn.execute( sql.text('SELECT * FROM PIXELTABLE_PRODUCTS') ).fetchall() result ```
  \[('Wireless Mouse', 29.99, True, None),
   ('USB-C Hub', 49.99, False, None),
   ('Mechanical Keyboard', 89.99, True, None),
   ('Monitor Stand', 39.99, True, None),
   ('Webcam', 59.99, False, None)]
### Exporting media data For tables containing media types (`pxt.Image`, `pxt.Video`, `pxt.Audio`), you have two options: 1. **Extract metadata before export** - Select only the columns you need (paths, embeddings, extracted text, etc.) and export those to SQL. 2. **Use Pixeltable destinations** - For syncing media files to cloud storage, use Pixeltable’s built-in destination support with providers like [Tigris](/howto/providers/working-with-tigris). **Example: Export image metadata to SQL** ```python theme={null} # Create a table with images images = pxt.create_table( 'sql_export_demo/images', {'image': pxt.Image, 'label': pxt.String} ) # Add computed columns for metadata images.add_computed_column(width=images.image.width) images.add_computed_column(height=images.image.height) images.add_computed_column(mode=images.image.mode) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert sample images base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' images.insert( [ {'image': f'{base_url}/000000000036.jpg', 'label': 'cat'}, {'image': f'{base_url}/000000000090.jpg', 'label': 'scene'}, ] ) ```
  Inserted 2 rows with 0 errors in 0.03 s (63.85 rows/s)
  2 rows inserted.
```python theme={null} # Export metadata (not the image itself) to SQL export_sql( images.select(images.label, images.width, images.height, images.mode), 'image_metadata', db_connect_str=connection_string, # or tigerdata_connection for cloud ) ``` ```python theme={null} # Verify the metadata export with engine.connect() as conn: result = conn.execute( sql.text('SELECT * FROM image_metadata') ).fetchall() result ```
  \[('cat', 481, 640, 'RGB'), ('scene', 640, 429, 'RGB')]
## Explanation **Connection strings:** The function uses SQLAlchemy connection strings. Common formats:
**Type mapping:** Pixeltable types map to SQL types automatically:
**Unsupported types:** Media types like `pxt.Image`, `pxt.Video`, and `pxt.Audio` cannot be exported directly. Extract the data you need (paths, embeddings, metadata) before export. ## See also * [Working with Tigris](/howto/providers/working-with-tigris) - Sync media files to cloud storage * [Cloud Storage Integration](/integrations/cloud-storage) - S3, GCS, and Azure Blob storage * [Export to PyTorch](./data-export-pytorch) - Export for ML training # Import data from CSV files Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-csv Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Load data from CSV and Excel files into Pixeltable tables for processing and analysis. ## Problem You have data in CSV or Excel files that you want to process with AI models, add computed columns to, or combine with other data sources.
## Solution **What’s in this recipe:** * Import CSV files directly into tables * Import from Pandas DataFrames * Handle different data types You use `pxt.create_table()` with a `source` parameter to create a table from a CSV file, or insert DataFrame rows into an existing table. ### Setup ```python theme={null} %pip install -qU pixeltable pandas ``` ```python theme={null} import pandas as pd import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('import_demo', force=True) pxt.create_dir('import_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'import\_demo'.
  \
### Import CSV directly Use `create_table` with `source` to create a table from a CSV file: ```python theme={null} # Import CSV from URL csv_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/world-population-data.csv' population = pxt.create_table('import_demo/population', source=csv_url) ```
  Created table 'population'.

  Inserting rows into \`population\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`population\`: 234 rows \[00:00, 9032.63 rows/s]
  Inserted 234 rows with 0 errors.
```python theme={null} # View the imported data population.head(5) ```
### Import from Pandas DataFrame You can also create a DataFrame first and insert it: ```python theme={null} # Create a DataFrame df = pd.DataFrame( { 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['NYC', 'LA', 'Chicago'], } ) # Create table and insert DataFrame users = pxt.create_table( 'import_demo/users', {'name': pxt.String, 'age': pxt.Int, 'city': pxt.String}, ) users.insert(df) ```
  Created table 'users'.

  Inserting rows into \`users\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`users\`: 3 rows \[00:00, 923.31 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # View the data users.collect() ```
## Explanation **Source types supported:**
**Type inference:** Pixeltable automatically infers column types from CSV data. You can override types using `schema_overrides`. **Large files:** For very large CSV files, consider: * Using `create_table(source=...)` which streams data * Importing in batches if memory is limited ## See also * [Tables documentation](/tutorials/tables-and-data-operations) * [Bringing data guide](/howto/cookbooks/data/data-import-csv) # Import data from Excel files Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-excel Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Load data from Excel spreadsheets (.xlsx) into Pixeltable tables. ## Problem You have data in Excel format that needs to be loaded for AI processing—reports, inventory lists, or business data exported from other systems.
## Solution **What’s in this recipe:** * Import Excel files directly into tables * Handle multiple sheets * Override column types when needed You use `pxt.create_table()` with an Excel file path as the `source` parameter. Pixeltable infers column types automatically. ### Setup ```python theme={null} %pip install -qU pixeltable openpyxl pandas ``` ```python theme={null} import pandas as pd import pixeltable as pxt import tempfile from pathlib import Path ``` ### Create sample Excel file ```python theme={null} # Create sample Excel file for demo sample_data = pd.DataFrame( { 'order_id': [1001, 1002, 1003, 1004, 1005], 'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'], 'product': [ 'Widget A', 'Gadget B', 'Widget A', 'Tool C', 'Gadget B', ], 'quantity': [2, 1, 5, 3, 2], 'price': [29.99, 149.99, 29.99, 79.99, 149.99], 'date': [ '2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17', '2024-01-18', ], } ) # Save to temp Excel file temp_dir = tempfile.mkdtemp() excel_path = Path(temp_dir) / 'orders.xlsx' sample_data.to_excel(excel_path, index=False) sample_data ```
### Import Excel file ```python theme={null} # Create a fresh directory pxt.drop_dir('excel_demo', force=True) pxt.create_dir('excel_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'excel\_demo'.
  \
```python theme={null} # Import Excel file directly orders = pxt.create_table( 'excel_demo/orders', source=str(excel_path), source_format='excel', # Hint for Excel format ) ```
  Created table 'orders'.

  Inserting rows into \`orders\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`orders\`: 5 rows \[00:00, 501.21 rows/s]
  Inserted 5 rows with 0 errors.
```python theme={null} # View imported data orders.collect() ```
### Add computed columns ```python theme={null} # Add computed column for order total orders.add_computed_column(total=orders.quantity * orders.price) ```
  Added 5 column values with 0 errors.
  5 rows updated, 10 values computed.
```python theme={null} # View with computed total orders.select( orders.order_id, orders.customer, orders.product, orders.quantity, orders.price, orders.total, ).collect() ```
## Explanation **Import methods:**
**Excel-specific options:** Pass Pandas `read_excel` arguments via `extra_args`: ```python theme={null} pxt.create_table( 'table_name', source='data.xlsx', source_format='excel', extra_args={'sheet_name': 'Sheet2', 'skiprows': 1} ) ``` **Common extra\_args:**
## See also * [Import CSV files](/howto/cookbooks/data/data-import-csv) - CSV and tabular data * [Import Parquet files](/howto/cookbooks/data/data-import-parquet) - Columnar data # Import data from Hugging Face datasets Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-huggingface Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Load datasets from Hugging Face Hub into Pixeltable tables for processing with AI models. ## Problem You want to use a dataset from Hugging Face Hub—for fine-tuning, evaluation, or analysis. You need to load it into a format where you can add computed columns, embeddings, or AI transformations.
## Solution **What’s in this recipe:** * Import Hugging Face datasets directly into tables * Handle datasets with multiple splits (train/test/validation) * Work with image datasets You use `pxt.create_table()` with a Hugging Face dataset as the `source` parameter. Pixeltable automatically maps HF types to Pixeltable column types. ### Setup ```python theme={null} %pip install -qU pixeltable datasets ``` ```python theme={null} import pixeltable as pxt from datasets import load_dataset ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('hf_demo', force=True) pxt.create_dir('hf_demo') ```
  Created directory 'hf\_demo'.
  \
### Import a single split Load a specific split from a dataset: ```python theme={null} # Load a small subset for demo (first 100 rows of rotten_tomatoes) hf_dataset = load_dataset( 'cornell-movie-review-data/rotten_tomatoes', split='train[:100]' ) ``` ```python theme={null} # Import into Pixeltable reviews = pxt.create_table('hf_demo/reviews', source=hf_dataset) ```
  Created table 'reviews'.
  Inserting rows into \`reviews\`: 100 rows \[00:00, 14781.69 rows/s]
  Inserted 100 rows with 0 errors.
```python theme={null} # View imported data reviews.head(5) ```
### Import multiple splits Load a DatasetDict with multiple splits and track which split each row came from: ```python theme={null} # Load dataset with multiple splits (small subset for demo) hf_dataset_dict = load_dataset( 'cornell-movie-review-data/rotten_tomatoes', split={'train': 'train[:50]', 'test': 'test[:50]'}, ) ``` ```python theme={null} # Import each split separately for clarity train_data = pxt.create_table( 'hf_demo/reviews_train', source=hf_dataset_dict['train'] ) test_data = pxt.create_table( 'hf_demo/reviews_test', source=hf_dataset_dict['test'] ) ```
  Created table 'reviews\_train'.
  Inserting rows into \`reviews\_train\`: 50 rows \[00:00, 10150.29 rows/s]
  Inserted 50 rows with 0 errors.
  Created table 'reviews\_test'.
  Inserting rows into \`reviews\_test\`: 50 rows \[00:00, 9883.37 rows/s]
  Inserted 50 rows with 0 errors.
```python theme={null} # View training data train_data.head(5) ```
```python theme={null} # View test data test_data.head(3) ```
### Add AI-powered computed columns Enrich the dataset with AI models: ```python theme={null} # Add a computed column for text length reviews.add_computed_column( text_length=reviews.text.apply(len, col_type=pxt.Int) ) ```
  Added 100 column values with 0 errors.
  100 rows updated, 200 values computed.
```python theme={null} # View with computed column reviews.select(reviews.text, reviews.label, reviews.text_length).head(5) ```
### Type mapping Pixeltable automatically maps Hugging Face types to Pixeltable types:
Use `schema_overrides` to customize type mapping when needed. ## Explanation **Why import Hugging Face datasets into Pixeltable:** 1. **Add computed columns** - Enrich data with embeddings, AI analysis, or transformations 2. **Incremental processing** - Add new rows without reprocessing existing data 3. **Persistent storage** - Keep processed results across sessions 4. **Query capabilities** - Filter, aggregate, and join with other tables **Working with large datasets:** For very large datasets, consider loading in batches or using streaming mode in the `datasets` library before importing. ## See also * [Import CSV files](/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports * [Semantic text search](/howto/cookbooks/search/search-semantic-text) - Add embeddings to text data * [Hugging Face integration notebook](/howto/providers/working-with-hugging-face) - Full integration guide # Import data from JSON files Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-json Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Load structured data from JSON files into Pixeltable tables for processing and analysis. ## Problem You have data in JSON format—from APIs, exports, or application logs. You need to load this data for processing with AI models or combining with other data sources.
## Solution **What’s in this recipe:** * Import JSON files directly into tables * Import from URLs (APIs, remote files) * Handle nested JSON structures You use `pxt.create_table()` with a `source` parameter to create a table from a JSON file or URL. The JSON must be an array of objects, where each object becomes a row. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import json import pixeltable as pxt import tempfile from pathlib import Path ``` ### Create sample JSON file First, create a sample JSON file to demonstrate the import process: ```python theme={null} # Create sample JSON data (array of objects) sample_data = [ { 'id': 1, 'title': 'Introduction to ML', 'author': 'Alice', 'tags': ['ml', 'intro'], 'rating': 4.5, }, { 'id': 2, 'title': 'Deep Learning Basics', 'author': 'Bob', 'tags': ['dl', 'neural'], 'rating': 4.8, }, { 'id': 3, 'title': 'NLP Fundamentals', 'author': 'Carol', 'tags': ['nlp', 'text'], 'rating': 4.2, }, { 'id': 4, 'title': 'Computer Vision', 'author': 'Dave', 'tags': ['cv', 'images'], 'rating': 4.6, }, { 'id': 5, 'title': 'Reinforcement Learning', 'author': 'Eve', 'tags': ['rl', 'agents'], 'rating': 4.3, }, ] # Save to temporary JSON file temp_dir = tempfile.mkdtemp() json_path = Path(temp_dir) / 'articles.json' with open(json_path, 'w') as f: json.dump(sample_data, f, indent=2) ``` ### Import JSON file Use `create_table` with `source` to create a table directly from a JSON file: ```python theme={null} # Create a fresh directory pxt.drop_dir('json_demo', force=True) pxt.create_dir('json_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'json\_demo'.
  \
```python theme={null} # Import JSON file into a new table articles = pxt.create_table( 'json_demo/articles', source=str(json_path), source_format='json', # Explicitly specify format when using local file paths ) ```
  Created table 'articles'.

  Inserting rows into \`articles\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`articles\`: 5 rows \[00:00, 538.52 rows/s]
  Inserted 5 rows with 0 errors.
```python theme={null} # View imported data articles.collect() ```
### Import from URL You can import JSON directly from a URL—useful for APIs and remote data: ```python theme={null} # Import from a public JSON URL # Using JSONPlaceholder API as an example posts = pxt.create_table( 'json_demo/posts', source='https://jsonplaceholder.typicode.com/posts', source_format='json', # Required for URL sources ) ```
  Created table 'posts'.

  Inserting rows into \`posts\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`posts\`: 100 rows \[00:00, 15623.57 rows/s]
  Inserted 100 rows with 0 errors.
```python theme={null} # View first few rows posts.head(5) ```
### Import from Python dictionaries Use `create_table` with a list of dictionaries as `source`—useful when you have data in memory: ```python theme={null} # Import from a list of dictionaries events = [ { 'event': 'page_view', 'user_id': 101, 'timestamp': '2024-01-15T10:30:00', }, { 'event': 'click', 'user_id': 101, 'timestamp': '2024-01-15T10:31:00', }, { 'event': 'purchase', 'user_id': 102, 'timestamp': '2024-01-15T10:32:00', }, ] event_table = pxt.create_table('json_demo/events', source=events) ```
  Created table 'events'.

  Inserting rows into \`events\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`events\`: 3 rows \[00:00, 988.06 rows/s]
  Inserted 3 rows with 0 errors.
```python theme={null} # View imported events event_table.collect() ```
### Add computed columns Once imported, you can enrich the data with computed columns: ```python theme={null} # Add a computed column combining title and author articles.add_computed_column( summary=articles.title + ' by ' + articles.author ) ```
  Added 5 column values with 0 errors.
  5 rows updated, 10 values computed.
```python theme={null} # View with computed column articles.select( articles.title, articles.author, articles.summary ).collect() ```
## Explanation **JSON format requirements:** The JSON file must contain an array of objects at the top level: ```json theme={null} [ {"col1": "value1", "col2": 123}, {"col1": "value2", "col2": 456} ] ``` **Source types supported:**
**Nested JSON handling:** Nested objects and arrays are stored as JSON columns. You can access nested fields using Pixeltable’s JSON path syntax in computed columns. ## See also * [Import CSV files](/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports * [Import Parquet files](/howto/cookbooks/data/data-import-parquet) - For Parquet data * [Extract fields from JSON](/howto/cookbooks/core/workflow-json-extraction) - Parse LLM response fields # Import data from Parquet files Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-parquet Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Load columnar data from Parquet files into Pixeltable tables for processing and analysis. ## Problem You have data stored in Parquet format—a common format for analytics, data lakes, and ML pipelines. You need to load this data for processing with AI models or combining with other data sources.
## Solution **What’s in this recipe:** * Import Parquet files directly into tables * Export tables to Parquet for external tools * Handle schema type overrides You use `pxt.create_table()` with a `source` parameter to create a table from a Parquet file. Pixeltable infers column types from the Parquet schema automatically. ### Setup ```python theme={null} %pip install -qU pixeltable pyarrow pandas ``` ```python theme={null} import pandas as pd import pixeltable as pxt import tempfile from pathlib import Path ``` ### Create sample Parquet file First, create a sample Parquet file to demonstrate the import process: ```python theme={null} # Create sample data sample_data = pd.DataFrame( { 'product_id': [1, 2, 3, 4, 5], 'name': [ 'Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Tool Z', ], 'price': [29.99, 39.99, 149.99, 199.99, 79.99], 'category': ['widgets', 'widgets', 'gadgets', 'gadgets', 'tools'], 'in_stock': [True, False, True, True, False], } ) # Save to temporary Parquet file temp_dir = tempfile.mkdtemp() parquet_path = Path(temp_dir) / 'products.parquet' sample_data.to_parquet(parquet_path, index=False) sample_data ```
### Import Parquet file Use `create_table` with the `source` parameter to create a table directly from the Parquet file: ```python theme={null} # Create a fresh directory pxt.drop_dir('parquet_demo', force=True) pxt.create_dir('parquet_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'parquet\_demo'.
  \
```python theme={null} # Import Parquet file into a new table products = pxt.create_table( 'parquet_demo/products', source=str(parquet_path) ) ```
  Created table 'products'.

  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`products\`: 5 rows \[00:00, 653.18 rows/s]
  Inserted 5 rows with 0 errors.
```python theme={null} # View imported data products.collect() ```
### Add computed columns Once imported, you can add computed columns like any other Pixeltable table: ```python theme={null} # Add a computed column for discounted price products.add_computed_column(sale_price=products.price * 0.9) ```
  Added 5 column values with 0 errors.
  5 rows updated, 10 values computed.
```python theme={null} # View with computed column products.select( products.name, products.price, products.sale_price ).collect() ```
### Import with primary key Specify a primary key when you need upsert behavior or unique constraints: ```python theme={null} # Import with a primary key products_pk = pxt.create_table( 'parquet_demo/products_with_pk', source=str(parquet_path), primary_key='product_id', ) ```
  Created table 'products\_with\_pk'.

  Inserting rows into \`products\_with\_pk\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`products\_with\_pk\`: 5 rows \[00:00, 1548.97 rows/s]
  Inserted 5 rows with 0 errors.
```python theme={null} # View the table products_pk.collect() ```
### Export table to Parquet Export your processed data back to Parquet for use with other toolee ```python theme={null} # Export to Parquet (note: image columns require inline_images=True) export_path = Path(temp_dir) / 'exported_products' pxt.io.export_parquet( products.select(products.name, products.price, products.sale_price), parquet_path=export_path, ) ``` ```python theme={null} # Verify export by reading back import pyarrow.parquet as pq exported_table = pq.read_table(export_path) exported_table.to_pandas() ```
## Explanation **When to use Parquet import:**
**Key features:** * Automatic schema inference from Parquet metadata * Support for partitioned datasets (directory of files) * Export with `pxt.io.export_parquet` for interoperability * Primary key support for upsert workflows ## See also * [Import CSV files](/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports * [Import JSON files](/howto/cookbooks/data/data-import-json) - For JSON data # Load media from S3 and other cloud storage Source: https://docs.pixeltable.com/howto/cookbooks/data/data-import-s3 Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Import images, videos, and audio files from S3, GCS, HTTP URLs, or local paths into Pixeltable tables. ## Problem You have media files stored in cloud storage (S3, GCS) or accessible via HTTP URLs. You need to process these files with AI models without downloading them all upfront.
## Solution **What’s in this recipe:** * Reference media files by URL (S3, HTTP, local paths) * Automatic caching of remote files on access * Process files lazily without bulk downloads You insert media URLs as references. Pixeltable stores the URLs and automatically downloads/caches files when you access them through queries or computed columns. ### Setup ```python theme={null} %pip install -qU pixeltable boto3 ``` ```python theme={null} import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('cloud_demo', force=True) pxt.create_dir('cloud_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'cloud\_demo'.
  \
### Load images from HTTP URLs Reference images by URL—Pixeltable downloads them on demand: ```python theme={null} # Create a table with image column images = pxt.create_table('cloud_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Insert images by URL (HTTP) image_urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg', ] images.insert([{'image': url} for url in image_urls]) ```
  Inserting rows into \`images\`: 3 rows \[00:00, 767.91 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # View images - files are downloaded and cached on access images.collect() ```
### Load videos from S3 Reference videos in S3 buckets (using public Multimedia Commons bucket): ```python theme={null} # Create a table with video column videos = pxt.create_table('cloud_demo/videos', {'video': pxt.Video}) ```
  Created table 'videos'.
```python theme={null} # Insert videos by S3 URL (public bucket, no credentials needed) s3_prefix = 's3://multimedia-commons/' video_paths = [ 'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4', 'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4', ] videos.insert([{'video': s3_prefix + path} for path in video_paths]) ```
  Inserting rows into \`videos\`: 2 rows \[00:00, 1477.13 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 4 values computed.
```python theme={null} # View videos - downloaded and cached on access videos.collect() ```
### Add computed columns on remote media Process remote media with computed columns—files are fetched automatically: ```python theme={null} # Add computed columns for image properties images.add_computed_column(width=images.image.width) images.add_computed_column(height=images.image.height) ```
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  3 rows updated, 6 values computed.
```python theme={null} # View with computed properties images.select(images.image, images.width, images.height).collect() ```
### Generate presigned URLs for serving media When you store media in private cloud storage, you need presigned URLs to serve files over HTTP. The `presigned_url` function converts storage URIs to time-limited, publicly accessible URLs: ```python theme={null} import pixeltable.functions as pxtf # Generate presigned URLs for videos (1-hour expiration) videos.select( videos.video, original_uri=videos.video.fileurl, http_url=pxtf.net.presigned_url(videos.video.fileurl, 3600), ).collect() ```
```python theme={null} # Store presigned URLs as computed column for API responses videos.add_computed_column( serving_url=pxtf.net.presigned_url( videos.video.fileurl, 86400 ) # 24-hour expiration ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 4 values computed.
**Use cases for presigned URLs:** * Serve private media in web applications without exposing credentials * Generate download links for end users * Integrate with CDNs or video players that require HTTP URLs **Provider limitations:**
Note: HTTP/HTTPS URLs pass through unchanged (already publicly accessible). ### Supported URL formats Pixeltable supports multiple URL schemes for media files:
\*Configure AWS/GCP credentials via environment variables or config files. ## Explanation **How caching works:** 1. URLs are stored as references in the table 2. Files are downloaded on first access (query or computed column) 3. Downloaded files are cached in `~/.pixeltable/file_cache/` 4. Cache uses LRU eviction when space is needed **Benefits of URL-based storage:** * **Lazy loading** - Only download files when needed * **Deduplication** - Same URL is cached once * **Incremental processing** - Add files without bulk downloads * **Cloud-native** - Works directly with object storage **For private S3 buckets:** Configure AWS credentials using standard methods: * Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) * AWS credentials file (`~/.aws/credentials`) * IAM roles (when running on EC2/ECS) ## See also * [Upload to S3](../../../howto/cookbooks/data/data-export-s3) - Store generated media in S3/GCS * [Import from CSV](../../../howto/cookbooks/data/data-import-csv) - Load structured data * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) - Process video files * [Analyze images in batch](/howto/cookbooks/images/vision-batch-analysis) - AI vision on images * [Configure API keys](/howto/cookbooks/core/workflow-api-keys) - Set up credentials # Sample data for training and testing Source: https://docs.pixeltable.com/howto/cookbooks/data/data-sampling Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create training, validation, and test splits with random or stratified sampling. ## Problem You have a large dataset and need to create subsets for ML training—random samples for quick experiments, stratified samples for balanced classes, or reproducible splits for benchmarking.
## Solution **What’s in this recipe:** * Random sampling with `sample(n=...)` * Percentage-based sampling with `sample(fraction=...)` * Stratified sampling with `stratify_by=` You use `query.sample()` to create random subsets, with optional stratification for balanced class distribution. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('sampling_demo', force=True) pxt.create_dir('sampling_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'sampling\_demo'.
  \
### Create sample dataset ```python theme={null} # Create a dataset with labels data = pxt.create_table( 'sampling_demo/data', {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float}, ) # Insert sample data with imbalanced classes samples = [ {'text': 'Great product!', 'label': 'positive', 'score': 0.9}, {'text': 'Love it', 'label': 'positive', 'score': 0.85}, {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95}, {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88}, {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92}, {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91}, {'text': 'Terrible', 'label': 'negative', 'score': 0.1}, {'text': 'Waste of money', 'label': 'negative', 'score': 0.15}, {'text': 'It is okay', 'label': 'neutral', 'score': 0.5}, {'text': 'Average product', 'label': 'neutral', 'score': 0.55}, ] data.insert(samples) ```
  Created table 'data'.

  Inserting rows into \`data\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`data\`: 10 rows \[00:00, 857.13 rows/s]
  Inserted 10 rows with 0 errors.
  10 rows inserted, 20 values computed.
### Random sampling ```python theme={null} # Sample exactly N rows data.sample(n=5, seed=42).collect() ```
```python theme={null} # Sample a percentage of rows sample_50pct = data.sample(fraction=0.5, seed=42).collect() ``` ### Stratified sampling ```python theme={null} # Stratified sampling: 50% from each class data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect() ```
```python theme={null} # Equal allocation: N rows from each class data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect() ```
### Sampling from filtered data ```python theme={null} # Sample from filtered query (high-confidence predictions only) data.where(data.score > 0.8).sample(n=3, seed=42).collect() ```
### Persist samples as tables ```python theme={null} # Create a persistent table from a sample for dev/test train_sample = data.sample(fraction=0.8, seed=42) test_sample = data.sample(fraction=0.2, seed=43) # Persist as new tables train_table = pxt.create_table('sampling_demo/train', source=train_sample) test_table = pxt.create_table('sampling_demo/test', source=test_sample) ```
  Created table 'train'.

  Inserting rows into \`train\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`train\`: 9 rows \[00:00, 3080.27 rows/s]
  Created table 'test'.

  Inserting rows into \`test\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`test\`: 3 rows \[00:00, 1333.92 rows/s]
## Explanation **Sampling methods:**
**Stratification options:**
**Tips:** * Always set `seed` for reproducible experiments * Use stratified sampling for imbalanced datasets * Combine with `.where()` to sample from subsets ## See also * [Export for ML training](/howto/cookbooks/data/data-export-pytorch) - PyTorch DataLoader export * [Import Hugging Face datasets](/howto/cookbooks/data/data-import-huggingface) - Load pre-split datasets # Add watermarks to images Source: https://docs.pixeltable.com/howto/cookbooks/images/img-add-watermarks Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to add watermarks to hundreds of different images to protect copyright, add branding, or mark drafts. ## Solution **What’s in this recipe:** * Create simple text watermarks * Test transformations before applying * Apply to multiple images automatically You add watermarks to images using a custom UDF that wraps Pillow’s `ImageDraw` (relies on PIL/Pillow). This gives you full control over watermark placement, font, transparency, and color. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply watermarks to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from PIL import Image, ImageDraw, ImageFont ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/watermarks', {'image': pxt.Image}) ```
  Created table 'watermarks'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000049.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, ] ) ```
  Inserting rows into \`watermarks\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`watermarks\`: 3 rows \[00:00, 532.86 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Iterate: add watermarks to a few images first ```python theme={null} @pxt.udf def add_watermark(img: Image.Image, text: str) -> Image.Image: """Add a watermark to bottom-right corner.""" img = img.copy().convert('RGBA') overlay = Image.new('RGBA', img.size, (0, 0, 0, 0)) draw = ImageDraw.Draw(overlay) # Draw white text in bottom-right corner font = ImageFont.load_default(size=40) position = (img.width - 150, img.height - 60) draw.text(position, text, font=font, fill=(255, 255, 255, 200)) result = Image.alpha_composite(img, overlay) return result.convert('RGB') ``` ```python theme={null} # Test on first image t.select(t.image, add_watermark(t.image, '© 2024')).head(1) ```
### Add: add watermarks to all images in your table ```python theme={null} # Add watermark to all images t.add_computed_column(watermarked=add_watermark(t.image, '© 2024')) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View all results t.collect() ```
## Explanation **How the watermark technique works:** The UDF creates a transparent overlay on top of each image. The overlay is created with the same dimensions as the image (`Image.new('RGBA', img.size, ...)`), so watermarks adapt automatically whether you’re processing small thumbnails or large photos. The function draws white text with semi-transparent fill (alpha=200, where 255 is fully opaque), composites the overlay onto the original image using `Image.alpha_composite()`, and converts back to RGB since most image formats don’t support transparency. **To customize the UDF:** * Position: Change the `(x, y)` coordinates in the `position` variable * Color: Modify the `(R, G, B, Alpha)` fill value (0-255 for each) * Size: Adjust the font size parameter in `ImageFont.load_default(size=40)` * Font: Use `ImageFont.truetype('path/to/font.ttf', size)` for custom fonts **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) * [Transform images with PIL operations](/howto/cookbooks/images/img-pil-transforms) * *Pillow techniques from [Real Python: Image Processing With the Python Pillow Library](https://realpython.com/image-processing-with-the-python-pillow-library/)* # Adjust image opacity Source: https://docs.pixeltable.com/howto/cookbooks/images/img-adjust-opacity Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to make hundreds of images semi-transparent for backgrounds, overlays, or watermarks. ## Solution **What’s in this recipe:** * Set image opacity (transparency level) * Test transformations before applying * Apply to multiple images automatically You adjust image transparency using a custom UDF that modifies alpha channels (relies on PIL/Pillow). This gives you precise control over transparency levels. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the opacity adjustment to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from PIL import Image ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/opacity', {'image': pxt.Image}) ```
  Created table 'opacity'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000776.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000885.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000016.jpg' }, ] ) ```
  Inserting rows into \`opacity\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`opacity\`: 3 rows \[00:00, 545.05 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Iterate: adjust opacity for a few images first You define a custom function using the `@pxt.udf` decorator to make it available in Pixeltable. Inside the function, you use standard PIL (Pillow) operations to manipulate images. Pixeltable handles applying your function to every row in your table. **How it works:** * All image manipulation (`.convert()`, `.split()`, `.point()`, `.putalpha()`) comes from the PIL/Pillow library * These are standard Python image operations—see [Pillow docs](https://pillow.readthedocs.io/) for reference * The `@pxt.udf` decorator lets Pixeltable apply your function to table rows * The opacity parameter (0.0 = fully transparent, 1.0 = fully opaque) controls the alpha scaling ```python theme={null} @pxt.udf def set_opacity(img: Image.Image, opacity: float) -> Image.Image: """Set image opacity (0.0 = fully transparent, 1.0 = fully opaque).""" img = img.convert('RGBA') alpha = img.split()[3] # Get alpha channel alpha = alpha.point(lambda p: int(p * opacity)) # Scale alpha values img.putalpha(alpha) return img ``` ```python theme={null} # Test 25%, 50%, and 75% opacity t.select( t.image, alpha_25=set_opacity(t.image, 0.25), alpha_50=set_opacity(t.image, 0.5), alpha_75=set_opacity(t.image, 0.75), ).head(1) ```
### Add: adjust opacity for all images in your table ```python theme={null} # Create 50% opacity for backgrounds t.add_computed_column(semi_transparent=set_opacity(t.image, 0.5)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View original and semi-transparent side by side t.collect() ```
## Explanation **How the opacity technique works:** The UDF modifies the alpha channel to control transparency. The function converts the image to RGBA mode (which includes an alpha channel for transparency), extracts the alpha channel with `.split()[3]`, scales all values by the desired opacity factor using `.point(lambda p: int(p * opacity))`, and applies it back with `.putalpha()`. This preserves the original image while adjusting only the transparency level. **To customize the UDF:** * **Opacity levels**: Use 0.25 for very faint backgrounds, 0.5 for standard transparency, 0.75 for subtle effects * **Selective transparency**: Modify the lambda function in `.point()` to apply different transparency to different pixel values * **Preserve regions**: Add conditional logic to keep certain areas fully opaque **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) * [Add watermarks to images](/howto/cookbooks/images/img-add-watermarks) * [Transform images with PIL operations](/howto/cookbooks/images/img-pil-transforms) # Apply image filters Source: https://docs.pixeltable.com/howto/cookbooks/images/img-apply-filters Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to apply filters to hundreds of images—blur, sharpen, edge detection, and other enhancements. ## Solution **What’s in this recipe:** * Apply common image filters * Test filters before applying * Process multiple images in batch You apply image filters (blur, sharpen, edge detection) to images in your table using custom UDFs that wrap Pillow’s `ImageFilter` module (relies on PIL/Pillow). This gives you control over filter parameters. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the filter to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from PIL import ImageFilter ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/filters', {'image': pxt.Image}) ```
  Created table 'filters'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000285.jpg' }, ] ) ```
  Inserting rows into \`filters\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`filters\`: 3 rows \[00:00, 538.79 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Iterate: apply filters to a few images first ```python theme={null} @pxt.udf def apply_blur(img: pxt.Image) -> pxt.Image: """Apply blur filter.""" return img.filter(ImageFilter.BLUR) @pxt.udf def apply_sharpen(img: pxt.Image) -> pxt.Image: """Apply sharpen filter.""" return img.filter(ImageFilter.SHARPEN) @pxt.udf def apply_find_edges(img: pxt.Image) -> pxt.Image: """Apply edge detection filter.""" return img.filter(ImageFilter.FIND_EDGES) @pxt.udf def apply_edge_enhance(img: pxt.Image) -> pxt.Image: """Apply edge enhancement filter.""" return img.filter(ImageFilter.EDGE_ENHANCE) ``` ```python theme={null} # Test blur and sharpen t.select(t.image, apply_blur(t.image), apply_sharpen(t.image)).head(1) ```
### Add: apply filters to all images in your table ```python theme={null} # Add filter columns t.add_computed_column(blurred=apply_blur(t.image)) t.add_computed_column(sharpened=apply_sharpen(t.image)) t.add_computed_column(edges=apply_find_edges(t.image)) t.add_computed_column(edge_enhanced=apply_edge_enhance(t.image)) ```
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
### View results Compare original and filtered images. ```python theme={null} # Compare blur and sharpen t.select(t.image, t.blurred, t.sharpened).collect() ```
```python theme={null} # Compare edge detection filters t.select(t.image, t.edges, t.edge_enhanced).collect() ```
## Explanation **How the filter technique works:** The UDFs wrap PIL’s `ImageFilter` module to apply convolution-based filters to images. Each filter uses a predefined kernel that processes pixel neighborhoods to achieve different effects. Blur averages surrounding pixels to reduce detail, Sharpen enhances pixel differences to increase detail, Find Edges detects boundaries between contrasting regions, and Edge Enhance strengthens edges while preserving the full image. You can apply multiple filters to the same image to create different versions for analysis or visual effects. **To customize the UDFs:** * **Blur intensity**: Use `ImageFilter.BoxBlur(radius)` or `ImageFilter.GaussianBlur(radius)` for adjustable blur strength * **Edge detection**: Combine with grayscale conversion for clearer edge maps * **Filter stacking**: Apply multiple filters in sequence for complex effects * **Custom kernels**: Use `ImageFilter.Kernel()` to define your own convolution filters **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) * [Adjust image brightness and contrast](/howto/cookbooks/images/img-brightness-contrast) * *Pillow techniques from [Real Python: Image Processing With the Python Pillow Library](https://realpython.com/image-processing-with-the-python-pillow-library/)* # Adjust image brightness and contrast Source: https://docs.pixeltable.com/howto/cookbooks/images/img-brightness-contrast Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to fix inconsistent lighting across hundreds of images—adjusting brightness, contrast, and color saturation. ## Solution **What’s in this recipe:** * Adjust brightness, contrast, and saturation * Test adjustments before applying * Process multiple images in batch You adjust brightness, contrast, and saturation for images in your table using custom UDFs that wrap Pillow’s `ImageEnhance` module (relies on PIL/Pillow). This lets you control enhancement levels to match your needs. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the adjustments to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from PIL import ImageEnhance ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/enhancements', {'image': pxt.Image}) ```
  Created table 'enhancements'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000016.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000049.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, ] ) ```
  Inserting rows into \`enhancements\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`enhancements\`: 3 rows \[00:00, 601.16 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Iterate: adjust brightness and contrast for a few images first ```python theme={null} @pxt.udf def adjust_brightness(img: pxt.Image, factor: float) -> pxt.Image: """Adjust brightness. factor < 1 = darker, > 1 = brighter.""" return ImageEnhance.Brightness(img).enhance(factor) @pxt.udf def adjust_contrast(img: pxt.Image, factor: float) -> pxt.Image: """Adjust contrast. factor < 1 = lower, > 1 = higher.""" return ImageEnhance.Contrast(img).enhance(factor) @pxt.udf def adjust_saturation(img: pxt.Image, factor: float) -> pxt.Image: """Adjust saturation. factor < 1 = less saturated, > 1 = more saturated.""" return ImageEnhance.Color(img).enhance(factor) ``` ```python theme={null} # Test brightness adjustments t.select( t.image, adjust_brightness(t.image, 0.5), adjust_brightness(t.image, 1.5), ).head(1) ```
### Add: adjust brightness and contrast for all images in your table ```python theme={null} # Brightness adjustments (1.0 = original) t.add_computed_column(darker=adjust_brightness(t.image, 0.5)) t.add_computed_column(brighter=adjust_brightness(t.image, 1.5)) # Contrast adjustments t.add_computed_column(low_contrast=adjust_contrast(t.image, 0.5)) t.add_computed_column(high_contrast=adjust_contrast(t.image, 2.0)) # Color saturation t.add_computed_column(desaturated=adjust_saturation(t.image, 0.3)) t.add_computed_column(saturated=adjust_saturation(t.image, 2.0)) ```
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
### View results Compare different enhancement levels side-by-side. ```python theme={null} # Compare brightness levels t.select(t.image, t.darker, t.brighter).collect() ```
```python theme={null} # Compare contrast levels t.select(t.image, t.low_contrast, t.high_contrast).collect() ```
```python theme={null} # Compare saturation levels t.select(t.image, t.desaturated, t.saturated).collect() ```
## Explanation **How the enhancement technique works:** The UDFs wrap PIL’s `ImageEnhance` module to adjust visual properties of images. Each enhancement type creates an enhancer object for the image, then applies a multiplication factor. A factor of 1.0 leaves the image unchanged, values below 1.0 decrease the property (darker, less contrast, desaturated), and values above 1.0 increase it (brighter, more contrast, saturated). You can apply different factors to the same image to create multiple variations for comparison or different use cases. **To customize the UDFs:** * **Brightness factors**: Use 0.5 for darker images, 1.5 for brighter, or adjust to match your lighting needs * **Contrast factors**: Use 0.5 for lower contrast, 2.0 for higher contrast, or fine-tune for image clarity * **Saturation factors**: Use 0.3 for desaturated/muted colors, 2.0 for vibrant colors, or 0.0 for complete grayscale * **Combine adjustments**: Apply multiple enhancements to create complex transformations **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) * [Apply image filters](/howto/cookbooks/images/img-apply-filters) * *Pillow techniques from [Real Python: Image Processing With the Python Pillow Library](https://realpython.com/image-processing-with-the-python-pillow-library/)* # Detect objects in images Source: https://docs.pixeltable.com/howto/cookbooks/images/img-detect-objects Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Automatically identify and locate objects in images using YOLOX object detection models. ## Problem You have images that need object detection—identifying what objects are present and where they’re located. Manual labeling is slow and expensive.
## Solution **What’s in this recipe:** * Detect objects using YOLOX models (runs locally, no API needed) * Get bounding boxes and class labels * Filter detections by confidence threshold You add a computed column that runs YOLOX on each image. Detection happens automatically when you insert new images. ### Setup ```python theme={null} %pip install -qU pixeltable pixeltable-yolox ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.yolox import yolox ``` ### Load images ```python theme={null} # Create a fresh directory pxt.drop_dir('detection_demo', force=True) pxt.create_dir('detection_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'detection\_demo'.
  \
```python theme={null} # Create table for images images = pxt.create_table('detection_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Insert sample images (COCO dataset samples with common objects) image_urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg', ] images.insert([{'image': url} for url in image_urls]) ```
  Inserting rows into \`images\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`images\`: 3 rows \[00:00, 523.85 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # View images images.collect() ```
### Run object detection Add a computed column that runs YOLOX on each image: ```python theme={null} # Run YOLOX object detection # model_id options: yolox_nano, yolox_tiny, yolox_s, yolox_m, yolox_l, yolox_x images.add_computed_column( detections=yolox(images.image, model_id='yolox_m', threshold=0.5) ) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View detection results images.select(images.image, images.detections).collect() ```
### Extract detection details Parse the detection output to get object counts and classes: ```python theme={null} # Extract number of detections @pxt.udf def count_objects(detections: dict) -> int: """Count the number of detected objects.""" return len(detections.get('labels', [])) images.add_computed_column(object_count=count_objects(images.detections)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 6 values computed.
```python theme={null} # Extract unique object classes @pxt.udf def get_classes(detections: dict) -> list: """Get list of detected object classes.""" return list(set(detections.get('labels', []))) images.add_computed_column(object_classes=get_classes(images.detections)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View summary images.select( images.image, images.object_count, images.object_classes ).collect() ```
## Explanation **YOLOX model sizes:**
**Detection output format:** The `detections` dictionary contains: * `labels`: List of class names (e.g., “person”, “car”, “dog”) * `boxes`: Bounding box coordinates \[x1, y1, x2, y2] * `scores`: Confidence scores (0-1) **Adjusting threshold:** * Higher threshold (0.7-0.9): Fewer detections, higher confidence * Lower threshold (0.3-0.5): More detections, may include false positives ## See also * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) - Detect objects in video frames * [Analyze images in batch](/howto/cookbooks/images/vision-batch-analysis) - AI vision analysis * [Find similar images](/howto/cookbooks/search/search-similar-images) - Visual similarity search # Compare object detection and panoptic segmentation Source: https://docs.pixeltable.com/howto/cookbooks/images/img-detection-vs-segmentation Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Understand when to use bounding boxes versus pixel-level masks for image analysis. **What’s in this recipe:** * Run object detection to get bounding boxes and labels * Run panoptic segmentation to get pixel-level masks * Visualize and compare outputs side-by-side ## Problem You need to analyze objects in images, but there are two approaches:
Which should you use? Detection is faster but approximate. Segmentation is slower but precise. ## Solution Run both approaches on the same images using DETR models and compare the results. ### Setup ```python theme={null} %pip install -qU pixeltable torch transformers timm ``` ```python theme={null} import numpy as np import pixeltable as pxt from pixeltable.functions.huggingface import ( detr_for_object_detection, detr_for_segmentation, ) from pixeltable.functions.vision import ( draw_bounding_boxes, overlay_segmentation, ) ``` ### Load images ```python theme={null} pxt.drop_dir('detection_vs_seg', force=True) pxt.create_dir('detection_vs_seg') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'detection\_vs\_seg'.
  \
```python theme={null} images = pxt.create_table('detection_vs_seg/images', {'image': pxt.Image}) base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' images.insert( [ {'image': f'{base_url}/000000000034.jpg'}, {'image': f'{base_url}/000000000049.jpg'}, ] ) ```
  Created table 'images'.
  Inserted 2 rows with 0 errors in 0.22 s (9.21 rows/s)
  2 rows inserted.
### Run object detection The `detr_for_object_detection` function returns bounding boxes, labels, and confidence scores. **Parameters:** * `model_id`: DETR variant (`facebook/detr-resnet-50` or `facebook/detr-resnet-101`) * `threshold`: Confidence threshold (0.0-1.0). Higher = fewer but more confident detections **Output:** ```python theme={null} {'boxes': [[x1, y1, x2, y2], ...], 'scores': [0.98, ...], 'label_text': ['person', ...]} ``` ```python theme={null} images.add_computed_column( detections=detr_for_object_detection( images.image, model_id='facebook/detr-resnet-50', threshold=0.8 ) ) ```
  Added 2 column values with 0 errors in 4.09 s (0.49 rows/s)
  2 rows updated.
```python theme={null} # View detection results images.select(images.image, images.detections).collect() ```
### Visualize detections with bounding boxes Use `draw_bounding_boxes` to overlay the detection results on the original image. ```python theme={null} images.add_computed_column( detection_viz=draw_bounding_boxes( images.image, boxes=images.detections.boxes, labels=images.detections.label_text, fill=True, width=2, ) ) ```
  Added 2 column values with 0 errors in 0.03 s (58.89 rows/s)
  2 rows updated.
```python theme={null} images.select(images.detection_viz).collect() ```
### Run panoptic segmentation The `detr_for_segmentation` function returns pixel-level masks and segment metadata. **Parameters:** * `model_id`: Segmentation model (`facebook/detr-resnet-50-panoptic`) * `threshold`: Confidence threshold for filtering segments **Output:** ```python theme={null} { 'segmentation': np.ndarray, # (H, W) array where each pixel = segment ID 'segments_info': [{'id': 1, 'label_text': 'person', 'score': 0.98}, ...] } ``` > **Note:** The full segmentation output contains a numpy array that > can’t be stored as JSON. We store just the `segments_info` metadata > and compute the pixel-level visualization inline. ```python theme={null} # Store just the segments_info (JSON-serializable) as a computed column # The segmentation array will be computed inline for visualization seg_expr = detr_for_segmentation( images.image, model_id='facebook/detr-resnet-50-panoptic', threshold=0.5, ) images.add_computed_column(segments_info=seg_expr.segments_info) ``` ```python theme={null} # View stored segmentation info images.select(images.image, images.segments_info).collect() ```
### Visualize segmentation with colored overlay Use `overlay_segmentation` to visualize the pixel masks with colored regions and contours. ```python theme={null} # Compute segmentation visualization inline # Cast the segmentation array to the proper type for overlay_segmentation seg_expr = detr_for_segmentation( images.image, model_id='facebook/detr-resnet-50-panoptic', threshold=0.5, ) segmentation_map = seg_expr.segmentation.astype( pxt.Array[(None, None), np.int32] ) images.select( segmentation_viz=overlay_segmentation( images.image, segmentation_map, alpha=0.5, draw_contours=True, contour_thickness=2, ) ).collect() ```
### Compare side-by-side ```python theme={null} # Side-by-side comparison: original, detection, segmentation seg_expr = detr_for_segmentation( images.image, model_id='facebook/detr-resnet-50-panoptic', threshold=0.5, ) segmentation_map = seg_expr.segmentation.astype( pxt.Array[(None, None), np.int32] ) images.select( images.image, images.detection_viz, segmentation_viz=overlay_segmentation( images.image, segmentation_map, alpha=0.5, draw_contours=True, contour_thickness=2, ), ).collect() ```
### Count objects per image ```python theme={null} # Count objects per image (using stored columns) images.select( images.image, num_detections=images.detections.boxes.apply(len, col_type=pxt.Int), num_segments=images.segments_info.apply(len, col_type=pxt.Int), ).collect() ```
## Explanation Detection gives fast, approximate locations. Segmentation gives slower but precise boundaries. ### Capability comparison
### Performance tradeoffs
### When to use each **Choose detection when:** * You need to know *what* objects are present and *where* (approximately) * Speed matters (detection is 2x faster) * You need search, filtering, or counting * Bounding boxes suffice for visualization **Choose segmentation when:** * You need *exact* object boundaries (pixel-perfect masks) * You’re doing image editing, compositing, or AR * You need to measure actual object area/coverage * You want scene composition analysis (what % is sky vs buildings) ## See also * [Detect objects in images](./img-detect-objects) - Object detection with YOLOX * [Visualize detections](./img-visualize-detections) - Draw bounding boxes and labels * [DETR documentation](https://huggingface.co/docs/transformers/model_doc/detr) - Hugging Face model docs # Generate captions for images Source: https://docs.pixeltable.com/howto/cookbooks/images/img-generate-captions Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Automatically create descriptive captions for images using AI vision models. ## Problem You have a collection of images that need captions—for accessibility, SEO, content management, or searchability. Writing captions manually doesn’t scale.
## Solution **What’s in this recipe:** * Generate captions using OpenAI’s vision models * Customize caption style (short, detailed, SEO-focused) * Process images in batch automatically You add a computed column that sends each image to a vision model with a captioning prompt. New images are captioned automatically on insert. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions ``` ### Load images ```python theme={null} # Create a fresh directory pxt.drop_dir('caption_demo', force=True) pxt.create_dir('caption_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'caption\_demo'.
  \
```python theme={null} # Create table for images images = pxt.create_table('caption_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Insert sample images image_urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg', ] images.insert([{'image': url} for url in image_urls]) ```
  Inserted 3 rows with 0 errors in 0.12 s (25.17 rows/s)
  3 rows inserted.
```python theme={null} # View images images.collect() ```
### Generate captions Add a computed column that generates captions using the vision model: ```python theme={null} # Add caption using OpenAI vision messages = [ { 'role': 'user', 'content': [ { 'type': 'text', 'text': 'Write a concise, descriptive caption for this image in one sentence.', }, {'type': 'image_url', 'image_url': images.image}, ], } ] images.add_computed_column( caption=chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 3 column values with 0 errors in 4.62 s (0.65 rows/s)
  3 rows updated.
```python theme={null} # View images with captions images.select( images.image, images.caption['choices'][0]['message']['content'] ).collect() ```
### Different caption styles You can generate multiple caption styles for different uses: ```python theme={null} # Add alt text for accessibility (brief) messages = [ { 'role': 'user', 'content': [ { 'type': 'text', 'text': 'Write a brief alt text for this image (under 125 characters) for screen readers.', }, {'type': 'image_url', 'image_url': images.image}, ], } ] images.add_computed_column( alt_text=chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 3 column values with 0 errors in 3.51 s (0.85 rows/s)
  3 rows updated.
```python theme={null} # Add detailed description messages = [ { 'role': 'user', 'content': [ { 'type': 'text', 'text': 'Describe this image in detail, including objects, colors, setting, and mood.', }, {'type': 'image_url', 'image_url': images.image}, ], } ] images.add_computed_column( description=chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 3 column values with 0 errors in 11.28 s (0.27 rows/s)
  3 rows updated.
```python theme={null} # View all caption types images.select( images.image, images.caption['choices'][0]['message']['content'], images.alt_text['choices'][0]['message']['content'], images.description['choices'][0]['message']['content'], ).collect() ```
## Explanation **Caption prompt patterns:**
**Model selection:** * `gpt-4o-mini`: Fast and affordable, good for most captioning tasks * `gpt-4o`: Higher quality for complex images or detailed descriptions ## See also * [Analyze images in batch](/howto/cookbooks/images/vision-batch-analysis) - Run custom prompts on images * [Extract structured data from images](/howto/cookbooks/images/vision-structured-output) - Get JSON from images * [Find similar images](/howto/cookbooks/search/search-similar-images) - Visual similarity search # Transform images with AI-powered editing Source: https://docs.pixeltable.com/howto/cookbooks/images/img-image-to-image Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You have a batch of images that need AI-powered transformations—like turning photos into paintings, adding stylistic effects, or modifying content based on text prompts.
## Solution **What’s in this recipe:** * Transform images using text prompts with Hugging Face Stable Diffusion models * Control transformation strength and quality settings * Process batches of images automatically You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the transformation to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable torch transformers diffusers accelerate ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.huggingface import image_to_image ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('img2img_demo', force=True) pxt.create_dir('img2img_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/cpestano/.pixeltable/pgdata
  Created directory 'img2img\_demo'.
  \
```python theme={null} t = pxt.create_table( 'img2img_demo/images', { 'image': pxt.Image, 'prompt': pxt.String, 'negative_prompt': pxt.String, }, ) ```
  Created table 'images'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000285.jpg', 'prompt': 'oil painting style, vibrant colors, brushstrokes visible', 'negative_prompt': 'blurry, low quality, bad anatomy', }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000776.jpg', 'prompt': 'watercolor painting, soft edges, artistic', 'negative_prompt': 'blurry, low quality, bad anatomy', }, ] ) ```
  Inserted 2 rows with 0 errors in 0.49 s (4.07 rows/s)
  2 rows inserted.
```python theme={null} # View original images and prompts t.collect() ```
### Iterate: test transformation on a single image Use `.select()` to define the transformation, then `.head(n)` to preview results on a subset of images. Nothing is stored in your table. The `image_to_image` function requires: * `image`: The source image to transform * `prompt`: Text describing the desired output * `model_id`: A Hugging Face model ID that supports image-to-image (e.g., `stable-diffusion-v1-5/stable-diffusion-v1-5`) ```python theme={null} # Preview transformation on first image t.select( t.image, t.prompt, image_to_image( t.image, t.prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', ), ).head(1) ```
### Iterate: adjust transformation strength You control how much the model modifies the original image using `strength` (0.0-1.0): * **Lower values** (0.3-0.5): Subtle changes, preserves more of the original * **Higher values** (0.7-1.0): Dramatic changes, more creative freedom You pass additional parameters through `model_kwargs`. For example, `negative_prompt` text describing what you don’t want the output to be. ```python theme={null} # Preview with lower strength (more preservation of original) t.select( t.image, t.prompt, t.negative_prompt, image_to_image( t.image, t.prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', model_kwargs={ 'negative_prompt': t.negative_prompt, 'strength': 0.5, 'num_inference_steps': 30, }, ), ).head(1) ```
### Add: apply transformation to all images Once you’re satisfied with the results, use `.add_computed_column()` with the same expression. This processes all rows and stores the results permanently in your table. ```python theme={null} # Save as computed column t.add_computed_column( transformed=image_to_image( t.image, t.prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', model_kwargs={ 'strength': 0.5, 'num_inference_steps': 25, 'negative_prompt': t.negative_prompt, }, ) ) ```
  Added 2 column values with 0 errors in 53.83 s (0.04 rows/s)
  2 rows updated.
```python theme={null} # View original and transformed images side by side t.select(t.image, t.prompt, t.negative_prompt, t.transformed).collect() ```
### Use reproducible results with seeds You set a `seed` parameter to get the same output every time you run the transformation. ```python theme={null} # Add reproducible transformation t.add_computed_column( transformed_seeded=image_to_image( t.image, t.prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', seed=42, model_kwargs={ 'strength': 0.5, 'negative_prompt': t.negative_prompt, }, ) ) ```
  Added 2 column values with 0 errors in 96.24 s (0.02 rows/s)
  2 rows updated.
```python theme={null} # View results t.select(t.image, t.transformed_seeded).collect() ```
## Explanation **How image-to-image works:** Image-to-image diffusion models take an existing image and a text prompt, then generate a new image that blends the structure of the original with the guidance from the prompt. The `strength` parameter controls the balance—lower values preserve more of the original, while higher values allow more dramatic transformations. **Model compatibility:** The `image_to_image` UDF uses `AutoPipelineForImage2Image` from the diffusers library, which automatically detects the model type and selects the appropriate pipeline. You use any compatible model: * `stable-diffusion-v1-5/stable-diffusion-v1-5` - General-purpose, runs on most hardware * `stabilityai/stable-diffusion-xl-base-1.0` - Higher quality, needs more VRAM **Key parameters:** * `strength` (0.0-1.0): How much to transform the image * `negative_prompt`: Text describing what to avoid in the generated image (e.g., “blurry, low quality”). * `num_inference_steps`: Quality vs speed tradeoff (more steps = better quality) * `guidance_scale`: How closely to follow the prompt (7-8 is typical) * `seed`: For reproducible results ## See also * [Apply filters to images](/howto/cookbooks/images/img-apply-filters) * [Generate captions for images](/howto/cookbooks/images/img-generate-captions) * [Hugging Face image-to-image models](https://huggingface.co/models?pipeline_tag=image-to-image) # Transform images with PIL operations Source: https://docs.pixeltable.com/howto/cookbooks/images/img-pil-transforms Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to resize, rotate, crop, or convert hundreds of images—and keep track of all the transformed versions. ## Solution **What’s in this recipe:** * Basic image operations (resize, rotate, flip, crop) * Track image properties * Iterate on transformations before adding to your table You apply PIL transformations (resize, rotate, flip, crop) to images in your table using Pixeltable’s built-in image functions—common operations that work directly on image columns. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the transformation to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000285.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000776.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000885.jpg' }, ] ) ```
  Inserting rows into \`images\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`images\`: 3 rows \[00:00, 708.38 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
### Iterate: check image properties for a few images first Use `.select()` to define the transformation, then `.collect()` to execute and return results. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Nothing is stored in your table. Pixeltable includes these built-in functions for image properties: * `.height` - Get image height in pixels * `.width` - Get image width in pixels * `.mode` - Get color mode (RGB, RGBA, L for grayscale, etc.) ```python theme={null} # Preview the properties t.select(t.image, t.image.height, t.image.width, t.image.mode).collect() ```
### Add: check image properties for all images in your table ```python theme={null} # Save as computed columns t.add_computed_column(height=t.image.height) t.add_computed_column(width=t.image.width) t.add_computed_column(mode=t.image.mode) # RGB, RGBA, L (grayscale), etc. ```
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  3 rows updated, 6 values computed.
```python theme={null} # View images with computed height, width, and mode columns t.collect() ```
### Iterate: resize a few images first Use `.select()` to define the transformation, then `.collect()` to execute and return results. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Nothing is stored in your table. Pixeltable includes a built-in function for resizing image files with PIL: * `.resize(width, height)` - Change image dimensions ```python theme={null} # Preview the resize operation t.select(t.image, t.image.resize((224, 224))).head(1) ```
### Add: resize all images in your table Once you’re satisfied with the results, use `.add_computed_column()` with the same expression. This processes all rows and stores the results permanently in your table. ```python theme={null} # Save as computed column t.add_computed_column(resized=t.image.resize((224, 224))) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View images with resized column t.collect() ```
### Iterate: rotate a few images first Use `.select()` to define the transformation, then `.collect()` to execute and return results. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Nothing is stored in your table. Pixeltable includes a built-in function for rotating image files with PIL: * `.rotate(degrees)` - Rotate image by specified degrees ```python theme={null} # Preview the rotation t.select(t.image, t.image.rotate(90)).head(1) ```
### Add: rotate all images in your table Once you’re satisfied with the results, use `.add_computed_column()` with the same expression. This processes all rows and stores the results permanently in your table. ```python theme={null} # Save as computed column t.add_computed_column(rotated=t.image.rotate(90)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View images with rotated column t.collect() ```
### Iterate: flip a few images first Use `.select()` to define the transformation, then `.collect()` to execute and return results. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Nothing is stored in your table. Pixeltable includes a built-in function for transposing image files with PIL (note that for this transform you will need import PIL to access the `FLIP_*` constants): * `.transpose(Image.FLIP_TOP_BOTTOM)` - Flip image vertically * `.transpose(Image.FLIP_LEFT_RIGHT)` - Mirror image horizontally ```python theme={null} # Import PIL Image to access flip constants from PIL import Image # Preview both flip operations t.select( t.image, t.image.transpose(Image.FLIP_TOP_BOTTOM), t.image.transpose(Image.FLIP_LEFT_RIGHT), ).head(1) ```
### Add: flip all images in your table Once you’re satisfied with the results, use `.add_computed_column()` with the same expression. This processes all rows and stores the results permanently in your table. ```python theme={null} # Flip vertically (top to bottom) t.add_computed_column(flip_v=t.image.transpose(Image.FLIP_TOP_BOTTOM)) # Flip horizontally (left to right, mirror effect) t.add_computed_column(flip_h=t.image.transpose(Image.FLIP_LEFT_RIGHT)) ```
  Added 3 column values with 0 errors.
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View original and flipped versions side by side t.select(t.image, t.flip_v, t.flip_h).collect() ```
### Iterate: crop a few images first Use `.select()` to define the transformation, then `.collect()` to execute and return results. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Nothing is stored in your table. Pixeltable includes a built-in function for cropping image files with PIL: * `.crop(box)` - Extract a rectangular region from the image (box format: `(left, top, right, bottom)`) ```python theme={null} # Preview the center crop # Box format: (left, top, right, bottom) t.select( t.image, t.image.crop( ( t.image.width // 4, t.image.height // 4, 3 * t.image.width // 4, 3 * t.image.height // 4, ) ), ).head(1) ```
### Add: crop all images in your table Once you’re satisfied with the results, use `.add_computed_column()` with the same expression. This processes all rows and stores the results permanently in your table. ```python theme={null} # Save as computed column t.add_computed_column( center_crop=t.image.crop( ( t.image.width // 4, t.image.height // 4, 3 * t.image.width // 4, 3 * t.image.height // 4, ) ) ) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View center-cropped images t.select(t.center_crop).collect() ```
## Explanation **How PIL transformations work in Pixeltable:** Pixeltable provides built-in functions that wrap PIL (Pillow) operations for image manipulation. These functions work directly on image columns in your table—no need to write loops or manage file I/O. When you call `.resize()`, `.rotate()`, or other methods on an image column, Pixeltable handles applying the transformation to each image automatically. All these transformations use standard PIL operations under the hood. For more details on PIL functionality, see the [Pillow documentation](https://pillow.readthedocs.io/). **To customize transformations:** * **Resize**: Change dimensions with `.resize((width, height))` - specify target size in pixels * **Rotate**: Rotate counterclockwise with `.rotate(degrees)` - use negative values for clockwise rotation * **Flip**: Use `.transpose(Image.FLIP_LEFT_RIGHT)` for horizontal mirror or `.transpose(Image.FLIP_TOP_BOTTOM)` for vertical flip * **Crop**: Extract regions with `.crop((left, top, right, bottom))` - coordinates are in pixels from top-left origin * **Properties**: Access `.width`, `.height`, and `.mode` to get image dimensions and color mode **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Convert RGB images to grayscale](/howto/cookbooks/images/img-rgb-to-grayscale) * [Apply filters to images](/howto/cookbooks/images/img-apply-filters) * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) # Convert color images to grayscale Source: https://docs.pixeltable.com/howto/cookbooks/images/img-rgb-to-grayscale Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ## Problem You need to convert color images to grayscale for analysis, preprocessing, or model inputs that require single-channel images. Different conversion methods produce different results—you need to choose the right approach for your use case. ## Solution **What’s in this recipe:** * Simple conversion with PIL * Perceptually accurate grayscale (weighted RGB channels) * Custom UDF for advanced conversion You convert RGB images to grayscale in your table using either Pixeltable’s built-in `.convert()` method for standard conversion, or a custom UDF (relies on NumPy and PIL/Pillow) for gamma-corrected conversion when scientific accuracy matters. You can iterate on transformations before adding them to your table. Use `.select()` with `.collect()` to preview results on sample images—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you’re satisfied, use `.add_computed_column()` to apply the conversion to all images in your table. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). **Conversion methods:**
The simple method uses PIL’s built-in conversion. The gamma-corrected method requires a custom UDF (not built into PIL) that applies perceptual weighting in linear color space. *For technical details on gamma correction and grayscale conversion, see [Wikipedia: Grayscale](https://en.wikipedia.org/wiki/Grayscale).* ### Setup ```python theme={null} %pip install -qU pixeltable numpy ``` ```python theme={null} import numpy as np import pixeltable as pxt from PIL import Image ``` ### Load images ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('image_demo', force=True) pxt.create_dir('image_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_demo'.
  \
```python theme={null} t = pxt.create_table('image_demo/gray', {'image': pxt.Image}) ```
  Created table 'gray'.
```python theme={null} t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg' }, ] ) ```
  Inserting rows into \`gray\`: 0 rows \[00:00, ? rows/s]Inserting rows into \`gray\`: 3 rows \[00:00, 617.66 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} # View loaded images t.collect() ```
### Iterate: convert with linear approximation for a few images first ```python theme={null} # Query: Preview the conversion t.select(t.image, t.image.convert('L')).head(1) ```
### Add: convert with linear approximation for all images in your table ```python theme={null} # Commit: Save as computed column (built-in PIL conversion - fast and good for most use cases) t.add_computed_column(grayscale=t.image.convert('L')) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View images with grayscale column t.collect() ```
### Iterate: convert with gamma decompression for a few images first ```python theme={null} @pxt.udf def rgb_to_gray_accurate(img: Image.Image) -> Image.Image: """Convert RGB to grayscale with full gamma correction. Most accurate but slower. Gamma-decompresses, applies perceptual weights in linear space, then re-compresses for display. """ rgb = np.array(img).astype(np.float32) / 255.0 # Gamma decompress: make pixel values perceptually linear rgb_lin = ((rgb + 0.055) / 1.055) ** 2.4 rgb_lin = np.where(rgb <= 0.04045, rgb / 12.92, rgb_lin) # Apply perceptual weights in linear space gray_lin = ( 0.2126 * rgb_lin[:, :, 0] + 0.7152 * rgb_lin[:, :, 1] + 0.0722 * rgb_lin[:, :, 2] ) # Gamma compress: make values display-ready gray = 1.055 * gray_lin ** (1 / 2.4) - 0.055 gray = np.where(gray_lin <= 0.0031308, 12.92 * gray_lin, gray) gray = (gray * 255).astype(np.uint8) return Image.fromarray(gray) ``` ```python theme={null} # Compare both methods on first image t.select(t.image, t.grayscale, rgb_to_gray_accurate(t.image)).head(1) ```
### Add: convert with gamma decompression for all images in your table ```python theme={null} t.add_computed_column(accurate=rgb_to_gray_accurate(t.image)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} # View all results t.collect() ```
## Explanation **Two approaches:** 1. **Simple (`.convert('L')`):** PIL’s built-in. Fast, good for most use cases (model preprocessing, general analysis). 2. **Gamma-corrected (custom UDF):** Not built into PIL. Requires a custom UDF that: * Gamma-decompresses to linear space * Applies perceptual weights: 0.2126 × R + 0.7152 × G + 0.0722 × B * Gamma-compresses back for display * Slower but most perceptually accurate * Use for scientific imaging, professional photography **Why gamma matters:** Displays aren’t linear—doubling a pixel value doesn’t double perceived brightness. Gamma correction accounts for this. For best results, convert to linear space before weighting, then convert back. *The gamma-corrected method is based on [Brandon Rohrer’s explanation](https://brandonrohrer.com/convert_rgb_to_grayscale.html) of perceptually accurate RGB to grayscale conversion.* **The Pixeltable workflow:** In traditional databases, `.select()` just picks which columns to view. In Pixeltable, `.select()` also lets you compute new transformations on the fly—define new columns without storing them. This makes `.select()` perfect for testing transformations before you commit them. When you use `.select()`, you’re creating a query that doesn’t execute until you call `.collect()`. You must use `.collect()` to execute the query and return results—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()` to test on a subset before processing your full dataset. Once satisfied, use `.add_computed_column()` with the same expression to persist results permanently. For more on this workflow, see [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow). ## See also * [Transform images with PIL operations](/howto/cookbooks/images/img-pil-transforms) * [Test transformations with fast feedback loops](/howto/cookbooks/core/dev-iterative-workflow) # Visualize object detections Source: https://docs.pixeltable.com/howto/cookbooks/images/img-visualize-detections Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Draw bounding boxes on images to visualize object detection results. ## Problem You’ve run object detection on images but need to visualize the results—see where objects were detected and verify the model’s accuracy.
## Solution **What’s in this recipe:** * Run object detection with YOLOX * Draw bounding boxes on images * Color-code by object class You create a pipeline that detects objects and then draws the results on the original image. ### Setup ```python theme={null} %pip install -qU pixeltable pixeltable-yolox ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.vision import draw_bounding_boxes from pixeltable.functions.yolox import yolox ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('viz_demo', force=True) pxt.create_dir('viz_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'viz\_demo'.
  \
### Create detection and visualization pipeline ```python theme={null} # Create table for images images = pxt.create_table('viz_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Step 1: Run object detection images.add_computed_column( detections=yolox(images.image, model_id='yolox_m', threshold=0.5) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Step 2: Draw bounding boxes on the image # Note: draw_bounding_boxes takes image, boxes, and labels (scores are not used for drawing) images.add_computed_column( annotated=draw_bounding_boxes( images.image, images.detections.bboxes, labels=images.detections.labels, ) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Detect and visualize ```python theme={null} # Insert sample images base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' image_urls = [ f'{base_url}/000000000036.jpg', # cats f'{base_url}/000000000139.jpg', # elephants ] images.insert([{'image': url} for url in image_urls]) ```
  Inserting rows into \`images\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`images\`: 2 rows \[00:00, 236.29 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 8 values computed.
```python theme={null} # View original vs annotated images side by side images.select(images.image, images.annotated).collect() ```
```python theme={null} # View detection details images.select(images.detections).collect() ```
## Explanation **Pipeline flow:**
  Image → YOLOX detection → Bounding boxes + labels → draw\_bounding\_boxes → Annotated image
**Detection output format:** The `yolox` function returns a dict with: * `bboxes` - List of \[x1, y1, x2, y2] coordinates * `labels` - List of class names (e.g., “cat”, “dog”) * `scores` - List of confidence scores (0-1) **YOLOX model options:**
## See also * [Detect objects in images](/howto/cookbooks/images/img-detect-objects) - Object detection basics * [Extract video frames](/howto/cookbooks/video/video-extract-frames) - Detect objects in video # Analyze images in batch with AI vision Source: https://docs.pixeltable.com/howto/cookbooks/images/vision-batch-analysis Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Run the same AI prompt against multiple images automatically, without writing loops or managing API calls. ## Problem You have a collection of images that all need the same analysis—like “Describe this image”, “Is this product damaged?”, or “What objects are visible?”. Writing a loop to call an API for each image is tedious and error-prone. You need to handle rate limits, retries, and track which images succeeded or failed.
## Solution **What’s in this recipe:** * Analyze multiple images with a single prompt using `openai.vision()` * Get all results at once, stored in your table * No loops or manual API calls You add a computed column that applies `openai.vision()` to every image in your table. Pixeltable handles the API calls, retries, and result storage automatically. When you insert new images, the analysis runs automatically—no extra code needed. ### Setup ```python theme={null} %pip install -qU pixeltable openai import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai ``` ### Load images ```python theme={null} # Create a fresh directory pxt.drop_dir('vision_demo', force=True) pxt.create_dir('vision_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'vision\_demo'.
  \
```python theme={null} t = pxt.create_table('vision_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Insert sample images t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg' }, ] ) ```
  Inserted 3 rows with 0 errors in 0.03 s (88.80 rows/s)
  3 rows inserted.
```python theme={null} # View loaded images t.collect() ```
### Analyze images with AI Add a computed column using `openai.vision()`. The prompt runs automatically on all images: ```python theme={null} # Define the prompt messages = [ { 'role': 'user', 'content': [ { 'type': 'text', 'text': 'Describe this image in one sentence.', }, {'type': 'image_url', 'image_url': t.image}, ], } ] # Add computed column for AI analysis using openai.chat_completions() t.add_computed_column( description=openai.chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 3 column values with 0 errors in 4.84 s (0.62 rows/s)
  3 rows updated.
### View results `openai.chat_completions()` returns a JSON structure containing the output, which we can unpack in the usual way: ```python theme={null} # View results: image alongside its AI-generated description t.select( t.image, t.description, t.description['choices'][0]['message']['content'], ).collect() ```
### New images are analyzed automatically When you insert more images, the analysis runs without any extra code: ```python theme={null} # Insert a new image - analysis happens automatically t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg' } ] ) # View all results including the new image t.select( t.image, t.description, t.description['choices'][0]['message']['content'], ).collect() ```
## Explanation **How it works:** 1. Add images to your table 2. Define a computed column with `openai.vision()` 3. Pixeltable executes the API call for each row automatically 4. Results are cached—rerunning won’t re-call the API 5. New rows trigger automatic computation **Changing the prompt:** To use a different prompt, add a new computed column with `if_exists='replace'`: ```python theme={null} messages = ... t.add_computed_column( description=openai.chat_completions(messages, model='gpt-4o-mini'), if_exists='replace' ) ``` **Using other providers:** Replace `openai.vision` with: * `anthropic.messages` for Claude * `google.generate_content` for Gemini * `together.chat_completions` for Together AI ## See also * [Configure API keys](/howto/cookbooks/core/workflow-api-keys) * [Working with OpenAI](/howto/providers/working-with-openai) ### New images are analyzed automatically When you insert more images, the analysis runs without any extra code: # Extract structured data from images Source: https://docs.pixeltable.com/howto/cookbooks/images/vision-structured-output Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Use AI vision to extract JSON data from receipts, forms, documents, and other images. ## Problem You have images containing structured information (receipts, forms, ID cards) and need to extract specific fields as JSON for downstream processing.
## Solution **What’s in this recipe:** * Extract structured JSON from images using GPT-4o * Use `openai.vision()` which handles images directly * Access individual fields from the extracted data You use Pixeltable’s `openai.vision()` function which automatically handles image encoding. Request JSON output via `response_format` in `model_kwargs`. ### Setup ```python theme={null} %pip install -qU pixeltable openai import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai ``` ### Load images ```python theme={null} # Create a fresh directory pxt.drop_dir('extraction_demo', force=True) pxt.create_dir('extraction_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'extraction\_demo'.
  \
```python theme={null} t = pxt.create_table('extraction_demo/images', {'image': pxt.Image}) ```
  Created table 'images'.
```python theme={null} # Insert sample images t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, ] ) ```
  Inserted 2 rows with 0 errors in 0.03 s (60.43 rows/s)
  2 rows inserted.
### Extract structured data Use `openai.chat_completions()` to analyze images and get JSON output: ```python theme={null} # Add extraction column using openai.vision (handles images directly) PROMPT = """Analyze this image and extract the following as JSON: - description: A brief description of the image - objects: List of objects visible in the image - dominant_colors: List of dominant colors - scene_type: Type of scene (indoor, outdoor, etc.)""" messages = [ { 'role': 'user', 'content': [ {'type': 'text', 'text': PROMPT}, {'type': 'image_url', 'image_url': t.image}, ], } ] t.add_computed_column( data=openai.chat_completions( messages, model='gpt-4o-mini', model_kwargs={'response_format': {'type': 'json_object'}}, ) ) ```
  Added 2 column values with 0 errors in 7.55 s (0.26 rows/s)
  2 rows updated.
```python theme={null} # View extracted data t.select( t.image, t.data, t.data['choices'][0]['message']['content'] ).collect() ```
```python theme={null} # You can also parse the JSON into individual columns if needed import json @pxt.udf def parse_description(data: str) -> str: return json.loads(data).get('description', '') t.select( t.image, description=parse_description( t.data['choices'][0]['message']['content'] ), ).collect() ```
## Explanation **Getting JSON output:** Pass `model_kwargs={'response_format': {'type': 'json_object'}}` to get structured JSON. **Other extraction use cases:**
## See also * [Analyze images in batch](/howto/cookbooks/images/vision-batch-analysis) * [Configure API keys](/howto/cookbooks/core/workflow-api-keys) # Create text embeddings with OpenAI Source: https://docs.pixeltable.com/howto/cookbooks/search/embed-text-openai Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Generate vector embeddings for text data to enable semantic search and similarity matching. ## Problem You need to convert text into vector embeddings for: * Semantic search (find similar documents) * RAG pipelines (retrieve relevant context) * Clustering and classification
## Solution **What’s in this recipe:** * Generate embeddings with OpenAI’s models * Store embeddings as computed columns * Use embeddings for similarity queries You add an embedding column that automatically generates vectors for new rows. The embeddings are cached and only recomputed when the source text changes. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import embeddings ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('embed_demo', force=True) pxt.create_dir('embed_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'embed\_demo'.
  \
### Create table with embedding column ```python theme={null} # Create table for documents docs = pxt.create_table( 'embed_demo/documents', {'title': pxt.String, 'content': pxt.String} ) ```
  Created table 'documents'.
```python theme={null} # Add embedding column using OpenAI's text-embedding-3-small docs.add_computed_column( embedding=embeddings(docs.content, model='text-embedding-3-small') ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Insert documents ```python theme={null} # Insert sample documents sample_docs = [ { 'title': 'Python Basics', 'content': 'Python is a high-level programming language known for its clear syntax and readability.', }, { 'title': 'Machine Learning', 'content': 'Machine learning is a subset of AI that enables systems to learn from data.', }, { 'title': 'Web Development', 'content': 'Web development involves building websites and web applications using HTML, CSS, and JavaScript.', }, { 'title': 'Data Science', 'content': 'Data science combines statistics, programming, and domain expertise to extract insights from data.', }, { 'title': 'Cloud Computing', 'content': 'Cloud computing provides on-demand computing resources over the internet.', }, ] docs.insert(sample_docs) ```
  Inserting rows into \`documents\`: 5 rows \[00:00, 553.22 rows/s]
  Inserted 5 rows with 0 errors.
  5 rows inserted, 15 values computed.
```python theme={null} # View documents with embeddings (showing first 5 dimensions) result = docs.select(docs.title, docs.embedding).collect() ``` ### Query by similarity Find documents similar to a query by creating an embedding index: ```python theme={null} # Add embedding index for semantic search docs.add_embedding_index( column='content', string_embed=embeddings.using(model='text-embedding-3-small'), ) ``` ```python theme={null} # Search for similar documents sim = docs.content.similarity( string='artificial intelligence applications' ) results = ( docs.where(sim > 0.2) .order_by(sim, asc=False) .limit(3) .select(docs.title, docs.content, sim=sim) ) results.collect() ```
## Explanation **OpenAI embedding models:**
**Similarity metrics:**
**Key benefits of computed embedding columns:** * Embeddings are generated automatically on insert * Results are cached—no re-computation on subsequent queries * Index enables fast similarity search at scale ## See also * [Semantic text search](/howto/cookbooks/search/search-semantic-text) - Full semantic search patterns * [Chunk documents for RAG](/howto/cookbooks/text/doc-chunk-for-rag) - Prepare documents for retrieval # Build semantic search for text Source: https://docs.pixeltable.com/howto/cookbooks/search/search-semantic-text Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create a searchable knowledge base that finds content by meaning, not just keywords. ## Problem You have a collection of text content (articles, notes, documentation) and need to find relevant items based on meaning. Keyword search fails when users phrase queries differently from the source text:
## Solution **What’s in this recipe:** * Create a text table with embeddings * Search by semantic similarity * Combine with metadata filters You add an embedding index to your text column. Pixeltable automatically generates embeddings for each row and enables similarity search. ### Setup ```python theme={null} %pip install -qU pixeltable sentence-transformers ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.huggingface import sentence_transformer ``` ### Create knowledge base ```python theme={null} # Create a fresh directory pxt.drop_dir('search_demo', force=True) pxt.create_dir('search_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'search\_demo'.
  \
```python theme={null} # Create table with content and metadata kb = pxt.create_table( 'search_demo/articles', {'title': pxt.String, 'content': pxt.String, 'category': pxt.String}, ) ```
  Created table 'articles'.
```python theme={null} # Insert sample content kb.insert( [ { 'title': 'Debugging best practices', 'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.', 'category': 'engineering', }, { 'title': 'Machine learning model optimization', 'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.', 'category': 'ml', }, { 'title': 'Production infrastructure setup', 'content': 'Deploy applications using containers, load balancers, and automated scaling.', 'category': 'devops', }, { 'title': 'API design principles', 'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.', 'category': 'engineering', }, ] ) ```
  Inserting rows into \`articles\`: 4 rows \[00:00, 577.69 rows/s]
  Inserted 4 rows with 0 errors.
  4 rows inserted, 12 values computed.
### Add semantic search Create an embedding index on the content column: ```python theme={null} # Add embedding index kb.add_embedding_index( column='content', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'), ) ``` ### Search by meaning Find content semantically similar to your query: ```python theme={null} # Search by meaning query = 'how to fix bugs' sim = kb.content.similarity(string=query) results = ( kb.order_by(sim, asc=False) .select(kb.title, kb.content, score=sim) .limit(2) ) results.collect() ```
### Filter by metadata Combine semantic search with metadata filters: ```python theme={null} # Search within a specific category query = 'best practices' sim = kb.content.similarity(string=query) results = ( kb.where(kb.category == 'engineering') # Filter first .order_by(sim, asc=False) .select(kb.title, kb.category, score=sim) .limit(2) ) results.collect() ```
## Explanation **How similarity search works:** 1. Your query is converted to an embedding vector 2. Pixeltable finds the most similar vectors in the index 3. Results are ranked by cosine similarity (0 to 1) **Embedding models:**
**New content is indexed automatically:** When you insert new rows, embeddings are generated without extra code. ## See also * [Vector database documentation](/platform/embedding-indexes) * [Split documents for RAG](/howto/cookbooks/text/doc-chunk-for-rag) # Find similar images with CLIP Source: https://docs.pixeltable.com/howto/cookbooks/search/search-similar-images Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Build visual similarity search to find images that look alike using OpenAI’s CLIP model. ## Problem You have a collection of images and need to find visually similar ones—for duplicate detection, content recommendations, or visual search.
## Solution **What’s in this recipe:** * Create image embeddings with CLIP * Search by image similarity * Search by text description (cross-modal) You add an embedding index using CLIP, which understands both images and text. This enables finding similar images or searching images by text description. ### Setup ```python theme={null} %pip install -qU pixeltable sentence-transformers torch ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.huggingface import clip ``` ### Load images ```python theme={null} # Create a fresh directory pxt.drop_dir('image_search_demo', force=True) pxt.create_dir('image_search_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'image\_search\_demo'.
  \
```python theme={null} images = pxt.create_table( 'image_search_demo/images', {'image': pxt.Image} ) ```
  Created table 'images'.
```python theme={null} # Insert sample images images.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg' }, { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg' }, ] ) ```
  Inserting rows into \`images\`: 4 rows \[00:00, 973.44 rows/s]
  Inserted 4 rows with 0 errors.
  4 rows inserted, 8 values computed.
### Create CLIP embedding index Add an embedding index using CLIP for cross-modal search: ```python theme={null} # Add CLIP embedding index (supports both image and text queries) images.add_embedding_index( 'image', embedding=clip.using(model_id='openai/clip-vit-base-patch32') ) ``` ### Search by text description Find images matching a text query: ```python theme={null} # Search by text description query = 'people eating food' sim = images.image.similarity(string=query) results = ( images.order_by(sim, asc=False) .select(images.image, score=sim) .limit(2) ) results.collect() ```
## Explanation **Why CLIP:** CLIP (Contrastive Language-Image Pre-training) understands both images and text in the same embedding space. This enables: * Image-to-image search (find similar photos) * Text-to-image search (find photos matching a description) **Index parameters:**
**Both must use the same model** for cross-modal search to work. **New images are indexed automatically:** When you insert new images, embeddings are generated without extra code. ## See also * [Semantic text search](/howto/cookbooks/search/search-semantic-text) * [Vector database documentation](/platform/embedding-indexes) # Split documents into chunks for RAG Source: https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Break PDFs and documents into searchable chunks for retrieval-augmented generation (RAG) pipelines. ## Problem You have PDF documents or text files that you want to use for retrieval-augmented generation (RAG). Before you can search them, you need to: 1. Split documents into smaller chunks 2. Generate embeddings for each chunk 3. Store everything in a searchable index
## Solution **What’s in this recipe:** * Split PDFs into sentences with token limits * Control chunk size with token limits * Add embeddings for semantic search You create a view with a `document_splitter` iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search. ### Setup ```python theme={null} %pip install -qU pixeltable sentence-transformers spacy tiktoken !python -m spacy download en_core_web_sm -q ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.document import document_splitter from pixeltable.functions.huggingface import sentence_transformer ``` ### Load documents ```python theme={null} # Create a fresh directory pxt.drop_dir('rag_demo', force=True) pxt.create_dir('rag_demo') ```
  Created directory 'rag\_demo'.
  \
```python theme={null} # Create table for documents docs = pxt.create_table('rag_demo/documents', {'document': pxt.Document}) ```
  Created table 'documents'.
```python theme={null} # Insert a sample PDF docs.insert( [ { 'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf' } ] ) ```
  Inserting rows into \`documents\`: 1 rows \[00:00, 775.86 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.
### Split into chunks Create a view that splits each document into sentences with a token limit: ```python theme={null} # Create a view that splits documents into chunks chunks = pxt.create_view( 'rag_demo/chunks', docs, iterator=document_splitter( docs.document, separators='sentence,token_limit', # Split by sentence with token limit limit=300, # Max 300 tokens per chunk ), ) ```
  Inserting rows into \`chunks\`: 217 rows \[00:00, 42111.88 rows/s]
```python theme={null} # View the chunks chunks.select(chunks.text).head(5) ```
### Add semantic search Create an embedding index on the chunks for similarity search: ```python theme={null} # Add embedding index for semantic search chunks.add_embedding_index( column='text', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'), ) ``` ### Search your documents Use similarity search to find relevant chunks: ```python theme={null} # Search for relevant chunks query = 'market trends' sim = chunks.text.similarity(string=query) results = ( chunks.order_by(sim, asc=False) .select(chunks.text, score=sim) .limit(3) ) results.collect() ```
## Explanation **Separator options:**
You can combine separators: `separators='sentence,token_limit'` **Chunk sizing:** * `limit`: Maximum tokens per chunk (default: 500) * `overlap`: Tokens to overlap between chunks (default: 0) **New documents are processed automatically:** When you insert new documents, chunks and embeddings are generated without extra code. ## See also * [Iterators documentation](/platform/iterators) * [RAG demo notebook](/howto/use-cases/rag-demo) # Extract text from PowerPoint, Word, and Excel files Source: https://docs.pixeltable.com/howto/cookbooks/text/doc-extract-text-from-office-files Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Transform office documents into searchable, analyzable text data. **What’s in this recipe:** * Extract text from PPTX, DOCX, and XLSX files * Split documents by headings, paragraphs, or custom limits * Preserve document structure and metadata for analysis ## Problem You have office documents—presentations, reports, spreadsheets—that contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models. Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization. ## Solution You extract text from office documents using Pixeltable’s document type with Microsoft’s MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically. You use `DocumentSplitter` to split documents by headings, paragraphs, or token limits. Each split creates a view where each row represents a chunk of the document with its metadata. ### Setup ```python theme={null} %pip install -qU pixeltable 'markitdown[pptx,docx,xlsx]' mistune tiktoken ``` ```python theme={null} import pixeltable as pxt from pixeltable.iterators.document import DocumentSplitter ``` ### Load office documents ```python theme={null} # Create a fresh directory (drop existing if present) pxt.drop_dir('office_docs', force=True) pxt.create_dir('office_docs') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'office\_docs'.
  \
```python theme={null} # Create table for office documents docs = pxt.create_table('office_docs/documents', {'doc': pxt.Document}) ```
  Created table 'documents'.
```python theme={null} # Sample PowerPoint from Pixeltable repo # Replace with your own PPTX, DOCX, or XLSX files sample_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/calpy.pptx' docs.insert([{'doc': sample_url}]) ```
  Inserting rows into \`documents\`: 1 rows \[00:00, 57.40 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.
### Extract full document text You create a view with `DocumentSplitter` to extract text. Setting `separators=''` extracts the full document without splitting. ```python theme={null} # Create a view to extract full document text full_text = pxt.create_view( 'office_docs/full_text', docs, iterator=DocumentSplitter.create( document=docs.doc, separators='', # No splitting - extract full document ), ) ```
  Inserting rows into \`full\_text\`: 1 rows \[00:00, 196.50 rows/s]
```python theme={null} # Preview extracted text full_text.select(full_text.doc, full_text.text).head(1) ```
### Split documents by headings You split documents by headings to preserve their logical structure. Each section under a heading becomes a separate chunk. ```python theme={null} # Create view that splits by headings by_heading = pxt.create_view( 'office_docs/by_heading', docs, iterator=DocumentSplitter.create( document=docs.doc, separators='heading', metadata='heading', # Preserve heading structure ), ) ```
  Inserting rows into \`by\_heading\`: 87 rows \[00:00, 10359.54 rows/s]
```python theme={null} # View chunks with their headings by_heading.select(by_heading.heading, by_heading.text).head(5) ```
### Split by token limit for AI models You split documents by token count when feeding chunks to AI models. The `overlap` parameter ensures chunks share context at boundaries. ```python theme={null} # Create view with token-based splitting by_tokens = pxt.create_view( 'office_docs/by_tokens', docs, iterator=DocumentSplitter.create( document=docs.doc, separators='heading,token_limit', # Split by heading first, then by tokens limit=512, # Maximum tokens per chunk overlap=50, # Overlap between chunks to preserve context metadata='heading', ), ) ```
  Inserting rows into \`by\_tokens\`: 2369 rows \[00:00, 9212.05 rows/s]
```python theme={null} # Preview chunks with token limits by_tokens.select(by_tokens.doc, by_tokens.heading, by_tokens.text).head(3) ```
### Search across documents You search across all document chunks using standard Pixeltable queries. ```python theme={null} # Find chunks containing specific keywords by_tokens.where(by_tokens.text.contains('Python')).select( by_tokens.doc, by_tokens.text ).head(3) ```
## Explanation **Supported formats:** * PowerPoint: `.pptx`, `.ppt` * Word: `.docx`, `.doc` * Excel: `.xlsx`, `.xls` **Separator options:** * `heading` - Split by document headings (preserves structure) * `paragraph` - Split by paragraphs * `sentence` - Split by sentences * `token_limit` - Split by token count (requires `limit` parameter) * `char_limit` - Split by character count (requires `limit` parameter) * Multiple separators work together: `'heading,token_limit'` splits by heading first, then ensures no chunk exceeds token limit **Metadata fields:** * `heading` - Hierarchical heading structure (e.g., `{'h1': 'Introduction', 'h2': 'Overview'}`) * `title` - Document title * `sourceline` - Source line number (HTML and Markdown documents) **Token overlap:** The `overlap` parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models. ## See also * [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow) * [Pixeltable Document API](/sdk/latest/document) # Extract named entities from text Source: https://docs.pixeltable.com/howto/cookbooks/text/text-extract-entities Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Identify and extract people, organizations, locations, dates, and other entities from text using LLMs. ## Problem You have unstructured text containing important information—names, companies, dates, locations—that you need to extract and structure for analysis, search, or integration with other systems.
## Solution **What’s in this recipe:** * Extract entities as structured JSON * Use OpenAI’s structured output for reliable parsing * Access extracted entities as queryable columns You use structured output to get entities in a consistent JSON format. The entities are stored as JSON columns that you can query and filter. ### Setup ```python theme={null} %pip install -qU pixeltable openai ```
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  Note: you may need to restart the kernel to use updated packages.
```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import json import pixeltable as pxt from pixeltable.functions.openai import chat_completions ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('entities_demo', force=True) pxt.create_dir('entities_demo') ```
  Created directory 'entities\_demo'.
  \
### Define entity extraction schema ```python theme={null} # Define the JSON schema for entity extraction entity_schema = { 'type': 'json_schema', 'json_schema': { 'name': 'entities', 'strict': True, 'schema': { 'type': 'object', 'properties': { 'people': { 'type': 'array', 'items': {'type': 'string'}, 'description': 'Names of people mentioned', }, 'organizations': { 'type': 'array', 'items': {'type': 'string'}, 'description': 'Names of companies, institutions, or groups', }, 'locations': { 'type': 'array', 'items': {'type': 'string'}, 'description': 'Geographic locations (cities, countries, addresses)', }, 'dates': { 'type': 'array', 'items': {'type': 'string'}, 'description': 'Dates or time references', }, }, 'required': ['people', 'organizations', 'locations', 'dates'], 'additionalProperties': False, }, }, } ``` ### Create extraction pipeline ```python theme={null} # Create table for articles articles = pxt.create_table( 'entities_demo/articles', {'title': pxt.String, 'content': pxt.String} ) ```
  Created table 'articles'.
```python theme={null} # Add entity extraction column extraction_prompt = ( 'Extract all named entities from the following text:\n\n' + articles.content ) articles.add_computed_column( extraction_response=chat_completions( messages=[{'role': 'user', 'content': extraction_prompt}], model='gpt-4o-mini', model_kwargs={'response_format': entity_schema}, ) ) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Extract the entities JSON articles.add_computed_column( entities=articles.extraction_response.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  No rows affected.
### Extract entities from text ```python theme={null} # Insert sample articles sample_articles = [ { 'title': 'Tech Acquisition', 'content': 'Microsoft announced today that CEO Satya Nadella will lead the acquisition of a Seattle-based startup. The deal, expected to close in March 2024, is valued at $500 million.', }, { 'title': 'Sports Update', 'content': 'LeBron James led the Los Angeles Lakers to victory against the Boston Celtics on Tuesday night at Staples Center. Coach Darvin Ham praised the teams performance.', }, { 'title': 'Research Breakthrough', 'content': 'Dr. Sarah Chen at Stanford University published groundbreaking research on renewable energy. The study, funded by the National Science Foundation, was conducted in Palo Alto, California.', }, ] articles.insert(sample_articles) ```
  Inserting rows into \`articles\`: 3 rows \[00:00, 404.21 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 12 values computed.
```python theme={null} # View extracted entities articles.select(articles.title, articles.entities).collect() ```
## Explanation **Structured output ensures reliable extraction:** By using OpenAI’s structured output (`response_format`), the model always returns valid JSON matching the schema. No post-processing or error handling needed. **Common entity types:**
**Customizing the schema:** Modify the `entity_schema` to extract domain-specific entities—product SKUs, legal terms, medical conditions, etc. ## See also * [Extract structured data from images](/howto/cookbooks/images/vision-structured-output) - JSON extraction from images * [Extract fields from JSON](/howto/cookbooks/core/workflow-json-extraction) - Parse LLM response fields # Summarize text with LLMs Source: https://docs.pixeltable.com/howto/cookbooks/text/text-summarize Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Generate concise summaries of long text, articles, or documents using large language models. ## Problem You have long text content—articles, transcripts, documents—that needs to be summarized. Processing each piece manually is time-consuming and inconsistent.
## Solution **What’s in this recipe:** * Summarize text using OpenAI GPT models * Customize summary style with prompts * Process multiple documents automatically You add a computed column that calls an LLM to generate summaries. When you insert new text, summaries are generated automatically. ### Setup ```python theme={null} %pip install -qU pixeltable openai ```
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  WARNING: Ignoring invalid distribution \~orch (/opt/miniconda3/envs/pixeltable/lib/python3.11/site-packages)
  Note: you may need to restart the kernel to use updated packages.
```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai ``` ### Load sample text ```python theme={null} # Create a fresh directory pxt.drop_dir('summarize_demo', force=True) pxt.create_dir('summarize_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'summarize\_demo'.
  \
```python theme={null} # Create table for articles articles = pxt.create_table( 'summarize_demo/articles', {'title': pxt.String, 'content': pxt.String}, ) ```
  Created table 'articles'.
```python theme={null} # Sample articles to summarize sample_articles = [ { 'title': 'The Rise of Electric Vehicles', 'content': """Electric vehicles (EVs) have seen unprecedented growth in recent years, transforming the automotive industry. Sales increased by 60% globally in 2023, with China leading the market followed by Europe and North America. Major automakers like Tesla, BYD, and traditional manufacturers have invested billions in EV technology. Battery costs have dropped significantly, making EVs more affordable for consumers. Government incentives and stricter emissions regulations continue to drive adoption. Charging infrastructure is expanding rapidly, with new fast-charging networks being deployed across major highways. Despite challenges like range anxiety and charging times, consumer acceptance is growing steadily.""", }, { 'title': 'Advances in Renewable Energy', 'content': """Solar and wind power capacity reached record levels in 2023, accounting for over 30% of global electricity generation. The cost of solar panels has fallen by 90% over the past decade, making renewable energy competitive with fossil fuels. Offshore wind farms are being built at scale, with turbines now reaching heights of over 250 meters. Energy storage solutions, particularly lithium-ion batteries, are addressing intermittency challenges. Countries like Denmark and Scotland have achieved periods of 100% renewable electricity. Corporate power purchase agreements are accelerating the transition, with tech giants committing to carbon-neutral operations.""", }, ] articles.insert(sample_articles) ```
  Inserting rows into \`articles\`: 2 rows \[00:00, 316.21 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 4 values computed.
```python theme={null} # View articles articles.select(articles.title, articles.content).collect() ```
### Generate summaries Add a computed column that generates summaries using GPT: ```python theme={null} # Create prompt template for summarization prompt = ( 'Summarize the following article in 2-3 sentences:\n\n' + articles.content ) # Add computed column for LLM response articles.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': prompt}], model='gpt-4o-mini', ) ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # Extract the summary text from the response articles.add_computed_column( summary=articles.response.choices[0].message.content ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # View titles and summaries articles.select(articles.title, articles.summary).collect() ```
### Custom summary styles You can customize the summary format by changing the prompt: ```python theme={null} # Add bullet-point summary bullet_prompt = ( 'List the 3 key points from this article as bullet points:\n\n' + articles.content ) articles.add_computed_column( bullet_response=openai.chat_completions( messages=[{'role': 'user', 'content': bullet_prompt}], model='gpt-4o-mini', ) ) articles.add_computed_column( key_points=articles.bullet_response.choices[0].message.content ) ```
  Added 2 column values with 0 errors.
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # View bullet-point summaries articles.select(articles.title, articles.key_points).collect() ```
### Automatic processing New articles are automatically summarized when inserted: ```python theme={null} # Insert a new article - summaries are generated automatically articles.insert( [ { 'title': 'AI in Healthcare', 'content': """Artificial intelligence is revolutionizing healthcare diagnostics and treatment planning. Machine learning models can now detect diseases from medical images with accuracy matching or exceeding human specialists. AI-powered drug discovery is accelerating the development of new treatments. Natural language processing is being used to extract insights from clinical notes and research papers.""", } ] ) ```
  Inserting rows into \`articles\`: 1 rows \[00:00, 411.57 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 6 values computed.
```python theme={null} # View all summaries including the new article articles.select(articles.title, articles.summary).collect() ```
## Explanation **Prompt engineering for summaries:**
**Cost optimization:** * Use `gpt-4o-mini` for most summarization tasks (fast and affordable) * Use `gpt-4o` for complex documents requiring deeper understanding * Summaries are cached—you only pay once per article and stuand toofor trL para ## See also * [Split documents for RAG](/howto/cookbooks/text/doc-chunk-for-rag) - Process long documents * [Extract fields from JSON](/howto/cookbooks/core/workflow-json-extraction) - Parse structured LLM output * [Configure API keys](/howto/cookbooks/core/workflow-api-keys) - Set up OpenAI credentials # Translate text between languages Source: https://docs.pixeltable.com/howto/cookbooks/text/text-translate Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Automatically translate content into multiple languages using LLMs. ## Problem You have content that needs to be available in multiple languages—product descriptions, documentation, user-generated content. Manual translation is slow and expensive.
## Solution **What’s in this recipe:** * Translate text using OpenAI models * Create multiple language columns from one source * Handle batch translation efficiently You add computed columns for each target language. Translations are generated automatically when you insert new content and cached for future queries. ### Setup ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ') ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('translate_demo', force=True) pxt.create_dir('translate_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'translate\_demo'.
  \
### Create translation pipeline ```python theme={null} # Create table for content content = pxt.create_table( 'translate_demo/content', {'title': pxt.String, 'text_en': pxt.String} ) ```
  Created table 'content'.
```python theme={null} # Add Spanish translation column spanish_prompt = ( 'Translate the following text to Spanish. Return only the translation, no explanations:\n\n' + content.text_en ) content.add_computed_column( response_es=chat_completions( messages=[{'role': 'user', 'content': spanish_prompt}], model='gpt-4o-mini', ) ) content.add_computed_column( text_es=content.response_es.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Add French translation column french_prompt = ( 'Translate the following text to French. Return only the translation, no explanations:\n\n' + content.text_en ) content.add_computed_column( response_fr=chat_completions( messages=[{'role': 'user', 'content': french_prompt}], model='gpt-4o-mini', ) ) content.add_computed_column( text_fr=content.response_fr.choices[0].message.content ) ```
  Added 0 column values with 0 errors.
  Added 0 column values with 0 errors.
  No rows affected.
### Translate content ```python theme={null} # Insert sample content sample_content = [ { 'title': 'Welcome Message', 'text_en': 'Welcome to our platform! We are excited to have you here.', }, { 'title': 'Product Description', 'text_en': 'This lightweight laptop features a 14-inch display and all-day battery life.', }, { 'title': 'Support Article', 'text_en': 'To reset your password, click the forgot password link on the login page.', }, ] content.insert(sample_content) ```
  Inserting rows into \`content\`: 3 rows \[00:00, 198.43 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 18 values computed.
```python theme={null} # View all translations content.select( content.title, content.text_en, content.text_es, content.text_fr ).collect() ```
```python theme={null} # Pretty print one example row = content.where(content.title == 'Welcome Message').collect()[0] ``` ## Explanation **How it works:** Each target language is a computed column with a translation prompt. When you insert new content: 1. The English text is processed 2. Translation prompts are generated for each language 3. All translations run in parallel 4. Results are cached—no re-translation needed **Adding more languages:** ```python theme={null} # Add German translation german_prompt = 'Translate to German:\n\n' + content.text_en content.add_computed_column( response_de=chat_completions(messages=[{'role': 'user', 'content': german_prompt}], model='gpt-4o-mini') ) content.add_computed_column(text_de=content.response_de.choices[0].message.content) ``` **Cost optimization:**
## See also * [Summarize text](/howto/cookbooks/text/text-summarize) - Text summarization with LLMs * [Extract structured data](/howto/cookbooks/images/vision-structured-output) - Get JSON from LLM responses # Add text overlays to videos Source: https://docs.pixeltable.com/howto/cookbooks/video/video-add-text-overlay Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Burn text, captions, or watermarks directly into video files. ## Problem You need to add text to videos—captions, watermarks, titles, or dynamic labels. Manual video editing doesn’t scale for batch processing.
## Solution **What’s in this recipe:** * Add simple text overlays * Create styled captions with backgrounds * Position text with alignment options * Crop a rectangular region from a video Use `video.overlay_text()` to burn text into videos with full control over styling and position, and `video.crop()` to extract a rectangular region. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('overlay_demo', force=True) pxt.create_dir('overlay_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'overlay\_demo'.
  \
### Load sample videos ```python theme={null} # Create a video table videos = pxt.create_table( 'overlay_demo/videos', {'video': pxt.Video, 'title': pxt.String} ) # Insert a sample video videos.insert( [ { 'video': 's3://multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4', 'title': 'Sample Video', } ] ) ```
  Created table 'videos'.
  Inserted 1 row with 0 errors in 3.21 s (0.31 rows/s)
  1 row inserted.
### Add a simple text overlay ```python theme={null} # Add a simple watermark in the corner videos.add_computed_column( watermarked=videos.video.overlay_text( 'My Brand', font_size=24, color='white', opacity=0.7, horizontal_align='right', horizontal_margin=20, vertical_align='top', vertical_margin=20, ) ) ```
  Added 1 column value with 0 errors in 1.25 s (0.80 rows/s)
  1 row updated.
### Add YouTube-style captions ```python theme={null} # Add a caption with a semi-transparent background box videos.add_computed_column( captioned=videos.video.overlay_text( 'This is a sample caption', font_size=32, color='white', box=True, # Add background box box_color='black', box_opacity=0.8, box_border=[6, 14], # Padding: [top/bottom, left/right] horizontal_align='center', vertical_align='bottom', vertical_margin=70, # Distance from bottom ) ) ```
  Added 1 column value with 0 errors in 1.08 s (0.92 rows/s)
  1 row updated.
### Add dynamic titles from table columns ```python theme={null} # Add video title as an overlay (dynamic per video) videos.add_computed_column( titled=videos.video.overlay_text( videos.title, # Use the title column! font_size=48, color='yellow', opacity=1.0, horizontal_align='center', vertical_align='top', vertical_margin=30, ) ) ```
  Added 1 column value with 0 errors in 1.15 s (0.87 rows/s)
  1 row updated.
```python theme={null} # View all versions videos.select( videos.title, videos.video, videos.watermarked, videos.captioned, videos.titled, ).collect() ```
### Crop a region from a video Use `video.crop()` to extract a rectangular region from a video. This is useful for focusing on a specific area of interest, removing borders, or preparing clips for object-specific analysis. ```python theme={null} # Crop using xywh format (default): [x, y, width, height] videos.add_computed_column(cropped=videos.video.crop([100, 50, 320, 240])) # Crop using xyxy format (common in object detection pipelines): # videos.add_computed_column( # cropped_xyxy=videos.video.crop([100, 50, 420, 290], bbox_format='xyxy') # ) ```
  Added 1 column value with 0 errors in 0.56 s (1.78 rows/s)
  1 row updated.
## Explanation **Positioning options:**
**Styling options:**
**Background box options:**
**Requirements:** * FFmpeg must be installed and in PATH ## See also * [Generate thumbnails](/howto/cookbooks/video/video-generate-thumbnails) - Create preview images * [Detect scene changes](/howto/cookbooks/video/video-scene-detection) - Find cuts and transitions # Extract frames from videos Source: https://docs.pixeltable.com/howto/cookbooks/video/video-extract-frames Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pull frames from video files at specified intervals for analysis, thumbnails, or training data. ## Problem You have video files and need to extract frames for: * Object detection on video content * Creating thumbnails or previews * Building training datasets * Scene analysis and classification
## Solution **What’s in this recipe:** * Extract frames at a fixed rate (FPS) * Extract a specific number of frames * Extract only keyframes for efficiency You create a view with a `frame_iterator` that automatically extracts frames from each video. New videos are processed without extra code. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.video import frame_iterator ``` ### Load videos ```python theme={null} # Create a fresh directory pxt.drop_dir('video_demo', force=True) pxt.create_dir('video_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'video\_demo'.
  \
```python theme={null} # Create table for videos videos = pxt.create_table('video_demo/videos', {'video': pxt.Video}) ```
  Created table 'videos'.
```python theme={null} # Insert a sample video videos.insert( [ { 'video': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/bangkok.mp4' } ] ) ```
  Inserting rows into \`videos\`: 1 rows \[00:00, 212.90 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.
### Extract frames at fixed rate Create a view that extracts 1 frame per second: ```python theme={null} # Extract 1 frame per second frames = pxt.create_view( 'video_demo/frames', videos, iterator=frame_iterator( videos.video, fps=1.0, # 1 frame per second ), ) ```
  Inserting rows into \`frames\`: 19 rows \[00:00, 8687.65 rows/s]
```python theme={null} # View extracted frames frames.select(frames.frame, frames.pos).head(3) ```
### Extract keyframes only For faster processing, extract only keyframes (I-frames): ```python theme={null} # Extract only keyframes (much faster for long videos) keyframes = pxt.create_view( 'video_demo/keyframes', videos, iterator=frame_iterator(videos.video, keyframes_only=True), ) keyframes.select(keyframes.frame).head(3) ```
  Inserting rows into \`keyframes\`: 7 rows \[00:00, 3277.53 rows/s]
## Explanation **Extraction options:**
Only one of `fps`, `num_frames`, or `keyframes_only` can be specified. **When to use keyframes:** * Quick video scanning and thumbnails * Initial content classification * Processing very long videos **Frame metadata:** Each frame includes: * `frame`: The extracted image * `pos`: Frame position in the video * `pts`: Presentation timestamp ## See also * [Iterators documentation](/platform/iterators) * [Analyze images in batch](/howto/cookbooks/images/vision-batch-analysis) # Generate videos with AI Source: https://docs.pixeltable.com/howto/cookbooks/video/video-generate-ai Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Create videos from text prompts or animate images using Google’s Veo model. ## Problem You need to generate video content programmatically—for social media, product demos, or creative applications.
## Solution **What’s in this recipe:** * Generate videos from text prompts * Animate existing images into videos * Store prompts and generated videos together Use Google’s Veo model to generate videos. Videos are cached—regeneration only happens if the prompt changes. ### Setup ```python theme={null} %pip install -qU pixeltable google-genai ``` ```python theme={null} import getpass import os if 'GEMINI_API_KEY' not in os.environ: os.environ['GEMINI_API_KEY'] = getpass.getpass( 'Google AI Studio API Key: ' ) ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions import gemini ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('video_gen_demo', force=True) pxt.create_dir('video_gen_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'video\_gen\_demo'.
  \
### Generate videos from text prompts ```python theme={null} # Create a table for text-to-video generation videos = pxt.create_table( 'video_gen_demo/text_to_video', {'prompt': pxt.String} ) # Add computed column that generates videos videos.add_computed_column( video=gemini.generate_videos( videos.prompt, model='veo-2.0-generate-001' ) ) ```
  Created table 'text\_to\_video'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Generate a video from a text prompt videos.insert( [ { 'prompt': 'A serene mountain lake at sunrise with mist rising from the water' } ] ) # View the result videos.select(videos.prompt, videos.video).collect() ```
  Inserting rows into \`text\_to\_video\`: 1 rows \[00:00, 190.68 rows/s]
  Inserted 1 row with 0 errors.
### Animate images into videos ```python theme={null} # Create a table for image-to-video generation animated = pxt.create_table( 'video_gen_demo/image_to_video', {'image': pxt.Image, 'description': pxt.String}, ) # Add computed column that animates images animated.add_computed_column( video=gemini.generate_videos( image=animated.image, model='veo-2.0-generate-001' ) ) ```
  Created table 'image\_to\_video'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Animate an image base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' animated.insert( [ { 'image': f'{base_url}/000000000030.jpg', 'description': 'Beach scene', } ] ) # View the animated result animated.select(animated.image, animated.video).collect() ```
  Inserting rows into \`image\_to\_video\`: 1 rows \[00:00, 291.88 rows/s]
  Inserted 1 row with 0 errors.
## Explanation **Generation modes:**
**Veo model options:**
**Tips:** * Prompts work best when descriptive and specific * Generated videos are cached - same prompt returns cached result * Image-to-video preserves the composition of the input image * New rows automatically generate videos on insert **Requirements:** * Google AI Studio API key (set `GEMINI_API_KEY`) * `pip install google-genai` ## See also * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) - Pull frames from generated videos * [Add text overlays](/howto/cookbooks/video/video-add-text-overlay) - Add captions to videos # Generate thumbnails from videos Source: https://docs.pixeltable.com/howto/cookbooks/video/video-generate-thumbnails Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Automatically create preview thumbnails from video files at specific timestamps or intervals. ## Problem You have video files that need preview thumbnails for galleries, search results, or video players. Manually extracting frames doesn’t scale.
## Solution **What’s in this recipe:** * Extract thumbnail at a specific timestamp * Generate multiple thumbnails per video * Resize thumbnails to standard dimensions You add computed columns that extract frames from videos. Thumbnails are generated automatically when you insert new videos. ### Setup ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt import pixeltable.functions as pxtf ``` ### Load videos ```python theme={null} # Create a fresh directory pxt.drop_dir('thumbnail_demo', force=True) pxt.create_dir('thumbnail_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'thumbnail\_demo'.
  \
```python theme={null} # Create table for videos videos = pxt.create_table('thumbnail_demo/videos', {'video': pxt.Video}) ```
  Created table 'videos'.
```python theme={null} # Insert sample videos from public S3 bucket s3_prefix = 's3://multimedia-commons/' video_paths = [ 'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4', 'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4', ] videos.insert([{'video': s3_prefix + path} for path in video_paths]) ```
  Inserting rows into \`videos\`: 2 rows \[00:00, 382.20 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 4 values computed.
```python theme={null} # View videos videos.collect() ```
### Extract thumbnail at timestamp Extract a single frame at a specific time (e.g., 1 second into the video): ```python theme={null} # Extract frame at 1 second as thumbnail videos.add_computed_column( thumbnail=pxtf.video.extract_frame(videos.video, timestamp=1.0) ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # View thumbnails videos.select(videos.video, videos.thumbnail).collect() ```
### Resize thumbnails Create standard-sized thumbnails for consistent display: ```python theme={null} # Resize thumbnail to 320x180 (16:9 aspect ratio) videos.add_computed_column( thumbnail_small=videos.thumbnail.resize((320, 180)) ) ```
  Added 2 column values with 0 errors.
  2 rows updated, 2 values computed.
```python theme={null} # View resized thumbnails with dimensions videos.select( videos.thumbnail_small, videos.thumbnail_small.width, videos.thumbnail_small.height, ).collect() ```
### Multiple thumbnails with `frame_iterator` For preview strips or timeline thumbnails, use `frame_iterator` to extract multiple frames: ```python theme={null} # Create a view with frames extracted at 0.5 fps (one frame every 2 seconds) frames = pxt.create_view( 'thumbnail_demo/frames', videos, iterator=pxtf.video.frame_iterator(videos.video, fps=0.5), ) ```
  Inserting rows into \`frames\`: 17 rows \[00:00, 9736.88 rows/s]
```python theme={null} # View extracted frames (multiple per video) frames.select(frames.frame, frames.pos).head(10) ```
## Explanation **Thumbnail extraction methods:**
**Common thumbnail sizes:**
## See also * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) - Detailed frame extraction guide * [Load media from S3](/howto/cookbooks/data/data-import-s3) - Import videos from cloud storage * [Transform images with PIL](/howto/cookbooks/images/img-pil-transforms) - Resize and crop images # Detect scene changes in videos Source: https://docs.pixeltable.com/howto/cookbooks/video/video-scene-detection Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Automatically find scene cuts, transitions, and fades in video files. ## Problem You have video files and need to identify scene boundaries for:
## Solution **What’s in this recipe:** * Detect hard cuts with `scene_detect_content()` * Find fade transitions with `scene_detect_threshold()` * Use adaptive detection with `scene_detect_adaptive()` Three built-in detection methods handle different transition types using PySceneDetect. ### Setup ```python theme={null} %pip install -qU pixeltable scenedetect opencv-python ``` ```python theme={null} import pixeltable as pxt ``` ```python theme={null} # Create a fresh directory pxt.drop_dir('scene_demo', force=True) pxt.create_dir('scene_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'scene\_demo'.
  \
### Load sample videos ```python theme={null} # Create a video table videos = pxt.create_table( 'scene_demo/videos', {'video': pxt.Video, 'title': pxt.String} ) # Insert sample videos from S3 videos.insert( [ { 'video': 's3://multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4', 'title': 'Sample video 1', } ] ) ```
  Created table 'videos'.
  Inserting rows into \`videos\`: 1 rows \[00:00, 200.53 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 3 values computed.
### Detect scenes with content-based detection ```python theme={null} # Detect scenes using content-based detection (best for hard cuts) videos.add_computed_column( scenes_content=videos.video.scene_detect_content( threshold=27.0, # Lower = more sensitive min_scene_len=15, # Minimum frames between cuts ) ) # View detected scenes videos.select(videos.title, videos.scenes_content).collect() ```
  Added 1 column value with 0 errors.
### Detect fade transitions ```python theme={null} # Detect fade-to-black/white transitions videos.add_computed_column( scenes_fade=videos.video.scene_detect_threshold( threshold=12.0, # Brightness threshold for fades min_scene_len=15, ) ) # View fade-detected scenes videos.select(videos.title, videos.scenes_fade).collect() ```
  Added 1 column value with 0 errors.
### Adaptive detection for complex videos ```python theme={null} # Adaptive detection adjusts to video content dynamically videos.add_computed_column( scenes_adaptive=videos.video.scene_detect_adaptive( adaptive_threshold=3.0, # Lower = more scenes detected min_scene_len=15, fps=2.0, # Analyze at 2 FPS for speed ) ) # View adaptively-detected scenes videos.select(videos.title, videos.scenes_adaptive).collect() ```
  Added 1 column value with 0 errors.
## Explanation **Detection methods:**
**Output format:** Each method returns a list of scene dictionaries: ```python theme={null} { 'start_time': 5.2, # Scene start in seconds 'start_pts': 156, # Presentation timestamp 'duration': 3.8 # Scene duration in seconds } ``` **Tuning tips:**
## See also * [Extract frames from videos](/howto/cookbooks/video/video-extract-frames) - Get frames at scene boundaries * [Generate thumbnails](/howto/cookbooks/video/video-generate-thumbnails) - Create preview images # Infrastructure Setup Source: https://docs.pixeltable.com/howto/deployment/infrastructure Code organization and storage architecture for Pixeltable deployments ## Code Organization Both deployment strategies require separating schema definition from application code. **Schema Definition (`setup_pixeltable.py`):** * Defines directories, tables, views, computed columns, indexes * Acts as Infrastructure-as-Code for Pixeltable entities * Version controlled in Git * Executed during initial deployment and schema migrations **Application Code (`app.py`, `endpoints.py`, `functions.py`):** * Assumes Pixeltable infrastructure exists * Interacts with tables via `pxt.get_table()` and `@pxt.udf` * Handles missing tables/views gracefully **Configuration (`config.py`):** * Externalizes model IDs, API keys, thresholds, connection strings * Uses environment variables (`.env` + `python-dotenv`) or secrets management * Never hardcodes secrets ```python theme={null} # setup_pixeltable.py import pixeltable as pxt import config pxt.create_dir(config.APP_NAMESPACE, if_exists='ignore') pxt.create_table( f'{config.APP_NAMESPACE}/documents', { 'document': pxt.Document, 'metadata': pxt.Json, 'timestamp': pxt.Timestamp }, if_exists='ignore' # Idempotent: safe for repeated execution ) # --- # app.py import pixeltable as pxt import config docs_table = pxt.get_table(f'{config.APP_NAMESPACE}/documents') if docs_table is None: raise RuntimeError( f"Table '{config.APP_NAMESPACE}/documents' not found. " "Run setup_pixeltable.py first." ) ``` ## Project Structure ``` project/ ├── config.py # Environment variables, model IDs, API keys ├── functions.py # Custom UDFs (imported as modules) ├── setup_pixeltable.py # Schema definition (tables, views, indexes) ├── app.py # Application endpoints (FastAPI/Flask) ├── requirements.txt # Pinned dependencies └── .env # Secrets (gitignored) ``` ```python theme={null} import os ENV = os.getenv('ENVIRONMENT', 'dev') APP_NAMESPACE = f'{ENV}_myapp' # Model Configuration EMBEDDING_MODEL = os.getenv('EMBEDDING_MODEL', 'intfloat/e5-large-v2') OPENAI_MODEL = os.getenv('OPENAI_MODEL', 'gpt-4o-mini') # Storage MEDIA_STORAGE_BUCKET = os.getenv('MEDIA_STORAGE_BUCKET') # Prompts RAG_SYSTEM_PROMPT = """You are a helpful assistant. Use the provided context to answer questions.""" ``` ```python theme={null} import pixeltable as pxt @pxt.udf def format_prompt(context: list, question: str) -> str: """Format RAG prompt with context.""" context_str = "\n".join([doc['text'] for doc in context]) return f"Context:\n{context_str}\n\nQuestion: {question}" @pxt.udf(resource_pool='request-rate:my_service') async def call_custom_model(prompt: str) -> dict: """Call self-hosted model endpoint.""" # Your custom logic here return {"response": "..."} ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.huggingface import sentence_transformer import config from functions import format_prompt # Import module UDFs # Create namespace pxt.create_dir(config.APP_NAMESPACE, if_exists='ignore') # Define base table docs = pxt.create_table( f'{config.APP_NAMESPACE}/documents', {'document': pxt.Document, 'metadata': pxt.Json, 'timestamp': pxt.Timestamp}, if_exists='ignore' ) # Add computed columns docs.add_computed_column( embedding=sentence_transformer(docs.document, model_id=config.EMBEDDING_MODEL), if_exists='ignore' ) # Add embedding index for similarity search docs.add_embedding_index('embedding', metric='cosine', if_not_exists=True) # Define retrieval query function @pxt.query def search_documents(query_text: str, limit: int = 5): """RAG retrieval query.""" sim = docs.embedding.similarity(string=query_text) return docs.order_by(sim, asc=False).limit(limit).select(docs.document, sim) ``` ```python theme={null} from pydantic import BaseModel from fastapi import FastAPI import pixeltable as pxt from setup_pixeltable import search_documents import config app = FastAPI() docs_table = pxt.get_table(f'{config.APP_NAMESPACE}/documents') class SearchResult(BaseModel): document: str sim: float @app.get("/search") def search(query: str, limit: int = 5) -> list[SearchResult]: results = search_documents(query, limit).collect() return list(results.to_pydantic(SearchResult)) ``` **Key Principles:** * **Module UDFs** (`functions.py`): Update when code changes; improve testability. [Learn more](/platform/udfs-in-pixeltable) * **Retrieval Queries** (`@pxt.query`): Encapsulate complex retrieval logic as reusable functions. * **Idempotency:** Use `if_exists='ignore'` to make `setup_pixeltable.py` safely re-runnable. See this structure in action — a ready-to-clone FastAPI + React app with `setup_pixeltable.py`, `config.py`, `functions.py`, and endpoint routers already wired up. ## Storage Architecture Pixeltable is an OLTP database built on embedded PostgreSQL. It uses multiple storage mechanisms: ```mermaid theme={null} flowchart LR subgraph Home[~/.pixeltable/] direction TB PG[(pgdata
PostgreSQL)] Media[media
Generated Files] Cache[file_cache
LRU Cache] Tmp[tmp
Temporary] end Cloud[Cloud Storage
S3/GCS] Media -.->|Optional| Cloud Cache <-->|Downloads| Cloud ``` **Important Concept:** Pixeltable directories (`pxt.create_dir`) are logical namespaces in the catalog, NOT filesystem directories. **How Media is Stored:** * PostgreSQL stores only file paths/URLs, never raw media data. * Inserted local files: path stored, original file remains in place. * Inserted URLs: URL stored, file downloaded to File Cache on first access. * Generated media (computed columns): saved to Media Store (default: local, configurable to S3/GCS/Azure per-column). * File Cache size: configure via `file_cache_size_g` in `~/.pixeltable/config.toml`. [See configuration guide](/platform/configuration) For large datasets with remote media, consider increasing file cache size to avoid repeated downloads (default is 20% of available disk): ```toml theme={null} # ~/.pixeltable/config.toml file_cache_size_g = 50 # 50 GB cache ``` ### References, Not Copies Unlike vector databases that require ingesting data into their own storage format, Pixeltable stores **references** to external files. Your original media stays in S3/GCS/Azure; only computed results (embeddings, metadata, generated media) are stored locally or in configured cloud buckets. ```mermaid theme={null} flowchart LR S3[S3 / GCS / Azure] -. reference .-> PXT[Pixeltable] PXT --> Meta[Computed Results] PXT -. lazy load .-> S3 ``` This means: * **No data duplication** — you don't pay for storage twice. * **Schema changes don't require re-upload** — add a column, not a migration script. * **Works with existing storage** — point Pixeltable at your current buckets. **Deployment-Specific Storage Patterns:** *Approach 1 (Orchestration Layer):* * Pixeltable storage can be ephemeral (re-computable). * Processing results exported to external RDBMS and blob storage. * Reference input media from S3/GCS/Azure URIs. *Approach 2 (Full Backend):* * Pixeltable IS the RDBMS (embedded PostgreSQL, not replaceable). * Requires persistent volume at `~/.pixeltable` (pgdata, media, file\_cache). * Media Store configurable to S3/GCS/Azure buckets for generated files. ## Dependency Management **Virtual Environments:** Use `venv`, `conda`, or `uv` to isolate dependencies. **Requirements:** ```txt theme={null} # requirements.txt pixeltable==0.4.6 fastapi==0.115.0 uvicorn[standard]==0.32.0 pydantic==2.9.0 python-dotenv==1.0.1 sentence-transformers==3.3.0 # If using embedding indexes ``` * Pin versions: `package==X.Y.Z` * Include integration packages (e.g., `openai`, `sentence-transformers`) * Test updates in staging before production ## Data Interoperability Pixeltable integrates with existing data pipelines via import/export capabilities. See the [Import/Export SDK reference](/sdk/latest/io) for full details. **Import:** * CSV, Excel, JSON: [`pxt.io.import_csv()`](/sdk/latest/io#func-import_csv), [`pxt.io.import_excel()`](/sdk/latest/io#func-import_excel), [`pxt.io.import_json()`](/sdk/latest/io#func-import_json) * Parquet: [`pxt.io.import_parquet()`](/sdk/latest/io#func-import_parquet) * Pandas DataFrames: [`table.insert(df)`](/sdk/latest/table#method-insert) or [`pxt.create_table(source=df)`](/sdk/latest/pixeltable#func-create_table) * Hugging Face Datasets: [`pxt.io.import_huggingface_dataset()`](/sdk/latest/io#func-import_huggingface_dataset) **Export:** * Parquet: [`pxt.io.export_parquet(table, path)`](/sdk/latest/io#func-export_parquet) for data warehousing * LanceDB: [`pxt.io.export_lancedb(table, db_uri, table_name)`](/sdk/latest/io#func-export_lancedb) for vector databases * PyTorch: [`table.to_pytorch_dataset()`](/sdk/latest/query#method-to_pytorch_dataset) for ML training pipelines * COCO: [`table.to_coco_dataset()`](/sdk/latest/query#method-to_coco_dataset) for computer vision * Pandas: [`table.collect().to_pandas()`](/sdk/latest/query#method-collect) for analysis ```python theme={null} # Export query results to Parquet import pixeltable as pxt docs_table = pxt.get_table('myapp/documents') results = docs_table.where(docs_table.timestamp > '2024-01-01') pxt.io.export_parquet(results, '/data/exports/recent_docs.parquet') ``` # Monitoring & Performance Source: https://docs.pixeltable.com/howto/deployment/monitoring Logging, resource monitoring, optimization, and rate limiting ## Logging * Implement Python logging in UDFs and application endpoints * Track execution time, errors, API call latency * Use structured logging (JSON) for log aggregation ```python theme={null} import logging import time import pixeltable as pxt logger = logging.getLogger(__name__) @pxt.udf def process_video(video: pxt.Video) -> pxt.Json: start = time.time() try: # Your processing logic here result = {'processed': True} logger.info(f"Processed in {time.time() - start:.2f}s") return result except Exception as e: logger.error(f"Processing failed: {e}") raise ``` ## Resource Monitoring * Monitor CPU, RAM, Disk I/O, Network on Pixeltable host * Track UDF execution time and model inference latency * Alert on resource exhaustion **Key Metrics to Track:** | Metric | What to Watch | | -------- | ------------------------------------- | | CPU | Sustained high usage during inference | | Memory | Growth over time (potential leaks) | | Disk I/O | Bottlenecks during media processing | | Network | API call latency to external services | ## Optimization ### Batch Operations Use batch processing for better throughput: ```python theme={null} # Batch UDF execution for GPU models @pxt.udf(batch_size=32) def embed_batch(texts: pxt.Batch[str]) -> pxt.Batch[list[float]]: # Process multiple items at once return model.encode(texts) # Batch inserts (more efficient than individual inserts) table.insert([row1, row2, row3, ...]) ``` ### Performance Tips * **Batch Operations:** Use `@pxt.udf(batch_size=32)` for GPU model inference * **Batch Inserts:** Insert multiple rows at once: `table.insert([row1, row2, ...])` * **Profile UDFs:** Add execution time logging to identify bottlenecks * **Embedding Indexes:** Use pgvector for efficient similarity search ## Rate Limiting ### Built-In Provider Limits Automatic rate limiting for OpenAI, Anthropic, Gemini, etc. is configured per-model in `config.toml`: ```toml theme={null} # ~/.pixeltable/config.toml [openai] requests_per_minute = 500 tokens_per_minute = 90000 ``` ### Custom API Rate Limiting Use `resource_pool` to throttle calls to self-hosted models or custom endpoints: ```python theme={null} # Default: 600 requests per minute @pxt.udf(resource_pool='request-rate:my_service') async def call_custom_api(prompt: str) -> dict: # Your logic to call custom endpoint return await custom_api_call(prompt) # Example: Custom rate-limited UDF for self-hosted model @pxt.udf(resource_pool='request-rate:my_ray_cluster') async def call_ray_model(prompt: str, model: str) -> dict: # Your logic to call FastAPI + Ray cluster return await custom_api_call(prompt, model) ``` ## Advanced Features Build complex agent workflows as computed columns with tool calling, MCP integration, and persistent state. Publish and replicate tables across Pixeltable instances for team collaboration. Create immutable point-in-time copies for reproducible ML experiments. Sync tables with annotation projects for human-in-the-loop workflows. # Production Operations Source: https://docs.pixeltable.com/howto/deployment/operations Concurrency, error handling, schema evolution, and deployment patterns ## Concurrent Access & Scaling | Aspect | Details | | ----------------- | ---------------------------------------------------------------------------------- | | **Thread Safety** | Each thread gets its own database connection and transaction context automatically | | **Locking** | Automatic table-level locking for schema changes | | **Isolation** | PostgreSQL `SERIALIZABLE` isolation prevents data race conditions | | **Retries** | Built-in retry logic handles transient serialization failures | | Scaling Dimension | Current Approach | Limitation | | --------------------- | ----------------------------------------------- | ---------------------------------------- | | **Metadata Storage** | Single embedded PostgreSQL instance | Vertical scaling (larger EC2/VM) | | **Compute** | Multiple API workers connected to same instance | Shared access to storage volume required | | **High Availability** | Single attached storage volume | Failover requires volume detach/reattach | Multi-node HA and horizontal scaling planned for Pixeltable Cloud (2026). ## Web Framework Concurrency Pixeltable is thread-safe and works with FastAPI, Flask, Django, and other web frameworks out of the box. The key rule: **use sync (`def`) endpoint handlers**, not `async def`. ### Why Sync Endpoints FastAPI (and Starlette) dispatches sync (`def`) handlers to a thread pool. Each concurrent request gets its own thread, and Pixeltable automatically creates an isolated database connection per thread. This gives you true parallel request handling with no extra configuration. ```python theme={null} from pydantic import BaseModel from fastapi import FastAPI import pixeltable as pxt app = FastAPI() class SearchResult(BaseModel): text: str score: float @app.post("/ingest") def ingest(text: str): t = pxt.get_table('myapp/documents') status = t.insert([{'text': text}]) return {'inserted': status.num_rows} @app.get("/search") def search(query: str, limit: int = 10) -> list[SearchResult]: t = pxt.get_table('myapp/documents') sim = t.text.similarity(string=query) results = ( t.order_by(sim, asc=False) .limit(limit) .select(t.text, score=sim) .collect() ) return list(results.to_pydantic(SearchResult)) ``` **Do not use `async def` for endpoints that call Pixeltable.** Pixeltable's API is synchronous. Inside an `async def` handler, Pixeltable calls block the event loop, serializing all requests and starving other coroutines. With `def` handlers, FastAPI's thread pool handles concurrency for you. ### Returning Query Results `table.select(...).collect()` returns a `ResultSet` object, which Pydantic cannot serialize directly. You have two options: **Option 1: `to_pydantic()` (recommended for FastAPI)** Define a Pydantic model and let Pixeltable validate and convert each row. FastAPI serializes these natively. ```python theme={null} class Item(BaseModel): name: str score: float @app.get("/rows") def get_rows() -> list[Item]: t = pxt.get_table('myapp/items') return list(t.select(t.name, t.score).collect().to_pydantic(Item)) ``` **Option 2: `to_pandas()` + `to_dict()`** Convert via pandas when you don't need a Pydantic model. ```python theme={null} @app.get("/rows") def get_rows(): t = pxt.get_table('myapp/items') df = t.select(t.name, t.score).collect().to_pandas() return {'rows': df.to_dict(orient='records')} ``` ### uvloop Compatibility Pixeltable is compatible with [uvloop](https://github.com/MagicStack/uvloop), the high-performance event loop used by default in many production deployments. No special configuration is needed — sync endpoints work identically whether the server uses the default asyncio loop or uvloop. ```bash theme={null} # uvicorn with uvloop (the default when uvloop is installed) uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1 ``` ## GPU Acceleration * **Automatic GPU Detection:** Pixeltable uses CUDA GPUs for local models (Hugging Face, Ollama) when available. * **CPU Fallback:** Models run on CPU if no GPU detected (functional but slower). * **Configuration:** Control via `CUDA_VISIBLE_DEVICES` environment variable. ## Error Handling | Error Type | Mode | Behavior | | -------------------------- | --------------------------------------- | ------------------------------------------------------- | | **Computed Column Errors** | `on_error='abort'` (default) | Fails entire operation if any row errors | | | `on_error='ignore'` | Continues processing; stores `None` with error metadata | | **Media Validation** | `media_validation='on_write'` (default) | Validates media during insert (catches errors early) | | | `media_validation='on_read'` | Defers validation until media accessed (faster inserts) | Access error details via `table.column.errortype` and `table.column.errormsg`. ```python theme={null} # Example: Graceful error handling in production table.add_computed_column( analysis=llm_analyze(table.document), on_error='ignore' # Continue processing despite individual failures ) # Query for errors errors = table.where(table.analysis.errortype != None).collect() ``` ## Testing Transformations Before Deployment When you add a computed column, Pixeltable executes it immediately for all existing rows. For expensive operations (LLM calls, model inference), validate your logic on a sample first using `select()`; nothing is stored until you commit with `add_computed_column()`. ```python theme={null} # 1. Test transformation on sample rows (nothing stored) table.select( table.text, summary=summarize_with_llm(table.text) ).head(3) # Only processes 3 rows # 2. Once satisfied, persist to table (processes all rows) table.add_computed_column(summary=summarize_with_llm(table.text)) ``` This "iterate-then-add" workflow lets you catch errors early without wasting API calls or compute on your full dataset. **Pro tip:** Save expressions as variables to guarantee identical logic in both steps: ```python theme={null} summary_expr = summarize_with_llm(table.text) table.select(table.text, summary=summary_expr).head(3) # Test table.add_computed_column(summary=summary_expr) # Commit ``` Step-by-step guide with examples for built-in functions, expressions, and custom UDFs ## Schema Evolution | Operation Type | Examples | Impact | | --------------- | -------------------------------------------------------------------------- | ------------------------------- | | **Safe** | Add columns, Add computed columns, Add indexes | Incremental computation only | | **Destructive** | Modify computed columns (`if_exists='replace'`), Drop columns/tables/views | Full recomputation or data loss | **Production Safety:** ```python theme={null} # Use if_exists='ignore' for idempotent schema migrations import pixeltable as pxt import config docs_table = pxt.get_table(f'{config.APP_NAMESPACE}/documents') docs_table.add_computed_column( embedding=embed_model(docs_table.document), if_exists='ignore' # No-op if column exists ) ``` * Version control `setup_pixeltable.py` like database migration scripts. * Rollback via `table.revert()` (single operation) or Git revert (complex changes). ### Updating Models The most common schema evolution is switching an embedding or LLM model. In a traditional stack this requires a migration script, a compute cluster, reprocessing every row, and a maintenance window. In Pixeltable it's one line — the old column keeps working while the new one backfills. **Traditional approach:** ```python theme={null} # 1. Write migration script # 2. Spin up compute to re-embed all rows (hours of downtime) # 3. Swap the column in application code # 4. Deploy during maintenance window # 5. Monitor for consistency issues data = db.query("SELECT id, content FROM documents") for row in data: new_vec = new_model.encode(row["content"]) db.execute("UPDATE documents SET embedding = %s WHERE id = %s", (new_vec, row["id"])) ``` **Pixeltable approach:** ```python theme={null} # Add a new computed column. Old column still serves queries — zero downtime. docs.add_computed_column( embedding_v2=sentence_transformer(docs.text, model_id='intfloat/e5-large-v2'), if_exists='ignore' ) # Pixeltable backfills in batches, rate-limited, with automatic retries. # Switch your queries to embedding_v2 when ready. ``` Because both columns coexist, you can A/B test retrieval quality before cutting over — no rollback plan needed. ## Deployment Patterns **Web Applications:** * Execute `setup_pixeltable.py` during deployment initialization * Web server processes connect to Pixeltable instance * Pixeltable uses connection pooling internally * Use sync (`def`) endpoint handlers for concurrent request support Clone a working FastAPI + React app with multimodal upload, search, and agent endpoints already configured. **Batch Processing:** * Schedule via `cron`, Airflow, AWS EventBridge, GCP Cloud Scheduler * Isolate batch workloads from real-time serving (separate containers/instances) * Use Pixeltable's incremental computation to process only new data **Containers:** * Docker provides reproducible builds across environments * **Full Backend:** Mount persistent volume at `~/.pixeltable` * **Kubernetes:** Use `ReadWriteOnce` PVC (single-pod write access) * Docker Compose or Kubernetes for multi-container deployments ```dockerfile theme={null} # Dockerfile for Pixeltable application FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . # Initialize schema and start application CMD python setup_pixeltable.py && uvicorn app:app --host 0.0.0.0 ``` ## Environment Management ### Multi-Tenancy and Isolation | Isolation Type | Implementation | Use Case | Overhead | | -------------- | ------------------------------------------------------------------------------------------ | ------------------------------------------------ | -------- | | **Logical** | Single Pixeltable instance with directory namespaces (`pxt.create_dir(f"user_{user_id}")`) | Dev/staging environments, simple multi-user apps | Low | | **Physical** | Separate container instances per tenant | SaaS with strict data isolation | High | **Logical Isolation Example:** ```python theme={null} # Per-user isolation via namespaces pxt.create_dir(f"user_{user_id}", if_exists='ignore') user_table = pxt.create_table(f"user_{user_id}/chat_history", schema={...}) ``` ### High Availability Constraints | Configuration | Status | Details | | ------------------------------------------- | --------------- | ----------------------------------------------------------------------------------------------------------------------- | | **Single Pod + ReadWriteOnce PVC** | ✅ Supported | One active pod writes to dedicated volume. Failover requires volume detach/reattach. | | **Multiple Pods + Shared Volume (NFS/EFS)** | ❌ Not Supported | **Will cause database corruption.** Do not mount same `pgdata` to multiple pods. | | **Multi-Node HA** | 🔜 Coming 2026 | Available in Pixeltable Cloud (serverless scaling, API endpoints). [Join waitlist](https://www.pixeltable.com/waitlist) | **Single-Writer Limitation:** Pixeltable's storage layer uses an embedded PostgreSQL instance. **Only one process can write to `~/.pixeltable/pgdata` at a time**. ## Troubleshooting ### Reset Database (Development Only) To completely reset Pixeltable's local state during development: ```bash theme={null} # Stop all Pixeltable processes first, then: rm -rf ~/.pixeltable/pgdata ~/.pixeltable/media ~/.pixeltable/file_cache ``` **This deletes all data.** Only use in development. For production, use backups and `table.revert()` or snapshots instead. ### Common Issues | Symptom | Cause | Solution | | ---------------------------- | -------------------------- | ---------------------------------------------------------------------------- | | "Cannot connect to database" | Stale lock file | Remove `~/.pixeltable/pgdata/postmaster.pid` if no process is running | | Slow first query | File cache miss | Files download on first access; subsequent queries are fast | | "Table not found" | Wrong namespace | Check `pxt.list_tables()` and verify `config.APP_NAMESPACE` | | OOM on large media | Full file loaded to memory | Use iterators (`FrameIterator`, `DocumentSplitter`) to process incrementally | ### Environment Separation Use environment-specific namespaces to manage dev/staging/prod configurations: ```python theme={null} # config.py import os ENV = os.getenv('ENVIRONMENT', 'dev') APP_NAMESPACE = f'{ENV}_myapp' # Creates: dev_myapp, staging_myapp, prod_myapp # Model and API configuration EMBEDDING_MODEL = os.getenv('EMBEDDING_MODEL', 'intfloat/e5-large-v2') OPENAI_MODEL = os.getenv('OPENAI_MODEL', 'gpt-4o-mini') # Optional: Cloud storage for generated media MEDIA_STORAGE_BUCKET = os.getenv('MEDIA_STORAGE_BUCKET') ``` ## Testing **Staging Environment:** * Mirror production configuration. * Test schema changes, UDF updates, application code changes. * Use representative data (anonymized or synthetic). ```python theme={null} # Test environment with isolated namespace import pixeltable as pxt TEST_NS = 'test_myapp' pxt.create_dir(TEST_NS, if_exists='replace') # Run setup targeting test namespace # Execute tests # pxt.drop_dir(TEST_NS, force=True) # Cleanup ``` # Deployment Overview Source: https://docs.pixeltable.com/howto/deployment/overview Choose the right deployment strategy for your Pixeltable application ## What Pixeltable Replaces Most multimodal AI stacks look like this: blob storage for media, a relational database for metadata, a vector database for embeddings, an orchestrator for scheduling, and custom glue code holding it all together. ```mermaid theme={null} flowchart LR S3[S3 / GCS] --> Orch[Airflow / Prefect] Orch --> PG[(PostgreSQL)] Orch --> VDB[(Vector DB)] PG --- Cache[Redis] PG --- Glue[Glue Code] VDB --- Glue ``` **5+ services to deploy and maintain:** blob storage, orchestrator, relational DB, vector DB, cache — plus custom retry logic, rate limiting, sync scripts, and error handling to wire them together. ```mermaid theme={null} flowchart LR Refs[S3 / GCS] -->|references| CC[Computed Columns] CC --> Query[Query + Search] ``` **1 Python import.** Storage, orchestration, caching, vector indexing, rate limiting, and retry logic are built in. The infrastructure you don't deploy is infrastructure you don't maintain. ## Deployment Decision Guide Pixeltable supports two production deployment patterns. Choose based on your constraints: | Question | Answer | Recommendation | | ---------------------------------------------- | ------ | ----------------------------------------------- | | Existing production DB that must stay? | Yes | **Orchestration Layer** | | Building new multimodal app? | Yes | **Full Backend** | | Need semantic search (RAG)? | Yes | **Full Backend** | | Only ETL/transformation? | Yes | **Orchestration Layer** | | Expose Pixeltable as MCP server for LLM tools? | Yes | **Full Backend** + [MCP Server](/libraries/mcp) | ### Technical Capabilities (Both) Regardless of deployment mode, you get: * **[Multimodal Types](/platform/type-system):** Native handling of Video, Document, Audio, Image, JSON. * **[Computed Columns](/tutorials/computed-columns):** Automatic incremental updates and dependency tracking. * **[Views & Iterators](/platform/views):** Built-in logic for chunking documents, extracting frames, etc. * **[Model Orchestration](/integrations/frameworks):** Rate-limited API calls to OpenAI, Anthropic, Gemini, local models. * **[Data Interoperability](/sdk/latest/io):** Import/export Parquet, PyTorch, LanceDB, pandas. * **[Configurable Media Storage](/platform/configuration):** Per-column destination (local or cloud bucket). ### Use Case Comparison | Capability | [ML Data Wrangling](/use-cases/ml-data-wrangling) | [AI Applications](/use-cases/ai-applications) | | --------------------- | ------------------------------------------------- | --------------------------------------------- | | **Multimodal Types** | ✅ Video, Audio, Image, Document | ✅ Video, Audio, Image, Document | | **Computed Columns** | ✅ Enrichment & pre-annotation | ✅ Pipeline orchestration | | **Embedding Indexes** | ✅ Curation & similarity search | ✅ RAG & retrieval | | **Versioning** | ✅ Dataset snapshots | ✅ Data lineage | | **Data Sharing** | ✅ Publish datasets | ✅ Team collaboration | *** ## Deployment Strategies ### Approach 1: Pixeltable as Orchestration Layer Use Pixeltable for multimodal data orchestration while retaining your existing data infrastructure. ```mermaid theme={null} flowchart TB App[Application Layer] subgraph Existing[Your Existing Infrastructure] DB[(RDBMS)] Blob[Blob Storage] end subgraph PXT[Pixeltable] Process[Process Media
Generate Embeddings
Run LLM Calls] end PXT -->|Export Results| DB PXT -->|Export Media| Blob App --> DB App --> Blob ``` * Existing RDBMS (PostgreSQL, MySQL) and blob storage (S3, GCS, Azure Blob) must remain * Application already queries a separate data layer * Incremental adoption required with minimal stack changes * Deploy Pixeltable in Docker container or dedicated compute instance * Define tables, views, computed columns, and UDFs for multimodal processing * Process videos, documents, audio, images within Pixeltable * Export structured outputs (embeddings, metadata, classifications) to RDBMS * Export generated media to blob storage * Application queries existing data layer, not Pixeltable * Native multimodal type system (Video, Document, Audio, Image, JSON) * Declarative computed columns eliminate orchestration boilerplate * Incremental computation automatically handles new data * UDFs encapsulate transformation logic * LLM call orchestration with automatic rate limiting * Iterators for chunking documents, extracting frames, splitting audio ```python theme={null} # Example: Orchestrate in Pixeltable, export to external systems import pixeltable as pxt from pixeltable.functions.video import extract_audio from pixeltable.functions.openai import transcriptions from pixeltable.functions.video import frame_iterator import psycopg2 from datetime import datetime # Setup: Define Pixeltable orchestration pipeline pxt.create_dir('video_processing', if_exists='ignore') videos = pxt.create_table( 'video_processing/videos', {'video': pxt.Video, 'uploaded_at': pxt.Timestamp} ) # Computed columns for orchestration videos.add_computed_column( audio=extract_audio(videos.video, format='mp3') ) videos.add_computed_column( transcript=transcriptions(audio=videos.audio, model='whisper-1') ) # Optional: Add LLM-based summary from pixeltable.functions.openai import chat_completions videos.add_computed_column( summary=chat_completions( messages=[{'role': 'user', 'content': f"Summarize: {videos.transcript.text}"}], model='gpt-4o-mini' ) ) # Extract frames for analysis frames = pxt.create_view( 'video_processing/frames', videos, iterator=frame_iterator(video=videos.video, fps=1.0) ) # Insert video for processing videos.insert([{'video': 's3://bucket/video.mp4', 'uploaded_at': datetime.now()}]) # Export structured results to external RDBMS conn = psycopg2.connect("postgresql://...") cursor = conn.cursor() for row in videos.select(videos.video, videos.transcript).collect(): cursor.execute( "INSERT INTO video_metadata (video_url, transcript_json) VALUES (%s, %s)", (row['video'], row['transcript']) ) conn.commit() ``` ### Approach 2: Pixeltable as Full Backend Use Pixeltable for both orchestration and storage as your primary data backend. ```mermaid theme={null} flowchart TB Frontend[Frontend App] API[FastAPI / Flask / Django] subgraph Pixeltable[Pixeltable Full Backend] PG[(PostgreSQL
Metadata & Data)] Media[Media Storage
S3/GCS/Local] Compute[Computed Columns
Embeddings & LLMs] PG --- Media PG --- Compute end Frontend --> API API --> Pixeltable ``` * Building new multimodal AI application * Semantic search and vector similarity required * Storage and ML pipeline need tight integration * Stack consolidation preferred over separate storage/orchestration layers * Deploy Pixeltable on persistent instance (EC2 with EBS, EKS with persistent volumes, VM) * Build API endpoints (FastAPI, Flask, Django) that interact with Pixeltable tables * Frontend calls endpoints to insert data and retrieve results * Query using Pixeltable's semantic search, filters, joins, and aggregations * All data stored in Pixeltable: metadata, media references, computed column results * Unified storage, computation, and retrieval in single system * Native semantic search via embedding indexes (pgvector) * No synchronization layer between storage and orchestration * Automatic versioning and lineage tracking * Incremental computation propagates through views * LLM/agent orchestration * Data export to PyTorch, Parquet, LanceDB ```python theme={null} # Example: FastAPI endpoints backed by Pixeltable from pydantic import BaseModel from fastapi import FastAPI, UploadFile from datetime import datetime import pixeltable as pxt app = FastAPI() docs_table = pxt.get_table('myapp/documents') # Has computed columns: embedding, summary class SearchResult(BaseModel): document: str summary: str | None similarity: float @app.post("/documents/upload") def upload_document(file: UploadFile): status = docs_table.insert([{ 'document': file.filename, 'uploaded_at': datetime.now() }]) return {"rows_inserted": status.num_rows} @app.get("/documents/search") def search_documents(query: str, limit: int = 10) -> list[SearchResult]: sim = docs_table.embedding.similarity(string=query) results = docs_table.select( docs_table.document, docs_table.summary, similarity=sim ).order_by(sim, asc=False).limit(limit).collect() return list(results.to_pydantic(SearchResult)) @app.get("/documents/{doc_id}") def get_document(doc_id: int): result = docs_table.where(docs_table._rowid == doc_id).collect() return result[0] if len(result) > 0 else {"error": "Not found"} ``` **Use sync (`def`) endpoints, not `async def`.** FastAPI dispatches sync endpoints to a thread pool, giving each request its own thread. Pixeltable is thread-safe and handles concurrent requests automatically. Using `async def` would block the event loop and serialize all requests. See [Production Operations](/howto/deployment/operations) for details. ## Get Started A ready-to-clone skeleton app with a FastAPI backend and React frontend — multimodal upload, cross-modal search, and a tool-calling agent, all wired through Pixeltable computed columns. ## Next Steps Code organization and storage architecture Concurrency, error handling, and schema evolution Backup strategies and security best practices # Security & Backup Source: https://docs.pixeltable.com/howto/deployment/security Backup strategies, recovery procedures, and security best practices ## Backup Strategies | Deployment Approach | Backup Strategy | Recovery Method | | ----------------------- | ------------------------------------------------------- | ------------------------------- | | **Orchestration Layer** | External RDBMS + Blob Storage backups | Re-run transformation pipelines | | **Full Backend** | `pg_dump` of `~/.pixeltable/pgdata` + S3/GCS versioning | Restore `pgdata` + media files | ### Full Backend Backup For deployments using Pixeltable as the full backend: ```bash theme={null} # Backup PostgreSQL data pg_dump -h ~/.pixeltable/pgdata -U postgres pixeltable > backup.sql # Backup media files (if stored locally) tar -czf media_backup.tar.gz ~/.pixeltable/media/ # For cloud media storage, ensure S3/GCS versioning is enabled ``` ### Orchestration Layer Backup For orchestration-only deployments: * Primary data lives in your external RDBMS and blob storage * Pixeltable state can be rebuilt by re-running transformation pipelines * Back up your `setup_pixeltable.py` and UDF code in version control ## Recovery Procedures ### Full Backend Recovery 1. Stop the Pixeltable application 2. Restore PostgreSQL data: `psql -f backup.sql` 3. Restore media files to `~/.pixeltable/media/` 4. Restart the application ### Orchestration Layer Recovery 1. Deploy fresh Pixeltable instance 2. Run `setup_pixeltable.py` to recreate schema 3. Re-process data through computed columns (incremental) ## Security Best Practices | Security Layer | Recommendation | Implementation | | --------------------- | ---------------------------------- | ------------------------------------------------ | | **Network** | Deploy within private VPC | Do not expose PostgreSQL port (5432) to internet | | **Authentication** | Application layer (FastAPI/Django) | Pixeltable does not manage end-user accounts | | **Cloud Credentials** | IAM Roles / Workload Identity | Avoid long-lived keys in `config.toml` | ### Network Security ```yaml theme={null} # Example: Kubernetes NetworkPolicy apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: pixeltable-network-policy spec: podSelector: matchLabels: app: pixeltable policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: api-server ports: - protocol: TCP port: 8000 ``` ### Secrets Management **Never hardcode secrets.** Use environment variables or secrets managers: ```python theme={null} # config.py - Load from environment import os OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID') AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY') # Or use python-dotenv for local development from dotenv import load_dotenv load_dotenv() ``` For production, use: * **AWS:** Secrets Manager, Parameter Store * **GCP:** Secret Manager * **Kubernetes:** Secrets, External Secrets Operator ### Cloud Storage Credentials For S3/GCS/Azure media storage: ```python theme={null} # Prefer IAM roles over long-lived credentials # AWS: Use EC2 instance profile or EKS IRSA # GCP: Use Workload Identity # If credentials required, set via environment variables: # AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY # GOOGLE_APPLICATION_CREDENTIALS ``` ## Audit and Compliance ### Data Lineage Pixeltable automatically tracks: * Table versions and schema changes * Computed column definitions and dependencies * Insert/update/delete operations ```python theme={null} # View table history table.history() # Get specific version old_version = pxt.get_table('myapp/documents:5') # Version 5 ``` ### Access Logging Implement application-level access logging: ```python theme={null} from fastapi import FastAPI, Request import logging logger = logging.getLogger("audit") @app.middleware("http") async def audit_log(request: Request, call_next): logger.info(f"User: {request.user} Action: {request.method} {request.url}") response = await call_next(request) return response ``` ## Disaster Recovery ### Recovery Time Objectives | Deployment | RTO | Strategy | | ------------------- | ------- | -------------------------------------------- | | Orchestration Layer | Minutes | Spin up new instance, re-run pipelines | | Full Backend | Hours | Restore from backup, validate data integrity | ### Recommendations 1. **Regular backups:** Daily for production workloads 2. **Test recovery:** Quarterly disaster recovery drills 3. **Multi-region:** Store backups in different region than primary 4. **Immutable backups:** Use S3 Object Lock or GCS retention policies # Working with Anthropic in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-anthropic Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Anthropic integration enables you to access Anthropic’s Claude LLM via the Anthropic API. ### Prerequisites * An Anthropic account with an API key ([https://docs.anthropic.com/en/api/getting-started](https://docs.anthropic.com/en/api/getting-started)) ### Important notes * Anthropic usage may incur costs based on your Anthropic plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter an Anthropic API key. ```python theme={null} %pip install -qU pixeltable anthropic ``` ```python theme={null} import getpass import os if 'ANTHROPIC_API_KEY' not in os.environ: os.environ['ANTHROPIC_API_KEY'] = getpass.getpass( 'Anthropic API Key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'anthropic_demo' directory and its contents, if it exists pxt.drop_dir('anthropic_demo', force=True) pxt.create_dir('anthropic_demo') ```
  Created directory 'anthropic\_demo'.
  \
## Messages Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Anthropic. ```python theme={null} from pixeltable.functions import anthropic # Create a table in Pixeltable and add a computed column that calls Anthropic t = pxt.create_table('anthropic_demo/chat', {'input': pxt.String}) msgs = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=anthropic.messages( messages=msgs, model='claude-haiku-4-5-20251001', max_tokens=300, model_kwargs={ # Optional dict with parameters for the Anthropic API 'system': 'Respond to the prompt with detailed historical information.', 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.content[0].text) ```
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Start a conversation t.insert( [ { 'input': 'What was the outcome of the 1904 US Presidential election?' } ] ) t.select(t.input, t.response).show() ```
  Inserting rows into \`chat\`: 1 rows \[00:00, 203.87 rows/s]
  Inserted 1 row with 0 errors.
### Learn More To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Bedrock in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-bedrock Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Bedrock integration enables you to access AWS Bedrock foundation models directly from your tables. ### Prerequisites * Activate Bedrock in your AWS account. * Request access to your desired models (e.g. Claude Sonnet 3.7, Amazon Nova Pro). * Obtain a **Bedrock API Key** from the AWS console (under Bedrock > API keys), or configure standard AWS IAM credentials. ### Important notes * Bedrock usage may incur costs based on your Bedrock plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and configure your Bedrock credentials. ```python theme={null} %pip install -qU pixeltable boto3 ``` ```python theme={null} import getpass import os if 'BEDROCK_API_KEY' not in os.environ: os.environ['BEDROCK_API_KEY'] = getpass.getpass( 'Enter your Bedrock API Key: ' ) # Optional: set the region if your Bedrock endpoint is not in us-east-1 # os.environ['BEDROCK_REGION_NAME'] = 'us-west-2' ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the `bedrock_demo` directory and its contents, if it exists pxt.drop_dir('bedrock_demo', force=True) pxt.create_dir('bedrock_demo') ```
  Created directory 'bedrock\_demo'.
  \
## Messages Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Bedrock. ```python theme={null} from pixeltable.functions import bedrock # Create a table in Pixeltable and add a computed column that calls Bedrock t = pxt.create_table('bedrock_demo/chat', {'input': pxt.String}) t.add_computed_column( output=bedrock.converse( model_id='amazon.nova-pro-v1:0', messages=[{'role': 'user', 'content': [{'text': t.input}]}], ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.output.message.content[0].text) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [ { 'input': 'What was the outcome of the 1904 US Presidential election?' } ] ) t.select(t.input, t.response).show() ```
  Inserted 1 row with 0 errors in 2.75 s (0.36 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Deepseek in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-deepseek Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Deepseek integration enables you to access Deepseek’s LLM via the Deepseek API. ### Prerequisites * A Deepseek account with an API key ([https://api-docs.deepseek.com/](https://api-docs.deepseek.com/)) ### Important notes * Deepseek usage may incur costs based on your Deepseek plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install the required libraries and enter a Deepseek API key. Deepseek uses the OpenAI SDK as its Python API, so we need to install it in addition to Pixeltable. ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'DEEPSEEK_API_KEY' not in os.environ: os.environ['DEEPSEEK_API_KEY'] = getpass.getpass('Deepseek API Key:') ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'deepseek_demo' directory and its contents, if it exists pxt.drop_dir('deepseek_demo', force=True) pxt.create_dir('deepseek_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'deepseek\_demo'.
  \
## Messages Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Deepseek. ```python theme={null} from pixeltable.functions import deepseek # Create a table in Pixeltable and add a computed column that calls Deepseek t = pxt.create_table('deepseek_demo/chat', {'input': pxt.String}) msgs = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=deepseek.chat_completions(messages=msgs, model='deepseek-chat') ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [ { 'input': 'What was the outcome of the 1904 US Presidential election?' } ] ) t.select(t.input, t.response).show() ```
  Inserted 1 row with 0 errors in 18.72 s (0.05 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Microsoft Fabric Source: https://docs.pixeltable.com/howto/providers/working-with-fabric Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Microsoft Fabric integration enables you to access Azure OpenAI models within Microsoft Fabric notebook environments with automatic authentication. ## Prerequisites * A Microsoft Fabric workspace with access to AI services * Running in a Microsoft Fabric notebook environment ## Important notes * This integration only works within Microsoft Fabric notebook environments * Authentication is handled automatically - no API keys required * Azure OpenAI usage in Fabric is subject to your organization’s Fabric capacity and policies For more information about Fabric AI services, see the [Microsoft Fabric AI Services documentation](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview). First, install Pixeltable in your Fabric notebook: ```python theme={null} %pip install -qU pixeltable ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'fabric_demo' directory and its contents, if it exists pxt.drop_dir('fabric_demo', force=True) pxt.create_dir('fabric_demo') ``` ## Chat Completions with Standard Models Let’s start by using a standard chat model (gpt-4.1) for a simple Q\&A application. Create a table in Pixeltable with a computed column that calls Azure OpenAI via Fabric: ```python theme={null} from pixeltable.functions import fabric # Create a table for customer support tickets tickets = pxt.create_table( 'fabric_demo.support_tickets', { 'ticket_id': pxt.Int, 'customer_message': pxt.String, 'priority': pxt.String, }, ) # Add a computed column that automatically generates AI responses # No API keys needed - Fabric handles authentication! messages = [ { 'role': 'system', 'content': 'You are a helpful customer support agent. Be concise and professional.', }, {'role': 'user', 'content': tickets.customer_message}, ] tickets.add_computed_column( ai_response=fabric.chat_completions( messages, model='gpt-4.1', model_kwargs={'max_tokens': 200, 'temperature': 0.7}, ) ) ``` ```python theme={null} # Parse the response to extract just the message content tickets.add_computed_column( response_text=tickets.ai_response.choices[0].message.content ) ``` ```python theme={null} # Insert data - AI responses are generated automatically tickets.insert( [ { 'ticket_id': 1, 'customer_message': 'How do I reset my password?', 'priority': 'low', }, { 'ticket_id': 2, 'customer_message': "My order hasn't arrived after 2 weeks", 'priority': 'high', }, { 'ticket_id': 3, 'customer_message': 'Can I change my subscription plan?', 'priority': 'medium', }, ] ) # Query results with AI-generated responses tickets.select( tickets.ticket_id, tickets.customer_message, tickets.response_text ).show() ``` ## Chat Completions with Reasoning Models Fabric also supports reasoning models like gpt-5, which are optimized for complex reasoning tasks. **Note:** Reasoning models have different parameter requirements: * Use `max_completion_tokens` instead of `max_tokens` * Don’t support the `temperature` parameter ```python theme={null} # Create a table for complex reasoning tasks reasoning_tasks = pxt.create_table( 'fabric_demo.reasoning', {'task_id': pxt.Int, 'problem': pxt.String} ) messages = [{'role': 'user', 'content': reasoning_tasks.problem}] reasoning_tasks.add_computed_column( reasoning_output=fabric.chat_completions( messages, model='gpt-5', # Reasoning model model_kwargs={ 'max_completion_tokens': 1000 # Note: max_completion_tokens, not max_tokens }, ) ) reasoning_tasks.add_computed_column( solution=reasoning_tasks.reasoning_output.choices[0].message.content ) ``` ```python theme={null} # Insert a complex reasoning task reasoning_tasks.insert( [ { 'task_id': 1, 'problem': 'Explain how to implement a binary search tree with self-balancing capabilities. Include time complexity analysis.', } ] ) reasoning_tasks.select( reasoning_tasks.problem, reasoning_tasks.solution ).show() ``` ## Embeddings for Semantic Search Fabric also supports embedding models for semantic search and similarity operations. Let’s create a knowledge base with semantic search capabilities: ```python theme={null} # Create a knowledge base table knowledge_base = pxt.create_table( 'fabric_demo.knowledge', {'doc_id': pxt.Int, 'content': pxt.String, 'category': pxt.String}, ) # Add embeddings column knowledge_base.add_computed_column( embedding=fabric.embeddings( knowledge_base.content, model='text-embedding-ada-002' ) ) # Insert some documents knowledge_base.insert( [ { 'doc_id': 1, 'content': 'Pixeltable is a Python library for AI data workflows with built-in versioning.', 'category': 'product', }, { 'doc_id': 2, 'content': 'Microsoft Fabric provides a unified analytics platform for data engineering and AI.', 'category': 'platform', }, { 'doc_id': 3, 'content': 'Azure OpenAI Service offers powerful language models through REST APIs.', 'category': 'service', }, ] ) ``` ```python theme={null} # Add an embedding index for fast similarity search knowledge_base.add_embedding_index( 'content', embedding=fabric.embeddings.using(model='text-embedding-ada-002'), ) ``` ```python theme={null} # Perform similarity search sim = knowledge_base.content.similarity('AI platform for data science') knowledge_base.select( knowledge_base.content, knowledge_base.category, sim=sim ).order_by(sim, asc=False).limit(2).show() ``` ## Combining Chat and Embeddings: RAG Pattern Let’s combine embeddings and chat completions to build a simple Retrieval-Augmented Generation (RAG) system: ```python theme={null} # Create a table for questions questions = pxt.create_table( 'fabric_demo.questions', {'question_id': pxt.Int, 'question': pxt.String}, ) # Find similar documents using similarity search @pxt.query def retrieve_context(question: str, top_k: int = 2) -> list[dict]: sim = knowledge_base.content.similarity(question) return ( knowledge_base.select(knowledge_base.content) .order_by(sim, asc=False) .limit(top_k) .collect()['content'] ) # Add context retrieval questions.add_computed_column( context=retrieve_context(questions.question, top_k=2) ) # Build RAG prompt with retrieved context questions.add_computed_column( rag_messages=[ { 'role': 'system', 'content': "Answer the question based on the provided context. If the context doesn't contain relevant information, say so.", }, { 'role': 'user', 'content': f'Context: {questions.context}\n\nQuestion: {questions.question}', }, ] ) # Generate answer using gpt-4.1 questions.add_computed_column( answer_response=fabric.chat_completions( questions.rag_messages, model='gpt-4.1', model_kwargs={'max_tokens': 300}, ) ) questions.add_computed_column( answer=questions.answer_response.choices[0].message.content ) ``` ```python theme={null} # Ask a question questions.insert( [{'question_id': 1, 'question': 'What is Microsoft Fabric used for?'}] ) questions.select( questions.question, questions.context, questions.answer ).show() ``` ## Available Models in Fabric The following models are currently available in Microsoft Fabric: **Chat Models:** * `gpt-5` (reasoning model) * `gpt-4.1` * `gpt-4.1-mini` **Embedding Models:** * `text-embedding-ada-002` * `text-embedding-3-small` * `text-embedding-3-large` For the latest information on available models, see the [Fabric AI Services documentation](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview). ## Key Features * **Automatic Authentication**: No API keys required - authentication is handled by Fabric * **Rate Limiting**: Pixeltable automatically handles rate limiting based on Azure OpenAI response headers * **Batching**: Embedding requests are automatically batched for efficiency (up to 32 inputs per request) * **Incremental Processing**: Computed columns only run on new or updated data * **Versioning**: All data and transformations are automatically versioned ### Learn More To learn more about advanced techniques in Pixeltable: * [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) * [Working with Embeddings](/platform/embedding-indexes) * [Microsoft Fabric AI Services](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview) If you have any questions, don’t hesitate to reach out on our [Discord community](https://discord.gg/QPyqFYx2UN). # Working with fal.ai in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-fal Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s fal.ai integration enables you to access fal.ai’s fast inference models via the fal.ai API. ### Prerequisites * A fal.ai account with an API key ([https://fal.ai/dashboard/keys](https://fal.ai/dashboard/keys)) ### Important notes * fal.ai usage may incur costs based on your fal.ai plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter a fal.ai API key. ```python theme={null} %pip install -qU fal-client ``` ```python theme={null} import getpass import os if 'FAL_API_KEY' not in os.environ: os.environ['FAL_API_KEY'] = getpass.getpass('fal.ai API Key: ') ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'fal_demo' directory and its contents, if it exists pxt.drop_dir('fal_demo', force=True) pxt.create_dir('fal_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'fal\_demo'.
  \
## Text-to-image generation with FLUX Schnell Let’s start by using fal.ai’s FLUX Schnell model, which is optimized for fast image generation. We’ll create a table to store prompts and generated images. ```python theme={null} from pixeltable.functions import fal # Create a table for image generation t = pxt.create_table('fal_demo/images', {'prompt': pxt.String}) # Add a computed column that calls the FLUX Schnell model t.add_computed_column( response=fal.run( input={'prompt': t.prompt}, app='fal-ai/flux/schnell' ) ) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
Now let’s insert some prompts and see the results: ```python theme={null} # Insert a few prompts t.insert( [ { 'prompt': 'A serene mountain landscape at sunset with a crystal clear lake' }, { 'prompt': 'A friendly robot teaching a class of kittens to code' }, {'prompt': 'An underwater city with bioluminescent architecture'}, ] ) ```
  Inserted 3 rows with 0 errors in 1.77 s (1.70 rows/s)
  3 rows inserted.
Let’s examine the structure of the response: ```python theme={null} t.select(t.prompt, t.response).head(1) ```
We can see that fal.ai returns a JSON response with an `images` array. Each image has a `url` field. Let’s extract and display the images: ```python theme={null} # Add a computed column to extract the image URL and convert it to an Image type t.add_computed_column( image=t.response['images'][0]['url'].astype(pxt.Image) ) # Display the prompts and images t.select(t.prompt, t.image).head() ```
  Added 3 column values with 0 errors in 0.04 s (85.38 rows/s)
## Advanced image generation with Fast SDXL fal.ai also offers Fast SDXL, which provides more control over image generation parameters. Let’s create a new table to explore these capabilities. ```python theme={null} # Create a table with more parameters sdxl_t = pxt.create_table( 'fal_demo/sdxl_images', { 'prompt': pxt.String, 'negative_prompt': pxt.String, 'steps': pxt.Int, }, ) # Add a computed column with more parameters sdxl_t.add_computed_column( response=fal.run( input={ 'prompt': sdxl_t.prompt, 'negative_prompt': sdxl_t.negative_prompt, 'image_size': 'square_hd', # 1024x1024 'num_inference_steps': sdxl_t.steps, }, app='fal-ai/fast-sdxl', ) ) # Extract the image sdxl_t.add_computed_column( image=sdxl_t.response['images'][0]['url'].astype(pxt.Image) ) ```
  Created table 'sdxl\_images'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert prompts with different parameters sdxl_t.insert( [ { 'prompt': 'A majestic lion in a savanna at golden hour, photorealistic', 'negative_prompt': 'cartoon, illustration, drawing', 'steps': 25, }, { 'prompt': 'A futuristic cityscape with flying cars and neon lights', 'negative_prompt': 'blurry, low quality', 'steps': 30, }, ] ) ```
  Inserted 2 rows with 0 errors in 5.23 s (0.38 rows/s)
  2 rows inserted.
```python theme={null} # Display the results sdxl_t.select(sdxl_t.prompt, sdxl_t.image).head() ```
## Generating multiple images per prompt You can also generate multiple variations of the same prompt in a single request: ```python theme={null} # Create a table for multiple image generation multi_t = pxt.create_table( 'fal_demo/multi_images', {'prompt': pxt.String} ) # Generate 3 variations of each prompt multi_t.add_computed_column( response=fal.run( input={'prompt': multi_t.prompt, 'num_images': 3}, app='fal-ai/flux/schnell', ) ) # Extract the first image (you could create columns for all three) multi_t.add_computed_column( image_1=multi_t.response['images'][0]['url'].astype(pxt.Image) ) multi_t.add_computed_column( image_2=multi_t.response['images'][1]['url'].astype(pxt.Image) ) multi_t.add_computed_column( image_3=multi_t.response['images'][2]['url'].astype(pxt.Image) ) ```
  Created table 'multi\_images'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert a prompt multi_t.insert( [{'prompt': 'A steampunk mechanical butterfly on a brass flower'}] ) ```
  Inserted 1 row with 0 errors in 1.14 s (0.88 rows/s)
  1 row inserted.
```python theme={null} # Display all three variations multi_t.select(multi_t.image_1, multi_t.image_2, multi_t.image_3).head() ```
## Using Higher Quality Models For higher quality generation, you can use models like `fal-ai/flux/dev` which produce better results but take more time: ```python theme={null} # Create a table using FLUX Dev dev_t = pxt.create_table('fal_demo/flux_dev', {'prompt': pxt.String}) # Use FLUX Dev model for higher quality dev_t.add_computed_column( response=fal.run( input={'prompt': dev_t.prompt}, app='fal-ai/flux/dev' ) ) dev_t.add_computed_column( image=dev_t.response['images'][0]['url'].astype(pxt.Image) ) ```
  Created table 'flux\_dev'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert a prompt (note: FLUX Dev may take longer but produces higher quality results) dev_t.insert( [ { 'prompt': 'A highly detailed oil painting of a wizard casting a spell in an ancient library' } ] ) ```
  Inserted 1 row with 0 errors in 1.74 s (0.58 rows/s)
  1 row inserted.
```python theme={null} # Display the result dev_t.select(dev_t.prompt, dev_t.image).head() ```
## Exploring Available Models fal.ai offers a wide variety of models. Here are some popular ones you can try: ### Image Generation Models * `fal-ai/flux/schnell` - Fast FLUX model for quick image generation * `fal-ai/flux/dev` - Higher quality FLUX model (slower) * `fal-ai/fast-sdxl` - Fast Stable Diffusion XL * `fal-ai/stable-diffusion-v3-medium` - Stable Diffusion 3 Medium ### Other Models * `fal-ai/fast-lightning-sdxl` - Ultra-fast SDXL variant * `fal-ai/recraft-v3` - Recraft V3 for design-focused generation To use a different model, simply change the `app` parameter in your `fal.run()` call. ## Working with Batch Processing Pixeltable’s computed columns make it easy to process multiple images in batch. Let’s create a larger dataset: ```python theme={null} # Create a batch processing table batch_t = pxt.create_table( 'fal_demo/batch', {'category': pxt.String, 'description': pxt.String} ) # Create a prompt by combining category and description batch_t.add_computed_column( prompt=pxt.functions.string.format( 'A {} that is {}', batch_t.category, batch_t.description ) ) # Generate images batch_t.add_computed_column( response=fal.run( input={'prompt': batch_t.prompt}, app='fal-ai/flux/schnell' ) ) batch_t.add_computed_column( image=batch_t.response['images'][0]['url'].astype(pxt.Image) ) ```
  Created table 'batch'.
  Added 0 column values with 0 errors in 0.02 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert a batch of prompts batch_t.insert( [ {'category': 'landscape', 'description': 'peaceful and zen-like'}, { 'category': 'portrait', 'description': 'mysterious and ethereal', }, { 'category': 'abstract art', 'description': 'colorful and energetic', }, { 'category': 'architecture', 'description': 'modern and minimalist', }, {'category': 'animal', 'description': 'cute and fluffy'}, ] ) ```
  Inserted 5 rows with 0 errors in 1.69 s (2.96 rows/s)
  5 rows inserted.
```python theme={null} # View all results batch_t.select( batch_t.category, batch_t.description, batch_t.image ).show() ```
## Tips and Best Practices 1. **Rate Limiting**: fal.ai has rate limits. Pixeltable respects these limits by default. You can configure custom rate limits in your Pixeltable config. 2. **Model Selection**: * Use `flux/schnell` for fast prototyping and when speed is critical * Use `flux/dev` when you need higher quality and can afford longer generation times * Use `fast-sdxl` for a good balance of speed and quality 3. **Prompt Engineering**: Good prompts lead to better results. Be specific and descriptive. 4. **Negative Prompts**: Use negative prompts to exclude unwanted elements from your images. 5. **Caching**: Pixeltable automatically caches results, so re-running the same prompt won’t incur additional costs. ### Learn more * fal.ai Documentation: [https://fal.ai/docs](https://fal.ai/docs) * Pixeltable Documentation: [https://docs.pixeltable.com](https://docs.pixeltable.com) * To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out on our [Discord community](https://pixeltable.com/discord)! # Working with Fireworks AI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-fireworks Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Fireworks integration enables you to access LLMs hosted on the Fireworks platform. ### Prerequisites * A Fireworks account with an API key ([https://fireworks.ai/api-keys](https://fireworks.ai/api-keys)) ### Important notes * Fireworks usage may incur costs based on your Fireworks plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter a Fireworks API key. ```python theme={null} %pip install -qU pixeltable fireworks-ai ``` ```python theme={null} import getpass import os if 'FIREWORKS_API_KEY' not in os.environ: os.environ['FIREWORKS_API_KEY'] = getpass.getpass( 'Fireworks API Key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'fireworks_demo' directory and its contents, if it exists pxt.drop_dir('fireworks_demo', force=True) pxt.create_dir('fireworks_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'fireworks\_demo'.
  \
## Completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Fireworks. ```python theme={null} from pixeltable.functions.fireworks import chat_completions # Create a table in Pixeltable and pick a model hosted on Fireworks with some parameters t = pxt.create_table('fireworks_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=chat_completions( messages=messages, model='accounts/fireworks/models/llama-v3p3-70b-instruct', model_kwargs={ # Optional dict with parameters for the Fireworks API 'max_tokens': 300, 'top_k': 40, 'top_p': 0.9, 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the bot_response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [{'input': 'Can you tell me who was President of the US in 1961?'}] ) t.select(t.input, t.response).show() ```
  Inserted 1 row with 0 errors in 2.15 s (0.47 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Gemini in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-gemini Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Gemini integration enables you to access the Gemini LLM via the Google Gemini API. ### Prerequisites * A Google AI Studio account with an API key ([https://aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey)) ### Important notes * Google AI Studio usage may incur costs based on your plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter a Gemini API key obtained via Google AI Studio. ```python theme={null} %pip install -qU pixeltable google-genai ``` ```python theme={null} import getpass import os if 'GEMINI_API_KEY' not in os.environ: os.environ['GEMINI_API_KEY'] = getpass.getpass( 'Google AI Studio API Key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'gemini_demo' directory and its contents, if it exists pxt.drop_dir('gemini_demo', force=True) pxt.create_dir('gemini_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'gemini\_demo'.
  \
## Generate content Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Gemini. ```python theme={null} from google.genai.types import GenerateContentConfigDict from pixeltable.functions import gemini # Create a table in Pixeltable and pick a model hosted on Google AI Studio with some parameters t = pxt.create_table('gemini_demo/text', {'input': pxt.String}) config = GenerateContentConfigDict( stop_sequences=['\n'], max_output_tokens=300, temperature=1.0, top_p=0.95, top_k=40, ) t.add_computed_column( output=gemini.generate_content( t.input, model='gemini-2.5-flash', config=config ) ) ```
  Created table 'text'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Ask Gemini to generate some content based on the input t.insert( [ {'input': 'Write a story about a magic backpack.'}, {'input': 'Tell me a science joke.'}, ] ) ```
  Inserted 2 rows with 0 errors in 1.43 s (1.39 rows/s)
  2 rows inserted.
```python theme={null} # Parse the response into a new column t.add_computed_column( response=t.output['candidates'][0]['content']['parts'][0]['text'] ) t.select(t.input, t.response).head() ```
  Added 2 column values with 0 errors in 0.03 s (62.79 rows/s)
## Generate images with Imagen ```python theme={null} from google.genai.types import GenerateImagesConfigDict images_t = pxt.create_table('gemini_demo/images', {'prompt': pxt.String}) config = GenerateImagesConfigDict(aspect_ratio='16:9') images_t.add_computed_column( generated_image=gemini.generate_images( images_t.prompt, model='imagen-4.0-generate-001', config=config ) ) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} images_t.insert( [{'prompt': 'A friendly dinosaur playing tennis in a cornfield'}] ) ```
  Inserted 1 row with 0 errors in 9.41 s (0.11 rows/s)
  1 row inserted.
```python theme={null} images_t.head() ```
## Generate video with Veo ```python theme={null} videos_t = pxt.create_table('gemini_demo/videos', {'prompt': pxt.String}) videos_t.add_computed_column( generated_video=gemini.generate_videos( videos_t.prompt, model='veo-2.0-generate-001' ) ) ```
  Created table 'videos'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} videos_t.insert( [ { 'prompt': 'A giant pixel floating over the open ocean in a sea of data' } ] ) ```
  Inserted 1 row with 0 errors in 46.23 s (0.02 rows/s)
  1 row inserted.
```python theme={null} videos_t.head() ```
## Generate Video from an existing Image We’ll add an additional computed column to our existing `images_t` to animate the generated images. ```python theme={null} images_t.add_computed_column( generated_video=gemini.generate_videos( image=images_t.generated_image, model='veo-2.0-generate-001' ) ) ```
  Added 1 column value with 0 errors in 40.00 s (0.03 rows/s)
  1 row updated.
```python theme={null} images_t.head() ```
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Groq in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-groq Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Groq integration enables you to access Groq models via the Groq API. ### Prerequisites * A Groq account with an API key ([https://console.groq.com/docs/quickstart](https://console.groq.com/docs/quickstart)) ### Important notes * Groq usage may incur costs based on your Groq plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter your OpenAI API key. ```python theme={null} %pip install -qU pixeltable groq ``` ```python theme={null} import getpass import os if 'GROQ_API_KEY' not in os.environ: os.environ['GROQ_API_KEY'] = getpass.getpass( 'Enter your Groq API key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'groq_demo' directory and its contents, if it exists pxt.drop_dir('groq_demo', force=True) pxt.create_dir('groq_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'groq\_demo'.
  \
## Chat Completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Groq. ```python theme={null} from pixeltable.functions import groq # Create a table in Pixeltable and add a computed column that calls OpenAI t = pxt.create_table('groq_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=groq.chat_completions( messages=messages, model='llama-3.3-70b-versatile', model_kwargs={ # Optional dict with parameters for the Groq API 'max_tokens': 300, 'top_p': 0.9, 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [{'input': 'How many islands are in the Aleutian island chain?'}] ) t.select(t.input, t.response).head() ```
  Inserted 1 row with 0 errors in 1.16 s (0.86 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Hugging Face Source: https://docs.pixeltable.com/howto/providers/working-with-hugging-face Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable provides seamless integration with Hugging Face datasets and models. This tutorial covers: * Importing datasets directly into Pixeltable tables * Working with dataset splits (train/test/validation) * Streaming large datasets with `IterableDataset` * Type mappings from Hugging Face to Pixeltable * Using Hugging Face models for embeddings ## Setup ```python theme={null} %pip install -qU pixeltable datasets torch transformers sentence-transformers ``` ## Import a Hugging Face Dataset Use `pxt.create_table()` with the `source=` parameter to import a Hugging Face dataset directly. Pixeltable automatically maps Hugging Face feature types to Pixeltable column types. ```python theme={null} import datasets import pixeltable as pxt pxt.drop_dir('hf_demo', force=True) pxt.create_dir('hf_demo') # Load a dataset with images padoru = datasets.load_dataset( 'not-lain/padoru', split='train' ).select_columns(['Image', 'ImageSize', 'Name', 'ImageSource']) # Import into Pixeltable images = pxt.create_table('hf_demo/images', source=padoru) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'hf\_demo'.
  Created table 'images'.
  Inserting rows into \`images\`: 100 rows \[00:00, 310.24 rows/s]
  Inserting rows into \`images\`: 100 rows \[00:00, 353.22 rows/s]
  Inserting rows into \`images\`: 100 rows \[00:00, 368.40 rows/s]
  Inserting rows into \`images\`: 82 rows \[00:00, 567.89 rows/s]
  Inserted 382 rows with 0 errors.
```python theme={null} images.head(3) ```
## Working with Dataset Splits When importing a `DatasetDict` (which contains multiple splits like train/test), use `extra_args={'column_name_for_split': 'split'}` to preserve split information in a column. ```python theme={null} # Load a dataset with multiple splits imdb = datasets.load_dataset('stanfordnlp/imdb') # Import all splits, storing split info in 'split' column reviews = pxt.create_table( 'hf_demo/reviews', source=imdb, extra_args={'column_name_for_split': 'split'}, ) ``` ```python theme={null} # Query by split reviews.where(reviews.split == 'train').limit(3).select( reviews.text, reviews.label, reviews.split ).collect() ```
```python theme={null} # Count rows per split reviews.group_by(reviews.split).select( reviews.split, count=pxt.functions.count(reviews.text) ).collect() ```
Using `schema_overrides` for Embeddings When importing datasets with pre-computed embeddings (common in RAG), use `schema_overrides` to specify the exact array shape: ```python theme={null} # Wikipedia with pre-computed embeddings - specify array shape wiki_ds = ( datasets.load_dataset( 'Cohere/wikipedia-2023-11-embed-multilingual-v3', 'simple', split='train', streaming=True, ) .select_columns(['url', 'title', 'text', 'emb']) .take(50) ) wiki = pxt.create_table( 'hf_demo/wiki_embeddings', source=wiki_ds, schema_overrides={'emb': pxt.Array[(1024,), pxt.Float]}, ) ``` ```python theme={null} wiki.select(wiki.title, wiki.emb).limit(2).collect() ```
## Streaming Large Datasets For very large datasets, use `streaming=True` to filter and sample before importing: ```python theme={null} # Stream, filter, and sample before importing streaming_ds = datasets.load_dataset( 'stanfordnlp/imdb', split='train', streaming=True ) positive_stream = streaming_ds.filter(lambda x: x['label'] == 1).take(50) ``` ```python theme={null} positive_samples = pxt.create_table( 'hf_demo/positive_samples', source=positive_stream ) ``` ```python theme={null} positive_samples.select( positive_samples.text, positive_samples.label ).limit(2).collect() ```
## Importing Audio Datasets Audio datasets work seamlessly - Pixeltable stores audio files locally: ```python theme={null} # Import a small audio dataset audio_ds = datasets.load_dataset( 'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation', ) audio_table = pxt.create_table('hf_demo/audio_samples', source=audio_ds) audio_table.select(audio_table.audio, audio_table.text).limit(2).collect() ```
  Created table 'audio\_samples'.
  Inserting rows into \`audio\_samples\`: 73 rows \[00:00, 3960.27 rows/s]
  Inserted 73 rows with 0 errors.
## Inserting More Data Use `table.insert()` to add more data from a HuggingFace dataset to an existing table: ```python theme={null} # Insert more data from the same or similar dataset more_audio = datasets.load_dataset( 'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation', ).select(range(5)) audio_table.insert(more_audio) audio_table.count() ```
  Inserting rows into \`audio\_samples\`: 5 rows \[00:00, 3186.68 rows/s]
  Inserted 5 rows with 0 errors.
  78
## Type Mappings Reference
## Using Hugging Face Models Pixeltable integrates with Hugging Face models for embeddings and inference, running locally without API keys. ### Image Embeddings with CLIP ```python theme={null} from pixeltable.functions.huggingface import clip # Add CLIP embedding index for cross-modal image search images.add_embedding_index( 'Image', embedding=clip.using(model_id='openai/clip-vit-base-patch32') ) # Search images using text sim = images.Image.similarity(string='anime character with red clothes') images.order_by(sim, asc=False).limit(3).select( images.Image, images.Name, sim=sim ).collect() ```
### Text Embeddings with Sentence Transformers ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer # Create table with text embedding index sample_reviews = pxt.create_table( 'hf_demo/sample_reviews', source=datasets.load_dataset('stanfordnlp/imdb', split='test').select( range(100) ), ) sample_reviews.add_embedding_index( 'text', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'), ) # Semantic search query = 'great acting and cinematography' sim = sample_reviews.text.similarity(string=query) sample_reviews.order_by(sim, asc=False).limit(3).select( sample_reviews.text, sim=sim ).collect() ```
  Created table 'sample\_reviews'.
  Inserting rows into \`sample\_reviews\`: 100 rows \[00:00, 21625.70 rows/s]
  Inserted 100 rows with 0 errors.
### More Hugging Face Models Pixeltable supports many more HuggingFace models including: * **ASR**: `automatic_speech_recognition()` - transcribe audio * **Translation**: `translation()` - translate between languages * **Text Generation**: `text_generation()` - generate text completions * **Image Classification**: `vit_for_image_classification()` - classify images * **Object Detection**: `detr_for_object_detection()` - detect objects in images See the SDK reference below for the complete list. ## See Also * [HuggingFace SDK Reference](/sdk/latest/huggingface) - Full list of models: ASR, translation, text generation, image classification, etc. * [Working with embedding indexes](../../platform/embedding-indexes) - Index configuration # Working with Jina AI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-jina Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Jina AI integration enables you to access state-of-the-art embedding and reranker models via the Jina AI API. ### Prerequisites * A Jina AI account with an API key ([https://jina.ai/](https://jina.ai/)) ### Important notes * Jina AI usage may incur costs based on your Jina AI plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install Pixeltable and set up your Jina AI API key. ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import os import getpass if 'JINA_API_KEY' not in os.environ: os.environ['JINA_API_KEY'] = getpass.getpass( 'Enter your Jina AI API key: ' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'jina_demo' directory and its contents, if it exists pxt.drop_dir('jina_demo', force=True) pxt.create_dir('jina_demo') ```
  Created directory 'jina\_demo'.
  \
## Text Embeddings Jina AI provides frontier multilingual embedding models for semantic search and RAG applications. The `jina-embeddings-v3` model supports 89+ languages and achieves state-of-the-art performance. ```python theme={null} from pixeltable.functions import jina # Create a table for document embeddings docs_t = pxt.create_table('jina_demo.documents', {'text': pxt.String}) # Add computed column with Jina embeddings # task='retrieval.passage' optimizes embeddings for documents to be searched docs_t.add_computed_column( embedding=jina.embeddings( docs_t.text, model='jina-embeddings-v3', task='retrieval.passage' ) ) ```
  Created table 'documents'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert some sample documents documents = [ 'The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.', 'Photosynthesis in plants converts light energy into glucose and produces essential oxygen.', '20th-century innovations, from radios to smartphones, centered on electronic advancements.', 'Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.', "Apple's conference call to discuss fourth fiscal quarter results is scheduled for Thursday, November 2, 2023.", "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature.", ] docs_t.insert({'text': doc} for doc in documents) ```
  Inserting rows into \`documents\`: 6 rows \[00:00, 1394.00 rows/s]
  Inserted 6 rows with 0 errors.
  6 rows inserted, 12 values computed.
```python theme={null} # View the embeddings docs_t.select(docs_t.text, docs_t.embedding).head(3) ```
## Multilingual Embeddings Jina AI models excel at multilingual text. The same model can embed text in different languages into the same semantic space. ```python theme={null} # Create a table for multilingual content multilingual_t = pxt.create_table( 'jina_demo.multilingual', {'text': pxt.String, 'language': pxt.String} ) multilingual_t.add_computed_column( embedding=jina.embeddings( multilingual_t.text, model='jina-embeddings-v3', task='text-matching', ) ) # Insert texts in different languages (all about organic skincare) multilingual_t.insert( [ { 'text': 'Organic skincare for sensitive skin with aloe vera and chamomile.', 'language': 'English', }, { 'text': 'Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille.', 'language': 'German', }, { 'text': 'Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla.', 'language': 'Spanish', }, { 'text': '针对敏感肌专门设计的天然有机护肤产品', 'language': 'Chinese', }, ] ) multilingual_t.select( multilingual_t.language, multilingual_t.text ).collect() ```
  Created table 'multilingual'.
  Added 0 column values with 0 errors.
  Inserting rows into \`multilingual\`: 4 rows \[00:00, 736.23 rows/s]
  Inserted 4 rows with 0 errors.
## Embedding Index for Similarity Search You can use Jina AI embeddings with Pixeltable’s embedding index for efficient similarity search. ```python theme={null} # Create a table with an embedding index search_t = pxt.create_table('jina_demo.search', {'text': pxt.String}) # Add embedding index for similarity search embed_fn = jina.embeddings.using( model='jina-embeddings-v3', task='retrieval.passage' ) search_t.add_embedding_index('text', string_embed=embed_fn) # Insert documents search_t.insert({'text': doc} for doc in documents) ```
  Created table 'search'.
  Inserting rows into \`search\`: 6 rows \[00:00, 565.03 rows/s]
  Inserted 6 rows with 0 errors.
  6 rows inserted, 12 values computed.
```python theme={null} # Perform similarity search sim = search_t.text.similarity( string='What are the health benefits of Mediterranean food?' ) search_t.order_by(sim, asc=False).limit(3).select( search_t.text, score=sim ).collect() ```
## Reranking Jina AI’s reranker models can improve search relevance by reordering results based on semantic similarity to the query. ```python theme={null} # Create a table for reranking queries rerank_t = pxt.create_table( 'jina_demo.rerank', {'query': pxt.String, 'documents': pxt.Json}, if_exists='replace', ) # Add computed column for reranking rerank_t.add_computed_column( reranked=jina.rerank( rerank_t.query, rerank_t.documents, model='jina-reranker-v2-base-multilingual', top_n=3, return_documents=True, ) ) # Insert a query with candidate documents rerank_t.insert( query="When is Apple's conference call scheduled?", documents=documents, ) ```
  Created table 'rerank'.
  Added 0 column values with 0 errors.
  Inserting rows into \`rerank\`: 1 rows \[00:00, 543.16 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.
```python theme={null} # View the reranked results result = rerank_t.select(rerank_t.reranked).collect() result['reranked'][0] ```
  \{'usage': \{'total\_tokens': 221},
   'results': \[\{'index': 4,
     'document': "Apple's conference call to discuss fourth fiscal quarter results is scheduled for Thursday, November 2, 2023.",
     'relevance\_score': 0.64511991},
    \{'index': 2,
     'document': '20th-century innovations, from radios to smartphones, centered on electronic advancements.',
     'relevance\_score': 0.03846619},
    \{'index': 5,
     'document': "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature.",
     'relevance\_score': 0.02517884}]}
## Learn More * [Jina AI Documentation](https://jina.ai/) * [Jina Embeddings](https://jina.ai/embeddings/) * [Jina Reranker](https://jina.ai/reranker/) * [API Rate Limits](https://jina.ai/api-dashboard/rate-limit) # Working with llama.cpp in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-llama-cpp Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. This tutorial demonstrates how to use Pixeltable’s built-in `llama.cpp` integration to run local LLMs efficiently. ### Important notes * Models are automatically downloaded from Hugging Face and cached locally * Different quantization levels are available for performance/quality tradeoffs * Consider memory usage when choosing models and quantizations ## Set up environment First, let’s install Pixeltable with llama.cpp support: ```python theme={null} %pip install -qU pixeltable llama-cpp-python huggingface-hub ``` ## Create a table for chat completions Now let’s create a table that will contain our inputs and responses. ```python theme={null} import pixeltable as pxt from pixeltable.functions import llama_cpp pxt.drop_dir('llama_demo', force=True) pxt.create_dir('llama_demo') t = pxt.create_table('llama_demo/chat', {'input': pxt.String}) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'llama\_demo'.
  Created table 'chat'.
Next, we add a computed column that calls the Pixeltable `create_chat_completion` UDF, which adapts the corresponding llama.cpp API call. In our examples, we’ll use pretrained models from the Hugging Face repository. llama.cpp makes it easy to do this by specifying a repo\_id (from the URL of the model) and filename from the model repo; the model will then be downloaded and cached automatically. (If this is your first time using Pixeltable, the Pixeltable Fundamentals tutorial contains more details about table creation, computed columns, and UDFs.) For this demo we’ll use `Qwen2.5-0.5B`, a very small (0.5-billion parameter) model that still produces decent results. We’ll use `Q5_K_M` (5-bit) quantization, which gives an excellent balance of quality and efficiency. ```python theme={null} # Add a computed column that uses llama.cpp for chat completion # against the input. messages = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': t.input}, ] t.add_computed_column( result=llama_cpp.create_chat_completion( messages, repo_id='Qwen/Qwen2.5-0.5B-Instruct-GGUF', repo_filename='*q5_k_m.gguf', ) ) # Extract the output content from the JSON structure returned # by llama_cpp. t.add_computed_column(output=t.result.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
## Test chat completion Let’s try a simple query: ```python theme={null} # Test with a simple question t.insert( [ {'input': 'What is the capital of France?'}, {'input': 'What are some edible species of fish?'}, {'input': 'Who are the most prominent classical composers?'}, ] ) ```
  Inserted 3 rows with 0 errors in 6.74 s (0.44 rows/s)
  3 rows inserted.
```python theme={null} t.select(t.input, t.output).collect() ```
## Comparing models Local model frameworks like `llama.cpp` make it easy to compare the output of different models. Let’s try comparing the output from `Qwen` against a somewhat larger model, `Llama-3.2-1B`. As always, when we add a new computed column to our table, it’s automatically evaluated against the existing table rows. ```python theme={null} t.add_computed_column( result_l3=llama_cpp.create_chat_completion( messages, repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF', repo_filename='*Q5_K_M.gguf', ) ) t.add_computed_column(output_l3=t.result_l3.choices[0].message.content) t.select(t.input, t.output, t.output_l3).collect() ```
  Added 3 column values with 0 errors in 6.32 s (0.47 rows/s)
  Added 3 column values with 0 errors in 0.03 s (113.79 rows/s)
Just for fun, let’s try running against a different system prompt with a different persona. ```python theme={null} messages_teacher = [ { 'role': 'system', 'content': 'You are a patient school teacher. ' 'Explain concepts simply and clearly.', }, {'role': 'user', 'content': t.input}, ] t.add_computed_column( result_teacher=llama_cpp.create_chat_completion( messages_teacher, repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF', repo_filename='*Q5_K_M.gguf', ) ) t.add_computed_column( output_teacher=t.result_teacher.choices[0].message.content ) t.select(t.input, t.output_teacher).collect() ```
  Added 3 column values with 0 errors in 7.70 s (0.39 rows/s)
  Added 3 column values with 0 errors in 0.02 s (143.54 rows/s)
## Additional Resources * [Pixeltable Documentation](https:/docs.pixeltable.com/) * [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) # Working with Mistral AI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-mistralai Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Mistral AI integration enables you to access Mistral’s LLM and other models via the Mistral AI API. ### Prerequisites * A Mistral AI account with an API key ([https://console.mistral.ai/api-keys/](https://console.mistral.ai/api-keys/)) ### Important notes * Mistral AI usage may incur costs based on your Mistral AI plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter a Mistral AI API key. ```python theme={null} %pip install -qU pixeltable mistralai ``` ```python theme={null} import getpass import os if 'MISTRAL_API_KEY' not in os.environ: os.environ['MISTRAL_API_KEY'] = getpass.getpass('Mistral AI API Key:') ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'mistralai_demo' directory and its contents, if it exists pxt.drop_dir('mistralai_demo', force=True) pxt.create_dir('mistralai_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'mistralai\_demo'.
  \
## Messages Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Mistral. ```python theme={null} from pixeltable.functions.mistralai import chat_completions # Create a table in Pixeltable and add a computed column that calls Mistral AI t = pxt.create_table('mistralai_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=chat_completions( messages=messages, model='mistral-small-latest', model_kwargs={ # Optional dict with parameters for the Mistral API 'max_tokens': 300, 'top_p': 0.9, 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [ { 'input': 'What three species of fish have the highest mercury content?' } ] ) t.select(t.input, t.response).show() ```
  Inserted 1 row with 0 errors in 2.31 s (0.43 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Ollama in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-ollama Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Ollama is a popular platform for local serving of LLMs. In this tutorial, we’ll show how to integrate Ollama models into a Pixeltable workflow. ## Install Ollama You’ll need to have an Ollama server instance to query. There are several ways to do this. ### Running on a local machine If you’re running this notebook on your own machine, running Windows, Mac OS, or Linux, you can install Ollama at: [https://ollama.com/download](https://ollama.com/download) ### Running on Google Colab * OR, if you’re running on Colab, you can install Ollama by uncommenting and running the following code. ```python theme={null} # To install Ollama on colab, uncomment and run the following # three lines (this will also work on a local Linux machine # if you don't already have Ollama installed). # !curl -fsSL https://ollama.com/install.sh | sh # import subprocess # ollama_process = subprocess.Popen(['ollama', 'serve'], stderr=subprocess.PIPE) ``` ### Running on a remote Ollama server * OR, if you have access to an Ollama server running remotely, you can uncomment and run the following line, replacing the default URL with the URL of your remote Ollama instance. ```python theme={null} # To run the notebook against an instance of Ollama running on a # remote server, uncomment the following line and specify the URL. # os.environs['OLLAMA_HOST'] = 'https://127.0.0.1:11434' ``` Once you’ve completed the installation, run the following commands to verify that it’s been successfully installed. This may result in an LLM being downloaded, so it may take some time. ```python theme={null} %pip install -qU ollama ``` ```python theme={null} import ollama ollama.pull('qwen2.5:0.5b') ollama.generate('qwen2.5:0.5b', 'What is the capital of Missouri?')[ 'response' ] ```
  'The capital of Missouri is Jefferson City. Jefferson City was originally named after the French explorer Pierre-Jacques Houget and the American statesman Thomas Jefferson, who lived in this city from 1764 to 1805. It became the seat of government for most of Jefferson County when it was established in 1836. In more recent times, the name has changed several times due to various political changes and legal changes.'
## Install Pixeltable Now, let’s install Pixeltable and create a table for the demo. ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.ollama import chat pxt.drop_dir('ollama_demo', force=True) pxt.create_dir('ollama_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'ollama\_demo'.
  \
```python theme={null} t = pxt.create_table('ollama_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] # Add a computed column that runs the model to generate responses t.add_computed_column( output=chat( messages=messages, model='qwen2.5:0.5b', # These parameters are optional and can be used to tune model behavior: options={'max_tokens': 300, 'top_p': 0.9, 'temperature': 0.5}, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Extract the message content into a separate column t.add_computed_column(response=t.output.message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
We can insert our input prompts into the table now. As always, Pixeltable automatically updates the computed columns by calling the relevant Ollama endpoint. ```python theme={null} # Start a conversation t.insert(input='What are the most popular services for LLM inference?') t.select(t.input, t.response).show() ```
  Inserted 1 row with 0 errors in 1.28 s (0.78 rows/s)
### Learn More To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with OpenAI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-openai Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s OpenAI integration enables you to access OpenAI models via the OpenAI API. ### Prerequisites * An OpenAI account with an API key ([https://openai.com/index/openai-api/](https://openai.com/index/openai-api/)) ### Important notes * OpenAI usage may incur costs based on your OpenAI plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter your OpenAI API key. ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass( 'Enter your OpenAI API key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'openai_demo' directory and its contents, if it exists pxt.drop_dir('openai_demo', force=True) pxt.create_dir('openai_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'openai\_demo'.
  \
## Chat completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from OpenAI. ```python theme={null} from pixeltable.functions import openai # Create a table in Pixeltable and add a computed column that calls OpenAI t = pxt.create_table('openai_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=openai.chat_completions( messages=messages, model='gpt-4o-mini', model_kwargs={ # Optional dict with parameters for the OpenAI API 'max_tokens': 300, 'top_p': 0.9, 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [{'input': 'How many islands are in the Aleutian island chain?'}] ) t.select(t.input, t.response).head() ```
  Inserted 1 row with 0 errors in 3.39 s (0.29 rows/s)
## Embeddings Note: OpenAI Embeddings API is not available with free tier API keys ```python theme={null} emb_t = pxt.create_table('openai_demo/embeddings', {'input': pxt.String}) emb_t.add_computed_column( embedding=openai.embeddings( input=emb_t.input, model='text-embedding-3-small' ) ) ```
  Created table 'embeddings'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} emb_t.insert( [{'input': 'OpenAI provides a variety of embeddings models.'}] ) ```
  Inserted 1 row with 0 errors in 1.03 s (0.97 rows/s)
  1 row inserted.
```python theme={null} emb_t.head() ```
## Image generations ```python theme={null} image_t = pxt.create_table('openai_demo/images', {'input': pxt.String}) image_t.add_computed_column( img=openai.image_generations(image_t.input, model='dall-e-2') ) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} image_t.insert( [ { 'input': 'A giant Pixel floating in the open ocean in a sea of data' } ] ) ```
  Inserted 1 row with 0 errors in 11.59 s (0.09 rows/s)
  1 row inserted.
```python theme={null} image_t ```
```python theme={null} image_t.head() ```
## Audio Transcription ```python theme={null} audio_t = pxt.create_table('openai_demo/audio', {'input': pxt.Audio}) audio_t.add_computed_column( result=openai.transcriptions( audio_t.input, model='whisper-1', model_kwargs={ 'language': 'en', 'prompt': 'Transcribe the contents of this recording.', }, ) ) ```
  Created table 'audio'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} url = ( 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/' 'jfk_1961_0109_cityuponahill-excerpt.flac' ) audio_t.insert([{'input': url}]) ```
  Inserted 1 row with 0 errors in 5.42 s (0.18 rows/s)
  1 row inserted.
```python theme={null} audio_t.head() ```
```python theme={null} audio_t.head()[0]['result']['text'] ```
  'Allow me to illustrate. During the last 60 days, I have been at the task of constructing an administration. It has been a long and deliberate process. Some have counseled greater speed. Others have counseled more expedient tests. But I have been guided by the standard John Winthrop set before his shipmates on the flagship Arabella 331 years ago, as they too faced the task of building a new government on a perilous frontier. We must always consider, he said, that we shall be as a city upon a hill. The eyes of all peoples are upon us. Today the eyes of all people are truly upon us. And our governments, in every branch, at every level,'
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with OpenRouter in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-openrouter Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s OpenRouter integration enables you to access multiple LLM providers through a unified API via OpenRouter. ### Prerequisites * An OpenRouter account with an API key ([https://openrouter.ai](https://openrouter.ai)) ### Important notes * OpenRouter usage may incur costs based on the models you use and your usage volume. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter your OpenRouter API key. ```python theme={null} %pip install -qU pixeltable openai ``` ```python theme={null} import getpass import os if 'OPENROUTER_API_KEY' not in os.environ: os.environ['OPENROUTER_API_KEY'] = getpass.getpass( 'Enter your OpenRouter API key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'openrouter_demo' directory and its contents, if it exists pxt.drop_dir('openrouter_demo', force=True) pxt.create_dir('openrouter_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'openrouter\_demo'.
  \
## Chat completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from OpenRouter. ```python theme={null} from pixeltable.functions import openrouter # Create a table in Pixeltable and add a computed column that calls OpenRouter t = pxt.create_table('openrouter_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': t.input}] t.add_computed_column( output=openrouter.chat_completions( messages=messages, model='anthropic/claude-sonnet-4', model_kwargs={ # Optional dict with parameters compatible with the model 'max_tokens': 300, 'temperature': 0.7, }, ) ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=t.output.choices[0].message.content) ```
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation t.insert( [ {'input': 'How many species of felids have been classified?'}, {'input': 'Can you make me a coffee?'}, ] ) t.select(t.input, t.response).head() ```
  Inserted 2 rows with 0 errors in 7.59 s (0.26 rows/s)
## Using different models One of OpenRouter’s key benefits is easy access to models from multiple providers. Let’s create a table that compares responses from Anthropic Claude, OpenAI GPT-4, and Meta Llama. ```python theme={null} # Create a table to compare different models compare_t = pxt.create_table( 'openrouter_demo/compare_models', {'prompt': pxt.String} ) messages = [{'role': 'user', 'content': compare_t.prompt}] # Add responses from different models compare_t.add_computed_column( claude=openrouter.chat_completions( messages=messages, model='anthropic/claude-sonnet-4', model_kwargs={'max_tokens': 150}, ) .choices[0] .message.content ) compare_t.add_computed_column( gpt4=openrouter.chat_completions( messages=messages, model='openai/gpt-4o-mini', model_kwargs={'max_tokens': 150}, ) .choices[0] .message.content ) compare_t.add_computed_column( llama=openrouter.chat_completions( messages=messages, model='meta-llama/llama-3.3-70b-instruct', model_kwargs={'max_tokens': 150}, ) .choices[0] .message.content ) ```
  Created table 'compare\_models'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert a prompt and compare responses compare_t.insert( [{'prompt': 'Explain quantum entanglement in one sentence.'}] ) compare_t.select( compare_t.prompt, compare_t.claude, compare_t.gpt4, compare_t.llama ).head() ```
  Inserted 1 row with 0 errors in 1.27 s (0.79 rows/s)
## Advanced features: provider routing OpenRouter allows you to specify provider preferences for fallback behavior and cost optimization. ```python theme={null} # Create a table with provider routing routing_t = pxt.create_table( 'openrouter_demo/routing', {'input': pxt.String} ) messages = [{'role': 'user', 'content': routing_t.input}] routing_t.add_computed_column( output=openrouter.chat_completions( messages=messages, model='anthropic/claude-sonnet-4', model_kwargs={'max_tokens': 300}, # Specify provider preferences provider={ 'order': [ 'Anthropic', 'OpenAI', ], # Try Anthropic first, then OpenAI 'allow_fallbacks': True, }, ) ) routing_t.add_computed_column( response=routing_t.output.choices[0].message.content ) ```
  Created table 'routing'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} routing_t.insert([{'input': 'What are the primary colors?'}]) routing_t.select(routing_t.input, routing_t.response).head() ```
  Inserted 1 row with 0 errors in 3.97 s (0.25 rows/s)
## Advanced Features: Context Window Optimization OpenRouter supports transforms like ‘middle-out’ to optimize handling of long contexts. ```python theme={null} # Create a table with transforms for long context optimization transform_t = pxt.create_table( 'openrouter_demo/transforms', {'long_context': pxt.String} ) messages = [{'role': 'user', 'content': transform_t.long_context}] transform_t.add_computed_column( output=openrouter.chat_completions( messages=messages, model='openai/gpt-4o-mini', model_kwargs={'max_tokens': 200}, # Apply middle-out transform for better long context handling transforms=['middle-out'], ) ) transform_t.add_computed_column( response=transform_t.output.choices[0].message.content ) ```
  Created table 'transforms'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Example with longer context long_text = """ Artificial intelligence has transformed many industries. Machine learning algorithms can now detect patterns in data that humans might miss. Deep learning has revolutionized computer vision and natural language processing. The future of AI looks promising with developments in areas like reinforcement learning and generative models. Question: What are the main AI developments mentioned? """ transform_t.insert([{'long_context': long_text}]) transform_t.select(transform_t.response).head() ```
  Inserted 1 row with 0 errors in 1.82 s (0.55 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. For more information about OpenRouter’s features and available models, visit: * [OpenRouter Documentation](https://openrouter.ai/docs) * [Available Models](https://openrouter.ai/models) If you have any questions, don’t hesitate to reach out. # Working with Pydantic in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-pydantic Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Pydantic integration enables type-safe data insertion using Pydantic models. Instead of inserting raw dictionaries, you can define structured models with validation and insert them directly into Pixeltable tables. ### Benefits * **Type Safety**: Pydantic validates data before insertion * **IDE Support**: Autocomplete and type hints for your data * **Self-Documenting**: Models serve as schema documentation * **Validation**: Built-in data validation via Pydantic ### Important notes * Pydantic model fields map to Pixeltable columns by name * Computed columns are automatically skipped during insertion * Nested Pydantic models map to JSON columns ```python theme={null} %pip install -qU pixeltable pydantic ``` ```python theme={null} import pixeltable as pxt pxt.drop_dir('pydantic_demo', force=True) pxt.create_dir('pydantic_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'pydantic\_demo'.
  \
## Basic usage: scalar types Define a Pydantic model with fields that match your table columns. Pixeltable automatically maps Python types to Pixeltable types:
```python theme={null} import datetime import pydantic from enum import Enum from typing import Literal # Define an enum for product categories class Category(Enum): ELECTRONICS = 1 CLOTHING = 2 BOOKS = 3 # Define a Pydantic model class Product(pydantic.BaseModel): name: str price: float in_stock: bool category: Category rating: Literal['poor', 'average', 'good', 'excellent'] created_at: datetime.datetime description: str | None = None # Optional field ``` ```python theme={null} # Create a table with matching schema products = pxt.create_table( 'pydantic_demo/products', { 'name': pxt.Required[pxt.String], 'price': pxt.Required[pxt.Float], 'in_stock': pxt.Required[pxt.Bool], 'category': pxt.Required[pxt.Int], # Enum values are integers 'rating': pxt.Required[pxt.String], # Literal values 'created_at': pxt.Required[pxt.Timestamp], 'description': pxt.String, # Nullable }, ) ```
  Created table 'products'.
```python theme={null} # Create Pydantic model instances now = datetime.datetime.now() product_data = [ Product( name='Wireless Headphones', price=79.99, in_stock=True, category=Category.ELECTRONICS, rating='excellent', created_at=now, description='High-quality wireless headphones with noise cancellation', ), Product( name='Python Cookbook', price=49.99, in_stock=True, category=Category.BOOKS, rating='good', created_at=now, ), Product( name='Running Shoes', price=129.99, in_stock=False, category=Category.CLOTHING, rating='average', created_at=now, description='Lightweight running shoes', ), ] # Insert Pydantic models directly products.insert(product_data) products.collect() ```
  Inserted 3 rows with 0 errors in 0.02 s (146.18 rows/s)
## Nested models and JSON columns Nested Pydantic models automatically map to Pixeltable JSON columns. This is useful for storing structured metadata. ```python theme={null} # Define nested models class Address(pydantic.BaseModel): street: str city: str country: str zip_code: str class ContactInfo(pydantic.BaseModel): email: str phone: str | None = None address: Address class Customer(pydantic.BaseModel): customer_id: str name: str contact: ContactInfo # Nested model → JSON column ``` ```python theme={null} # Create table with JSON column for nested data customers = pxt.create_table( 'pydantic_demo/customers', { 'customer_id': pxt.Required[pxt.String], 'name': pxt.Required[pxt.String], 'contact': pxt.Required[pxt.Json], # Nested model stored as JSON }, ) ```
  Created table 'customers'.
```python theme={null} # Insert nested data customer_data = [ Customer( customer_id='C001', name='Alice Johnson', contact=ContactInfo( email='alice@example.com', phone='+1-555-0101', address=Address( street='123 Main St', city='San Francisco', country='USA', zip_code='94102', ), ), ), Customer( customer_id='C002', name='Bob Smith', contact=ContactInfo( email='bob@example.com', address=Address( street='456 Oak Ave', city='New York', country='USA', zip_code='10001', ), ), ), ] customers.insert(customer_data) customers.collect() ```
  Inserted 2 rows with 0 errors in 0.01 s (227.55 rows/s)
```python theme={null} # Query nested JSON fields using Pixeltable's JSON path syntax customers.select( customers.name, email=customers.contact.email, city=customers.contact.address.city, ).collect() ```
## Media files with Pydantic For media columns (Image, Video, Audio, Document), use `str` or `Path` fields in your Pydantic model to specify file paths or URLs. ```python theme={null} from pathlib import Path class ImageRecord(pydantic.BaseModel): title: str image_url: str # URLs or file paths as strings tags: list[str] # Create table with Image column images = pxt.create_table( 'pydantic_demo/images', { 'title': pxt.Required[pxt.String], 'image_url': pxt.Required[pxt.Image], # Media column 'tags': pxt.Required[pxt.Json], }, ) ```
  Created table 'images'.
```python theme={null} # Insert image records with URLs base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' image_data = [ ImageRecord( title='Sample Image', image_url=f'{base_url}/000000000036.jpg', tags=['sample', 'test', 'image'], ) ] images.insert(image_data) images.select(images.title, images.image_url, images.tags).collect() ```
  Inserted 1 row with 0 errors in 0.27 s (3.74 rows/s)
## Working with Computed Columns Pydantic models work seamlessly with computed columns. Simply omit computed column fields from your model - Pixeltable will skip them during insertion. ```python theme={null} # Model only includes input columns class Article(pydantic.BaseModel): title: str content: str # Create table with computed column articles = pxt.create_table( 'pydantic_demo/articles', { 'title': pxt.Required[pxt.String], 'content': pxt.Required[pxt.String], }, ) # Add a computed column articles.add_computed_column( word_count=articles.content.apply( lambda x: len(x.split()), col_type=pxt.Int ) ) ```
  Created table 'articles'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Insert data - computed columns are automatically calculated article_data = [ Article( title='Getting Started with Pixeltable', content='Pixeltable is a powerful tool for building AI applications. It provides automatic versioning and incremental computation.', ), Article( title='Type Safety in Python', content='Using Pydantic with Pixeltable provides type safety and validation for your data pipelines.', ), ] articles.insert(article_data) articles.select(articles.title, articles.word_count).collect() ```
  Inserted 2 rows with 0 errors in 0.01 s (186.43 rows/s)
## Optional Fields and Defaults Pydantic’s optional fields with defaults work naturally with Pixeltable’s nullable columns. ```python theme={null} class Task(pydantic.BaseModel): title: str priority: int = 1 # Default value due_date: datetime.datetime | None = None # Optional notes: str | None = None # Optional tasks = pxt.create_table( 'pydantic_demo/tasks', { 'title': pxt.Required[pxt.String], 'priority': pxt.Required[pxt.Int], 'due_date': pxt.Timestamp, # Nullable 'notes': pxt.String, # Nullable }, ) # Insert with and without optional fields tasks.insert( [ Task( title='Complete project', priority=3, due_date=datetime.datetime(2025, 12, 31), ), Task( title='Review code' ), # Uses default priority=1, None for optionals Task(title='Write docs', notes='Include examples'), ] ) tasks.collect() ```
  Created table 'tasks'.
  Inserted 3 rows with 0 errors in 0.01 s (408.88 rows/s)
## Type Mapping Reference Here’s the complete mapping between Pydantic/Python types and Pixeltable types:
## Learn More For more information about working with Pydantic in Pixeltable: * [Pixeltable Documentation](https://docs.pixeltable.com) * [Pydantic Documentation](https://docs.pydantic.dev) * [Type Safety Blog Post](https://www.pixeltable.com/blog/pydantic-integration-type-safety) If you have any questions, don’t hesitate to reach out on [Discord](https://discord.com/invite/QPyqFYx2UN). # Working with Replicate in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-replicate Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Replicate integration enables you to access Replicate’s models via the Replicate API. ### Prerequisites * A Replicate account with an API token. ### Important notes * Replicate usage may incur costs based on your Replicate plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter a Replicate API token. ```python theme={null} %pip install -qU pixeltable replicate ``` ```python theme={null} import getpass import os if 'REPLICATE_API_TOKEN' not in os.environ: os.environ['REPLICATE_API_TOKEN'] = getpass.getpass( 'Replicate API Token:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the `replicate_demo` directory and its contents, if it exists pxt.drop_dir('replicate_demo', force=True) pxt.create_dir('replicate_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'replicate\_demo'.
  \
## Chat completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from Replicate. ```python theme={null} from pixeltable.functions.replicate import run # Create a table in Pixeltable and pick a model hosted on Replicate with some parameters t = pxt.create_table('replicate_demo/chat', {'prompt': pxt.String}) input = { 'system_prompt': 'You are a helpful assistant.', 'prompt': t.prompt, # These parameters are optional and can be used to tune model behavior: 'max_tokens': 300, 'top_p': 0.9, 'temperature': 0.8, } t.add_computed_column( output=run(input, ref='meta/meta-llama-3-8b-instruct') ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Parse the response into a new column t.add_computed_column(response=pxt.functions.string.join('', t.output)) ```
  Added 0 column values with 0 errors in 0.02 s
  No rows affected.
```python theme={null} # Start a conversation t.insert([{'prompt': 'What foods are rich in selenium?'}]) t.select(t.prompt, t.response).show() ```
  Inserted 1 row with 0 errors in 4.45 s (0.22 rows/s)
## Image generation Here’s an example that shows how to use Replicate’s image generation models with Pixeltable. We’ll use the FLUX Schnell model. ```python theme={null} t = pxt.create_table('replicate_demo/images', {'prompt': pxt.String}) input = {'prompt': t.prompt, 'go_fast': True, 'megapixels': '1'} t.add_computed_column( output=run(input, ref='black-forest-labs/flux-schnell') ) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} t.insert( [ { 'prompt': 'Draw a pencil sketch of a friendly dinosaur playing tennis in a cornfield.' } ] ) ```
  Inserted 1 row with 0 errors in 0.99 s (1.01 rows/s)
  1 row inserted.
```python theme={null} t.select(t.prompt, t.output).collect() ```
We see that Replicate returns our image as an array containing a single URL. To turn it into an actual image, we cast the string to type `pxt.Image` in a new computed column: ```python theme={null} t.add_computed_column(image=t.output[0].astype(pxt.Image)) t.select(t.image).collect() ```
  Added 1 column value with 0 errors in 0.02 s (53.36 rows/s)
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Reve in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-reve Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Reve integration lets you call Reve’s `create`, `edit`, and `remix` endpoints directly from tables so you can iterate on visuals without leaving your data workflows. ## What is Reve? Reve is an image generation/editing service with three API endpoints: * **`create`**: Generate new images from text prompts * **`edit`**: Edit existing images with natural language instructions * **`remix`**: Blend multiple images together ### Documentation * [Pixeltable Reve Functions](/sdk/latest/reve#module-pixeltable-functions-reve) * [Reve API Reference](https://api.reve.com/console/docs) ## Prerequisites * A Reve account with an API key (see [https://api.reve.com/](https://api.reve.com/) for instructions) **Important:** Reve API calls consume credits based on your plan—monitor your usage to avoid unexpected charges. Images sent to Reve are processed on Reve’s servers outside your environment, so do not upload sensitive, private, or confidential images. We’ll start by installing Pixeltable, configuring your API key, creating a directory, and setting up a table. Then we’ll walk through each Reve endpoint—`create`, `edit`, and `remix`—one at a time. ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import getpass import os if 'REVE_API_KEY' not in os.environ: os.environ['REVE_API_KEY'] = getpass.getpass('Reve API Key: ') ``` To read more about working with API keys in Pixeltable, see [Configuration](/platform/configuration). ## Setup ```python theme={null} import pixeltable as pxt ``` Create a Pixeltable directory to keep the tables for this demo separate from anything else you’re working on. ```python theme={null} pxt.create_dir('reve_demo', if_exists='replace_force') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/alison-pxt/.pixeltable/pgdata
  Created directory 'reve\_demo'.
  \
We’ll create a Pixeltable table that starts with a prompt and a source image, and ends with a final scene. The table we’ll build up to will require two inputs per row: 1. A prompt for creating a background scene image. We’ll use this prompt for Reve to create a scene with `reve.create()`. 2. An existing source image. We’ll ask Reve to edit this image with `reve.edit()`, and then it will be ready as the foreground image. Finally, we’ll remix the background scene image we made in step 1 by combining it with the foreground image we made in step 2 with `reve.remix()`. ```python theme={null} spunk_t = pxt.create_table( 'reve_demo/solarpunk_scenes', {'prompt': pxt.String, 'source_image': pxt.Image}, ) ```
  Created table 'solarpunk\_scenes'.
To read more about creating tables, see [Tables and Data Operations](/tutorials/tables-and-data-operations). You can look at the schema for this table: ```python theme={null} spunk_t.describe() ```
Now, we’ll insert values for our first row. We need to provide a text prompt for the `reve.create()` function and a source image for the `reve.edit()` function. ```python theme={null} scene_prompt = ( 'Create a scene of lush solarpunk metropolis in the desert ' 'with urban agriculture and an oasis theme.' 'It should not look like an office park, corporate campus, or an outdoor mall.' ) image_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg' ``` ```python theme={null} spunk_t.insert([{'prompt': scene_prompt, 'source_image': image_url}]) ```
  Inserted 1 row with 0 errors in 0.03 s (39.10 rows/s)
  1 row inserted.
To read more about inserting data, see [Bringing Data](/howto/cookbooks/data/data-import-csv). And we can peek at our starter table with a single row: ```python theme={null} spunk_t.collect() ```
## Generate new imagery with Reve Create Use `reve.create()` when you want Reve to synthesize an entirely new image from a prompt. In Pixeltable, we place this function call inside a computed column. We’ll generate fresh imagery from the prompt first in this section. Feel free to change the prompt. Here we ask for a solarpunk oasis city. ```python theme={null} from pixeltable.functions import reve spunk_t.add_computed_column( new_image=reve.create(spunk_t.prompt), if_exists='replace' ) ```
  Added 1 column value with 0 errors in 6.16 s (0.16 rows/s)
  1 row updated.
To read more about computed columns in Pixeltable, see [Computed Columns](/tutorials/computed-columns). ```python theme={null} spunk_t.select(spunk_t.prompt, spunk_t.new_image).collect() ```
By default, Pixeltable saves all generated media outputs to a media directory. We can see the file path by using the `fileurl` property. ```python theme={null} spunk_t.select(spunk_t.new_image.fileurl).collect() ```
### Add Reve parameters All Reve functions accept optional parameters to customize the output: * `aspect_ratio`: desired image aspect ratio, e.g. ‘3:2’, ‘16:9’, ‘1:1’, etc. (available for `reve.create()` and `reve.remix()`) * `version`: specific model version to use (optional; defaults to latest if not specified). Available for all Reve functions (`reve.create()`, `reve.edit()`, and `reve.remix()`) This adds a second image column using the same prompt that renders in a square frame. ```python theme={null} spunk_t.add_computed_column( new_image_sq=reve.create(spunk_t.prompt, aspect_ratio='1:1'), if_exists='replace', ) ```
  Added 1 column value with 0 errors in 6.22 s (0.16 rows/s)
  1 row updated.
```python theme={null} spunk_t.select( spunk_t.prompt, spunk_t.new_image, spunk_t.new_image_sq ).collect() ```
To read more about `reve.create()`, see [reve.create UDF](/sdk/latest/reve#udf-create). ## Edit an existing photo with Reve Edit `reve.edit()` takes an existing image plus natural-language instructions and returns an edited version. We already have a `source_image` column in our table from the initial setup. ```python theme={null} spunk_t.select(spunk_t.source_image).collect() ```
We can now add a computed column that calls `reve.edit()` to modify the source image. To read more about `reve.edit()`, see [reve.edit UDF](/sdk/latest/reve#udf-edit). This editing prompt is integrated into our computed column logic in Pixeltable, as opposed to our creating example where we saved the prompt as its own column. This means that the same prompt will be applied to any new rows that we insert into this table. We will phrase the editing prompt to reflect this table’s solarpunk theme, but otherwise keep it general. This way, we don’t need to provide a specific prompt for every new table row. ```python theme={null} # Uncomment the below line to use a Reve function, if you have not already done so # from pixeltable.functions import reve spunk_t.add_computed_column( edited_subject=reve.edit( spunk_t.source_image, 'Remove any existing background. Focus on the closest person in the foreground. ' 'Keep the person and props, but make the lighting and colors vibrant and fit with a solarpunk theme. ' 'Make the background behind the person blank.', ), if_exists='replace', ) ```
  Added 1 column value with 0 errors in 16.54 s (0.06 rows/s)
  1 row updated.
We can use `collect()` to see the new image: ```python theme={null} spunk_t.select(spunk_t.source_image, spunk_t.edited_subject).collect() ```
## Remix multiple references with Reve Remix `reve.remix()` blends multiple reference images. Inside the prompt string, you reference each image with a numbered placeholder: * `0` refers to `images[0]` * `1` refers to `images[1]` * etc. You can optionally specify `aspect_ratio` and `version` parameters (both default to latest/auto if not specified). In the next cell we place the edited subject from `0` (the first entry in the images list) into the scene from `1` (the second entry). ```python theme={null} # Uncomment the below line to use a Reve function, if you have not already done so # from pixeltable.functions import reve spunk_t.add_computed_column( solarpunk_remix=reve.remix( 'Place the person in 0 in the foreground of the scene from 1. ' 'Make the background clear and detailed so it feels like a complete "day in the life" in solarpunk city scene.', images=[spunk_t.edited_subject, spunk_t.new_image], aspect_ratio='16:9', ), if_exists='replace', ) ```
  Added 1 column value with 0 errors in 18.58 s (0.05 rows/s)
  1 row updated.
To read more about `reve.remix()`, see [reve.remix UDF](/sdk/latest/reve#udf-remix). ```python theme={null} spunk_t.select(spunk_t.solarpunk_remix).collect() ```
## Insert a new row So far, we have been building up our table schema with a single row. Now we’ll insert a new row, with two fresh input values: 1. A text prompt to create the scene image with `reve.create()` and 2. A source image to edit with `reve.edit()` and remix into that scene with `reve.remix()`. Pixeltable will then automatically make the desired Reve API calls and populate the computed columns. ```python theme={null} spunk_t.insert( [ { 'prompt': 'Create an indoor tennis court scene, with clay courts inside a lush solarpunk greenhouse filled with bougainvillea, terraced gardens, and an oasis theme.', 'source_image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000885.jpg', } ] ) ```
  Inserted 1 row with 0 errors in 34.30 s (0.03 rows/s)
  1 row inserted.
Now we can inspect both outputs because the `insert()` in Pixeltable triggers our computed columns to update for any that are missing values (existing images we already generated are not changed because Pixeltable does incremental updates). For example, here is our inserted image and our edited image: ```python theme={null} spunk_t.select(spunk_t.source_image, spunk_t.edited_subject).collect() ```
Here are our two remixed images created by Reve: ```python theme={null} spunk_t.select(spunk_t.solarpunk_remix).collect() ```
All together, we created a new scene image, edited an existing image of a person, then remixed both together to reimagine an existing person in our new scene. ```python theme={null} spunk_t.select( spunk_t.new_image, spunk_t.edited_subject, spunk_t.solarpunk_remix ).collect() ```
## Review Reve in Pixeltable Below is a quick recap of how each Reve function maps inputs to outputs inside Pixeltable tables. Each function reads input parameters and writes its results into computed columns. ### Reve Create * **Input parameter:** A prompt inserted as a row inside a Pixeltable ```python theme={null} spunk_t.select( spunk_t.prompt, spunk_t.new_image, spunk_t.new_image_sq ).collect() ```
### Reve Edit * **Input parameter:** A source image of type `pxt.Image` * **Usage reminder:** The edit instructions live inline inside the `add_computed_column()` call ```python theme={null} spunk_t.select(spunk_t.source_image, spunk_t.edited_subject).collect() ```
### Reve Remix * **Input parameters:** We started with two image columns * **How the prompt references them:** * `images=[my_table.image00, my_table.image01]` * Inside the prompt, `0` points at `images[0]` and `1` points at `images[1]` * **Usage reminder:** Always keep the placeholders and the order of the `images` list in sync; add more `n` tags if you pass more reference images. ```python theme={null} spunk_t.select( spunk_t.new_image, spunk_t.edited_subject, spunk_t.solarpunk_remix ).collect() ```
## Learn more * Reve API reference: [https://api.reve.com/console/docs](https://api.reve.com/console/docs) * Pixeltable documentation: [https://docs.pixeltable.com/sdk/latest/reve#module-pixeltable-functions-reve](/sdk/latest/reve#module-pixeltable-functions-reve) If you build something with Reve, let us know! # Working with Tigris in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-tigris Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. This tutorial demonstrates how to configure Pixeltable to use [Tigris](https://tigrisdata.com) for storage. This lets you store unlimited amounts of images in Tigris’ global data plane, allowing your images to load fast everywhere. ## Prerequisites * A Tigris account, bucket, and access keypair ([https://storage.new](https://storage.new)) ## Important notes * Tigris usage may incur costs based on your plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you need to install required libraries and enter a Tigris access keypair obtained via the Tigris Admin Console. ## Set up environment First, let’s install Pixeltable: ```python theme={null} %pip install -qU pixeltable boto3 datasets ``` ## Configure authentication These steps will have you enter in your Tigris credentials: ```python theme={null} import os from getpass import getpass os.environ['AWS_ACCESS_KEY_ID'] = getpass('Tigris access key ID') os.environ['AWS_SECRET_ACCESS_KEY'] = getpass('Tigris secret access key') bucket_name = getpass('Tigris bucket name') os.environ['AWS_ENDPOINT_URL_S3'] = 'https://t3.storage.dev' os.environ['AWS_REGION'] = 'auto' os.environ['PIXELTABLE_INPUT_MEDIA_DEST'] = f's3://{bucket_name}/input/' os.environ['PIXELTABLE_OUTPUT_MEDIA_DEST'] = f's3://{bucket_name}/output/' ``` ## Create a table for images Now let’s create a table that will contain images from the [XeIaso/botw-screenshots-captioned](https://huggingface.co/datasets/XeIaso/botw-screenshots-captioned) dataset: ```python theme={null} import pixeltable as pxt from datasets import load_dataset # Create directory for this demo pxt.drop_dir('tigris', force=True) pxt.create_dir('tigris', if_exists='replace') # Load the dataset ds = load_dataset('XeIaso/botw-screenshots-captioned') # Import it to pixeltable with the name screenshots pxt.drop_table('tigris/screenshots', force=True) screenshots = pxt.create_table( 'tigris/screenshots', source=ds, if_exists='replace' ) ```
  Created directory 'tigris'.
  Created table 'screenshots'.
  Inserting rows into \`screenshots\`: 100 rows \[00:01, 51.72 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:01, 55.57 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:01, 52.74 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:02, 33.96 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:02, 42.64 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:02, 39.65 rows/s]
  Inserting rows into \`screenshots\`: 100 rows \[00:02, 47.36 rows/s]
  Inserting rows into \`screenshots\`: 28 rows \[00:00, 6786.12 rows/s]
  Inserted 728 rows with 0 errors.
Once the import is done, you can create thumbnails with a [computed column](/tutorials/computed-columns): ```python theme={null} # Add a computed column for thumbnails # Uses output_media_dest by default, or specify a custom destination screenshots.add_computed_column( thumbnail=screenshots.image.resize((256, 256)), destination=f's3://{bucket_name}/botw-screenshots/thumbnails/', ) ```
  Added 728 column values with 0 errors.
  728 rows updated, 728 values computed.
And then inspect that with the `collect` method: ```python theme={null} results = screenshots.limit(1).collect() results ```
## Getting URLs for your files When your files are in object storage, you can get URLs that point directly to them. These URLs work in HTML, APIs, or any application you need to serve media with. Fetch them with the `.fileurl` property: ```python theme={null} screenshots.select( image=screenshots.image, image_url=screenshots.image.fileurl, thumbnail=screenshots.thumbnail, thumbnail_url=screenshots.thumbnail.fileurl, ).limit(1).collect() ```
## Generating Presigned URLs For private buckets or when you need time-limited access to files, use presigned URLs. These are temporary, authenticated URLs that allow anyone to access your files for a limited time without needing credentials. Use the `presigned_url` function from `pixeltable.functions.net`: ```python theme={null} from pixeltable.functions import net # Generate presigned URLs with 1-hour expiration (3600 seconds) screenshots.select( image=screenshots.image, image_url=screenshots.image.fileurl, image_presigned=net.presigned_url(screenshots.image.fileurl, 3600), thumbnail=screenshots.thumbnail, thumbnail_url=screenshots.thumbnail.fileurl, thumbnail_presigned=net.presigned_url( screenshots.thumbnail.fileurl, 3600 ), ).limit(1).collect() ```
### Common expiration times
## What you learned * When you configure Pixeltable to use Tigris to store images, adding images transparently uploads them into Tigris for global distribution. * You can override where images are stored in Tigris using the `destination=` kwarg when creating computed columns. * Use the `.fileurl` property in queries to get URLs for your stored files. * Use `net.presigned_url()` to generate time-limited, authenticated URLs for private bucket access. Pixeltable handles everything else for you. ## Next steps * See the [Cloud Storage documentation](/integrations/cloud-storage) for complete provider setup and authentication details. * Check out [Pixeltable Configuration](/platform/configuration) for all config options. * Join our [Discord community](https://pixeltable.com/discord) if you have questions. ## Additional Resources * [Pixeltable Documentation](/) * [Tigris Documentation](https://www.tigrisdata.com/docs/) # Working with Together AI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-together Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. ### Prerequisites * A Together AI account with an API key ([https://api.together.ai/settings/api-keys](https://api.together.ai/settings/api-keys)) ### Important notes * Together.ai usage may incur costs based on your Together.ai plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter your Together API key. ```python theme={null} %pip install -qU pixeltable together ``` ```python theme={null} import getpass import os if 'TOGETHER_API_KEY' not in os.environ: os.environ['TOGETHER_API_KEY'] = getpass.getpass('Together API Key: ') ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'together_demo' directory and its contents, if it exists pxt.drop_dir('together_demo', force=True) pxt.create_dir('together_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'together\_demo'.
  \
## Chat completions Create a Table: In Pixeltable, create a table with columns to represent your input data and the columns where you want to store the results from OpenAI. ```python theme={null} from pixeltable.functions import together chat_t = pxt.create_table('together_demo/chat', {'input': pxt.String}) messages = [{'role': 'user', 'content': chat_t.input}] chat_t.add_computed_column( output=together.chat_completions( messages=messages, model='meta-llama/Llama-3.3-70B-Instruct-Turbo', model_kwargs={ # Optional dict with parameters for the Together API 'max_tokens': 300, 'stop': ['\n'], 'temperature': 0.7, 'top_p': 0.9, }, ) ) chat_t.add_computed_column( response=chat_t.output.choices[0].message.content ) ```
  Created table 'chat'.
  Added 0 column values with 0 errors in 0.01 s
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} # Start a conversation chat_t.insert( [ {'input': 'How many species of felids have been classified?'}, {'input': 'Can you make me a coffee?'}, ] ) chat_t.select(chat_t.input, chat_t.response).head() ```
  Inserted 2 rows with 0 errors in 1.58 s (1.27 rows/s)
## Embeddings ```python theme={null} emb_t = pxt.create_table( 'together_demo/embeddings', {'input': pxt.String} ) emb_t.add_computed_column( embedding=together.embeddings( input=emb_t.input, model='BAAI/bge-base-en-v1.5' ) ) ```
  Created table 'embeddings'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} emb_t.insert( [{'input': 'Together AI provides a variety of embeddings models.'}] ) ```
  Inserted 1 row with 0 errors in 0.54 s (1.86 rows/s)
  1 row inserted.
```python theme={null} emb_t.head() ```
## Image generations ```python theme={null} image_t = pxt.create_table('together_demo/images', {'input': pxt.String}) image_t.add_computed_column( img=together.image_generations( image_t.input, model='black-forest-labs/FLUX.1-schnell', model_kwargs={'steps': 5}, ) ) ```
  Created table 'images'.
  Added 0 column values with 0 errors in 0.01 s
  No rows affected.
```python theme={null} image_t.insert( [{'input': 'A friendly dinosaur playing tennis in a cornfield'}] ) ```
  Inserted 1 row with 0 errors in 1.35 s (0.74 rows/s)
  1 row inserted.
```python theme={null} image_t ```
```python theme={null} image_t.head() ```
### Learn more To learn more about advanced techniques like RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. If you have any questions, don’t hesitate to reach out. # Working with Twelve Labs in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-twelvelabs Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Twelve Labs provides multimodal embeddings that project text, images, audio, and video into the **same semantic space**. This enables true **cross-modal search** - the most powerful feature of this integration. **What makes this special?** You can search a video index using *any* modality:
This notebook demonstrates this cross-modal capability with video, then shows how to apply the same embeddings to other modalities. ### Prerequisites * A Twelve Labs account with an API key ([playground.twelvelabs.io](https://playground.twelvelabs.io/)) * Audio and video must be at least 4 seconds long ## Setup ```python theme={null} %pip install -qU pixeltable twelvelabs ``` ```python theme={null} import getpass import os if 'TWELVELABS_API_KEY' not in os.environ: os.environ['TWELVELABS_API_KEY'] = getpass.getpass( 'Enter your Twelve Labs API key: ' ) ``` ```python theme={null} import pixeltable as pxt import pixeltable.functions as pxtf # Create a fresh directory for our demo pxt.drop_dir('twelvelabs_demo', force=True) pxt.create_dir('twelvelabs_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'twelvelabs\_demo'.
  \
## Cross-Modal Video Search Let’s index a video and search it using text, images, audio, and other videos - all against the same index. ### Create Video Table and Index ```python theme={null} # Create a table for videos video_t = pxt.create_table('twelvelabs_demo/videos', {'video': pxt.Video}) # Insert a sample video video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4' video_t.insert([{'video': video_url}]) ```
  Created table 'videos'.
  Inserted 1 row with 0 errors in 1.60 s (0.63 rows/s)
  1 row inserted.
```python theme={null} # Create a view that segments the video into searchable chunks # Twelve Labs requires minimum 4 second segments video_chunks = pxt.create_view( 'twelvelabs_demo/video_chunks', video_t, iterator=pxtf.video.video_splitter( video=video_t.video, duration=5.0, min_segment_duration=4.0 ), ) # Add embedding index for cross-modal search video_chunks.add_embedding_index( 'video_segment', embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'), ) ``` Let’s look at the index we just added in the table metadata: ```python theme={null} video_chunks ```
The iterator created a larger table from our single video: ```python theme={null} video_chunks.count() ```
  51
### Text to Video Search Find video segments matching a text description. ```python theme={null} sim = video_chunks.video_segment.similarity(string='pink') video_chunks.order_by(sim, asc=False).limit(3).select( video_chunks.video_segment, score=sim ).collect() ```
### Image to Video Search Find video segments similar to an image. ```python theme={null} image_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Screenshot.png' sim = video_chunks.video_segment.similarity(image=image_query) video_chunks.order_by(sim, asc=False).limit(2).select( video_chunks.video_segment, score=sim ).collect() ```
### Video to Video Search Find video segments similar to another video clip. ```python theme={null} video_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Video-Extract.mp4' sim = video_chunks.video_segment.similarity(video=video_query) video_chunks.order_by(sim, asc=False).limit(2).select( video_chunks.video_segment, score=sim ).collect() ```
### Audio to Video Search Find video segments with similar audio/speech content. ```python theme={null} audio_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Audio-Extract.m4a' sim = video_chunks.video_segment.similarity(audio=audio_query) video_chunks.order_by(sim, asc=False).limit(2).select( video_chunks.video_segment, score=sim ).collect() ```
## Embedding Options For video embeddings, you can focus on specific aspects: * `'visual'` - Focus on what you see * `'audio'` - Focus on what you hear * `'transcription'` - Focus on what is said ```python theme={null} # Add a visual-only embedding column video_chunks.add_computed_column( visual_embedding=pxtf.twelvelabs.embed( video_chunks.video_segment, model_name='marengo3.0', embedding_option=['visual'], ) ) video_chunks.select( video_chunks.video_segment, video_chunks.visual_embedding ).limit(2).collect() ```
  Added 51 column values with 0 errors in 19.81 s (2.57 rows/s)
## Other Modalities: Text, Images, and Documents Twelve Labs embeddings also work for text, images, and documents. Here’s a compact example showing **multiple embedding indexes on a single table**. ```python theme={null} # Create a multimodal content table content_t = pxt.create_table( 'twelvelabs_demo/content', { 'title': pxt.String, 'description': pxt.String, 'thumbnail': pxt.Image, }, ) # Add computed column combining title and description content_t.add_computed_column( text_content=content_t.title + '. ' + content_t.description ) # Add embedding index on combined text column content_t.add_embedding_index( 'text_content', embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'), ) # Add embedding index on image column content_t.add_embedding_index( 'thumbnail', embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'), ) ```
  Created table 'content'.
  Added 0 column values with 0 errors in 0.01 s
```python theme={null} # Insert sample content content_t.insert( [ { 'title': 'Beach Sunset', 'description': 'A beautiful sunset over the ocean with palm trees.', 'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg', }, { 'title': 'Mountain Hiking', 'description': 'Hikers climbing a steep mountain trail with scenic views.', 'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg', }, { 'title': 'City Street', 'description': 'Busy urban street with cars and pedestrians.', 'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg', }, { 'title': 'Wildlife Safari', 'description': 'Elephants and zebras on the African savanna.', 'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg', }, ] ) ```
  Inserted 4 rows with 0 errors in 1.97 s (2.03 rows/s)
  4 rows inserted.
We can see the two indexes we added in the schema: ```python theme={null} content_t ```
```python theme={null} # Search by text description sim = content_t.text_content.similarity(string='outdoor nature adventure') content_t.order_by(sim, asc=False).limit(2).select( content_t.title, content_t.text_content, score=sim ).collect() ```
```python theme={null} # Search by image similarity query_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg' sim = content_t.thumbnail.similarity(image=query_image) content_t.order_by(sim, asc=False).limit(2).select( content_t.title, content_t.thumbnail, score=sim ).collect() ```
```python theme={null} # Cross-modal: Search images using text! sim = content_t.thumbnail.similarity(string='shoe rack') content_t.order_by(sim, asc=False).limit(2).select( content_t.title, content_t.thumbnail, score=sim ).collect() ```
## Summary **Twelve Labs + Pixeltable enables:** * **Cross-modal search**: Query video with text, images, audio, or other videos * **Multiple indexes per table**: Add embedding indexes on different columns * **Embedding options**: Focus on visual, audio, or transcription aspects * **All modalities**: Text, images, audio, video, and documents ### Learn More * [Twelve Labs Documentation](https://docs.twelvelabs.io/) * [Pixeltable Documentation](/) # Working with Voyage AI in Pixeltable Source: https://docs.pixeltable.com/howto/providers/working-with-voyageai Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable’s Voyage AI integration enables you to access state-of-the-art embedding and reranker models via the Voyage AI API. ### Prerequisites * A Voyage AI account with an API key ([https://www.voyageai.com/](https://www.voyageai.com/)) ### Important notes * Voyage AI usage may incur costs based on your Voyage AI plan. * Be mindful of sensitive data and consider security measures when integrating with external services. First you’ll need to install required libraries and enter your Voyage AI API key. ```python theme={null} %pip install -qU voyageai ``` ```python theme={null} import getpass import os if 'VOYAGE_API_KEY' not in os.environ: os.environ['VOYAGE_API_KEY'] = getpass.getpass( 'Enter your Voyage AI API key:' ) ``` Now let’s create a Pixeltable directory to hold the tables for our demo. ```python theme={null} import pixeltable as pxt # Remove the 'voyageai_demo' directory and its contents, if it exists pxt.drop_dir('voyageai_demo', force=True) pxt.create_dir('voyageai_demo') ```
  Created directory 'voyageai\_demo'.
  \
## Text embeddings Voyage AI provides state-of-the-art embedding models for semantic search and RAG applications. ```python theme={null} from pixeltable.functions import voyageai # Create a table for document embeddings docs_t = pxt.create_table('voyageai_demo/documents', {'text': pxt.String}) # Add computed column with Voyage embeddings docs_t.add_computed_column( embedding=voyageai.embeddings( docs_t.text, model='voyage-3.5', input_type='document' ) ) ```
  Created table 'documents'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert some sample documents documents = [ 'The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.', 'Photosynthesis in plants converts light energy into glucose and produces essential oxygen.', '20th-century innovations, from radios to smartphones, centered on electronic advancements.', 'Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.', "Apple's conference call to discuss fourth fiscal quarter results is scheduled for Thursday, November 2, 2023.", "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature.", ] docs_t.insert({'text': doc} for doc in documents) ```
  Inserting rows into \`documents\`: 6 rows \[00:00, 2561.67 rows/s]
  Inserted 6 rows with 0 errors.
  6 rows inserted, 12 values computed.
```python theme={null} # View the embeddings docs_t.select(docs_t.text, docs_t.embedding).head(3) ```
## Embedding index for similarity search You can use Voyage AI embeddings with Pixeltable’s embedding index for efficient similarity search. ```python theme={null} # Create a table with an embedding index search_t = pxt.create_table('voyageai_demo/search', {'text': pxt.String}) # Add embedding index for similarity search embed_fn = voyageai.embeddings.using( model='voyage-3.5', input_type='document' ) search_t.add_embedding_index('text', string_embed=embed_fn) ```
  Created table 'search'.
```python theme={null} # Insert documents search_t.insert({'text': doc} for doc in documents) ```
  Inserting rows into \`search\`: 6 rows \[00:00, 973.68 rows/s]
  Inserted 6 rows with 0 errors.
  6 rows inserted, 12 values computed.
```python theme={null} # Perform similarity search sim = search_t.text.similarity( string='What are the health benefits of Mediterranean food?' ) search_t.order_by(sim, asc=False).limit(3).select( search_t.text, score=sim ).collect() ```
## Reranking Voyage AI’s rerankers can refine search results by providing more accurate relevance scores. ```python theme={null} # Create a table for reranking rerank_t = pxt.create_table( 'voyageai_demo/rerank', {'query': pxt.String, 'documents': pxt.Json} ) # Add computed column with reranking results rerank_t.add_computed_column( reranked=voyageai.rerank( rerank_t.query, rerank_t.documents, model='rerank-2.5', top_k=3 ) ) ```
  Created table 'rerank'.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert query and documents to rerank rerank_t.insert( [ { 'query': "When is Apple's conference call scheduled?", 'documents': documents, } ] ) ```
  Inserting rows into \`rerank\`: 1 rows \[00:00, 343.65 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.
```python theme={null} # Add computed column to extract top results using JSON path rerank_t.add_computed_column(top_results=rerank_t.reranked['results']) ```
  Added 1 column value with 0 errors.
  1 row updated, 1 value computed.
```python theme={null} # Extract the top result's document and score rerank_t.select( rerank_t.query, top_document=rerank_t.top_results[0]['document'], top_score=rerank_t.top_results[0]['relevance_score'], ).collect() ```
```python theme={null} # View reranking results rerank_t.select(rerank_t.query, rerank_t.top_results).collect() ```
## Multimodal Embeddings Voyage AI’s multimodal model (`voyage-multimodal-3`) can embed both images and text into the same vector space, enabling cross-modal similarity search. ```python theme={null} # Create a table for multimodal embeddings mm_t = pxt.create_table( 'voyageai_demo/multimodal', {'image': pxt.Image, 'caption': pxt.String}, if_exists='replace', ) # Add computed columns for image and text embeddings # multimodal_embed can embed either images or text independently mm_t.add_computed_column( image_embedding=voyageai.multimodal_embed( mm_t.image, model='voyage-multimodal-3.5', input_type='document' ) ) mm_t.add_computed_column( text_embedding=voyageai.multimodal_embed( mm_t.caption, model='voyage-multimodal-3.5', input_type='document' ) ) ```
  Created table 'multimodal'.
  Added 0 column values with 0 errors.
  Added 0 column values with 0 errors.
  No rows affected.
```python theme={null} # Insert a sample image with caption mm_t.insert( [ { 'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg', 'caption': 'A person standing next to an elephant', } ] ) ```
  Inserting rows into \`multimodal\`: 1 rows \[00:00, 520.00 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 5 values computed.
```python theme={null} # View the multimodal embeddings mm_t.select( mm_t.image, mm_t.caption, mm_t.image_embedding, mm_t.text_embedding ).head() ```
### Learn more To learn more about RAG operations in Pixeltable, check out the [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) tutorial. For more information about Voyage AI models and features, visit: * [Voyage AI Documentation](https://docs.voyageai.com/) * [Text Embeddings](https://docs.voyageai.com/docs/embeddings) * [Multimodal Embeddings](https://docs.voyageai.com/docs/multimodal-embeddings) * [Rerankers](https://docs.voyageai.com/docs/reranker) If you have any questions, don’t hesitate to reach out. # Transcribing and Indexing Audio and Video in Pixeltable Source: https://docs.pixeltable.com/howto/use-cases/audio-transcriptions Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. In this tutorial, we’ll build an end-to-end workflow for creating and indexing audio transcriptions of video data. We’ll demonstrate how Pixeltable can be used to: 1. Extract audio data from video files; 2. Transcribe the audio using OpenAI Whisper; 3. Build a semantic index of the transcriptions, using the Huggingface sentence\_transformers models; 4. Search this index. The tutorial assumes you’re already somewhat familiar with Pixeltable. If this is your first time using Pixeltable, the [10-Minute Tour](/overview/ten-minute-tour) tutorial is a great place to start. ## Create a Table for Video Data Let’s first install the Python packages we’ll need for the demo. We’re going to use the popular Whisper library, running locally. Later in the demo, we’ll see how to use the OpenAI API endpoints as an alternative. ```python theme={null} %pip install -q pixeltable openai openai-whisper sentence-transformers spacy !python -m spacy download en_core_web_sm -q ``` Now we create a Pixeltable table to hold our videos. ```python theme={null} import pixeltable as pxt pxt.drop_dir( 'transcription_demo', force=True ) # Ensure a clean slate for the demo pxt.create_dir('transcription_demo') # Create a table to store our videos and workflow video_table = pxt.create_table( 'transcription_demo/video_table', {'video': pxt.Video} ) video_table ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'transcription\_demo'.
  Created table 'video\_table'.
Next let’s insert some video files into the table. In this demo, we’ll be using one-minute excerpts from a Lex Fridman podcast. We’ll begin by inserting two of them into our new table. In this demo, our videos are given as `https` links, but Pixeltable also accepts local files and S3 URLs as input. ```python theme={null} videos = [ 'https://github.com/pixeltable/pixeltable/raw/release/docs/resources/audio-transcription-demo/' f'Lex-Fridman-Podcast-430-Excerpt-{n}.mp4' for n in range(3) ] video_table.insert({'video': video} for video in videos[:2]) video_table.show() ```
  Inserted 2 rows with 0 errors in 2.04 s (0.98 rows/s)
Now we’ll add another column to hold extracted audio from our videos. The new column is an example of a *computed column*: it’s updated automatically based on the contents of another column (or columns). In this case, the value of the `audio` column is defined to be the audio track extracted from whatever’s in the `video` column. ```python theme={null} from pixeltable.functions.video import extract_audio video_table.add_computed_column( audio=extract_audio(video_table.video, format='mp3') ) video_table.show() ```
  Added 2 column values with 0 errors in 0.91 s (2.19 rows/s)
If we look at the structure of the video table, we see that the new column is a computed column. ```python theme={null} video_table ```
We can also add another computed column to extract metadata from the audio streams. ```python theme={null} from pixeltable.functions.audio import get_metadata video_table.add_computed_column(metadata=get_metadata(video_table.audio)) video_table.show() ```
  Added 2 column values with 0 errors in 0.02 s (95.47 rows/s)
## Create Transcriptions Now we’ll add a step to create transcriptions of our videos. As mentioned above, we’re going to use the Whisper library for this, running locally. Pixeltable has a built-in function, `whisper.transcribe`, that serves as an adapter for the Whisper library’s transcription capability. All we have to do is add a computed column that calls this function: ```python theme={null} from pixeltable.functions import whisper video_table.add_computed_column( transcription=whisper.transcribe( audio=video_table.audio, model='base.en' ) ) video_table.select( video_table.video, video_table.transcription.text ).show() ```
  Added 2 column values with 0 errors in 4.63 s (0.43 rows/s)
In order to index the transcriptions, we’ll first need to split them into sentences. We can do this using Pixeltable’s built-in `string_splitter` iterator. ```python theme={null} from pixeltable.functions.string import string_splitter sentences_view = pxt.create_view( 'transcription_demo/sentences_view', video_table, iterator=string_splitter( video_table.transcription.text, separators='sentence' ), ) ``` The `string_splitter` creates a new view, with the audio transcriptions broken into individual, one-sentence chunks. ```python theme={null} sentences_view.select(sentences_view.pos, sentences_view.text).show(8) ```
## Add an Embedding Index Next, let’s use the Huggingface `sentence_transformers` library to create an embedding index of our sentences, attaching it to the `text` column of our `sentences_view`. ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer sentences_view.add_embedding_index( 'text', embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2'), ) ```
  modules.json:   0%|          | 0.00/387 \[00:00\

We can do a simple lookup to test our new index. The following snippet
returns the results of a nearest-neighbor search on the input “What is
happiness?”

```python theme={null}
sim = sentences_view.text.similarity(string='What is happiness?')

(
    sentences_view.order_by(sim, asc=False)
    .limit(10)
    .select(sentences_view.text, similarity=sim)
    .collect()
)
```

## Incremental Updates *Incremental updates* are a key feature of Pixeltable. Whenever a new video is added to the original table, all of its downstream computed columns are updated automatically. Let’s demonstrate this by adding a third video to the table and seeing how the updates propagate through to the index. ```python theme={null} video_table.insert([{'video': videos[2]}]) ```
  Inserted 10 rows with 0 errors in 4.20 s (2.38 rows/s)
  10 rows inserted.
```python theme={null} video_table.select( video_table.video, video_table.metadata, video_table.transcription.text, ).show() ```
```python theme={null} sim = sentences_view.text.similarity(string='What is happiness?') ( sentences_view.order_by(sim, asc=False) .limit(20) .select(sentences_view.text, similarity=sim) .collect() ) ```
We can see the new results showing up in `sentences_view`. ## Using the OpenAI API This concludes our tutorial using the locally installed Whisper library. Sometimes, it may be preferable to use the OpenAI API rather than a locally installed library. In this section we’ll show how this can be done in Pixeltable, simply by using a different function to construct our computed columns. Since this section relies on calling out to the OpenAI API, you’ll need to have an API key, which you can enter below. ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:') ``` ```python theme={null} from pixeltable.functions import openai video_table.add_computed_column( transcription_from_api=openai.transcriptions( video_table.audio, model='whisper-1' ) ) ```
  Added 3 column values with 0 errors in 6.49 s (0.46 rows/s)
  3 rows updated.
Now let’s compare the results from the local model and the API side-by-side. ```python theme={null} video_table.select( video_table.video, video_table.transcription.text, video_table.transcription_from_api.text, ).show() ```
They look pretty similar, which isn’t surprising, since the OpenAI transcriptions endpoint runs on Whisper. One difference is that the local library spits out a lot more information about the internal behavior of the model. Note that we’ve been selecting `video_table.transcription.text` in the preceding queries, which pulls out just the `text` field of the transcription results. The actual results are a sizable JSON structure that includes a lot of metadata. To see the full output, we can select `video_table.transcription` instead, to get the full JSON struct. Here’s what it looks like (we’ll select just one row, since it’s a lot of output): ```python theme={null} video_table.select( video_table.transcription, video_table.transcription_from_api ).show(1) ```
# Object Detection in Videos Source: https://docs.pixeltable.com/howto/use-cases/object-detection-in-videos Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. In this tutorial, we’ll demonstrate how to use Pixeltable to do frame-by-frame object detection, made simple through Pixeltable’s video-related functionality: * automatic frame extraction * running complex functions against frames (in this case, the YOLOX object detection models) * reassembling frames back into videos We’ll be working with a single video file from Pixeltable’s test data repository. This tutorial assumes you’re at least somewhat familiar with Pixeltable; a good place to learn more is the [Pixeltable Documentation](/overview/pixeltable). ## Creating a tutorial directory and table First, let’s make sure the packages we need for this tutorial are installed: Pixeltable itself, PyTorch, and the YOLOX object detection library. ```python theme={null} %pip install -qU pixeltable pixeltable-yolox ``` All data in Pixeltable is stored in tables, which in turn reside in directories. We’ll begin by creating a `detection_demo` directory and a table to hold our videos, with a single column of type `pxt.Video`. ```python theme={null} import pixeltable as pxt pxt.create_dir('detection_demo', if_exists='replace_force') videos_table = pxt.create_table( 'detection_demo/videos', {'video': pxt.Video} ) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'detection\_demo'.
  Created table 'videos'.
In order to interact with the frames, we take advantage of Pixeltable’s component view concept: we create a “view” of our video table that contains one row for each frame of each video in the table. Pixeltable provides the built-in `frame_iterator` for this. ```python theme={null} from pixeltable.functions.video import frame_iterator frames_view = pxt.create_view( 'detection_demo/frames', videos_table, iterator=frame_iterator(videos_table.video), ) ``` You’ll see that neither the `videos` table nor the `frames` view has any actual data yet, because we haven’t yet added any videos to the table. However, the `frames` view is now configured to automatically track the `videos` table as new data shows up. The new view is automatically configured with six columns: * `pos` - a system column that is part of every component view * `video` - the column inherited from our base table (all base table columns are visible in any of its views) * `frame_idx`, `pos_msec`, `pos_frame`, `frame` - these four columns are created by the `frame_iterator`. Let’s have a look at the new view: ```python theme={null} frames_view ```
We’ll now insert a single row into the videos table, containing a video of a busy intersection in Bangkok. ```python theme={null} videos_table.insert( [ { 'video': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/bangkok.mp4' } ] ) ```
  Inserted 462 rows with 0 errors in 4.35 s (106.25 rows/s)
  462 rows inserted.
Notice that both the `videos` table and `frames` view were automatically updated, expanding the single video into 461 rows in the view. Let’s have a look at `videos` first. ```python theme={null} videos_table.show() ```
Now let’s peek at the first five rows of `frames`: ```python theme={null} frames_view.select( frames_view.pos, frames_view.frame, frames_view.frame.width, frames_view.frame.height, ).show(5) ```
One advantage of using Pixeltable’s component view mechanism is that Pixeltable does not physically store the frames. Instead, Pixeltable re-extracts the frames on retrieval using the frame index, which can be done very efficiently and avoids any storage overhead (which can be quite substantial for video frames). ## Object Detection with Pixeltable Now let’s apply an object detection model to our frames. Pixeltable includes built-in support for a number of models; we’re going to use the YOLOX family of models, which are lightweight models with solid performance. We first import the `yolox` Pixeltable function. ```python theme={null} from pixeltable.functions.yolox import yolox ``` Pixeltable functions operate on columns and expressions using standard Python function call syntax. Here’s an example that shows how we might experiment with applying one of the YOLOX models to the first few frames in our video, using Pixeltable’s powerful `select` comprehension. ```python theme={null} # Show the results of applying the `yolox_tiny` model # to the first few frames in the table. frames_view.select( frames_view.frame, yolox(frames_view.frame, model_id='yolox_tiny') ).head(3) ```
It may appear that we just ran the YOLOX inference over the entire view of 461 frames, but remember that Pixeltable evaluates expressions lazily: in this case, it only ran inference over the 3 frames that we actually displayed. The inference output looks like what we’d expect, so let’s add a *computed column* that runs inference over the entire view (computed columns are discussed in detail in the [Computed Columns](https://github.com/pixeltable/pixeltable/blob/release/docs/tutorials/computed-columns.ipynb) tutorial). Remember that once a computed column is created, Pixeltable will update it incrementally any time new rows are added to the view. This is a convenient way to incorporate inference (and other operations) into data workflows. This *will* cause Pixeltable to run inference over all 461 frames, so please be patient. ```python theme={null} # Create a computed column to compute detections using the `yolox_tiny` # model. # We'll adjust the confidence threshold down a bit (the default is 0.5) # to pick up even more bounding boxes. frames_view.add_computed_column( detections_tiny=yolox( frames_view.frame, model_id='yolox_tiny', threshold=0.25 ) ) ```
  Added 461 column values with 0 errors in 15.09 s (30.55 rows/s)
  461 rows updated.
The new column is now part of the schema of the `frames` view: ```python theme={null} frames_view ```
The data in the computed column is now stored for fast retrieval. ```python theme={null} frames_view.select(frames_view.frame, frames_view.detections_tiny).show(3) ```
Now let’s create a new set of images, in which we superimpose the detected bounding boxes on top of the original images. We’ll use the handy built-in `draw_bounding_boxes` UDF for this. We could create a new computed column to hold the superimposed images, but we don’t have to; sometimes it’s easier just to use a `select` comprehension, as we did when we were first experimenting with the detection model. ```python theme={null} import pixeltable.functions as pxtf frames_view.select( frames_view.frame, pxtf.vision.draw_bounding_boxes( frames_view.frame, frames_view.detections_tiny.bboxes, width=4 ), ).show(1) ```
Our `select` comprehension ranged over the entire table, but just as before, Pixeltable computes the output lazily: image operations are performed at retrieval time, so in this case, Pixeltable drew the annotations just for the one frame that we actually displayed. Looking at individual frames gives us some idea of how well our detection algorithm works, but it would be more instructive to turn the visualization output back into a video. We do that with the built-in function `make_video()`, which is an aggregation function that takes a frame index (actually: any expression that can be used to order the frames; a timestamp would also work) and an image, and then assembles the sequence of images into a video. ```python theme={null} frames_view.group_by(videos_table).select( pxt.functions.video.make_video( frames_view.pos, pxtf.vision.draw_bounding_boxes( frames_view.frame, frames_view.detections_tiny.bboxes, width=4 ), ) ).show(1) ```
## Comparing Object Detection Models The detections that we get out of `yolox_tiny` are passable, but a little choppy. Suppose we want to experiment with a more powerful object detection model, to see if there is any improvement in detection quality. We can create an additional column to hold the new inferences. The larger model takes longer to download and run, so please be patient. ```python theme={null} # Here we use the larger `yolox_m` (medium) model. frames_view.add_computed_column( detections_m=yolox( frames_view.frame, model_id='yolox_m', threshold=0.25 ) ) ```
  Added 461 column values with 0 errors in 65.94 s (6.99 rows/s)
  461 rows updated.
Let’s see the results of the two models side-by-side. ```python theme={null} frames_view.group_by(videos_table).select( pxt.functions.video.make_video( frames_view.pos, pxtf.vision.draw_bounding_boxes( frames_view.frame, frames_view.detections_tiny.bboxes, width=4 ), ), pxt.functions.video.make_video( frames_view.pos, pxtf.vision.draw_bounding_boxes( frames_view.frame, frames_view.detections_m.bboxes, width=4 ), ), ).show(1) ```
Running the videos side-by-side, we can see that the larger model is higher in quality: less flickering, with more stable boxes from frame to frame. ## Evaluating Models Against a Ground Truth In order to do a quantitative evaluation of model performance, we need a ground truth to compare them against. Let’s generate some (synthetic) “ground truth” data by running against the largest YOLOX model available. It will take even longer to cache and evaluate this model. ```python theme={null} frames_view.add_computed_column( detections_x=yolox( frames_view.frame, model_id='yolox_x', threshold=0.25 ) ) ```
  Added 461 column values with 0 errors in 156.55 s (2.94 rows/s)
  461 rows updated.
Let’s have a look at our enlarged view, now with three `detections` columns. ```python theme={null} frames_view ```
We’re going to be evaluating the generated detections with the commonly-used [mean average precision](https://learnopencv.com/mean-average-precision-map-object-detection-model-evaluation-metric/) metric (mAP). The mAP metric is based on per-frame metrics, such as true and false positives per detected class, which are then aggregated into a single (per-class) number. In Pixeltable, functionality is available via the `eval_detections()` and `mean_ap()` built-in functions. ```python theme={null} from pixeltable.functions.vision import eval_detections, mean_ap frames_view.add_computed_column( eval_yolox_tiny=eval_detections( pred_bboxes=frames_view.detections_tiny.bboxes, pred_labels=frames_view.detections_tiny.labels, pred_scores=frames_view.detections_tiny.scores, gt_bboxes=frames_view.detections_x.bboxes, gt_labels=frames_view.detections_x.labels, ) ) frames_view.add_computed_column( eval_yolox_m=eval_detections( pred_bboxes=frames_view.detections_m.bboxes, pred_labels=frames_view.detections_m.labels, pred_scores=frames_view.detections_m.scores, gt_bboxes=frames_view.detections_x.bboxes, gt_labels=frames_view.detections_x.labels, ) ) ```
  Added 461 column values with 0 errors in 0.29 s (1589.38 rows/s)
  Added 461 column values with 0 errors in 0.31 s (1475.98 rows/s)
  461 rows updated.
Let’s take a look at the output. ```python theme={null} frames_view.select( frames_view.eval_yolox_tiny, frames_view.eval_yolox_m ).show(1) ```
The computation of the mAP metric is now simply a query over the evaluation output, aggregated with the `mean_ap()` function. ```python theme={null} frames_view.select( mean_ap(frames_view.eval_yolox_tiny), mean_ap(frames_view.eval_yolox_m), ).show() ```
This two-step process allows you to compute mAP at every granularity: over your entire dataset, only for specific videos, only for videos that pass a certain filter, etc. Moreover, you can compute this metric any time, not just during training, and use it to guide your understanding of your dataset and how it affects the quality of your models. # Document Indexing and RAG Source: https://docs.pixeltable.com/howto/use-cases/rag-demo Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. In this tutorial, we’ll demonstrate how RAG operations can be implemented in Pixeltable. In particular, we’ll develop a RAG application that summarizes a collection of PDF documents and uses ChatGPT to answer questions about them. In a traditional RAG workflow, such operations might be implemented as a Python script that runs on a periodic schedule or in response to certain events. In Pixeltable, they are implemented as persistent tables that are updated automatically and incrementally as new data becomes available. We first set up our OpenAI API key: ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:') ``` We then install the packages we need for this tutorial and then set up our environment. ```python theme={null} %pip install -q pixeltable sentence-transformers tiktoken openai openpyxl ```
  Note: you may need to restart the kernel to use updated packages.
```python theme={null} import pixeltable as pxt # Ensure a clean slate for the demo pxt.drop_dir('rag_demo', force=True) pxt.create_dir('rag_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/sergeymkhitaryan/.pixeltable/pgdata
  Created directory 'rag\_demo'.
  \
Next we’ll create a table containing the sample questions we want to answer. The questions are stored in an Excel spreadsheet, along with a set of “ground truth” answers to help evaluate our model pipeline. We can use `create_table()` with the `source` parameter to load them. Note that we can pass the URL of the spreadsheet directly. ```python theme={null} base = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/rag-demo/' qa_url = base + 'Q-A-Rag.xlsx' queries_t = pxt.create_table('rag_demo/queries', source=qa_url) ```
  Created table 'queries'.
  Inserting rows into \`queries\`: 8 rows \[00:00, 2469.96 rows/s]
  Inserted 8 rows with 0 errors.
```python theme={null} queries_t.head() ```
## Outline There are two major parts to our RAG application: 1. Document Indexing: Load the documents, split them into chunks, and index them using a vector embedding. 2. Querying: For each question on our list, do a top-k lookup for the most relevant chunks, use them to construct a ChatGPT prompt, and send the enriched prompt to an LLM. We’ll implement both parts in Pixeltable. ## Document Indexing All data in Pixeltable, including documents, resides in tables. Tables are persistent containers that can serve as the store of record for your data. Since we are starting from scratch, we will start with an empty table `rag_demo.documents` with a single column, `document`. ```python theme={null} documents_t = pxt.create_table( 'rag_demo/documents', {'document': pxt.Document} ) documents_t ```
  Created table 'documents'.
Next, we’ll insert our first few source documents into the new table. We’ll leave the rest for later, in order to show how to update the indexed document base incrementally. ```python theme={null} document_urls = [ base + 'Argus-Market-Digest-June-2024.pdf', base + 'Argus-Market-Watch-June-2024.pdf', base + 'Company-Research-Alphabet.pdf', base + 'Jefferson-Amazon.pdf', base + 'Mclean-Equity-Alphabet.pdf', base + 'Zacks-Nvidia-Report.pdf', ] ``` ```python theme={null} documents_t.insert({'document': url} for url in document_urls[:3]) documents_t.show() ```
  Inserting rows into \`documents\`: 3 rows \[00:00, 491.31 rows/s]
  Inserted 3 rows with 0 errors.
In RAG applications, we often decompose documents into smaller units, or chunks, rather than treating each document as a single entity. In this example, we’ll use Pixeltable’s built-in `document_splitter`, but in general the chunking methodology is highly customizable. `document_splitter` has a variety of options for controlling the chunking behavior, and it’s also possible to replace it entirely with a user-defined iterator (or an adapter for a third-party document splitter). In Pixeltable, operations such as chunking can be automated by creating **views** of the base `documents` table. A view is a virtual derived table: rather than adding data directly to the view, we define it via a computation over the base table. In this example, the view is defined by iteration over the chunks of a `document_splitter`. ```python theme={null} from pixeltable.functions.document import document_splitter chunks_t = pxt.create_view( 'rag_demo/chunks', documents_t, iterator=document_splitter( documents_t.document, separators='token_limit', limit=300 ), ) ```
  Inserting rows into \`chunks\`: 41 rows \[00:00, 20799.04 rows/s]
Our `chunks` view now has 3 columns: ```python theme={null} chunks_t ```
* `text` is the chunk text produced by the `document_splitter` * `pos` is a system-generated integer column, starting at 0, that provides a sequence number for each row * `document`, which is simply the `document` column from the base table `documents`. We won’t need it here, but having access to the base table’s columns (in effect a parent-child join) can be quite useful. Notice that as soon as we created it, `chunks` was automatically populated with data from the existing documents in our base table. We can select the first 2 chunks from each document using common query operations, in order to get a feel for what was extracted: ```python theme={null} chunks_t.where(chunks_t.pos < 2).show() ```
Now let’s compute vector embeddings for the document chunks and store them in a vector index. Pixeltable has built-in support for vector indexing using a variety of embedding model families, and it’s easy for users to add new ones via UDFs. In this demo, we’re going to use the E5 model from the Huggingface `sentence_transformers` library, which runs locally. The following command creates a vector index on the `text` column in the `chunks` table, using the E5 embedding model. (For details on index creation, see the [Embedding and Vector Indices](https://github.com/pixeltable/pixeltable/blob/release/docs/platform/embedding-indexes.ipynb) guide.) Note that defining the index is sufficient in order to load it with the existing data (and also to update it when the underlying data changes, as we’ll see later). ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer chunks_t.add_embedding_index( 'text', embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2'), ) ``` This completes the first part of our application, creating an indexed document base. Next, we’ll use it to run some queries. ## Querying In order to express a top-k lookup against our index, we use Pixeltable’s `similarity` operator in combination with the standard `order_by` and `limit` operations. Before building this into our application, let’s run a sample query to make sure it works. ```python theme={null} query_text = 'What is the expected EPS for Nvidia in Q1 2026?' sim = chunks_t.text.similarity(string=query_text) nvidia_eps_query = ( chunks_t.order_by(sim, asc=False) .select(similarity=sim, text=chunks_t.text) .limit(5) ) nvidia_eps_query.collect() ```
We perform this context retrieval for each row of our `queries` table by adding it as a computed column. In this case, the operation is a top-k similarity lookup against the data in the `chunks` table. To implement this operation, we’ll use Pixeltable’s `@query` decorator to enhance the capabilities of the `chunks` table. ```python theme={null} # A @query is essentially a reusable, parameterized query that is attached to a table (or view), # which is a modular way of getting data from that table. @pxt.query def top_k(query_text: str): sim = chunks_t.text.similarity(string=query_text) return ( chunks_t.order_by(sim, asc=False) .select(chunks_t.text, sim=sim) .limit(5) ) # Now add a computed column to `queries_t`, calling the query # `top_k` that we just defined. queries_t.add_computed_column(question_context=top_k(queries_t.Question)) ```
  Added 8 column values with 0 errors.
  8 rows updated, 8 values computed.
Our `queries` table now looks like this: ```python theme={null} queries_t ```
The new column `question_context` now contains the result of executing the query for each row, formatted as a list of dictionaries: ```python theme={null} queries_t.select(queries_t.question_context).head(1) ```
### Asking the LLM Now it’s time for the final step in our application: feeding the document chunks and questions to an LLM for resolution. In this demo, we’ll use OpenAI for this, but any other inference cloud or local model could be used instead. We start by defining a UDF that takes a top-k list of context chunks and a question and turns them into a ChatGPT prompt. ```python theme={null} # Define a UDF to create an LLM prompt given a top-k list of # context chunks and a question. @pxt.udf def create_prompt(top_k_list: list[dict], question: str) -> str: concat_top_k = '\n\n'.join( elt['text'] for elt in reversed(top_k_list) ) return f""" PASSAGES: {concat_top_k} QUESTION: {question}""" ``` We then add that again as a computed column to `queries`: ```python theme={null} queries_t.add_computed_column( prompt=create_prompt(queries_t.question_context, queries_t.Question) ) ```
  Added 8 column values with 0 errors.
  8 rows updated, 16 values computed.
We now have a new string column containing the prompt: ```python theme={null} queries_t ```
```python theme={null} queries_t.select(queries_t.prompt).head(1) ```
We now add another computed column to call OpenAI. For the `chat_completions()` call, we need to construct two messages, containing the instructions to the model and the prompt. For the latter, we can simply reference the `prompt` column we just added. ```python theme={null} from pixeltable.functions import openai # Assemble the prompt and instructions into OpenAI's message format messages = [ { 'role': 'system', 'content': 'Please read the following passages and answer the question based on their contents.', }, {'role': 'user', 'content': queries_t.prompt}, ] # Add a computed column that calls OpenAI queries_t.add_computed_column( response=openai.chat_completions( model='gpt-4o-mini', messages=messages ) ) ```
  Added 8 column values with 0 errors.
  8 rows updated, 8 values computed.
Our `queries` table now contains a JSON-structured column `response`, which holds the entire API response structure. At the moment, we’re only interested in the response content, which we can extract easily into another computed column: ```python theme={null} queries_t.add_computed_column( answer=queries_t.response.choices[0].message.content ) ```
  Added 8 column values with 0 errors.
  8 rows updated, 8 values computed.
We now have the following `queries` schema: ```python theme={null} queries_t ```
Let’s take a look at what we got back: ```python theme={null} queries_t.select( queries_t.Question, queries_t.correct_answer, queries_t.answer ).show() ```
The application works, but, as expected, a few questions couldn’t be answered due to the missing documents. As a final step, let’s add the remaining documents to our document base, and run the queries again. ## Incremental Updates Pixeltable’s views and computed columns update automatically in response to new data. We can see this when we add the remaining documents to our `documents` table. Watch how the `chunks` view is updated to stay in sync with `documents`: ```python theme={null} documents_t.insert({'document': p} for p in document_urls[3:]) ```
  Inserting rows into \`documents\`: 3 rows \[00:00, 569.05 rows/s]
  Inserting rows into \`chunks\`: 67 rows \[00:00, 325.91 rows/s]
  Inserted 70 rows with 0 errors.
  70 rows inserted, 6 values computed.
```python theme={null} documents_t.show() ```
(Note: although Pixeltable updates `documents` and `chunks`, it **does not** automatically update the `queries` table. This is by design: we don’t want all rows in `queries` to get automatically re-executed every time a single new document is added to the document base. However, newly-added rows will be run over the new, incrementally-updated index.) To confirm that the `chunks` index got updated, we’ll re-run the chunks retrieval query for the question `What is the expected EPS for Nvidia in Q1 2026?` Previously, our most similar chunk had a similarity score of \~0.8. Let’s see what we get now: ```python theme={null} nvidia_eps_query.collect() ```
Our most similar chunk now has a score of \~0.855 and pulls in more relevant chunks from the newly-inserted documents. Let’s recompute the `question_context` column of the `queries_t` table, which will automatically recompute the `answer` column as well. ```python theme={null} queries_t.recompute_columns('question_context') ```
  Inserting rows into \`queries\`: 8 rows \[00:00, 580.60 rows/s]
  8 rows updated, 40 values computed.
As a final step, let’s confirm that all the queries now have answers: ```python theme={null} queries_t.select( queries_t.Question, queries_t.correct_answer, queries_t.answer ).show() ```
# RAG Operations in Pixeltable Source: https://docs.pixeltable.com/howto/use-cases/rag-operations Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. In this tutorial, we’ll explore Pixeltable’s flexible handling of RAG operations on unstructured text. In a traditional AI workflow, such operations might be implemented as a Python script that runs on a periodic schedule or in response to certain events. In Pixeltable, as with everything else, they are implemented as persistent table operations that update incrementally as new data becomes available. In our tutorial workflow, we’ll chunk PDF documents in various ways with a document splitter, then apply several kinds of embeddings to the chunks. ## Set Up the Table Structure We start by installing the necessary dependencies, creating a Pixeltable directory `rag_ops_demo` (if it doesn’t already exist), and setting up the table structure for our new workflow. ```python theme={null} %pip install -qU pixeltable sentence-transformers spacy tiktoken !python -m spacy download en_core_web_sm -q ``` ```python theme={null} import pixeltable as pxt # Ensure a clean slate for the demo pxt.drop_dir('rag_ops_demo', force=True) # Create the Pixeltable workspace pxt.create_dir('rag_ops_demo') ``` ## Creating Tables and Views Now we’ll create the tables that represent our workflow, starting with a table to hold references to source documents. The table contains a single column `source_doc` whose elements have type `pxt.Document`, representing a general document instance. In this tutorial, we’ll be working with PDF documents, but Pixeltable supports a range of other document types, such as Markdown and HTML. ```python theme={null} docs = pxt.create_table('rag_ops_demo/docs', {'source_doc': pxt.Document}) ```
  Created table 'docs'.
If we take a peek at the `docs` table, we see its very simple structure. ```python theme={null} docs ```
Next we create a view to represent chunks of our PDF documents. A Pixeltable view is a virtual table, which is dynamically derived from a source table by applying a transformation and/or selecting a subset of data. In this case, our view represents a one-to-many transformation from source documents into individual sentences. This is achieved using Pixeltable’s built-in `document_splitter` class. Note that the `docs` table is currently empty, so creating this view doesn’t actually *do* anything yet: it simply defines an operation that we want Pixeltable to execute when it sees new data. ```python theme={null} from pixeltable.functions.document import document_splitter sentences = pxt.create_view( 'rag_ops_demo/sentences', # Name of the view docs, # Table from which the view is derived iterator=document_splitter( docs.source_doc, separators='sentence', # Chunk docs into sentences metadata='title,heading,sourceline', ), ) ``` Let’s take a peek at the new `sentences` view. ```python theme={null} sentences ```
We see that `sentences` inherits the `source_doc` column from `docs`, together with some new fields: * `pos`: The position in the source document where the sentence appears. * `text`: The text of the sentence. * `title`, `heading`, and `sourceline`: The metadata we requested when we set up the view. ## Data Ingestion Ok, now it’s time to insert some data into our workflow. A document in Pixeltable is just a URL; the following command inserts a single row into the `docs` table with the `source_doc` field set to the specified URL: ```python theme={null} docs.insert( [ { 'source_doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf' } ] ) ```
  Inserting rows into \`docs\`: 1 rows \[00:00, 292.76 rows/s]
  Inserting rows into \`sentences\`: 217 rows \[00:00, 42910.00 rows/s]
  Inserted 218 rows with 0 errors.
  218 rows inserted, 2 values computed.
We can see that two things happened. First, a single row was inserted into `docs`, containing the URL representing our source PDF. Then, the view `sentences` was incrementally updated by applying the `document_splitter` according to the definition of the view. This illustrates an important principle in Pixeltable: by default, anytime Pixeltable sees new data, the update is incrementally propagated to any downstream views or computed columns. We can see the effect of the insertion with the `select` command. There’s a single row in `docs`: ```python theme={null} docs.select(docs.source_doc.fileurl).show() ```
And here are the first 20 rows in `sentences`. The content of the PDF is broken into individual sentences, as expected. ```python theme={null} sentences.select(sentences.text, sentences.heading).show(20) ```
## Experimenting with Chunking Of course, chunking into sentences isn’t the only way to split a document. Perhaps we want to experiment with different chunking methodologies, in order to see which one performs best in a particular application. Pixeltable makes it easy to do this, by creating several views of the same source table. Here are a few examples. Notice that as each new view is created, it is initially populated from the data already in `docs`. ```python theme={null} chunks = pxt.create_view( 'rag_ops_demo/chunks', docs, iterator=document_splitter( docs.source_doc, separators='sentence,token_limit', limit=2048, overlap=0, metadata='title,heading,sourceline', ), ) ```
  Inserting rows into \`chunks\`: 217 rows \[00:00, 47827.85 rows/s]
```python theme={null} short_chunks = pxt.create_view( 'rag_ops_demo/short_chunks', docs, iterator=document_splitter( docs.source_doc, separators='sentence,token_limit', limit=72, overlap=0, metadata='title,heading,sourceline', ), ) ```
  Inserting rows into \`short\_chunks\`: 219 rows \[00:00, 49104.70 rows/s]
```python theme={null} short_char_chunks = pxt.create_view( 'rag_ops_demo/short_char_chunks', docs, iterator=document_splitter( docs.source_doc, separators='sentence,char_limit', limit=72, overlap=0, metadata='title,heading,sourceline', ), ) ```
  Inserting rows into \`short\_char\_chunks\`: 459 rows \[00:00, 63241.10 rows/s]
```python theme={null} chunks.select(chunks.text, chunks.heading).show(20) ```
```python theme={null} short_chunks.select(short_chunks.text, short_chunks.heading).show(20) ```
```python theme={null} short_char_chunks.select( short_char_chunks.text, short_char_chunks.heading ).show(20) ```
Now let’s add a few more documents to our workflow. Notice how all of the downstream views are updated incrementally, processing just the new documents as they are inserted. ```python theme={null} urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Watch-June-2024.pdf', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Company-Research-Alphabet.pdf', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Zacks-Nvidia-Report.pdf', ] docs.insert({'source_doc': url} for url in urls) ```
  Inserting rows into \`docs\`: 3 rows \[00:00, 1969.77 rows/s]
  Inserting rows into \`chunks\`: 742 rows \[00:00, 61926.41 rows/s]
  Inserting rows into \`short\_chunks\`: 747 rows \[00:00, 67743.68 rows/s]
  Inserting rows into \`sentences\`: 742 rows \[00:00, 67949.90 rows/s]
  Inserting rows into \`short\_char\_chunks\`: 1165 rows \[00:00, 3603.41 rows/s]
  Inserted 3399 rows with 0 errors.
  3399 rows inserted, 6 values computed.
## Further Experiments This is a good time to mention another important guiding principle of Pixeltable. The preceding examples all used the built-in `document_splitter` class with various configurations. That’s probably fine as a first cut or to prototype an application quickly, and it might be sufficient for some applications. But other applications might want to do more sophisticated kinds of chunking, implementing their own specialized logic or leveraging third-party tools. Pixeltable imposes no constraints on the AI or RAG operations a workflow uses: the iterator interface is highly general, and it’s easy to implement new operations or adapt existing code or third-party tools into the Pixeltable workflow. ## Computing Embeddings Next, let’s look at how embedding indices can be added seamlessly to existing Pixeltable workflows. To compute our embeddings, we’ll use the Huggingface `sentence_transformer` package, running it over the `chunks` view that broke our documents up into sentence-based chunks. Pixeltable has a built-in `sentence_transformer` adapter, and all we have to do is add a new column that leverages it. Pixeltable takes care of the rest, applying the new column to all existing data in the view. ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer chunks.add_computed_column( minilm_embed=sentence_transformer( chunks.text, model_id='paraphrase-MiniLM-L6-v2' ) ) ```
  Added 959 column values with 0 errors.
  959 rows updated, 959 values computed.
The new column is a *computed column*: it is defined as a function on top of existing data and updated incrementally as new data are added to the workflow. Let’s have a look at how the new column affected the `chunks` view. ```python theme={null} chunks ```
```python theme={null} chunks.select(chunks.text, chunks.heading, chunks.minilm_embed).head() ```
Similarly, we might want to add a CLIP embedding to our workflow; once again, it’s just another computed column: ```python theme={null} from pixeltable.functions.huggingface import clip chunks.add_computed_column( clip_embed=clip(chunks.text, model_id='openai/clip-vit-base-patch32') ) ```
  Added 959 column values with 0 errors.
  959 rows updated, 959 values computed.
```python theme={null} chunks ```
```python theme={null} chunks.select(chunks.text, chunks.heading, chunks.clip_embed).head() ```
# Using Label Studio for Annotations with Pixeltable Source: https://docs.pixeltable.com/howto/using-label-studio-with-pixeltable Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. This tutorial demonstrates how to integrate Pixeltable with Label Studio, in order to provide seamless management of annotations data across the annotation workflow. We’ll assume that you’re at least somewhat familiar with Pixeltable and have read the [10-Minute Tour](/overview/ten-minute-tour) tutorial. **This tutorial can only be run in a local Pixeltable installation, not in Colab or Kaggle**, since it relies on spinning up a locally running Label Studio instance. See the [Quick Start](/overview/quick-start) guide for instructions on how to set up a local Pixeltable instance. To begin, let’s ensure the requisite dependencies are installed. ```python theme={null} %pip install -qU pixeltable label-studio label-studio-sdk torch transformers ``` ## Set up Label Studio Now let’s spin up a Label Studio server process. (If you’re already running Label Studio, you can choose to skip this step, and instead enter your existing Label Studio URL and access token in the subsequent step.) Be patient, as it may take a minute or two to start. This will open a new browser window containing the Label Studio interface. If you’ve never run Label Studio before, you’ll need to create an account; a link to create one will appear in the Label Studio browser window. **Everything is running locally in this tutorial, so the account will exist only on your local system.** ```python theme={null} import subprocess ls_process = subprocess.Popen(['label-studio'], stderr=subprocess.PIPE) ```
  January 23, 2026 - 01:41:50
  Django version 5.1.15, using settings 'label\_studio.core.settings.label\_studio'
  Starting development server at [http://0.0.0.0:8080/](http://0.0.0.0:8080/)
  Quit the server with CONTROL-C.
If for some reason the Label Studio browser window failed to open, you can always access it at: [http://localhost:8080/](http://localhost:8080/) Once you’ve created an account in Label Studio, you’ll need to locate your API key. In the Label Studio browser window, log in, click “Organization”, “API Tokens Settings”, and enable “Legacy Tokens”. Then click on “Account & Settings” in the top right, click “Legacy Token”, and copy the Access Token from the interface. ## Configure Pixeltable Next, we configure Pixeltable to communicate with Label Studio. Run the following command, pasting in the API key that you copied from the Label Studio interface. ```python theme={null} import getpass import os if 'LABEL_STUDIO_URL' not in os.environ: os.environ['LABEL_STUDIO_URL'] = 'http://localhost:8080/' if 'LABEL_STUDIO_API_KEY' not in os.environ: os.environ['LABEL_STUDIO_API_KEY'] = getpass.getpass( 'Label Studio API key: ' ) ``` ## Create a Table to Store Videos Now we create the master table that will hold our videos to be annotated. This only needs to be done once, when we initially set up the workflow. ```python theme={null} import pixeltable as pxt schema = {'video': pxt.Video, 'date': pxt.Timestamp} # Before creating the table, we drop the `ls_demo` dir and all its contents, # in order to ensure a clean environment for the demo. pxt.drop_dir('ls_demo', force=True) pxt.create_dir('ls_demo') videos_table = pxt.create_table('ls_demo/videos', schema) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'ls\_demo'.
  Created table 'videos'.
## Populate It with Data Now let’s add some videos to the table to populate it. For this tutorial, we’ll use some randomly selected videos from the Multimedia Commons archive. The table also contains a `date` field, for which we’ll use a fixed date (but in a production setting, it would typically be the date on which the video was imported). ```python theme={null} from datetime import datetime url_prefix = 'http://multimedia-commons.s3-website-us-west-2.amazonaws.com/data/videos/mp4/' files = [ '122/8ff/1228ff94bf742242ee7c88e4769ad5d5.mp4', '2cf/a20/2cfa205eae979b31b1144abd9fa4e521.mp4', 'ffe/ff3/ffeff3c6bf57504e7a6cecaff6aefbc9.mp4', ] today = datetime(2024, 4, 22) videos_table.insert( {'video': url_prefix + file, 'date': today} for file in files ) ```
  Inserted 3 rows with 0 errors in 1.07 s (2.81 rows/s)
  3 rows inserted.
Let’s have a look at the table now. ```python theme={null} videos_table.head() ```
## Create a Label Studio project Next we’ll create a new Label Studio project and link it to a new view on the Pixeltable table. You can link a Label Studio project to either a table or a view. For tables that are expecting a lot of input data, it’s often easier to link to views. In this example, we’ll create a view that filters the table down by date. ```python theme={null} # Create a view to filter on the specified date v = pxt.create_view( 'ls_demo/videos_2024_04_22', videos_table.where(videos_table.date == today), ) # Create a new Label Studio project and link it to the view. The # configuration uses Label Studio's standard XML format. This only # needs to be done once: after the view and project are linked, # the relationship is stored indefinitely in Pixeltable's metadata. label_config = """ """ pxt.io.create_label_studio_project(v, label_config) ```
  Added 3 column values with 0 errors in 0.01 s (355.10 rows/s)
  Added 3 column values with 0 errors in 0.02 s (146.19 rows/s)
  Linked external store 'ls\_project\_0' to table 'videos\_2024\_04\_22'.
  Created 3 new task(s) in LabelStudioProject \`videos\_2024\_04\_22\`.
  No rows affected.
If you look in the Label Studio UI now, you’ll see that there’s a new project with the name `videos_2022_04_22`, with three tasks, one for each of the videos in the view. If you want to create the project without populating it with tasks (yet), you can set `sync_immediately=False` in the call to `create_label_studio_project()`. You can always sync the table and project by calling `v.sync()`. Note also that we didn’t have to specify an explicit mapping between Pixeltable columns and Label Studio data fields. This is because, by default, Pixeltable assumes the Pixeltable and Label Studio field names coincide. The data field in the Label Studio project has the name `$video`, which Pixeltable maps, by default, to the column in `ls_demo.videos_2024_02_22` that is also called `video`. If you want to override this behavior to specify an explicit mapping of columns to fields, you can do that with the `col_mapping` parameter of `create_label_studio_project()`. Inspecting the view, we also see that Pixeltable created an additional column on the view, `annotations`, which will hold the output of our annotations workflow. The name of the output column can also be overridden by specifying a dict entry in `col_mapping` of the form `{'my_col_name': 'annotations'}`. ```python theme={null} v ```
## Add Some Annotations Now, let’s add some annotations to our Label Studio project to simulate a human-in-the-loop workflow. In the Label Studio UI, click on the new `videos_2024_02_22` project, and click on any of the three tasks. Select the appropriate category (“city”, “food”, or “sports”), and click “Submit”. ## Import the Annotations Back To Pixeltable Now let’s try importing annotations from Label Studio back to our view. ```python theme={null} v = pxt.get_table('ls_demo/videos_2024_04_22') v.sync() ```
  Created 0 new task(s) in LabelStudioProject \`videos\_2024\_04\_22\`.
  Updated annotation(s) from 3 task(s) in LabelStudioProject \`videos\_2024\_04\_22\`.
  3 rows updated.
Let’s see what effect that had. You’ll see that any videos that you annotated now have their `annotations` field populated in the view. ```python theme={null} v.select(v.video, v.annotations).head() ```
## Parse Annotations with a Computed Column Pixeltable pulls in all sorts of metadata from Label Studio during a sync: everything that Label Studio reports back about the annotations, including things like the user account that created the annotations. Let’s say that all we care about is the annotation value. We can add a computed column to our table to pull it out. ```python theme={null} v.add_computed_column( video_category=v.annotations[0].result[0].value.choices[0] ) v.select(v.video, v.annotations, v.video_category).head() ```
  Added 3 column values with 0 errors in 0.02 s (143.55 rows/s)
Another useful operation is the `get_metadata` function, which returns information about the video itself, such as the resolution and codec (independent of Label Studio). Let’s add another computed column to hold such metadata. ```python theme={null} from pixeltable.functions.video import get_metadata v.add_computed_column(video_metadata=get_metadata(v.video)) v.select( v.video, v.annotations, v.video_category, v.video_metadata ).head() ```
  Added 3 column values with 0 errors in 0.03 s (115.36 rows/s)
## Preannotations with Pixeltable and Label Studio Frame extraction is another common operation in labeling workflows. In this example, we’ll extract frames from our videos into a view, then use an object detection model to generate preannotations for each frame. The following code uses a Pixeltable `frame_iterator` to automatically extract frames into a new view, which we’ll call `frames_2024_04_22`. ```python theme={null} from datetime import datetime from pixeltable.functions.video import frame_iterator today = datetime(2024, 4, 22) videos_table = pxt.get_table('ls_demo/videos') # Create the view, using a `frame_iterator` to extract frames with a sample rate # of `fps=0.25`, or 1 frame per 4 seconds of video. Setting `fps=0` would use the # native framerate of the video, extracting every frame. frames = pxt.create_view( 'ls_demo/frames_2024_04_22', videos_table.where(videos_table.date == today), iterator=frame_iterator(videos_table.video, fps=0.25), ) ``` ```python theme={null} # Show just the first 3 frames in the table, to avoid cluttering the notebook frames.select(frames.frame).head(3) ```
Now we’ll use the Resnet-50 object detection model to generate preannotations. We do this by creating a new computed column. ```python theme={null} from pixeltable.functions.huggingface import detr_for_object_detection # Run the Resnet-50 object detection model against each frame to generate bounding boxes frames.add_computed_column( detections=detr_for_object_detection( frames.frame, model_id='facebook/detr-resnet-50', threshold=0.95 ) ) frames.select(frames.frame, frames.detections).head(3) ```
  Added 11 column values with 0 errors in 9.71 s (1.13 rows/s)
We’d like to send these detections to Label Studio as preannotations, but they’re not quite ready. Label Studio expects preannotations in standard COCO format, but the Huggingface library outputs them in its own custom format. We can use Pixeltable’s handy `detr_to_coco` function to do the conversion, using another computed column. ```python theme={null} from pixeltable.functions.huggingface import detr_to_coco frames.add_computed_column( preannotations=detr_to_coco(frames.frame, frames.detections) ) frames.select( frames.frame, frames.detections, frames.preannotations ).head(3) ``` ## Create a Label Studio Project for Frames With our data workflow set up and the COCO preannotations prepared, all that’s left is to create a corresponding Label Studio project. Note how Pixeltable automatically maps `RectangleLabels` preannotation fields to columns, just like it does with data fields. Here, Pixeltable interprets the `name="preannotations"` attribute in `RectangleLabels` to mean, “map these rectangle labels to the `preannotations` column in my linked table or view”. The Label values `car`, `person`, and `train` are standard COCO object identifiers used by many off-the-shelf object detection models. You can find the complete list of them here, and include as many as you wish: [https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/coco-categories.csv](https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/coco-categories.csv) ```python theme={null} frames_config = """ """ pxt.io.create_label_studio_project(frames, frames_config) ``` If you go into Label Studio and open up the new project, you can see the effect of adding the preannotations from Resnet-50 to our workflow. ## Incremental Updates As we saw in the [10-Minute Tour](/overview/ten-minute-tour) tutorial, adding new data to Pixeltable results in incremental updates of everything downstream. We can see this by inserting a new video into our base videos table: all of the downstream views and computed columns are updated automatically, including the video metadata, frames, and preannotations. The update may take some time, so please be patient (it involves a sequence of operations, including frame extraction and object detection). ```python theme={null} videos_table.insert( video=url_prefix + '22a/948/22a9487a92956ac453a9c15e0fc4dd4.mp4', date=today, ) ``` Note that the incremental updates do *not* automatically sync the `Table` with the remote Label Studio projects. To issue a sync, we have to call the `sync()` methods separately. Note that tasks will be created only for the *newly added* rows in the videos and frames views, not the existing ones. ```python theme={null} v.sync() frames.sync() ``` ## Deleting a Project To remove a Label Studio project from a table or view, use `unlink_external_stores()`, as demonstrated by the following example. If you specify `delete_external_data=True`, then the Label Studio project will also be deleted, along with all existing data and annotations (be careful!) If `delete_external_data=False`, then the Label Studio project will be unlinked from Pixeltable, but the project and data will remain in Label Studio (so you’ll need to delete the project manually if you later want to get rid of it). ```python theme={null} v.external_stores # Get a list of all external stores for `v` ``` ```python theme={null} v.unlink_external_stores('ls_project_0', delete_external_data=True) ``` ## Configuring `media_import_method` All of the examples so far in this tutorial use HTTP file uploads to send media data to Label Studio. This is the simplest method and the easiest to configure, but it’s undesirable for complex projects or projects with a lot of data. In fact, the Label Studio documentation includes this specific warning: “Uploading data works fine for proof of concept projects, but it is not recommended for larger projects.” In Pixeltable, you can configure linked Label Studio projects to use URLs for media data (instead of file uploads) by specifying the `media_import_method='url'` argument in `create_label_studio_project`. This is recommended for all production applications, and is mandatory for projects whose input configuration is more complex than a single media file (in the Label Studio parlance, projects with more than one “data key”). If `media_import_method='url'`, then Pixeltable will simply pass the media data URLs directly to Label Studio. If the URLs are `http://` or `https://` URLs, then nothing more needs to be done. Label Studio also supports `s3://` URLs with credentialed access. To use them, you’ll need to configure access to your bucket in the project configuration. The simplest way to do this is by specifying an `s3_configuration` in `create_label_studio_project`. Here’s an example, though it won’t work directly in this demo notebook, since it relies on having an access key. (If your AWS credentials are stored in `~/.aws/credentials`, then you can omit the access key and secret, and Pixeltable will fill them in automatically.) ```python theme={null} pxt.io.create_label_studio_project( v, label_config, media_import_method='url', s3_configuration={ 'bucket': 'pxt-test', 'aws_access_key_id': my_key, 'aws_secret_access_key': my_secret, }, ) ``` Before you can set up credentialed S3 access, you’ll need to configure your S3 bucket to work with Label Studio; the details on how to do this are described here: * [Label Studio Docs: Amazon S3](https://labelstud.io/guide/storage.html#Amazon-S3) For the full documentation on `create_label_studio_project` usage, see: * [Pixeltable SDK Docs: create\_label\_studio\_project()](/sdk/latest/io#func-create_label_studio_project) ## Notebook Cleanup That’s the end of the tutorial! To conclude, let’s terminate the running Label Studio process. (Of course, feel free to leave it running if you want to play around with it some more.) ```python theme={null} ls_process.kill() ``` # Working with Voxel51 for Visualization in Pixeltable Source: https://docs.pixeltable.com/howto/working-with-fiftyone Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable can export data directly from tables and views to the popular [Voxel51](https://voxel51.com/) frontend, providing a way to visualize and explore image and video datasets. In this tutorial, we’ll learn how to: * Export data from Pixeltable to Voxel51 * Apply labels from image classification and object detection models to exported data We begin by installing the necessary libraries for this tutorial. ```python theme={null} %pip install -qU pixeltable fiftyone torch transformers ``` ## Example 1: An Image Dataset ```python theme={null} import fiftyone as fo import pixeltable as pxt # Create a Pixeltable directory for the demo. We first drop the directory if it # exists, in order to ensure a clean environment. pxt.drop_dir('fo_demo', force=True) pxt.create_dir('fo_demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'fo\_demo'.
  \
```python theme={null} # Create a Pixeltable table for our dataset and insert some sample images. url_prefix = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000019.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000030.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000034.jpg', ] t = pxt.create_table('fo_demo/images', {'image': pxt.Image}) t.insert({'image': url} for url in urls) t.head() ```
  Created table 'images'.
  Inserted 4 rows with 0 errors in 0.71 s (5.60 rows/s)
Now we export our new table to a Voxel51 dataset and load it into a new Voxel51 session within our demo notebook. Once it’s been loaded, the images can be interactively navigated as with any other Voxel51 dataset. ```python theme={null} fo_dataset = pxt.io.export_images_as_fo_dataset(t, t.image) session = fo.launch_app(fo_dataset) ```
  You are running the oldest supported major version of MongoDB. Please refer to [https://deprecation.voxel51.com\ for\ deprecation\ notices.\ You\ can\ suppress\ this\ exception\ by\ setting\ your\ \\\`database\_validation\\\`\ config\ parameter\ to\ \\\`False\\\`.\ See\ https://docs.voxel51.com/user\_guide/config.html#configuring-a-mongodb-connection\ for\ more\ information](https://deprecation.voxel51.com\ for\ deprecation\ notices.\ You\ can\ suppress\ this\ exception\ by\ setting\ your\ \\`database_validation\\`\ config\ parameter\ to\ \\`False\\`.\ See\ https://docs.voxel51.com/user_guide/config.html#configuring-a-mongodb-connection\ for\ more\ information)
   28 \[31.4ms elapsed, ? remaining, 890.5 samples/s] 
## Adding Labels We’ll now show how Voxel51 labels can be attached to the exported dataset. Currently, Pixeltable supports only classification and detection labels; other Voxel51 label types may be added in the future. First, let’s generate some labels by applying two models from the Huggingface `transformers` library: A ViT model for image classification and a DETR model for object detection. ```python theme={null} from pixeltable.functions.huggingface import ( detr_for_object_detection, vit_for_image_classification, ) t.add_computed_column( classifications=vit_for_image_classification( t.image, model_id='google/vit-base-patch16-224' ) ) t.add_computed_column( detections=detr_for_object_detection( t.image, model_id='facebook/detr-resnet-50' ) ) ```
  Added 4 column values with 0 errors in 4.17 s (0.96 rows/s)
  Added 4 column values with 0 errors in 2.72 s (1.47 rows/s)
  4 rows updated.
Both models output JSON containing the model results. Let’s peek at the contents of our table now: ```python theme={null} t.head() ```
Now we need to transform our model data into the format the Voxel51 API expects (see the Pixeltable documentation for [pxt.io.export\_images\_as\_fo\_dataset](/sdk/latest/io#func-export_images_as_fo_dataset) for details). We’ll use Pixeltable UDFs to do the appropriate conversions. ```python theme={null} @pxt.udf def vit_to_fo(vit_labels: list) -> list: return [ {'label': label, 'confidence': score} for label, score in zip( vit_labels['label_text'], vit_labels['scores'] ) ] @pxt.udf def detr_to_fo(img: pxt.Image, detr_labels: dict) -> list: result = [] for label, box, score in zip( detr_labels['label_text'], detr_labels['boxes'], detr_labels['scores'], ): # DETR gives us bounding boxes in (x1,y1,x2,y2) absolute (pixel) coordinates. # Voxel51 expects (x,y,w,h) relative (fractional) coordinates. # So we need to do a conversion. fo_box = [ box[0] / img.width, box[1] / img.height, (box[2] - box[0]) / img.width, (box[3] - box[1]) / img.height, ] result.append( {'label': label, 'bounding_box': fo_box, 'confidence': score} ) return result ``` We can test that our UDFs are working as expected with a `select()` statement. ```python theme={null} t.select( t.image, t.classifications, vit_to_fo(t.classifications), t.detections, detr_to_fo(t.image, t.detections), ).head() ```
Now we pass the modified structures to `export_images_as_fo_dataset`. ```python theme={null} fo_dataset = pxt.io.export_images_as_fo_dataset( t, t.image, classifications=vit_to_fo(t.classifications), detections=detr_to_fo(t.image, t.detections), ) session = fo.launch_app(fo_dataset) ```
   28 \[41.8ms elapsed, ? remaining, 669.2 samples/s] 
## Adding Multiple Label Sets You can include multiple label sets of the same type in the same dataset by passing a `list` or `dict` of expressions to the `classifications` and/or `detections` parameters. If a `list` is specified, default names will be assigned to the label sets; if a `dict` is specified, the label sets will be named according to its keys. As an example, let’s try recomputing our detections using the more powerful DETR model ResNet-101, and then load them into the same Voxel51 dataset as the earlier detections in order to compare them side-by-side. ```python theme={null} t.add_computed_column( detections_101=detr_for_object_detection( t.image, model_id='facebook/detr-resnet-101' ) ) ```
  Added 4 column values with 0 errors in 21.91 s (0.18 rows/s)
  4 rows updated.
```python theme={null} fo_dataset = pxt.io.export_images_as_fo_dataset( t, t.image, classifications=vit_to_fo(t.classifications), detections={ 'detections_50': detr_to_fo(t.image, t.detections), 'detections_101': detr_to_fo(t.image, t.detections_101), }, ) session = fo.launch_app(fo_dataset) ```
   28 \[44.2ms elapsed, ? remaining, 633.4 samples/s] 
Exploring the resulting images, we can see that the results are not much different between the two models, at least on our small sample dataset. # Cloud Storage Source: https://docs.pixeltable.com/integrations/cloud-storage Store and manage media files in cloud storage providers like S3, GCS, Azure, and more Pixeltable supports storing media files (images, videos, audio, documents) in external cloud storage providers instead of local disk. This is essential for production deployments, enabling scalable storage, team collaboration, and integration with existing data infrastructure. ## Supported providers Native S3 storage with full feature support GCS buckets with gs\:// URI scheme Azure containers with wasb:// or abfs\:// schemes S3-compatible storage with zero egress fees Cost-effective S3-compatible storage Globally distributed S3-compatible storage ## How it works When you configure a storage destination, Pixeltable automatically: 1. **Uploads computed media** — AI-generated images, extracted video frames, and other computed media files are stored in your bucket 2. **Copies input media** — Optionally persists referenced media files for durability 3. **Manages file lifecycle** — Cleans up files when table data is deleted 4. **Handles caching** — Downloads files on-demand with intelligent local caching ## Configuration There are two ways to configure cloud storage destinations: ### Global default destinations Set default destinations for all media columns in your `config.toml` (see [Configuration](/platform/configuration) for details): ```toml theme={null} [pixeltable] # For input media (inserted/referenced files) input_media_dest = "s3://my-bucket/input/" # For computed media (AI-generated outputs) output_media_dest = "s3://my-bucket/output/" ``` Or via environment variables: ```bash theme={null} export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/input/" export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/output/" ``` Configure these before creating tables. All media columns will automatically use the configured destinations. ### Per-column destination (computed columns only) For **computed columns**, you can override the default with a specific destination: ```python theme={null} import pixeltable as pxt # Create a table with input media column # (uses global input_media_dest if configured) t = pxt.create_table('my_app/images', {'image': pxt.Image}) # Add computed column with explicit destination t.add_computed_column( thumbnail=t.image.resize((128, 128)), destination='s3://my-bucket/thumbnails/' ) ``` The `destination` parameter only applies to **stored computed columns**. For input columns, use the global `input_media_dest` configuration. ### Precedence rules Destinations are resolved in this order: 1. **Explicit column destination** — highest priority (computed columns only) 2. **Global default** — `input_media_dest` for input columns, `output_media_dest` for computed columns 3. **Local storage** — fallback if no destination is configured ## Provider configuration ### Amazon S3 ``` s3://bucket-name/optional/prefix/ ``` Uses standard AWS credential chain: * Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) * AWS credentials file (`~/.aws/credentials`) * IAM role (when running on AWS) Optionally specify a profile in `config.toml`: ```toml theme={null} [pixeltable] s3_profile = "my-aws-profile" ``` ```python theme={null} import pixeltable as pxt # With global config: output_media_dest = "s3://my-bucket/output/" t = pxt.create_table('app/images', {'photo': pxt.Image}) # Or set destination per computed column t.add_computed_column( thumbnail=t.photo.resize((256, 256)), destination='s3://my-production-bucket/thumbnails/' ) ``` ### Google Cloud Storage ``` gs://bucket-name/optional/prefix/ ``` Uses Google Cloud Application Default Credentials: * Service account key file (`GOOGLE_APPLICATION_CREDENTIALS`) * gcloud CLI authentication * GCE metadata service (when running on GCP) ```bash theme={null} pip install google-cloud-storage ``` ```python theme={null} # With global config: output_media_dest = "gs://my-gcs-bucket/output/" t = pxt.create_table('app/videos', {'video': pxt.Video}) # Or set destination per computed column t.add_computed_column( frames=pxt.functions.video.frame_iterator(t.video, fps=1), destination='gs://my-gcs-bucket/frames/' ) ``` ### Azure Blob Storage Azure supports multiple URI schemes: ``` wasbs://container@account.blob.core.windows.net/prefix/ abfss://container@account.dfs.core.windows.net/prefix/ ``` Configure in `config.toml`: ```toml theme={null} [azure] storage_account_name = "myaccount" storage_account_key = "your-key-here" ``` Or via environment variables: ```bash theme={null} export AZURE_STORAGE_ACCOUNT_NAME="myaccount" export AZURE_STORAGE_ACCOUNT_KEY="your-key-here" ``` ```bash theme={null} pip install azure-storage-blob ``` ```python theme={null} # With global config: output_media_dest = "wasbs://mycontainer@myaccount.blob.core.windows.net/output/" t = pxt.create_table('app/docs', {'document': pxt.Document}) # Or set destination per computed column t.add_computed_column( chunks=pxt.functions.video.document_splitter(t.document), destination='wasbs://mycontainer@myaccount.blob.core.windows.net/chunks/' ) ``` ### Cloudflare R2 ``` https://account-id.r2.cloudflarestorage.com/bucket-name/prefix/ ``` Create an R2 API token and configure AWS-style credentials. In `~/.aws/credentials`: ```ini theme={null} [r2] aws_access_key_id = your-r2-access-key aws_secret_access_key = your-r2-secret-key ``` In `config.toml`: ```toml theme={null} [pixeltable] r2_profile = "r2" ``` ```python theme={null} t = pxt.create_table('app/images', {'image': pxt.Image}) t.add_computed_column( rotated=t.image.rotate(90), destination='https://abc123.r2.cloudflarestorage.com/my-bucket/processed/' ) ``` ### Backblaze B2 ``` https://s3.region.backblazeb2.com/bucket-name/prefix/ ``` Create B2 application keys and configure AWS-style credentials. In `~/.aws/credentials`: ```ini theme={null} [b2] aws_access_key_id = your-b2-key-id aws_secret_access_key = your-b2-application-key ``` In `config.toml`: ```toml theme={null} [pixeltable] b2_profile = "b2" ``` ```python theme={null} t = pxt.create_table('app/audio', {'audio': pxt.Audio}) t.add_computed_column( segments=pxt.functions.video.audio_splitter(t.audio, duration=30), destination='https://s3.us-west-004.backblazeb2.com/my-bucket/segments/' ) ``` ### Tigris ``` https://t3.storage.dev/bucket-name/prefix/ ``` Configure AWS-style credentials for Tigris. In `~/.aws/credentials`: ```ini theme={null} [tigris] aws_access_key_id = your-tigris-access-key aws_secret_access_key = your-tigris-secret-key ``` In `config.toml`: ```toml theme={null} [pixeltable] tigris_profile = "tigris" ``` ```python theme={null} t = pxt.create_table('app/media', {'file': pxt.Image}) t.add_computed_column( thumbnail=t.file.resize((128, 128)), destination='https://t3.storage.dev/my-bucket/thumbnails/' ) ``` ## Complete example Here's a full example using S3 for both input and computed media. First, configure your global destinations in `~/.pixeltable/config.toml`: ```toml theme={null} [pixeltable] input_media_dest = "s3://my-app-bucket/uploads/" output_media_dest = "s3://my-app-bucket/generated/" s3_profile = "my-aws-profile" # optional, uses default credentials if not set ``` Then create your table and add computed columns: ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai # Create a table — input media automatically goes to input_media_dest t = pxt.create_table('production/photos', {'photo': pxt.Image}) # Add a computed column for thumbnails # Uses output_media_dest by default, or specify a custom destination t.add_computed_column( thumbnail=t.photo.resize((256, 256)), destination='s3://my-app-bucket/thumbnails/' # override default ) # Add AI-generated descriptions (uses output_media_dest) t.add_computed_column( description=openai.vision( prompt="Describe this image briefly.", image=t.photo, model='gpt-4o-mini' ) ) # Insert data — Pixeltable handles all uploads automatically t.insert([ {'photo': 'https://example.com/image1.jpg'}, {'photo': '/local/path/to/image2.png'}, ]) # Query as usual — files are streamed/cached as needed t.select(t.photo, t.thumbnail, t.description).collect() ``` ## Best practices Structure your bucket with prefixes that reflect your application: ``` s3://my-bucket/ ├── production/ │ ├── uploads/ │ └── generated/ └── staging/ ├── uploads/ └── generated/ ``` Use different prefixes or buckets for input vs computed media: * Easier to set different retention policies * Clearer cost attribution * Simpler backup strategies Set up bucket lifecycle policies to automatically: * Transition old data to cheaper storage tiers * Delete temporary/staging data after a period * Enable versioning for critical data When running on cloud infrastructure, use IAM roles instead of access keys: * More secure (no key rotation needed) * Automatic credential refresh * Better audit trails ## Troubleshooting Verify your credentials have the necessary permissions: * `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject` * `s3:ListBucket` for the bucket For GCS: `storage.objects.create`, `storage.objects.get`, `storage.objects.delete` * Ensure the bucket exists and the name is spelled correctly * Check the region matches your credential configuration * For S3-compatible providers, verify the endpoint URL is correct * Pixeltable uses connection pooling and parallel uploads automatically * Consider using a bucket in the same region as your compute * Check your network bandwidth and latency See the complete list of storage configuration options including profiles for S3, R2, B2, Tigris, and Azure. Need help setting up cloud storage? Join our [Discord community](https://discord.com/invite/QPyqFYx2UN) for support. # Embedding Models Source: https://docs.pixeltable.com/integrations/embedding-model Learn how to integrate custom embedding models with Pixeltable Pixeltable provides extensive built-in support for popular embedding models, but you can also easily integrate your own custom embedding models. This guide shows you how to create and use custom embedding functions for any model architecture. ## Quick start Here's a simple example using a custom BERT model: ```python theme={null} import tensorflow as tf import tensorflow_hub as hub import pixeltable as pxt @pxt.udf def custom_bert_embed(text: str) -> pxt.Array[(512,), pxt.Float]: """Basic BERT embedding function""" preprocessor = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3') model = hub.load('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2') tensor = tf.constant([text]) result = model(preprocessor(tensor))['pooled_output'] return result.numpy()[0, :] # Create table and add embedding index docs = pxt.create_table('documents', {'text': pxt.String}) docs.add_embedding_index('text', string_embed=custom_bert_embed) ``` ## Production best practices The quick start example works but isn't production-ready. Below we'll cover how to optimize your custom embedding UDFs. ### Model caching Always cache your model instances to avoid reloading on every call: ```python theme={null} @pxt.udf def optimized_bert_embed(text: str) -> pxt.Array[(512,), pxt.Float]: """BERT embedding function with model caching""" if not hasattr(optimized_bert_embed, 'model'): # Load models once optimized_bert_embed.preprocessor = hub.load( 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3' ) optimized_bert_embed.model = hub.load( 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2' ) tensor = tf.constant([text]) result = optimized_bert_embed.model( optimized_bert_embed.preprocessor(tensor) )['pooled_output'] return result.numpy()[0, :] ``` ### Batch processing Use Pixeltable's batching capabilities for better performance: ```python theme={null} from pixeltable.func import Batch @pxt.udf(batch_size=32) def batched_bert_embed(texts: Batch[str]) -> Batch[pxt.Array[(512,), pxt.Float]]: """BERT embedding function with batching""" if not hasattr(batched_bert_embed, 'model'): batched_bert_embed.preprocessor = hub.load( 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3' ) batched_bert_embed.model = hub.load( 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2' ) # Process entire batch at once tensor = tf.constant(list(texts)) results = batched_bert_embed.model( batched_bert_embed.preprocessor(tensor) )['pooled_output'] return [r for r in results.numpy()] ``` ## Error handling Always implement proper error handling in production UDFs: ```python theme={null} @pxt.udf def robust_bert_embed(text: str) -> pxt.Array[(512,), pxt.Float]: """BERT embedding with error handling""" try: if not text or len(text.strip()) == 0: raise ValueError("Empty text input") if not hasattr(robust_bert_embed, 'model'): # Model initialization... pass tensor = tf.constant([text]) result = robust_bert_embed.model( robust_bert_embed.preprocessor(tensor) )['pooled_output'] return result.numpy()[0, :] except Exception as e: logger.error(f"Embedding failed: {str(e)}") raise ``` ## Additional resources Complete UDF documentation More embedding examples Find embedding models # Ecosystem Source: https://docs.pixeltable.com/integrations/frameworks Explore Pixeltable ecosystem of built-in integrations for AI/ML workflows From language models to computer vision frameworks, Pixeltable integrates with the entire ecosystem. All integrations are available out-of-the-box with Pixeltable installation. No additional setup required unless specified. If you have a framework that you want us to integrate with, please reach out and you can also leverage Pixeltable's [UDFs](/platform/udfs-in-pixeltable) to build your own. ## Cloud LLM providers Integrate Claude models for advanced language understanding and generation with multimodal capabilities Access Google's Gemini models for state-of-the-art multimodal AI capabilities Leverage GPT models for text generation, embeddings, and image analysis Use OpenAI models via Azure with enterprise security and compliance Use Mistral's efficient language models for various NLP tasks Access a variety of open-source models through Together AI's platform Use Fireworks.ai's optimized model inference infrastructure Leverage DeepSeek's powerful language and code models for text and code generation Access a variety of AI models through AWS Bedrock's unified API Access Groq's models for text generation Unified access to 100+ LLMs from various providers through a single API ## Embeddings & Reranking High-quality embeddings and reranking for RAG applications ## Video Understanding Multimodal video understanding, search, and analysis with state-of-the-art foundation models ## Media Generation Fast image generation with Flux, Stable Diffusion, and other models AI-powered video generation and editing capabilities ## Local LLM runtimes High-performance C++ implementation for running LLMs on CPU and GPU Easy-to-use toolkit for running and managing open-source models locally ## Computer vision State-of-the-art object detection with YOLOX models Advanced video and image dataset management with Voxel51 ## Annotation tools Comprehensive platform for data annotation and labeling workflows ## Audio processing High-quality speech recognition and transcription using OpenAI's Whisper models ## Data wrangling Import and Export from and to Pandas DataFrames if needed ## Usage examples ```python theme={null} import pixeltable as pxt from pixeltable.functions import openai # Create a table with computed column for OpenAI completion table = pxt.create_table('responses', {'prompt': pxt.String}) table.add_computed_column( response=openai.chat_completions( messages=[{'role': 'user', 'content': table.prompt}], model='gpt-4' ) ) ``` ```python theme={null} from pixeltable.functions.yolox import yolox # Add object detection to video frames frames_view.add_computed_column( detections=yolox( frames_view.frame, model_id='yolox_l' ) ) ``` ```python theme={null} from pixeltable.functions import openai # Transcribe audio files audio_table.add_computed_column( transcription=openai.transcriptions( audio=audio_table.file, model='whisper-1' ) ) ``` ## Integration features Most integrations work out-of-the-box with simple API configuration Use integrations directly in computed columns for automated processing Efficient handling of batch operations with automatic optimization Check our [Github](https://github.com/pixeltable/pixeltable/tree/main/docs/howto/providers) for detailed usage instructions for each integration. Need help setting up integrations? Join our [Discord community](https://discord.com/invite/QPyqFYx2UN) for support. # Model Hub & Repositories Source: https://docs.pixeltable.com/integrations/models Explore pre-trained models and integrations available in Pixeltable ## Model hubs Access thousands of pre-trained models across vision, text, and audio domains Deploy and run ML models through Replicate's cloud infrastructure ## Hugging Face models Pixeltable provides seamless integration with Hugging Face's transformers library through built-in UDFs. These functions allow you to use state-of-the-art models directly in your data workflows. Requirements: Install required dependencies with `pip install transformers`. Some models may require additional packages like `sentence-transformers` or `torch`. ### CLIP models ```python theme={null} from pixeltable.functions.huggingface import clip # For text embedding t.add_computed_column( text_embedding=clip( t.text_column, model_id='openai/clip-vit-base-patch32' ) ) # For image embedding t.add_computed_column( image_embedding=clip( t.image_column, model_id='openai/clip-vit-base-patch32' ) ) ``` Perfect for multimodal applications combining text and image understanding. ### Cross-encoders ```python theme={null} from pixeltable.functions.huggingface import cross_encoder t.add_computed_column( similarity_score=cross_encoder( t.sentence1, t.sentence2, model_id='cross-encoder/ms-marco-MiniLM-L-4-v2' ) ) ``` Ideal for semantic similarity tasks and sentence pair classification. ### DETR object detection ```python theme={null} from pixeltable.functions.huggingface import detr_for_object_detection t.add_computed_column( detections=detr_for_object_detection( t.image, model_id='facebook/detr-resnet-50', threshold=0.8 ) ) # Convert to COCO format if needed t.add_computed_column( coco_format=detr_to_coco(t.image, t.detections) ) ``` Powerful object detection with end-to-end transformer architecture. ### Sentence transformers ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer t.add_computed_column( embeddings=sentence_transformer( t.text, model_id='sentence-transformers/all-mpnet-base-v2', normalize_embeddings=True ) ) ``` State-of-the-art sentence and document embeddings for semantic search and similarity. ### Speech2Text models ```python theme={null} from pixeltable.functions.huggingface import speech2text_for_conditional_generation # Basic transcription t.add_computed_column( transcript=speech2text_for_conditional_generation( t.audio, model_id='facebook/s2t-small-librispeech-asr' ) ) # Multilingual translation t.add_computed_column( translation=speech2text_for_conditional_generation( t.audio, model_id='facebook/s2t-medium-mustc-multilingual-st', language='fr' ) ) ``` Support for both transcription and translation of audio content. ### Vision Transformer (ViT) ```python theme={null} from pixeltable.functions.huggingface import vit_for_image_classification t.add_computed_column( classifications=vit_for_image_classification( t.image, model_id='google/vit-base-patch16-224', top_k=5 ) ) ``` Modern image classification using transformer architecture. ## Integration features All models can be used directly in computed columns for automated processing: ```python theme={null} # Example: Combine CLIP embeddings with ViT classification t.add_computed_column( image_features=clip(t.image, model_id='openai/clip-vit-base-patch32') ) t.add_computed_column( classifications=vit_for_image_classification(t.image, model_id='google/vit-base-patch16-224') ) ``` Pixeltable automatically handles batch processing and optimization: ```python theme={null} # Pixeltable efficiently processes large datasets t.add_computed_column( embeddings=sentence_transformer( t.text, model_id='all-mpnet-base-v2' ) ) ``` ```python theme={null} # Object Detection Output { 'scores': [0.99, 0.98], # confidence scores 'labels': [25, 30], # class labels 'label_text': ['cat', 'dog'], # human-readable labels 'boxes': [[x1, y1, x2, y2], ...] # bounding boxes } # Image Classification Output { 'scores': [0.8, 0.15], # class probabilities 'labels': [340, 353], # class IDs 'label_text': ['zebra', 'gazelle'] # class names } ``` ## Model selection guide Select the appropriate model family based on your task: * Text/Image Similarity → CLIP * Object Detection → DETR * Text Embeddings → Sentence Transformers * Speech Processing → Speech2Text * Image Classification → ViT Install necessary dependencies: ```bash theme={null} pip install transformers torch sentence-transformers ``` Import and use the model in your Pixeltable workflow: ```python theme={null} from pixeltable.functions.huggingface import clip, sentence_transformer ``` Need help choosing the right model? Check our [example notebooks](https://github.com/pixeltable/pixeltable/tree/main/docs/howto/providers) or join our [Discord community](https://discord.com/invite/QPyqFYx2UN). # Agent Frameworks Source: https://docs.pixeltable.com/migrate/from-agent-frameworks How AI agent concepts map from LangGraph, CrewAI, and similar frameworks to Pixeltable If you've been building AI agents with LangGraph or CrewAI — defining state graphs, tool nodes, conditional edges, and bolting on separate memory stores — this guide shows how Pixeltable replaces the graph DSL with declarative tables. **Related use case:** [Agents & MCP](/use-cases/agents-mcp) *** ## Concept Mapping | Agent Framework | Pixeltable Equivalent | | --------------------------------- | ------------------------------------------------------------------------------------------------------------------ | | `StateGraph` / `AgentExecutor` | [`pxt.create_table()`](/tutorials/tables-and-data-operations) with [computed columns](/tutorials/computed-columns) | | Graph nodes (functions) | Computed columns — dependencies resolved automatically | | Graph edges / conditional routing | Column references — Pixeltable infers the DAG | | `ToolNode` / `@tool` | [`pxt.tools()` + `invoke_tools()`](/howto/cookbooks/agents/llm-tool-calling) | | `MemorySaver` / checkpointer | Tables are persistent by default | | Separate vector DB for RAG | [`add_embedding_index()`](/platform/embedding-indexes) + [`@pxt.query`](/platform/udfs-in-pixeltable) | | LangSmith for observability | `t.select()` on any column — every step is [queryable](/tutorials/queries-and-expressions) | *** ## Side by Side: Tool-Calling Agent An agent that picks tools, calls them, and answers based on the results. ```python theme={null} from typing import Annotated, Sequence, TypedDict from langchain_core.messages import BaseMessage, HumanMessage from langchain_openai import ChatOpenAI from langgraph.graph import StateGraph, END, add_messages from langgraph.prebuilt import ToolNode from langchain_core.tools import tool class AgentState(TypedDict): messages: Annotated[Sequence[BaseMessage], add_messages] @tool def get_weather(city: str) -> str: """Get current weather for a city.""" return f'Weather in {city}: 72°F, sunny' @tool def search_docs(query: str) -> str: """Search internal documents.""" return f'Results for: {query}' tools = [get_weather, search_docs] model = ChatOpenAI(model='gpt-4o-mini').bind_tools(tools) def call_model(state): return {'messages': [model.invoke(state['messages'])]} def should_continue(state): last = state['messages'][-1] return 'tools' if last.tool_calls else END workflow = StateGraph(AgentState) workflow.add_node('agent', call_model) workflow.add_node('tools', ToolNode(tools)) workflow.set_entry_point('agent') workflow.add_conditional_edges( 'agent', should_continue, {'tools': 'tools', END: END}) workflow.add_edge('tools', 'agent') graph = workflow.compile() result = graph.invoke( {'messages': [HumanMessage(content='Weather in SF?')]}) print(result['messages'][-1].content) ``` **Packages:** `langgraph`, `langchain-openai`, `langchain-core`, plus a vector DB client for RAG ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions, invoke_tools @pxt.udf def get_weather(city: str) -> str: """Get current weather for a city.""" return f'Weather in {city}: 72°F, sunny' @pxt.udf def search_docs(query: str) -> str: """Search internal documents.""" return f'Results for: {query}' tools = pxt.tools(get_weather, search_docs) agent = pxt.create_table('agents.assistant', {'message': pxt.String}) agent.add_computed_column(response=chat_completions( messages=[{'role': 'user', 'content': agent.message}], model='gpt-4o-mini', tools=tools)) agent.add_computed_column( tool_output=invoke_tools(tools, agent.response)) @pxt.udf def build_followup(message: str, tool_output: dict) -> list[dict]: results = [ str(r) for vals in (tool_output or {}).values() if vals for r in vals ] return [ {'role': 'user', 'content': message}, {'role': 'assistant', 'content': '\n'.join(results)}, {'role': 'user', 'content': 'Answer my original question using that information.'}, ] agent.add_computed_column( followup=build_followup(agent.message, agent.tool_output)) agent.add_computed_column( final=chat_completions(messages=agent.followup, model='gpt-4o-mini')) agent.add_computed_column( answer=agent.final.choices[0].message.content) agent.insert([{'message': 'What is the weather in SF?'}]) agent.select(agent.message, agent.answer).collect() ``` **Packages:** `pixeltable`, `openai` ### What Changes | | LangGraph / CrewAI | Pixeltable | | -------------------- | -------------------------------------- | --------------------------------------------------------- | | **State** | Ephemeral — lost when the process ends | Persistent — every row survives restarts | | **Caching** | No built-in caching of tool results | Same input returns cached result | | **Observability** | LangSmith (separate service + API key) | `agent.select(agent.tool_output).collect()` | | **Adding RAG** | Separate vector DB integration | `add_embedding_index()` + `@pxt.query` — no extra service | | **Graph definition** | Nodes, edges, conditional routing DSL | Computed columns — Pixeltable infers the DAG | | **MCP tools** | Custom integration | `pxt.mcp_udfs()` loads tools from any MCP server | *** ## Common Patterns ### Adding persistent memory ```python theme={null} from langgraph.checkpoint.memory import MemorySaver checkpointer = MemorySaver() graph = workflow.compile(checkpointer=checkpointer) # In-process only — lost on restart ``` ```python theme={null} from pixeltable.functions.openai import embeddings memories = pxt.create_table('agents.memories', { 'content': pxt.String, 'timestamp': pxt.Timestamp}) memories.add_embedding_index('content', string_embed=embeddings.using(model='text-embedding-3-small')) @pxt.query def recall(query: str, top_k: int = 5) -> pxt.Query: sim = memories.content.similarity(string=query) return memories.order_by(sim, asc=False) \ .limit(top_k).select(memories.content) ``` ### Adding RAG to an agent ```python theme={null} from langchain_pinecone import PineconeVectorStore vector_store = PineconeVectorStore( index_name='docs', embedding=embeddings) @tool def search_kb(query: str) -> str: """Search the knowledge base.""" docs = vector_store.as_retriever() \ .get_relevant_documents(query) return '\n'.join(d.page_content for d in docs) # Must add tool to graph, re-compile... ``` ```python theme={null} @pxt.query def search_kb(query: str) -> pxt.Query: """Search the knowledge base.""" sim = chunks.text.similarity(string=query) return chunks.order_by(sim, asc=False) \ .limit(5).select(chunks.text) tools = pxt.tools(get_weather, search_kb) ``` ### Inspecting agent behavior ```python theme={null} # Requires LangSmith: set LANGSMITH_API_KEY, # LANGSMITH_PROJECT, then view traces in dashboard ``` ```python theme={null} agent.select( agent.message, agent.tool_output, agent.answer ).collect() ``` *** ## Next Steps Full use case walkthrough All 8 agentic patterns as Pixeltable tables Register UDFs and queries as LLM tools Lightweight agent framework built on Pixeltable # DIY Data Pipeline Source: https://docs.pixeltable.com/migrate/from-diy-data-pipeline Replace custom scripts, DVC, Airflow, and manual processing with declarative tables If you've been wrangling multimodal data with custom Python scripts, DVC for versioning, Airflow for scheduling, and manual processing loops — this guide shows how Pixeltable replaces that plumbing with declarative tables. **Related use case:** [Data Wrangling for ML](/use-cases/ml-data-wrangling) *** ## Concept Mapping | Your DIY Stack | Pixeltable Equivalent | | -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | | S3 buckets for media files | [`pxt.Image`, `pxt.Video`, `pxt.Audio`](/platform/type-system) columns — can still [read from S3](/integrations/cloud-storage) | | DVC for data versioning | Built-in [`history()`, `revert()`, `create_snapshot()`](/platform/version-control) | | Airflow / cron for scheduling | [Computed columns](/tutorials/computed-columns) — run automatically on insert | | Custom scripts with OpenCV / PIL | [`@pxt.udf`](/platform/udfs-in-pixeltable) functions as computed columns | | `cv2.VideoCapture()` + frame loops | [`frame_iterator`](/platform/iterators) via `create_view()` | | Manual retry logic (`tenacity`) | Automatic retries with result caching | | Embeddings as numpy / Parquet | [`add_embedding_index()`](/platform/embedding-indexes) with HNSW search | | `torch.utils.data.Dataset` boilerplate | [`to_pytorch_dataset()`](/howto/cookbooks/data/data-export-pytorch) — one line | | Re-run pipeline when data changes | Incremental — only new rows are processed | *** ## Side by Side: Image Processing Pipeline Process images: generate thumbnails, caption with an LLM, embed for search, version everything. ```python theme={null} import pandas as pd import numpy as np from PIL import Image from openai import OpenAI from pathlib import Path import base64, time client = OpenAI() # Load metadata image_dir = Path('dataset/images/') df = pd.DataFrame([ {'filename': f.name, 'path': str(f), 'category': 'unknown'} for f in image_dir.glob('*.jpg') ]) # Generate thumbnails (manual loop) thumb_dir = Path('dataset/thumbnails/') thumb_dir.mkdir(exist_ok=True) for idx, row in df.iterrows(): img = Image.open(row['path']) img.thumbnail((256, 256)) img.save(thumb_dir / row['filename']) df.at[idx, 'thumbnail'] = str(thumb_dir / row['filename']) # Caption images (manual retry, one at a time) def caption_image(path, max_retries=3): with open(path, 'rb') as f: b64 = base64.b64encode(f.read()).decode() for attempt in range(max_retries): try: resp = client.chat.completions.create( model='gpt-4o-mini', messages=[{'role': 'user', 'content': [ {'type': 'text', 'text': 'Describe this image in one sentence.'}, {'type': 'image_url', 'image_url': { 'url': f'data:image/jpeg;base64,{b64}'}} ]}], ) return resp.choices[0].message.content except Exception: if attempt < max_retries - 1: time.sleep(2 ** attempt) else: return None df['caption'] = [caption_image(row['path']) for _, row in df.iterrows()] # Generate embeddings (batch manually, store as numpy) valid = df.dropna(subset=['caption']) resp = client.embeddings.create( input=valid['caption'].tolist(), model='text-embedding-3-small') np.save('dataset/embeddings.npy', [e.embedding for e in resp.data]) # Persist and version df.to_csv('dataset/metadata.csv', index=False) # Then: dvc add dataset/ && dvc push && git add && git commit ``` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions, embeddings from pathlib import Path images = pxt.create_table('ml.images', { 'image': pxt.Image, 'category': pxt.String}) images.add_computed_column(thumbnail=images.image.resize((256, 256))) messages = [{'role': 'user', 'content': [ {'type': 'text', 'text': 'Describe this image in one sentence.'}, {'type': 'image_url', 'image_url': images.image}, ]}] images.add_computed_column(response=chat_completions( messages=messages, model='gpt-4o-mini')) images.add_computed_column( caption=images.response.choices[0].message.content) images.add_embedding_index('caption', string_embed=embeddings.using(model='text-embedding-3-small')) images.insert([{'image': str(f), 'category': 'unknown'} for f in Path('dataset/images/').glob('*.jpg')]) sim = images.caption.similarity(string='a dog playing in the park') images.order_by(sim, asc=False).limit(5) \ .select(images.image, images.caption).collect() ``` ### What Changes | | Custom Scripts | Pixeltable | | ------------------ | ------------------------------------------------------- | -------------------------------------------------------- | | **New images** | Re-run the entire pipeline | `images.insert([...])` — everything downstream runs | | **Change model** | Re-run everything; DVC tracks snapshots, not transforms | Drop and re-add the column — only that column recomputes | | **Versioning** | `dvc add` + `git commit` ceremony | Automatic — `images.history()`, `pxt.create_snapshot()` | | **Scheduling** | Airflow, cron, or manual re-runs | Not needed — computed columns run on insert | | **Retries** | `try/except` with backoff in every function | Built-in; successful results are cached | | **Search** | Brute-force numpy, or set up a vector DB | `add_embedding_index()` with HNSW | | **PyTorch export** | Custom `Dataset` class | `images.to_pytorch_dataset()` | *** ## Common Patterns ### Video frame extraction ```python theme={null} import cv2 from PIL import Image cap = cv2.VideoCapture('demo.mp4') fps = cap.get(cv2.CAP_PROP_FPS) frames, idx = [], 0 while cap.isOpened(): ret, frame = cap.read() if not ret: break if idx % int(fps) == 0: frames.append(Image.fromarray( cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))) idx += 1 cap.release() ``` ```python theme={null} from pixeltable.functions.video import frame_iterator videos = pxt.create_table('ml.videos', {'video': pxt.Video}) frames = pxt.create_view('ml.frames', videos, iterator=frame_iterator(videos.video, fps=1)) videos.insert([{'video': 'demo.mp4'}]) frames.select(frames.frame).head(10) ``` ### Data versioning ```bash theme={null} dvc add dataset/ git add dataset.dvc .gitignore git commit -m "update dataset v3" dvc push # Revert git checkout HEAD~1 -- dataset.dvc dvc checkout ``` ```python theme={null} images.history() pxt.create_snapshot('ml.images_before_relabeling', images) images.revert() ``` ### PyTorch export ```python theme={null} from torch.utils.data import Dataset, DataLoader from torchvision import transforms class ImageDataset(Dataset): def __init__(self, df, transform=None): self.df = df.reset_index(drop=True) self.transform = transform def __len__(self): return len(self.df) def __getitem__(self, idx): img = Image.open(self.df.at[idx, 'path']) if self.transform: img = self.transform(img) return img, self.df.at[idx, 'category'] loader = DataLoader(ImageDataset(df, transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor()])), batch_size=32) ``` ```python theme={null} from torch.utils.data import DataLoader ds = images.select(images.image, images.category) \ .to_pytorch_dataset() loader = DataLoader(ds, batch_size=32) ``` *** ## Next Steps Full use case walkthrough Frame extraction with FPS control Convert tables to DataLoaders S3, GCS, Azure, R2, Tigris # RDBMS & Vector DBs Source: https://docs.pixeltable.com/migrate/from-rdbms-vectordbs Replace Postgres + Pinecone + LangChain RAG stacks with a single declarative system If you're running a RAG application with Postgres for metadata, a vector database like Pinecone or Weaviate for embeddings, and LangChain for orchestration — this guide shows how Pixeltable unifies all three. **Related use case:** [Backend for AI Apps](/use-cases/ai-applications) *** ## Concept Mapping | Your Database Stack | Pixeltable Equivalent | | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | | Postgres / MySQL for metadata | [`pxt.create_table()`](/tutorials/tables-and-data-operations) with typed columns | | Pinecone / Weaviate / Chroma for embeddings | [`add_embedding_index()`](/platform/embedding-indexes) — built-in HNSW search | | S3 for media files (referenced by URL) | [`pxt.Image`, `pxt.Video`, `pxt.Document`](/platform/type-system) native types | | ORM (SQLAlchemy, Prisma) | [`.select()`, `.where()`, `.order_by()`](/tutorials/queries-and-expressions) | | LangChain `DocumentLoader` | `insert()`, [`import_csv()`](/howto/cookbooks/data/data-import-csv), [import from S3](/integrations/cloud-storage) | | `RecursiveCharacterTextSplitter` | [`document_splitter`](/platform/iterators) iterator via `create_view()` | | `retriever.get_relevant_documents()` | [`.similarity()`](/platform/embedding-indexes) + `.order_by()` | | `create_retrieval_chain()` | [Computed column](/tutorials/computed-columns) with LLM call | | Keeping Postgres and Pinecone in sync | Automatic — derived columns can't go stale | *** ## Side by Side: RAG Pipeline Load documents, chunk, embed, retrieve, and generate answers. ```python theme={null} from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_pinecone import PineconeVectorStore from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain.prompts import PromptTemplate # Load and chunk documents = PyPDFLoader('report.pdf').load() chunks = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ).split_documents(documents) # Embed and store in Pinecone embeddings = OpenAIEmbeddings(model='text-embedding-3-small') vector_store = PineconeVectorStore.from_documents( chunks, embeddings, index_name='my-index') retriever = vector_store.as_retriever(search_kwargs={'k': 5}) # Build chain prompt = PromptTemplate.from_template( 'Answer based on context:\n{context}\n\nQuestion: {input}') llm = ChatOpenAI(model='gpt-4o-mini', temperature=0) rag_chain = create_retrieval_chain( retriever, create_stuff_documents_chain(llm, prompt)) result = rag_chain.invoke({'input': 'What were the key findings?'}) print(result['answer']) ``` **Packages:** `langchain`, `langchain-openai`, `langchain-pinecone`, `pinecone-client`, `sqlalchemy` ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions, embeddings from pixeltable.functions.document import document_splitter docs = pxt.create_table('rag.docs', { 'pdf': pxt.Document, 'source': pxt.String}) chunks = pxt.create_view('rag.chunks', docs, iterator=document_splitter( docs.pdf, separators='sentence,token_limit', limit=300)) chunks.add_embedding_index('text', string_embed=embeddings.using(model='text-embedding-3-small')) @pxt.query def retrieve(question: str, top_k: int = 5) -> pxt.Query: sim = chunks.text.similarity(string=question) return chunks.order_by(sim, asc=False) \ .limit(top_k).select(chunks.text) qa = pxt.create_table('rag.qa', {'question': pxt.String}) qa.add_computed_column(context=retrieve(qa.question)) @pxt.udf def build_prompt(question: str, context: list[dict]) -> str: ctx = '\n\n'.join(c['text'] for c in context) return f'Answer based on context:\n{ctx}\n\nQuestion: {question}' qa.add_computed_column(prompt=build_prompt(qa.question, qa.context)) qa.add_computed_column(response=chat_completions( messages=[{'role': 'user', 'content': qa.prompt}], model='gpt-4o-mini')) qa.add_computed_column( answer=qa.response.choices[0].message.content) docs.insert([{'pdf': 'report.pdf', 'source': 'annual_report'}]) qa.insert([{'question': 'What were the key findings?'}]) qa.select(qa.question, qa.answer).collect() ``` **Packages:** `pixeltable`, `openai` ### What Changes | | LangChain + Pinecone | Pixeltable | | ------------------------ | ------------------------------------------------ | ------------------------------------------------------------------------- | | **New documents** | Re-run chunking, embedding, and Pinecone upsert | `docs.insert([...])` — chunks, embeddings, and index update automatically | | **Infrastructure** | Postgres + Pinecone account + API keys | Single local system, no external services | | **Sync issues** | Postgres metadata and Pinecone vectors can drift | Impossible — derived columns are always consistent | | **Intermediate results** | Ephemeral unless you add logging | Every column is stored and queryable: `qa.select(qa.context).collect()` | | **Versioning** | Not built-in | `t.history()`, `pxt.create_snapshot()` | | **Swap providers** | Rewrite chain with new provider classes | Change the model string — same pipeline | *** ## Common Patterns ### Adding new documents ```python theme={null} new_docs = PyPDFLoader('new_report.pdf').load() new_chunks = splitter.split_documents(new_docs) vector_store.add_documents(new_chunks) # Also update Postgres metadata... ``` ```python theme={null} docs.insert([{'pdf': 'new_report.pdf', 'source': 'quarterly'}]) ``` ### Filtering by metadata ```python theme={null} retriever = vector_store.as_retriever( search_kwargs={'k': 5, 'filter': {'source': 'annual_report'}}) ``` ```python theme={null} sim = chunks.text.similarity(string=query) results = (chunks .where((chunks.source == 'annual_report') & (sim > 0.3)) .order_by(sim, asc=False).limit(5).collect()) ``` ### Inspecting what was retrieved ```python theme={null} result = rag_chain.invoke({'input': query}) print(result['context']) # if available ``` ```python theme={null} qa.select(qa.question, qa.context, qa.answer).collect() ``` *** ## Next Steps Full use case walkthrough Complete RAG system with chunking and retrieval Control chunk size, overlap, and splitting strategies Search patterns and similarity queries # Building with LLMs Source: https://docs.pixeltable.com/overview/building-pixeltable-with-llms Use AI coding tools to build Pixeltable applications faster ## Why Pixeltable Is Easy to Vibe-Code Pixeltable's API is declarative — you say *what* you want, not *how* to wire it up. That means LLMs get it right on the first try. Ask your AI tool to "summarize articles with GPT-4o-mini" and you get working code: ```python theme={null} import pixeltable as pxt from pixeltable.functions.openai import chat_completions t = pxt.create_table('app.articles', {'title': pxt.String, 'body': pxt.String}) t.add_computed_column(response=chat_completions( messages=[{'role': 'user', 'content': t.body}], model='gpt-4o-mini')) t.add_computed_column(summary=t.response.choices[0].message.content) t.insert([{'title': 'Climate Report', 'body': 'Global temperatures rose 1.2°C ...'}]) t.select(t.title, t.summary).collect() ``` Ten lines of code — and the result is **persistent**, **versioned**, **traceable**, and **incrementally optimized**. Every output is stored, every transformation is replayable, and new rows only recompute what changed. The same pattern scales to [RAG pipelines](/howto/cookbooks/agents/pattern-rag-pipeline), [video frame extraction](/howto/cookbooks/video/video-extract-frames), [tool-calling agents](/howto/cookbooks/agents/llm-tool-calling), and [semantic search](/howto/cookbooks/search/search-semantic-text). *** ## Set Up Your AI Tool Pick the setup that matches your editor. These aren't mutually exclusive — use whichever combination helps. Drop our [AGENTS.md](https://github.com/pixeltable/pixeltable/blob/main/AGENTS.md) into your project root. Cursor, Windsurf, and similar agents pick it up automatically and use it as context for code generation. ```bash theme={null} curl -o AGENTS.md https://raw.githubusercontent.com/pixeltable/pixeltable/main/AGENTS.md ``` For Claude-based editors, the same file is also available as [CLAUDE.md](https://github.com/pixeltable/pixeltable/blob/main/CLAUDE.md). Install the [Pixeltable Skill](https://github.com/pixeltable/pixeltable-skill) — Claude discovers it automatically when you ask about Pixeltable. It loads a concise `SKILL.md` first, then pulls in the full API reference only when needed. ```bash theme={null} # Plugin install (recommended — auto-updates) /plugin marketplace add pixeltable/pixeltable-skill /plugin install pixeltable-skill@pixeltable-skill # Or manual install git clone https://github.com/pixeltable/pixeltable-skill.git /tmp/pxt-skill cp -r /tmp/pxt-skill/skills/pixeltable-skill ~/.claude/skills/ ``` Append `.md` to any docs URL to get a plain-text version optimized for LLMs. Paste it straight into your chat. | Resource | URL | | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | Any docs page as markdown | `https://docs.pixeltable.com/.md` — e.g., [this page](https://docs.pixeltable.com/overview/building-pixeltable-with-llms.md) | | Site index for LLMs | [llms.txt](https://docs.pixeltable.com/llms.txt) ([standard](https://llmstxt.org/)) | | Full site map with metadata | [llms-full.txt](https://docs.pixeltable.com/llms-full.txt) | *** ## MCP Servers Connect your AI tool to Pixeltable directly via the [Model Context Protocol](https://modelcontextprotocol.io). We ship two servers — or you can build your own using [`pxt.mcp_udfs()`](/libraries/mcp). Search the full documentation from Claude Desktop, Cursor, or Windsurf: ``` https://docs.pixeltable.com/mcp ``` Exposes a `SearchPixeltableDocumentation` tool that returns relevant content, code examples, and direct links. 32 tools for creating tables, running queries, managing dependencies, and executing Python — all from your AI editor. Experimental; great for prototyping. ```bash theme={null} # Install uv tool install --from git+https://github.com/pixeltable/mcp-server-pixeltable-developer.git mcp-server-pixeltable-developer # Add to Claude Code claude mcp add pixeltable mcp-server-pixeltable-developer ``` See [configuration for Cursor, Claude Desktop, and more](https://github.com/pixeltable/mcp-server-pixeltable-developer) in the repo README. Any Pixeltable UDF or query function can be exposed as an MCP tool with a single call: ```python theme={null} import pixeltable as pxt @pxt.udf def lookup_customer(name: str) -> str: """Look up customer info by name.""" t = pxt.get_table('app.customers') return t.where(t.name == name).select(t.info).collect()[0]['info'] tools = pxt.tools(lookup_customer) ``` `pxt.tools()` wraps your functions so any MCP-compatible client can call them. See the [MCP integration guide](/libraries/mcp) for the full setup. *** ## Start Building Use the app template to scaffold a full-stack project. It wires up a FastAPI backend and React frontend on top of Pixeltable — document upload, cross-modal search, and a tool-calling agent, all powered by computed columns. Ask your AI tool to customize it from there. Full-stack skeleton: FastAPI + React + Pixeltable for multimodal AI workloads *** ## Next Steps Install and run your first pipeline in 5 minutes The core pattern LLMs generate — learn how it works Build agents with UDFs, queries, and MCP tools Full use case walkthrough for AI agents # What is Pixeltable? Source: https://docs.pixeltable.com/overview/pixeltable Data Infrastructure providing a declarative, incremental approach for multimodal AI **The only open source Python library providing declarative data infrastructure for building multimodal AI applications, enabling incremental storage, transformation, indexing, retrieval, and orchestration of data.** With Pixeltable, you define your entire data processing and AI workflow declaratively using computed columns on tables. Focus on your application logic, not the data plumbing. ## Before Pixeltable AI teams are building on images, video, audio, and text, but the infrastructure is broken: Data lives across object stores, vector DBs, SQL, and ad-hoc pipelines. No single source of truth. Every model change requires reprocessing. Pipelines are brittle and hard to reproduce. This creates high engineering cost, slow iteration, and production risk. **Pixeltable solves this.** One system for storage, orchestration, and retrieval. Transactions, incremental updates, and automatic dependency tracking built in. ## With Pixeltable All data and computed results are automatically stored and versioned. Data transformations run automatically on new data. No orchestration code needed. Images, video, audio, and documents integrate seamlessly with structured data. Built-in support for OpenAI, Anthropic, Gemini, Hugging Face, and dozens more. ## Get started Install Pixeltable and run your first pipeline in 5 minutes. See Pixeltable in action with a hands-on image workflow. Learn about tables, computed columns, views, and the type system. Complete API reference for the Pixeltable Python SDK. Many documentation pages are interactive notebooks (marked with in the sidebar). Open them in Colab, Kaggle, or locally to follow along. ## Core Primitives Pixeltable provides a small set of primitives that compose into any multimodal AI workflow: **Create tables with native multimodal types** ```python theme={null} t = pxt.create_table('myapp.media', { 'video': pxt.Video, 'image': pxt.Image, 'audio': pxt.Audio, 'document': pxt.Document, 'metadata': pxt.Json }) ``` Create, insert, update, delete All supported types **Declarative computed columns: API calls, LLM inference, local models, vision** ```python theme={null} # LLM API call t.add_computed_column(summary=openai.chat_completions( messages=[{'role': 'user', 'content': 'Summarize: ' + t.text}] )) # Local model inference t.add_computed_column(objects=yolox(t.image, model_id='yolox_s')) # Vision analysis t.add_computed_column(desc=openai.vision(prompt="Describe", image=t.image)) ``` Incremental transforms OpenAI, Anthropic, Gemini, HuggingFace... **Explode rows: video→frames, doc→chunks, audio→segments** ```python theme={null} # Extract frames from video at 1 fps frames = pxt.create_view('myapp.frames', t, iterator=FrameIterator(t.video, fps=1)) # Chunk documents for RAG chunks = pxt.create_view('myapp.chunks', t, iterator=DocumentSplitter(t.document)) ``` Virtual tables Frame, Document, Audio splitters **Add embedding indexes for semantic search** ```python theme={null} t.add_embedding_index('text', embedding=openai.embeddings()) # Search by similarity results = t.order_by(t.text.similarity('find relevant docs'), asc=False).limit(10) ``` Vector search with automatic maintenance **Write custom functions with `@pxt.udf` and `@pxt.query`** ```python theme={null} @pxt.udf def extract_entities(text: str) -> list[str]: # Your custom logic return entities @pxt.query def search_by_topic(topic: str): return t.where(t.category == topic).select(t.title, t.summary) ``` Custom Python functions **Tool calling for AI agents and MCP integration** ```python theme={null} # Load tools from MCP server, UDFs, and queries mcp_tools = pxt.mcp_udfs('http://localhost:8000/mcp') tools = pxt.tools(search_by_topic, extract_entities, *mcp_tools) # LLM decides which tool to call; Pixeltable executes it t.add_computed_column(response=openai.chat_completions( messages=[{'role': 'user', 'content': t.question}], tools=tools )) t.add_computed_column(result=openai.invoke_tools(tools, t.response)) ``` Build agents with tools MCP servers, memory, Pixelbot **SQL-like queries + test transformations before committing** ```python theme={null} # Query data with familiar syntax results = t.where(t.score > 0.8).order_by(t.timestamp).limit(10).collect() # Test transformations on sample rows BEFORE adding to table t.select(t.text, summary=summarize(t.text)).head(3) # Nothing stored yet t.add_computed_column(summary=summarize(t.text)) # Now commit to all rows ``` Select, filter, aggregate Test before commit **Time travel and automatic versioning** ```python theme={null} t.history() # View all versions t.revert(version=5) # Rollback changes old_data = pxt.get_table('myapp.media:3') # Query past version ``` History, snapshots, lineage **Load from any source, export to ML formats** ```python theme={null} # Import from files, URLs, S3, Hugging Face t.insert(pxt.io.import_csv('data.csv')) t.insert(pxt.io.import_huggingface_dataset(dataset)) # Export to ML/analytics formats pxt.io.export_parquet(t, 'output.parquet') loader = DataLoader(t.to_pytorch_dataset(), batch_size=32) coco_path = t.to_coco_dataset() ``` CSV, JSON, Parquet, S3, HF PyTorch, Parquet, COCO, LanceDB **Publish and replicate datasets via Pixeltable Cloud** ```python theme={null} pxt.publish(t, 'my-dataset') # Share publicly pxt.replicate('user/dataset', 'local') # Pull to local ``` Publish, replicate, collaborate ## Use Cases Pixeltable's primitives are **use-case agnostic**. They compose into any multimodal AI workflow: Curate, augment, export training datasets. Pre-annotate with models, integrate Label Studio, export PyTorch. Build RAG systems, semantic search, and multimodal APIs. Pixeltable handles storage, retrieval, and orchestration. Tool-calling agents with persistent memory, MCP server integration, and automatic conversation history. Start with the **[Quick Start](/overview/quick-start)** to get running in 5 minutes, or explore **[Cookbooks](/howto/cookbooks/agents/pattern-rag-pipeline)** for hands-on examples covering RAG, video analysis, audio transcription, and more. ## Choose How You Run Pixeltable Open-source Python library. Install with `pip install pixeltable` and run locally. Same APIs scale to production. Data sharing available now. Managed endpoints and live tables coming soon. Schedule a call to discuss your use case and see how Pixeltable can help. ## Next steps Get help, share projects, and connect with other developers Star the repo, report issues, and contribute # Quick Start Source: https://docs.pixeltable.com/overview/quick-start The fastest way to get started using Pixeltable ## System requirements Before installing, ensure your system meets these requirements: * Python 3.10 or higher * Linux, MacOS, or Windows ## Installation It is recommended that you install Pixeltable in a virtual environment. ```bash theme={null} python -m venv .venv ``` ```bash Linux/MacOS theme={null} source .venv/bin/activate ``` ```bash Windows theme={null} .venv\Scripts\activate ``` ```bash theme={null} pip install pixeltable ``` Install uv from the [Installing uv](https://docs.astral.sh/uv/getting-started/installation/) guide. ```bash theme={null} uv venv --python 3.12 ``` ```bash Linux/MacOS theme={null} source .venv/bin/activate ``` ```bash Windows theme={null} .venv\Scripts\activate ``` ```bash theme={null} uv pip install pixeltable ``` Download and install from the [Miniconda Installation](https://www.anaconda.com/docs/getting-started/miniconda/main) guide. ```bash theme={null} conda create --name pxt python=3.12 conda activate pxt ``` ```bash theme={null} pip install pixeltable ``` ## Getting help * Join our [Discord Community](https://discord.com/invite/QPyqFYx2UN) * Report issues on [GitHub](https://github.com/pixeltable/pixeltable/issues) * Contact [support@pixeltable.com](mailto:support@pixeltable.com) ## Build an image analysis app This guide will help you spin up a functioning AI workload in 5 minutes. Pixeltable requires only a minimal set of Python packages by default. To use AI models, you'll need to install additional dependencies. ```bash theme={null} pip install torch transformers openai ``` ```python theme={null} import pixeltable as pxt # Create a namespace and table pxt.create_dir('quickstart', if_exists='replace_force') t = pxt.create_table('quickstart/images', {'image': pxt.Image}) ``` Tables are persistent: your data survives restarts and can be queried anytime. ```python theme={null} from pixeltable.functions import huggingface # Add DETR object detection as a computed column t.add_computed_column( detections=huggingface.detr_for_object_detection( t.image, model_id='facebook/detr-resnet-50' ) ) # Extract labels from detections t.add_computed_column(labels=t.detections.label_text) ``` Computed columns run automatically whenever new data is inserted. ```python theme={null} # Insert a few images t.insert([ {'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000001.jpg'}, {'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000025.jpg'} ]) ``` You can insert images from URLs and/or local paths in any combination. ```python theme={null} # Query results t.select(t.image, t.labels).collect() ``` **Expected output:** | image | labels | | -------- | --------------------------------------------- | | \[Image] | \[car, parking meter, truck, car, car, truck] | | \[Image] | \[giraffe, giraffe] | You'll need an OpenAI API key to use this step. If you don't have one, you can safely skip this step. ```python theme={null} import os from pixeltable.functions import openai # Set your API key os.environ['OPENAI_API_KEY'] = 'your-key-here' t.add_computed_column( description=openai.vision( prompt="Describe this image in one sentence.", image=t.image, model='gpt-4o-mini' ) ) t.select(t.image, t.labels, t.description).collect() ``` ```python theme={null} # See the full text of the description in row 0 t.select(t.description).collect()[0] ``` Pixeltable orchestrates LLM calls for optimized throughput, handling rate limiting, retries, and caching automatically. Insert a few more images. ```python theme={null} t.insert([ {'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000034.jpg'}, {'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000057.jpg'} ]) t.select(t.image, t.labels, t.description).collect() ``` When new data is insterted into tables, Pixeltable incrementally runs all computed columns against the new data, ensuring the table is up to date. Pixeltable automatically: 1. Created a persistent multimodal table 2. Downloaded and cached the DETR model 3. Ran inference on your image 4. Stored all results (including computed columns) for instant retrieval 5. Will incrementally process any new images you insert # 10-Minute Tour Source: https://docs.pixeltable.com/overview/ten-minute-tour Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Welcome to Pixeltable! In this tutorial, we’ll survey how to create tables, populate them with data, and enhance them with built-in and user-defined transformations and AI operations. ## Install Python packages First run the following command to install Pixeltable and related libraries needed for this tutorial. ```python theme={null} %pip install -qU torch transformers openai pixeltable ``` ## Creating a table Let’s begin by creating a `demo` directory (if it doesn’t already exist) and a table that can hold image data, `demo/first`. The table will initially have just a single column to hold our input images, which we’ll call `input_image`. We also need to specify a type for the column: `pxt.Image`. ```python theme={null} import pixeltable as pxt # Create the directory `demo`, dropping it first (if it exists) # to ensure a clean environment. pxt.drop_dir('demo', force=True) pxt.create_dir('demo') # Create the table `demo/first` with a single column `input_image` t = pxt.create_table('demo/first', {'input_image': pxt.Image}) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'demo'.
  Created table 'first'.
We can use `t.describe()` to examine the table schema. We see that it now contains a single column, as expected. ```python theme={null} t.describe() ```
The new table is initially empty, with no rows: ```python theme={null} t.count() ```
  0
Now let’s put an image into it! We can add images simply by giving Pixeltable their URLs. The example images in this demo come from the [COCO dataset](https://cocodataset.org/), and we’ll be referencing copies of them in the Pixeltable github repo. But in practice, the images can come from anywhere: an S3 bucket, say, or the local file system. When we add the image, we see that Pixeltable gives us some useful status updates indicating that the operation was successful. ```python theme={null} t.insert( [ { 'input_image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000025.jpg' } ] ) ```
  Inserted 1 row with 0 errors in 0.21 s (4.86 rows/s)
  1 row inserted.
We can use `t.head()` to examine the contents of the table. ```python theme={null} t.head() ```
## Adding computed columns Great! Now we have a table containing some data. Let’s add an object detection model to our workflow. Specifically, we’re going to use the ResNet-50 object detection model, which runs using the Huggingface DETR (“DEtection TRansformer”) model class. Pixeltable contains a built-in adapter for this model family, so all we have to do is call the `detr_for_object_detection` Pixeltable function. A nice thing about the Huggingface models is that they run locally, so you don’t need an account with a service provider in order to use them. This is our first example of a **computed column**, a key concept in Pixeltable. Recall that when we created the `input_image` column, we specified a type, `ImageType`, indicating our intent to populate it with data in the future. When we create a *computed* column, we instead specify a function that operates on other columns of the table. By default, when we add the new computed column, Pixeltable immediately evaluates it against all existing data in the table - in this case, by calling the `detr_for_object_detection` function on the image. Depending on your setup, it may take a minute for the function to execute. In the background, Pixeltable is downloading the model from Huggingface (if necessary), instantiating it, and caching it for later use. ```python theme={null} from pixeltable.functions import huggingface t.add_computed_column( detections=huggingface.detr_for_object_detection( t.input_image, model_id='facebook/detr-resnet-50' ) ) ```
  Added 1 column value with 0 errors in 3.26 s (0.31 rows/s)
  1 row updated.
Let’s examine the results. ```python theme={null} t.head() ```
We see that the model returned a JSON structure containing a lot of information. In particular, it has the following fields: * `label_text`: Descriptions of the objects detected * `boxes`: Bounding boxes for each detected object * `scores`: Confidence scores for each detection * `labels`: The DETR model’s internal IDs for the detected objects Perhaps this is more than we need, and all we really want are the text labels. We could add another computed column to extract `label_text` from the JSON struct: ```python theme={null} t.add_computed_column(detections_text=t.detections.label_text) t.head() ```
If we inspect the table schema now, we see how Pixeltable distinguishes between ordinary and computed columns. ```python theme={null} t.describe() ```
Now let’s add some more images to our table. This demonstrates another important feature of computed columns: by default, they update incrementally any time new data shows up on their inputs. In this case, Pixeltable will run the ResNet-50 model against each new image that is added, then extract the labels into the `detect_text` column. Pixeltable will orchestrate the execution of any sequence (or DAG) of computed columns. Note how we can pass multiple rows to `t.insert` with a single statement, which will insert them more efficiently. ```python theme={null} more_images = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000030.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000034.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000042.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000061.jpg', ] t.insert({'input_image': image} for image in more_images) ```
  Inserted 4 rows with 0 errors in 1.51 s (2.65 rows/s)
  4 rows inserted.
Let’s see what the model came up with. We’ll use `t.select` to suppress the display of the `detect` column, since right now we’re only interested in the text labels. ```python theme={null} t.select(t.input_image, t.detections_text).head() ```
## Pixeltable is persistent An important feature of Pixeltable is that *everything is persistent*. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database: all your data, transformations, and computed columns are stored and preserved between sessions. To see this, let’s clear all the variables in our notebook and start fresh. You can optionally restart your notebook kernel at this point, to demonstrate how Pixeltable data persists across sessions. ```python theme={null} # Clear all variables in the notebook %reset -f # Instantiate a new client object import pixeltable as pxt t = pxt.get_table('demo/first') # Display just the first two rows, to avoid cluttering the tutorial t.select(t.input_image, t.detections_text).head(2) ```
## GPT-4o For comparison, let’s try running our examples through a generative model, Open AI’s `gpt-4o-mini`. For this section, you’ll need an OpenAI account with an API key. You can use the following command to add your API key to the environment (just enter your API key when prompted): ```python theme={null} import getpass import os if 'OPENAI_API_KEY' not in os.environ: os.environ['OPENAI_API_KEY'] = getpass.getpass( 'Enter your OpenAI API key:' ) ``` Now we can connect to OpenAI through Pixeltable. This may take some time, depending on how long OpenAI takes to process the query. ```python theme={null} from pixeltable.functions import openai # Construct a message dict for OpenAI. It follows the same pattern # as the OpenAI SDK, except that in place of an image URL, we can # put a reference to our image column, and Pixeltable will do the # substitution once for each row of the table. messages = [ { 'role': 'user', 'content': [ {'type': 'text', 'text': "What's in this image?"}, {'type': 'image_url', 'image_url': t.input_image}, ], } ] t.add_computed_column( vision=openai.chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 5 column values with 0 errors in 6.98 s (0.72 rows/s)
  5 rows updated.
Let’s see how GPT-4’s responses compare to the traditional discriminative (DETR) model. ```python theme={null} t.select(t.input_image, t.detections_text, t.vision).head() ```
It looks like OpenAI returned a whole range of context information along with the image descriptions. Let’s pluck out just the response content from inside those JSON structures, so that it’s easier to see in the table. Note that we can unpack JSON columns in Pixeltable the same way we would with ordinary Python dicts and lists. ```python theme={null} t.select( t.input_image, t.detections_text, t.vision['choices'][0]['message']['content'], ).head() ```
In addition to adapters for local models and inference APIs, Pixeltable can perform a range of more basic image operations. These image operations can be seamlessly chained with API calls, and Pixeltable will keep track of the sequence of operations, constructing new images and caching when necessary to keep things running smoothly. Just for fun (and to demonstrate the power of computed columns), let’s see what OpenAI thinks of our sample images when we rotate them by 180 degrees. ```python theme={null} t.add_computed_column(rot_image=t.input_image.rotate(180)) # This is identical to the preceding messages dict, but with # `t.rot_image` in place of `t.input_image`. messages = [ { 'role': 'user', 'content': [ {'type': 'text', 'text': "What's in this image?"}, {'type': 'image_url', 'image_url': t.rot_image}, ], } ] t.add_computed_column( rot_vision=openai.chat_completions(messages, model='gpt-4o-mini') ) ```
  Added 5 column values with 0 errors in 6.19 s (0.81 rows/s)
  5 rows updated.
```python theme={null} t.select( t.rot_image, t.rot_vision['choices'][0]['message']['content'] ).head() ```
## UDFs: Enhancing Pixeltable’s capabilities Another important principle of Pixeltable is that, although Pixeltable has a built-in library of useful operations and adapters, it will never prescribe a particular way of doing things. Pixeltable is built from the ground up to be extensible. Let’s take a specific example. Recall our use of the ResNet-50 detection model, in which the `detect` column contains a JSON blob with bounding boxes, scores, and labels. Suppose we want to create a column containing the single label with the highest confidence score. There’s no built-in Pixeltable function to do this, but it’s easy to write our own. In fact, all we have to do is write a Python function that does the thing we want, and mark it with the `@pxt.udf` decorator. ```python theme={null} @pxt.udf def top_detection(detect: dict) -> str: scores = detect['scores'] label_text = detect['label_text'] # Get the index of the object with the highest confidence i = scores.index(max(scores)) # Return the corresponding label return label_text[i] ``` ```python theme={null} t.add_computed_column(top=top_detection(t.detections)) ```
  Added 5 column values with 0 errors in 0.11 s (45.52 rows/s)
  5 rows updated.
```python theme={null} t.select(t.detections_text, t.top).show() ```
Congratulations! You’ve reached the end of the tutorial. Hopefully, this gives a good overview of the capabilities of Pixeltable, but there’s much more to explore. As a next step, you might check out one of the other tutorials, depending on your interests: * [Object Detection in Videos](/howto/use-cases/object-detection-in-videos) * [RAG Operations in Pixeltable](/howto/use-cases/rag-operations) * [Working with OpenAI in Pixeltable](/howto/providers/working-with-openai) # Configuration Source: https://docs.pixeltable.com/platform/configuration Complete guide to configuring Pixeltable ## Configuration options Pixeltable can be configured through: * Environment variables * System configuration file (`~/.pixeltable/config.toml` on Linux/macOS or `C:\Users\\.pixeltable\config.toml` on Windows) Example `config.toml`: ```toml theme={null} [pixeltable] file_cache_size_g = 250 time_zone = "America/Los_Angeles" hide_warnings = true verbosity = 2 [openai] api_key = 'my-openai-api-key' [openai.rate_limits] tts-1 = 500 # OpenAI uses a per-model rate limit configuration (see below for details) [mistral] api_key = 'my-mistral-api-key' rate_limit = 600 # Mistral uses a single rate limit for all models [label_studio] url = 'http://localhost:8080/' api_key = 'my-label-studio-api-key' ``` ## System settings | Environment Variable | Config File | Meaning | | -------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | PIXELTABLE\_HOME | | (string) Pixeltable user directory; default is \~/.pixeltable | | PIXELTABLE\_CONFIG | | (string) Pixeltable config file; default is \$PIXELTABLE\_HOME/config.toml | | PIXELTABLE\_PGDATA | | (string) Directory where Pixeltable DB is stored; default is \$PIXELTABLE\_HOME/pgdata | | PIXELTABLE\_DB | | (string) Pixeltable database name; default is pixeltable | | PIXELTABLE\_FILE\_CACHE\_SIZE\_G | \[pixeltable]
file\_cache\_size\_g | (float) Maximum size of the Pixeltable file cache, in GiB; required | | PIXELTABLE\_TIME\_ZONE | \[pixeltable]
time\_zone | (string) Default time zone in [IANA format](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones); defaults to the system time zone | | PIXELTABLE\_HIDE\_WARNINGS | \[pixeltable]
hide\_warnings | (bool) Suppress warnings generated by various libraries used by Pixeltable; default is false | | PIXELTABLE\_VERBOSITY | \[pixeltable]
verbosity | (int) Verbosity for Pixeltable console logging (0: minimum, 1: normal, 2: maximum); default is 1 | | PIXELTABLE\_API\_KEY | \[pixeltable]
api\_key | (string) API key for Pixeltable Cloud | | PIXELTABLE\_INPUT\_MEDIA\_DEST | \[pixeltable]
input\_media\_dest | (string) Default destination URI for media files that are inserted into tables | | PIXELTABLE\_OUTPUT\_MEDIA\_DEST | \[pixeltable]
output\_media\_dest | (string) Default destination URI for media files that are generated by Pixeltable operations | | PIXELTABLE\_R2\_PROFILE | \[pixeltable]
r2\_profile | (string) Name of AWS config profile to use when accessing Cloudflare R2 resources. If not specified, default AWS credentials will be used. | | PIXELTABLE\_S3\_PROFILE | \[pixeltable]
s3\_profile | (string) Name of AWS config profile to use when accessing Amazon S3 resources. If not specified, default AWS credentials will be used. | | PIXELTABLE\_B2\_PROFILE | \[pixeltable]
b2\_profile | (string) Name of an S3-compatible profile for accessing Backblaze B2. Defaults to the standard AWS credential chain if not set. | | PIXELTABLE\_TIGRIS\_PROFILE | \[pixeltable]
tigris\_profile | (string) Name of an S3-compatible profile for accessing Tigris. Defaults to the standard AWS credential chain if not set. | ## API configuration | Environment Variable | Config File | Meaning | | ----------------------------- | ------------------------------------ | --------------------------------------------------------------------------------- | | ANTHROPIC\_API\_KEY | \[anthropic]
api\_key | (string) API key to use for Anthropic services | | AZURE\_STORAGE\_ACCOUNT\_NAME | \[azure]
storage\_account\_name | (string) Azure Storage account name for use with Azure Blob Storage | | AZURE\_STORAGE\_ACCOUNT\_KEY | \[azure]
storage\_account\_key | (string) Azure Storage account key for use with Azure Blob Storage | | BEDROCK\_API\_KEY | \[bedrock]
api\_key | (string) API key to use for AWS Bedrock services | | DEEPSEEK\_API\_KEY | \[deepseek]
api\_key | (string) API key to use for Deepseek services | | FAL\_API\_KEY | \[fal]
api\_key | (string) API key to use for fal.ai services | | FIREWORKS\_API\_KEY | \[fireworks]
api\_key | (string) API key to use for Fireworks AI services | | GEMINI\_API\_KEY | \[gemini]
api\_key | (string) API key to use for Google Gemini services | | GROQ\_API\_KEY | \[groq]
api\_key | (string) API key to use for Groq AI services | | HF\_AUTH\_TOKEN | \[hf]
auth\_token | (string) Hugging Face auth token for use with Hugging Face services | | LABEL\_STUDIO\_API\_KEY | \[label\_studio]
api\_key | (string) API key to use for Label Studio | | LABEL\_STUDIO\_URL | \[label\_studio]
url | (string) URL of the Label Studio server to use | | MISTRAL\_API\_KEY | \[mistral]
api\_key | (string) API key to use for Mistral AI services | | OPENAI\_API\_KEY | \[openai]
api\_key | (string) API key to use for OpenAI services | | OPENAI\_BASE\_URL | \[openai]
base\_url | (string, optional) Base URL to use for OpenAI services | | OPENAI\_API\_VERSION | \[openai]
api\_version | (string) API version for use with Azure OpenAI; must be `'latest'` or `'preview'` | | OPENROUTER\_API\_KEY | \[openrouter]
api\_key | (string) API key to use for OpenRouter services | | OPENROUTER\_SITE\_URL | \[openrouter]
site\_url | (string) Application URL (optional, for OpenRouter analytics) | | OPENROUTER\_APP\_NAME | \[openrouter]
app\_name | (string) Application name (optional, for OpenRouter analytics) | | REPLICATE\_API\_TOKEN | \[replicate]
api\_token | (string) API token to use for Replicate services | | REVE\_API\_KEY | \[reve]
api\_key | (string) API key to use for Reve Image services | | TOGETHER\_API\_KEY | \[together]
api\_key | (string) API key to use for Together AI services | | TWELVELABS\_API\_KEY | \[twelvelabs]
api\_key | (string) API key to use for TwelveLabs services | | VOYAGE\_API\_KEY | \[voyage]
api\_key | (string) API key to use for Voyage AI services | ## Rate limit configuration Pixeltable supports two patterns for configuring API rate limits in `config.toml`. Refer to the docstring of the relevant udf in the [SDK Reference](/sdk/latest) for details on the rate limiting pattern used by that udf. ### Single rate limit per provider For providers with a single rate limit across all models, add a `rate_limit` key to the provider's config section: ```toml theme={null} [mistral] api_key = 'my-mistral-api-key' rate_limit = 600 # requests per minute [fireworks] api_key = 'my-fireworks-api-key' rate_limit = 300 ``` ### Per-model rate limits For providers that support different rate limits for different models, add a `.rate_limits` section and list the rate limits for each model: ```toml theme={null} [openai] api_key = 'my-openai-api-key' [openai.rate_limits] gpt-4o = 500 gpt-4o-mini = 1000 tts-1 = 50 dall-e-3 = 10 [gemini.rate_limits] gemini-pro = 600 gemini-pro-vision = 300 ``` If no rate limit is configured, Pixeltable uses a default of 600 requests per minute. ## Configuration best practices ### Security considerations When configuring API keys and sensitive information: * Avoid hardcoding API keys in your code * Use environment variables for temporary access * Use the config file for persistent configuration * Ensure your config.toml file has appropriate permissions (readable only by you) ### Performance tuning * Adjust `file_cache_size_g` based on your available system memory * For large datasets, increase the cache size to improve performance * Set appropriate verbosity level based on your debugging needs ## Applying configuration changes Configuration changes take effect when you restart your Python session. Return to the installation guide for setup instructions # Data Sharing Source: https://docs.pixeltable.com/platform/data-sharing Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Learn how to publish datasets to Pixeltable Cloud and replicate datasets from the cloud to your local environment. ## Overview Pixeltable Cloud enables you to: * **Publish** your datasets for sharing with teams or the public * **Replicate** datasets from the cloud to your local environment * Share multimodal AI datasets (images, videos, audio, documents) without managing infrastructure This guide demonstrates both publishing and replicating datasets. ## Setup Data sharing functionality requires Pixeltable version 0.4.24 or later. ```python theme={null} %pip install -qU pixeltable ``` ## Replicating datasets You can replicate any public dataset from Pixeltable Cloud to your local environment without needing an account or API key. ### Replicate a public dataset Let’s replicate a mini-version of the COCO-2017 dataset from Pixeltable Cloud. You can find this dataset at [pixeltable.com/t/pixeltable:fiftyone/coco\_mini\_2017](https://www.pixeltable.com/t/pixeltable:fiftyone/coco_mini_2017), or browse for other [public datasets](https://www.pixeltable.com/data-products). When calling `replicate()`: * **`remote_uri`** (required): The URI of the cloud dataset you want to replicate * **`local_path`** (your choice): The local directory/table name where you want to store the replica * **Variable name** (your choice): The Python variable in your session/script to reference the table (e.g., `coco_copy`) See the [replicate() SDK reference](/sdk/latest/pixeltable#func-replicate) for full documentation. ```python theme={null} import pixeltable as pxt pxt.drop_dir('sharing-demo', force=True) pxt.create_dir('sharing-demo') # The remote_uri is the specific cloud dataset you want to replicate # The local_path and variable name are yours to choose coco_copy = pxt.replicate( remote_uri='pxt://pixeltable:fiftyone/coco_mini_2017', local_path='sharing-demo.coco-copy', ) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'sharing-demo'.
  Output()
  Extracting table data into: /Users/asiegel/.pixeltable/tmp/acad78b1-4a62-483e-a0b1-728ccb5603cf
  Created directory '\_system'.
  Created local replica 'sharing-demo/coco-copy' from URI: pxt://pixeltable:fiftyone/coco\_mini\_2017
You can check that the replica exists at the local path with `list_tables()`. ```python theme={null} pxt.list_tables('sharing-demo') ```
  \['sharing-demo/coco-copy']
To see the structure of the replicated table: ```python theme={null} coco_copy ```
### Working with replicas Replicated datasets are read-only locally, but you can query, explore, and use them in powerful ways: **1. Query and explore the data** ```python theme={null} # View the replicated data coco_copy.limit(3).collect() ```
**2. Perform similarity searches** Replicas include embedding indexes, so you can immediately perform similarity searches: ```python theme={null} # Get a sample image to search with sample_img = ( coco_copy.select(coco_copy.image).limit(1).collect()[0]['image'] ) sample_img ``` ```python theme={null} # Perform image-based similarity search sim = coco_copy.image.similarity(image=sample_img) results = ( coco_copy.order_by(sim, asc=False) .limit(5) .select(coco_copy.image, sim) .collect() ) results ```
Because the COCO dataset uses CLIP embeddings (which are multimodal), you can also search using text queries: ```python theme={null} # Perform text-based similarity search sim = coco_copy.image.similarity(string='surfing') results = ( coco_copy.order_by(sim, asc=False) .limit(4) .select(coco_copy.image, sim) .collect() ) results ```
**3. Access replicas in new sessions** In a new Python session, use `list_tables()` and `get_table()` to access your replicas: ```python theme={null} # List all tables to see your replica pxt.list_tables('sharing-demo') ```
  \['sharing-demo/coco-copy']
```python theme={null} # Assign a handle to the replica coco_copy = pxt.get_table('sharing-demo.coco-copy') ``` **4. Create an independent copy** To work with the data in new ways, create an independent table with the replica as the source: ```python theme={null} # Create a fresh table with values only my_coco = pxt.create_table('sharing-demo.my-coco-table', source=coco_copy) ```
  Created table 'my-coco-table'.
This copies the values in the source, but drops the computational definitions and cannot be updated if the source table changes. ### Updating replicas with pull If the upstream table changes, you can update your local replica using `pull()`: ```python theme={null} # Update your local replica with changes from the cloud coco_copy.pull() ```
  Replica 'sharing-demo/coco-copy' is already up to date with source: pxt://pixeltable:fiftyone/d699317b-23a4-404b-8f71-6531fd8dc462
This synchronizes your local replica with any updates made to the source dataset. ## Publishing datasets **Requirements:** * A Pixeltable Cloud account (Community Edition includes 1TB storage - see [pricing](https://www.pixeltable.com/pricing)) * Your API key from the [account dashboard](https://pixeltable.com/dashboard) Publishing allows you to share your datasets with your team or make them publicly available. ### Configure your API key Pixeltable looks for your API key in the `PIXELTABLE_API_KEY` environment variable. Choose one of these methods: **Option 1: In your notebook (secure and convenient)** Run this cell to securely enter your API key (get it from [pixeltable.com/dashboard](https://pixeltable.com/dashboard)): ```python theme={null} import os from getpass import getpass os.environ['PIXELTABLE_API_KEY'] = getpass('Pixeltable API Key:') ``` **Option 2: Environment variable** Add to your `~/.zshrc` or `~/.bashrc`: ```bash theme={null} export PIXELTABLE_API_KEY='your-api-key-here' ``` **Option 3: Config file** Add to `~/.pixeltable/config.toml`: ```toml theme={null} [pixeltable] api_key = 'your-api-key-here' ``` See the [Configuration Guide](/platform/configuration) for details. ### Create a sample dataset Let’s create a table with images from this repository to publish. The `comment` parameter provides a description that will be visible on Pixeltable Cloud: ```python theme={null} t = pxt.create_table( 'sharing-demo.photos', schema={'image': pxt.Image, 'description': pxt.String}, comment='Sample image dataset for demonstrating Pixeltable Cloud publishing', ) ```
  Created table 'photos'.
```python theme={null} base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' t.insert( [ { 'image': f'{base_url}/000000000009.jpg', 'description': 'Kitchen scene', }, { 'image': f'{base_url}/000000000025.jpg', 'description': 'Street view', }, { 'image': f'{base_url}/000000000042.jpg', 'description': 'Indoor setting', }, ] ) ```
  Inserted 3 rows with 0 errors in 0.02 s (169.05 rows/s)
  3 rows inserted.
### Publish your dataset Publish your table to Pixeltable Cloud. When calling `publish()`: * **`source`** (required): An existing local table - either a table path string (e.g., `'sample-images.photos'`) or table handle (e.g., `t`) * If you use a local table path string, it must match a table in your local database (you can verify with `pxt.list_tables()`) * **`destination_uri`** (required): The cloud URI where you want to publish, in the format `pxt://orgname/dataset` * Pixeltable automatically creates any directory structure in the cloud based on this URI * Your local directory structure doesn’t need to match the cloud structure See the [publish() SDK reference](/sdk/latest/pixeltable#func-publish) for full documentation. ```python theme={null} # Option 1: Publish using table path (string) pxt.publish( source='sharing-demo.photos', # Table path from list_tables() destination_uri='pxt://your-orgname/sample-images', ) # Option 2: Publish using table handle # pxt.publish( # source=t, # Table handle you assigned # destination_uri='pxt://your-orgname/sample-images' # ) ``` ### Understanding destination URIs The `destination_uri` in `publish()` uses the format: `pxt://org:database/path` **URI components:** * **`org`** (required): Your organization name * **`database`** (optional): Database name - defaults to `main` if omitted * **`path`** (required): Directory and table path in the cloud **Examples:** * `pxt://orgname/my-dataset` → Uses the default `main` database * `pxt://orgname:main/my-dataset` → Explicitly specifies the `main` database * `pxt://orgname:analytics/my-dataset` → Uses the `analytics` database **About databases:** * Every Pixeltable Cloud account includes a `main` database by default * Each database has its own storage bucket * You can create additional databases in your [Pixeltable dashboard](https://pixeltable.com/dashboard) ### Updating published datasets with push After you’ve published a dataset, you can update the cloud replica with local changes using `push()`: ```python theme={null} # Make some changes to your local table t.insert( [ { 'image': f'{base_url}/000000000049.jpg', 'description': 'Outdoor scene', } ] ) # Push the changes to your published dataset t.push() ``` This updates the published dataset on Pixeltable Cloud with your local changes. Your dataset is now published and can be replicated by others using: ```python theme={null} import pixeltable as pxt sample_images = pxt.replicate( remote_uri='pxt://your-orgname/sample-images', local_path='sample-images-copy' ) ``` **Note:** If you are the owner of a published table, you cannot use `replicate()` to create a replica of your own table. This is because the table already exists in your Pixeltable database. The `replicate()` function is intended for pulling datasets published by others into your environment. ### Access control The `access` parameter in `publish()` controls who can replicate your dataset: * **`access='private'`** (default): Only your team members can access the dataset * **`access='public'`**: Anyone can replicate your dataset You can set access control either at the time of publish using the `access` parameter, or change it later in the [Pixeltable Cloud UI](https://pixeltable.com/dashboard). You can also manage team members and permissions in your dashboard. ### Deleting published tables If you want to delete a published table, you have two options: **Option 1: Using the Pixeltable SDK** Use `drop_table()` with your table’s destination URI (the same `pxt://` URI you used when publishing): ```python theme={null} pxt.drop_table('pxt://your-orgname/sample-images') ``` **Option 2: Using the Pixeltable Cloud dashboard** Navigate to your [Pixeltable Cloud dashboard](https://pixeltable.com/dashboard) and delete the table from the UI. ## Get help Have questions or need support? Join our community: * **[Discord Community](https://discord.com/invite/QPyqFYx2UN)**: Ask questions, get community support, and share what you build with Pixeltable * **[YouTube](https://www.youtube.com/@PixeltableHQ)**: Watch tutorials, demos, and feature walkthroughs * **[GitHub Issues](https://github.com/pixeltable/pixeltable/issues)**: Report bugs or request features ## Resources * [Pixeltable Cloud Dashboard](https://www.pixeltable.com/dashboard) * [Pixeltable Public Datasets](https://www.pixeltable.com/data-products) * [Pixeltable SDK Reference](/sdk/latest/) # Embedding Indices Source: https://docs.pixeltable.com/platform/embedding-indexes Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Main takeaways: * Indexing in Pixeltable is declarative * you create an index on a column and supply the embedding functions you want to use (for inserting data into the index as well as lookups) * Pixeltable maintains the index in response to any kind of update of the indexed table (i.e., `insert()`/`update()`/`delete()`) * Perform index lookups with the `similarity()` pseudo-function, in combination with the `order_by()` and `limit()` clauses To make this concrete, let’s create a table of images with the [`create_table()`](/sdk/latest/pixeltable#func-create_table) function. We’re also going to add some columns to demonstrate combining similarity search with other predicates. ```python theme={null} %pip install -qU pixeltable transformers sentence_transformers ``` ```python theme={null} import pixeltable as pxt # Delete the `indices_demo` directory and its contents, if it exists pxt.drop_dir('indices_demo', force=True) # Create the directory and table to use for the demo pxt.create_dir('indices_demo') schema = {'id': pxt.Int, 'img': pxt.Image} imgs = pxt.create_table('indices_demo/img_tbl', schema) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory \`indices\_demo\`.
  Created table \`img\_tbl\`.
We start out by inserting 10 rows: ```python theme={null} img_urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000030.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000034.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000042.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000049.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000057.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000061.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000063.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000064.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000069.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000071.jpg', ] imgs.insert({'id': i, 'img': url} for i, url in enumerate(img_urls)) ```
  Computing cells:  80%|█████████████████████████████████▌        | 16/20 \[00:01\<00:00, 14.67 cells/s]
  Inserting rows into \`img\_tbl\`: 10 rows \[00:00, 3589.17 rows/s]
  Computing cells: 100%|██████████████████████████████████████████| 20/20 \[00:01\<00:00, 18.16 cells/s]
  Inserted 10 rows with 0 errors.
  UpdateStatus(num\_rows=10, num\_computed\_values=20, num\_excs=0, updated\_cols=\[], cols\_with\_excs=\[])
For the sake of convenience, we’re storing the images as external URLs, which are cached transparently by Pixeltable. For details on working with external media files, see [Working with External Files](/platform/external-files). ## Creating an index To create and populate an index, we call [`Table.add_embedding_index()`](/sdk/latest/table#method-add_embedding_index) and tell it which UDF or UDFs to use to create embeddings. That definition is persisted as part of the table’s metadata, which allows Pixeltable to maintain the index in response to updates to the table. Any embedding UDF can be used for the index. For this example, we’re going to use a [CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) model, which has built-in support in Pixeltable under the [`pixeltable.functions.huggingface`](/sdk/latest/huggingface) package. As an alternative, you could use an online service such as OpenAI (see [`pixeltable.functions.openai`](/sdk/latest/openai)), or create your own embedding UDF with custom code (we’ll see how to do this below). Because we’re adding an index to an image column, the UDF we specify *must* be able to handle images. In fact, CLIP models are multimodal: they can handle both text and images, which is useful for doing lookups against the index. ```python theme={null} import PIL.Image from pixeltable.functions.huggingface import clip # create embedding index on the 'img' column imgs.add_embedding_index( 'img', embedding=clip.using(model_id='openai/clip-vit-base-patch32') ) ```
  Computing cells: 100%|██████████████████████████████████████████| 10/10 \[00:04\<00:00,  2.50 cells/s]
The first parameter of `add_embedding_index()` is the name of the column being indexed; the `embed` parameter specifies the relevant embedding. Notice the notation we used: ```python theme={null} clip.using(model_id='openai/clip-vit-base-patch32') ``` `clip` is a general-purpose UDF that can accept any CLIP model available in the Hugging Face model repository. To define an embedding, however, we need to provide a specific embedding function to `add_embedding_index()`: a function that is *not* parameterized on `model_id`. The `.using(model_id=...)` syntax tells Pixeltable to specialize the `clip` UDF by fixing the `model_id` parameter to the specific value `'openai/clip-vit-base-patch32'`. If you’re familiar with functional programming concepts, you might recognize .using() as a partial function operator. It’s a general operator that can be applied to any UDF (not just embedding functions), transforming a UDF with n parameters into one with k parameters by fixing the values of n-k of its arguments. Python has something similar in the functools package: the functools.partial() operator. `add_embedding_index()` provides a few other optional parameters: * `idx_name`: optional name for the index, which needs to be unique for the table; a default name is created if this isn’t provided explicitly * `metric`: the metric to use to compute the similarity of two embedding vectors; one of: * `'cosine'`: cosine distance (default) * `'ip'`: inner product * `'l2'`: L2 distance If desired, you can create multiple indexes on the same column, using different embedding functions. This can be useful to evaluate the effectiveness of different embedding functions side-by-side, or to use embedding functions tailored to specific use cases. In that case, you can provide explicit names for those indexes and then reference them during queries. We’ll illustrate that later with an example. ## Using the index in queries To take advantage of an embedding index when querying a table, we use the `similarity()` pseudo-function, which is invoked as a method on the indexed column, in combination with the [`order_by()`](/sdk/latest/query#method-order_by) and [`limit()`](/sdk/latest/query#method-limit) clauses. First, we’ll get a sample image from the table: ```python theme={null} # retrieve the 'img' column of some row as a PIL.Image.Image sample_img = imgs.select(imgs.img).collect()[6]['img'] sample_img ``` We then call the `similarity()` pseudo-function as a method on the indexed column and apply `order_by()` and `limit()`. We used the default cosine distance when we created the index, so we’re going to order by descending similarity (`order_by(..., asc=False)`): ```python theme={null} sim = imgs.img.similarity(image=sample_img) res = ( imgs.order_by(sim, asc=False) # Order by descending similarity .limit(2) # Limit number of results to 2 .select(imgs.id, imgs.img, sim) .collect() # Retrieve results now ) res ```
We can combine nearest-neighbor/similarity search with standard predicates. Here’s the same query, but filtering out the selected `sample_img` (which we already know has perfect similarity with itself): ```python theme={null} res = ( imgs.order_by(sim, asc=False) .where(imgs.id != 6) # Additional clause .limit(2) .select(imgs.id, imgs.img, sim) .collect() ) res ```
## Index updates In Pixeltable, each index is kept up-to-date automatically in response to changes to the indexed table. To illustrate this, let’s insert a few more rows: ```python theme={null} more_img_urls = [ 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000080.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000090.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000106.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000108.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000139.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000285.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000632.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000724.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000776.jpg', 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000785.jpg', ] imgs.insert( {'id': 10 + i, 'img': url} for i, url in enumerate(more_img_urls) ) ```
  Computing cells:  33%|██████████████                            | 10/30 \[00:01\<00:02,  8.90 cells/s]
  Inserting rows into \`img\_tbl\`: 10 rows \[00:00, 1337.60 rows/s]
  Computing cells: 100%|██████████████████████████████████████████| 30/30 \[00:01\<00:00, 24.55 cells/s]
  Inserted 10 rows with 0 errors.
  UpdateStatus(num\_rows=10, num\_computed\_values=30, num\_excs=0, updated\_cols=\[], cols\_with\_excs=\[])
When we now re-run the initial similarity query, we get a different result: ```python theme={null} sim = imgs.img.similarity(image=sample_img) res = ( imgs.order_by(sim, asc=False) .limit(2) .select(imgs.id, imgs.img, sim) .collect() ) res ```
## Similarity search on different types Because CLIP models are multimodal, we can also do lookups by text. ```python theme={null} sim = imgs.img.similarity(string='train') # String lookup res = ( imgs.order_by(sim, asc=False) .limit(2) .select(imgs.id, imgs.img, sim) .collect() ) res ```
## Creating multiple indexes on a single column We can create multiple embedding indexes on the same column, utilizing different embedding models. In order to use a specific index in a query, we need to assign it a name and then use that name in the query. To illustrate this, let’s create a table with text (taken from the Wikipedia article on [Pablo Picasso](https://en.wikipedia.org/wiki/Pablo_Picasso)): ```python theme={null} txts = pxt.create_table('indices_demo/text_tbl', {'text': pxt.String}) sentences = [ 'Pablo Ruiz Picasso (25 October 1881 – 8 April 1973) was a Spanish painter, sculptor, printmaker, ceramicist, and theatre designer who spent most of his adult life in France.', 'One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore.', "Among his most famous works are the proto-Cubist Les Demoiselles d'Avignon (1907) and the anti-war painting Guernica (1937), a dramatic portrayal of the bombing of Guernica by German and Italian air forces during the Spanish Civil War.", 'Picasso demonstrated extraordinary artistic talent in his early years, painting in a naturalistic manner through his childhood and adolescence.', 'During the first decade of the 20th century, his style changed as he experimented with different theories, techniques, and ideas.', 'After 1906, the Fauvist work of the older artist Henri Matisse motivated Picasso to explore more radical styles, beginning a fruitful rivalry between the two artists, who subsequently were often paired by critics as the leaders of modern art.', "Picasso's output, especially in his early career, is often periodized.", 'While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901–1904), the Rose Period (1904–1906), the African-influenced Period (1907–1909), Analytic Cubism (1909–1912), and Synthetic Cubism (1912–1919), also referred to as the Crystal period.', "Much of Picasso's work of the late 1910s and early 1920s is in a neoclassical style, and his work in the mid-1920s often has characteristics of Surrealism.", 'His later work often combines elements of his earlier styles.', ] txts.insert({'text': s} for s in sentences) ```
  Created table \`text\_tbl\`.
  Inserting rows into \`text\_tbl\`: 10 rows \[00:00, 3599.64 rows/s]
  Inserted 10 rows with 0 errors.
  UpdateStatus(num\_rows=10, num\_computed\_values=10, num\_excs=0, updated\_cols=\[], cols\_with\_excs=\[])
When calling [`add_embedding_index()`](/sdk/latest/table#method-add_embedding_index), we now specify the index name (`idx_name`) directly. If it is not specified, Pixeltable will assign a name (such as `idx0`). ```python theme={null} from pixeltable.functions.huggingface import sentence_transformer txts.add_embedding_index( 'text', idx_name='minilm_idx', embedding=sentence_transformer.using( model_id='sentence-transformers/all-MiniLM-L12-v2' ), ) txts.add_embedding_index( 'text', idx_name='e5_idx', embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2'), ) ```
  Computing cells: 100%|██████████████████████████████████████████| 10/10 \[00:01\<00:00,  6.86 cells/s]
  Computing cells: 100%|██████████████████████████████████████████| 10/10 \[00:01\<00:00,  6.35 cells/s]
To do a similarity query, we now call `similarity()` with the `idx` parameter: ```python theme={null} sim = txts.text.similarity('cubism', idx='minilm_idx') res = ( txts.order_by(sim, asc=False) .limit(2) .select(txts.text, sim) .collect() ) res ```
## Using a UDF for a custom embedding The above examples show how to use any model in the Hugging Face `CLIP` or `sentence_transformer` model families, and essentially the same pattern can be used for any other embedding with built-in Pixeltable support, such as OpenAI embeddings. But what if you want to adapt a new model family that doesn’t have built-in support in Pixeltable? This can be done by writing a custom Pixeltable UDF. In the following example, we’ll write a simple UDF to use the [BERT](https://www.kaggle.com/models/tensorflow/bert/tensorFlow2/en-uncased-preprocess/3) model built on TensorFlow. First we install the necessary dependencies. ```python theme={null} %pip install -qU tensorflow tensorflow-hub tensorflow-text ``` Text embedding UDFs must always take a string as input, and return a 1-dimensional numpy array of fixed dimension (512 in the case of `small_bert`, the variant we’ll be using). If we were writing an image embedding UDF, the `input` would have type `PIL.Image.Image` rather than `str`. The UDF is straightforward, loading the model and evaluating it against the input, with a minor data conversion on either side of the model invocation. ```python theme={null} import pixeltable as pxt import tensorflow as tf import tensorflow_hub as hub import tensorflow_text # Necessary to ensure BERT dependencies are loaded @pxt.udf def bert(input: str) -> pxt.Array[(512,), pxt.Float]: """Computes text embeddings using the small_bert model.""" preprocessor = hub.load( 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3' ) bert_model = hub.load( 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2' ) tensor = tf.constant([input]) # Convert the string to a tensor result = bert_model(preprocessor(tensor))['pooled_output'] return result.numpy()[0, :] ``` ```python theme={null} txts.add_embedding_index('text', idx_name='bert_idx', embedding=bert) ```
  Computing cells: 100%|██████████████████████████████████████████| 10/10 \[00:17\<00:00,  1.72s/ cells]
Here’s the output of our sample query run against `bert_idx`. ```python theme={null} sim = txts.text.similarity('cubism', idx='bert_idx') res = ( txts.order_by(sim, asc=False) .limit(2) .select(txts.text, sim) .collect() ) res ```
Our example UDF is very simple, but it would perform poorly in a production setting. To make our UDF production-ready, we’d want to do two things: * Cache the model: the current version calls `hub.load()` on every UDF invocation. In a real application, we’d want to instantiate the model just once, then reuse it on subsequent UDF calls. * Batch our inputs: we’d use Pixeltable’s batching capability to ensure we’re making efficient use of the model. Batched UDFs are described in depth in the [User-Defined Functions](/platform/udfs-in-pixeltable) how-to guide. You might have noticed that the updates to `bert_idx` seem sluggish; that’s why! ## Deleting an index To delete an index, call [`Table.drop_embedding_index()`](/sdk/latest/table#method-drop_embedding_index): * specify the `idx_name` parameter if you have multiple indices * otherwise the `column_name` parameter is sufficient Given that we have several embedding indices, we’ll specify which index to drop: ```python theme={null} txts.drop_embedding_index(idx_name='e5_idx') ``` # External Files Source: https://docs.pixeltable.com/platform/external-files Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. In Pixeltable, all media data (videos, images, audio) resides in external files, and Pixeltable stores references to those. The files can be local or remote (e.g., in S3). For the latter, Pixeltable automatically caches the files locally on access. When interacting with media data via Pixeltable, either through queries or UDFs, the user sees the following Python types: * `ImageType`: `PIL.Image.Image` * `VideoType`: `string` (local path) * `AudioType`: `string` (local path) Let’s create a table and load some data to see what that looks like: ```python theme={null} %pip install -qU pixeltable boto3 ``` ```python theme={null} import pixeltable as pxt import random import shutil import tempfile # First drop the `external_data` directory if it exists, to ensure # a clean environment for the demo pxt.drop_dir('external_data', force=True) pxt.create_dir('external_data') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory \`external\_data\`.
  \
```python theme={null} v = pxt.create_table('external_data/videos', {'video': pxt.Video}) prefix = 's3://multimedia-commons/' paths = [ 'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4', 'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4', 'data/videos/mp4/ffe/f73/ffef7384d698b5f70d411c696247169.mp4', ] v.insert({'video': prefix + p} for p in paths) ```
  Created table \`videos\`.
  Computing cells:   0%|                                                    | 0/6 \[00:00\

We just inserted 3 rows with video files residing in S3. When we now
query these, we are presented with their locally cached counterparts.

(Note: we don’t simply display the output of `collect()` here, because
that is formatted as an HTML table with a media player and so would
obscure the file path.)

```python theme={null}
rows = list(v.select(v.video).collect())
rows[0]
```

  \{'video': '/Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4'}
Let’s make a local copy of the first file and insert that separately. First, the copy: ```python theme={null} local_path = tempfile.mktemp(suffix='.mp4') shutil.copyfile(rows[0]['video'], local_path) local_path ```
  '/var/folders/hb/qd0dztsj43j\_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4'
Now the insert: ```python theme={null} v.insert([{'video': local_path}]) ```
  Computing cells:   0%|                                                    | 0/2 \[00:00\

When we query this again, we see that local paths are preserved:

```python theme={null}
rows = list(v.select(v.video).collect())
rows
```

  \[\{'video': '/Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4'},
   \{'video': '/Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_fc11428b32768ae782193a57ebcbad706f45bbd9fa13354471e0bcd798fee3ea.mp4'},
   \{'video': '/Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_b9fb0d9411bc9cd183a36866911baa7a8834f22f665bce47608566b38485c16a.mp4'},
   \{'video': '/var/folders/hb/qd0dztsj43j\_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4'}]
UDFs also see local paths: ```python theme={null} @pxt.udf def f(v: pxt.Video) -> int: print(f'{type(v)}: {v}') return 1 ``` ```python theme={null} v.select(f(v.video)).show() ```
  \: /Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4
  \: /Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_fc11428b32768ae782193a57ebcbad706f45bbd9fa13354471e0bcd798fee3ea.mp4
  \: /Users/asiegel/.pixeltable/file\_cache/682f022a704d4459adb2f29f7fe9577c\_0\_b9fb0d9411bc9cd183a36866911baa7a8834f22f665bce47608566b38485c16a.mp4
  \: /var/folders/hb/qd0dztsj43j\_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4
## Dealing with errors When interacting with media data in Pixeltable, the user can assume that the underlying files exist, are local and are valid for their respective data type. In other words, the user doesn’t need to consider error conditions. To that end, Pixeltable validates media data on ingest. The default behavior is to reject invalid media files: ```python theme={null} v.insert([{'video': prefix + 'bad_path.mp4'}]) ```
  Computing cells:   0%|                                                    | 0/2 \[00:01\ 1\[0m \[43mv\[49m\[38;5;241;43m.\[39;49m\[43minsert\[49m\[43m(\[49m\[43mvideo\[49m\[38;5;241;43m=\[39;49m\[43mprefix\[49m\[43m \[49m\[38;5;241;43m+\[39;49m\[43m \[49m\[38;5;124;43m'\[39;49m\[38;5;124;43mbad\_path.mp4\[39;49m\[38;5;124;43m'\[39;49m\[43m)\[49m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/insertable\_table.py:125\[0m, in \[0;36mInsertableTable.insert\[0;34m(self, rows, print\_stats, on\_error, \*\*kwargs)\[0m
  \[1;32m    123\[0m         \[38;5;28;01mraise\[39;00m excs\[38;5;241m.\[39mError(\[38;5;124m'\[39m\[38;5;124mrows must be a list of dictionaries\[39m\[38;5;124m'\[39m)
  \[1;32m    124\[0m \[38;5;28mself\[39m\[38;5;241m.\[39m\_validate\_input\_rows(rows)
  \[0;32m--> 125\[0m status \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_tbl\_version\[49m\[38;5;241;43m.\[39;49m\[43minsert\[49m\[43m(\[49m\[43mrows\[49m\[43m,\[49m\[43m \[49m\[38;5;28;43;01mNone\[39;49;00m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mfail\_on\_exception\[49m\[38;5;241;43m=\[39;49m\[43mfail\_on\_exception\[49m\[43m)\[49m
  \[1;32m    127\[0m \[38;5;28;01mif\[39;00m status\[38;5;241m.\[39mnum\_excs \[38;5;241m==\[39m \[38;5;241m0\[39m:
  \[1;32m    128\[0m     cols\_with\_excs\_str \[38;5;241m=\[39m \[38;5;124m'\[39m\[38;5;124m'\[39m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:723\[0m, in \[0;36mTableVersion.insert\[0;34m(self, rows, df, conn, print\_stats, fail\_on\_exception)\[0m
  \[1;32m    721\[0m \[38;5;28;01mif\[39;00m conn \[38;5;129;01mis\[39;00m \[38;5;28;01mNone\[39;00m:
  \[1;32m    722\[0m     \[38;5;28;01mwith\[39;00m Env\[38;5;241m.\[39mget()\[38;5;241m.\[39mengine\[38;5;241m.\[39mbegin() \[38;5;28;01mas\[39;00m conn:
  \[0;32m--> 723\[0m         \[38;5;28;01mreturn\[39;00m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_insert\[49m\[43m(\[49m
  \[1;32m    724\[0m \[43m            \[49m\[43mplan\[49m\[43m,\[49m\[43m \[49m\[43mconn\[49m\[43m,\[49m\[43m \[49m\[43mtime\[49m\[38;5;241;43m.\[39;49m\[43mtime\[49m\[43m(\[49m\[43m)\[49m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mrowids\[49m\[38;5;241;43m=\[39;49m\[43mrowids\[49m\[43m(\[49m\[43m)\[49m\[43m,\[49m\[43m \[49m\[43mabort\_on\_exc\[49m\[38;5;241;43m=\[39;49m\[43mfail\_on\_exception\[49m\[43m)\[49m
  \[1;32m    725\[0m \[38;5;28;01melse\[39;00m:
  \[1;32m    726\[0m     \[38;5;28;01mreturn\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_insert(
  \[1;32m    727\[0m         plan, conn, time\[38;5;241m.\[39mtime(), print\_stats\[38;5;241m=\[39mprint\_stats, rowids\[38;5;241m=\[39mrowids(), abort\_on\_exc\[38;5;241m=\[39mfail\_on\_exception)

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:737\[0m, in \[0;36mTableVersion.\_insert\[0;34m(self, exec\_plan, conn, timestamp, rowids, print\_stats, abort\_on\_exc)\[0m
  \[1;32m    735\[0m \[38;5;28mself\[39m\[38;5;241m.\[39mversion \[38;5;241m+\[39m\[38;5;241m=\[39m \[38;5;241m1\[39m
  \[1;32m    736\[0m result \[38;5;241m=\[39m UpdateStatus()
  \[0;32m--> 737\[0m num\_rows, num\_excs, cols\_with\_excs \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mstore\_tbl\[49m\[38;5;241;43m.\[39;49m\[43minsert\_rows\[49m\[43m(\[49m
  \[1;32m    738\[0m \[43m    \[49m\[43mexec\_plan\[49m\[43m,\[49m\[43m \[49m\[43mconn\[49m\[43m,\[49m\[43m \[49m\[43mv\_min\[49m\[38;5;241;43m=\[39;49m\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mversion\[49m\[43m,\[49m\[43m \[49m\[43mrowids\[49m\[38;5;241;43m=\[39;49m\[43mrowids\[49m\[43m,\[49m\[43m \[49m\[43mabort\_on\_exc\[49m\[38;5;241;43m=\[39;49m\[43mabort\_on\_exc\[49m\[43m)\[49m
  \[1;32m    739\[0m result\[38;5;241m.\[39mnum\_rows \[38;5;241m=\[39m num\_rows
  \[1;32m    740\[0m result\[38;5;241m.\[39mnum\_excs \[38;5;241m=\[39m num\_excs

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:323\[0m, in \[0;36mStoreBase.insert\_rows\[0;34m(self, exec\_plan, conn, v\_min, show\_progress, rowids, abort\_on\_exc)\[0m
  \[1;32m    321\[0m \[38;5;28;01mtry\[39;00m:
  \[1;32m    322\[0m     exec\_plan\[38;5;241m.\[39mopen()
  \[0;32m--> 323\[0m     \[38;5;28;01mfor\[39;00m row\_batch \[38;5;129;01min\[39;00m exec\_plan:
  \[1;32m    324\[0m         num\_rows \[38;5;241m+\[39m\[38;5;241m=\[39m \[38;5;28mlen\[39m(row\_batch)
  \[1;32m    325\[0m         \[38;5;28;01mfor\[39;00m batch\_start\_idx \[38;5;129;01min\[39;00m \[38;5;28mrange\[39m(\[38;5;241m0\[39m, \[38;5;28mlen\[39m(row\_batch), \[38;5;28mself\[39m\[38;5;241m.\[39m\_\_INSERT\_BATCH\_SIZE):
  \[1;32m    326\[0m             \[38;5;66;03m# compute batch of rows and convert them into table rows\[39;00m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exec/expr\_eval\_node.py:45\[0m, in \[0;36mExprEvalNode.\_\_next\_\_\[0;34m(self)\[0m
  \[1;32m     44\[0m \[38;5;28;01mdef\[39;00m \[38;5;21m\_\_next\_\_\[39m(\[38;5;28mself\[39m) \[38;5;241m-\[39m\[38;5;241m>\[39m DataRowBatch:
  \[0;32m---> 45\[0m     input\_batch \[38;5;241m=\[39m \[38;5;28;43mnext\[39;49m\[43m(\[49m\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43minput\[49m\[43m)\[49m
  \[1;32m     46\[0m     \[38;5;66;03m# compute target exprs\[39;00m
  \[1;32m     47\[0m     \[38;5;28;01mfor\[39;00m cohort \[38;5;129;01min\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39mcohorts:

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exec/cache\_prefetch\_node.py:71\[0m, in \[0;36mCachePrefetchNode.\_\_next\_\_\[0;34m(self)\[0m
  \[1;32m     68\[0m     futures\[executor\[38;5;241m.\[39msubmit(\[38;5;28mself\[39m\[38;5;241m.\[39m\_fetch\_url, row, info\[38;5;241m.\[39mslot\_idx)] \[38;5;241m=\[39m (row, info)
  \[1;32m     69\[0m \[38;5;28;01mfor\[39;00m future \[38;5;129;01min\[39;00m concurrent\[38;5;241m.\[39mfutures\[38;5;241m.\[39mas\_completed(futures):
  \[1;32m     70\[0m     \[38;5;66;03m# TODO:  does this need to deal with recoverable errors (such as retry after throttling)?\[39;00m
  \[0;32m---> 71\[0m     tmp\_path \[38;5;241m=\[39m \[43mfuture\[49m\[38;5;241;43m.\[39;49m\[43mresult\[49m\[43m(\[49m\[43m)\[49m
  \[1;32m     72\[0m     \[38;5;28;01mif\[39;00m tmp\_path \[38;5;129;01mis\[39;00m \[38;5;28;01mNone\[39;00m:
  \[1;32m     73\[0m         \[38;5;28;01mcontinue\[39;00m

  File \[0;32m/opt/miniconda3/envs/pxt/lib/python3.9/concurrent/futures/\_base.py:439\[0m, in \[0;36mFuture.result\[0;34m(self, timeout)\[0m
  \[1;32m    437\[0m     \[38;5;28;01mraise\[39;00m CancelledError()
  \[1;32m    438\[0m \[38;5;28;01melif\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_state \[38;5;241m==\[39m FINISHED:
  \[0;32m--> 439\[0m     \[38;5;28;01mreturn\[39;00m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_\_get\_result\[49m\[43m(\[49m\[43m)\[49m
  \[1;32m    441\[0m \[38;5;28mself\[39m\[38;5;241m.\[39m\_condition\[38;5;241m.\[39mwait(timeout)
  \[1;32m    443\[0m \[38;5;28;01mif\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_state \[38;5;129;01min\[39;00m \[CANCELLED, CANCELLED\_AND\_NOTIFIED]:

  File \[0;32m/opt/miniconda3/envs/pxt/lib/python3.9/concurrent/futures/\_base.py:391\[0m, in \[0;36mFuture.\_\_get\_result\[0;34m(self)\[0m
  \[1;32m    389\[0m \[38;5;28;01mif\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_exception:
  \[1;32m    390\[0m     \[38;5;28;01mtry\[39;00m:
  \[0;32m--> 391\[0m         \[38;5;28;01mraise\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_exception
  \[1;32m    392\[0m     \[38;5;28;01mfinally\[39;00m:
  \[1;32m    393\[0m         \[38;5;66;03m# Break a reference cycle with the exception in self.\_exception\[39;00m
  \[1;32m    394\[0m         \[38;5;28mself\[39m \[38;5;241m=\[39m \[38;5;28;01mNone\[39;00m

  File \[0;32m/opt/miniconda3/envs/pxt/lib/python3.9/concurrent/futures/thread.py:58\[0m, in \[0;36m\_WorkItem.run\[0;34m(self)\[0m
  \[1;32m     55\[0m     \[38;5;28;01mreturn\[39;00m
  \[1;32m     57\[0m \[38;5;28;01mtry\[39;00m:
  \[0;32m---> 58\[0m     result \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mfn\[49m\[43m(\[49m\[38;5;241;43m*\[39;49m\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43margs\[49m\[43m,\[49m\[43m \[49m\[38;5;241;43m*\[39;49m\[38;5;241;43m\*\[39;49m\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mkwargs\[49m\[43m)\[49m
  \[1;32m     59\[0m \[38;5;28;01mexcept\[39;00m \[38;5;167;01mBaseException\[39;00m \[38;5;28;01mas\[39;00m exc:
  \[1;32m     60\[0m     \[38;5;28mself\[39m\[38;5;241m.\[39mfuture\[38;5;241m.\[39mset\_exception(exc)

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exec/cache\_prefetch\_node.py:115\[0m, in \[0;36mCachePrefetchNode.\_fetch\_url\[0;34m(self, row, slot\_idx)\[0m
  \[1;32m    113\[0m     \[38;5;28mself\[39m\[38;5;241m.\[39mrow\_builder\[38;5;241m.\[39mset\_exc(row, slot\_idx, exc)
  \[1;32m    114\[0m     \[38;5;28;01mif\[39;00m \[38;5;129;01mnot\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39mctx\[38;5;241m.\[39mignore\_errors:
  \[0;32m--> 115\[0m         \[38;5;28;01mraise\[39;00m exc \[38;5;28;01mfrom\[39;00m \[38;5;28;01mNone\[39;00m  \[38;5;66;03m# suppress original exception\[39;00m
  \[1;32m    116\[0m \[38;5;28;01mreturn\[39;00m \[38;5;28;01mNone\[39;00m

  \[0;31mError\[0m: Failed to download s3://multimedia-commons/bad\_path.mp4: An error occurred (404) when calling the HeadObject operation: Not Found
The same happens for corrupted files: ```python theme={null} # create invalid .mp4 with tempfile.NamedTemporaryFile( mode='wb', suffix='.mp4', delete=False ) as temp_file: temp_file.write(random.randbytes(1024)) corrupted_path = temp_file.name v.insert([{'video': corrupted_path}]) ```
  Computing cells: 100%|██████████████████████████████████████████| 2/2 \[00:00\<00:00, 1084.64 cells/s]
  Error: Not a valid video: /var/folders/hb/qd0dztsj43j\_mdb6hbl1gzyc0000gn/T/tmp3djgfyjp.mp4
  \[0;31m---------------------------------------------------------------------------\[0m
  \[0;31mError\[0m                                     Traceback (most recent call last)
  Cell \[0;32mIn\[10], line 6\[0m
  \[1;32m      3\[0m     temp\_file\[38;5;241m.\[39mwrite(random\[38;5;241m.\[39mrandbytes(\[38;5;241m1024\[39m))
  \[1;32m      4\[0m     corrupted\_path \[38;5;241m=\[39m temp\_file\[38;5;241m.\[39mname
  \[0;32m----> 6\[0m \[43mv\[49m\[38;5;241;43m.\[39;49m\[43minsert\[49m\[43m(\[49m\[43mvideo\[49m\[38;5;241;43m=\[39;49m\[43mcorrupted\_path\[49m\[43m)\[49m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/insertable\_table.py:125\[0m, in \[0;36mInsertableTable.insert\[0;34m(self, rows, print\_stats, on\_error, \*\*kwargs)\[0m
  \[1;32m    123\[0m         \[38;5;28;01mraise\[39;00m excs\[38;5;241m.\[39mError(\[38;5;124m'\[39m\[38;5;124mrows must be a list of dictionaries\[39m\[38;5;124m'\[39m)
  \[1;32m    124\[0m \[38;5;28mself\[39m\[38;5;241m.\[39m\_validate\_input\_rows(rows)
  \[0;32m--> 125\[0m status \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_tbl\_version\[49m\[38;5;241;43m.\[39;49m\[43minsert\[49m\[43m(\[49m\[43mrows\[49m\[43m,\[49m\[43m \[49m\[38;5;28;43;01mNone\[39;49;00m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mfail\_on\_exception\[49m\[38;5;241;43m=\[39;49m\[43mfail\_on\_exception\[49m\[43m)\[49m
  \[1;32m    127\[0m \[38;5;28;01mif\[39;00m status\[38;5;241m.\[39mnum\_excs \[38;5;241m==\[39m \[38;5;241m0\[39m:
  \[1;32m    128\[0m     cols\_with\_excs\_str \[38;5;241m=\[39m \[38;5;124m'\[39m\[38;5;124m'\[39m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:723\[0m, in \[0;36mTableVersion.insert\[0;34m(self, rows, df, conn, print\_stats, fail\_on\_exception)\[0m
  \[1;32m    721\[0m \[38;5;28;01mif\[39;00m conn \[38;5;129;01mis\[39;00m \[38;5;28;01mNone\[39;00m:
  \[1;32m    722\[0m     \[38;5;28;01mwith\[39;00m Env\[38;5;241m.\[39mget()\[38;5;241m.\[39mengine\[38;5;241m.\[39mbegin() \[38;5;28;01mas\[39;00m conn:
  \[0;32m--> 723\[0m         \[38;5;28;01mreturn\[39;00m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_insert\[49m\[43m(\[49m
  \[1;32m    724\[0m \[43m            \[49m\[43mplan\[49m\[43m,\[49m\[43m \[49m\[43mconn\[49m\[43m,\[49m\[43m \[49m\[43mtime\[49m\[38;5;241;43m.\[39;49m\[43mtime\[49m\[43m(\[49m\[43m)\[49m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mrowids\[49m\[38;5;241;43m=\[39;49m\[43mrowids\[49m\[43m(\[49m\[43m)\[49m\[43m,\[49m\[43m \[49m\[43mabort\_on\_exc\[49m\[38;5;241;43m=\[39;49m\[43mfail\_on\_exception\[49m\[43m)\[49m
  \[1;32m    725\[0m \[38;5;28;01melse\[39;00m:
  \[1;32m    726\[0m     \[38;5;28;01mreturn\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_insert(
  \[1;32m    727\[0m         plan, conn, time\[38;5;241m.\[39mtime(), print\_stats\[38;5;241m=\[39mprint\_stats, rowids\[38;5;241m=\[39mrowids(), abort\_on\_exc\[38;5;241m=\[39mfail\_on\_exception)

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:737\[0m, in \[0;36mTableVersion.\_insert\[0;34m(self, exec\_plan, conn, timestamp, rowids, print\_stats, abort\_on\_exc)\[0m
  \[1;32m    735\[0m \[38;5;28mself\[39m\[38;5;241m.\[39mversion \[38;5;241m+\[39m\[38;5;241m=\[39m \[38;5;241m1\[39m
  \[1;32m    736\[0m result \[38;5;241m=\[39m UpdateStatus()
  \[0;32m--> 737\[0m num\_rows, num\_excs, cols\_with\_excs \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mstore\_tbl\[49m\[38;5;241;43m.\[39;49m\[43minsert\_rows\[49m\[43m(\[49m
  \[1;32m    738\[0m \[43m    \[49m\[43mexec\_plan\[49m\[43m,\[49m\[43m \[49m\[43mconn\[49m\[43m,\[49m\[43m \[49m\[43mv\_min\[49m\[38;5;241;43m=\[39;49m\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mversion\[49m\[43m,\[49m\[43m \[49m\[43mrowids\[49m\[38;5;241;43m=\[39;49m\[43mrowids\[49m\[43m,\[49m\[43m \[49m\[43mabort\_on\_exc\[49m\[38;5;241;43m=\[39;49m\[43mabort\_on\_exc\[49m\[43m)\[49m
  \[1;32m    739\[0m result\[38;5;241m.\[39mnum\_rows \[38;5;241m=\[39m num\_rows
  \[1;32m    740\[0m result\[38;5;241m.\[39mnum\_excs \[38;5;241m=\[39m num\_excs

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:334\[0m, in \[0;36mStoreBase.insert\_rows\[0;34m(self, exec\_plan, conn, v\_min, show\_progress, rowids, abort\_on\_exc)\[0m
  \[1;32m    332\[0m \[38;5;28;01mif\[39;00m abort\_on\_exc \[38;5;129;01mand\[39;00m row\[38;5;241m.\[39mhas\_exc():
  \[1;32m    333\[0m     exc \[38;5;241m=\[39m row\[38;5;241m.\[39mget\_first\_exc()
  \[0;32m--> 334\[0m     \[38;5;28;01mraise\[39;00m exc
  \[1;32m    336\[0m rowid \[38;5;241m=\[39m (\[38;5;28mnext\[39m(rowids),) \[38;5;28;01mif\[39;00m rowids \[38;5;129;01mis\[39;00m \[38;5;129;01mnot\[39;00m \[38;5;28;01mNone\[39;00m \[38;5;28;01melse\[39;00m row\[38;5;241m.\[39mpk\[:\[38;5;241m-\[39m\[38;5;241m1\[39m]
  \[1;32m    337\[0m pk \[38;5;241m=\[39m rowid \[38;5;241m+\[39m (v\_min,)

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exprs/column\_ref.py:159\[0m, in \[0;36mColumnRef.eval\[0;34m(self, data\_row, row\_builder)\[0m
  \[1;32m    156\[0m     \[38;5;28;01mreturn\[39;00m
  \[1;32m    158\[0m \[38;5;28;01mtry\[39;00m:
  \[0;32m--> 159\[0m     \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mcol\[49m\[38;5;241;43m.\[39;49m\[43mcol\_type\[49m\[38;5;241;43m.\[39;49m\[43mvalidate\_media\[49m\[43m(\[49m\[43mdata\_row\[49m\[38;5;241;43m.\[39;49m\[43mfile\_paths\[49m\[43m\[\[49m\[43munvalidated\_slot\_idx\[49m\[43m]\[49m\[43m)\[49m
  \[1;32m    160\[0m     \[38;5;66;03m# access the value only after successful validation\[39;00m
  \[1;32m    161\[0m     val \[38;5;241m=\[39m data\_row\[unvalidated\_slot\_idx]

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/type\_system.py:906\[0m, in \[0;36mVideoType.validate\_media\[0;34m(self, val)\[0m
  \[1;32m    904\[0m             \[38;5;28;01mraise\[39;00m excs\[38;5;241m.\[39mError(\[38;5;124mf\[39m\[38;5;124m'\[39m\[38;5;124mNot a valid video: \[39m\[38;5;132;01m\{\[39;00mval\[38;5;132;01m}\[39;00m\[38;5;124m'\[39m)
  \[1;32m    905\[0m \[38;5;28;01mexcept\[39;00m av\[38;5;241m.\[39mAVError:
  \[0;32m--> 906\[0m     \[38;5;28;01mraise\[39;00m excs\[38;5;241m.\[39mError(\[38;5;124mf\[39m\[38;5;124m'\[39m\[38;5;124mNot a valid video: \[39m\[38;5;132;01m\{\[39;00mval\[38;5;132;01m}\[39;00m\[38;5;124m'\[39m) \[38;5;28;01mfrom\[39;00m \[38;5;28;01mNone\[39;00m

  \[0;31mError\[0m: Not a valid video: /var/folders/hb/qd0dztsj43j\_mdb6hbl1gzyc0000gn/T/tmp3djgfyjp.mp4
Alternatively, Pixeltable can also be instructed to record error conditions and proceed with the ingest, via the `on_error` flag (default: `'abort'`): ```python theme={null} v.insert( [{'video': prefix + 'bad_path.mp4'}, {'video': corrupted_path}], on_error='ignore', ) ```
  Computing cells: 100%|████████████████████████████████████████████| 4/4 \[00:00\<00:00, 20.98 cells/s]
  Inserting rows into \`videos\`: 2 rows \[00:00, 671.63 rows/s]
  Computing cells: 100%|████████████████████████████████████████████| 4/4 \[00:00\<00:00, 20.13 cells/s]
  Inserted 2 rows with 4 errors across 2 columns (videos.video, videos.None).
  UpdateStatus(num\_rows=2, num\_computed\_values=4, num\_excs=4, updated\_cols=\[], cols\_with\_excs=\['videos.video', 'videos.None'])
Every media column has properties `errortype` and `errormsg` (both containing `string` data) that indicate whether the column value is valid. Invalid values show up as `None` and have non-null `errortype`/`errormsg`: ```python theme={null} v.select(v.video == None, v.video.errortype, v.video.errormsg).collect() ```
Errors can now be inspected (and corrected) after the ingest: ```python theme={null} v.where(v.video.errortype != None).select(v.video.errormsg).collect() ```
## Accessing the original file paths In some cases, it will be necessary to access file paths (not, say, the `PIL.Image.Image`), and Pixeltable provides the column properties `fileurl` and `localpath` for that purpose: ```python theme={null} v.select(v.video.fileurl, v.video.localpath).collect() ```
Note that for local media files, the `fileurl` property still returns a parsable URL. # Iterators Source: https://docs.pixeltable.com/platform/iterators Learn about iterators for processing documents, videos, audio, and images ## What are iterators? Iterators in Pixeltable are specialized tools for processing and transforming media content. They efficiently break down large files into manageable chunks, enabling analysis at different granularities. Iterators work seamlessly with views to create virtual derived tables without duplicating storage. In Pixeltable, iterators: * Process media files incrementally to manage memory efficiently * Transform single records into multiple output records * Support various media types including documents, videos, images, and audio * Integrate with the view system for automated processing pipelines * Provide configurable parameters for fine-tuning output Iterators are particularly useful when: * Working with large media files that can't be processed at once * Building retrieval systems that require chunked content * Creating analysis pipelines for multimedia data * Implementing feature extraction workflows ```python theme={null} import pixeltable as pxt from pixeltable.functions.document import document_splitter # Create a view using an iterator chunks = pxt.create_view( 'docs/chunks', documents_table, iterator=document_splitter( document=documents_table.document, separators='sentence,token_limit', limit=300 ) ) ``` ## Core concepts Split documents into chunks by headings, sentences, or token limits Extract frames at specified intervals or counts Divide images into overlapping or non-overlapping tiles Split audio files into time-based chunks with configurable overlap Iterators are powerful tools for processing large media files. They work seamlessly with Pixeltable's computed columns and versioning system. ## Available iterators ```python theme={null} from pixeltable.functions.document import document_splitter # Create view with document chunks chunks_view = pxt.create_view( 'docs/chunks', docs_table, iterator=document_splitter( document=docs_table.document, separators='sentence,token_limit', limit=500, metadata='title,heading' ) ) ``` ### Parameters * `separators`: Choose from 'heading', 'sentence', 'token\_limit', 'char\_limit', 'page' * `limit`: Maximum tokens/characters per chunk * `metadata`: Optional fields like 'title', 'heading', 'sourceline', 'page', 'bounding\_box' * `overlap`: Optional overlap between chunks ```python theme={null} from pixeltable.functions.video import frame_iterator # Extract frames at 1 FPS frames_view = pxt.create_view( 'videos/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Extract exact number of frames (evenly spaced) frames_view = pxt.create_view( 'videos/sampled', videos_table, iterator=frame_iterator( video=videos_table.video, num_frames=10 # Extract 10 evenly-spaced frames ) ) # Extract only keyframes (I-frames) for efficient processing keyframes_view = pxt.create_view( 'videos/keyframes', videos_table, iterator=frame_iterator( video=videos_table.video, keyframes_only=True ) ) ``` ### Parameters * `fps`: Frames per second to extract (can be fractional) * `num_frames`: Exact number of frames to extract * `keyframes_only`: Extract only keyframes (I-frames) - efficient for quick video scanning * Only one of `fps`, `num_frames`, or `keyframes_only` can be specified ```python theme={null} from pixeltable.functions.video import video_splitter # Split video into 10-second segments segments_view = pxt.create_view( 'videos/segments', videos_table, iterator=video_splitter( video=videos_table.video, duration=10.0, min_segment_duration=1.0 ) ) ``` ### Parameters * `duration`: Duration of each segment in seconds * `overlap`: Overlap between segments in seconds * `min_segment_duration`: Drop last segment if shorter than this value ### Returns For each segment, yields: * `segment_start`: Start time of the segment in seconds * `segment_end`: End time of the segment in seconds * `video_segment`: The video segment file ```python theme={null} from pixeltable.functions.string import string_splitter # Split text into sentences sentences_view = pxt.create_view( 'texts/sentences', texts_table, iterator=string_splitter( text=texts_table.content, separators='sentence' ) ) ``` ### Parameters * `separators`: Choose from 'sentence' (requires spacy) ### Returns For each chunk, yields: * `text`: The text chunk ```python theme={null} from pixeltable.functions.image import tile_iterator # Create tiles with overlap tiles_view = pxt.create_view( 'images/tiles', images_table, iterator=tile_iterator( image=images_table.image, tile_size=(224, 224), # Width, Height overlap=(32, 32) # Horizontal, Vertical overlap ) ) ``` ### Parameters * `tile_size`: Tuple of (width, height) for each tile * `overlap`: Optional tuple for overlap between tiles ```python theme={null} from pixeltable.functions.audio import audio_splitter # Split audio into chunks chunks_view = pxt.create_view( 'audio/chunks', audio_table, iterator=audio_splitter( audio=audio_table.audio, duration=30.0, # Split into 30-second chunks overlap=2.0, # 2-second overlap between chunks min_segment_duration=5.0 # Drop last chunk if < 5 seconds ) ) ``` ### Parameters * `duration` (float): Duration of each audio chunk in seconds * `overlap` (float, default: 0.0): Overlap duration between consecutive chunks in seconds * `min_segment_duration` (float, default: 0.0): Minimum duration threshold - the last chunk will be dropped if it's shorter than this value ### Returns For each chunk, yields: * `start_time_sec`: Start time of the chunk in seconds * `end_time_sec`: End time of the chunk in seconds * `audio_chunk`: The audio chunk as pxt.Audio type ### Notes * If the input contains no audio, no chunks are yielded * The audio file is processed efficiently with proper codec handling * Supports various audio formats including MP3, AAC, Vorbis, Opus, FLAC ## Common use cases Split documents for: * RAG systems * Text analysis * Content extraction Extract frames for: * Object detection * Scene classification * Activity recognition Create tiles for: * High-resolution analysis * Object detection * Segmentation tasks Split audio for: * Speech recognition * Sound classification * Audio feature extraction ## Example workflows ```python theme={null} # Create document chunks chunks = pxt.create_view( 'rag/chunks', docs_table, iterator=document_splitter( document=docs_table.document, separators='sentence,token_limit', limit=500 ) ) # Add embeddings chunks.add_embedding_index( 'text', string_embed=sentence_transformer.using( model_id='all-mpnet-base-v2' ) ) ``` ```python theme={null} # Extract frames at 1 FPS frames = pxt.create_view( 'detection/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Add object detection frames.add_computed_column(detections=detect_objects(frames.frame)) ``` ```python theme={null} # Split long audio files chunks = pxt.create_view( 'audio/chunks', audio_table, iterator=audio_splitter( audio=audio_table.audio, duration=30.0 ) ) # Add transcription chunks.add_computed_column(text=whisper_transcribe(chunks.audio_chunk)) ``` ```python theme={null} from pixeltable.functions.video import make_video # Extract frames at 1 FPS frames = pxt.create_view( 'video/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Process frames (e.g., apply a filter) frames.add_computed_column(processed=frames.frame.filter('BLUR')) # Create new videos from processed frames processed_videos = frames.select( frames.video_id, make_video(frames.pos, frames.processed) # Default fps is 25 ).group_by(frames.video_id).collect() ``` ## Best practices * Use appropriate chunk sizes * Consider overlap requirements * Monitor memory usage with large files * Balance chunk size vs. processing time * Use batch processing when possible * Cache intermediate results ## Tips & tricks When using `token_limit` with `document_splitter`, ensure the limit accounts for any model context windows in your pipeline. ## Additional resources All built-in iterators Chunk documents for RAG Extract video frames # Multimodal Type System Source: https://docs.pixeltable.com/platform/type-system Understanding Pixeltable types for structured data, media, and ML workflows Pixeltable provides a rich type system designed for multimodal AI applications. Every column and expression has an associated type that determines what data it can hold and what operations are available. ## Type overview | Pixeltable Type | Python Type | Description | | --------------- | --------------------------------------------- | -------------------------------------- | | `pxt.String` | `str` | Text data | | `pxt.Int` | `int` | Integer numbers | | `pxt.Float` | `float` | Decimal numbers | | `pxt.Bool` | `bool` | Boolean values | | `pxt.Timestamp` | `datetime.datetime` | Timestamp values | | `pxt.Date` | `datetime.date` | Date values | | `pxt.UUID` | `uuid.UUID` | Unique identifiers | | `pxt.Array` | `np.ndarray` | Numerical arrays (embeddings, tensors) | | `pxt.Json` | `dict`, `list`, `str`, `int`, `float`, `bool` | Flexible JSON data | | `pxt.Image` | `PIL.Image.Image` | Image data | | `pxt.Video` | `str` (file path) | Video files | | `pxt.Audio` | `str` (file path) | Audio files | | `pxt.Document` | `str` (file path) | Documents (PDFs, markdown, html, etc.) | `pxt.Audio`, `pxt.Video`, and `pxt.Document` return file paths when queried. Pixeltable automatically downloads and caches remote media locally. Use `.fileurl` to get the original URL. ## Basic types ```python theme={null} import pixeltable as pxt table = pxt.create_table('example/basic_types', { 'text': pxt.String, # Text data 'count': pxt.Int, # Integer numbers 'score': pxt.Float, # Decimal numbers 'active': pxt.Bool, # Boolean values 'created': pxt.Timestamp # Date/time values }) ``` ### Auto-generated UUIDs Use `uuid7()` to create columns that auto-generate unique identifiers: ```python theme={null} from pixeltable.functions.uuid import uuid7 # UUID as primary key - auto-generated for each row products = pxt.create_table('example/products', { 'id': uuid7(), # Auto-generates UUID 'name': pxt.String, 'price': pxt.Float }, primary_key=['id']) # Insert without providing 'id' - it's generated automatically products.insert([{'name': 'Laptop', 'price': 999.99}]) ``` You can also add UUIDs to existing tables: ```python theme={null} # Add UUID column to existing table orders.add_computed_column(order_id=uuid7()) ``` By default, `stored=True` for all computed columns—values compute once and persist. For UUIDs, this ensures stable identifiers. Setting `stored=False` would regenerate UUIDs on every query (almost never what you want). See the [UUID cookbook](/howto/cookbooks/core/workflow-uuid-identity) for more examples of working with unique identifiers. ## Media types Pixeltable natively supports images, video, audio, and documents as first-class column types. ```python theme={null} media = pxt.create_table('example/media', { 'image': pxt.Image, # Any image 'video': pxt.Video, # Video reference 'audio': pxt.Audio, # Audio file 'document': pxt.Document # PDF/text document }) ``` ### Image specialization Images can be constrained by resolution and/or color mode: ```python theme={null} # Constrain by resolution thumbnails = pxt.create_table('example/thumbnails', { 'thumb': pxt.Image[(224, 224)] # Width 224, height 224 }) # Constrain by color mode grayscale = pxt.create_table('example/grayscale', { 'img': pxt.Image['L'] # Grayscale (1-channel) }) # Constrain both rgb_fixed = pxt.create_table('example/rgb_fixed', { 'img': pxt.Image[(300, 200), 'RGB'] # 300x200 RGB images }) ``` See the [PIL Documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html) for the full list of image modes (`'RGB'`, `'RGBA'`, `'L'`, etc.). ## Array types (embeddings & tensors) Arrays are used for embeddings, feature vectors, and tensor data. They must always specify a shape and dtype. ```python theme={null} ml_data = pxt.create_table('example/ml_features', { # Fixed-size embedding (e.g., from CLIP or OpenAI) 'embedding': pxt.Array[(768,), pxt.Float], # Variable first dimension (batch of 512-dim vectors) 'features': pxt.Array[(None, 512), pxt.Float], # 3D tensor with flexible dimensions 'tensor': pxt.Array[(None, None, 3), pxt.Float] }) ``` Array shapes follow NumPy conventions. Use `None` for unconstrained dimensions: * `(512,)` — fixed 512-element vector * `(None, 768)` — variable-length sequence of 768-dim vectors * `(64, 64, 3)` — fixed 64×64×3 tensor ### Working with arrays ```python theme={null} # Arrays can be sliced like NumPy arrays t.select( t.embedding[0], # First element t.embedding[5:10], # Slice t.embedding[-3:] # Last 3 elements ).collect() ``` ## JSON type The `Json` type stores flexible structured data—dictionaries, lists, or primitives. ```python theme={null} logs = pxt.create_table('example/logs', { 'event': pxt.Json }) logs.insert([ {'event': {'type': 'click', 'x': 100, 'y': 200}}, {'event': {'type': 'scroll', 'delta': 50}}, {'event': ['tag1', 'tag2', 'tag3']} ]) ``` ### JSON path access Access nested data using dictionary or attribute syntax: ```python theme={null} # Dictionary syntax t.select(t.event['type']).collect() # Attribute syntax (JSONPath) t.select(t.event.type).collect() # List indexing t.select(t.event.tags[0]).collect() # Slicing t.select(t.event.tags[:2]).collect() ``` Pixeltable handles missing keys gracefully—you'll get `None` instead of an exception. ### JSON schema validation Validate JSON columns against a schema to ensure data integrity: ```python theme={null} # Define a JSON schema movie_schema = { 'type': 'object', 'properties': { 'title': {'type': 'string'}, 'year': {'type': 'integer'}, 'rating': {'type': 'number'} }, 'required': ['title', 'year'] } # Create table with validated JSON column movies = pxt.create_table('example/validated_movies', { 'data': pxt.Json[movie_schema] }) # Valid insert movies.insert(data={'title': 'Inception', 'year': 2010, 'rating': 8.8}) # Invalid insert raises error (missing required 'year') # movies.insert(data={'title': 'Movie'}) # Error! ``` ### Using Pydantic models ```python theme={null} from pydantic import BaseModel class Movie(BaseModel): title: str year: int rating: float | None = None # Use the model's JSON schema for validation movies = pxt.create_table('example/pydantic_movies', { 'data': pxt.Json[Movie.model_json_schema()] }) ``` ## Type conversion Use `astype()` to convert string file paths or URLs to media types: ```python theme={null} # String file paths → Media types media = pxt.create_table('media_table', {'path': pxt.String}) media.insert([{'path': '/path/to/image.jpg'}]) # Convert string path to Image media.select(img=media.path.astype(pxt.Image)).collect() ``` **Primary use case:** Converting string columns containing file paths or URLs to media types (`Image`, `Video`, `Audio`, `Document`). For other type conversions, use built-in functions from the [`string`](/sdk/latest/string), [`json`](/sdk/latest/json), or [`math`](/sdk/latest/math) modules. For example, use `string.len()` to get string length as an integer, or access JSON fields directly. ## Column properties ### Media column properties Media columns (`Image`, `Video`, `Audio`, `Document`) have special properties: ```python theme={null} # Local file path (Pixeltable ensures this is on local filesystem) t.select(t.image.localpath).collect() # Original URL where the media resides t.select(t.image.fileurl).collect() ``` ### Error properties Computed columns have `errortype` and `errormsg` properties for debugging: ```python theme={null} # Create a computed column that might fail t.add_computed_column( result=some_function(t.input), on_error='ignore' # Continue on errors ) # Query error information for failed rows t.where(t.result == None).select( t.input, t.result.errortype, # Exception class name t.result.errormsg # Error message ).collect() ``` ## Best practices Prefer `pxt.Image[(224,224), 'RGB']` over `pxt.Image` when you know the constraints. This enables optimizations and catches errors early. Use JSON schema validation or Pydantic models for structured data to ensure consistency across your pipeline. Always specify array shapes and dtypes. Use `None` for variable dimensions: `pxt.Array[(None, 768), pxt.Float]`. Use `on_error='ignore'` in production pipelines, then query `.errortype` and `.errormsg` to debug failures. ## See also Creating and managing tables Transform data with computed columns Complete type reference # UDFs in Pixeltable Source: https://docs.pixeltable.com/platform/udfs-in-pixeltable Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Pixeltable comes with a library of built-in functions and integrations, but sooner or later, you’ll want to introduce some customized logic into your workflow. This is where Pixeltable’s rich UDF (User-Defined Function) capability comes in. Pixeltable UDFs let you write code in Python, then directly insert your custom logic into Pixeltable expressions and computed columns. In this how-to guide, we’ll show how to define UDFs, extend their capabilities, and use them in computed columns. To start, we’ll install the necessary dependencies, create a Pixeltable directory and table to experiment with, and add some sample data. ```python theme={null} %pip install -qU pixeltable ``` ```python theme={null} import pixeltable as pxt # Create the directory and table pxt.drop_dir('udf_demo', force=True) # Ensure a clean slate for the demo pxt.create_dir('udf_demo') t = pxt.create_table('udf_demo/strings', {'input': pxt.String}) # Add some sample data t.insert( [ {'input': 'Hello, world!'}, {'input': 'You can do a lot with Pixeltable UDFs.'}, ] ) t.show() ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory \`udf\_demo\`.
  Created table \`strings\`.
  Inserting rows into \`strings\`: 2 rows \[00:00, 763.99 rows/s]
  Inserted 2 rows with 0 errors.
## What is a UDF? A Pixeltable UDF is just a Python function that is marked with the `@pxt.udf` decorator. ```python theme={null} @pxt.udf def add_one(n: int) -> int: return n + 1 ``` It’s as simple as that! Without the decorator, `add_one` would be an ordinary Python function that operates on integers. Adding `@pxt.udf` converts it into a Pixeltable function that operates on *columns* of integers. The decorated function can then be used directly to define computed columns; Pixeltable will orchestrate its execution across all the input data. For our first working example, let’s do something slightly more interesting: write a function to extract the longest word from a sentence. (If there are ties for the longest word, we choose the first word among those ties.) In Python, that might look something like this: ```python theme={null} import numpy as np def longest_word(sentence: str, strip_punctuation: bool = False) -> str: words = sentence.split() if ( strip_punctuation ): # Remove non-alphanumeric characters from each word words = [''.join(filter(str.isalnum, word)) for word in words] i = np.argmax([len(word) for word in words]) return words[i] ``` ```python theme={null} longest_word("Let's check that it works.", strip_punctuation=True) ```
  'check'
The `longest_word` Python function isn’t a Pixeltable UDF (yet); it operates on individual strings, not columns of strings. Adding the decorator turns it into a UDF: ```python theme={null} @pxt.udf def longest_word(sentence: str, strip_punctuation: bool = False) -> str: words = sentence.split() if ( strip_punctuation ): # Remove non-alphanumeric characters from each word words = [''.join(filter(str.isalnum, word)) for word in words] i = np.argmax([len(word) for word in words]) return words[i] ``` Now we can use it to create a computed column. Pixeltable orchestrates the computation like it does with any other function, applying the UDF in turn to each existing row of the table, then updating incrementally each time a new row is added. ```python theme={null} t.add_computed_column(longest_word=longest_word(t.input)) t.show() ```
  Computing cells: 100%|███████████████████████████████████████████| 2/2 \[00:00\<00:00, 138.28 cells/s]
  Added 2 column values with 0 errors.
```python theme={null} t.insert([{'input': 'Pixeltable updates tables incrementally.'}]) t.show() ```
  Computing cells:   0%|                                                    | 0/3 \[00:00\

Oops, those trailing punctuation marks are kind of annoying. Let’s add another column, this time using the handy `strip_punctuation` parameter from our UDF. (We could alternatively drop the first column before adding the new one, but for purposes of this tutorial it’s convenient to see how Pixeltable executes both variants side-by-side.) Note how *columns* such as `t.input` and *constants* such as `True` can be freely intermixed as arguments to the UDF. ```python theme={null} t.add_computed_column( longest_word_2=longest_word(t.input, strip_punctuation=True) ) t.show() ```
  Computing cells: 100%|███████████████████████████████████████████| 3/3 \[00:00\<00:00, 252.91 cells/s]
  Added 3 column values with 0 errors.
## Types in UDFs You might have noticed that the `longest_word` UDF has *type hints* in its signature. ```python theme={null} def longest_word(sentence: str, strip_punctuation: bool = False) -> str: ... ``` The `sentence` parameter, `strip_punctuation` parameter, and return value all have explicit types (`str`, `bool`, and `str` respectively). In general Python code, type hints are usually optional. But Pixeltable is a database system: *everything* in Pixeltable must have a type. And since Pixeltable is also an orchestrator - meaning it sets up workflows and computed columns *before* executing them - these types need to be known in advance. That’s the reasoning behind a fundamental principle of Pixeltable UDFs: * Type hints are *required*. You can turn almost any Python function into a Pixeltable UDF, provided that it has type hints, and provided that Pixeltable supports the types that it uses. The most familiar types that you’ll use in UDFs are: * `int` * `float` * `str` * `list` (can optionally be parameterized, e.g., `list[str]`) * `dict` (can optionally be parameterized, e.g., `dict[str, int]`) * `PIL.Image.Image` In addition to these standard Python types, Pixeltable also recognizes various kinds of arrays, audio and video media, and documents. ## Local and module UDFs The `longest_word` UDF that we defined above is a *local* UDF: it was defined directly in our notebook, rather than in a module that we imported. Many other UDFs, including all of Pixeltable’s built-in functions, are defined in modules. We encountered a few of these in the 10-Minute Tour tutorial: the `huggingface.detr_for_object_detection` and `openai.vision` functions. (Although these are built-in functions, they behave the same way as UDFs, and in fact they’re defined the same way under the covers.) There is an important difference between the two. When you add a module UDF such as `openai.vision` to a table, Pixeltable stores a *reference* to the corresponding Python function in the module. If you later restart your Python runtime and reload Pixeltable, then Pixeltable will re-import the module UDF when it loads the computed column. This means that any code changes made to the UDF will be picked up at that time, and the new version of the UDF will be used in any future execution. Conversely, when you add a local UDF to a table, the *entire code* for the UDF is serialized and stored in the table. This ensures that if you restart your notebook kernel (say), or even delete the notebook entirely, the UDF will continue to function. However, it also means that if you modify the UDF code, the updated logic will not be reflected in any existing Pixeltable columns. To see how this works in practice, let’s modify our `longest_word` UDF so that if `strip_punctuation` is `True`, then we remove only a single punctuation mark from the *end* of each word. ```python theme={null} @pxt.udf def longest_word(sentence: str, strip_punctuation: bool = False) -> str: words = sentence.split() if strip_punctuation: words = [ word if word[-1].isalnum() else word[:-1] for word in words ] i = np.argmax([len(word) for word in words]) return words[i] ``` Now we see that Pixeltable continues to use the *old* definition, even as new rows are added to the table. ```python theme={null} t.insert([{'input': "Let's check that it still works."}]) t.show() ```
  Computing cells:   0%|                                                    | 0/5 \[00:00\

But if we add a new *column* that references the `longest_word` UDF, Pixeltable will use the updated version. ```python theme={null} t.add_computed_column( longest_word_3=longest_word(t.input, strip_punctuation=True) ) t.show() ```
  Computing cells: 100%|███████████████████████████████████████████| 4/4 \[00:00\<00:00, 348.89 cells/s]
  Added 4 column values with 0 errors.
The general rule is: changes to module UDFs will affect any future execution; changes to local UDFs will only affect *new columns* that are defined using the new version of the UDF. ## Batching Pixeltable provides several ways to optimize UDFs for better performance. One of the most common is *batching*, which is particularly important for UDFs that involve GPU operations. Ordinary UDFs process one row at a time, meaning the UDF will be invoked exactly once per row processed. Conversely, a batched UDF processes several rows at a time; the specific number is user-configurable. As an example, let’s modify our `longest_word` UDF to take a batched parameter. Here’s what it looks like: ```python theme={null} from pixeltable.func import Batch @pxt.udf(batch_size=16) def longest_word( sentences: Batch[str], strip_punctuation: bool = False ) -> Batch[str]: results = [] for sentence in sentences: words = sentence.split() if strip_punctuation: words = [ word if word[-1].isalnum() else word[:-1] for word in words ] i = np.argmax([len(word) for word in words]) results.append(words[i]) return results ``` There are several changes: * The parameter `batch_size=16` has been added to the `@pxt.udf` decorator, specifying the batch size; * The `sentences` parameter has changed from `str` to `Batch[str]`; * The return type has also changed from `str` to `Batch[str]`; and * Instead of processing a single sentence, the UDF is processing a `Batch` of sentences and returning the result `Batch`. What exactly is a `Batch[str]`? Functionally, it’s simply a `list[str]`, and you can use it exactly like a `list[str]` in any Python code. The only difference is in the type hint; a type hint of `Batch[str]` tells Pixeltable, “My data consists of individual strings that I want you to process in batches”. Conversely, a type hint of `list[str]` would mean, “My data consists of *lists* of strings that I want you to process one at a time”. Notice that the `strip_punctuation` parameter is *not* wrapped in a `Batch` type. This because `strip_punctuation` controls the behavior of the UDF, rather than being part of the input data. When we use the batched `longest_word` UDF, the `strip_punctuation` parameter will always be a constant, not a column. Let’s put the new, batched UDF to work. ```python theme={null} t.add_computed_column( longest_word_3_batched=longest_word(t.input, strip_punctuation=True) ) t.show() ```
  Computing cells: 100%|███████████████████████████████████████████| 4/4 \[00:00\<00:00, 353.90 cells/s]
  Added 4 column values with 0 errors.
As expected, the output of the `longest_word_3_batched` column is identical to the `longest_word_3` column. Under the covers, though, Pixeltable is orchestrating execution in batches of 16. That probably won’t have much performance impact on our toy example, but for GPU-bound computations such as text or image embeddings, it can make a substantial difference. ## UDAs (aggregate UDFs) Ordinary UDFs are always one-to-one on rows: each row of input generates one UDF output value. Functions that aggregate data, conversely, are many-to-one, and in Pixeltable they are represented by a related abstraction, the UDA (User-Defined Aggregate). Pixeltable has a number of built-in UDAs; if you’ve worked through the Fundamentals tutorial, you’ll have already encountered a few of them, such as `sum` and `count`. In this section, we’ll show how to define your own custom UDAs. For demonstration purposes, let’s start by creating a table containing all the integers from 0 to 49. ```python theme={null} import pixeltable as pxt t = pxt.create_table('udf_demo/values', {'val': pxt.Int}) t.insert({'val': n} for n in range(50)) ```
  Created table \`values\`.
  Inserting rows into \`values\`: 50 rows \[00:00, 9267.95 rows/s]
  Inserted 50 rows with 0 errors.
  UpdateStatus(num\_rows=50, num\_computed\_values=0, num\_excs=0, updated\_cols=\[], cols\_with\_excs=\[])
If we wanted to compute their sum using the built-in `sum` aggregate, we’d do it like this: ```python theme={null} import pixeltable.functions as pxtf t.select(pxtf.sum(t.val)).collect() ```
Or perhaps we want to group them by `n // 10` (corresponding to the tens digit of each integer) and sum each group: ```python theme={null} t.group_by(t.val // 10).order_by(t.val // 10).select( t.val // 10, pxtf.sum(t.val) ).collect() ```
Now let’s define a new aggregate to compute the sum of squares of a set of numbers. To define an aggregate, we implement a subclass of the `pxt.Aggregator` Python class and decorate it with the `@pxt.uda` decorator, similar to what we did for UDFs. The subclass must implement three methods: * `__init__()` - initializes the aggregator; can be used to parameterize aggregator behavior * `update()` - updates the internal state of the aggregator with a new value * `value()` - retrieves the current value held by the aggregator In our example, the class will have a single member `cur_sum`, which holds a running total of the squares of all the values we’ve seen. ```python theme={null} @pxt.uda class sum_of_squares(pxt.Aggregator): def __init__(self): # No data yet; initialize `cur_sum` to 0 self.cur_sum = 0 def update(self, val: int) -> None: # Update the value of `cur_sum` with the new datapoint self.cur_sum += val * val def value(self) -> int: # Retrieve the current value of `cur_sum` return self.cur_sum ``` ```python theme={null} t.select(sum_of_squares(t.val)).collect() ```
```python theme={null} t.group_by(t.val // 10).order_by(t.val // 10).select( t.val // 10, sum_of_squares(t.val) ).collect() ```
# Version Control and Lineage Source: https://docs.pixeltable.com/platform/version-control Automatic versioning, time travel queries, and full data lineage tracking Pixeltable automatically tracks every change to your tables—data insertions, updates, deletions, and schema modifications. Query any point in history, undo mistakes, and maintain full reproducibility without manual version management. ## How it works Every operation that modifies a table creates a new version: ```python theme={null} import pixeltable as pxt # Version 0: Table created products = pxt.create_table('demo/products', { 'name': pxt.String, 'price': pxt.Float }) # Version 1: Data inserted products.insert([ {'name': 'Widget', 'price': 9.99}, {'name': 'Gadget', 'price': 24.99} ]) # Version 2: Schema changed products.add_computed_column(price_with_tax=products.price * 1.08) # Version 3: Data updated products.update({'price': 19.99}, where=products.name == 'Widget') ``` No configuration required—versioning is always on. ## Viewing history ### Human-readable history ```python theme={null} products.history() ``` Returns a DataFrame showing all versions with timestamps, change types, and row counts: | version | created\_at | change\_type | inserts | updates | deletes | schema\_change | | ------- | ------------------- | ------------ | ------- | ------- | ------- | ----------------------- | | 3 | 2025-01-15 10:30:00 | data | 0 | 1 | 0 | None | | 2 | 2025-01-15 10:29:00 | schema | 0 | 2 | 0 | Added: price\_with\_tax | | 1 | 2025-01-15 10:28:00 | data | 2 | 0 | 0 | None | | 0 | 2025-01-15 10:27:00 | schema | 0 | 0 | 0 | Initial Version | ### Programmatic access ```python theme={null} versions = products.get_versions() # List of dictionaries latest = versions[0] print(f"Version {latest['version']}: {latest['inserts']} inserts") ``` ## Time travel queries Query any historical version using the `table_name:version` syntax: ```python theme={null} # Get the table at version 1 (before computed column) products_v1 = pxt.get_table('demo/products:1') products_v1.collect() # Returns data as it was at version 1 # Compare with current state products.collect() # Returns current data ``` Version handles are **read-only**—you cannot modify historical data. ### Use cases * **Debugging**: Compare data before and after a problematic update * **Auditing**: Track who changed what and when * **Recovery**: Find and extract accidentally deleted or modified data * **Reproducibility**: Query exact data used for a specific model training run ## Reverting changes Undo the most recent change with `revert()`: ```python theme={null} # Oops, wrong update products.update({'price': 0.00}, where=products.name == 'Widget') # Undo it products.revert() # Removes version N, table is now at version N-1 ``` `revert()` permanently removes the latest version. This cannot be undone. You can call `revert()` multiple times to go back further, but cannot revert past version 0 or past a version referenced by a snapshot. ## Snapshots Create named, persistent point-in-time copies for long-term preservation: ```python theme={null} # Freeze current state before a major data update baseline = pxt.create_snapshot('demo/products_baseline', products) # Later: source table changes, but snapshot remains unchanged products.insert([{'name': 'NewItem', 'price': 99.99}]) products.count() # 3 rows (updated) baseline.count() # 2 rows (frozen) ``` **Snapshots vs Time Travel:** * Time travel (`pxt.get_table('table:N')`) queries historical versions in place * Snapshots create a named, independent copy that persists even if the source table is modified or deleted ## Data lineage Pixeltable tracks the complete lineage of your data: ### Schema lineage Every computed column records its dependencies: ```python theme={null} products.add_computed_column( discounted=products.price * 0.9 ) products.add_computed_column( discounted_with_tax=products.discounted * 1.08 ) # Pixeltable knows: discounted_with_tax → discounted → price ``` ### View lineage Views automatically track their source tables: ```python theme={null} expensive = pxt.create_view( 'demo/expensive_products', products.where(products.price > 20) ) # View lineage: expensive_products → products ``` ### What's tracked | Change Type | Tracked Information | | ----------------------- | --------------------------------------------------------- | | `insert()` | Row count, timestamp, computed values generated | | `update()` | Rows affected, old vs new values (via version comparison) | | `delete()` | Row count removed | | `add_column()` | Column name, type, dependencies | | `add_computed_column()` | Column name, expression, dependencies | | `drop_column()` | Column removed | | `rename_column()` | Old name → new name | ## Best practices Create snapshots before major data loads, model training runs, or production deployments. Log table version numbers alongside model artifacts: `products.get_versions()[0]['version']` Use `revert()` immediately after mistakes. For older issues, use time travel to identify the problem. Use directories like `dev/products`, `staging/products` to isolate versioning across environments. ## Comparison with other systems | Feature | Pixeltable | Git | Delta Lake | | ----------------------- | --------------------- | --------------- | ------------------ | | Automatic versioning | ✅ Every operation | Manual commits | ✅ Every operation | | Time travel queries | ✅ `table:N` syntax | Checkout commit | ✅ `VERSION AS OF` | | Schema versioning | ✅ Tracked | File-based | ✅ Schema evolution | | Computed column lineage | ✅ Automatic | N/A | N/A | | Revert | ✅ `revert()` | `git revert` | `RESTORE` | | Named snapshots | ✅ `create_snapshot()` | Tags/branches | N/A | ## Next steps Step-by-step cookbook with runnable examples Publish and replicate tables across environments # Views Source: https://docs.pixeltable.com/platform/views Learn how to create and use virtual derived tables in Pixeltable through views # When to Use Views Views in Pixeltable are best used when you need to: 1. **Transform Data**: When you need to process or reshape data from a base table (e.g., splitting documents into chunks, extracting features from images) 2. **Filter Data**: When you frequently need to work with a specific subset of your data 3. **Create Virtual Tables**: When you want to avoid storing redundant data and automatically keep derived data in sync 4. **Build Data Workflows**: When you need to chain multiple data transformations together 5. **Save Storage**: When you want to compute data on demand rather than storing it permanently Choose views over tables when your data is derived from other base tables and needs to stay synchronized with its source. Use regular tables when you need to store original data or when the computation cost of deriving data on demand is too high. ## Phase 1: Define your base table and view structure ```python theme={null} import pixeltable as pxt from pixeltable.functions.document import document_splitter # Create a directory to organize data (optional) pxt.drop_dir('documents', force=True) pxt.create_dir('documents') # Define your base table first documents = pxt.create_table( "documents/collection", {"document": pxt.Document} ) # Create a view that splits documents into chunks chunks = pxt.create_view( 'documents/chunks', documents, iterator=document_splitter( document=documents.document, separators='token_limit', limit=300 ) ) ``` ## Phase 2: Use your application ```python theme={null} import pixeltable as pxt # Connect to your base table and view documents = pxt.get_table("documents/collection") chunks = pxt.get_table("documents/chunks") # Insert data into base table - view updates automatically documents.insert([{ "document": "path/to/document.pdf" }]) # Query the view print(chunks.collect()) ``` ## View types Views created using iterators to transform data: ```python theme={null} # Document splitting view chunks = pxt.create_view( 'docs/chunks', documents, iterator=document_splitter( document=documents.document ) ) ``` Views created from query operations: ```python theme={null} # Filtered view of high-budget movies blockbusters = pxt.create_view( 'movies/blockbusters', movies.where(movies.budget >= 100.0) ) ``` ## View operations Query views like regular tables: ```python theme={null} # Basic filtering on view chunks.where(chunks.text.contains('specific topic')).collect() # Select specific columns chunks.select(chunks.text, chunks.pos).collect() # Order results chunks.order_by(chunks.pos).limit(5).collect() ``` Add computed columns to views: ```python theme={null} # Add embeddings to chunks chunks.add_computed_column( embedding=sentence_transformer.using( model_id='intfloat/e5-large-v2' )(chunks.text) ) ``` Create views based on other views: ```python theme={null} # Create a view of embedded chunks embedded_chunks = pxt.create_view( 'docs/embedded_chunks', chunks.where(chunks.text.len() > 100) ) ``` ## Key features Views automatically update when base tables change Views compute data on demand, saving storage Views can be part of larger data workflows # anthropic Source: https://docs.pixeltable.com/sdk/latest/anthropic View Source on GitHub # module  pixeltable.functions.anthropic Pixeltable UDFs that wrap various endpoints from the Anthropic API. In order to use them, you must first `pip install anthropic` and configure your Anthropic credentials, as described in the [Working with Anthropic](https://docs.pixeltable.com/notebooks/integrations/working-with-anthropic) tutorial. ## func  invoke\_tools() ```python Signature theme={null} invoke_tools( tools: pixeltable.func.tools.Tools, response: pixeltable.exprs.expr.Expr ) -> pixeltable.exprs.inline_expr.InlineDict ``` Converts an Anthropic response dict to Pixeltable tool invocation format and calls `tools._invoke()`. ## udf  messages() ```python Signature theme={null} @pxt.udf messages( messages: pxt.Json, *, model: pxt.String, max_tokens: pxt.Int, model_kwargs: pxt.Json | None = None, tools: pxt.Json | None = None, tool_choice: pxt.Json | None = None ) -> pxt.Json ``` Create a Message. Equivalent to the Anthropic `messages` API endpoint. For additional details, see: [https://docs.anthropic.com/en/api/messages](https://docs.anthropic.com/en/api/messages) Request throttling: Uses the rate limit-related headers returned by the API to throttle requests adaptively, based on available request and token capacity. No configuration is necessary. **Requirements:** * `pip install anthropic` **Parameters:** * **`messages`** (`pxt.Json`): Input messages. * **`model`** (`pxt.String`): The model that will complete your prompt. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Anthropic `messages` API. For details on the available parameters, see: [https://docs.anthropic.com/en/api/messages](https://docs.anthropic.com/en/api/messages) * **`tools`** (`pxt.Json | None`): An optional list of Pixeltable tools to use for the request. * **`tool_choice`** (`pxt.Json | None`): An optional tool choice configuration. **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `claude-3-5-sonnet-20241022` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} msgs = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=messages(msgs, model='claude-3-5-sonnet-20241022') ) ``` # audio Source: https://docs.pixeltable.com/sdk/latest/audio View Source on GitHub # module  pixeltable.functions.audio Pixeltable UDFs for `AudioType`. ## iterator  audio\_splitter() ```python Signature theme={null} @pxt.iterator audio_splitter( audio: pxt.Audio, duration: pxt.Float, *, overlap: pxt.Float = 0.0, min_segment_duration: pxt.Float = 0.0 ) ``` Iterator over segments of an audio file. The audio file is split into smaller segments, where the duration of each segment is determined by `duration`. If the input contains no audio, no segments are yielded. **Outputs**: One row per audio segment, with the following columns: * `segment_start` (`pxt.Float`): Start time of the audio segment in seconds * `segment_end` (`pxt.Float`): End time of the audio segment in seconds * `audio_segment` (`pxt.Audio | None`): The audio content of the segment **Parameters:** * **`duration`** (`pxt.Float`): Audio segment duration in seconds * **`overlap`** (`pxt.Float`): Overlap between consecutive segments in seconds * **`min_segment_duration`** (`pxt.Float`): Drop the last segment if it is smaller than `min_segment_duration` **Examples:** This example assumes an existing table `tbl` with a column `audio` of type `pxt.Audio`. Create a view that splits all audio files into segments of 30 seconds with 5 seconds overlap: ```python theme={null} pxt.create_view( 'audio_segments', tbl, iterator=audio_splitter(tbl.audio, duration=30.0, overlap=5.0), ) ``` ## udf  encode\_audio() ```python Signature theme={null} @pxt.udf encode_audio( audio_data: pxt.Array[float32], *, input_sample_rate: pxt.Int, format: pxt.String, output_sample_rate: pxt.Int | None = None ) -> pxt.Audio ``` Encodes an audio clip represented as an array into a specified audio format. **Parameters:** * **`audio_data`** (`pxt.Array[float32]`): An array of sampled amplitudes. The accepted array shapes are `(N,)` or `(1, N)` for mono audio or `(2, N)` for stereo. * **`input_sample_rate`** (`pxt.Int`): The sample rate of the input audio data. * **`format`** (`pxt.String`): The desired output audio format. The supported formats are 'wav', 'mp3', 'flac', and 'mp4'. * **`output_sample_rate`** (`pxt.Int | None`): The desired sample rate for the output audio. Defaults to the input sample rate if unspecified. **Examples:** Add a computed column with encoded FLAC audio files to a table with audio data (as arrays of floats) and sample rates: ```python theme={null} t.add_computed_column( audio_file=encode_audio( t.audio_data, input_sample_rate=t.sample_rate, format='flac' ) ) ``` ## udf  get\_metadata() ```python Signature theme={null} @pxt.udf get_metadata(audio: pxt.Audio) -> pxt.Json ``` Gets various metadata associated with an audio file and returns it as a dictionary. **Parameters:** * **`audio`** (`pxt.Audio`): The audio to get metadata for. **Returns:** * `pxt.Json`: A `dict` such as the following: ```json theme={null} { 'size': 2568827, 'streams': [ { 'type': 'audio', 'frames': 0, 'duration': 2646000, 'metadata': {}, 'time_base': 2.2675736961451248e-05, 'codec_context': { 'name': 'flac', 'profile': None, 'channels': 1, 'codec_tag': '\x00\x00\x00\x00', }, 'duration_seconds': 60.0, } ], 'bit_rate': 342510, 'metadata': {'encoder': 'Lavf61.1.100'}, 'bit_exact': False, } ``` **Examples:** Extract metadata for files in the `audio_col` column of the table `tbl`: ```python theme={null} tbl.select(tbl.audio_col.get_metadata()).collect() ``` # bedrock Source: https://docs.pixeltable.com/sdk/latest/bedrock View Source on GitHub # module  pixeltable.functions.bedrock Pixeltable UDFs for AWS Bedrock AI models. Provides integration with AWS Bedrock for accessing various foundation models including Anthropic Claude, Amazon Titan, and other providers. ## func  invoke\_tools() ```python Signature theme={null} invoke_tools( tools: pixeltable.func.tools.Tools, response: pixeltable.exprs.expr.Expr ) -> pixeltable.exprs.inline_expr.InlineDict ``` Converts an Anthropic response dict to Pixeltable tool invocation format and calls `tools._invoke()`. ## udf  converse() ```python Signature theme={null} @pxt.udf converse( messages: pxt.Json, *, model_id: pxt.String, system: pxt.Json | None = None, inference_config: pxt.Json | None = None, additional_model_request_fields: pxt.Json | None = None, tool_config: pxt.Json | None = None ) -> pxt.Json ``` Generate a conversation response. Equivalent to the AWS Bedrock `converse` API endpoint. For additional details, see: [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/converse.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/converse.html) **Requirements:** * `pip install boto3` **Parameters:** * **`messages`** (`pxt.Json`): Input messages. * **`model_id`** (`pxt.String`): The model that will complete your prompt. * **`system`** (`pxt.Json | None`): An optional system prompt. * **`inference_config`** (`pxt.Json | None`): Base inference parameters to use. * **`additional_model_request_fields`** (`pxt.Json | None`): Additional inference parameters to use. **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `anthropic.claude-3-haiku-20240307-v1:0` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} msgs = [{'role': 'user', 'content': [{'text': tbl.prompt}]}] tbl.add_computed_column( response=messages( msgs, model_id='anthropic.claude-3-haiku-20240307-v1:0' ) ) ``` # ColumnMetadata Source: https://docs.pixeltable.com/sdk/latest/columnmetadata View Source on GitHub # class  pixeltable.ColumnMetadata Metadata for a column of a Pixeltable table. ## attr  comment ``` comment: str ``` User-provided column comment. ## attr  computed\_with ``` computed_with: str | None ``` Expression used to compute this column; `None` if this is not a computed column. ## attr  custom\_metadata ``` custom_metadata: Any ``` User-defined JSON metadata for this column, if any. ## attr  defined\_in ``` defined_in: str | None ``` Name of the table where this column was originally defined. If the current table is a view, then `defined_in` may differ from the current table name. ## attr  is\_primary\_key ``` is_primary_key: bool ``` `True` if this column is part of the table's primary key. ## attr  is\_stored ``` is_stored: bool ``` `True` if this is a stored column; `False` if it is dynamically computed. ## attr  media\_validation ``` media_validation: Literal['on_read', 'on_write'] | None ``` The media validation policy for this column. ## attr  name ``` name: str ``` The name of the column. ## attr  type\_ ``` type_: str ``` The type specifier of the column. ## attr  version\_added ``` version_added: int ``` The table version when this column was added. # ColumnSpec Source: https://docs.pixeltable.com/sdk/latest/columnspec View Source on GitHub # class  pixeltable.types.ColumnSpec Column specification, a dictionary representation of a column's schema. Exactly one of `type` or `value` must be included in the dictionary. ## attr  comment ``` comment: str ``` Optional comment for the column. Displayed in .describe() output. ## attr  custom\_metadata ``` custom_metadata: Any ``` User-defined metadata to associate with the column. ## attr  destination ``` destination: str | Path ``` Destination for storing computed output files. Only applicable for computed columns. Can be: * A local pathname (such as `path/to/outputs/`), or * The URI of an object store (such as `s3://my-bucket/outputs/`). ## attr  media\_validation ``` media_validation: Literal['on_read', 'on_write'] ``` When to validate media; `'on_read'` or `'on_write'`. ## attr  primary\_key ``` primary_key: bool ``` Whether this column is part of the primary key. Defaults to `False`. ## attr  stored ``` stored: bool ``` Whether to store the column data. Defaults to `True`. ## attr  type ``` type: type ``` The column type (e.g., `pxt.Image`, `str`). Required unless `value` is specified. ## attr  value ``` value: exprs.Expr ``` A Pixeltable expression for computed columns. Mutually exclusive with `type`. # date Source: https://docs.pixeltable.com/sdk/latest/date View Source on GitHub # module  pixeltable.functions.date Pixeltable UDFs for `DateType`. Usage example: ```python theme={null} import pixeltable as pxt t = pxt.get_table(...) t.select(t.date_col.year, t.date_col.weekday()).collect() ``` ## udf  add\_days() ```python Signature theme={null} @pxt.udf add_days(self: pxt.Date, n: pxt.Int) -> pxt.Date ``` Add `n` days to the date. Equivalent to [`date + timedelta(days=n)`](https://docs.python.org/3/library/datetime.html#datetime.timedelta). ## udf  day() ```python Signature theme={null} @pxt.udf day(self: pxt.Date) -> pxt.Int ``` Between 1 and the number of days in the given month of the given year. Equivalent to [`date.day`](https://docs.python.org/3/library/datetime.html#datetime.date.day). ## udf  isocalendar() ```python Signature theme={null} @pxt.udf isocalendar(self: pxt.Date) -> pxt.Json ``` Return a dictionary with three entries: `'year'`, `'week'`, and `'weekday'`. Equivalent to [`date.isocalendar()`](https://docs.python.org/3/library/datetime.html#datetime.date.isocalendar). ## udf  isoformat() ```python Signature theme={null} @pxt.udf isoformat( self: pxt.Date, sep: pxt.String = 'T', timespec: pxt.String = 'auto' ) -> pxt.String ``` Return a string representing the date and time in ISO 8601 format. Equivalent to [`date.isoformat()`](https://docs.python.org/3/library/datetime.html#datetime.date.isoformat). **Parameters:** * **`sep`** (`pxt.String`): Separator between date and time. * **`timespec`** (`pxt.String`): The number of additional terms in the output. See the [`date.isoformat()`](https://docs.python.org/3/library/datetime.html#datetime.date.isoformat) documentation for more details. ## udf  isoweekday() ```python Signature theme={null} @pxt.udf isoweekday(self: pxt.Date) -> pxt.Int ``` Return the day of the week as an integer, where Monday is 1 and Sunday is 7. Equivalent to [`date.isoweekday()`](https://docs.python.org/3/library/datetime.html#datetime.date.isoweekday). ## udf  make\_date() ```python Signature theme={null} @pxt.udf make_date(year: pxt.Int, month: pxt.Int, day: pxt.Int) -> pxt.Date ``` Create a date. Equivalent to [`datetime()`](https://docs.python.org/3/library/datetime.html#datetime.date). ## udf  month() ```python Signature theme={null} @pxt.udf month(self: pxt.Date) -> pxt.Int ``` Between 1 and 12 inclusive. Equivalent to [`date.month`](https://docs.python.org/3/library/datetime.html#datetime.date.month). ## udf  strftime() ```python Signature theme={null} @pxt.udf strftime(self: pxt.Date, format: pxt.String) -> pxt.String ``` Return a string representing the date and time, controlled by an explicit format string. Equivalent to [`date.strftime()`](https://docs.python.org/3/library/datetime.html#datetime.date.strftime). **Parameters:** * **`format`** (`pxt.String`): The format string to control the output. For a complete list of formatting directives, see [`strftime()` and `strptime()` Behavior](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior). ## udf  toordinal() ```python Signature theme={null} @pxt.udf toordinal(self: pxt.Date) -> pxt.Int ``` Return the proleptic Gregorian ordinal of the date, where January 1 of year 1 has ordinal 1. Equivalent to [`date.toordinal()`](https://docs.python.org/3/library/datetime.html#datetime.date.toordinal). ## udf  weekday() ```python Signature theme={null} @pxt.udf weekday(self: pxt.Date) -> pxt.Int ``` Between 0 (Monday) and 6 (Sunday) inclusive. Equivalent to [`date.weekday()`](https://docs.python.org/3/library/datetime.html#datetime.date.weekday). ## udf  year() ```python Signature theme={null} @pxt.udf year(self: pxt.Date) -> pxt.Int ``` Between 1 and 9999 inclusive. (Between [`MINYEAR`](https://docs.python.org/3/library/datetime.html#datetime.MINYEAR) and [`MAXYEAR`](https://docs.python.org/3/library/datetime.html#datetime.MAXYEAR) as defined by the Python `datetime` library). Equivalent to [`date.year`](https://docs.python.org/3/library/datetime.html#datetime.date.year). # deepseek Source: https://docs.pixeltable.com/sdk/latest/deepseek View Source on GitHub # module  pixeltable.functions.deepseek Pixeltable UDFs for Deepseek AI models. Provides integration with Deepseek's language models for chat completions and other AI capabilities. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None, tools: pxt.Json | None = None, tool_choice: pxt.Json | None = None ) -> pxt.Json ``` Creates a model response for the given chat conversation. Equivalent to the Deepseek `chat/completions` API endpoint. For additional details, see: [https://api-docs.deepseek.com/api/create-chat-completion](https://api-docs.deepseek.com/api/create-chat-completion) Deepseek uses the OpenAI SDK, so you will need to install the `openai` package to use this UDF. Request throttling: Applies the rate limit set in the config (section `deepseek`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages to use for chat completion, as described in the Deepseek API documentation. * **`model`** (`pxt.String`): The model to use for chat completion. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Deepseek `chat/completions` API. For details on the available parameters, see: [https://api-docs.deepseek.com/api/create-chat-completion](https://api-docs.deepseek.com/api/create-chat-completion) * **`tools`** (`pxt.Json | None`): An optional list of Pixeltable tools to use for the request. * **`tool_choice`** (`pxt.Json | None`): An optional tool choice configuration. **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `deepseek-chat` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': tbl.prompt}, ] tbl.add_computed_column( response=chat_completions(messages, model='deepseek-chat') ) ``` # DirContents Source: https://docs.pixeltable.com/sdk/latest/dircontents View Source on GitHub # class  pixeltable.DirContents Represents the contents of a Pixeltable directory. ## attr  dirs ``` dirs: list[str] ``` List of directory paths contained in this directory. ## attr  tables ``` tables: list[str] ``` List of table paths contained in this directory. # document Source: https://docs.pixeltable.com/sdk/latest/document View Source on GitHub # module  pixeltable.functions.document Pixeltable UDFs for `DocumentType`. ## iterator  document\_splitter() ```python Signature theme={null} @pxt.iterator document_splitter( document: pxt.Document, separators: pxt.String, *, elements: pxt.Json | None = None, limit: pxt.Int | None = None, overlap: pxt.Int | None = None, metadata: pxt.String = '', skip_tags: pxt.Json | None = None, spacy_model: pxt.String = 'en_core_web_sm', tiktoken_encoding: pxt.String | None = 'cl100k_base', tiktoken_target_model: pxt.String | None = None, image_dpi: pxt.Int = 300, image_format: pxt.String = 'png' ) ``` Iterator over chunks of a document. The document is chunked according to the specified `separators`. Chunked text will be cleaned with `ftfy.fix_text` to fix up common problems with unicode sequences. **Outputs**: One row per chunk, with the following columns, depending on the specified `elements` and `metadata`: * `text` (`pxt.String`): The text of the chunk. Present if `'text'` is specified in `elements`. * `image` (`pxt.Image`): The image extracted from the chunk. Present if `'image'` is specified in `elements`. * `title` (`pxt.String | None`): The document title. Present if `'title'` is specified in `metadata`. * `heading` (`pxt.Json | None`): The heading hierarchy at the start of the chunk (HTML and Markdown only). Present if `'heading'` is specified in `metadata`. * `sourceline` (`pxt.Int | None`): The source line number of the start of the chunk (HTML only). Present if `'sourceline'` is specified in `metadata`. * `page` (`pxt.Int | None`): The page number of the chunk (PDF only). Present if `'page'` is specified in `metadata`. * `bounding_box` (`pxt.Json | None`): The bounding box of the chunk on the page, as an `{x1, y1, x2, y2}` dictionary (PDF only). Present if `'bounding_box'` is specified in `metadata`. **Parameters:** * **`separators`** (`pxt.String`): separators to use to chunk the document. Options are: `'heading'`, `'paragraph'`, `'sentence'`, `'token_limit'`, `'char_limit'`, `'page'`. This may be a comma-separated string, e.g., `'heading,token_limit'`. * **`elements`** (`pxt.Json | None`): list of elements to extract from the document. Options are: `'text'`, `'image'`. Defaults to `['text']` if not specified. The `'image'` element is only supported for the `'page'` separator on PDF documents. * **`limit`** (`pxt.Int | None`): the maximum number of tokens or characters in each chunk, if `'token_limit'` or `'char_limit'` is specified. * **`metadata`** (`pxt.String`): additional metadata fields to include in the output. Options are: `'title'`, `'heading'` (HTML and Markdown), `'sourceline'` (HTML), `'page'` (PDF), `'bounding_box'` (PDF). The input may be a comma-separated string, e.g., `'title,heading,sourceline'`. * **`skip_tags`** (`pxt.Json | None`): list of HTML tags to skip when processing HTML documents. * **`spacy_model`** (`pxt.String`): Name of the spaCy model to use for sentence segmentation. This parameter is ignored unless the `'sentence'` separator is specified. * **`tiktoken_encoding`** (`pxt.String | None`): Name of the tiktoken encoding to use when counting tokens. This parameter is ignored unless the `'token_limit'` separator is specified. * **`tiktoken_target_model`** (`pxt.String | None`): Name of the target model to use when counting tokens with tiktoken. If specified, this parameter overrides `tiktoken_encoding`. This parameter is ignored unless the `'token_limit'` separator is specified. * **`image_dpi`** (`pxt.Int`): DPI to use when extracting images from PDFs. Defaults to 300. * **`image_format`** (`pxt.String`): format to use when extracting images from PDFs. Defaults to 'png'. **Examples:** All these examples assume an existing table `tbl` with a column `doc` of type `pxt.Document`. Create a view that splits all documents into chunks of up to 300 tokens: ```python theme={null} pxt.create_view( 'chunks', tbl, iterator=document_splitter( tbl.doc, separators='token_limit', limit=300 ), ) ``` Create a view that splits all documents along sentence boundaries, including title and heading metadata: ```python theme={null} pxt.create_view( 'sentence_chunks', tbl, iterator=document_splitter( tbl.doc, separators='sentence', metadata='title,heading' ), ) ``` # fabric Source: https://docs.pixeltable.com/sdk/latest/fabric View Source on GitHub # module  pixeltable.functions.fabric Pixeltable UDFs that wrap Azure OpenAI endpoints via Microsoft Fabric. These functions provide seamless access to Azure OpenAI models within Microsoft Fabric notebook environments. Authentication and endpoint discovery are handled automatically using Fabric's built-in service discovery and token utilities. **Note:** These functions only work within Microsoft Fabric notebook environments. For more information on Fabric AI services, see: [https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview) ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, api_version: pxt.String | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Creates a model response for the given chat conversation using Azure OpenAI in Fabric. Equivalent to the Azure OpenAI `chat/completions` API endpoint. For additional details, see: [https://learn.microsoft.com/en-us/azure/ai-services/openai/reference](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) **Automatic authentication:** Authentication is handled automatically in Fabric notebooks using token-based authentication. No API keys are required. **Supported models in Fabric:** * `gpt-5` (reasoning model) * `gpt-4.1` * `gpt-4.1-mini` Request throttling: Applies the rate limit set in the config (section `fabric.rate_limits`, key `chat`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * Microsoft Fabric notebook environment * `synapse-ml-fabric` package (pre-installed in Fabric) **Parameters:** * **`messages`** (`pxt.Json`): A list of message dicts with 'role' and 'content' keys, as described in the Azure OpenAI API documentation. * **`model`** (`pxt.String`): The deployment name to use (e.g., 'gpt-5', 'gpt-4.1', 'gpt-4.1-mini'). * **`api_version`** (`pxt.String | None`): Optional API version override. If not specified, defaults to '2025-04-01-preview' for reasoning models (gpt-5) and '2024-02-15-preview' for standard models. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Azure OpenAI `chat/completions` API. For details on available parameters, see: [https://learn.microsoft.com/en-us/azure/ai-services/openai/reference](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) **Note:** Reasoning models (gpt-5) use `max_completion_tokens` instead of `max_tokens` and do not support the `temperature` parameter. **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `gpt-4.1` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} from pixeltable.functions import fabric messages = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': tbl.prompt}, ] tbl.add_computed_column( response=fabric.chat_completions(messages, model='gpt-4.1') ) ``` Using a reasoning model (gpt-5): ```python theme={null} tbl.add_computed_column( reasoning_response=fabric.chat_completions( messages, model='gpt-5', model_kwargs={'max_completion_tokens': 5000}, ) ) ``` ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String = 'text-embedding-ada-002', api_version: pxt.String = '2024-02-15-preview', model_kwargs: pxt.Json | None = None ) -> pxt.Array[(None,), float32] ``` Creates an embedding vector representing the input text using Azure OpenAI in Fabric. Equivalent to the Azure OpenAI `embeddings` API endpoint. For additional details, see: [https://learn.microsoft.com/en-us/azure/ai-services/openai/reference](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) **Automatic authentication:** Authentication is handled automatically in Fabric notebooks using token-based authentication. No API keys are required. **Supported models in Fabric:** * `text-embedding-ada-002` * `text-embedding-3-small` * `text-embedding-3-large` Request throttling: Applies the rate limit set in the config (section `fabric.rate_limits`, key `embeddings`). If no rate limit is configured, uses a default of 600 RPM. Batches up to 32 inputs per request for efficiency. **Requirements:** * Microsoft Fabric notebook environment * `synapse-ml-fabric` package (pre-installed in Fabric) **Parameters:** * **`input`** (`pxt.String`): The text to embed (automatically batched). * **`model`** (`pxt.String`): The embedding model deployment name (default: 'text-embedding-ada-002'). * **`api_version`** (`pxt.String`): The API version to use (default: '2024-02-15-preview'). * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Azure OpenAI `embeddings` API. For details on available parameters, see: [https://learn.microsoft.com/en-us/azure/ai-services/openai/reference](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) **Returns:** * `pxt.Array[(None,), float32]`: An array representing the embedding vector for the input text. **Examples:** Add a computed column that applies the model `text-embedding-ada-002` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} from pixeltable.functions import fabric tbl.add_computed_column(embed=fabric.embeddings(tbl.text)) ``` Add an embedding index to an existing column `text`: ```python theme={null} tbl.add_embedding_index( 'text', embedding=fabric.embeddings.using(model='text-embedding-ada-002'), ) ``` # fal Source: https://docs.pixeltable.com/sdk/latest/fal View Source on GitHub # module  pixeltable.functions.fal Pixeltable UDFs that wrap various endpoints from the fal.ai API. In order to use them, you must first `pip install fal-client` and configure your fal.ai credentials, as described in the [Working with fal.ai](https://docs.pixeltable.com/notebooks/integrations/working-with-fal) tutorial. ## udf  run() ```python Signature theme={null} @pxt.udf run(input: pxt.Json, *, app: pxt.String) -> pxt.Json ``` Run a model on fal.ai. Uses fal's queue-based subscribe mechanism for reliable execution. For additional details, see: [https://fal.ai/docs](https://fal.ai/docs) Request throttling: Applies the rate limit set in the config (section `fal`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install fal-client` **Parameters:** * **`input`** (`pxt.Json`): The input parameters for the model. * **`app`** (`pxt.String`): The name or ID of the fal.ai application to run (e.g., 'fal-ai/flux/schnell'). **Returns:** * `pxt.Json`: The output of the model as a JSON object. **Examples:** Add a computed column that applies the model `fal-ai/flux/schnell` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} input = {'prompt': tbl.prompt} tbl.add_computed_column(response=run(input, app='fal-ai/flux/schnell')) ``` Add a computed column that uses the model `fal-ai/fast-sdxl` to generate images from an existing Pixeltable column `tbl.prompt`: ```python theme={null} input = { 'prompt': tbl.prompt, 'image_size': 'square', 'num_inference_steps': 25, } tbl.add_computed_column(response=run(input, app='fal-ai/fast-sdxl')) tbl.add_computed_column( image=tbl.response['images'][0]['url'].astype(pxt.Image) ) ``` # fireworks Source: https://docs.pixeltable.com/sdk/latest/fireworks View Source on GitHub # module  pixeltable.functions.fireworks Pixeltable UDFs that wrap various endpoints from the Fireworks AI API. In order to use them, you must first `pip install fireworks-ai` and configure your Fireworks AI credentials, as described in the [Working with Fireworks](https://docs.pixeltable.com/notebooks/integrations/working-with-fireworks) tutorial. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Creates a model response for the given chat conversation. Equivalent to the Fireworks AI `chat/completions` API endpoint. For additional details, see: [https://docs.fireworks.ai/api-reference/post-chatcompletions](https://docs.fireworks.ai/api-reference/post-chatcompletions) Request throttling: Applies the rate limit set in the config (section `fireworks`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install fireworks-ai` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages comprising the conversation so far. * **`model`** (`pxt.String`): The name of the model to use. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Fireworks `chat_completions` API. For details on the available parameters, see: [https://docs.fireworks.ai/api-reference/post-chatcompletions](https://docs.fireworks.ai/api-reference/post-chatcompletions) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `accounts/fireworks/models/mixtral-8x22b-instruct` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=chat_completions( messages, model='accounts/fireworks/models/mixtral-8x22b-instruct' ) ) ``` # functions Source: https://docs.pixeltable.com/sdk/latest/functions View Source on GitHub # module  pixeltable.functions General Pixeltable UDFs. This parent module contains general-purpose UDFs that apply to multiple data types. ## func  map() ```python Signature theme={null} map( expr: pixeltable.exprs.expr.Expr, fn: Callable[[pixeltable.exprs.expr.Expr], Any] ) -> pixeltable.exprs.expr.Expr ``` Applies a mapping function to each element of a list. **Parameters:** * **`expr`** (`pixeltable.exprs.expr.Expr`): The list expression to map over; must be an expression of type `pxt.Json`. * **`fn`** (`typing.Callable[[pixeltable.exprs.expr.Expr], typing.Any]`): An operation on Pixeltable expressions that will be applied to each element of the JSON array. **Examples:** Given a table `tbl` with a column `data` of type `pxt.Json` containing lists of integers, add a computed column that produces new lists with each integer doubled: ```python theme={null} tbl.add_computed_column( doubled=pxt.functions.map(t.data, lambda x: x * 2) ) ``` ## uda  count() ```python Signatures theme={null} # Signature 1: @pxt.uda count(val: pxt.String | None) -> pxt.Int # Signature 2: @pxt.uda count(val: pxt.Bool | None) -> pxt.Int # Signature 3: @pxt.uda count(val: pxt.Int | None) -> pxt.Int # Signature 4: @pxt.uda count(val: pxt.Float | None) -> pxt.Int # Signature 5: @pxt.uda count(val: pxt.Timestamp | None) -> pxt.Int # Signature 6: @pxt.uda count(val: pxt.Json | None) -> pxt.Int # Signature 7: @pxt.uda count(val: pxt.Array | None) -> pxt.Int # Signature 8: @pxt.uda count(val: pxt.Image | None) -> pxt.Int # Signature 9: @pxt.uda count(val: pxt.Video | None) -> pxt.Int # Signature 10: @pxt.uda count(val: pxt.Audio | None) -> pxt.Int # Signature 11: @pxt.uda count(val: pxt.Document | None) -> pxt.Int # Signature 12: @pxt.uda count(val: pxt.Date | None) -> pxt.Int # Signature 13: @pxt.uda count(val: pxt.UUID | None) -> pxt.Int # Signature 14: @pxt.uda count(val: pxt.Binary | None) -> pxt.Int ``` Aggregate function that counts the number of non-null values in a column or grouping. **Parameters:** * **`val`** (`String | None`): The value to count. **Returns:** * `pxt.Int`: The count of non-null values. **Examples:** Count the number of non-null values in the `value` column of the table `tbl`: ```python theme={null} tbl.select(pxt.functions.count(tbl.value)).collect() ``` Group by the `category` column and compute the count of non-null values in the `value` column for each category, assigning the name `'category_count'` to the new column: ```python theme={null} tbl.group_by(tbl.category).select( tbl.category, category_count=pxt.functions.count(tbl.value) ).collect() ``` ## uda  max() ```python Signatures theme={null} # Signature 1: @pxt.uda max(val: pxt.String | None) -> pxt.String | None # Signature 2: @pxt.uda max(val: pxt.Int | None) -> pxt.Int | None # Signature 3: @pxt.uda max(val: pxt.Float | None) -> pxt.Float | None # Signature 4: @pxt.uda max(val: pxt.Bool | None) -> pxt.Bool | None # Signature 5: @pxt.uda max(val: pxt.Timestamp | None) -> pxt.Timestamp | None ``` Aggregate function that computes the maximum value in a column or grouping. **Parameters:** * **`val`** (`String | None`): The value to compare. **Returns:** * `pxt.String | None`: The maximum value, or `None` if there are no non-null values. **Examples:** Compute the maximum value in the `value` column of the table `tbl`: ```python theme={null} tbl.select(pxt.functions.max(tbl.value)).collect() ``` Group by the `category` column and compute the maximum value in the `value` column for each category, assigning the name `'category_max'` to the new column: ```python theme={null} tbl.group_by(tbl.category).select( tbl.category, category_max=pxt.functions.max(tbl.value) ).collect() ``` ## uda  mean() ```python Signatures theme={null} # Signature 1: @pxt.uda mean(val: pxt.Int | None) -> pxt.Float | None # Signature 2: @pxt.uda mean(val: pxt.Float | None) -> pxt.Float | None ``` Aggregate function that computes the mean (average) of non-null values of a numeric column or grouping. **Parameters:** * **`val`** (`Int | None`): The numeric value to include in the mean. **Returns:** * `pxt.Float | None`: The mean of the non-null values, or `None` if there are no non-null values. **Examples:** Compute the mean of the values in the `value` column of the table `tbl`: ```python theme={null} tbl.select(pxt.functions.mean(tbl.value)).collect() ``` Group by the `category` column and compute the mean of the `value` column for each category, assigning the name `'category_mean'` to the new column: ```python theme={null} tbl.group_by(tbl.category).select( tbl.category, category_mean=pxt.functions.mean(tbl.value) ).collect() ``` ## uda  min() ```python Signatures theme={null} # Signature 1: @pxt.uda min(val: pxt.String | None) -> pxt.String | None # Signature 2: @pxt.uda min(val: pxt.Int | None) -> pxt.Int | None # Signature 3: @pxt.uda min(val: pxt.Float | None) -> pxt.Float | None # Signature 4: @pxt.uda min(val: pxt.Bool | None) -> pxt.Bool | None # Signature 5: @pxt.uda min(val: pxt.Timestamp | None) -> pxt.Timestamp | None ``` Aggregate function that computes the minimum value in a column or grouping. **Parameters:** * **`val`** (`String | None`): The value to compare. **Returns:** * `pxt.String | None`: The minimum value, or `None` if there are no non-null values. **Examples:** Compute the minimum value in the `value` column of the table `tbl`: ```python theme={null} tbl.select(pxt.functions.min(tbl.value)).collect() ``` Group by the `category` column and compute the minimum value in the `value` column for each category, assigning the name `'category_min'` to the new column: ```python theme={null} tbl.group_by(tbl.category).select( tbl.category, category_min=pxt.functions.min(tbl.value) ).collect() ``` ## uda  sum() ```python Signatures theme={null} # Signature 1: @pxt.uda sum(val: pxt.Int | None) -> pxt.Int | None # Signature 2: @pxt.uda sum(val: pxt.Float | None) -> pxt.Float | None ``` Aggregate function that computes the sum of non-null values of a numeric column or grouping. **Parameters:** * **`val`** (`Int | None`): The numeric value to add to the sum. **Returns:** * `pxt.Int | None`: The sum of the non-null values, or `None` if there are no non-null values. **Examples:** Sum the values in the `value` column of the table `tbl`: ```python theme={null} tbl.select(pxt.functions.sum(tbl.value)).collect() ``` Group by the `category` column and compute the sum of the `value` column for each category, assigning the name `'category_total'` to the new column: ```python theme={null} tbl.group_by(tbl.category).select( tbl.category, category_total=pxt.functions.sum(tbl.value) ).collect() ``` # gemini Source: https://docs.pixeltable.com/sdk/latest/gemini View Source on GitHub # module  pixeltable.functions.gemini Pixeltable UDFs that wrap various endpoints from the Google Gemini API. In order to use them, you must first `pip install google-genai` and configure your Gemini credentials, as described in the [Working with Gemini](https://docs.pixeltable.com/notebooks/integrations/working-with-gemini) tutorial. ## func  invoke\_tools() ```python Signature theme={null} invoke_tools( tools: pixeltable.func.tools.Tools, response: pixeltable.exprs.expr.Expr ) -> pixeltable.exprs.inline_expr.InlineDict ``` Converts an OpenAI response dict to Pixeltable tool invocation format and calls `tools._invoke()`. ## udf  generate\_content() ```python Signature theme={null} @pxt.udf generate_content( contents: pxt.Json, *, model: pxt.String, config: pxt.Json | None = None, tools: pxt.Json | None = None ) -> pxt.Json ``` Generate content from the specified model. Request throttling: Applies the rate limit set in the config (section `gemini.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install google-genai` **Parameters:** * **`contents`** (`pxt.Json`): The input content to generate from. Can be a prompt, or a list containing images and text prompts, as described in: [https://ai.google.dev/gemini-api/docs/text-generation](https://ai.google.dev/gemini-api/docs/text-generation) * **`model`** (`pxt.String`): The name of the model to use. * **`config`** (`pxt.Json | None`): Configuration for generation, corresponding to keyword arguments of `genai.types.GenerateContentConfig`. For details on the parameters, see: [https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateContentConfig](https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateContentConfig) * **`tools`** (`pxt.Json | None`): An optional list of Pixeltable tools to use. It is also possible to specify tools manually via the `config['tools']` parameter, but at most one of `config['tools']` or `tools` may be used. **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `gemini-2.5-flash` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=generate_content(tbl.prompt, model='gemini-2.5-flash') ) ``` ## udf  generate\_embedding() ```python Signature theme={null} @pxt.udf generate_embedding( input: pxt.String, *, model: pxt.String, config: pxt.Json | None = None, use_batch_api: pxt.Bool = False ) -> pxt.Array[(None,), float32] ``` Generate embeddings for the input strings. For more information on Gemini embeddings API, see: [https://ai.google.dev/gemini-api/docs/embeddings](https://ai.google.dev/gemini-api/docs/embeddings) **Requirements:** * `pip install google-genai` **Parameters:** * **`input`** (`pxt.String`): The strings to generate embeddings for. * **`model`** (`pxt.String`): The Gemini model to use. * **`config`** (`pxt.Json | None`): Configuration for embedding generation, corresponding to keyword arguments of `genai.types.EmbedContentConfig`. For details on the parameters, see: [https://googleapis.github.io/python-genai/genai.html#genai.types.EmbedContentConfig](https://googleapis.github.io/python-genai/genai.html#genai.types.EmbedContentConfig) * **`use_batch_api`** (`pxt.Bool`): If True, use [Gemini's Batch API](https://ai.google.dev/gemini-api/docs/batch-api) that provides a higher throughput at a lower cost at the expense of higher latency. **Returns:** * `pxt.Array[(None,), float32]`: The generated embeddings. **Examples:** Add a computed column with embeddings to an existing table with a `text` column: ```python theme={null} t.add_computed_column(embedding=generate_embedding(t.text)) ``` Add an embedding index on `text` column: ```python theme={null} t.add_embedding_index( t.text, embedding=generate_embedding.using( model='gemini-embedding-001', config={'output_dimensionality': 3072} ), ``` ## udf  generate\_images() ```python Signature theme={null} @pxt.udf generate_images( prompt: pxt.String, *, model: pxt.String, config: pxt.Json | None = None ) -> pxt.Image ``` Generates images based on a text description and configuration. For additional details, see: [https://ai.google.dev/gemini-api/docs/image-generation](https://ai.google.dev/gemini-api/docs/image-generation) Request throttling: Applies the rate limit set in the config (section `imagen.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install google-genai` **Parameters:** * **`prompt`** (`pxt.String`): A text description of the images to generate. * **`model`** (`pxt.String`): The model to use. * **`config`** (`pxt.Json | None`): Configuration for generation, corresponding to keyword arguments of `genai.types.GenerateImagesConfig`. For details on the parameters, see: [https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateImagesConfig](https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateImagesConfig) **Returns:** * `pxt.Image`: The generated image. **Examples:** Add a computed column that applies the model `imagen-4.0-generate-001` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=generate_images(tbl.prompt, model='imagen-4.0-generate-001') ) ``` ## udf  generate\_videos() ```python Signature theme={null} @pxt.udf generate_videos( prompt: pxt.String | None = None, image: pxt.Image | None = None, *, model: pxt.String, config: pxt.Json | None = None ) -> pxt.Video ``` Generates videos based on a text description and configuration. For additional details, see: [https://ai.google.dev/gemini-api/docs/video](https://ai.google.dev/gemini-api/docs/video) At least one of `prompt` or `image` must be provided. Request throttling: Applies the rate limit set in the config (section `veo.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install google-genai` **Parameters:** * **`prompt`** (`pxt.String | None`): A text description of the videos to generate. * **`image`** (`pxt.Image | None`): An image to use as the first frame of the video. * **`model`** (`pxt.String`): The model to use. * **`config`** (`pxt.Json | None`): Configuration for generation, corresponding to keyword arguments of `genai.types.GenerateVideosConfig`. For details on the parameters, see: [https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateVideosConfig](https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateVideosConfig) **Returns:** * `pxt.Video`: The generated video. **Examples:** Add a computed column that applies the model `veo-3.0-generate-001` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=generate_videos(tbl.prompt, model='veo-3.0-generate-001') ) ``` # groq Source: https://docs.pixeltable.com/sdk/latest/groq View Source on GitHub # module  pixeltable.functions.groq Pixeltable UDFs that wrap various endpoints from the Groq API. In order to use them, you must first `pip install groq` and configure your Groq credentials, as described in the [Working with Groq](https://docs.pixeltable.com/notebooks/integrations/working-with-groq) tutorial. ## func  invoke\_tools() ```python Signature theme={null} invoke_tools( tools: pixeltable.func.tools.Tools, response: pixeltable.exprs.expr.Expr ) -> pixeltable.exprs.inline_expr.InlineDict ``` Converts an OpenAI response dict to Pixeltable tool invocation format and calls `tools._invoke()`. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None, tools: pxt.Json | None = None, tool_choice: pxt.Json | None = None ) -> pxt.Json ``` Chat Completion API. Equivalent to the Groq `chat/completions` API endpoint. For additional details, see: [https://console.groq.com/docs/api-reference#chat-create](https://console.groq.com/docs/api-reference#chat-create) Request throttling: Applies the rate limit set in the config (section `groq`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install groq` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages comprising the conversation so far. * **`model`** (`pxt.String`): ID of the model to use. (See overview here: [https://console.groq.com/docs/models](https://console.groq.com/docs/models)) * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Groq `chat/completions` API. For details on the available parameters, see: [https://console.groq.com/docs/api-reference#chat-create](https://console.groq.com/docs/api-reference#chat-create) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `llama-3.1-8b-instant` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=chat_completions(messages, model='llama-3.1-8b-instant') ) ``` # huggingface Source: https://docs.pixeltable.com/sdk/latest/huggingface View Source on GitHub # module  pixeltable.functions.huggingface Pixeltable UDFs that wrap various models from the Hugging Face `transformers` package. These UDFs will cause Pixeltable to invoke the relevant models locally. In order to use them, you must first `pip install transformers` (or in some cases, `sentence-transformers`, as noted in the specific UDFs). ## UDFs ## udf  automatic\_speech\_recognition() ```python Signature theme={null} @pxt.udf automatic_speech_recognition( audio: pxt.Audio, *, model_id: pxt.String, language: pxt.String | None = None, chunk_length_s: pxt.Int | None = None, return_timestamps: pxt.Bool = False ) -> pxt.String ``` Transcribes speech to text using a pretrained ASR model. `model_id` should be a reference to a pretrained [automatic-speech-recognition model](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition). This is a **generic function** that works with many ASR model families. For production use with specific models, consider specialized functions like `whisper.transcribe()` or `speech2text_for_conditional_generation()`. **Requirements:** * `pip install torch transformers torchaudio` **Recommended Models:** * **OpenAI Whisper**: `openai/whisper-tiny.en`, `openai/whisper-small`, `openai/whisper-base` * **Facebook Wav2Vec2**: `facebook/wav2vec2-base-960h`, `facebook/wav2vec2-large-960h-lv60-self` * **Microsoft SpeechT5**: `microsoft/speecht5_asr` * **Meta MMS (Multilingual)**: `facebook/mms-1b-all` **Parameters:** * **`audio`** (`pxt.Audio`): The audio file(s) to transcribe. * **`model_id`** (`pxt.String`): The pretrained ASR model to use. * **`language`** (`pxt.String | None`): Language code for multilingual models (e.g., 'en', 'es', 'fr'). * **`chunk_length_s`** (`pxt.Int | None`): Maximum length of audio chunks in seconds for long audio processing. * **`return_timestamps`** (`pxt.Bool`): Whether to return word-level timestamps (model dependent). **Returns:** * `pxt.String`: The transcribed text. **Examples:** Add a computed column that transcribes audio files: ```python theme={null} tbl.add_computed_column( transcription=automatic_speech_recognition( tbl.audio_file, model_id='openai/whisper-tiny.en', # Recommended ) ) ``` Transcribe with language specification: ```python theme={null} tbl.add_computed_column( transcription=automatic_speech_recognition( tbl.audio_file, model_id='facebook/mms-1b-all', language='en' ) ) ``` ## udf  clip() ```python Signatures theme={null} # Signature 1: @pxt.udf clip( text: pxt.String, model_id: pxt.String ) -> pxt.Array[(None,), float32] # Signature 2: @pxt.udf clip( image: pxt.Image, model_id: pxt.String ) -> pxt.Array[(None,), float32] ``` Computes a CLIP embedding for the specified text or image. `model_id` should be a reference to a pretrained [CLIP Model](https://huggingface.co/docs/transformers/model_doc/clip). **Requirements:** * `pip install torch transformers` **Parameters:** * **`text`** (`String`): The string to embed. * **`model_id`** (`String`): The pretrained model to use for the embedding. **Returns:** * `pxt.Array[(None,), float32]`: An array containing the output of the embedding model. **Examples:** Add a computed column that applies the model `openai/clip-vit-base-patch32` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( result=clip(tbl.text, model_id='openai/clip-vit-base-patch32') ) ``` ## udf  cross\_encoder() ```python Signature theme={null} @pxt.udf cross_encoder( sentences1: pxt.String, sentences2: pxt.String, *, model_id: pxt.String ) -> pxt.Float ``` Performs predicts on the given sentence pair. `model_id` should be a pretrained Cross-Encoder model, as described in the [Cross-Encoder Pretrained Models](https://www.sbert.net/docs/cross_encoder/pretrained_models.html) documentation. **Requirements:** * `pip install torch sentence-transformers` **Parameters:** * **`sentences1`** (`pxt.String`): The first sentence to be paired. * **`sentences2`** (`pxt.String`): The second sentence to be paired. * **`model_id`** (`pxt.String`): The identifier of the cross-encoder model to use. **Returns:** * `pxt.Float`: The similarity score between the inputs. **Examples:** Add a computed column that applies the model `ms-marco-MiniLM-L-4-v2` to the sentences in columns `tbl.sentence1` and `tbl.sentence2`: ```python theme={null} tbl.add_computed_column( result=sentence_transformer( tbl.sentence1, tbl.sentence2, model_id='ms-marco-MiniLM-L-4-v2' ) ) ``` ## udf  detr\_for\_object\_detection() ```python Signature theme={null} @pxt.udf detr_for_object_detection( image: pxt.Image, *, model_id: pxt.String, threshold: pxt.Float = 0.5, revision: pxt.String = 'no_timm' ) -> pxt.Json ``` Computes DETR object detections for the specified image. `model_id` should be a reference to a pretrained [DETR Model](https://huggingface.co/docs/transformers/model_doc/detr). **Requirements:** * `pip install torch transformers` **Parameters:** * **`image`** (`pxt.Image`): The image to embed. * **`model_id`** (`pxt.String`): The pretrained model to use for object detection. **Returns:** * `pxt.Json`: A dictionary containing the output of the object detection model, in the following format: ```python theme={null} { # list of confidence scores for each detected object 'scores': [0.99, 0.999], # list of COCO class labels for each detected object 'labels': [25, 25], # corresponding text names of class labels 'label_text': ['giraffe', 'giraffe'], # list of bounding boxes for each detected object, as [x1, y1, x2, y2] 'boxes': [ [51.942, 356.174, 181.481, 413.975], [383.225, 58.66, 605.64, 361.346], ], } ``` **Examples:** Add a computed column that applies the model `facebook/detr-resnet-50` to an existing Pixeltable column `image` of the table `tbl`: ```python theme={null} tbl.add_computed_column( detections=detr_for_object_detection( tbl.image, model_id='facebook/detr-resnet-50', threshold=0.8 ) ) ``` ## udf  detr\_for\_segmentation() ```python Signature theme={null} @pxt.udf detr_for_segmentation( image: pxt.Image, *, model_id: pxt.String, threshold: pxt.Float = 0.5 ) -> pxt.Json ``` Computes DETR panoptic segmentation for the specified image. `model_id` should be a reference to a pretrained [DETR Model](https://huggingface.co/docs/transformers/model_doc/detr) with a segmentation head. **Requirements:** * `pip install torch transformers timm` **Parameters:** * **`image`** (`pxt.Image`): The image to segment. * **`model_id`** (`pxt.String`): The pretrained model to use for segmentation (e.g., 'facebook/detr-resnet-50-panoptic'). * **`threshold`** (`pxt.Float`): Confidence threshold for filtering segments. **Returns:** * `pxt.Json`: A dictionary containing the output of the segmentation model, in the following format: ```python theme={null} { 'segmentation': np.ndarray, # (H, W) array where each pixel value is a segment ID 'segments_info': [ { 'id': 1, # segment ID (matches pixel values in segmentation array) 'label_id': 0, # class label index 'label_text': 'person', # human-readable class name 'score': 0.98, # confidence score 'was_fused': False, # whether segment was fused from multiple instances }, ..., ], } ``` **Examples:** Add a computed column that applies the model `facebook/detr-resnet-50-panoptic` to an existing Pixeltable column `image` of the table `tbl`: ```python theme={null} tbl.add_computed_column( segmentation=detr_for_segmentation( tbl.image, model_id='facebook/detr-resnet-50-panoptic', threshold=0.5, ) ) ``` ## udf  detr\_to\_coco() ```python Signature theme={null} @pxt.udf detr_to_coco(image: pxt.Image, detr_info: pxt.Json) -> pxt.Json ``` Converts the output of a DETR object detection model to COCO format. **Parameters:** * **`image`** (`pxt.Image`): The image for which detections were computed. * **`detr_info`** (`pxt.Json`): The output of a DETR object detection model, as returned by `detr_for_object_detection`. **Returns:** * `pxt.Json`: A dictionary containing the data from `detr_info`, converted to COCO format. **Examples:** Add a computed column that converts the output `tbl.detections` to COCO format, where `tbl.image` is the image for which detections were computed: ```python theme={null} tbl.add_computed_column( detections_coco=detr_to_coco(tbl.image, tbl.detections) ) ``` ## udf  image\_captioning() ```python Signature theme={null} @pxt.udf image_captioning( image: pxt.Image, *, model_id: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.String ``` Generates captions for images using a pretrained image captioning model. `model_id` should be a reference to a pretrained [image-to-text model](https://huggingface.co/models?pipeline_tag=image-to-text) such as BLIP, Git, or LLaVA. **Requirements:** * `pip install torch transformers` **Parameters:** * **`image`** (`pxt.Image`): The image to caption. * **`model_id`** (`pxt.String`): The pretrained model to use for captioning. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model's `generate` method, such as `max_length`. **Returns:** * `pxt.String`: The generated caption text. **Examples:** Add a computed column `caption` to an existing table `tbl` that generates captions using the `Salesforce/blip-image-captioning-base` model: ```python theme={null} tbl.add_computed_column( caption=image_captioning( tbl.image, model_id='Salesforce/blip-image-captioning-base', model_kwargs={'max_length': 30}, ) ) ``` ## udf  image\_to\_image() ```python Signature theme={null} @pxt.udf image_to_image( image: pxt.Image, prompt: pxt.String, *, model_id: pxt.String, seed: pxt.Int | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Image ``` Transforms input images based on text prompts using a pretrained image-to-image model. `model_id` should be a reference to a pretrained [image-to-image model](https://huggingface.co/models?pipeline_tag=image-to-image) such as Stable Diffusion. **Requirements:** * `pip install torch transformers diffusers accelerate` **Parameters:** * **`image`** (`pxt.Image`): The input image to transform. * **`prompt`** (`pxt.String`): The text prompt describing the desired transformation. * **`model_id`** (`pxt.String`): The pretrained image-to-image model to use. * **`seed`** (`pxt.Int | None`): Random seed for reproducibility. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model, such as `strength`, `guidance_scale`, or `num_inference_steps`. **Returns:** * `pxt.Image`: The transformed image. **Examples:** Add a computed column that transforms images based on prompts: ```python theme={null} tbl.add_computed_column( transformed=image_to_image( tbl.source_image, tbl.transformation_prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', ) ) ``` With custom transformation strength: ```python theme={null} tbl.add_computed_column( transformed=image_to_image( tbl.source_image, tbl.transformation_prompt, model_id='stable-diffusion-v1-5/stable-diffusion-v1-5', model_kwargs={'strength': 0.75, 'num_inference_steps': 50}, ) ) ``` ## udf  image\_to\_video() ```python Signature theme={null} @pxt.udf image_to_video( image: pxt.Image, *, model_id: pxt.String, num_frames: pxt.Int = 25, fps: pxt.Int = 6, seed: pxt.Int | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Video ``` Generates videos from input images using a pretrained image-to-video model. `model_id` should be a reference to a pretrained [image-to-video model](https://huggingface.co/models?pipeline_tag=image-to-video). **Requirements:** * `pip install torch transformers diffusers accelerate` **Parameters:** * **`image`** (`pxt.Image`): The input image to animate into a video. * **`model_id`** (`pxt.String`): The pretrained image-to-video model to use. * **`num_frames`** (`pxt.Int`): Number of video frames to generate. * **`fps`** (`pxt.Int`): Frames per second for the output video. * **`seed`** (`pxt.Int | None`): Random seed for reproducibility. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model, such as `num_inference_steps`, `motion_bucket_id`, or `guidance_scale`. **Returns:** * `pxt.Video`: The generated video file. **Examples:** Add a computed column that creates videos from images: ```python theme={null} tbl.add_computed_column( video=image_to_video( tbl.input_image, model_id='stabilityai/stable-video-diffusion-img2vid-xt', num_frames=25, fps=7, ) ) ``` ## udf  question\_answering() ```python Signature theme={null} @pxt.udf question_answering( context: pxt.String, question: pxt.String, *, model_id: pxt.String ) -> pxt.Json ``` Answers questions based on provided context using a pretrained QA model. `model_id` should be a reference to a pretrained [question answering model](https://huggingface.co/models?pipeline_tag=question-answering) such as BERT or RoBERTa. **Requirements:** * `pip install torch transformers` **Parameters:** * **`context`** (`pxt.String`): The context text containing the answer. * **`question`** (`pxt.String`): The question to answer. * **`model_id`** (`pxt.String`): The pretrained QA model to use. **Returns:** * `pxt.Json`: A dictionary containing the answer, confidence score, and start/end positions. **Examples:** Add a computed column that answers questions based on document context: ```python theme={null} tbl.add_computed_column( answer=question_answering( tbl.document_text, tbl.question, model_id='deepset/roberta-base-squad2', ) ) ``` ## udf  sentence\_transformer() ```python Signature theme={null} @pxt.udf sentence_transformer( sentence: pxt.String, *, model_id: pxt.String, normalize_embeddings: pxt.Bool = False ) -> pxt.Array[(None,), float32] ``` Computes sentence embeddings. `model_id` should be a pretrained Sentence Transformers model, as described in the [Sentence Transformers Pretrained Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html) documentation. **Requirements:** * `pip install torch sentence-transformers` **Parameters:** * **`sentence`** (`pxt.String`): The sentence to embed. * **`model_id`** (`pxt.String`): The pretrained model to use for the encoding. * **`normalize_embeddings`** (`pxt.Bool`): If `True`, normalizes embeddings to length 1; see the [Sentence Transformers API Docs](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html) for more details **Returns:** * `pxt.Array[(None,), float32]`: An array containing the output of the embedding model. **Examples:** Add a computed column that applies the model `all-mpnet-base-2` to an existing Pixeltable column `tbl.sentence` of the table `tbl`: ```python theme={null} tbl.add_computed_column( result=sentence_transformer( tbl.sentence, model_id='all-mpnet-base-v2' ) ) ``` ## udf  speech2text\_for\_conditional\_generation() ```python Signature theme={null} @pxt.udf speech2text_for_conditional_generation( audio: pxt.Audio, *, model_id: pxt.String, language: pxt.String | None = None ) -> pxt.String ``` Transcribes or translates speech to text using a Speech2Text model. `model_id` should be a reference to a pretrained [Speech2Text](https://huggingface.co/docs/transformers/en/model_doc/speech_to_text) model. **Requirements:** * `pip install torch torchaudio sentencepiece transformers` **Parameters:** * **`audio`** (`pxt.Audio`): The audio clip to transcribe or translate. * **`model_id`** (`pxt.String`): The pretrained model to use for the transcription or translation. * **`language`** (`pxt.String | None`): If using a multilingual translation model, the language code to translate to. If not provided, the model's default language will be used. If the model is not translation model, is not a multilingual model, or does not support the specified language, an error will be raised. **Returns:** * `pxt.String`: The transcribed or translated text. **Examples:** Add a computed column that applies the model `facebook/s2t-small-librispeech-asr` to an existing Pixeltable column `audio` of the table `tbl`: ```python theme={null} tbl.add_computed_column( transcription=speech2text_for_conditional_generation( tbl.audio, model_id='facebook/s2t-small-librispeech-asr' ) ) ``` Add a computed column that applies the model `facebook/s2t-medium-mustc-multilingual-st` to an existing Pixeltable column `audio` of the table `tbl`, translating the audio to French: ```python theme={null} tbl.add_computed_column( translation=speech2text_for_conditional_generation( tbl.audio, model_id='facebook/s2t-medium-mustc-multilingual-st', language='fr', ) ) ``` ## udf  summarization() ```python Signature theme={null} @pxt.udf summarization( text: pxt.String, *, model_id: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.String ``` Summarizes text using a pretrained summarization model. `model_id` should be a reference to a pretrained [summarization model](https://huggingface.co/models?pipeline_tag=summarization) such as BART, T5, or Pegasus. **Requirements:** * `pip install torch transformers` **Parameters:** * **`text`** (`pxt.String`): The text to summarize. * **`model_id`** (`pxt.String`): The pretrained model to use for summarization. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model's `generate` method, such as `max_length`. **Returns:** * `pxt.String`: The generated summary text. **Examples:** Add a computed column that summarizes documents: ```python theme={null} tbl.add_computed_column( summary=text_summarization( tbl.document_text, model_id='facebook/bart-large-cnn', max_length=100, ) ) ``` ## udf  text\_classification() ```python Signature theme={null} @pxt.udf text_classification( text: pxt.String, *, model_id: pxt.String, top_k: pxt.Int = 5 ) -> pxt.Json ``` Classifies text using a pretrained classification model. `model_id` should be a reference to a pretrained [text classification model](https://huggingface.co/models?pipeline_tag=text-classification) such as BERT, RoBERTa, or DistilBERT. **Requirements:** * `pip install torch transformers` **Parameters:** * **`text`** (`pxt.String`): The text to classify. * **`model_id`** (`pxt.String`): The pretrained model to use for classification. * **`top_k`** (`pxt.Int`): The number of top predictions to return. **Returns:** * `pxt.Json`: A dictionary containing classification results with scores, labels, and label text. **Examples:** Add a computed column for sentiment analysis: ```python theme={null} tbl.add_computed_column( sentiment=text_classification( tbl.review_text, model_id='cardiffnlp/twitter-roberta-base-sentiment-latest', ) ) ``` ## udf  text\_generation() ```python Signature theme={null} @pxt.udf text_generation( text: pxt.String, *, model_id: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.String ``` Generates text using a pretrained language model. `model_id` should be a reference to a pretrained [text generation model](https://huggingface.co/models?pipeline_tag=text-generation). **Requirements:** * `pip install torch transformers` **Parameters:** * **`text`** (`pxt.String`): The input text to continue/complete. * **`model_id`** (`pxt.String`): The pretrained model to use for text generation. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model's `generate` method, such as `max_length`, `temperature`, etc. See the [Hugging Face text\_generation documentation](https://huggingface.co/docs/inference-providers/en/tasks/text-generation) for details. **Returns:** * `pxt.String`: The generated text completion. **Examples:** Add a computed column that generates text completions using the `Qwen/Qwen3-0.6B` model: ```python theme={null} tbl.add_computed_column( completion=text_generation( tbl.prompt, model_id='Qwen/Qwen3-0.6B', model_kwargs={'temperature': 0.5, 'max_length': 150}, ) ) ``` ## udf  text\_to\_image() ```python Signature theme={null} @pxt.udf text_to_image( prompt: pxt.String, *, model_id: pxt.String, height: pxt.Int = 512, width: pxt.Int = 512, seed: pxt.Int | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Image ``` Generates images from text prompts using a pretrained text-to-image model. `model_id` should be a reference to a pretrained [text-to-image model](https://huggingface.co/models?pipeline_tag=text-to-image) such as Stable Diffusion. **Requirements:** * `pip install torch transformers diffusers accelerate` **Parameters:** * **`prompt`** (`pxt.String`): The text prompt describing the desired image. * **`model_id`** (`pxt.String`): The pretrained text-to-image model to use. * **`height`** (`pxt.Int`): Height of the generated image in pixels. * **`width`** (`pxt.Int`): Width of the generated image in pixels. * **`seed`** (`pxt.Int | None`): Optional random seed for reproducibility. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments to pass to the model, such as `num_inference_steps`, `guidance_scale`, or `negative_prompt`. **Returns:** * `pxt.Image`: The generated Image. **Examples:** Add a computed column that generates images from text prompts: ```python theme={null} tbl.add_computed_column( generated_image=text_to_image( tbl.prompt, model_id='stable-diffusion-v1.5/stable-diffusion-v1-5', height=512, width=512, model_kwargs={'num_inference_steps': 25}, ) ) ``` ## udf  text\_to\_speech() ```python Signature theme={null} @pxt.udf text_to_speech( text: pxt.String, *, model_id: pxt.String, speaker_id: pxt.Int | None = None, vocoder: pxt.String | None = None ) -> pxt.Audio ``` Converts text to speech using a pretrained TTS model. `model_id` should be a reference to a pretrained [text-to-speech model](https://huggingface.co/models?pipeline_tag=text-to-speech). **Requirements:** * `pip install torch transformers datasets soundfile` **Parameters:** * **`text`** (`pxt.String`): The text to convert to speech. * **`model_id`** (`pxt.String`): The pretrained TTS model to use. * **`speaker_id`** (`pxt.Int | None`): Speaker ID for multi-speaker models. * **`vocoder`** (`pxt.String | None`): Optional vocoder model for higher quality audio. **Returns:** * `pxt.Audio`: The generated audio file. **Examples:** Add a computed column that converts text to speech: ```python theme={null} tbl.add_computed_column( audio=text_to_speech( tbl.text_content, model_id='microsoft/speecht5_tts', speaker_id=0 ) ) ``` ## udf  token\_classification() ```python Signature theme={null} @pxt.udf token_classification( text: pxt.String, *, model_id: pxt.String, aggregation_strategy: pxt.String = 'simple' ) -> pxt.Json ``` Extracts named entities from text using a pretrained named entity recognition (NER) model. `model_id` should be a reference to a pretrained [token classification model](https://huggingface.co/models?pipeline_tag=token-classification) for NER. **Requirements:** * `pip install torch transformers` **Parameters:** * **`text`** (`pxt.String`): The text to analyze for named entities. * **`model_id`** (`pxt.String`): The pretrained model to use. * **`aggregation_strategy`** (`pxt.String`): Method used to aggregate tokens. **Returns:** * `pxt.Json`: A list of dictionaries containing entity information (text, label, confidence, start, end). **Examples:** Add a computed column that extracts named entities: ```python theme={null} tbl.add_computed_column( entities=token_classification( tbl.text, model_id='dbmdz/bert-large-cased-finetuned-conll03-english', ) ) ``` ## udf  translation() ```python Signature theme={null} @pxt.udf translation( text: pxt.String, *, model_id: pxt.String, src_lang: pxt.String | None = None, target_lang: pxt.String | None = None ) -> pxt.String ``` Translates text using a pretrained translation model. `model_id` should be a reference to a pretrained [translation model](https://huggingface.co/models?pipeline_tag=translation) such as MarianMT or T5. **Requirements:** * `pip install torch transformers sentencepiece` **Parameters:** * **`text`** (`pxt.String`): The text to translate. * **`model_id`** (`pxt.String`): The pretrained translation model to use. * **`src_lang`** (`pxt.String | None`): Source language code (optional, can be inferred from model). * **`target_lang`** (`pxt.String | None`): Target language code (optional, can be inferred from model). **Returns:** * `pxt.String`: The translated text. **Examples:** Add a computed column that translates text: ```python theme={null} tbl.add_computed_column( french_text=translation( tbl.english_text, model_id='Helsinki-NLP/opus-mt-en-fr', src_lang='en', target_lang='fr', ) ) ``` ## udf  vit\_for\_image\_classification() ```python Signature theme={null} @pxt.udf vit_for_image_classification( image: pxt.Image, *, model_id: pxt.String, top_k: pxt.Int = 5 ) -> pxt.Json ``` Computes image classifications for the specified image using a Vision Transformer (ViT) model. `model_id` should be a reference to a pretrained [ViT Model](https://huggingface.co/docs/transformers/en/model_doc/vit). **Note:** Be sure the model is a ViT model that is trained for image classification (that is, a model designed for use with the [ViTForImageClassification](https://huggingface.co/docs/transformers/en/model_doc/vit#transformers.ViTForImageClassification) class), such as `google/vit-base-patch16-224`. General feature-extraction models such as `google/vit-base-patch16-224-in21k` will not produce the desired results. **Requirements:** * `pip install torch transformers` **Parameters:** * **`image`** (`pxt.Image`): The image to classify. * **`model_id`** (`pxt.String`): The pretrained model to use for the classification. * **`top_k`** (`pxt.Int`): The number of classes to return. **Returns:** * `pxt.Json`: A dictionary containing the output of the image classification model, in the following format: ```python theme={null} { 'scores': [0.325, 0.198, 0.105], # list of probabilities of the top-k most likely classes 'labels': [340, 353, 386], # list of class IDs for the top-k most likely classes 'label_text': ['zebra', 'gazelle', 'African elephant, Loxodonta africana'], # corresponding text names of the top-k most likely classes ``` **Examples:** Add a computed column that applies the model `google/vit-base-patch16-224` to an existing Pixeltable column `image` of the table `tbl`, returning the 10 most likely classes for each image: ```python theme={null} tbl.add_computed_column( image_class=vit_for_image_classification( tbl.image, model_id='google/vit-base-patch16-224', top_k=10 ) ) ``` # image Source: https://docs.pixeltable.com/sdk/latest/image View Source on GitHub # module  pixeltable.functions.image Pixeltable UDFs for `ImageType`. Example: ```python theme={null} import pixeltable as pxt t = pxt.get_table(...) t.select(t.img_col.convert('L')).collect() ``` ## iterator  tile\_iterator() ```python Signature theme={null} @pxt.iterator tile_iterator( image: pxt.Image, tile_size: pxt.Json, *, overlap: pxt.Json = (0, 0) ) ``` Iterator over tiles of an image. Each image will be divided into tiles of size `tile_size`, and the tiles will be iterated over in row-major order (left-to-right, then top-to-bottom). An optional `overlap` parameter may be specified. If the tiles do not exactly cover the image, then the rightmost and bottommost tiles will be padded with blackspace, so that the output images all have the exact size `tile_size`. **Outputs**: One row per tile, with the following columns: * `tile` (`pxt.Image`): The image tile * `tile_coord` (`pxt.Json`): The (x, y) coordinates of the tile in the grid of tiles * `tile_box` (`pxt.Json`): The (x1, y1, x2, y2) pixel coordinates of the tile in the original image **Parameters:** * **`image`** (`pxt.Image`): Image to split into tiles. * **`tile_size`** (`pxt.Json`): Size of each tile, as a pair of integers `(width, height)`. * **`overlap`** (`pxt.Json`): Amount of overlap between adjacent tiles, as a pair of integers `(width, height)`. **Examples:** This example assumes an existing table `tbl` with a column `img` of type `pxt.Image`. Create a view that splits all images into 256x256 tiles with 32 pixels of overlap: ```python theme={null} pxt.create_view( 'image_tiles', tbl, iterator=tile_iterator( tbl.img, tile_size=(256, 256), overlap=(32, 32) ), ) ``` ## udf  alpha\_composite() ```python Signature theme={null} @pxt.udf alpha_composite(im1: pxt.Image, im2: pxt.Image) -> pxt.Image ``` Alpha composite `im2` over `im1`. Equivalent to [`PIL.Image.alpha_composite()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.alpha_composite) ## udf  b64\_encode() ```python Signature theme={null} @pxt.udf b64_encode( img: pxt.Image, image_format: pxt.String = 'png' ) -> pxt.String ``` Convert image to a base64-encoded string. **Parameters:** * **`img`** (`pxt.Image`): image * **`image_format`** (`pxt.String`): image format [supported by PIL](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#fully-supported-formats) ## udf  blend() ```python Signature theme={null} @pxt.udf blend( im1: pxt.Image, im2: pxt.Image, alpha: pxt.Float ) -> pxt.Image ``` Return a new image by interpolating between two input images, using a constant alpha. Equivalent to [`PIL.Image.blend()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.blend) ## udf  composite() ```python Signature theme={null} @pxt.udf composite( image1: pxt.Image, image2: pxt.Image, mask: pxt.Image ) -> pxt.Image ``` Return a composite image by blending two images using a mask. Equivalent to [`PIL.Image.composite()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.composite) ## udf  convert() ```python Signature theme={null} @pxt.udf convert(self: pxt.Image, mode: pxt.String) -> pxt.Image ``` Convert the image to a different mode. Equivalent to [`PIL.Image.Image.convert()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.convert). **Parameters:** * **`mode`** (`pxt.String`): The mode to convert to. See the [Pillow documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#concept-modes) for a list of supported modes. ## udf  crop() ```python Signature theme={null} @pxt.udf crop(self: pxt.Image, box: pxt.Json) -> pxt.Image ``` Return a rectangular region from the image. The box is a 4-tuple defining the left, upper, right, and lower pixel coordinates. Equivalent to [`PIL.Image.Image.crop()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.crop) ## udf  effect\_spread() ```python Signature theme={null} @pxt.udf effect_spread(self: pxt.Image, distance: pxt.Int) -> pxt.Image ``` Randomly spread pixels in an image. Equivalent to [`PIL.Image.Image.effect_spread()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.effect_spread) **Parameters:** * **`distance`** (`pxt.Int`): The distance to spread pixels. ## udf  entropy() ```python Signature theme={null} @pxt.udf entropy( self: pxt.Image, mask: pxt.Image | None = None, extrema: pxt.Json | None = None ) -> pxt.Float ``` Returns the entropy of the image, optionally using a mask and extrema. Equivalent to [`PIL.Image.Image.entropy()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.entropy) **Parameters:** * **`mask`** (`pxt.Image | None`): An optional mask image. * **`extrema`** (`pxt.Json | None`): An optional list of extrema. ## udf  get\_metadata() ```python Signature theme={null} @pxt.udf get_metadata(self: pxt.Image) -> pxt.Json ``` Return metadata for the image. ## udf  getbands() ```python Signature theme={null} @pxt.udf getbands(self: pxt.Image) -> pxt.Json ``` Return a tuple containing the names of the image bands. Equivalent to [`PIL.Image.Image.getbands()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getbands) ## udf  getbbox() ```python Signature theme={null} @pxt.udf getbbox( self: pxt.Image, *, alpha_only: pxt.Bool = True ) -> pxt.Json | None ``` Return a bounding box for the non-zero regions of the image. Equivalent to [`PIL.Image.Image.getbbox()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getbbox) **Parameters:** * **`alpha_only`** (`pxt.Bool`): If `True`, and the image has an alpha channel, trim transparent pixels. Otherwise, trim pixels when all channels are zero. ## udf  getchannel() ```python Signature theme={null} @pxt.udf getchannel(self: pxt.Image, channel: pxt.Int) -> pxt.Image ``` Return an L-mode image containing a single channel of the original image. Equivalent to [`PIL.Image.Image.getchannel()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getchannel) **Parameters:** * **`channel`** (`pxt.Int`): The channel to extract. This is a 0-based index. ## udf  getcolors() ```python Signature theme={null} @pxt.udf getcolors(self: pxt.Image, maxcolors: pxt.Int = 256) -> pxt.Json ``` Return a list of colors used in the image, up to a maximum of `maxcolors`. Equivalent to [`PIL.Image.Image.getcolors()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getcolors) **Parameters:** * **`maxcolors`** (`pxt.Int`): The maximum number of colors to return. ## udf  getextrema() ```python Signature theme={null} @pxt.udf getextrema(self: pxt.Image) -> pxt.Json ``` Return a 2-tuple containing the minimum and maximum pixel values of the image. Equivalent to [`PIL.Image.Image.getextrema()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getextrema) ## udf  getpalette() ```python Signature theme={null} @pxt.udf getpalette( self: pxt.Image, mode: pxt.String | None = None ) -> pxt.Json | None ``` Return the palette of the image, optionally converting it to a different mode. Equivalent to [`PIL.Image.Image.getpalette()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getpalette) **Parameters:** * **`mode`** (`pxt.String | None`): The mode to convert the palette to. ## udf  getpixel() ```python Signature theme={null} @pxt.udf getpixel(self: pxt.Image, xy: pxt.Json) -> pxt.Json ``` Return the pixel value at the given position. The position `xy` is a tuple containing the x and y coordinates. Equivalent to [`PIL.Image.Image.getpixel()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getpixel) **Parameters:** * **`xy`** (`pxt.Json`): The coordinates, given as (x, y). ## udf  getprojection() ```python Signature theme={null} @pxt.udf getprojection(self: pxt.Image) -> pxt.Json ``` Return two sequences representing the horizontal and vertical projection of the image. Equivalent to [`PIL.Image.Image.getprojection()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.getprojection) ## udf  height() ```python Signature theme={null} @pxt.udf height(self: pxt.Image) -> pxt.Int ``` Return the height of the image. ## udf  histogram() ```python Signature theme={null} @pxt.udf histogram( self: pxt.Image, mask: pxt.Image | None = None, extrema: pxt.Json | None = None ) -> pxt.Json ``` Return a histogram for the image. Equivalent to [`PIL.Image.Image.histogram()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.histogram) **Parameters:** * **`mask`** (`pxt.Image | None`): An optional mask image. * **`extrema`** (`pxt.Json | None`): An optional list of extrema. ## udf  mode() ```python Signature theme={null} @pxt.udf mode(self: pxt.Image) -> pxt.String ``` Return the image mode. ## udf  point() ```python Signature theme={null} @pxt.udf point( self: pxt.Image, lut: pxt.Json, mode: pxt.String | None = None ) -> pxt.Image ``` Map image pixels through a lookup table. Equivalent to [`PIL.Image.Image.point()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.point) **Parameters:** * **`lut`** (`pxt.Json`): A lookup table. ## udf  quantize() ```python Signature theme={null} @pxt.udf quantize( self: pxt.Image, colors: pxt.Int = 256, method: pxt.Int | None = None, kmeans: pxt.Int = 0, palette: pxt.Image | None = None, dither: pxt.Int = ) -> pxt.Image ``` Convert the image to 'P' mode with the specified number of colors. Equivalent to [`PIL.Image.Image.quantize()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.quantize) **Parameters:** * **`colors`** (`pxt.Int`): The number of colors to quantize to. * **`method`** (`pxt.Int | None`): The quantization method. See the [Pillow documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.quantize) for a list of supported methods. * **`kmeans`** (`pxt.Int`): The number of k-means clusters to use. * **`palette`** (`pxt.Image | None`): The palette to use. * **`dither`** (`pxt.Int`): The dithering method. See the [Pillow documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.quantize) for a list of supported methods. ## udf  reduce() ```python Signature theme={null} @pxt.udf reduce( self: pxt.Image, factor: pxt.Int, box: pxt.Json | None = None ) -> pxt.Image ``` Reduce the image by the given factor. Equivalent to [`PIL.Image.Image.reduce()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.reduce) **Parameters:** * **`factor`** (`pxt.Int`): The reduction factor. * **`box`** (`pxt.Json | None`): An optional 4-tuple of ints providing the source image region to be reduced. The values must be within (0, 0, width, height) rectangle. If omitted or None, the entire source is used. ## udf  resize() ```python Signature theme={null} @pxt.udf resize(self: pxt.Image, size: pxt.Json) -> pxt.Image ``` Return a resized copy of the image. The size parameter is a tuple containing the width and height of the new image. Equivalent to [`PIL.Image.Image.resize()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.resize) ## udf  rotate() ```python Signature theme={null} @pxt.udf rotate(self: pxt.Image, angle: pxt.Int) -> pxt.Image ``` Return a copy of the image rotated by the given angle. Equivalent to [`PIL.Image.Image.rotate()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.rotate) **Parameters:** * **`angle`** (`pxt.Int`): The angle to rotate the image, in degrees. Positive angles are counter-clockwise. ## udf  thumbnail() ```python Signature theme={null} @pxt.udf thumbnail( self: pxt.Image, size: pxt.Json, resample: pxt.Int = , reducing_gap: pxt.Float | None = 2.0 ) -> pxt.Image ``` Create a thumbnail of the image. Equivalent to [`PIL.Image.Image.thumbnail()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.thumbnail) **Parameters:** * **`size`** (`pxt.Json`): The size of the thumbnail, as a tuple of (width, height). * **`resample`** (`pxt.Int`): The resampling filter to use. See the [Pillow documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.thumbnail) for a list of supported filters. * **`reducing_gap`** (`pxt.Float | None`): The reducing gap to use. ## udf  transpose() ```python Signature theme={null} @pxt.udf transpose(self: pxt.Image, method: pxt.Int) -> pxt.Image ``` Transpose the image. Equivalent to [`PIL.Image.Image.transpose()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transpose) **Parameters:** * **`method`** (`pxt.Int`): The transpose method. See the [Pillow documentation](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transpose) for a list of supported methods. ## udf  width() ```python Signature theme={null} @pxt.udf width(self: pxt.Image) -> pxt.Int ``` Return the width of the image. # IndexMetadata Source: https://docs.pixeltable.com/sdk/latest/indexmetadata View Source on GitHub # class  pixeltable.IndexMetadata Metadata for a column of a Pixeltable table. ## attr  columns ``` columns: list[str] ``` The table columns that are indexed. ## attr  index\_type ``` index_type: Literal['embedding'] ``` The type of index (currently only `'embedding'` is supported, but others will be added in the future). ## attr  name ``` name: str ``` The name of the index. ## attr  parameters ``` parameters: EmbeddingIndexParams ``` Parameters specific to the index type. # io Source: https://docs.pixeltable.com/sdk/latest/io View Source on GitHub # module  pixeltable.io Functions for importing and exporting Pixeltable data. ## func  create\_label\_studio\_project() ```python Signature theme={null} create_label_studio_project( t: Table, label_config: str, name: str | None = None, title: str | None = None, media_import_method: Literal['post', 'file', 'url'] = 'post', col_mapping: dict[str, str] | None = None, sync_immediately: bool = True, s3_configuration: dict[str, Any] | None = None, **kwargs: Any ) -> UpdateStatus ``` Create a new Label Studio project and link it to the specified [`Table`](./table). * A tutorial notebook with fully worked examples can be found here: [Using Label Studio for Annotations with Pixeltable](https://docs.pixeltable.com/notebooks/integrations/using-label-studio-with-pixeltable) The required parameter `label_config` specifies the Label Studio project configuration, in XML format, as described in the Label Studio documentation. The linked project will have one column for each data field in the configuration; for example, if the configuration has an entry ``` ``` then the linked project will have a column named `image`. In addition, the linked project will always have a JSON-typed column `annotations` representing the output. By default, Pixeltable will link each of these columns to a column of the specified [`Table`](./table) with the same name. If any of the data fields are missing, an exception will be raised. If the `annotations` column is missing, it will be created. The default names can be overridden by specifying an optional `col_mapping`, with Pixeltable column names as keys and Label Studio field names as values. In all cases, the Pixeltable columns must have types that are consistent with their corresponding Label Studio fields; otherwise, an exception will be raised. The API key and URL for a valid Label Studio server must be specified in Pixeltable config. Either: * Set the `LABEL_STUDIO_API_KEY` and `LABEL_STUDIO_URL` environment variables; or * Specify `api_key` and `url` fields in the `label-studio` section of `$PIXELTABLE_HOME/config.toml`. **Requirements:** * `pip install label-studio-sdk` * `pip install boto3` (if using S3 import storage) **Parameters:** * **`t`** (`Table`): The table to link to. * **`label_config`** (`str`): The Label Studio project configuration, in XML format. * **`name`** (`str | None`): An optional name for the new project in Pixeltable. If specified, must be a valid Pixeltable identifier and must not be the name of any other external data store linked to `t`. If not specified, a default name will be used of the form `ls_project_0`, `ls_project_1`, etc. * **`title`** (`str | None`): An optional title for the Label Studio project. This is the title that annotators will see inside Label Studio. Unlike `name`, it does not need to be an identifier and does not need to be unique. If not specified, the table name `t.name` will be used. * **`media_import_method`** (`Literal['post', 'file', 'url']`, default: `'post'`): The method to use when transferring media files to Label Studio: * `post`: Media will be sent to Label Studio via HTTP post. This should generally only be used for prototyping; due to restrictions in Label Studio, it can only be used with projects that have just one data field, and does not scale well. * `file`: Media will be sent to Label Studio as a file on the local filesystem. This method can be used if Pixeltable and Label Studio are running on the same host. * `url`: Media will be sent to Label Studio as externally accessible URLs. This method cannot be used with local media files or with media generated by computed columns. The default is `post`. * **`col_mapping`** (`dict[str, str] | None`): An optional mapping of local column names to Label Studio fields. * **`sync_immediately`** (`bool`, default: `True`): If `True`, immediately perform an initial synchronization by exporting all rows of the table as Label Studio tasks. * **`s3_configuration`** (`dict[str, Any] | None`): If specified, S3 import storage will be configured for the new project. This can only be used with `media_import_method='url'`, and if `media_import_method='url'` and any of the media data is referenced by `s3://` URLs, then it must be specified in order for such media to display correctly in the Label Studio interface. The items in the `s3_configuration` dictionary correspond to kwarg parameters of the Label Studio `connect_s3_import_storage` method, as described in the [Label Studio connect\_s3\_import\_storage docs](https://labelstud.io/sdk/project.html#label_studio_sdk.project.Project.connect_s3_import_storage). `bucket` must be specified; all other parameters are optional. If credentials are not specified explicitly, Pixeltable will attempt to retrieve them from the environment (such as from `~/.aws/credentials`). If a title is not specified, Pixeltable will use the default `'Pixeltable-S3-Import-Storage'`. All other parameters use their Label Studio defaults. * **`kwargs`** (`Any`): Additional keyword arguments are passed to the `start_project` method in the Label Studio SDK, as described in the [Label Studio start\_project docs](https://labelstud.io/sdk/project.html#label_studio_sdk.project.Project.start_project). **Returns:** * `UpdateStatus`: An `UpdateStatus` representing the status of any synchronization operations that occurred. **Examples:** Create a Label Studio project whose tasks correspond to videos stored in the `video_col` column of the table `tbl`: ```python theme={null} config = """ """ create_label_studio_project(tbl, config) ``` Create a Label Studio project with the same configuration, using `media_import_method='url'`, whose media are stored in an S3 bucket: ```python theme={null} create_label_studio_project( tbl, config, media_import_method='url', s3_configuration={'bucket': 'my-bucket', 'region_name': 'us-east-2'}, ) ``` ## func  export\_images\_as\_fo\_dataset() ```python Signature theme={null} export_images_as_fo_dataset( tbl: pxt.Table, images: exprs.Expr, image_format: str = 'webp', classifications: exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None = None, detections: exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None = None ) -> fo.Dataset ``` Export images from a Pixeltable table as a Voxel51 dataset. The data must consist of a single column (or expression) containing image data, along with optional additional columns containing labels. Currently, only classification and detection labels are supported. The [Working with Voxel51 in Pixeltable](https://docs.pixeltable.com/examples/vision/voxel51) tutorial contains a fully worked example showing how to export data from a Pixeltable table and load it into Voxel51. Images in the dataset that already exist on disk will be exported directly, in whatever format they are stored in. Images that are not already on disk (such as frames extracted using a [`frame_iterator`](./video#iterator-frame_iterator)) will first be written to disk in the specified `image_format`. The label parameters accept one or more sets of labels of each type. If a single `Expr` is provided, then it will be exported as a single set of labels with a default name such as `classifications`. (The single set of labels may still containing multiple individual labels; see below.) If a list of `Expr`s is provided, then each one will be exported as a separate set of labels with a default name such as `classifications`, `classifications_1`, etc. If a dictionary of `Expr`s is provided, then each entry will be exported as a set of labels with the specified name. **Requirements:** * `pip install fiftyone` **Parameters:** * **`tbl`** (`pxt.Table`): The table from which to export data. * **`images`** (`exprs.Expr`): A column or expression that contains the images to export. * **`image_format`** (`str`, default: `'webp'`): The format to use when writing out images for export. * **`classifications`** (`exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None`): Optional image classification labels. If a single `Expr` is provided, it must be a table column or an expression that evaluates to a list of dictionaries. Each dictionary in the list corresponds to an image class and must have the following structure: ```python theme={null} {'label': 'zebra', 'confidence': 0.325} ``` If multiple `Expr`s are provided, each one must evaluate to a list of such dictionaries. * **`detections`** (`exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None`): Optional image detection labels. If a single `Expr` is provided, it must be a table column or an expression that evaluates to a list of dictionaries. Each dictionary in the list corresponds to an image detection, and must have the following structure: ```python theme={null} { 'label': 'giraffe', 'confidence': 0.99, # [x, y, w, h], fractional coordinates 'bounding_box': [0.081, 0.836, 0.202, 0.136], } ``` If multiple `Expr`s are provided, each one must evaluate to a list of such dictionaries. **Returns:** * `'fo.Dataset'`: A Voxel51 dataset. **Examples:** Export the images in the `image` column of the table `tbl` as a Voxel51 dataset, using classification labels from `tbl.classifications`: ```python theme={null} export_images_as_fo_dataset( tbl, tbl.image, classifications=tbl.classifications ) ``` ## func  export\_lancedb() ```python Signature theme={null} export_lancedb( table_or_query: pxt.Table | pxt.Query, db_uri: Path, table_name: str, batch_size_bytes: int = 134217728, if_exists: Literal['error', 'overwrite', 'append'] = 'error' ) -> None ``` Exports a Query's data to a LanceDB table. This utilizes LanceDB's streaming interface for efficient table creation, via a sequence of in-memory pyarrow `RecordBatches`, the size of which can be controlled with the `batch_size_bytes` parameter. **Requirements:** * `pip install lancedb` **Parameters:** * **`table_or_query `** (`Any`): Table or Query to export. * **`db_uri`** (`Path`): Local Path to the LanceDB database. * **`table_name `** (`Any`): Name of the table in the LanceDB database. * **`batch_size_bytes `** (`Any`): Maximum size in bytes for each batch. * **`if_exists`** (`Literal['error', 'overwrite', 'append']`, default: `'error'`): Determines the behavior if the table already exists. Must be one of the following: * `'error'`: raise an error * `'overwrite'`: overwrite the existing table * `'append'`: append to the existing table ## func  export\_parquet() ```python Signature theme={null} export_parquet( table_or_query: pxt.Table | pxt.Query, parquet_path: Path, partition_size_bytes: int = 100000000, inline_images: bool = False, _write_md: bool = False ) -> None ``` Exports a query result or table to one or more Parquet files. Requires pyarrow to be installed. Pixeltable column types are mapped to Parquet types as follows: * String: string * Int: int64 * Float: float32 * Bool: bool * Timestamp: timestamp\[us, tz=UTC] * Date: date32 * UUID: uuid * Binary: binary * Image: binary (when `inline_images=True`) * Audio, Video, Document: string (file paths) * Array (requires shape to be known): * fixed\_shape\_tensor for fixed-shape arrays * list for ragged arrays (one or more dimensions are None) * Json: struct * Schema is inferred from data via `pyarrow.infer_type()` * Fields that contain empty dicts cannot be mapped to a Parquet type and will result in an exception **Parameters:** * **`table_or_query `** (`Any`): Table or Query to export. * **`parquet_path `** (`Any`): Path to directory to write the parquet files to. * **`partition_size_bytes `** (`Any`): The maximum target size for each chunk. Default 100\_000\_000 bytes. * **`inline_images `** (`Any`): If True, images are stored inline in the parquet file. This is useful for small images, to be imported as pytorch dataset. But can be inefficient for large images, and cannot be imported into pixeltable. If False, will raise an error if the Query has any image column. Default False. ## func  import\_csv() ```python Signature theme={null} import_csv( tbl_name: str, filepath_or_buffer: str | os.PathLike, schema_overrides: dict[str, typing.Any] | None = None, primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str = '', **kwargs: Any ) -> pixeltable.catalog.table.Table ``` Creates a new base table from a csv file. This is a convenience method and is equivalent to calling `import_pandas(table_path, pd.read_csv(filepath_or_buffer, **kwargs), schema=schema)`. See the Pandas documentation for [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for more details. **Returns:** * `pixeltable.catalog.table.Table`: A handle to the newly created [`Table`](./table). ## func  import\_excel() ```python Signature theme={null} import_excel( tbl_name: str, io: str | os.PathLike, *, schema_overrides: dict[str, typing.Any] | None = None, primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str = '', **kwargs: Any ) -> pixeltable.catalog.table.Table ``` Creates a new base table from an Excel (.xlsx) file. This is a convenience method and is equivalent to calling `import_pandas(table_path, pd.read_excel(io, *args, **kwargs), schema=schema)`. See the Pandas documentation for [`read_excel`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for more details. **Returns:** * `pixeltable.catalog.table.Table`: A handle to the newly created [`Table`](./table). ## func  import\_huggingface\_dataset() ```python Signature theme={null} import_huggingface_dataset( table_path: str, dataset: datasets.Dataset | datasets.DatasetDict | datasets.IterableDataset | datasets.IterableDatasetDict, *, schema_overrides: dict[str, Any] | None = None, primary_key: str | list[str] | None = None, **kwargs: Any ) -> pxt.Table ``` Create a new base table from a Huggingface dataset, or dataset dict with multiple splits. Requires `datasets` library to be installed. HuggingFace feature types are mapped to Pixeltable column types as follows: * `Value(bool)`: `Bool`
`Value(int*/uint*)`: `Int`
`Value(float*)`: `Float`
`Value(string/large_string)`: `String`
`Value(timestamp*)`: `Timestamp`
`Value(date*)`: `Date` * `ClassLabel`: `String` (converted to label names) * `Sequence`/`LargeList` of numeric types: `Array` * `Sequence`/`LargeList` of string: `Json` * `Sequence`/`LargeList` of dicts: `Json` * `Array2D`-`Array5D`: `Array` (preserves shape) * `Image`: `Image` * `Audio`: `Audio` * `Video`: `Video` * `Translation`/`TranslationVariableLanguages`: `Json` **Parameters:** * **`table_path`** (`str`): Path to the table. * **`dataset`** (`datasets.Dataset | datasets.DatasetDict | datasets.IterableDataset | datasets.IterableDatasetDict`): An instance of any of the Huggingface dataset classes: [`datasets.Dataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset), [`datasets.DatasetDict`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict), [`datasets.IterableDataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset), [`datasets.IterableDatasetDict`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDatasetDict) * **`schema_overrides`** (`dict[str, Any] | None`): If specified, then for each (name, type) pair in `schema_overrides`, the column with name `name` will be given type `type`, instead of being inferred from the `Dataset` or `DatasetDict`. The keys in `schema_overrides` should be the column names of the `Dataset` or `DatasetDict` (whether or not they are valid Pixeltable identifiers). * **`primary_key`** (`str | list[str] | None`): The primary key of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`kwargs`** (`Any`): Additional arguments to pass to `create_table`. An argument of `column_name_for_split` must be provided if the source is a DatasetDict. This column name will contain the split information. If None, no split information will be stored. **Returns:** * `pxt.Table`: A handle to the newly created [`Table`](./table). ## func  import\_json() ```python Signature theme={null} import_json( tbl_path: str, filepath_or_url: str, *, schema_overrides: dict[str, Any] | None = None, primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str = '', **kwargs: Any ) -> pxt.Table ``` Creates a new base table from a JSON file. This is a convenience method and is equivalent to calling `import_data(table_path, json.loads(file_contents, **kwargs), ...)`, where `file_contents` is the contents of the specified `filepath_or_url`. **Parameters:** * **`tbl_path`** (`str`): The name of the table to create. * **`filepath_or_url`** (`str`): The path or URL of the JSON file. * **`schema_overrides`** (`dict[str, Any] | None`): If specified, then columns in `schema_overrides` will be given the specified types (see [`import_rows()`](./io#func-import_rows)). * **`primary_key`** (`str | list[str] | None`): The primary key of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`num_retained_versions`** (`int`, default: `10`): The number of retained versions of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`comment`** (`str`, default: `''`): A comment to attach to the table (see [`create_table()`](./pixeltable#func-create_table)). * **`kwargs`** (`Any`): Additional keyword arguments to pass to `json.loads`. **Returns:** * `pxt.Table`: A handle to the newly created [`Table`](./table). ## func  import\_pandas() ```python Signature theme={null} import_pandas( tbl_name: str, df: pandas.core.frame.DataFrame, *, schema_overrides: dict[str, typing.Any] | None = None, primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str = '' ) -> pixeltable.catalog.table.Table ``` Creates a new base table from a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), with the specified name. The schema of the table will be inferred from the DataFrame. The column names of the new table will be identical to those in the DataFrame, as long as they are valid Pixeltable identifiers. If a column name is not a valid Pixeltable identifier, it will be normalized according to the following procedure: * first replace any non-alphanumeric characters with underscores; * then, preface the result with the letter 'c' if it begins with a number or an underscore; * then, if there are any duplicate column names, suffix the duplicates with '\_2', '\_3', etc., in column order. **Parameters:** * **`tbl_name`** (`str`): The name of the table to create. * **`df`** (`pandas.core.frame.DataFrame`): The Pandas `DataFrame`. * **`schema_overrides`** (`dict[str, typing.Any] | None`): If specified, then for each (name, type) pair in `schema_overrides`, the column with name `name` will be given type `type`, instead of being inferred from the `DataFrame`. The keys in `schema_overrides` should be the column names of the `DataFrame` (whether or not they are valid Pixeltable identifiers). **Returns:** * `pixeltable.catalog.table.Table`: A handle to the newly created [`Table`](./table). ## func  import\_parquet() ```python Signature theme={null} import_parquet( table: str, *, parquet_path: str, schema_overrides: dict[str, Any] | None = None, primary_key: str | list[str] | None = None, **kwargs: Any ) -> pxt.Table ``` Creates a new base table from a Parquet file or set of files. Requires pyarrow to be installed. **Parameters:** * **`table`** (`str`): Fully qualified name of the table to import the data into. * **`parquet_path`** (`str`): Path to an individual Parquet file or directory of Parquet files. * **`schema_overrides`** (`dict[str, Any] | None`): If specified, then for each (name, type) pair in `schema_overrides`, the column with name `name` will be given type `type`, instead of being inferred from the Parquet dataset. The keys in `schema_overrides` should be the column names of the Parquet dataset (whether or not they are valid Pixeltable identifiers). * **`primary_key`** (`str | list[str] | None`): The primary key of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`kwargs`** (`Any`): Additional arguments to pass to `create_table`. **Returns:** * `pxt.Table`: A handle to the newly created table. ## func  import\_rows() ```python Signature theme={null} import_rows( tbl_path: str, rows: list[dict[str, Any]], *, schema_overrides: dict[str, Any] | None = None, primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str = '' ) -> pxt.Table ``` Creates a new base table from a list of dictionaries. The dictionaries must be of the form `{column_name: value, ...}`. Pixeltable will attempt to infer the schema of the table from the supplied data, using the most specific type that can represent all the values in a column. If `schema_overrides` is specified, then for each entry `(column_name, type)` in `schema_overrides`, Pixeltable will force the specified column to the specified type (and will not attempt any type inference for that column). All column types of the new table will be nullable unless explicitly specified as non-nullable in `schema_overrides`. **Parameters:** * **`tbl_path`** (`str`): The qualified name of the table to create. * **`rows`** (`list[dict[str, Any]]`): The list of dictionaries to import. * **`schema_overrides`** (`dict[str, Any] | None`): If specified, then columns in `schema_overrides` will be given the specified types as described above. * **`primary_key`** (`str | list[str] | None`): The primary key of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`num_retained_versions`** (`int`, default: `10`): The number of retained versions of the table (see [`create_table()`](./pixeltable#func-create_table)). * **`comment`** (`str`, default: `''`): A comment to attach to the table (see [`create_table()`](./pixeltable#func-create_table)). **Returns:** * `pxt.Table`: A handle to the newly created [`Table`](./table). # jina Source: https://docs.pixeltable.com/sdk/latest/jina View Source on GitHub # module  pixeltable.functions.jina Pixeltable [UDFs](https://docs.pixeltable.com/platform/udfs-in-pixeltable) that wrap [Jina AI](https://jina.ai/) APIs for embeddings and reranking. In order to use them, the API key must be specified either with `JINA_API_KEY` environment variable, or as `api_key` in the `jina` section of the Pixeltable config file. ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String, task: pxt.String | None = None, dimensions: pxt.Int | None = None, late_chunking: pxt.Bool | None = None ) -> pxt.Array[(None,), float32] ``` Creates embedding vectors for the input text using Jina AI embedding models. Equivalent to the Jina AI embeddings API endpoint. For additional details, see: [https://jina.ai/embeddings/](https://jina.ai/embeddings/) Request throttling: Applies the rate limit set in the config (section `jina`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Parameters:** * **`input`** (`pxt.String`): The text to embed. * **`model`** (`pxt.String`): The Jina embedding model to use. See available models at [https://jina.ai/embeddings/](https://jina.ai/embeddings/). * **`task`** (`pxt.String | None`): Task-specific embedding optimization. Options: * `retrieval.query`: For search queries * `retrieval.passage`: For documents/passages to be searched * `separation`: For clustering/separation tasks * `classification`: For classification tasks * `text-matching`: For semantic similarity * **`dimensions`** (`pxt.Int | None`): Output embedding dimensions (optional). If not specified, uses the model's default dimension. * **`late_chunking`** (`pxt.Bool | None`): Enable late chunking for long documents. **Returns:** * `pxt.Array[(None,), float32]`: An array representing the embedding of `input`. **Examples:** Add a computed column that applies jina-embeddings-v3 to an existing column: ```python theme={null} tbl.add_computed_column( embed=jina.embeddings( tbl.text, model='jina-embeddings-v3', task='retrieval.passage' ) ) ``` Add an embedding index: ```python theme={null} tbl.add_embedding_index( 'text', string_embed=jina.embeddings.using(model='jina-embeddings-v3') ) ``` ## udf  rerank() ```python Signature theme={null} @pxt.udf rerank( query: pxt.String, documents: pxt.Json, *, model: pxt.String, top_n: pxt.Int | None = None, return_documents: pxt.Bool | None = None ) -> pxt.Json ``` Reranks documents based on their relevance to a query using Jina AI reranker models. Equivalent to the Jina AI rerank API endpoint. For additional details, see: [https://jina.ai/reranker/](https://jina.ai/reranker/) Request throttling: Applies the rate limit set in the config (section `jina`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Parameters:** * **`query`** (`pxt.String`): The query string to rank documents against. * **`documents`** (`pxt.Json`): The list of documents to rerank. * **`model`** (`pxt.String`): The Jina reranker model to use. See available models at [https://jina.ai/reranker/](https://jina.ai/reranker/). * **`top_n`** (`pxt.Int | None`): Number of top results to return. If not specified, returns all documents. * **`return_documents`** (`pxt.Bool | None`): Whether to include the original document text in results. **Returns:** * `pxt.Json`: A dictionary containing: * `results`: List of reranking results with `index` and `relevance_score` (and `document` if `return_documents=True`) * `usage`: Token usage information **Examples:** Rerank search results for better relevance: ```python theme={null} tbl.add_computed_column( reranked=jina.rerank( tbl.query, tbl.candidate_docs, model='jina-reranker-v2-base-multilingual', top_n=5, ) ) ``` # json Source: https://docs.pixeltable.com/sdk/latest/json View Source on GitHub # module  pixeltable.functions.json Pixeltable UDFs for `JsonType`. Example: ```python theme={null} import pixeltable as pxt import pixeltable.functions as pxtf t = pxt.get_table(...) t.select(pxtf.json.make_list(t.json_col)).collect() ``` ## udf  dumps() ```python Signature theme={null} @pxt.udf dumps(obj: pxt.Json) -> pxt.String ``` Serialize a JSON object to a string. Equivalent to [`json.dumps()`](https://docs.python.org/3/library/json.html#json.dumps). **Parameters:** * **`obj`** (`pxt.Json`): A JSON-serializable object (dict, list, or scalar). **Returns:** * `pxt.String`: A JSON-formatted string. ## udf  make\_list() ```python Signature theme={null} @pxt.udf make_list(*args, **kwargs) -> pxt.Json ``` Collects arguments into a list. # llama_cpp Source: https://docs.pixeltable.com/sdk/latest/llama_cpp View Source on GitHub # module  pixeltable.functions.llama\_cpp Pixeltable UDFs for llama.cpp models. Provides integration with llama.cpp for running quantized language models locally, supporting chat completions and embeddings with GGUF format models. ## udf  create\_chat\_completion() ```python Signature theme={null} @pxt.udf create_chat_completion( messages: pxt.Json, *, model_path: pxt.String | None = None, repo_id: pxt.String | None = None, repo_filename: pxt.String | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate a chat completion from a list of messages. The model can be specified either as a local path, or as a repo\_id and repo\_filename that reference a pretrained model on the Hugging Face model hub. Exactly one of `model_path` or `repo_id` must be provided; if `model_path` is provided, then an optional `repo_filename` can also be specified. For additional details, see the [llama\_cpp create\_chat\_completions documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion). **Parameters:** * **`messages`** (`pxt.Json`): A list of messages to generate a response for. * **`model_path`** (`pxt.String | None`): Path to the model (if using a local model). * **`repo_id`** (`pxt.String | None`): The Hugging Face model repo id (if using a pretrained model). * **`repo_filename`** (`pxt.String | None`): A filename or glob pattern to match the model file in the repo (optional, if using a pretrained model). * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the llama\_cpp `create_chat_completions` API, such as `max_tokens`, `temperature`, `top_p`, and `top_k`. For details, see the [llama\_cpp create\_chat\_completions documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion). # math Source: https://docs.pixeltable.com/sdk/latest/math View Source on GitHub # module  pixeltable.functions.math Pixeltable UDFs for mathematical operations. Example: ```python theme={null} import pixeltable as pxt t = pxt.get_table(...) t.select(t.float_col.floor()).collect() ``` ## udf  abs() ```python Signature theme={null} @pxt.udf abs(self: pxt.Float) -> pxt.Float ``` Return the absolute value of the given number. Equivalent to Python [`builtins.abs()`](https://docs.python.org/3/library/functions.html#abs). ## udf  bitwise\_and() ```python Signature theme={null} @pxt.udf bitwise_and(self: pxt.Int, other: pxt.Int) -> pxt.Int ``` Bitwise AND of two integers. Equivalent to Python [`self & other`](https://docs.python.org/3/library/stdtypes.html#bitwise-operations-on-integer-types). ## udf  bitwise\_or() ```python Signature theme={null} @pxt.udf bitwise_or(self: pxt.Int, other: pxt.Int) -> pxt.Int ``` Bitwise OR of two integers. Equivalent to Python [`self | other`](https://docs.python.org/3/library/stdtypes.html#bitwise-operations-on-integer-types). ## udf  bitwise\_xor() ```python Signature theme={null} @pxt.udf bitwise_xor(self: pxt.Int, other: pxt.Int) -> pxt.Int ``` Bitwise XOR of two integers. Equivalent to Python [`self ^ other`](https://docs.python.org/3/library/stdtypes.html#bitwise-operations-on-integer-types). ## udf  ceil() ```python Signature theme={null} @pxt.udf ceil(self: pxt.Float) -> pxt.Float ``` Return the ceiling of the given number. Equivalent to Python [`float(math.ceil(self))`](https://docs.python.org/3/library/math.html#math.ceil) if `self` is finite, or `self` itself if `self` is infinite. (This is slightly different from the default behavior of `math.ceil(self)`, which always returns an `int` and raises an error if `self` is infinite. The behavior in Pixeltable generalizes the Python operator and is chosen to align with the SQL standard.) ## udf  floor() ```python Signature theme={null} @pxt.udf floor(self: pxt.Float) -> pxt.Float ``` Return the ceiling of the given number. Equivalent to Python [`float(math.floor(self))`](https://docs.python.org/3/library/math.html#math.ceil) if `self` is finite, or `self` itself if `self` is infinite. (This is slightly different from the default behavior of `math.floor(self)`, which always returns an `int` and raises an error if `self` is infinite. The behavior of Pixeltable generalizes the Python operator and is chosen to align with the SQL standard.) ## udf  pow() ```python Signature theme={null} @pxt.udf pow(self: pxt.Int, other: pxt.Int) -> pxt.Float ``` Raise `self` to the power of `other`. Equivalent to Python [`self ** other`](https://docs.python.org/3/library/functions.html#pow). ## udf  round() ```python Signature theme={null} @pxt.udf round( self: pxt.Float, digits: pxt.Int | None = None ) -> pxt.Float ``` Round a number to a given precision in decimal digits. Equivalent to Python [`builtins.round(self, digits or 0)`](https://docs.python.org/3/library/functions.html#round). Note that if `digits` is not specified, the behavior matches `builtins.round(self, 0)` rather than `builtins.round(self)`; this ensures that the return type is always `float` (as in SQL) rather than `int`. # mistralai Source: https://docs.pixeltable.com/sdk/latest/mistralai View Source on GitHub # module  pixeltable.functions.mistralai Pixeltable UDFs that wrap various endpoints from the Mistral AI API. In order to use them, you must first `pip install mistralai` and configure your Mistral AI credentials, as described in the [Working with Mistral AI](https://docs.pixeltable.com/notebooks/integrations/working-with-mistralai) tutorial. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Chat Completion API. Equivalent to the Mistral AI `chat/completions` API endpoint. For additional details, see: [https://docs.mistral.ai/api/#tag/chat](https://docs.mistral.ai/api/#tag/chat) Request throttling: Applies the rate limit set in the config (section `mistral`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install mistralai` **Parameters:** * **`messages`** (`pxt.Json`): The prompt(s) to generate completions for. * **`model`** (`pxt.String`): ID of the model to use. (See overview here: [https://docs.mistral.ai/getting-started/models/](https://docs.mistral.ai/getting-started/models/)) * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Mistral `chat/completions` API. For details on the available parameters, see: [https://docs.mistral.ai/api/#tag/chat](https://docs.mistral.ai/api/#tag/chat) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `mistral-latest-small` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=completions(messages, model='mistral-latest-small') ) ``` ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String ) -> pxt.Array[(None,), float32] ``` Embeddings API. Equivalent to the Mistral AI `embeddings` API endpoint. For additional details, see: [https://docs.mistral.ai/api/#tag/embeddings](https://docs.mistral.ai/api/#tag/embeddings) Request throttling: Applies the rate limit set in the config (section `mistral`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install mistralai` **Parameters:** * **`input`** (`pxt.String`): Text to embed. * **`model`** (`pxt.String`): ID of the model to use. (See overview here: [https://docs.mistral.ai/getting-started/models/](https://docs.mistral.ai/getting-started/models/)) **Returns:** * `pxt.Array[(None,), float32]`: An array representing the application of the given embedding to `input`. ## udf  fim\_completions() ```python Signature theme={null} @pxt.udf fim_completions( prompt: pxt.String, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Fill-in-the-middle Completion API. Equivalent to the Mistral AI `fim/completions` API endpoint. For additional details, see: [https://docs.mistral.ai/api/#tag/fim](https://docs.mistral.ai/api/#tag/fim) Request throttling: Applies the rate limit set in the config (section `mistral`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install mistralai` **Parameters:** * **`prompt`** (`pxt.String`): The text/code to complete. * **`model`** (`pxt.String`): ID of the model to use. (See overview here: [https://docs.mistral.ai/getting-started/models/](https://docs.mistral.ai/getting-started/models/)) * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Mistral `fim/completions` API. For details on the available parameters, see: [https://docs.mistral.ai/api/#tag/fim](https://docs.mistral.ai/api/#tag/fim) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `codestral-latest` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=completions(tbl.prompt, model='codestral-latest') ) ``` # net Source: https://docs.pixeltable.com/sdk/latest/net View Source on GitHub # module  pixeltable.functions.net Pixeltable UDF for converting media file URIs to presigned HTTP URLs. ## udf  presigned\_url() ```python Signature theme={null} @pxt.udf presigned_url(uri: pxt.String, expiration_seconds: pxt.Int) -> pxt.String ``` Convert a blob storage URI to a presigned HTTP URL for direct access. Generates a time-limited, publicly accessible URL from cloud storage URIs (S3, GCS, Azure, etc.) that can be used to serve media files over HTTP. Note: This function uses presigned URLs from storage providers. Provider-specific limitations apply: * Google Cloud Storage: maximum 7-day expiration * AWS S3: requires proper region configuration * Azure: subject to storage account access policies **Parameters:** * **`uri`** (`pxt.String`): The media file URI (e.g., `s3://bucket/path`, `gs://bucket/path`, `azure://container/path`) * **`expiration_seconds`** (`pxt.Int`): How long the URL remains valid **Returns:** * `pxt.String`: A presigned HTTP URL for accessing the file **Examples:** Generate a presigned URL for a video column with 1-hour expiration: ```python theme={null} tbl.select( original_url=tbl.video.fileurl, presigned_url=pxtf.net.presigned_url(tbl.video.fileurl, 3600), ).collect() ``` # ollama Source: https://docs.pixeltable.com/sdk/latest/ollama View Source on GitHub # module  pixeltable.functions.ollama Pixeltable UDFs for Ollama local models. Provides integration with Ollama for running large language models locally, including chat completions and embeddings. ## udf  chat() ```python Signature theme={null} @pxt.udf chat( messages: pxt.Json, *, model: pxt.String, tools: pxt.Json | None = None, format: pxt.String | None = None, options: pxt.Json | None = None ) -> pxt.Json ``` Generate the next message in a chat with a provided model. **Parameters:** * **`messages`** (`pxt.Json`): The messages of the chat. * **`model`** (`pxt.String`): The model name. * **`tools`** (`pxt.Json | None`): Tools for the model to use. * **`format`** (`pxt.String | None`): The format of the response; must be one of `'json'` or `None`. * **`options`** (`pxt.Json | None`): Additional options to pass to the `chat` call, such as `max_tokens`, `temperature`, `top_p`, and `top_k`. For details, see the [Valid Parameters and Values](https://github.com/ollama/ollama/blob/main/docs/modelfile.mdx#valid-parameters-and-values) section of the Ollama documentation. ## udf  embed() ```python Signature theme={null} @pxt.udf embed( input: pxt.String, *, model: pxt.String, truncate: pxt.Bool = True, options: pxt.Json | None = None ) -> pxt.Array[(None,), float32] ``` Generate embeddings from a model. **Parameters:** * **`input`** (`pxt.String`): The input text to generate embeddings for. * **`model`** (`pxt.String`): The model name. * **`truncate`** (`pxt.Bool`): Truncates the end of each input to fit within context length. Returns error if false and context length is exceeded. * **`options`** (`pxt.Json | None`): Additional options to pass to the `embed` call. For details, see the [Valid Parameters and Values](https://github.com/ollama/ollama/blob/main/docs/modelfile.mdx#valid-parameters-and-values) section of the Ollama documentation. ## udf  generate() ```python Signature theme={null} @pxt.udf generate( prompt: pxt.String, *, model: pxt.String, suffix: pxt.String = '', system: pxt.String = '', template: pxt.String = '', context: pxt.Json | None = None, raw: pxt.Bool = False, format: pxt.String | None = None, options: pxt.Json | None = None ) -> pxt.Json ``` Generate a response for a given prompt with a provided model. **Parameters:** * **`prompt`** (`pxt.String`): The prompt to generate a response for. * **`model`** (`pxt.String`): The model name. * **`suffix`** (`pxt.String`): The text after the model response. * **`format`** (`pxt.String | None`): The format of the response; must be one of `'json'` or `None`. * **`system`** (`pxt.String`): System message. * **`template`** (`pxt.String`): Prompt template to use. * **`context`** (`pxt.Json | None`): The context parameter returned from a previous call to `generate()`. * **`raw`** (`pxt.Bool`): If `True`, no formatting will be applied to the prompt. * **`options`** (`pxt.Json | None`): Additional options for the Ollama `chat` call, such as `max_tokens`, `temperature`, `top_p`, and `top_k`. For details, see the [Valid Parameters and Values](https://github.com/ollama/ollama/blob/main/docs/modelfile.mdx#valid-parameters-and-values) section of the Ollama documentation. # openai Source: https://docs.pixeltable.com/sdk/latest/openai View Source on GitHub # module  pixeltable.functions.openai Pixeltable UDFs that wrap various endpoints from the OpenAI API. In order to use them, you must first `pip install openai` and configure your OpenAI credentials, as described in the [Working with OpenAI](https://docs.pixeltable.com/notebooks/integrations/working-with-openai) tutorial. ## func  invoke\_tools() ```python Signature theme={null} invoke_tools( tools: pixeltable.func.tools.Tools, response: pixeltable.exprs.expr.Expr ) -> pixeltable.exprs.inline_expr.InlineDict ``` Converts an OpenAI response dict to Pixeltable tool invocation format and calls `tools._invoke()`. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None, tools: pxt.Json | None = None, tool_choice: pxt.Json | None = None ) -> pxt.Json ``` Creates a model response for the given chat conversation. Equivalent to the OpenAI `chat/completions` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/chat-completions](https://platform.openai.com/docs/guides/chat-completions) Request throttling: Uses the rate limit-related headers returned by the API to throttle requests adaptively, based on available request and token capacity. No configuration is necessary. **Requirements:** * `pip install openai` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages to use for chat completion, as described in the OpenAI API documentation. * **`model`** (`pxt.String`): The model to use for chat completion. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `chat/completions` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/chat/create](https://platform.openai.com/docs/api-reference/chat/create) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `gpt-4o-mini` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [ {'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': tbl.prompt}, ] tbl.add_computed_column( response=chat_completions(messages, model='gpt-4o-mini') ) ``` You can also include images in the messages list by passing image data directly in the input dictionary, in the `'image_url'` field of the message content, as in this example: ```python theme={null} messages = [ { 'role': 'user', 'content': [ {'type': 'text', 'text': "What's in this image?"}, {'type': 'image_url', 'image_url': tbl.image}, ], } ] tbl.add_computed_column( response=chat_completions(messages, model='gpt-4o-mini') ) ``` ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Array[(None,), float32] ``` Creates an embedding vector representing the input text. Equivalent to the OpenAI `embeddings` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/embeddings](https://platform.openai.com/docs/guides/embeddings) Request throttling: Uses the rate limit-related headers returned by the API to throttle requests adaptively, based on available request and token capacity. No configuration is necessary. **Requirements:** * `pip install openai` **Parameters:** * **`input`** (`pxt.String`): The text to embed. * **`model`** (`pxt.String`): The model to use for the embedding. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `embeddings` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/embeddings](https://platform.openai.com/docs/api-reference/embeddings) **Returns:** * `pxt.Array[(None,), float32]`: An array representing the application of the given embedding to `input`. **Examples:** Add a computed column that applies the model `text-embedding-3-small` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( embed=embeddings(tbl.text, model='text-embedding-3-small') ) ``` Add an embedding index to an existing column `text`, using the model `text-embedding-3-small`: ```python theme={null} tbl.add_embedding_index( embedding=embeddings.using(model='text-embedding-3-small') ) ``` ## udf  image\_generations() ```python Signature theme={null} @pxt.udf image_generations( prompt: pxt.String, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Creates an image given a prompt. Equivalent to the OpenAI `images/generations` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/images](https://platform.openai.com/docs/guides/images) Request throttling: Applies the rate limit set in the config (section `openai.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`prompt`** (`pxt.String`): Prompt for the image. * **`model`** (`pxt.String`): The model to use for the generations. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `images/generations` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/images/create](https://platform.openai.com/docs/api-reference/images/create) **Returns:** * `pxt.Json`: A dictionary containing the generated image data. Images will be deserialized into `PIL.Image.Image` objects, and the result dictionary will have the following form: ```json theme={null} { "created": 1234567890, "data": [ PIL.Image.Image(...), PIL.Image.Image(...), ... ], "usage": } ``` **Examples:** Add a computed column that applies the model `dall-e-2` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( gen_image=image_generations(tbl.text, model='dall-e-2') ) ``` ## udf  moderations() ```python Signature theme={null} @pxt.udf moderations( input: pxt.String, *, model: pxt.String = 'omni-moderation-latest' ) -> pxt.Json ``` Classifies if text is potentially harmful. Equivalent to the OpenAI `moderations` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/moderation](https://platform.openai.com/docs/guides/moderation) Request throttling: Applies the rate limit set in the config (section `openai.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`input`** (`pxt.String`): Text to analyze with the moderations model. * **`model`** (`pxt.String`): The model to use for moderations. **Returns:** * `pxt.Json`: Details of the moderations results. **Examples:** Add a computed column that applies the model `text-moderation-stable` to an existing Pixeltable column `tbl.input` of the table `tbl`: ```python theme={null} tbl.add_computed_column( moderations=moderations(tbl.text, model='text-moderation-stable') ) ``` ## udf  speech() ```python Signature theme={null} @pxt.udf speech( input: pxt.String, *, model: pxt.String, voice: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Audio ``` Generates audio from the input text. Equivalent to the OpenAI `audio/speech` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/text-to-speech](https://platform.openai.com/docs/guides/text-to-speech) Request throttling: Applies the rate limit set in the config (section `openai.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`input`** (`pxt.String`): The text to synthesize into speech. * **`model`** (`pxt.String`): The model to use for speech synthesis. * **`voice`** (`pxt.String`): The voice profile to use for speech synthesis. Supported options include: `alloy`, `echo`, `fable`, `onyx`, `nova`, and `shimmer`. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `audio/speech` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/audio/createSpeech](https://platform.openai.com/docs/api-reference/audio/createSpeech) **Returns:** * `pxt.Audio`: An audio file containing the synthesized speech. **Examples:** Add a computed column that applies the model `tts-1` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( audio=speech(tbl.text, model='tts-1', voice='nova') ) ``` ## udf  transcriptions() ```python Signature theme={null} @pxt.udf transcriptions( audio: pxt.Audio, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Transcribes audio into the input language. Equivalent to the OpenAI `audio/transcriptions` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/speech-to-text](https://platform.openai.com/docs/guides/speech-to-text) Request throttling: Applies the rate limit set in the config (section `openai.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`audio`** (`pxt.Audio`): The audio to transcribe. * **`model`** (`pxt.String`): The model to use for speech transcription. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `audio/transcriptions` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/audio/createTranscription](https://platform.openai.com/docs/api-reference/audio/createTranscription) **Returns:** * `pxt.Json`: A dictionary containing the transcription and other metadata. **Examples:** Add a computed column that applies the model `whisper-1` to an existing Pixeltable column `tbl.audio` of the table `tbl`: ```python theme={null} tbl.add_computed_column( transcription=transcriptions( tbl.audio, model='whisper-1', language='en' ) ) ``` ## udf  translations() ```python Signature theme={null} @pxt.udf translations( audio: pxt.Audio, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Translates audio into English. Equivalent to the OpenAI `audio/translations` API endpoint. For additional details, see: [https://platform.openai.com/docs/guides/speech-to-text](https://platform.openai.com/docs/guides/speech-to-text) Request throttling: Applies the rate limit set in the config (section `openai.rate_limits`; use the model id as the key). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`audio`** (`pxt.Audio`): The audio to translate. * **`model`** (`pxt.String`): The model to use for speech transcription and translation. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the OpenAI `audio/translations` API. For details on the available parameters, see: [https://platform.openai.com/docs/api-reference/audio/createTranslation](https://platform.openai.com/docs/api-reference/audio/createTranslation) **Returns:** * `pxt.Json`: A dictionary containing the translation and other metadata. **Examples:** Add a computed column that applies the model `whisper-1` to an existing Pixeltable column `tbl.audio` of the table `tbl`: ```python theme={null} tbl.add_computed_column( translation=translations(tbl.audio, model='whisper-1', language='en') ) ``` ## udf  vision() ```python Signature theme={null} @pxt.udf vision( prompt: pxt.String, image: pxt.Image, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.String ``` Analyzes an image with the OpenAI vision capability. This is a convenience function that takes an image and prompt, and constructs a chat completion request that utilizes OpenAI vision. For additional details, see: [https://platform.openai.com/docs/guides/vision](https://platform.openai.com/docs/guides/vision) Request throttling: Uses the rate limit-related headers returned by the API to throttle requests adaptively, based on available request and token capacity. No configuration is necessary. **Requirements:** * `pip install openai` **Parameters:** * **`prompt`** (`pxt.String`): A prompt for the OpenAI vision request. * **`image`** (`pxt.Image`): The image to analyze. * **`model`** (`pxt.String`): The model to use for OpenAI vision. **Returns:** * `pxt.String`: A dictionary containing the response and associated metadata. **Examples:** Add a computed column that applies the model `gpt-4o-mini` to an existing Pixeltable column `tbl.image` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=vision( "What's in this image?", tbl.image, model='gpt-4o-mini' ) ) ``` # openrouter Source: https://docs.pixeltable.com/sdk/latest/openrouter View Source on GitHub # module  pixeltable.functions.openrouter Pixeltable UDFs that wrap the OpenRouter API. OpenRouter provides a unified interface to multiple LLM providers. In order to use it, you must first sign up at [https://openrouter.ai](https://openrouter.ai), create an API key, and configure it as described in the Working with OpenRouter tutorial. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None, tools: pxt.Json | None = None, tool_choice: pxt.Json | None = None, provider: pxt.Json | None = None, transforms: pxt.Json | None = None ) -> pxt.Json ``` Chat Completion API via OpenRouter. OpenRouter provides access to multiple LLM providers through a unified API. For additional details, see: [https://openrouter.ai/docs](https://openrouter.ai/docs) Supported models can be found at: [https://openrouter.ai/models](https://openrouter.ai/models) Request throttling: Applies the rate limit set in the config (section `openrouter`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install openai` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages comprising the conversation so far. * **`model`** (`pxt.String`): ID of the model to use (e.g., `'anthropic/claude-3.5-sonnet'`, `'openai/gpt-4'`). * **`model_kwargs`** (`pxt.Json | None`): Additional OpenAI-compatible parameters. * **`tools`** (`pxt.Json | None`): List of tools available to the model. * **`tool_choice`** (`pxt.Json | None`): Controls which (if any) tool is called by the model. * **`provider`** (`pxt.Json | None`): OpenRouter-specific provider preferences (e.g., `{'order': ['Anthropic', 'OpenAI']}`). * **`transforms`** (`pxt.Json | None`): List of message transforms to apply (e.g., `['middle-out']`). **Returns:** * `pxt.Json`: A dictionary containing the response in OpenAI format. **Examples:** Basic chat completion: ```python theme={null} messages = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=chat_completions( messages, model='anthropic/claude-3.5-sonnet' ) ) ``` With provider routing: ```python theme={null} tbl.add_computed_column( response=chat_completions( messages, model='anthropic/claude-3.5-sonnet', provider={'require_parameters': True, 'order': ['Anthropic']}, ) ) ``` With transforms: ```python theme={null} tbl.add_computed_column( response=chat_completions( messages, model='openai/gpt-4', transforms=['middle-out'], # Optimize for long contexts ) ) ``` # pixeltable Source: https://docs.pixeltable.com/sdk/latest/pixeltable View Source on GitHub # module  pixeltable Core Pixeltable API for table operations, data processing, and UDF management. ## func  create\_dir() ```python Signature theme={null} create_dir( path: str, *, if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error', parents: bool = False ) -> catalog.Dir | None ``` Create a directory. **Parameters:** * **`path`** (`str`): Path to the directory. * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Directive regarding how to handle if the path already exists. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return the existing directory handle * `'replace'`: if the existing directory is empty, drop it and create a new one * `'replace_force'`: drop the existing directory and all its children, and create a new one * **`parents`** (`bool`, default: `False`): Create missing parent directories. **Returns:** * `catalog.Dir | None`: A handle to the newly created directory, or to an already existing directory at the path when `if_exists='ignore'`. Please note the existing directory may not be empty. **Examples:** ```python theme={null} pxt.create_dir('my_dir') ``` Create a subdirectory: ```python theme={null} pxt.create_dir('my_dir/sub_dir') ``` Create a subdirectory only if it does not already exist, otherwise do nothing: ```python theme={null} pxt.create_dir('my_dir/sub_dir', if_exists='ignore') ``` Create a directory and replace if it already exists: ```python theme={null} pxt.create_dir('my_dir', if_exists='replace_force') ``` Create a subdirectory along with its ancestors: ```python theme={null} pxt.create_dir('parent1/parent2/sub_dir', parents=True) ``` ## func  create\_snapshot() ```python Signature theme={null} create_snapshot( path_str: str, base: catalog.Table | Query, *, additional_columns: Mapping[str, type | ColumnSpec | exprs.Expr] | None = None, iterator: func.GeneratingFunctionCall | None = None, num_retained_versions: int = 10, comment: str | None = None, custom_metadata: Any = None, media_validation: Literal['on_read', 'on_write'] = 'on_write', if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error' ) -> catalog.Table | None ``` Create a snapshot of an existing table object (which itself can be a view or a snapshot or a base table). **Parameters:** * **`path_str`** (`str`): A name for the snapshot; can be either a simple name such as `my_snapshot`, or a pathname such as `dir1/my_snapshot`. * **`base`** (`catalog.Table | Query`): [`Table`](./table) (i.e., table or view or snapshot) or [`Query`](./query) to base the snapshot on. * **`additional_columns`** (`Mapping[str, type | ColumnSpec | exprs.Expr] | None`): If specified, will add these columns to the snapshot once it is created. The format of the `additional_columns` parameter is identical to the format of the `schema` parameter in [`create_table`](./pixeltable#func-create_table). * **`iterator`** (`func.GeneratingFunctionCall | None`): The iterator to use for this snapshot. If specified, then this snapshot will be a one-to-many view of the base table. * **`num_retained_versions`** (`int`, default: `10`): Number of versions of the view to retain. * **`comment`** (`str | None`): Optional comment for the snapshot. * **`custom_metadata`** (`Any`): Optional user-defined JSON metadata to associate with the snapshot. * **`media_validation`** (`Literal['on_read', 'on_write']`, default: `'on_write'`): Media validation policy for the snapshot. * `'on_read'`: validate media files at query time * `'on_write'`: validate media files during insert/update operations * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Directive regarding how to handle if the path already exists. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return the existing snapshot handle * `'replace'`: if the existing snapshot has no dependents, drop and replace it with a new one * `'replace_force'`: drop the existing snapshot and all its dependents, and create a new one **Returns:** * `catalog.Table | None`: A handle to the [`Table`](./table) representing the newly created snapshot. Please note the schema or base of the existing snapshot may not match those provided in the call. **Examples:** Create a snapshot `my_snapshot` of a table `my_table`: ```python theme={null} tbl = pxt.get_table('my_table') snapshot = pxt.create_snapshot('my_snapshot', tbl) ``` Create a snapshot `my_snapshot` of a view `my_view` with additional int column `col3`, if `my_snapshot` does not already exist: ```python theme={null} view = pxt.get_table('my_view') snapshot = pxt.create_snapshot( 'my_snapshot', view, additional_columns={'col3': pxt.Int}, if_exists='ignore', ) ``` Create a snapshot `my_snapshot` on a table `my_table`, and replace any existing snapshot named `my_snapshot`: ```python theme={null} tbl = pxt.get_table('my_table') snapshot = pxt.create_snapshot( 'my_snapshot', tbl, if_exists='replace_force' ) ``` ## func  create\_table() ```python Signature theme={null} create_table( path: str, schema: Mapping[str, type | ColumnSpec | exprs.Expr] | None = None, *, source: TableDataSource | None = None, source_format: Literal['csv', 'excel', 'parquet', 'json'] | None = None, schema_overrides: dict[str, Any] | None = None, create_default_idxs: bool = True, on_error: Literal['abort', 'ignore'] = 'abort', primary_key: str | list[str] | None = None, num_retained_versions: int = 10, comment: str | None = None, custom_metadata: Any = None, media_validation: Literal['on_read', 'on_write'] = 'on_write', if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error', extra_args: dict[str, Any] | None = None ) -> catalog.Table ``` Create a new base table. Exactly one of `schema` or `source` must be provided. If a `schema` is provided, then an empty table will be created with the specified schema. If a `source` is provided, then Pixeltable will attempt to infer a data source format and table schema from the contents of the specified data, and the data will be imported from the specified source into the new table. The source format and/or schema can be specified directly via the `source_format` and `schema_overrides` parameters. **Parameters:** * **`path`** (`str`): Pixeltable path (qualified name) of the table, such as `'my_table'` or `'my_dir/my_subdir/my_table'`. * **`schema`** (`Mapping[str, type | ColumnSpec | exprs.Expr] | None`): Schema for the new table, mapping column names to Pixeltable types. * **`source`** (`TableDataSource | None`): A data source (file, URL, Table, Query, or list of rows) to import from. * **`source_format`** (`Literal['csv', 'excel', 'parquet', 'json'] | None`): Must be used in conjunction with a `source`. If specified, then the given format will be used to read the source data. (Otherwise, Pixeltable will attempt to infer the format from the source data.) * **`schema_overrides`** (`dict[str, Any] | None`): Must be used in conjunction with a `source`. If specified, then columns in `schema_overrides` will be given the specified types. (Pixeltable will attempt to infer the types of any columns not specified.) * **`create_default_idxs`** (`bool`, default: `True`): If True, creates a B-tree index on every scalar and media column that is not computed, except for boolean columns. * **`on_error`** (`Literal['abort', 'ignore']`, default: `'abort'`): Determines the behavior if an error occurs while evaluating a computed column or detecting an invalid media file (such as a corrupt image) for one of the inserted rows. * If `on_error='abort'`, then an exception will be raised and the rows will not be inserted. * If `on_error='ignore'`, then execution will continue and the rows will be inserted. Any cells with errors will have a `None` value for that cell, with information about the error stored in the corresponding `tbl.col_name.errortype` and `tbl.col_name.errormsg` fields. * **`primary_key`** (`str | list[str] | None`): An optional column name or list of column names to use as the primary key(s) of the table. * **`num_retained_versions`** (`int`, default: `10`): Number of versions of the table to retain. * **`comment`** (`str | None`): An optional comment; its meaning is user-defined. * **`custom_metadata`** (`Any`): Optional user-defined metadata to associate with the table. Must be a valid JSON-serializable object \[str, int, float, bool, dict, list]. * **`media_validation`** (`Literal['on_read', 'on_write']`, default: `'on_write'`): Media validation policy for the table. * `'on_read'`: validate media files at query time * `'on_write'`: validate media files during insert/update operations * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Determines the behavior if a table already exists at the specified path location. * `'error'`: raise an error * `'ignore'`: do nothing and return the existing table handle * `'replace'`: if the existing table has no views or snapshots, drop and replace it with a new one; raise an error if the existing table has views or snapshots * `'replace_force'`: drop the existing table and all its views and snapshots, and create a new one * **`extra_args`** (`dict[str, Any] | None`): Must be used in conjunction with a `source`. If specified, then additional arguments will be passed along to the source data provider. **Returns:** * `catalog.Table`: A handle to the newly created table, or to an already existing table at the path when `if_exists='ignore'`. Please note the schema of the existing table may not match the schema provided in the call. **Examples:** Create a table with an int and a string column: ```python theme={null} tbl = pxt.create_table( 'my_table', schema={'col1': pxt.Int, 'col2': pxt.String} ) ``` Create a table from a select statement over an existing table `orig_table` (this will create a new table containing the exact contents of the query): ```python theme={null} tbl1 = pxt.get_table('orig_table') tbl2 = pxt.create_table( 'new_table', tbl1.where(tbl1.col1 < 10).select(tbl1.col2) ) ``` Create a table if it does not already exist, otherwise get the existing table: ```python theme={null} tbl = pxt.create_table( 'my_table', schema={'col1': pxt.Int, 'col2': pxt.String}, if_exists='ignore', ) ``` Create a table with an int and a float column, and replace any existing table: ```python theme={null} tbl = pxt.create_table( 'my_table', schema={'col1': pxt.Int, 'col2': pxt.Float}, if_exists='replace', ) ``` Create a table from a CSV file: ```python theme={null} tbl = pxt.create_table('my_table', source='data.csv') ``` Create a table with an auto-generated UUID primary key: ```python theme={null} tbl = pxt.create_table( 'my_table', schema={'id': pxt.functions.uuid.uuid4(), 'data': pxt.String}, primary_key=['id'], ) ``` ## func  create\_view() ```python Signature theme={null} create_view( path: str, base: catalog.Table | Query, *, additional_columns: Mapping[str, type | ColumnSpec | exprs.Expr] | None = None, is_snapshot: bool = False, create_default_idxs: bool = False, iterator: func.GeneratingFunctionCall | None = None, num_retained_versions: int = 10, comment: str | None = None, custom_metadata: Any = None, media_validation: Literal['on_read', 'on_write'] = 'on_write', if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error' ) -> catalog.Table | None ``` Create a view of an existing table object (which itself can be a view or a snapshot or a base table). **Parameters:** * **`path`** (`str`): A name for the view; can be either a simple name such as `my_view`, or a pathname such as `dir1/my_view`. * **`base`** (`catalog.Table | Query`): [`Table`](./table) (i.e., table or view or snapshot) or [`Query`](./query) to base the view on. * **`additional_columns`** (`Mapping[str, type | ColumnSpec | exprs.Expr] | None`): If specified, will add these columns to the view once it is created. The format of the `additional_columns` parameter is identical to the format of the `schema` parameter in [`create_table`](./pixeltable#func-create_table). * **`is_snapshot`** (`bool`, default: `False`): Whether the view is a snapshot. Setting this to `True` is equivalent to calling [`create_snapshot`](./pixeltable#func-create_snapshot). * **`create_default_idxs`** (`bool`, default: `False`): Whether to create default indexes on the view's columns (the base's columns are excluded). Cannot be `True` for snapshots. * **`iterator`** (`func.GeneratingFunctionCall | None`): The iterator to use for this view. If specified, then this view will be a one-to-many view of the base table. * **`num_retained_versions`** (`int`, default: `10`): Number of versions of the view to retain. * **`comment`** (`str | None`): Optional comment for the view. * **`custom_metadata`** (`Any`): Optional user-defined JSON metadata to associate with the view. * **`media_validation`** (`Literal['on_read', 'on_write']`, default: `'on_write'`): Media validation policy for the view. * `'on_read'`: validate media files at query time * `'on_write'`: validate media files during insert/update operations * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Directive regarding how to handle if the path already exists. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return the existing view handle * `'replace'`: if the existing view has no dependents, drop and replace it with a new one * `'replace_force'`: drop the existing view and all its dependents, and create a new one **Returns:** * `catalog.Table | None`: A handle to the [`Table`](./table) representing the newly created view. If the path already exists and `if_exists='ignore'`, returns a handle to the existing view. Please note the schema or the base of the existing view may not match those provided in the call. **Examples:** Create a view `my_view` of an existing table `my_table`, filtering on rows where `col1` is greater than 10: ```python theme={null} tbl = pxt.get_table('my_table') view = pxt.create_view('my_view', tbl.where(tbl.col1 > 10)) ``` Create a view `my_view` of an existing table `my_table`, filtering on rows where `col1` is greater than 10, and if it not already exist. Otherwise, get the existing view named `my_view`: ```python theme={null} tbl = pxt.get_table('my_table') view = pxt.create_view( 'my_view', tbl.where(tbl.col1 > 10), if_exists='ignore' ) ``` Create a view `my_view` of an existing table `my_table`, filtering on rows where `col1` is greater than 100, and replace any existing view named `my_view`: ```python theme={null} tbl = pxt.get_table('my_table') view = pxt.create_view( 'my_view', tbl.where(tbl.col1 > 100), if_exists='replace_force' ) ``` ## func  drop\_dir() ```python Signature theme={null} drop_dir( path: str, force: bool = False, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Remove a directory. **Parameters:** * **`path`** (`str`): Name or path of the directory. * **`force`** (`bool`, default: `False`): If `True`, will also drop all tables and subdirectories of this directory, recursively, along with any views or snapshots that depend on any of the dropped tables. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive regarding how to handle if the path does not exist. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return **Examples:** Remove a directory, if it exists and is empty: ```python theme={null} pxt.drop_dir('my_dir') ``` Remove a subdirectory: ```python theme={null} pxt.drop_dir('my_dir/sub_dir') ``` Remove an existing directory if it is empty, but do nothing if it does not exist: ```python theme={null} pxt.drop_dir('my_dir/sub_dir', if_not_exists='ignore') ``` Remove an existing directory and all its contents: ```python theme={null} pxt.drop_dir('my_dir', force=True) ``` ## func  drop\_table() ```python Signature theme={null} drop_table( table: str | catalog.Table, force: bool = False, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Drop a table, view, snapshot, or replica. **Parameters:** * **`table`** (`str | catalog.Table`): Fully qualified name or table handle of the table to be dropped; or a remote URI of a cloud replica to be deleted. * **`force`** (`bool`, default: `False`): If `True`, will also drop all views and sub-views of this table. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive regarding how to handle if the path does not exist. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return **Examples:** Drop a table by its fully qualified name: ```python theme={null} pxt.drop_table('subdir/my_table') ``` Drop a table by its handle: ```python theme={null} t = pxt.get_table('subdir/my_table') pxt.drop_table(t) ``` Drop a table if it exists, otherwise do nothing: ```python theme={null} pxt.drop_table('subdir/my_table', if_not_exists='ignore') ``` Drop a table and all its dependents: ```python theme={null} pxt.drop_table('subdir/my_table', force=True) ``` ## func  get\_dir\_contents() ```python Signature theme={null} get_dir_contents(dir_path: str = '', recursive: bool = True) -> DirContents ``` Get the contents of a Pixeltable directory. **Parameters:** * **`dir_path`** (`str`, default: `''`): Path to the directory. Defaults to the root directory. * **`recursive`** (`bool`, default: `True`): If `False`, returns only those tables and directories that are directly contained in specified directory; if `True`, returns all tables and directories that are descendants of the specified directory, recursively. **Returns:** * `'DirContents'`: A [`DirContents`](./dircontents) object representing the contents of the specified directory. **Examples:** Get contents of top-level directory: ```python theme={null} pxt.get_dir_contents() ``` Get contents of 'dir1': ```python theme={null} pxt.get_dir_contents('dir1') ``` ## func  get\_table() ```python Signature theme={null} get_table( path: str, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> catalog.Table | None ``` Get a handle to an existing table, view, or snapshot. **Parameters:** * **`path`** (`str`): Path to the table. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive regarding how to handle if the path does not exist. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return `None` **Returns:** * `catalog.Table | None`: A handle to the [`Table`](./table). **Examples:** Get handle for a table in the top-level directory: ```python theme={null} tbl = pxt.get_table('my_table') ``` For a table in a subdirectory: ```python theme={null} tbl = pxt.get_table('subdir/my_table') ``` Handles to views and snapshots are retrieved in the same way: ```python theme={null} tbl = pxt.get_table('my_snapshot') ``` Get a handle to a specific version of a table: ```python theme={null} tbl = pxt.get_table('my_table:722') ``` ## func  home() ```python Signature theme={null} home() -> Path ``` Get the path to the user's home directory in Pixeltable. **Returns:** * `Path`: The path to the user's home directory. ## func  init() ```python Signature theme={null} init(config_overrides: dict[str, Any] | None = None) -> None ``` Initializes the Pixeltable environment. ## func  ls() ```python Signature theme={null} ls(path: str = '') -> pd.DataFrame ``` List the contents of a Pixeltable directory. This function returns a Pandas DataFrame representing a human-readable listing of the specified directory, including various attributes such as version and base table, as appropriate. To get a programmatic list of the directory's contents, use [get\_dir\_contents()](./pixeltable#func-get_dir_contents) instead. ## func  move() ```python Signature theme={null} move( path: str, new_path: str, *, if_exists: Literal['error', 'ignore'] = 'error', if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Move a schema object to a new directory and/or rename a schema object. **Parameters:** * **`path`** (`str`): absolute path to the existing schema object. * **`new_path`** (`str`): absolute new path for the schema object. * **`if_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive regarding how to handle if a schema object already exists at the new path. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive regarding how to handle if the source path does not exist. Must be one of the following: * `'error'`: raise an error * `'ignore'`: do nothing and return **Examples:** Move a table to a different directory: ```python theme={null} pxt.move('dir1/my_table', 'dir2/my_table') ``` Rename a table: ```python theme={null} pxt.move('dir1/my_table', 'dir1/new_name') ``` ## func  publish() ```python Signature theme={null} publish( source: str | catalog.Table, destination_uri: str, bucket_name: str | None = None, access: Literal['public', 'private'] = 'private' ) -> None ``` Publishes a replica of a local Pixeltable table to Pixeltable cloud. A given table can be published to at most one URI per Pixeltable cloud database. **Parameters:** * **`source`** (`str | catalog.Table`): Path or table handle of the local table to be published. * **`destination_uri`** (`str`): Remote URI where the replica will be published, such as `'pxt://org_name/my_dir/my_table'`. * **`bucket_name`** (`str | None`): The name of the bucket to use to store replica's data. The bucket must be registered with Pixeltable cloud. If no `bucket_name` is provided, the default storage bucket for the destination database will be used. * **`access`** (`Literal['public', 'private']`, default: `'private'`): Access control for the replica. * `'public'`: Anyone can access this replica. * `'private'`: Only the host organization can access. ## func  replicate() ```python Signature theme={null} replicate(remote_uri: str, local_path: str) -> catalog.Table ``` Retrieve a replica from Pixeltable cloud as a local table. This will create a full local copy of the replica in a way that preserves the table structure of the original source data. Once replicated, the local table can be queried offline just as any other Pixeltable table. **Parameters:** * **`remote_uri`** (`str`): Remote URI of the table to be replicated, such as `'pxt://org_name/my_dir/my_table'` or `'pxt://org_name/my_dir/my_table:5'` (with version 5). * **`local_path`** (`str`): Local table path where the replica will be created, such as `'my_new_dir/my_new_tbl'`. It can be the same or different from the cloud table name. **Returns:** * `catalog.Table`: A handle to the newly created local replica table. # Query Source: https://docs.pixeltable.com/sdk/latest/query View Source on GitHub # class  pixeltable.Query Represents a query for retrieving and transforming data from Pixeltable tables. ## method  collect() ```python Signature theme={null} collect() -> ResultSet ``` ## method  distinct() ```python Signature theme={null} distinct() -> Query ``` Remove duplicate rows from this Query. Note that grouping will be applied to the rows based on the select clause of this Query. In the absence of a select clause, by default, all columns are selected in the grouping. **Examples:** Select unique addresses from table `addresses`. ```python theme={null} results = addresses.distinct() ``` Select unique cities in table `addresses` ```python theme={null} results = addresses.city.distinct() ``` Select unique locations (street, city) in the state of `CA` ```python theme={null} results = ( addresses.select(addresses.street, addresses.city) .where(addresses.state == 'CA') .distinct() ) ``` ## method  group\_by() ```python Signature theme={null} group_by(*grouping_items: Any) -> Query ``` Add a group-by clause to this Query. Variants: * group\_by(base\_tbl): group a component view by their respective base table rows * group\_by(expr1, expr2, expr3): group by the given expressions Note that grouping will be applied to the rows and take effect when used with an aggregation function like sum(), count() etc. **Parameters:** * **`grouping_items`** (`Any`): expressions to group by **Returns:** * `Query`: A new Query with the specified group-by clause. **Examples:** Given the Query book from a table t with all its columns and rows: ```python theme={null} book = t.select() ``` Group the above Query book by the 'genre' column (referenced in table t): ```python theme={null} query = book.group_by(t.genre) ``` Use the above Query grouped by genre to count the number of books for each 'genre': ```python theme={null} query = ( book.group_by(t.genre).select(t.genre, count=count(t.genre)).show() ) ``` Use the above Query grouped by genre to the total price of books for each 'genre': ```python theme={null} query = book.group_by(t.genre).select(t.genre, total=sum(t.price)).show() ``` ## method  head() ```python Signature theme={null} head(n: int = 10) -> ResultSet ``` Return the first n rows of the Query, in insertion order of the underlying Table. head() is not supported for joins. **Parameters:** * **`n`** (`int`, default: `10`): Number of rows to select. Default is 10. **Returns:** * `ResultSet`: A ResultSet with the first n rows of the Query. ## method  join() ```python Signature theme={null} join( other: catalog.Table, on: exprs.Expr | Sequence[exprs.ColumnRef] | None = None, how: plan.JoinType.LiteralType = 'inner' ) -> Query ``` Join this Query with a table. **Parameters:** * **`other`** (`catalog.Table`): the table to join with * **`on`** (`exprs.Expr | Sequence[exprs.ColumnRef] | None`): the join condition, which can be either a) references to one or more columns or b) a boolean expression. * column references: implies an equality predicate that matches columns in both this Query and `other` by name. * column in `other`: A column with that same name must be present in this Query, and **it must be unique** (otherwise the join is ambiguous). * column in this Query: A column with that same name must be present in `other`. * boolean expression: The expressions must be valid in the context of the joined tables. * **`how`** (`plan.JoinType.LiteralType`, default: `'inner'`): the type of join to perform. * `'inner'`: only keep rows that have a match in both * `'left'`: keep all rows from this Query and only matching rows from the other table * `'right'`: keep all rows from the other table and only matching rows from this Query * `'full_outer'`: keep all rows from both this Query and the other table * `'cross'`: Cartesian product; no `on` condition allowed **Returns:** * `Query`: A new Query. **Examples:** Perform an inner join between t1 and t2 on the column id: ```python theme={null} join1 = t1.join(t2, on=t2.id) ``` Perform a left outer join of join1 with t3, also on id (note that we can't specify `on=t3.id` here, because that would be ambiguous, since both t1 and t2 have a column named id): ```python theme={null} join2 = join1.join(t3, on=t2.id, how='left') ``` Do the same, but now with an explicit join predicate: ```python theme={null} join2 = join1.join(t3, on=t2.id == t3.id, how='left') ``` Join t with d, which has a composite primary key (columns pk1 and pk2, with corresponding foreign key columns d1 and d2 in t): ```python theme={null} query = t.join(d, on=(t.d1 == d.pk1) & (t.d2 == d.pk2), how='left') ``` ## method  limit() ```python Signature theme={null} limit(n: int, offset: int | None = None) -> Query ``` Limit the number of rows in the Query, optionally skipping rows for pagination. **Parameters:** * **`n`** (`int`): Number of rows to select. * **`offset`** (`int | None`): Number of rows to skip before returning results. Default is None (no offset). **Returns:** * `Query`: A new Query with the specified limited rows. **Examples:** ```python theme={null} query = t.select() ``` Get the first 10 rows: ```python theme={null} query.limit(10).collect() ``` Get rows 21-30 (skip first 20, return next 10): ```python theme={null} query.limit(10, offset=20).collect() ``` ## method  order\_by() ```python Signature theme={null} order_by(*expr_list: exprs.Expr, asc: bool = True) -> Query ``` Add an order-by clause to this Query. **Parameters:** * **`expr_list`** (`exprs.Expr`): expressions to order by * **`asc`** (`bool`, default: `True`): whether to order in ascending order (True) or descending order (False). Default is True. **Returns:** * `Query`: A new Query with the specified order-by clause. **Examples:** Given the Query book from a table t with all its columns and rows: ```python theme={null} book = t.select() ``` Order the above Query book by two columns (price, pages) in descending order: ```python theme={null} query = book.order_by(t.price, t.pages, asc=False) ``` Order the above Query book by price in descending order, but order the pages in ascending order: ```python theme={null} query = book.order_by(t.price, asc=False).order_by(t.pages) ``` ## method  sample() ```python Signature theme={null} sample( n: int | None = None, n_per_stratum: int | None = None, fraction: float | None = None, seed: int | None = None, stratify_by: Any = None ) -> Query ``` Return a new Query specifying a sample of rows from the Query, considered in a shuffled order. The size of the sample can be specified in three ways: * `n`: the total number of rows to produce as a sample * `n_per_stratum`: the number of rows to produce per stratum as a sample * `fraction`: the fraction of available rows to produce as a sample The sample can be stratified by one or more columns, which means that the sample will be selected from each stratum separately. The data is shuffled before creating the sample. **Parameters:** * **`n`** (`int | None`): Total number of rows to produce as a sample. * **`n_per_stratum`** (`int | None`): Number of rows to produce per stratum as a sample. This parameter is only valid if `stratify_by` is specified. Only one of `n` or `n_per_stratum` can be specified. * **`fraction`** (`float | None`): Fraction of available rows to produce as a sample. This parameter is not usable with `n` or `n_per_stratum`. The fraction must be between 0.0 and 1.0. * **`seed`** (`int | None`): Random seed for reproducible shuffling * **`stratify_by`** (`Any`): If specified, the sample will be stratified by these values. **Returns:** * `Query`: A new Query which specifies the sampled rows **Examples:** Given the Table `person` containing the field 'age', we can create samples of the table in various ways: Sample 100 rows from the above Table: ```python theme={null} query = person.sample(n=100) ``` Sample 10% of the rows from the above Table: ```python theme={null} query = person.sample(fraction=0.1) ``` Sample 10% of the rows from the above Table, stratified by the column 'age': ```python theme={null} query = person.sample(fraction=0.1, stratify_by=t.age) ``` Equal allocation sampling: Sample 2 rows from each age present in the above Table: ```python theme={null} query = person.sample(n_per_stratum=2, stratify_by=t.age) ``` Sampling is compatible with the where clause, so we can also sample from a filtered Query: ```python theme={null} query = person.where(t.age > 30).sample(n=100) ``` ## method  select() ```python Signature theme={null} select(*items: Any, **named_items: Any) -> Query ``` Select columns or expressions from the Query. **Parameters:** * **`items`** (`Any`): expressions to be selected * **`named_items`** (`Any`): named expressions to be selected **Returns:** * `Query`: A new Query with the specified select list. **Examples:** Given the Query person from a table t with all its columns and rows: ```python theme={null} person = t.select() ``` Select the columns 'name' and 'age' (referenced in table t) from the Query person: ```python theme={null} query = person.select(t.name, t.age) ``` Select the columns 'name' (referenced in table t) from the Query person, and a named column 'is\_adult' from the expression `age >= 18` where 'age' is another column in table t: ```python theme={null} query = person.select(t.name, is_adult=(t.age >= 18)) ``` ## method  show() ```python Signature theme={null} show(n: int = 20) -> ResultSet ``` ## method  tail() ```python Signature theme={null} tail(n: int = 10) -> ResultSet ``` Return the last n rows of the Query, in insertion order of the underlying Table. tail() is not supported for joins. **Parameters:** * **`n`** (`int`, default: `10`): Number of rows to select. Default is 10. **Returns:** * `ResultSet`: A ResultSet with the last n rows of the Query. ## method  to\_coco\_dataset() ```python Signature theme={null} to_coco_dataset() -> Path ``` Convert the Query to a COCO dataset. This Query must return a single json-typed output column in the following format: ```python theme={null} { 'image': PIL.Image.Image, 'annotations': [ { 'bbox': [x: int, y: int, w: int, h: int], 'category': str | int, }, ... ], } ``` **Returns:** * `Path`: Path to the COCO dataset file. ## method  to\_pytorch\_dataset() ```python Signature theme={null} to_pytorch_dataset(image_format: str = 'pt') -> torch.utils.data.IterableDataset ``` Convert the Query to a pytorch IterableDataset suitable for parallel loading with torch.utils.data.DataLoader. This method requires pyarrow >= 13, torch and torchvision to work. This method serializes data so it can be read from disk efficiently and repeatedly without re-executing the query. This data is cached to disk for future re-use. **Parameters:** * **`image_format`** (`str`, default: `'pt'`): format of the images. Can be 'pt' (pytorch tensor) or 'np' (numpy array). 'np' means image columns return as an RGB uint8 array of shape HxWxC. 'pt' means image columns return as a CxHxW tensor with values in \[0,1] and type torch.float32. (the format output by torchvision.transforms.ToTensor()) **Returns:** * `'torch.utils.data.IterableDataset'`: A pytorch IterableDataset: Columns become fields of the dataset, where rows are returned as a dictionary compatible with torch.utils.data.DataLoader default collation. ## method  where() ```python Signature theme={null} where(pred: exprs.Expr) -> Query ``` Filter rows based on a predicate. **Parameters:** * **`pred`** (`exprs.Expr`): the predicate to filter rows **Returns:** * `Query`: A new Query with the specified predicates replacing the where-clause. **Examples:** Given the Query person from a table t with all its columns and rows: ```python theme={null} person = t.select() ``` Filter the above Query person to only include rows where the column 'age' (referenced in table t) is greater than 30: ```python theme={null} query = person.where(t.age > 30) ``` ## attr  schema ``` schema: dict[str, ColumnType] ``` Column names and types in this Query. # replicate Source: https://docs.pixeltable.com/sdk/latest/replicate View Source on GitHub # module  pixeltable.functions.replicate Pixeltable UDFs that wrap various endpoints from the Replicate API. In order to use them, you must first `pip install replicate` and configure your Replicate credentials, as described in the [Working with Replicate](https://docs.pixeltable.com/notebooks/integrations/working-with-replicate) tutorial. ## udf  run() ```python Signature theme={null} @pxt.udf run(input: pxt.Json, *, ref: pxt.String) -> pxt.Json ``` Run a model on Replicate. For additional details, see: [https://replicate.com/docs/topics/models/run-a-model](https://replicate.com/docs/topics/models/run-a-model) Request throttling: Applies the rate limit set in the config (section `replicate`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install replicate` **Parameters:** * **`input`** (`pxt.Json`): The input parameters for the model. * **`ref`** (`pxt.String`): The name of the model to run. **Returns:** * `pxt.Json`: The output of the model. **Examples:** Add a computed column that applies the model `meta/meta-llama-3-8b-instruct` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} input = { 'system_prompt': 'You are a helpful assistant.', 'prompt': tbl.prompt, } tbl.add_computed_column( response=run(input, ref='meta/meta-llama-3-8b-instruct') ) ``` Add a computed column that uses the model `black-forest-labs/flux-schnell` to generate images from an existing Pixeltable column `tbl.prompt`: ```python theme={null} input = {'prompt': tbl.prompt, 'go_fast': True, 'megapixels': '1'} tbl.add_computed_column( response=run(input, ref='black-forest-labs/flux-schnell') ) tbl.add_computed_column(image=tbl.response.output[0].astype(pxt.Image)) ``` # reve Source: https://docs.pixeltable.com/sdk/latest/reve View Source on GitHub # module  pixeltable.functions.reve Pixeltable [UDFs](https://docs.pixeltable.com/platform/udfs-in-pixeltable) that wrap [Reve](https://app.reve.com/) image generation API. In order to use them, the API key must be specified either with `REVE_API_KEY` environment variable, or as `api_key` in the `reve` section of the Pixeltable config file. ## udf  create() ```python Signature theme={null} @pxt.udf create( prompt: pxt.String, *, aspect_ratio: pxt.String | None = None, version: pxt.String | None = None ) -> pxt.Image ``` Creates an image from a text prompt. This UDF wraps the `https://api.reve.com/v1/image/create` endpoint. For more information, refer to the official [API documentation](https://api.reve.com/console/docs/create). **Parameters:** * **`prompt`** (`pxt.String`): prompt describing the desired image * **`aspect_ratio`** (`pxt.String | None`): desired image aspect ratio, e.g. '3:2', '16:9', '1:1', etc. * **`version`** (`pxt.String | None`): specific model version to use. Latest if not specified. **Returns:** * `pxt.Image`: A generated image **Examples:** Add a computed column with generated square images to a table with text prompts: ```python theme={null} t.add_computed_column(img=reve.create(t.prompt, aspect_ratio='1:1')) ``` ## udf  edit() ```python Signature theme={null} @pxt.udf edit( image: pxt.Image, edit_instruction: pxt.String, *, version: pxt.String | None = None ) -> pxt.Image ``` Edits images based on a text prompt. This UDF wraps the `https://api.reve.com/v1/image/edit` endpoint. For more information, refer to the official [API documentation](https://api.reve.com/console/docs/edit) **Parameters:** * **`image`** (`pxt.Image`): image to edit * **`edit_instruction`** (`pxt.String`): text prompt describing the desired edit * **`version`** (`pxt.String | None`): specific model version to use. Latest if not specified. **Returns:** * `pxt.Image`: A generated image **Examples:** Add a computed column with catalog-ready images to the table with product pictures: ```python theme={null} t.add_computed_column( catalog_img=reve.edit( t.product_img, 'Remove background and distractions from the product picture, improve lighting.', ) ) ``` ## udf  remix() ```python Signature theme={null} @pxt.udf remix( prompt: pxt.String, images: pxt.Json, *, aspect_ratio: pxt.String | None = None, version: pxt.String | None = None ) -> pxt.Image ``` Creates images based on a text prompt and reference images. The prompt may include `0`, `1`, etc. tags to refer to the images in the `images` argument. This UDF wraps the `https://api.reve.com/v1/image/remix` endpoint. For more information, refer to the official [API documentation](https://api.reve.com/console/docs/remix) **Parameters:** * **`prompt`** (`pxt.String`): prompt describing the desired image * **`images`** (`pxt.Json`): list of reference images * **`aspect_ratio`** (`pxt.String | None`): desired image aspect ratio, e.g. '3:2', '16:9', '1:1', etc. * **`version`** (`pxt.String | None`): specific model version to use. Latest by default. **Returns:** * `pxt.Image`: A generated image **Examples:** Add a computed column with promotional collages to a table with original images: ```python theme={null} t.add_computed_column( promo_img=( reve.remix( 'Generate a product promotional image by combining the image of the product' ' from 0 with the landmark scene from 1', images=[t.product_img, t.local_landmark_img], aspect_ratio='16:9', ) ) ) ``` # runwayml Source: https://docs.pixeltable.com/sdk/latest/runwayml View Source on GitHub # module  pixeltable.functions.runwayml Pixeltable UDFs that wrap various endpoints from the RunwayML API. In order to use them, you must first `pip install runwayml` and configure your RunwayML credentials by setting the `RUNWAYML_API_SECRET` environment variable. ## udf  image\_to\_video() ```python Signature theme={null} @pxt.udf image_to_video( prompt_image: pxt.Image, model: pxt.String, ratio: pxt.String, *, prompt_text: pxt.String | None = None, duration: pxt.Int | None = None, seed: pxt.Int | None = None, audio: pxt.Bool | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate videos from images. For additional details, see: [Image to video](https://docs.dev.runwayml.com/api/#tag/Start-generating/paths/~1v1~1image_to_video/post) **Requirements:** * `pip install runwayml` **Parameters:** * **`prompt_image`** (`pxt.Image`): Input image to use as the first frame. * **`model`** (`pxt.String`): The model to use. * **`ratio`** (`pxt.String`): Aspect ratio of the generated video. * **`prompt_text`** (`pxt.String | None`): Text description to guide generation. * **`duration`** (`pxt.Int | None`): Duration in seconds. * **`seed`** (`pxt.Int | None`): Seed for reproducibility. * **`audio`** (`pxt.Bool | None`): Whether to generate audio. * **`model_kwargs`** (`pxt.Json | None`): Additional API parameters. **Returns:** * `pxt.Json`: A dictionary containing the response and metadata. **Examples:** Add a computed column that generates videos from images: ```python theme={null} tbl.add_computed_column( response=image_to_video( tbl.image, model='gen4', ratio='16:9', prompt_text='Slow motion', duration=5, ) ) tbl.add_computed_column(video=tbl.response['output'].astype(pxt.Video)) ``` ## udf  text\_to\_image() ```python Signature theme={null} @pxt.udf text_to_image( prompt_text: pxt.String, reference_images: pxt.Json, model: pxt.String, ratio: pxt.String, *, seed: pxt.Int | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate images from text prompts and reference images. For additional details, see: [Text/Image to Image](https://docs.dev.runwayml.com/api/#tag/Start-generating/paths/~1v1~1text_to_image/post) **Requirements:** * `pip install runwayml` **Parameters:** * **`prompt_text`** (`pxt.String`): Text description of the image to generate. * **`reference_images`** (`pxt.Json`): List of 1-3 reference images. * **`model`** (`pxt.String`): The model to use. * **`ratio`** (`pxt.String`): Aspect ratio of the generated image. * **`seed`** (`pxt.Int | None`): Seed for reproducibility. * **`model_kwargs`** (`pxt.Json | None`): Additional API parameters. **Returns:** * `pxt.Json`: A dictionary containing the response and metadata. **Examples:** Add a computed column that generates images from prompts: ```python theme={null} tbl.add_computed_column( response=text_to_image( tbl.prompt, [tbl.ref_image], model='gen4_image', ratio='16:9' ) ) tbl.add_computed_column(image=tbl.response['output'][0].astype(pxt.Image)) ``` ## udf  text\_to\_video() ```python Signature theme={null} @pxt.udf text_to_video( prompt_text: pxt.String, model: pxt.String, ratio: pxt.String, *, duration: pxt.Int | None = None, audio: pxt.Bool | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate videos from text prompts. For additional details, see: [Text to video](https://docs.dev.runwayml.com/api/#tag/Start-generating/paths/~1v1~1text_to_video/post) **Requirements:** * `pip install runwayml` **Parameters:** * **`prompt_text`** (`pxt.String`): Text description of the video to generate. * **`model`** (`pxt.String`): The model to use. * **`ratio`** (`pxt.String`): Aspect ratio of the generated video. * **`duration`** (`pxt.Int | None`): Duration in seconds. * **`audio`** (`pxt.Bool | None`): Whether to generate audio. * **`model_kwargs`** (`pxt.Json | None`): Additional API parameters. **Returns:** * `pxt.Json`: A dictionary containing the response and metadata. **Examples:** Add a computed column that generates videos from prompts: ```python theme={null} tbl.add_computed_column( response=text_to_video( tbl.prompt, model='veo3.1', ratio='16:9', duration=4 ) ) tbl.add_computed_column(video=tbl.response['output'].astype(pxt.Video)) ``` ## udf  video\_to\_video() ```python Signature theme={null} @pxt.udf video_to_video( video_uri: pxt.String, prompt_text: pxt.String, model: pxt.String, ratio: pxt.String, *, seed: pxt.Int | None = None, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Transform videos with text guidance. For additional details, see: [Video to video](https://docs.dev.runwayml.com/api/#tag/Start-generating/paths/~1v1~1video_to_video/post) **Requirements:** * `pip install runwayml` **Parameters:** * **`video_uri`** (`pxt.String`): HTTPS URL to the input video. * **`prompt_text`** (`pxt.String`): Text description of the transformation. * **`model`** (`pxt.String`): The model to use. * **`ratio`** (`pxt.String`): Aspect ratio of the output video. * **`seed`** (`pxt.Int | None`): Seed for reproducibility. * **`model_kwargs`** (`pxt.Json | None`): Additional API parameters. **Returns:** * `pxt.Json`: A dictionary containing the response and metadata. **Examples:** Add a computed column that transforms videos: ```python theme={null} tbl.add_computed_column( response=video_to_video( tbl.video_url, 'Anime style', model='gen4_aleph', ratio='16:9' ) ) tbl.add_computed_column(video=tbl.response['output'].astype(pxt.Video)) ``` # string Source: https://docs.pixeltable.com/sdk/latest/string View Source on GitHub # module  pixeltable.functions.string Pixeltable UDFs for `StringType`. It closely follows the Pandas `pandas.Series.str` API. Example: ```python theme={null} import pixeltable as pxt t = pxt.get_table(...) t.select(t.str_col.capitalize()).collect() ``` ## iterator  string\_splitter() ```python Signature theme={null} @pxt.iterator string_splitter( text: pxt.String, separators: pxt.String, *, spacy_model: pxt.String = 'en_core_web_sm' ) ``` Iterator over chunks of a string. The string is chunked according to the specified `separators`. **Outputs**: One row per chunk, with the following columns: * `text` (`pxt.String`): The text of the chunk. **Parameters:** * **`separators`** (`pxt.String`): Separators to use to chunk the document. Currently the only supported option is `'sentence'`. * **`spacy_model`** (`pxt.String`): Name of the spaCy model to use for sentence segmentation. **Examples:** This example assumes an existing table `tbl` with a column `text` of type `pxt.String`. Create a view that splits all strings on sentence boundaries: ```python theme={null} pxt.create_view( 'sentence_chunks', tbl, iterator=string_splitter(tbl.text, separators='sentence'), ) ``` ## udf  capitalize() ```python Signature theme={null} @pxt.udf capitalize(self: pxt.String) -> pxt.String ``` Return string with its first character capitalized and the rest lowercased. Equivalent to [`str.capitalize()`](https://docs.python.org/3/library/stdtypes.html#str.capitalize). ## udf  casefold() ```python Signature theme={null} @pxt.udf casefold(self: pxt.String) -> pxt.String ``` Return a casefolded copy of string. Equivalent to [`str.casefold()`](https://docs.python.org/3/library/stdtypes.html#str.casefold). ## udf  center() ```python Signature theme={null} @pxt.udf center( self: pxt.String, width: pxt.Int, fillchar: pxt.String = ' ' ) -> pxt.String ``` Return a centered string of length `width`. Equivalent to [`str.center()`](https://docs.python.org/3/library/stdtypes.html#str.center). **Parameters:** * **`width`** (`pxt.Int`): Total width of the resulting string. * **`fillchar`** (`pxt.String`): Character used for padding. ## udf  contains() ```python Signature theme={null} @pxt.udf contains( self: pxt.String, substr: pxt.String, case: pxt.Bool = True ) -> pxt.Bool ``` Test if string contains a substring. **Parameters:** * **`substr`** (`pxt.String`): string literal or regular expression * **`case`** (`pxt.Bool`): if False, ignore case ## udf  contains\_re() ```python Signature theme={null} @pxt.udf contains_re( self: pxt.String, pattern: pxt.String, flags: pxt.Int = 0 ) -> pxt.Bool ``` Test if string contains a regular expression pattern. **Parameters:** * **`pattern`** (`pxt.String`): regular expression pattern * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  count() ```python Signature theme={null} @pxt.udf count( self: pxt.String, pattern: pxt.String, flags: pxt.Int = 0 ) -> pxt.Int ``` Count occurrences of pattern or regex. **Parameters:** * **`pattern`** (`pxt.String`): string literal or regular expression * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  endswith() ```python Signature theme={null} @pxt.udf endswith(self: pxt.String, substr: pxt.String) -> pxt.Bool ``` Return `True` if the string ends with the specified suffix, otherwise return `False`. Equivalent to [`str.endswith()`](https://docs.python.org/3/library/stdtypes.html#str.endswith). **Parameters:** * **`substr`** (`pxt.String`): string literal ## udf  fill() ```python Signature theme={null} @pxt.udf fill(self: pxt.String, width: pxt.Int, **kwargs) -> pxt.String ``` Wraps the single paragraph in string, and returns a single string containing the wrapped paragraph. Equivalent to [`textwrap.fill()`](https://docs.python.org/3/library/textwrap.html#textwrap.fill). **Parameters:** * **`width`** (`pxt.Int`): Maximum line width. * **`kwargs`** (`Any`): Additional keyword arguments to pass to `textwrap.fill()`. ## udf  find() ```python Signature theme={null} @pxt.udf find( self: pxt.String, substr: pxt.String, start: pxt.Int = 0, end: pxt.Int | None = None ) -> pxt.Int ``` Return the lowest index in string where `substr` is found within the slice `s[start:end]`. Equivalent to [`str.find()`](https://docs.python.org/3/library/stdtypes.html#str.find). **Parameters:** * **`substr`** (`pxt.String`): substring to search for * **`start`** (`pxt.Int`): slice start * **`end`** (`pxt.Int | None`): slice end ## udf  findall() ```python Signature theme={null} @pxt.udf findall( self: pxt.String, pattern: pxt.String, flags: pxt.Int = 0 ) -> pxt.Json ``` Find all occurrences of a regular expression pattern in string. Equivalent to [`re.findall()`](https://docs.python.org/3/library/re.html#re.findall). **Parameters:** * **`pattern`** (`pxt.String`): regular expression pattern * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  format() ```python Signature theme={null} @pxt.udf format(self: pxt.String, *args, **kwargs) -> pxt.String ``` Perform string formatting. Equivalent to [`str.format()`](https://docs.python.org/3/library/stdtypes.html#str.format). ## udf  fullmatch() ```python Signature theme={null} @pxt.udf fullmatch( self: pxt.String, pattern: pxt.String, case: pxt.Bool = True, flags: pxt.Int = 0 ) -> pxt.Bool ``` Determine if string fully matches a regular expression. Equivalent to [`re.fullmatch()`](https://docs.python.org/3/library/re.html#re.fullmatch). **Parameters:** * **`pattern`** (`pxt.String`): regular expression pattern * **`case`** (`pxt.Bool`): if False, ignore case * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  index() ```python Signature theme={null} @pxt.udf index( self: pxt.String, substr: pxt.String, start: pxt.Int = 0, end: pxt.Int | None = None ) -> pxt.Int ``` Return the lowest index in string where `substr` is found within the slice `[start:end]`. Raises ValueError if `substr` is not found. Equivalent to [`str.index()`](https://docs.python.org/3/library/stdtypes.html#str.index). **Parameters:** * **`substr`** (`pxt.String`): substring to search for * **`start`** (`pxt.Int`): slice start * **`end`** (`pxt.Int | None`): slice end ## udf  isalnum() ```python Signature theme={null} @pxt.udf isalnum(self: pxt.String) -> pxt.Bool ``` Return `True` if all characters in the string are alphanumeric and there is at least one character, `False` otherwise. Equivalent to \[`str.isalnum()`]\([https://docs.python.org/3/library/stdtypes.html#str.isalnum](https://docs.python.org/3/library/stdtypes.html#str.isalnum) ## udf  isalpha() ```python Signature theme={null} @pxt.udf isalpha(self: pxt.String) -> pxt.Bool ``` Return `True` if all characters in the string are alphabetic and there is at least one character, `False` otherwise. Equivalent to [`str.isalpha()`](https://docs.python.org/3/library/stdtypes.html#str.isalpha). ## udf  isascii() ```python Signature theme={null} @pxt.udf isascii(self: pxt.String) -> pxt.Bool ``` Return `True` if the string is empty or all characters in the string are ASCII, `False` otherwise. Equivalent to [`str.isascii()`](https://docs.python.org/3/library/stdtypes.html#str.isascii). ## udf  isdecimal() ```python Signature theme={null} @pxt.udf isdecimal(self: pxt.String) -> pxt.Bool ``` Return `True` if all characters in the string are decimal characters and there is at least one character, `False` otherwise. Equivalent to [`str.isdecimal()`](https://docs.python.org/3/library/stdtypes.html#str.isdecimal). ## udf  isdigit() ```python Signature theme={null} @pxt.udf isdigit(self: pxt.String) -> pxt.Bool ``` Return `True` if all characters in the string are digits and there is at least one character, `False` otherwise. Equivalent to [`str.isdigit()`](https://docs.python.org/3/library/stdtypes.html#str.isdigit). ## udf  isidentifier() ```python Signature theme={null} @pxt.udf isidentifier(self: pxt.String) -> pxt.Bool ``` Return `True` if the string is a valid identifier according to the language definition, `False` otherwise. Equivalent to [`str.isidentifier()`](https://docs.python.org/3/library/stdtypes.html#str.isidentifier) ## udf  islower() ```python Signature theme={null} @pxt.udf islower(self: pxt.String) -> pxt.Bool ``` Return `True` if all cased characters in the string are lowercase and there is at least one cased character, `False` otherwise. Equivalent to [`str.islower()`](https://docs.python.org/3/library/stdtypes.html#str.islower) ## udf  isnumeric() ```python Signature theme={null} @pxt.udf isnumeric(self: pxt.String) -> pxt.Bool ``` Return `True` if all characters in the string are numeric characters, `False` otherwise. Equivalent to [`str.isnumeric()`](https://docs.python.org/3/library/stdtypes.html#str.isnumeric) ## udf  isspace() ```python Signature theme={null} @pxt.udf isspace(self: pxt.String) -> pxt.Bool ``` Return `True` if there are only whitespace characters in the string and there is at least one character, `False` otherwise. Equivalent to [`str.isspace()`](https://docs.python.org/3/library/stdtypes.html#str.isspace) ## udf  istitle() ```python Signature theme={null} @pxt.udf istitle(self: pxt.String) -> pxt.Bool ``` Return `True` if the string is a titlecased string and there is at least one character, `False` otherwise. Equivalent to [`str.istitle()`](https://docs.python.org/3/library/stdtypes.html#str.istitle) ## udf  isupper() ```python Signature theme={null} @pxt.udf isupper(self: pxt.String) -> pxt.Bool ``` Return `True` if all cased characters in the string are uppercase and there is at least one cased character, `False` otherwise. Equivalent to [`str.isupper()`](https://docs.python.org/3/library/stdtypes.html#str.isupper) ## udf  join() ```python Signature theme={null} @pxt.udf join(sep: pxt.String, elements: pxt.Json) -> pxt.String ``` Return a string which is the concatenation of the strings in `elements`. Equivalent to [`str.join()`](https://docs.python.org/3/library/stdtypes.html#str.join) ## udf  len() ```python Signature theme={null} @pxt.udf len(self: pxt.String) -> pxt.Int ``` Return the number of characters in the string. Equivalent to [`len(str)`](https://docs.python.org/3/library/functions.html#len) ## udf  ljust() ```python Signature theme={null} @pxt.udf ljust( self: pxt.String, width: pxt.Int, fillchar: pxt.String = ' ' ) -> pxt.String ``` Return the string left-justified in a string of length `width`. Equivalent to [`str.ljust()`](https://docs.python.org/3/library/stdtypes.html#str.ljust) **Parameters:** * **`width`** (`pxt.Int`): Minimum width of resulting string; additional characters will be filled with character defined in `fillchar`. * **`fillchar`** (`pxt.String`): Additional character for filling. ## udf  lower() ```python Signature theme={null} @pxt.udf lower(self: pxt.String) -> pxt.String ``` Return a copy of the string with all the cased characters converted to lowercase. Equivalent to [`str.lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower) ## udf  lstrip() ```python Signature theme={null} @pxt.udf lstrip( self: pxt.String, chars: pxt.String | None = None ) -> pxt.String ``` Return a copy of the string with leading characters removed. The `chars` argument is a string specifying the set of characters to be removed. If omitted or `None`, whitespace characters are removed. Equivalent to [`str.lstrip()`](https://docs.python.org/3/library/stdtypes.html#str.lstrip) **Parameters:** * **`chars`** (`pxt.String | None`): The set of characters to be removed. ## udf  match() ```python Signature theme={null} @pxt.udf match( self: pxt.String, pattern: pxt.String, case: pxt.Bool = True, flags: pxt.Int = 0 ) -> pxt.Bool ``` Determine if string starts with a match of a regular expression **Parameters:** * **`pattern`** (`pxt.String`): regular expression pattern * **`case`** (`pxt.Bool`): if False, ignore case * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  normalize() ```python Signature theme={null} @pxt.udf normalize(self: pxt.String, form: pxt.String) -> pxt.String ``` Return the Unicode normal form. Equivalent to [`unicodedata.normalize()`](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) **Parameters:** * **`form`** (`pxt.String`): Unicode normal form (`'NFC'`, `'NFKC'`, `'NFD'`, `'NFKD'`) ## udf  pad() ```python Signature theme={null} @pxt.udf pad( self: pxt.String, width: pxt.Int, side: pxt.String = 'left', fillchar: pxt.String = ' ' ) -> pxt.String ``` Pad string up to width **Parameters:** * **`width`** (`pxt.Int`): Minimum width of resulting string; additional characters will be filled with character defined in `fillchar`. * **`side`** (`pxt.String`): Side from which to fill resulting string (`'left'`, `'right'`, `'both'`) * **`fillchar`** (`pxt.String`): Additional character for filling ## udf  partition() ```python Signature theme={null} @pxt.udf partition(self: pxt.String, sep: pxt.String = ' ') -> pxt.Json ``` Splits string at the first occurrence of `sep`, and returns 3 elements containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return 3 elements containing string itself, followed by two empty strings. ## udf  removeprefix() ```python Signature theme={null} @pxt.udf removeprefix(self: pxt.String, prefix: pxt.String) -> pxt.String ``` Remove prefix. If the prefix is not present, returns string. ## udf  removesuffix() ```python Signature theme={null} @pxt.udf removesuffix(self: pxt.String, suffix: pxt.String) -> pxt.String ``` Remove suffix. If the suffix is not present, returns string. ## udf  repeat() ```python Signature theme={null} @pxt.udf repeat(self: pxt.String, n: pxt.Int) -> pxt.String ``` Repeat string `n` times. ## udf  replace() ```python Signature theme={null} @pxt.udf replace( self: pxt.String, substr: pxt.String, repl: pxt.String, n: pxt.Int | None = None ) -> pxt.String ``` Replace occurrences of `substr` with `repl`. Equivalent to [`str.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace). **Parameters:** * **`substr`** (`pxt.String`): string literal * **`repl`** (`pxt.String`): replacement string * **`n`** (`pxt.Int | None`): number of replacements to make (if `None`, replace all occurrences) ## udf  replace\_re() ```python Signature theme={null} @pxt.udf replace_re( self: pxt.String, pattern: pxt.String, repl: pxt.String, n: pxt.Int | None = None, flags: pxt.Int = 0 ) -> pxt.String ``` Replace occurrences of a regular expression pattern with `repl`. Equivalent to [`re.sub()`](https://docs.python.org/3/library/re.html#re.sub). **Parameters:** * **`pattern`** (`pxt.String`): regular expression pattern * **`repl`** (`pxt.String`): replacement string * **`n`** (`pxt.Int | None`): number of replacements to make (if `None`, replace all occurrences) * **`flags`** (`pxt.Int`): [flags](https://docs.python.org/3/library/re.html#flags) for the `re` module ## udf  reverse() ```python Signature theme={null} @pxt.udf reverse(self: pxt.String) -> pxt.String ``` Return a reversed copy of the string. Equivalent to `str[::-1]`. ## udf  rfind() ```python Signature theme={null} @pxt.udf rfind( self: pxt.String, substr: pxt.String, start: pxt.Int | None = 0, end: pxt.Int | None = None ) -> pxt.Int ``` Return the highest index where `substr` is found, such that `substr` is contained within `[start:end]`. Equivalent to [`str.rfind()`](https://docs.python.org/3/library/stdtypes.html#str.rfind). **Parameters:** * **`substr`** (`pxt.String`): substring to search for * **`start`** (`pxt.Int | None`): slice start * **`end`** (`pxt.Int | None`): slice end ## udf  rindex() ```python Signature theme={null} @pxt.udf rindex( self: pxt.String, substr: pxt.String, start: pxt.Int | None = 0, end: pxt.Int | None = None ) -> pxt.Int ``` Return the highest index where `substr` is found, such that `substr` is contained within `[start:end]`. Raises ValueError if `substr` is not found. Equivalent to [`str.rindex()`](https://docs.python.org/3/library/stdtypes.html#str.rindex). ## udf  rjust() ```python Signature theme={null} @pxt.udf rjust( self: pxt.String, width: pxt.Int, fillchar: pxt.String = ' ' ) -> pxt.String ``` Return the string right-justified in a string of length `width`. Equivalent to [`str.rjust()`](https://docs.python.org/3/library/stdtypes.html#str.rjust). **Parameters:** * **`width`** (`pxt.Int`): Minimum width of resulting string. * **`fillchar`** (`pxt.String`): Additional character for filling. ## udf  rpartition() ```python Signature theme={null} @pxt.udf rpartition(self: pxt.String, sep: pxt.String = ' ') -> pxt.Json ``` This method splits string at the last occurrence of `sep`, and returns a list containing the part before the separator, the separator itself, and the part after the separator. ## udf  rstrip() ```python Signature theme={null} @pxt.udf rstrip( self: pxt.String, chars: pxt.String | None = None ) -> pxt.String ``` Return a copy of string with trailing characters removed. Equivalent to [`str.rstrip()`](https://docs.python.org/3/library/stdtypes.html#str.rstrip). **Parameters:** * **`chars`** (`pxt.String | None`): The set of characters to be removed. If omitted or `None`, whitespace characters are removed. ## udf  slice() ```python Signature theme={null} @pxt.udf slice( self: pxt.String, start: pxt.Int | None = None, stop: pxt.Int | None = None, step: pxt.Int | None = None ) -> pxt.String ``` Return a slice. **Parameters:** * **`start`** (`pxt.Int | None`): slice start * **`stop`** (`pxt.Int | None`): slice end * **`step`** (`pxt.Int | None`): slice step ## udf  slice\_replace() ```python Signature theme={null} @pxt.udf slice_replace( self: pxt.String, start: pxt.Int | None = None, stop: pxt.Int | None = None, repl: pxt.String | None = None ) -> pxt.String ``` Replace a positional slice with another value. **Parameters:** * **`start`** (`pxt.Int | None`): slice start * **`stop`** (`pxt.Int | None`): slice end * **`repl`** (`pxt.String | None`): replacement value ## udf  startswith() ```python Signature theme={null} @pxt.udf startswith(self: pxt.String, substr: pxt.String) -> pxt.Int ``` Return `True` if string starts with `substr`, otherwise return `False`. Equivalent to [`str.startswith()`](https://docs.python.org/3/library/stdtypes.html#str.startswith). **Parameters:** * **`substr`** (`pxt.String`): string literal ## udf  strip() ```python Signature theme={null} @pxt.udf strip( self: pxt.String, chars: pxt.String | None = None ) -> pxt.String ``` Return a copy of string with leading and trailing characters removed. Equivalent to [`str.strip()`](https://docs.python.org/3/library/stdtypes.html#str.strip). **Parameters:** * **`chars`** (`pxt.String | None`): The set of characters to be removed. If omitted or `None`, whitespace characters are removed. ## udf  swapcase() ```python Signature theme={null} @pxt.udf swapcase(self: pxt.String) -> pxt.String ``` Return a copy of string with uppercase characters converted to lowercase and vice versa. Equivalent to [`str.swapcase()`](https://docs.python.org/3/library/stdtypes.html#str.swapcase). ## udf  title() ```python Signature theme={null} @pxt.udf title(self: pxt.String) -> pxt.String ``` Return a titlecased version of string, i.e. words start with uppercase characters, all remaining cased characters are lowercase. Equivalent to [`str.title()`](https://docs.python.org/3/library/stdtypes.html#str.title). ## udf  upper() ```python Signature theme={null} @pxt.udf upper(self: pxt.String) -> pxt.String ``` Return a copy of string converted to uppercase. Equivalent to [`str.upper()`](https://docs.python.org/3/library/stdtypes.html#str.upper). ## udf  wrap() ```python Signature theme={null} @pxt.udf wrap(self: pxt.String, width: pxt.Int, **kwargs) -> pxt.Json ``` Wraps the single paragraph in string so every line is at most `width` characters long. Returns a list of output lines, without final newlines. Equivalent to [`textwrap.fill()`](https://docs.python.org/3/library/textwrap.html#textwrap.fill). **Parameters:** * **`width`** (`pxt.Int`): Maximum line width. * **`kwargs`** (`Any`): Additional keyword arguments to pass to `textwrap.fill()`. ## udf  zfill() ```python Signature theme={null} @pxt.udf zfill(self: pxt.String, width: pxt.Int) -> pxt.String ``` Pad a numeric string with ASCII `0` on the left to a total length of `width`. Equivalent to [`str.zfill()`](https://docs.python.org/3/library/stdtypes.html#str.zfill). **Parameters:** * **`width`** (`pxt.Int`): Minimum width of resulting string. # Table Source: https://docs.pixeltable.com/sdk/latest/table View Source on GitHub # class  pixeltable.Table A handle to a table, view, or snapshot. This class is the primary interface through which table operations (queries, insertions, updates, etc.) are performed in Pixeltable. ## method  add\_column() ```python Signature theme={null} add_column( *, if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error', **kwargs: type | ColumnSpec ) -> UpdateStatus ``` Adds an ordinary (non-computed) column to the table. **Parameters:** * **`kwargs`** (`type | ColumnSpec`): Exactly one keyword argument of the form `col_name=type` or `col_name=col_spec_dict`, where `col_spec_dict` is a [`ColumnSpec`](./columnspec) dict. * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Determines the behavior if the column already exists. Must be one of the following: * `'error'`: an exception will be raised. * `'ignore'`: do nothing and return. * `'replace'` or `'replace_force'`: drop the existing column and add the new column, if it has no dependents. **Returns:** * `UpdateStatus`: Information about the execution status of the operation. **Examples:** Add an int column: ```python theme={null} tbl.add_column(new_col=pxt.Int) ``` Add a column with column metadata using a dict: ```python theme={null} tbl.add_column( img_col={ 'type': pxt.Image, 'stored': True, 'media_validation': 'on_write', } ) ``` Alternatively, adding a column can also be expressed using `add_columns`: ```python theme={null} tbl.add_columns({'new_col': pxt.Int}) ``` As well as with column metadata: ```python theme={null} tbl.add_columns( { 'img_col': { 'type': pxt.Image, 'stored': True, 'media_validation': 'on_write', } } ) ``` ## method  add\_columns() ```python Signature theme={null} add_columns( schema: Mapping[str, type | ColumnSpec], if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error' ) -> UpdateStatus ``` Adds multiple columns to the table. The columns must be concrete (non-computed) columns; to add computed columns, use [`add_computed_column()`](./table#method-add_computed_column) instead. The format of the `schema` argument is a dict mapping column names to their types. **Parameters:** * **`schema`** (`Mapping[str, type | ColumnSpec]`): A dictionary mapping column names to a `type` or a [`ColumnSpec`](./columnspec) dict. * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Determines the behavior if a column already exists. Must be one of the following: * `'error'`: an exception will be raised. * `'ignore'`: do nothing and return. * `'replace' or 'replace_force'`: drop the existing column and add the new column, if it has no dependents. Note that the `if_exists` parameter is applied to all columns in the schema. To apply different behaviors to different columns, please use [`add_column()`](./table#method-add_column) for each column. **Returns:** * `UpdateStatus`: Information about the execution status of the operation. **Examples:** Add multiple columns to the table `my_table`: ```python theme={null} tbl = pxt.get_table('my_table') schema = {'new_col_1': pxt.Int, 'new_col_2': pxt.String} tbl.add_columns(schema) ``` It is also possible to specify column metadata using a dict: ```python theme={null} tbl = pxt.get_table('my_table') schema = { 'new_col_1': { 'type': pxt.Image, 'stored': True, 'media_validation': 'on_write', }, 'new_col_2': pxt.String, } tbl.add_columns(schema) ``` ## method  add\_computed\_column() ```python Signature theme={null} add_computed_column( *, stored: bool | None = None, destination: str | Path | None = None, print_stats: bool = False, on_error: Literal['abort', 'ignore'] = 'abort', if_exists: Literal['error', 'ignore', 'replace'] = 'error', **kwargs: exprs.Expr ) -> UpdateStatus ``` Adds a computed column to the table. **Parameters:** * **`kwargs`** (`exprs.Expr`): Exactly one keyword argument of the form `col_name=expression`. * **`stored`** (`bool | None`): Whether the column is materialized and stored or computed on demand. * **`destination`** (`str | Path | None`): An object store reference for persisting computed files. * **`print_stats`** (`bool`, default: `False`): If `True`, print execution metrics during evaluation. * **`on_error`** (`Literal['abort', 'ignore']`, default: `'abort'`): Determines the behavior if an error occurs while evaluating the column expression for at least one row. * `'abort'`: an exception will be raised and the column will not be added. * `'ignore'`: execution will continue and the column will be added. Any rows with errors will have a `None` value for the column, with information about the error stored in the corresponding `tbl.col_name.errormsg` and `tbl.col_name.errortype` fields. * **`if_exists`** (`Literal['error', 'ignore', 'replace']`, default: `'error'`): Determines the behavior if the column already exists. Must be one of the following: * `'error'`: an exception will be raised. * `'ignore'`: do nothing and return. * `'replace' or 'replace_force'`: drop the existing column and add the new column, iff it has no dependents. **Returns:** * `UpdateStatus`: Information about the execution status of the operation. **Examples:** For a table with an image column `frame`, add an image column `rotated` that rotates the image by 90 degrees: ```python theme={null} tbl.add_computed_column(rotated=tbl.frame.rotate(90)) ``` Do the same, but now the column is unstored: ```python theme={null} tbl.add_computed_column(rotated=tbl.frame.rotate(90), stored=False) ``` ## method  add\_embedding\_index() ```python Signature theme={null} add_embedding_index( column: str | ColumnRef, *, idx_name: str | None = None, embedding: pxt.Function | None = None, string_embed: pxt.Function | None = None, image_embed: pxt.Function | None = None, metric: Literal['cosine', 'ip', 'l2'] = 'cosine', precision: Literal['fp16', 'fp32'] = 'fp16', if_exists: Literal['error', 'ignore', 'replace', 'replace_force'] = 'error' ) -> None ``` Add an embedding index to the table. Once the index is created, it will be automatically kept up-to-date as new rows are inserted into the table. To add an embedding index, one must specify, at minimum, the column to be indexed and an embedding UDF. Only `String` and `Image` columns are currently supported. **Parameters:** * **`column`** (`str | ColumnRef`): The name of, or reference to, the column to be indexed; must be a `String` or `Image` column. * **`idx_name`** (`str | None`): An optional name for the index. If not specified, a name such as `'idx0'` will be generated automatically. If specified, the name must be unique for this table and a valid pixeltable column name. * **`embedding`** (`pxt.Function | None`): The UDF to use for the embedding. Must be a UDF that accepts a single argument of type `String` or `Image` (as appropriate for the column being indexed) and returns a fixed-size 1-dimensional array of floats. * **`string_embed`** (`pxt.Function | None`): An optional UDF to use for the string embedding component of this index. Can be used in conjunction with `image_embed` to construct multimodal embeddings manually, by specifying different embedding functions for different data types. * **`image_embed`** (`pxt.Function | None`): An optional UDF to use for the image embedding component of this index. Can be used in conjunction with `string_embed` to construct multimodal embeddings manually, by specifying different embedding functions for different data types. * **`metric`** (`Literal['cosine', 'ip', 'l2']`, default: `'cosine'`): Distance metric to use for the index; one of `'cosine'`, `'ip'`, or `'l2'`. The default is `'cosine'`. * **`precision`** (`Literal['fp16', 'fp32']`, default: `'fp16'`): level of precision for the embeddings; one of `'fp16'` or `'fp32'`. * **`if_exists`** (`Literal['error', 'ignore', 'replace', 'replace_force']`, default: `'error'`): Directive for handling an existing index with the same name. Must be one of the following: * `'error'`: raise an error if an index with the same name already exists. * `'ignore'`: do nothing if an index with the same name already exists. * `'replace'` or `'replace_force'`: replace the existing index with the new one. **Examples:** Add an index to the `img` column of the table `my_table`: ```python theme={null} from pixeltable.functions.huggingface import clip tbl = pxt.get_table('my_table') embedding_fn = clip.using(model_id='openai/clip-vit-base-patch32') tbl.add_embedding_index(tbl.img, embedding=embedding_fn) ``` Alternatively, the `img` column may be specified by name: ```python theme={null} tbl.add_embedding_index('img', embedding=embedding_fn) ``` Once the index is created, similarity lookups can be performed using the `similarity` pseudo-function: ```python theme={null} sim = tbl.img.similarity( image='/path/to/my-image.jpg' # can also be a URL or a PIL image ) tbl.select(tbl.img, sim).order_by(sim, asc=False).limit(5) ``` If the embedding UDF is a multimodal embedding (supporting more than one data type), then lookups may be performed using any of its supported modalities. In our example, CLIP supports both text and images, so we can also search for images using a text description: ```python theme={null} sim = tbl.img.similarity(string='a picture of a train') tbl.select(tbl.img, sim).order_by(sim, asc=False).limit(5) ``` Audio and video lookups would look like this: ```python theme={null} sim = tbl.img.similarity(audio='/path/to/audio.flac') sim = tbl.img.similarity(video='/path/to/video.mp4') ``` Multiple indexes can be defined on each column. Add a second index to the `img` column, using the inner product as the distance metric, and with a specific name: ```python theme={null} tbl.add_embedding_index( tbl.img, idx_name='ip_idx', embedding=embedding_fn, metric='ip' ) ``` Add an index using separately specified string and image embeddings: ```python theme={null} tbl.add_embedding_index( tbl.img, string_embed=string_embedding_fn, image_embed=image_embedding_fn, ) ``` ## method  batch\_update() ```python Signature theme={null} batch_update( rows: Iterable[dict[str, Any]], cascade: bool = True, if_not_exists: Literal['error', 'ignore', 'insert'] = 'error' ) -> UpdateStatus ``` Update rows in this table. **Parameters:** * **`rows`** (`Iterable[dict[str, Any]]`): an Iterable of dictionaries containing values for the updated columns plus values for the primary key columns. * **`cascade`** (`bool`, default: `True`): if True, also update all computed columns that transitively depend on the updated columns. * **`if_not_exists`** (`Literal['error', 'ignore', 'insert']`, default: `'error'`): Specifies the behavior if a row to update does not exist: * `'error'`: Raise an error. * `'ignore'`: Skip the row silently. * `'insert'`: Insert the row. **Examples:** Update the `name` and `age` columns for the rows with ids 1 and 2 (assuming `id` is the primary key). If either row does not exist, this raises an error: ```python theme={null} tbl.batch_update( [ {'id': 1, 'name': 'Alice', 'age': 30}, {'id': 2, 'name': 'Bob', 'age': 40}, ] ) ``` Update the `name` and `age` columns for the row with `id` 1 (assuming `id` is the primary key) and insert the row with new `id` 3 (assuming this key does not exist): ```python theme={null} tbl.batch_update( [ {'id': 1, 'name': 'Alice', 'age': 30}, {'id': 3, 'name': 'Bob', 'age': 40}, ], if_not_exists='insert', ) ``` ## method  collect() ```python Signature theme={null} collect() -> pxt._query.ResultSet ``` Return rows from this table. ## method  columns() ```python Signature theme={null} columns() -> list[str] ``` Return the names of the columns in this table. ## method  count() ```python Signature theme={null} count() -> int ``` Return the number of rows in this table. ## method  delete() ```python Signature theme={null} delete(where: exprs.Expr | None = None) -> UpdateStatus ``` Delete rows in this table. **Parameters:** * **`where`** (`'exprs.Expr' | None`): a predicate to filter rows to delete. **Examples:** Delete all rows in a table: ```python theme={null} tbl.delete() ``` Delete all rows in a table where column `a` is greater than 5: ```python theme={null} tbl.delete(tbl.a > 5) ``` ## method  describe() ```python Signature theme={null} describe() -> None ``` Print the table schema. ## method  distinct() ```python Signature theme={null} distinct() -> pxt.Query ``` Remove duplicate rows from table. ## method  drop\_column() ```python Signature theme={null} drop_column( column: str | ColumnRef, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Drop a column from the table. **Parameters:** * **`column`** (`str | ColumnRef`): The name or reference of the column to drop. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive for handling a non-existent column. Must be one of the following: * `'error'`: raise an error if the column does not exist. * `'ignore'`: do nothing if the column does not exist. **Examples:** Drop the column `col` from the table `my_table` by column name: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_column('col') ``` Drop the column `col` from the table `my_table` by column reference: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_column(tbl.col) ``` Drop the column `col` from the table `my_table` if it exists, otherwise do nothing: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_col(tbl.col, if_not_exists='ignore') ``` ## method  drop\_embedding\_index() ```python Signature theme={null} drop_embedding_index( *, column: str | ColumnRef | None = None, idx_name: str | None = None, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Drop an embedding index from the table. Either a column name or an index name (but not both) must be specified. If a column name or reference is specified, it must be a column containing exactly one embedding index; otherwise the specific index name must be provided instead. **Parameters:** * **`column`** (`str | ColumnRef | None`): The name of, or reference to, the column from which to drop the index. The column must have only one embedding index. * **`idx_name`** (`str | None`): The name of the index to drop. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive for handling a non-existent index. Must be one of the following: * `'error'`: raise an error if the index does not exist. * `'ignore'`: do nothing if the index does not exist. Note that `if_not_exists` parameter is only applicable when an `idx_name` is specified and it does not exist, or when `column` is specified and it has no index. `if_not_exists` does not apply to non-exisitng column. **Examples:** Drop the embedding index on the `img` column of the table `my_table` by column name: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_embedding_index(column='img') ``` Drop the embedding index on the `img` column of the table `my_table` by column reference: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_embedding_index(column=tbl.img) ``` Drop the embedding index `idx1` of the table `my_table` by index name: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_embedding_index(idx_name='idx1') ``` Drop the embedding index `idx1` of the table `my_table` by index name, if it exists, otherwise do nothing: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_embedding_index(idx_name='idx1', if_not_exists='ignore') ``` ## method  drop\_index() ```python Signature theme={null} drop_index( *, column: str | ColumnRef | None = None, idx_name: str | None = None, if_not_exists: Literal['error', 'ignore'] = 'error' ) -> None ``` Drop an index from the table. Either a column name or an index name (but not both) must be specified. If a column name or reference is specified, it must be a column containing exactly one index; otherwise the specific index name must be provided instead. **Parameters:** * **`column`** (`str | ColumnRef | None`): The name of, or reference to, the column from which to drop the index. The column must have only one embedding index. * **`idx_name`** (`str | None`): The name of the index to drop. * **`if_not_exists`** (`Literal['error', 'ignore']`, default: `'error'`): Directive for handling a non-existent index. Must be one of the following: * `'error'`: raise an error if the index does not exist. * `'ignore'`: do nothing if the index does not exist. Note that `if_not_exists` parameter is only applicable when an `idx_name` is specified and it does not exist, or when `column` is specified and it has no index. `if_not_exists` does not apply to non-exisitng column. **Examples:** Drop the index on the `img` column of the table `my_table` by column name: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_index(column_name='img') ``` Drop the index on the `img` column of the table `my_table` by column reference: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_index(tbl.img) ``` Drop the index `idx1` of the table `my_table` by index name: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_index(idx_name='idx1') ``` Drop the index `idx1` of the table `my_table` by index name, if it exists, otherwise do nothing: ```python theme={null} tbl = pxt.get_table('my_table') tbl.drop_index(idx_name='idx1', if_not_exists='ignore') ``` ## method  get\_metadata() ```python Signature theme={null} get_metadata() -> TableMetadata ``` Retrieves metadata associated with this table. **Returns:** * `'TableMetadata'`: A [TableMetadata](./tablemetadata) instance containing this table's metadata. ## method  get\_versions() ```python Signature theme={null} get_versions(n: int | None = None) -> list[VersionMetadata] ``` Returns information about versions of this table, most recent first. `get_versions()` is intended for programmatic access to version metadata; for human-readable output, use [`history()`](./table#method-history) instead. **Parameters:** * **`n`** (`int | None`): if specified, will return at most `n` versions **Returns:** * `list[VersionMetadata]`: A list of [VersionMetadata](./versionmetadata) dictionaries, one per version retrieved, most recent first. **Examples:** Retrieve metadata about all versions of the table `tbl`: ```python theme={null} tbl.get_versions() ``` Retrieve metadata about the most recent 5 versions of the table `tbl`: ```python theme={null} tbl.get_versions(n=5) ``` ## method  group\_by() ```python Signature theme={null} group_by(*items: exprs.Expr) -> pxt.Query ``` Group the rows of this table based on the expression. See [`Query.group_by`](./query#method-group_by) for more details. ## method  head() ```python Signature theme={null} head(*args: Any, **kwargs: Any) -> pxt._query.ResultSet ``` Return the first n rows inserted into this table. ## method  history() ```python Signature theme={null} history(n: int | None = None) -> pd.DataFrame ``` Returns a human-readable report about versions of this table. `history()` is intended for human-readable output of version metadata; for programmatic access, use [`get_versions()`](./table#method-get_versions) instead. **Parameters:** * **`n`** (`int | None`): if specified, will return at most `n` versions **Returns:** * `pd.DataFrame`: A report with information about each version, one per row, most recent first. **Examples:** Report all versions of the table: ```python theme={null} tbl.history() ``` Report only the most recent 5 changes to the table: ```python theme={null} tbl.history(n=5) ``` ## method  insert() ```python Signatures theme={null} # Signature 1: insert( source: TableDataSource, /, *, source_format: Literal['csv', 'excel', 'parquet', 'json'] | None = None, schema_overrides: dict[str, ts.ColumnType] | None = None, on_error: Literal['abort', 'ignore'] = 'abort', print_stats: bool = False, **kwargs: Any ) -> UpdateStatus # Signature 2: insert( *, on_error: Literal['abort', 'ignore'] = 'abort', print_stats: bool = False, **kwargs: Any ) -> UpdateStatus ``` Inserts rows into this table. There are two mutually exclusive call patterns: To insert multiple rows at a time: ```python theme={null} insert( source: TableSourceDataType, /, *, on_error: Literal['abort', 'ignore'] = 'abort', print_stats: bool = False, **kwargs: Any, ) ``` To insert just a single row, you can use the more concise syntax: ```python theme={null} insert( *, on_error: Literal['abort', 'ignore'] = 'abort', print_stats: bool = False, **kwargs: Any ) ``` **Parameters:** * **`source`** (`TableDataSource | None`): A data source from which data can be imported. * **`kwargs`** (`Any`): (if inserting a single row) Keyword-argument pairs representing column names and values. (if inserting multiple rows) Additional keyword arguments are passed to the data source. * **`source_format`** (`Literal['csv', 'excel', 'parquet', 'json'] | None`): A hint about the format of the source data * **`schema_overrides`** (`dict[str, ts.ColumnType] | None`): If specified, then columns in `schema_overrides` will be given the specified types * **`on_error`** (`Literal['abort', 'ignore']`, default: `'abort'`): Determines the behavior if an error occurs while evaluating a computed column or detecting an invalid media file (such as a corrupt image) for one of the inserted rows. * If `on_error='abort'`, then an exception will be raised and the rows will not be inserted. * If `on_error='ignore'`, then execution will continue and the rows will be inserted. Any cells with errors will have a `None` value for that cell, with information about the error stored in the corresponding `tbl.col_name.errortype` and `tbl.col_name.errormsg` fields. * **`print_stats`** (`bool`, default: `False`): If `True`, print statistics about the cost of computed columns. **Returns:** * `UpdateStatus`: An [`UpdateStatus`](./updatestatus) object containing information about the update. **Examples:** Insert two rows into the table `my_table` with three int columns `a`, `b`, and `c`. Column `c` is nullable: ```python theme={null} tbl = pxt.get_table('my_table') tbl.insert([{'a': 1, 'b': 1, 'c': 1}, {'a': 2, 'b': 2}]) ``` Insert a single row using the alternative syntax: ```python theme={null} tbl.insert(a=3, b=3, c=3) ``` Insert rows from a CSV file: ```python theme={null} tbl.insert(source='path/to/file.csv') ``` Insert Pydantic model instances into a table with two `pxt.Int` columns `a` and `b`: ```python theme={null} class MyModel(pydantic.BaseModel): a: int b: int models = [MyModel(a=1, b=2), MyModel(a=3, b=4)] tbl.insert(models) ``` ## method  join() ```python Signature theme={null} join( other: Table, *, on: exprs.Expr | None = None, how: pixeltable.plan.JoinType.LiteralType = 'inner' ) -> pxt.Query ``` Join this table with another table. ## method  limit() ```python Signature theme={null} limit(n: int, offset: int | None = None) -> pxt.Query ``` Select a limited number of rows from the Table, optionally skipping rows for pagination. **Parameters:** * **`n`** (`int`): Number of rows to select. * **`offset`** (`int | None`): Number of rows to skip before returning results. Default is None (no offset). **Returns:** * `'pxt.Query'`: A Query with the specified limited rows. **Examples:** Get the first 10 rows: ```python theme={null} t.limit(10).collect() ``` Get rows 21-30 (skip first 20, return next 10): ```python theme={null} t.limit(10, offset=20).collect() ``` ## method  list\_views() ```python Signature theme={null} list_views(*, recursive: bool = True) -> list[str] ``` Returns a list of all views and snapshots of this `Table`. **Parameters:** * **`recursive`** (`bool`, default: `True`): If `False`, returns only the immediate successor views of this `Table`. If `True`, returns all sub-views (including views of views, etc.) **Returns:** * `list[str]`: A list of view paths. ## method  order\_by() ```python Signature theme={null} order_by(*items: exprs.Expr, asc: bool = True) -> pxt.Query ``` Order the rows of this table based on the expression. See [`Query.order_by`](./query#method-order_by) for more details. ## method  pull() ```python Signature theme={null} pull() -> None ``` ## method  push() ```python Signature theme={null} push() -> None ``` ## method  recompute\_columns() ```python Signature theme={null} recompute_columns( *columns: str | ColumnRef, where: exprs.Expr | None = None, errors_only: bool = False, cascade: bool = True ) -> UpdateStatus ``` Recompute the values in one or more computed columns of this table. **Parameters:** * **`columns`** (`str | ColumnRef`): The names or references of the computed columns to recompute. * **`where`** (`'exprs.Expr' | None`): A predicate to filter rows to recompute. * **`errors_only`** (`bool`, default: `False`): If True, only run the recomputation for rows that have errors in the column (ie, the column's `errortype` property indicates that an error occurred). Only allowed for recomputing a single column. * **`cascade`** (`bool`, default: `True`): if True, also update all computed columns that transitively depend on the recomputed columns. **Examples:** Recompute computed columns `c1` and `c2` for all rows in this table, and everything that transitively depends on them: ```python theme={null} tbl.recompute_columns('c1', 'c2') ``` Recompute computed column `c1` for all rows in this table, but don't recompute other columns that depend on it: ```python theme={null} tbl.recompute_columns(tbl.c1, tbl.c2, cascade=False) ``` Recompute column `c1` and its dependents, but only for rows with `c2` == 0: ```python theme={null} tbl.recompute_columns('c1', where=tbl.c2 == 0) ``` Recompute column `c1` and its dependents, but only for rows that have errors in it: ```python theme={null} tbl.recompute_columns('c1', errors_only=True) ``` ## method  rename\_column() ```python Signature theme={null} rename_column(old_name: str, new_name: str) -> None ``` Rename a column. **Parameters:** * **`old_name`** (`str`): The current name of the column. * **`new_name`** (`str`): The new name of the column. **Examples:** Rename the column `col1` to `col2` of the table `my_table`: ```python theme={null} tbl = pxt.get_table('my_table') tbl.rename_column('col1', 'col2') ``` ## method  revert() ```python Signature theme={null} revert() -> None ``` Reverts the table to the previous version. .. warning:: This operation is irreversible. ## method  sample() ```python Signature theme={null} sample( n: int | None = None, n_per_stratum: int | None = None, fraction: float | None = None, seed: int | None = None, stratify_by: Any = None ) -> pxt.Query ``` Choose a shuffled sample of rows See [`Query.sample`](./query#method-sample) for more details. ## method  select() ```python Signature theme={null} select(*items: Any, **named_items: Any) -> pxt.Query ``` Select columns or expressions from this table. See [`Query.select`](./query#method-select) for more details. ## method  show() ```python Signature theme={null} show(*args: Any, **kwargs: Any) -> pxt._query.ResultSet ``` Return rows from this table. ## method  sync() ```python Signature theme={null} sync( stores: str | list[str] | None = None, *, export_data: bool = True, import_data: bool = True ) -> UpdateStatus ``` Synchronizes this table with its linked external stores. **Parameters:** * **`stores`** (`str | list[str] | None`): If specified, will synchronize only the specified named store or list of stores. If not specified, will synchronize all of this table's external stores. * **`export_data`** (`bool`, default: `True`): If `True`, data from this table will be exported to the external stores during synchronization. * **`import_data`** (`bool`, default: `True`): If `True`, data from the external stores will be imported to this table during synchronization. ## method  tail() ```python Signature theme={null} tail(*args: Any, **kwargs: Any) -> pxt._query.ResultSet ``` Return the last n rows inserted into this table. ## method  unlink\_external\_stores() ```python Signature theme={null} unlink_external_stores( stores: str | list[str] | None = None, *, delete_external_data: bool = False, ignore_errors: bool = False ) -> None ``` Unlinks this table's external stores. **Parameters:** * **`stores`** (`str | list[str] | None`): If specified, will unlink only the specified named store or list of stores. If not specified, will unlink all of this table's external stores. * **`ignore_errors`** (`bool`, default: `False`): If `True`, no exception will be thrown if a specified store is not linked to this table. * **`delete_external_data`** (`bool`, default: `False`): If `True`, then the external data store will also be deleted. WARNING: This is a destructive operation that will delete data outside Pixeltable, and cannot be undone. ## method  update() ```python Signature theme={null} update( value_spec: dict[str, Any], where: exprs.Expr | None = None, cascade: bool = True ) -> UpdateStatus ``` Update rows in this table. **Parameters:** * **`value_spec`** (`dict[str, Any]`): a dictionary mapping column names to literal values or Pixeltable expressions. * **`where`** (`'exprs.Expr' | None`): a predicate to filter rows to update. * **`cascade`** (`bool`, default: `True`): if True, also update all computed columns that transitively depend on the updated columns. **Returns:** * `UpdateStatus`: An [`UpdateStatus`](./updatestatus) object containing information about the update. **Examples:** Set column `int_col` to 1 for all rows: ```python theme={null} tbl.update({'int_col': 1}) ``` Set column `int_col` to 1 for all rows where `int_col` is 0: ```python theme={null} tbl.update({'int_col': 1}, where=tbl.int_col == 0) ``` Set `int_col` to the value of `other_int_col` + 1: ```python theme={null} tbl.update({'int_col': tbl.other_int_col + 1}) ``` Increment `int_col` by 1 for all rows where `int_col` is 0: ```python theme={null} tbl.update({'int_col': tbl.int_col + 1}, where=tbl.int_col == 0) ``` ## method  where() ```python Signature theme={null} where(pred: exprs.Expr) -> pxt.Query ``` Filter rows from this table based on the expression. See [`Query.where`](./query#method-where) for more details. # TableMetadata Source: https://docs.pixeltable.com/sdk/latest/tablemetadata View Source on GitHub # class  pixeltable.TableMetadata Metadata for a Pixeltable table. ## attr  base ``` base: str | None ``` If this table is a view or snapshot, the full path of its base table; otherwise `None`. ## attr  columns ``` columns: dict[str, ColumnMetadata] ``` Column metadata for all of the visible columns of the table. ## attr  comment ``` comment: str | None ``` User-provided table comment, if one exists. ## attr  custom\_metadata ``` custom_metadata: Any ``` User-defined JSON metadata for this table, if any. ## attr  indices ``` indices: dict[str, IndexMetadata] ``` Index metadata for all of the indices of the table. ## attr  is\_replica ``` is_replica: bool ``` `True` if this table is a replica of another (shared) table. ## attr  is\_snapshot ``` is_snapshot: bool ``` `True` if this table is a snapshot. ## attr  is\_view ``` is_view: bool ``` `True` if this table is a view. ## attr  media\_validation ``` media_validation: Literal['on_read', 'on_write'] ``` The media validation policy for this table. ## attr  name ``` name: str ``` The name of the table (ex: `'my_table'`). ## attr  path ``` path: str ``` The full path of the table (ex: `'my_dir.my_subdir.my_table'`). ## attr  schema\_version ``` schema_version: int ``` The current schema version of the table. ## attr  version ``` version: int ``` The current version of the table. ## attr  version\_created ``` version_created: datetime.datetime ``` The timestamp when this table version was created. # timestamp Source: https://docs.pixeltable.com/sdk/latest/timestamp View Source on GitHub # module  pixeltable.functions.timestamp Pixeltable UDFs for `TimestampType`. Usage example: ```python theme={null} import pixeltable as pxt t = pxt.get_table(...) t.select(t.timestamp_col.year, t.timestamp_col.weekday()).collect() ``` ## udf  astimezone() ```python Signature theme={null} @pxt.udf astimezone(self: pxt.Timestamp, tz: pxt.String) -> pxt.Timestamp ``` Convert the datetime to the given time zone. **Parameters:** * **`tz`** (`pxt.String`): The time zone to convert to. Must be a valid time zone name from the [IANA Time Zone Database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). ## udf  day() ```python Signature theme={null} @pxt.udf day(self: pxt.Timestamp) -> pxt.Int ``` Between 1 and the number of days in the given month of the given year. Equivalent to [`datetime.day`](https://docs.python.org/3/library/datetime.html#datetime.datetime.day). ## udf  hour() ```python Signature theme={null} @pxt.udf hour(self: pxt.Timestamp) -> pxt.Int ``` Between 0 and 23 inclusive. Equivalent to [`datetime.hour`](https://docs.python.org/3/library/datetime.html#datetime.datetime.hour). ## udf  isocalendar() ```python Signature theme={null} @pxt.udf isocalendar(self: pxt.Timestamp) -> pxt.Json ``` Return a dictionary with three entries: `'year'`, `'week'`, and `'weekday'`. Equivalent to [`datetime.isocalendar()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.isocalendar). ## udf  isoformat() ```python Signature theme={null} @pxt.udf isoformat( self: pxt.Timestamp, sep: pxt.String = 'T', timespec: pxt.String = 'auto' ) -> pxt.String ``` Return a string representing the date and time in ISO 8601 format. Equivalent to [`datetime.isoformat()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat). **Parameters:** * **`sep`** (`pxt.String`): Separator between date and time. * **`timespec`** (`pxt.String`): The number of additional terms in the output. See the [`datetime.isoformat()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.isoformat) documentation for more details. ## udf  isoweekday() ```python Signature theme={null} @pxt.udf isoweekday(self: pxt.Timestamp) -> pxt.Int ``` Return the day of the week as an integer, where Monday is 1 and Sunday is 7. Equivalent to [`datetime.isoweekday()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.isoweekday). ## udf  make\_timestamp() ```python Signature theme={null} @pxt.udf make_timestamp( year: pxt.Int, month: pxt.Int, day: pxt.Int, hour: pxt.Int = 0, minute: pxt.Int = 0, second: pxt.Int = 0, microsecond: pxt.Int = 0 ) -> pxt.Timestamp ``` Create a timestamp. Equivalent to [`datetime()`](https://docs.python.org/3/library/datetime.html#datetime.datetime). ## udf  microsecond() ```python Signature theme={null} @pxt.udf microsecond(self: pxt.Timestamp) -> pxt.Int ``` Between 0 and 999999 inclusive. Equivalent to [`datetime.microsecond`](https://docs.python.org/3/library/datetime.html#datetime.datetime.microsecond). ## udf  minute() ```python Signature theme={null} @pxt.udf minute(self: pxt.Timestamp) -> pxt.Int ``` Between 0 and 59 inclusive. Equivalent to [`datetime.minute`](https://docs.python.org/3/library/datetime.html#datetime.datetime.minute). ## udf  month() ```python Signature theme={null} @pxt.udf month(self: pxt.Timestamp) -> pxt.Int ``` Between 1 and 12 inclusive. Equivalent to [`datetime.month`](https://docs.python.org/3/library/datetime.html#datetime.datetime.month). ## udf  posix\_timestamp() ```python Signature theme={null} @pxt.udf posix_timestamp(self: pxt.Timestamp) -> pxt.Float ``` Return POSIX timestamp corresponding to the datetime instance. Equivalent to [`datetime.timestamp()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.timestamp). ## udf  replace() ```python Signature theme={null} @pxt.udf replace( self: pxt.Timestamp, year: pxt.Int | None = None, month: pxt.Int | None = None, day: pxt.Int | None = None, hour: pxt.Int | None = None, minute: pxt.Int | None = None, second: pxt.Int | None = None, microsecond: pxt.Int | None = None ) -> pxt.Timestamp ``` Return a datetime with the same attributes, except for those attributes given new values by whichever keyword arguments are specified. Equivalent to [`datetime.replace()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.replace). ## udf  second() ```python Signature theme={null} @pxt.udf second(self: pxt.Timestamp) -> pxt.Int ``` Between 0 and 59 inclusive. Equivalent to [`datetime.second`](https://docs.python.org/3/library/datetime.html#datetime.datetime.second). ## udf  strftime() ```python Signature theme={null} @pxt.udf strftime(self: pxt.Timestamp, format: pxt.String) -> pxt.String ``` Return a string representing the date and time, controlled by an explicit format string. Equivalent to [`datetime.strftime()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.strftime). **Parameters:** * **`format`** (`pxt.String`): The format string to control the output. For a complete list of formatting directives, see [`strftime()` and `strptime()` Behavior](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior). ## udf  toordinal() ```python Signature theme={null} @pxt.udf toordinal(self: pxt.Timestamp) -> pxt.Int ``` Return the proleptic Gregorian ordinal of the date, where January 1 of year 1 has ordinal 1. Equivalent to [`datetime.toordinal()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.toordinal). ## udf  weekday() ```python Signature theme={null} @pxt.udf weekday(self: pxt.Timestamp) -> pxt.Int ``` Between 0 (Monday) and 6 (Sunday) inclusive. Equivalent to [`datetime.weekday()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.weekday). ## udf  year() ```python Signature theme={null} @pxt.udf year(self: pxt.Timestamp) -> pxt.Int ``` Between [`MINYEAR`](https://docs.python.org/3/library/datetime.html#datetime.MINYEAR) and [`MAXYEAR`](https://docs.python.org/3/library/datetime.html#datetime.MAXYEAR) inclusive. Equivalent to [`datetime.year`](https://docs.python.org/3/library/datetime.html#datetime.datetime.year). # together Source: https://docs.pixeltable.com/sdk/latest/together View Source on GitHub # module  pixeltable.functions.together Pixeltable UDFs that wrap various endpoints from the Together AI API. In order to use them, you must first `pip install together` and configure your Together AI credentials, as described in the [Working with Together AI](https://docs.pixeltable.com/notebooks/integrations/working-with-together-ai) tutorial. ## udf  chat\_completions() ```python Signature theme={null} @pxt.udf chat_completions( messages: pxt.Json, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate chat completions based on a given prompt using a specified model. Equivalent to the Together AI `chat/completions` API endpoint. For additional details, see: [https://docs.together.ai/reference/chat-completions-1](https://docs.together.ai/reference/chat-completions-1) Request throttling: Applies the rate limit set in the config (section `together.rate_limits`, key `chat`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install together` **Parameters:** * **`messages`** (`pxt.Json`): A list of messages comprising the conversation so far. * **`model`** (`pxt.String`): The name of the model to query. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments for the Together `chat/completions` API. For details on the available parameters, see: [https://docs.together.ai/reference/chat-completions-1](https://docs.together.ai/reference/chat-completions-1) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `mistralai/Mixtral-8x7B-v0.1` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} messages = [{'role': 'user', 'content': tbl.prompt}] tbl.add_computed_column( response=chat_completions( messages, model='mistralai/Mixtral-8x7B-v0.1' ) ) ``` ## udf  completions() ```python Signature theme={null} @pxt.udf completions( prompt: pxt.String, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Json ``` Generate completions based on a given prompt using a specified model. Equivalent to the Together AI `completions` API endpoint. For additional details, see: [https://docs.together.ai/reference/completions-1](https://docs.together.ai/reference/completions-1) Request throttling: Applies the rate limit set in the config (section `together.rate_limits`, key `chat`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install together` **Parameters:** * **`prompt`** (`pxt.String`): A string providing context for the model to complete. * **`model`** (`pxt.String`): The name of the model to query. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword arguments for the Together `completions` API. For details on the available parameters, see: [https://docs.together.ai/reference/completions-1](https://docs.together.ai/reference/completions-1) **Returns:** * `pxt.Json`: A dictionary containing the response and other metadata. **Examples:** Add a computed column that applies the model `mistralai/Mixtral-8x7B-v0.1` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=completions(tbl.prompt, model='mistralai/Mixtral-8x7B-v0.1') ) ``` ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String ) -> pxt.Array[(None,), float32] ``` Query an embedding model for a given string of text. Equivalent to the Together AI `embeddings` API endpoint. For additional details, see: [https://docs.together.ai/reference/embeddings-2](https://docs.together.ai/reference/embeddings-2) Request throttling: Applies the rate limit set in the config (section `together.rate_limits`, key `embeddings`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install together` **Parameters:** * **`input`** (`pxt.String`): A string providing the text for the model to embed. * **`model`** (`pxt.String`): The name of the embedding model to use. **Returns:** * `pxt.Array[(None,), float32]`: An array representing the application of the given embedding to `input`. **Examples:** Add a computed column that applies the model `togethercomputer/m2-bert-80M-8k-retrieval` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=embeddings( tbl.text, model='togethercomputer/m2-bert-80M-8k-retrieval' ) ) ``` ## udf  image\_generations() ```python Signature theme={null} @pxt.udf image_generations( prompt: pxt.String, *, model: pxt.String, model_kwargs: pxt.Json | None = None ) -> pxt.Image ``` Generate images based on a given prompt using a specified model. Equivalent to the Together AI `images/generations` API endpoint. For additional details, see: [https://docs.together.ai/reference/post\_images-generations](https://docs.together.ai/reference/post_images-generations) Request throttling: Applies the rate limit set in the config (section `together.rate_limits`, key `images`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install together` **Parameters:** * **`prompt`** (`pxt.String`): A description of the desired images. * **`model`** (`pxt.String`): The model to use for image generation. * **`model_kwargs`** (`pxt.Json | None`): Additional keyword args for the Together `images/generations` API. For details on the available parameters, see: [https://docs.together.ai/reference/post\_images-generations](https://docs.together.ai/reference/post_images-generations) **Returns:** * `pxt.Image`: The generated image. **Examples:** Add a computed column that applies the model `stabilityai/stable-diffusion-xl-base-1.0` to an existing Pixeltable column `tbl.prompt` of the table `tbl`: ```python theme={null} tbl.add_computed_column( response=image_generations( tbl.prompt, model='stabilityai/stable-diffusion-xl-base-1.0' ) ) ``` # twelvelabs Source: https://docs.pixeltable.com/sdk/latest/twelvelabs View Source on GitHub # module  pixeltable.functions.twelvelabs Pixeltable UDFs that wrap various endpoints from the TwelveLabs API. In order to use them, you must first `pip install twelvelabs` and configure your TwelveLabs credentials, as described in the [Working with TwelveLabs](https://docs.pixeltable.com/howto/providers/working-with-twelvelabs) tutorial. ## udf  embed() ```python Signatures theme={null} # Signature 1: @pxt.udf embed( text: pxt.String, image: pxt.Image | None, model_name: pxt.String ) -> pxt.Array[float32] | None # Signature 2: @pxt.udf embed( image: pxt.Image, model_name: pxt.String ) -> pxt.Array[float32] | None # Signature 3: @pxt.udf embed( audio: pxt.Audio, model_name: pxt.String, start_sec: pxt.Float | None, end_sec: pxt.Float | None, embedding_option: pxt.Json | None ) -> pxt.Array[float32] | None # Signature 4: @pxt.udf embed( video: pxt.Video, model_name: pxt.String, start_sec: pxt.Float | None, end_sec: pxt.Float | None, embedding_option: pxt.Json | None ) -> pxt.Array[float32] | None ``` Creates an embedding vector for the given text, audio, image, or video input. Each UDF signature corresponds to one of the four supported input types. If text is specified, it is possible to specify an image as well, corresponding to the `text_image` embedding type in the TwelveLabs API. This is (currently) the only way to include more than one input type at a time. Equivalent to the TwelveLabs Embed API: [https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings](https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings) Request throttling: Applies the rate limit set in the config (section `twelvelabs`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install twelvelabs` **Parameters:** * **`model_name`** (`String`): The name of the model to use. Check [the TwelveLabs documentation](https://docs.twelvelabs.io/v1.3/sdk-reference/python/create-embeddings-v-1/create-text-image-and-audio-embeddings) for available models. * **`text`** (`String`): The text to embed. * **`image`** (`Image | None`, default: `Literal(None)`): If specified, the embedding will be created from both the text and the image. **Returns:** * `pxt.Array[float32] | None`: The embedding. **Examples:** Add a computed column `embed` for an embedding of a string column `input`: ```python theme={null} tbl.add_computed_column( embed=embed(model_name='marengo3.0', text=tbl.input) ) ``` # UpdateStatus Source: https://docs.pixeltable.com/sdk/latest/updatestatus View Source on GitHub # class  pixeltable.UpdateStatus Information about changes to table data or table schema ## attr  ext\_num\_rows ``` ext_num_rows: int ``` Total number of rows affected in an external store. ## attr  external\_rows\_created ``` external_rows_created: int ``` Number of rows created in an external store. ## attr  external\_rows\_deleted ``` external_rows_deleted: int ``` Number of rows deleted from an external store. ## attr  external\_rows\_updated ``` external_rows_updated: int ``` Number of rows updated in an external store. ## attr  num\_computed\_values ``` num_computed_values: int ``` Total number of computed values affected (including cascaded changes). ## attr  num\_excs ``` num_excs: int ``` Total number of exceptions encountered (including cascaded changes). ## attr  num\_rows ``` num_rows: int ``` Total number of rows affected (including cascaded changes). ## attr  pxt\_rows\_updated ``` pxt_rows_updated: int ``` Returns the number of Pixeltable rows that were updated as a result of the operation. # uuid Source: https://docs.pixeltable.com/sdk/latest/uuid View Source on GitHub # module  pixeltable.functions.uuid Pixeltable UDFs for `UUID`. ## udf  to\_string() ```python Signature theme={null} @pxt.udf to_string(u: pxt.UUID) -> pxt.String ``` Convert a UUID to its string representation. **Parameters:** * **`u`** (`pxt.UUID`): The UUID to convert. **Returns:** * `pxt.String`: The string representation of the UUID, in the form `xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`. **Examples:** Convert the UUID column `id` in an existing table `tbl` to a string: ```python theme={null} tbl.add_computed_column(id_string=to_string(tbl.id)) ``` ## udf  uuid4() ```python Signature theme={null} @pxt.udf uuid4() -> pxt.UUID ``` Generate a random UUID (version 4). Equivalent to [`uuid.uuid4()`](https://docs.python.org/3/library/uuid.html#uuid.uuid4). ## udf  uuid7() ```python Signature theme={null} @pxt.udf uuid7() -> pxt.UUID ``` Generate a time-based UUID. Equivalent to [`uuid.uuid7()`](https://docs.python.org/3/library/uuid.html#uuid.uuid7). # VersionMetadata Source: https://docs.pixeltable.com/sdk/latest/versionmetadata View Source on GitHub # class  pixeltable.VersionMetadata Metadata for a specific version of a Pixeltable table. ## attr  change\_type ``` change_type: Literal['data', 'schema'] ``` The type of table transformation that this version represents (`'data'` or `'schema'`). ## attr  created\_at ``` created_at: datetime.datetime ``` The timestamp when this version was created. ## attr  deletes ``` deletes: int ``` The number of rows deleted in this version. ## attr  errors ``` errors: int ``` The number of errors encountered during this version. ## attr  inserts ``` inserts: int ``` The number of rows inserted in this version. ## attr  schema\_change ``` schema_change: str | None ``` A description of the schema change that occurred in this version, if any. ## attr  updates ``` updates: int ``` The number of rows updated in this version. ## attr  user ``` user: str | None ``` The user who created this version, if defined. ## attr  version ``` version: int ``` The version number. # video Source: https://docs.pixeltable.com/sdk/latest/video View Source on GitHub # module  pixeltable.functions.video Pixeltable UDFs for `VideoType`. ## iterator  frame\_iterator() ```python Signature theme={null} @pxt.iterator frame_iterator( video: pxt.Video, *, fps: pxt.Float | None = None, num_frames: pxt.Int | None = None, keyframes_only: pxt.Bool = False ) ``` Iterator over frames of a video. At most one of `fps`, `num_frames` or `keyframes_only` may be specified. If `fps` is specified, then frames will be extracted at the specified rate (frames per second). If `num_frames` is specified, then the exact number of frames will be extracted. If neither is specified, then all frames will be extracted. **Outputs:** One row per extracted frame, with the following columns: * `frame` (`pxt.Image`): The extracted video frame * `frame_attrs` (`pxt.Json`): A dictionary containing the following attributes (for more information, see `pyav`'s documentation on [VideoFrame](https://pyav.org/docs/develop/api/video.html#module-av.video.frame) and [Frame](https://pyav.org/docs/develop/api/frame.html)): * `index` (`int`): The index of the frame in the video stream * `pts` (`int | None`): The presentation timestamp of the frame * `dts` (`int | None`): The decoding timestamp of the frame * `time` (`float | None`): The timestamp of the frame in seconds * `is_corrupt` (`bool`): `True` if the frame is corrupt * `key_frame` (`bool`): `True` if the frame is a keyframe * `pict_type` (`int`): The picture type of the frame * `interlaced_frame` (`bool`): `True` if the frame is interlaced **Parameters:** * **`fps`** (`pxt.Float | None`): Number of frames to extract per second of video. This may be a fractional value, such as 0.5. If omitted, or if greater than the native framerate of the video, then the framerate of the video will be used (all frames will be extracted). * **`num_frames`** (`pxt.Int | None`): Exact number of frames to extract. The frames will be spaced as evenly as possible. If `num_frames` is greater than the number of frames in the video, all frames will be extracted. * **`keyframes_only`** (`pxt.Bool`): If True, only extract keyframes. **Examples:** All these examples assume an existing table `tbl` with a column `video` of type `pxt.Video`. Create a view that extracts all frames from all videos: ```python theme={null} pxt.create_view('all_frames', tbl, iterator=frame_iterator(tbl.video)) ``` Create a view that extracts only keyframes from all videos: ```python theme={null} pxt.create_view( 'keyframes', tbl, iterator=frame_iterator(tbl.video, keyframes_only=True), ) ``` Create a view that extracts frames from all videos at a rate of 1 frame per second: ```python theme={null} pxt.create_view( 'one_fps_frames', tbl, iterator=frame_iterator(tbl.video, fps=1.0) ) ``` Create a view that extracts exactly 10 frames from each video: ```python theme={null} pxt.create_view( 'ten_frames', tbl, iterator=frame_iterator(tbl.video, num_frames=10) ) ``` ## iterator  video\_splitter() ```python Signature theme={null} @pxt.iterator video_splitter( video: pxt.Video, *, duration: pxt.Float | None = None, overlap: pxt.Float | None = None, min_segment_duration: pxt.Float | None = None, segment_times: pxt.Json | None = None, mode: pxt.String = 'accurate', video_encoder: pxt.String | None = None, video_encoder_args: pxt.Json | None = None ) ``` Iterator over segments of a video file, which is split into segments. The segments are specified either via a fixed duration or a list of split points. **Parameters:** * **`duration`** (`pxt.Float | None`): Video segment duration in seconds * **`overlap`** (`pxt.Float | None`): Overlap between consecutive segments in seconds. Only available for `mode='fast'`. * **`min_segment_duration`** (`pxt.Float | None`): Drop the last segment if it is smaller than min\_segment\_duration. * **`segment_times`** (`pxt.Json | None`): List of timestamps (in seconds) in video where segments should be split. Note that these are not segment durations. If all segment times are less than the duration of the video, produces exactly `len(segment_times) + 1` segments. An argument of `[]` will produce a single segment containing the entire video. * **`mode`** (`pxt.String`): Segmentation mode: * `'fast'`: Quick segmentation using stream copy (splits only at keyframes, approximate durations) * `'accurate'`: Precise segmentation with re-encoding (exact durations, slower) * **`video_encoder`** (`pxt.String | None`): Video encoder to use. If not specified, uses the default encoder for the current platform. Only available for `mode='accurate'`. * **`video_encoder_args`** (`pxt.Json | None`): Additional arguments to pass to the video encoder. Only available for `mode='accurate'`. **Examples:** All these examples assume an existing table `tbl` with a column `video` of type `pxt.Video`. Create a view that splits each video into 10-second segments: ```python theme={null} pxt.create_view( 'ten_second_segments', tbl, iterator=video_splitter(tbl.video, duration=10.0), ) ``` Create a view that splits each video into segments at specified fixed times: ```python theme={null} split_times = [5.0, 15.0, 30.0] pxt.create_view( 'custom_segments', tbl, iterator=video_splitter(tbl.video, segment_times=split_times), ) ``` Create a view that splits each video into segments at times specified by a column `split_times` of type `pxt.Json`, containing a list of timestamps in seconds: ```python theme={null} pxt.create_view( 'custom_segments', tbl, iterator=video_splitter(tbl.video, segment_times=tbl.split_times), ) ``` ## uda  make\_video() ```python Signature theme={null} @pxt.uda make_video(*args, **kwargs) -> pxt.Video ``` Aggregate function that creates a video from a sequence of images, using the default video encoder and yuv420p pixel format. **Parameters:** * **`fps`** (`pxt.Int`): Frames per second for the output video. **Returns:** * `pxt.Video`: The video obtained by combining the input frames at the specified `fps`. **Examples:** Combine the images in the `img` column of the table `tbl` into a video: ```python theme={null} tbl.select(make_video(tbl.img, fps=30)).collect() ``` Combine a sequence of rotated images into a video: ```python theme={null} tbl.select(make_video(tbl.img.rotate(45), fps=30)).collect() ``` ## udf  clip() ```python Signature theme={null} @pxt.udf clip( video: pxt.Video, *, start_time: pxt.Float, end_time: pxt.Float | None = None, duration: pxt.Float | None = None, mode: pxt.String = 'accurate', video_encoder: pxt.String | None = None, video_encoder_args: pxt.Json | None = None ) -> pxt.Video | None ``` Extract a clip from a video, specified by `start_time` and either `end_time` or `duration` (in seconds). If `start_time` is beyond the end of the video, returns None. Can only specify one of `end_time` and `duration`. If both `end_time` and `duration` are None, the clip goes to the end of the video. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`video`** (`pxt.Video`): Input video file * **`start_time`** (`pxt.Float`): Start time in seconds * **`end_time`** (`pxt.Float | None`): End time in seconds * **`duration`** (`pxt.Float | None`): Duration of the clip in seconds * **`mode`** (`pxt.String`): - `'fast'`: avoids re-encoding but starts the clip at the nearest keyframes and as a result, the clip duration will be slightly longer than requested * `'accurate'`: extracts a frame-accurate clip, but requires re-encoding * **`video_encoder`** (`pxt.String | None`): Video encoder to use. If not specified, uses the default encoder for the current platform. Only available for `mode='accurate'`. * **`video_encoder_args`** (`pxt.Json | None`): Additional arguments to pass to the video encoder. Only available for `mode='accurate'`. **Returns:** * `pxt.Video | None`: New video containing only the specified time range or None if start\_time is beyond the end of the video. ## udf  concat\_videos() ```python Signature theme={null} @pxt.udf concat_videos(videos: pxt.Json) -> pxt.Video ``` Merge multiple videos into a single video. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`videos`** (`pxt.Json`): List of videos to merge. **Returns:** * `pxt.Video`: A new video containing the merged videos. ## udf  crop() ```python Signature theme={null} @pxt.udf crop( video: pxt.Video, bbox: pxt.Json, *, bbox_format: pxt.String = 'xywh', video_encoder: pxt.String | None = None, video_encoder_args: pxt.Json | None = None ) -> pxt.Video ``` Crop a rectangular region from a video using ffmpeg's crop filter. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`video`** (`pxt.Video`): Input video. * **`bbox`** (`pxt.Json`): Crop region as a list of 4 integers. * **`bbox_format`** (`pxt.String`): Format of the `bbox` coordinates: * `'xyxy'`: `[x1, y1, x2, y2]` where (x1, y1) is top-left and (x2, y2) is bottom-right * `'xywh'`: `[x, y, width, height]` where (x, y) is top-left corner * `'cxcywh'`: `[cx, cy, width, height]` where (cx, cy) is the center * **`video_encoder`** (`pxt.String | None`): Video encoder to use. If not specified, uses the default encoder. * **`video_encoder_args`** (`pxt.Json | None`): Additional arguments to pass to the video encoder. **Returns:** * `pxt.Video`: Video containing the cropped region. **Examples:** Crop using default xywh format: ```python theme={null} tbl.select(tbl.video.crop2([100, 50, 320, 240])).collect() ``` Crop using xyxy format (common in object detection): ```python theme={null} tbl.select( tbl.video.crop2([100, 50, 420, 290], bbox_format='xyxy') ).collect() ``` Crop using center format: ```python theme={null} tbl.select( tbl.video.crop2([260, 170, 320, 240], bbox_format='cxcywh') ).collect() ``` Use with yolox object detection output: ```python theme={null} tbl.add_computed_column( cropped=tbl.video.crop2(tbl.detections.bboxes[0], bbox_format='xyxy') ) ``` ## udf  extract\_audio() ```python Signature theme={null} @pxt.udf extract_audio( video_path: pxt.Video, stream_idx: pxt.Int = 0, format: pxt.String = 'wav', codec: pxt.String | None = None ) -> pxt.Audio ``` Extract an audio stream from a video. **Parameters:** * **`stream_idx`** (`pxt.Int`): Index of the audio stream to extract. * **`format`** (`pxt.String`): The target audio format. (`'wav'`, `'mp3'`, `'flac'`). * **`codec`** (`pxt.String | None`): The codec to use for the audio stream. If not provided, a default codec will be used. **Returns:** * `pxt.Audio`: The extracted audio. **Examples:** Add a computed column to a table `tbl` that extracts audio from an existing column `video_col`: ```python theme={null} tbl.add_computed_column( extracted_audio=tbl.video_col.extract_audio(format='flac') ) ``` ## udf  extract\_frame() ```python Signature theme={null} @pxt.udf extract_frame( video: pxt.Video, *, timestamp: pxt.Float ) -> pxt.Image | None ``` Extract a single frame from a video at a specific timestamp. **Parameters:** * **`video`** (`pxt.Video`): The video from which to extract the frame. * **`timestamp`** (`pxt.Float`): Extract frame at this timestamp (in seconds). **Returns:** * `pxt.Image | None`: The extracted frame as a PIL Image, or None if the timestamp is beyond the video duration. **Examples:** Extract the first frame from each video in the `video` column of the table `tbl`: ```python theme={null} tbl.select(tbl.video.extract_frame(0.0)).collect() ``` Extract a frame close to the end of each video in the `video` column of the table `tbl`: ```python theme={null} tbl.select( tbl.video.extract_frame( tbl.video.get_metadata().streams[0].duration_seconds - 0.1 ) ).collect() ``` ## udf  get\_duration() ```python Signature theme={null} @pxt.udf get_duration(video: pxt.Video) -> pxt.Float | None ``` Get video duration in seconds. **Parameters:** * **`video`** (`pxt.Video`): The video for which to get the duration. **Returns:** * `pxt.Float | None`: The duration in seconds, or None if the duration cannot be determined. ## udf  get\_metadata() ```python Signature theme={null} @pxt.udf get_metadata(video: pxt.Video) -> pxt.Json ``` Gets various metadata associated with a video file and returns it as a dictionary. **Parameters:** * **`video`** (`pxt.Video`): The video for which to get metadata. **Returns:** * `pxt.Json`: A `dict` such as the following: ```json theme={null} { 'bit_exact': False, 'bit_rate': 967260, 'size': 2234371, 'metadata': { 'encoder': 'Lavf60.16.100', 'major_brand': 'isom', 'minor_version': '512', 'compatible_brands': 'isomiso2avc1mp41', }, 'streams': [ { 'type': 'video', 'width': 640, 'height': 360, 'frames': 462, 'time_base': 1.0 / 12800, 'duration': 236544, 'duration_seconds': 236544.0 / 12800, 'average_rate': 25.0, 'base_rate': 25.0, 'guessed_rate': 25.0, 'metadata': { 'language': 'und', 'handler_name': 'L-SMASH Video Handler', 'vendor_id': '[0][0][0][0]', 'encoder': 'Lavc60.31.102 libx264', }, 'codec_context': {'name': 'h264', 'codec_tag': 'avc1', 'profile': 'High', 'pix_fmt': 'yuv420p'}, } ], } ``` **Examples:** Extract metadata for files in the `video_col` column of the table `tbl`: ```python theme={null} tbl.select(tbl.video_col.get_metadata()).collect() ``` ## udf  overlay\_text() ```python Signature theme={null} @pxt.udf overlay_text( video: pxt.Video, text: pxt.String, *, font: pxt.String | None = None, font_size: pxt.Int = 24, color: pxt.String = 'white', opacity: pxt.Float = 1.0, horizontal_align: pxt.String = 'center', horizontal_margin: pxt.Int = 0, vertical_align: pxt.String = 'center', vertical_margin: pxt.Int = 0, box: pxt.Bool = False, box_color: pxt.String = 'black', box_opacity: pxt.Float = 1.0, box_border: pxt.Json | None = None ) -> pxt.Video ``` Overlay text on a video with customizable positioning and styling. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`video`** (`pxt.Video`): Input video to overlay text on. * **`text`** (`pxt.String`): The text string to overlay on the video. * **`font`** (`pxt.String | None`): Font family or path to font file. If None, uses the system default. * **`font_size`** (`pxt.Int`): Size of the text in points. * **`color`** (`pxt.String`): Text color (e.g., `'white'`, `'red'`, `'#FF0000'`). * **`opacity`** (`pxt.Float`): Text opacity from 0.0 (transparent) to 1.0 (opaque). * **`horizontal_align`** (`pxt.String`): Horizontal text alignment (`'left'`, `'center'`, `'right'`). * **`horizontal_margin`** (`pxt.Int`): Horizontal margin in pixels from the alignment edge. * **`vertical_align`** (`pxt.String`): Vertical text alignment (`'top'`, `'center'`, `'bottom'`). * **`vertical_margin`** (`pxt.Int`): Vertical margin in pixels from the alignment edge. * **`box`** (`pxt.Bool`): Whether to draw a background box behind the text. * **`box_color`** (`pxt.String`): Background box color as a string. * **`box_opacity`** (`pxt.Float`): Background box opacity from 0.0 to 1.0. * **`box_border`** (`pxt.Json | None`): Padding around text in the box in pixels. * `[10]`: 10 pixels on all sides * `[10, 20]`: 10 pixels on top/bottom, 20 on left/right * `[10, 20, 30]`: 10 pixels on top, 20 on left/right, 30 on bottom * `[10, 20, 30, 40]`: 10 pixels on top, 20 on right, 30 on bottom, 40 on left **Returns:** * `pxt.Video`: A new video with the text overlay applied. **Examples:** Add a simple text overlay to videos in a table: ```python theme={null} tbl.select(tbl.video.overlay_text('Sample Text')).collect() ``` Add a YouTube-style caption: ```python theme={null} tbl.select( tbl.video.overlay_text( 'Caption text', font_size=32, color='white', opacity=1.0, box=True, box_color='black', box_opacity=0.8, box_border=[6, 14], horizontal_margin=10, vertical_align='bottom', vertical_margin=70, ) ).collect() ``` Add text with a semi-transparent background box: ```python theme={null} tbl.select( tbl.video.overlay_text( 'Important Message', font_size=32, color='yellow', box=True, box_color='black', box_opacity=0.6, box_border=[20, 10], ) ).collect() ``` ## udf  scene\_detect\_adaptive() ```python Signature theme={null} @pxt.udf scene_detect_adaptive( video: pxt.Video, *, fps: pxt.Float | None = None, adaptive_threshold: pxt.Float = 3.0, min_scene_len: pxt.Int = 15, window_width: pxt.Int = 2, min_content_val: pxt.Float = 15.0, delta_hue: pxt.Float = 1.0, delta_sat: pxt.Float = 1.0, delta_lum: pxt.Float = 1.0, delta_edges: pxt.Float = 0.0, luma_only: pxt.Bool = False, kernel_size: pxt.Int | None = None ) -> pxt.Json ``` Detect scene cuts in a video using PySceneDetect's [AdaptiveDetector](https://www.scenedetect.com/docs/latest/api/detectors.html#scenedetect.detectors.adaptive_detector.AdaptiveDetector). **Requirements:** * `pip install scenedetect` **Parameters:** * **`video`** (`pxt.Video`): The video to analyze for scene cuts. * **`fps`** (`pxt.Float | None`): Number of frames to extract per second for analysis. If None or 0, analyzes all frames. Lower values process faster but may miss exact scene cuts. * **`adaptive_threshold`** (`pxt.Float`): Threshold that the score ratio must exceed to trigger a new scene cut. Lower values will detect more scenes (more sensitive), higher values will detect fewer scenes. * **`min_scene_len`** (`pxt.Int`): Once a cut is detected, this many frames must pass before a new one can be added to the scene list. * **`window_width`** (`pxt.Int`): Size of window (number of frames) before and after each frame to average together in order to detect deviations from the mean. Must be at least 1. * **`min_content_val`** (`pxt.Float`): Minimum threshold (float) that the content\_val must exceed in order to register as a new scene. This is calculated the same way that `scene_detect_content()` calculates frame score based on weights/luma\_only/kernel\_size. * **`delta_hue`** (`pxt.Float`): Weight for hue component changes. Higher values make hue changes more important. * **`delta_sat`** (`pxt.Float`): Weight for saturation component changes. Higher values make saturation changes more important. * **`delta_lum`** (`pxt.Float`): Weight for luminance component changes. Higher values make brightness changes more important. * **`delta_edges`** (`pxt.Float`): Weight for edge detection changes. Higher values make edge changes more important. Edge detection can help detect cuts in scenes with similar colors but different content. * **`luma_only`** (`pxt.Bool`): If True, only analyzes changes in the luminance (brightness) channel of the video, ignoring color information. This can be faster and may work better for grayscale content. * **`kernel_size`** (`pxt.Int | None`): Size of kernel to use for post edge detection filtering. If None, automatically set based on video resolution. **Returns:** * `pxt.Json`: A list of dictionaries, one for each detected scene, with the following keys: * `start_time` (float): The start time of the scene in seconds. * `start_pts` (int): The pts of the start of the scene. * `duration` (float): The duration of the scene in seconds. The list is ordered chronologically. Returns the full duration of the video if no scenes are detected. **Examples:** Detect scene cuts with default parameters: ```python theme={null} tbl.select(tbl.video.scene_detect_adaptive()).collect() ``` Detect more scenes by lowering the threshold: ```python theme={null} tbl.select( tbl.video.scene_detect_adaptive(adaptive_threshold=1.5) ).collect() ``` Use luminance-only detection with a longer minimum scene length: ```python theme={null} tbl.select( tbl.video.scene_detect_adaptive(luma_only=True, min_scene_len=30) ).collect() ``` Add scene cuts as a computed column: ```python theme={null} tbl.add_computed_column( scene_cuts=tbl.video.scene_detect_adaptive(adaptive_threshold=2.0) ) ``` Analyze at a lower frame rate for faster processing: ```python theme={null} tbl.select(tbl.video.scene_detect_adaptive(fps=2.0)).collect() ``` ## udf  scene\_detect\_content() ```python Signature theme={null} @pxt.udf scene_detect_content( video: pxt.Video, *, fps: pxt.Float | None = None, threshold: pxt.Float = 27.0, min_scene_len: pxt.Int = 15, delta_hue: pxt.Float = 1.0, delta_sat: pxt.Float = 1.0, delta_lum: pxt.Float = 1.0, delta_edges: pxt.Float = 0.0, luma_only: pxt.Bool = False, kernel_size: pxt.Int | None = None, filter_mode: pxt.String = 'merge' ) -> pxt.Json ``` Detect scene cuts in a video using PySceneDetect's [ContentDetector](https://www.scenedetect.com/docs/latest/api/detectors.html#scenedetect.detectors.content_detector.ContentDetector). **Requirements:** * `pip install scenedetect` **Parameters:** * **`video`** (`pxt.Video`): The video to analyze for scene cuts. * **`fps`** (`pxt.Float | None`): Number of frames to extract per second for analysis. If None, analyzes all frames. Lower values process faster but may miss exact scene cuts. * **`threshold`** (`pxt.Float`): Threshold that the weighted sum of component changes must exceed to trigger a scene cut. Lower values detect more scenes (more sensitive), higher values detect fewer scenes. * **`min_scene_len`** (`pxt.Int`): Once a cut is detected, this many frames must pass before a new one can be added to the scene list. * **`delta_hue`** (`pxt.Float`): Weight for hue component changes. Higher values make hue changes more important. * **`delta_sat`** (`pxt.Float`): Weight for saturation component changes. Higher values make saturation changes more important. * **`delta_lum`** (`pxt.Float`): Weight for luminance component changes. Higher values make brightness changes more important. * **`delta_edges`** (`pxt.Float`): Weight for edge detection changes. Higher values make edge changes more important. Edge detection can help detect cuts in scenes with similar colors but different content. * **`luma_only`** (`pxt.Bool`): If True, only analyzes changes in the luminance (brightness) channel, ignoring color information. This can be faster and may work better for grayscale content. * **`kernel_size`** (`pxt.Int | None`): Size of kernel for expanding detected edges. Must be odd integer greater than or equal to 3. If None, automatically set using video resolution. * **`filter_mode`** (`pxt.String`): How to handle fast cuts/flashes. 'merge' combines quick cuts, 'suppress' filters them out. **Returns:** * `pxt.Json`: A list of dictionaries, one for each detected scene, with the following keys: * `start_time` (float): The start time of the scene in seconds. * `start_pts` (int): The pts of the start of the scene. * `duration` (float): The duration of the scene in seconds. The list is ordered chronologically. Returns the full duration of the video if no scenes are detected. **Examples:** Detect scene cuts with default parameters: ```python theme={null} tbl.select(tbl.video.scene_detect_content()).collect() ``` Detect more scenes by lowering the threshold: ```python theme={null} tbl.select(tbl.video.scene_detect_content(threshold=15.0)).collect() ``` Use luminance-only detection: ```python theme={null} tbl.select(tbl.video.scene_detect_content(luma_only=True)).collect() ``` Emphasize edge detection for scenes with similar colors: ```python theme={null} tbl.select( tbl.video.scene_detect_content( delta_edges=1.0, delta_hue=0.5, delta_sat=0.5 ) ).collect() ``` Add scene cuts as a computed column: ```python theme={null} tbl.add_computed_column( scene_cuts=tbl.video.scene_detect_content(threshold=20.0) ) ``` ## udf  scene\_detect\_hash() ```python Signature theme={null} @pxt.udf scene_detect_hash( video: pxt.Video, *, fps: pxt.Float | None = None, threshold: pxt.Float = 0.395, size: pxt.Int = 16, lowpass: pxt.Int = 2, min_scene_len: pxt.Int = 15 ) -> pxt.Json ``` Detect scene cuts in a video using PySceneDetect's [HashDetector](https://www.scenedetect.com/docs/latest/api/detectors.html#scenedetect.detectors.hash_detector.HashDetector). HashDetector uses perceptual hashing for very fast scene detection. It computes a hash of each frame at reduced resolution and compares hash distances. **Requirements:** * `pip install scenedetect` **Parameters:** * **`video`** (`pxt.Video`): The video to analyze for scene cuts. * **`fps`** (`pxt.Float | None`): Number of frames to extract per second for analysis. If None, analyzes all frames. Lower values process faster but may miss exact scene cuts. * **`threshold`** (`pxt.Float`): Value from 0.0 and 1.0 representing the relative hamming distance between the perceptual hashes of adjacent frames. A distance of 0 means the image is the same, and 1 means no correlation. Smaller threshold values thus require more correlation, making the detector more sensitive. The Hamming distance is divided by size x size before comparing to threshold for normalization. Lower values detect more scenes (more sensitive), higher values detect fewer scenes. * **`size`** (`pxt.Int`): Size of square of low frequency data to use for the DCT. Larger values are more precise but slower. Common values are 8, 16, or 32. * **`lowpass`** (`pxt.Int`): How much high frequency information to filter from the DCT. A value of 2 means keep lower 1/2 of the frequency data, 4 means only keep 1/4, etc. Larger values make the detector less sensitive to high-frequency details and noise. * **`min_scene_len`** (`pxt.Int`): Once a cut is detected, this many frames must pass before a new one can be added to the scene list. **Returns:** * `pxt.Json`: A list of dictionaries, one for each detected scene, with the following keys: * `start_time` (float): The start time of the scene in seconds. * `start_pts` (int): The pts of the start of the scene. * `duration` (float): The duration of the scene in seconds. The list is ordered chronologically. Returns the full duration of the video if no scenes are detected. **Examples:** Detect scene cuts with default parameters: ```python theme={null} tbl.select(tbl.video.scene_detect_hash()).collect() ``` Detect more scenes by lowering the threshold: ```python theme={null} tbl.select(tbl.video.scene_detect_hash(threshold=0.3)).collect() ``` Use larger hash size for more precision: ```python theme={null} tbl.select(tbl.video.scene_detect_hash(size=32)).collect() ``` Use for fast processing with lower frame rate: ```python theme={null} tbl.select(tbl.video.scene_detect_hash(fps=1.0, threshold=0.4)).collect() ``` Add scene cuts as a computed column: ```python theme={null} tbl.add_computed_column(scene_cuts=tbl.video.scene_detect_hash()) ``` ## udf  scene\_detect\_histogram() ```python Signature theme={null} @pxt.udf scene_detect_histogram( video: pxt.Video, *, fps: pxt.Float | None = None, threshold: pxt.Float = 0.05, bins: pxt.Int = 256, min_scene_len: pxt.Int = 15 ) -> pxt.Json ``` Detect scene cuts in a video using PySceneDetect's [HistogramDetector](https://www.scenedetect.com/docs/latest/api/detectors.html#scenedetect.detectors.histogram_detector.HistogramDetector). HistogramDetector compares frame histograms on the Y (luminance) channel after YUV conversion. It detects scenes based on relative histogram differences and is more robust to gradual lighting changes than content-based detection. **Requirements:** * `pip install scenedetect` **Parameters:** * **`video`** (`pxt.Video`): The video to analyze for scene cuts. * **`fps`** (`pxt.Float | None`): Number of frames to extract per second for analysis. If None or 0, analyzes all frames. Lower values process faster but may miss exact scene cuts. * **`threshold`** (`pxt.Float`): Maximum relative difference between 0.0 and 1.0 that the histograms can differ. Histograms are calculated on the Y channel after converting the frame to YUV, and normalized based on the number of bins. Higher differences imply greater change in content, so larger threshold values are less sensitive to cuts. Lower values detect more scenes (more sensitive), higher values detect fewer scenes. * **`bins`** (`pxt.Int`): Number of bins to use for histogram calculation (typically 16-256). More bins provide finer granularity but may be more sensitive to noise. * **`min_scene_len`** (`pxt.Int`): Once a cut is detected, this many frames must pass before a new one can be added to the scene list. **Returns:** * `pxt.Json`: A list of dictionaries, one for each detected scene, with the following keys: * `start_time` (float): The start time of the scene in seconds. * `start_pts` (int): The pts of the start of the scene. * `duration` (float): The duration of the scene in seconds. The list is ordered chronologically. Returns the full duration of the video if no scenes are detected. **Examples:** Detect scene cuts with default parameters: ```python theme={null} tbl.select(tbl.video.scene_detect_histogram()).collect() ``` Detect more scenes by lowering the threshold: ```python theme={null} tbl.select(tbl.video.scene_detect_histogram(threshold=0.03)).collect() ``` Use fewer bins for faster processing: ```python theme={null} tbl.select(tbl.video.scene_detect_histogram(bins=64)).collect() ``` Use with a longer minimum scene length: ```python theme={null} tbl.select(tbl.video.scene_detect_histogram(min_scene_len=30)).collect() ``` Add scene cuts as a computed column: ```python theme={null} tbl.add_computed_column( scene_cuts=tbl.video.scene_detect_histogram(threshold=0.04) ) ``` ## udf  scene\_detect\_threshold() ```python Signature theme={null} @pxt.udf scene_detect_threshold( video: pxt.Video, *, fps: pxt.Float | None = None, threshold: pxt.Float = 12.0, min_scene_len: pxt.Int = 15, fade_bias: pxt.Float = 0.0, add_final_scene: pxt.Bool = False, method: pxt.String = 'floor' ) -> pxt.Json ``` Detect fade-in and fade-out transitions in a video using PySceneDetect's [ThresholdDetector](https://www.scenedetect.com/docs/latest/api/detectors.html#scenedetect.detectors.threshold_detector.ThresholdDetector). ThresholdDetector identifies scenes by detecting when pixel brightness falls below or rises above a threshold value, suitable for detecting fade-to-black, fade-to-white, and similar transitions. **Requirements:** * `pip install scenedetect` **Parameters:** * **`video`** (`pxt.Video`): The video to analyze for fade transitions. * **`fps`** (`pxt.Float | None`): Number of frames to extract per second for analysis. If None or 0, analyzes all frames. Lower values process faster but may miss exact transition points. * **`threshold`** (`pxt.Float`): 8-bit intensity value that each pixel value (R, G, and B) must be less than or equal to in order to trigger a fade in/out. * **`min_scene_len`** (`pxt.Int`): Once a cut is detected, this many frames must pass before a new one can be added to the scene list. * **`fade_bias`** (`pxt.Float`): Float between -1.0 and +1.0 representing the percentage of timecode skew for the start of a scene (-1.0 causing a cut at the fade-to-black, 0.0 in the middle, and +1.0 causing the cut to be right at the position where the threshold is passed). * **`add_final_scene`** (`pxt.Bool`): Boolean indicating if the video ends on a fade-out to generate an additional scene at this timecode. * **`method`** (`pxt.String`): How to treat threshold when detecting fade events * 'ceiling': Fade out happens when frame brightness rises above threshold. * 'floor': Fade out happens when frame brightness falls below threshold. **Returns:** * `pxt.Json`: A list of dictionaries, one for each detected scene, with the following keys: * `start_time` (float): The start time of the scene in seconds. * `start_pts` (int): The pts of the start of the scene. * `duration` (float): The duration of the scene in seconds. The list is ordered chronologically. Returns the full duration of the video if no scenes are detected. **Examples:** Detect fade-to-black transitions with default parameters: ```python theme={null} tbl.select(tbl.video.scene_detect_threshold()).collect() ``` Use a lower threshold to detect darker fades: ```python theme={null} tbl.select(tbl.video.scene_detect_threshold(threshold=8.0)).collect() ``` Detect both fade-to-black and fade-to-white using absolute method: ```python theme={null} tbl.select(tbl.video.scene_detect_threshold(method='absolute')).collect() ``` Add final scene boundary: ```python theme={null} tbl.select( tbl.video.scene_detect_threshold(add_final_scene=True) ).collect() ``` Add fade transitions as a computed column: ```python theme={null} tbl.add_computed_column( fade_cuts=tbl.video.scene_detect_threshold(threshold=15.0) ) ``` ## udf  segment\_video() ```python Signature theme={null} @pxt.udf segment_video( video: pxt.Video, *, duration: pxt.Float | None = None, segment_times: pxt.Json | None = None, mode: pxt.String = 'accurate', video_encoder: pxt.String | None = None, video_encoder_args: pxt.Json | None = None ) -> pxt.Json ``` Split a video into segments. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`video`** (`pxt.Video`): Input video file to segment * **`duration`** (`pxt.Float | None`): Duration of each segment (in seconds). For `mode='fast'`, this is approximate; for `mode='accurate'`, segments will have exact durations. Cannot be specified together with `segment_times`. * **`segment_times`** (`pxt.Json | None`): List of timestamps (in seconds) in video where segments should be split. Note that these are not segment durations. If all segment times are less than the duration of the video, produces exactly `len(segment_times) + 1` segments. Cannot be empty or be specified together with `duration`. * **`mode`** (`pxt.String`): Segmentation mode: * `'fast'`: Quick segmentation using stream copy (splits only at keyframes, approximate durations) * `'accurate'`: Precise segmentation with re-encoding (exact durations, slower) * **`video_encoder`** (`pxt.String | None`): Video encoder to use. If not specified, uses the default encoder for the current platform. Only available for `mode='accurate'`. * **`video_encoder_args`** (`pxt.Json | None`): Additional arguments to pass to the video encoder. Only available for `mode='accurate'`. **Returns:** * `pxt.Json`: List of file paths for the generated video segments. **Examples:** Split a video at 1 minute intervals using fast mode: ```python theme={null} tbl.select segment_paths=tbl.video.segment_video( duration=60, mode='fast' ) ).collect() ``` Split video into exact 10-second segments with default accurate mode, using the libx264 encoder with a CRF of 23 and slow preset (for smaller output files): ```python theme={null} tbl.select( segment_paths=tbl.video.segment_video( duration=10, video_encoder='libx264', video_encoder_args={'crf': 23, 'preset': 'slow'}, ) ).collect() ``` Split video into two parts at the midpoint: ```python theme={null} duration = tbl.video.get_duration() tbl.select( segment_paths=tbl.video.segment_video(segment_times=[duration / 2]) ).collect() ``` ## udf  with\_audio() ```python Signature theme={null} @pxt.udf with_audio( video: pxt.Video, audio: pxt.Audio, *, video_start_time: pxt.Float = 0.0, video_duration: pxt.Float | None = None, audio_start_time: pxt.Float = 0.0, audio_duration: pxt.Float | None = None ) -> pxt.Video ``` Creates a new video that combines the video stream from `video` and the audio stream from `audio`. The `start_time` and `duration` parameters can be used to select a specific time range from each input. If the audio input (or selected time range) is longer than the video, the audio will be truncated. **Requirements:** * `ffmpeg` needs to be installed and in PATH **Parameters:** * **`video`** (`pxt.Video`): Input video. * **`audio`** (`pxt.Audio`): Input audio. * **`video_start_time`** (`pxt.Float`): Start time in the video input (in seconds). * **`video_duration`** (`pxt.Float | None`): Duration of video segment (in seconds). If None, uses the remainder of the video after `video_start_time`. `video_duration` determines the duration of the output video. * **`audio_start_time`** (`pxt.Float`): Start time in the audio input (in seconds). * **`audio_duration`** (`pxt.Float | None`): Duration of audio segment (in seconds). If None, uses the remainder of the audio after `audio_start_time`. If the audio is longer than the output video, it will be truncated. **Returns:** * `pxt.Video`: A new video file with the audio track added. **Examples:** Add background music to a video: ```python theme={null} tbl.select(tbl.video.with_audio(tbl.music_track)).collect() ``` Add audio starting 5 seconds into both files: ```python theme={null} tbl.select( tbl.video.with_audio( tbl.music_track, video_start_time=5.0, audio_start_time=5.0 ) ).collect() ``` Use a 10-second clip from the middle of both files: ```python theme={null} tbl.select( tbl.video.with_audio( tbl.music_track, video_start_time=30.0, video_duration=10.0, audio_start_time=15.0, audio_duration=10.0, ) ).collect() ``` # vision Source: https://docs.pixeltable.com/sdk/latest/vision View Source on GitHub # module  pixeltable.functions.vision Pixeltable UDFs for Computer Vision. Example: ```python theme={null} import pixeltable as pxt from pixeltable.functions import vision as pxtv t = pxt.get_table(...) t.select( pxtv.draw_bounding_boxes(t.img, boxes=t.boxes, label=t.labels) ).collect() ``` ## udf  draw\_bounding\_boxes() ```python Signature theme={null} @pxt.udf draw_bounding_boxes( img: pxt.Image, boxes: pxt.Json, *, labels: pxt.Json | None = None, color: pxt.String | None = None, box_colors: pxt.Json | None = None, alpha: pxt.Float | None = None, fill: pxt.Bool = False, fill_alpha: pxt.Float | None = None, width: pxt.Int = 1, font: pxt.String | None = None, font_size: pxt.Int | None = None ) -> pxt.Image ``` Draws bounding boxes on the given image. Labels can be any type that supports `str()` and is hashable (e.g., strings, ints, etc.). Colors can be specified as common HTML color names (e.g., 'red') supported by PIL's [`ImageColor`](https://pillow.readthedocs.io/en/stable/reference/ImageColor.html#imagecolor-module) module or as RGB/RGBA hex codes (e.g., '#FF0000', '#FF0000FF'). If opacity isn't specified in the color string and `alpha`/`fill_alpha` is `None`, defaults to 1.0 for box borders and 0.5 for filled boxes. If no colors are specified, this function randomly assigns each label a specific color based on a hash of the label. **Parameters:** * **`img`** (`pxt.Image`): The image on which to draw the bounding boxes. * **`boxes`** (`pxt.Json`): List of bounding boxes, each represented as \[xmin, ymin, xmax, ymax]. * **`labels`** (`pxt.Json | None`): List of labels for each bounding box. * **`color`** (`pxt.String | None`): Single color to be used for all bounding boxes and labels. * **`box_colors`** (`pxt.Json | None`): List of colors, one per bounding box. * **`alpha`** (`pxt.Float | None`): Opacity (0-1) of the bounding box borders and labels. If non-`None`, overrides any alpha in `color`/`box_colors`. * **`fill`** (`pxt.Bool`): Whether to fill the bounding boxes with color. * **`fill_alpha`** (`pxt.Float | None`): Opacity (0-1) of the bounding box fill. If non-`None`, overrides any alpha in `color`/`box_colors`. * **`width`** (`pxt.Int`): Width of the bounding box borders. * **`font`** (`pxt.String | None`): Name of a system font or path to a TrueType font file, as required by [`PIL.ImageFont.truetype()`](https://pillow.readthedocs.io/en/stable/reference/ImageFont.html#PIL.ImageFont.truetype). If `None`, uses the default provided by [`PIL.ImageFont.load_default()`](https://pillow.readthedocs.io/en/stable/reference/ImageFont.html#PIL.ImageFont.load_default). * **`font_size`** (`pxt.Int | None`): Size of the font used for labels in points. Only used in conjunction with non-`None` `font` argument. **Returns:** * `pxt.Image`: The image with bounding boxes drawn on it. ## udf  eval\_detections() ```python Signature theme={null} @pxt.udf eval_detections( pred_bboxes: pxt.Json, pred_labels: pxt.Json, pred_scores: pxt.Json, gt_bboxes: pxt.Json, gt_labels: pxt.Json, min_iou: pxt.Float = 0.5 ) -> pxt.Json ``` Evaluates the performance of a set of predicted bounding boxes against a set of ground truth bounding boxes. **Parameters:** * **`pred_bboxes`** (`pxt.Json`): List of predicted bounding boxes, each represented as \[xmin, ymin, xmax, ymax]. * **`pred_labels`** (`pxt.Json`): List of predicted labels. * **`pred_scores`** (`pxt.Json`): List of predicted scores. * **`gt_bboxes`** (`pxt.Json`): List of ground truth bounding boxes, each represented as \[xmin, ymin, xmax, ymax]. * **`gt_labels`** (`pxt.Json`): List of ground truth labels. * **`min_iou`** (`pxt.Float`): Minimum intersection-over-union (IoU) threshold for a predicted bounding box to be considered a true positive. **Returns:** * `pxt.Json`: A list of dictionaries, one per label class, with the following structure: ```python theme={null} { 'min_iou': float, # The value of `min_iou` used for the detections 'class': int, # The label class # List of 1's and 0's indicating true positives for each # predicted bounding box of this class 'tp': list[int], # List of 1's and 0's indicating false positives for each # predicted bounding box of this class; `fp[n] == 1 - tp[n]` 'fp': list[int], # List of predicted scores for each bounding box of this class 'scores': list[float], 'num_gts': int, # Number of ground truth bounding boxes of this class } ``` # voyageai Source: https://docs.pixeltable.com/sdk/latest/voyageai View Source on GitHub # module  pixeltable.functions.voyageai Pixeltable UDFs that wrap various endpoints from the Voyage AI API. In order to use them, you must first `pip install voyageai` and configure your Voyage AI credentials, as described in the [Working with Voyage AI](https://docs.pixeltable.com/notebooks/integrations/working-with-voyageai) tutorial. ## udf  embeddings() ```python Signature theme={null} @pxt.udf embeddings( input: pxt.String, *, model: pxt.String, input_type: pxt.String | None = None, truncation: pxt.Bool | None = None, output_dimension: pxt.Int | None = None, output_dtype: pxt.String | None = None ) -> pxt.Array[(None,), float32] ``` Creates an embedding vector representing the input text. Equivalent to the Voyage AI `embeddings` API endpoint. For additional details, see: [https://docs.voyageai.com/docs/embeddings](https://docs.voyageai.com/docs/embeddings) Request throttling: Applies the rate limit set in the config (section `voyageai`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install voyageai` **Parameters:** * **`input`** (`pxt.String`): The text to embed. * **`model`** (`pxt.String`): The model to use for the embedding. Recommended options: `voyage-3-large`, `voyage-3.5`, `voyage-3.5-lite`, `voyage-code-3`, `voyage-finance-2`, `voyage-law-2`. * **`input_type`** (`pxt.String | None`): Type of the input text. Options: `None`, `query`, `document`. When `input_type` is `None`, the embedding model directly converts the inputs into numerical vectors. For retrieval/search purposes, we recommend setting this to `query` or `document` as appropriate. * **`truncation`** (`pxt.Bool | None`): Whether to truncate the input texts to fit within the context length. Defaults to `True`. * **`output_dimension`** (`pxt.Int | None`): The number of dimensions for resulting output embeddings. Most models only support a single default dimension. Models `voyage-3-large`, `voyage-3.5`, `voyage-3.5-lite`, and `voyage-code-3` support: 256, 512, 1024 (default), and 2048. * **`output_dtype`** (`pxt.String | None`): The data type for the embeddings to be returned. Options: `float`, `int8`, `uint8`, `binary`, `ubinary`. Only `float` is currently supported in Pixeltable. **Returns:** * `pxt.Array[(None,), float32]`: An array representing the application of the given embedding to `input`. **Examples:** Add a computed column that applies the model `voyage-3.5` to an existing Pixeltable column `tbl.text` of the table `tbl`: ```python theme={null} tbl.add_computed_column( embed=embeddings(tbl.text, model='voyage-3.5', input_type='document') ) ``` Add an embedding index to an existing column `text`, using the model `voyage-3.5`: ```python theme={null} tbl.add_embedding_index( 'text', string_embed=embeddings.using(model='voyage-3.5') ) ``` ## udf  multimodal\_embed() ```python Signatures theme={null} # Signature 1: @pxt.udf multimodal_embed( text: pxt.String, model: pxt.String, input_type: pxt.String | None, truncation: pxt.Bool ) -> pxt.Array[(1024,), float32] # Signature 2: @pxt.udf multimodal_embed( image: pxt.Image, model: pxt.String, input_type: pxt.String | None, truncation: pxt.Bool ) -> pxt.Array[(1024,), float32] # Signature 3: @pxt.udf multimodal_embed( video: pxt.Video, model: pxt.String, input_type: pxt.String | None, truncation: pxt.Bool ) -> pxt.Array[(1024,), float32] ``` Creates an embedding vector for text, images, or video using Voyage AI's multimodal model. Equivalent to the Voyage AI `multimodal_embed` API endpoint. For additional details, see: [https://docs.voyageai.com/docs/multimodal-embeddings](https://docs.voyageai.com/docs/multimodal-embeddings) Request throttling: Applies the rate limit set in the config (section `voyageai`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install voyageai` **Parameters:** * **`text`** (`String`): The text to embed. * **`image`** (`Image`): The image to embed. * **`video`** (`Video`): The video to embed. * **`model`** (`String`): The model to use. Currently only `voyage-multimodal-3` is supported. * **`input_type`** (`String | None`, default: `Literal(None)`): Type of the input. Options: `None`, `query`, `document`. For retrieval/search, set to `query` or `document` as appropriate. * **`truncation`** (`Bool`, default: `Literal(True)`): Whether to truncate inputs to fit within context length. Defaults to `True`. **Returns:** * `pxt.Array[(1024,), float32]`: An array of 1024 floats representing the embedding. **Examples:** Embed a text column `description`: ```python theme={null} tbl.add_computed_column( embed=multimodal_embed(tbl.description, input_type='document') ) ``` Add an embedding index for column `description`: ```python theme={null} tbl.add_embedding_index( 'description', embed=multimodal_embed.using(model='voyage-multimodal-3'), ) ``` ## udf  rerank() ```python Signature theme={null} @pxt.udf rerank( query: pxt.String, documents: pxt.Json, *, model: pxt.String, top_k: pxt.Int | None = None, truncation: pxt.Bool = True ) -> pxt.Json ``` Reranks documents based on their relevance to a query. Equivalent to the Voyage AI `rerank` API endpoint. For additional details, see: [https://docs.voyageai.com/docs/reranker](https://docs.voyageai.com/docs/reranker) Request throttling: Applies the rate limit set in the config (section `voyageai`, key `rate_limit`). If no rate limit is configured, uses a default of 600 RPM. **Requirements:** * `pip install voyageai` **Parameters:** * **`query`** (`pxt.String`): The query as a string. * **`documents`** (`pxt.Json`): The documents to be reranked as a list of strings. * **`model`** (`pxt.String`): The model to use for reranking. Recommended options: `rerank-2.5`, `rerank-2.5-lite`. * **`top_k`** (`pxt.Int | None`): The number of most relevant documents to return. If not specified, all documents will be reranked and returned. * **`truncation`** (`pxt.Bool`): Whether to truncate the input to satisfy context length limits. Defaults to `True`. **Returns:** * `pxt.Json`: A dictionary containing: * `results`: List of reranking results with `index`, `document`, and `relevance_score` * `total_tokens`: The total number of tokens used **Examples:** Rerank similarity search results for better relevance. First, create a table with an embedding index, then use a query function to retrieve candidates and rerank them: ```python theme={null} docs = pxt.create_table('docs', {'text': pxt.String}) docs.add_computed_column(embed=embeddings(docs.text, model='voyage-3.5')) docs.add_embedding_index('text', embed=docs.embed) @pxt.query def get_candidates(query_text: str): sim = docs.text.similarity( query_text, embed=embeddings.using(model='voyage-3.5') ) return docs.order_by(sim, asc=False).limit(20).select(docs.text) queries = pxt.create_table('queries', {'query': pxt.String}) queries.add_computed_column(candidates=get_candidates(queries.query)) queries.add_computed_column( reranked=rerank( queries.query, queries.candidates.text, model='rerank-2.5', top_k=5, ) ) ``` # whisper Source: https://docs.pixeltable.com/sdk/latest/whisper View Source on GitHub # module  pixeltable.functions.whisper Pixeltable UDFs that wraps the OpenAI Whisper library. This UDF will cause Pixeltable to invoke the relevant model locally. In order to use it, you must first `pip install openai-whisper`. ## udf  transcribe() ```python Signature theme={null} @pxt.udf transcribe( audio: pxt.Audio, *, model: pxt.String, temperature: pxt.Json | None = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), compression_ratio_threshold: pxt.Float | None = 2.4, logprob_threshold: pxt.Float | None = -1.0, no_speech_threshold: pxt.Float | None = 0.6, condition_on_previous_text: pxt.Bool = True, initial_prompt: pxt.String | None = None, word_timestamps: pxt.Bool = False, prepend_punctuations: pxt.String = '"\'“¿([{-', append_punctuations: pxt.String = '"\'.。,,!!??::”)]}、', decode_options: pxt.Json | None = None ) -> pxt.Json ``` Transcribe an audio file using Whisper. This UDF runs a transcription model *locally* using the Whisper library, equivalent to the Whisper `transcribe` function, as described in the [Whisper library documentation](https://github.com/openai/whisper). **Requirements:** * `pip install openai-whisper` **Parameters:** * **`audio`** (`pxt.Audio`): The audio file to transcribe. * **`model`** (`pxt.String`): The name of the model to use for transcription. **Returns:** * `pxt.Json`: A dictionary containing the transcription and various other metadata. **Examples:** Add a computed column that applies the model `base.en` to an existing Pixeltable column `tbl.audio` of the table `tbl`: ```python theme={null} tbl.add_computed_column(result=transcribe(tbl.audio, model='base.en')) ``` # whisperx Source: https://docs.pixeltable.com/sdk/latest/whisperx View Source on GitHub # module  pixeltable.functions.whisperx WhisperX audio transcription and diarization functions. ## udf  transcribe() ```python Signature theme={null} @pxt.udf transcribe( audio: pxt.Audio, *, model: pxt.String, diarize: pxt.Bool = False, compute_type: pxt.String | None = None, language: pxt.String | None = None, task: pxt.String | None = None, chunk_size: pxt.Int | None = None, alignment_model_name: pxt.String | None = None, interpolate_method: pxt.String | None = None, return_char_alignments: pxt.Bool | None = None, diarization_model_name: pxt.String | None = None, num_speakers: pxt.Int | None = None, min_speakers: pxt.Int | None = None, max_speakers: pxt.Int | None = None ) -> pxt.Json ``` Transcribe an audio file using WhisperX. This UDF runs a transcription model *locally* using the WhisperX library, equivalent to the WhisperX `transcribe` function, as described in the [WhisperX library documentation](https://github.com/m-bain/whisperX). If `diarize=True`, then speaker diarization will also be performed. Several of the UDF parameters are only valid if `diarize=True`, as documented in the parameters list below. **Requirements:** * `pip install whisperx` **Parameters:** * **`audio`** (`pxt.Audio`): The audio file to transcribe. * **`model`** (`pxt.String`): The name of the model to use for transcription. * **`diarize`** (`pxt.Bool`): Whether to perform speaker diarization. * **`compute_type`** (`pxt.String | None`): The compute type to use for the model (e.g., `'int8'`, `'float16'`). If `None`, defaults to `'float16'` on CUDA devices and `'int8'` otherwise. * **`language`** (`pxt.String | None`): The language code for the transcription (e.g., `'en'` for English). * **`task`** (`pxt.String | None`): The task to perform (e.g., `'transcribe'` or `'translate'`). Defaults to `'transcribe'`. * **`chunk_size`** (`pxt.Int | None`): The size of the audio chunks to process, in seconds. Defaults to `30`. * **`alignment_model_name`** (`pxt.String | None`): The name of the alignment model to use. If `None`, uses the default model for the given language. Only valid if `diarize=True`. * **`interpolate_method`** (`pxt.String | None`): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default (`'nearest'`). Only valid if `diarize=True`. * **`return_char_alignments`** (`pxt.Bool | None`): Whether to return character-level alignments. Defaults to `False`. Only valid if `diarize=True`. * **`diarization_model_name`** (`pxt.String | None`): The name of the diarization model to use. Defaults to `pyannote/speaker-diarization-3.1`. Only valid if `diarize=True`. * **`num_speakers`** (`pxt.Int | None`): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if `diarize=True`. * **`min_speakers`** (`pxt.Int | None`): If specified, the minimum number of speakers to expect in the audio. Only valid if `diarize=True`. * **`max_speakers`** (`pxt.Int | None`): If specified, the maximum number of speakers to expect in the audio. Only valid if `diarize=True`. **Returns:** * `pxt.Json`: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata. **Examples:** Add a computed column that applies the model `tiny.en` to an existing Pixeltable column `tbl.audio` of the table `tbl`: ```python theme={null} tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en')) ``` Add a computed column that applies the model `tiny.en` to an existing Pixeltable column `tbl.audio` of the table `tbl`, with speaker diarization enabled, expecting at least 2 speakers: ```python theme={null} tbl.add_computed_column( result=transcribe( tbl.audio, model='tiny.en', diarize=True, min_speakers=2 ) ) ``` # yolox Source: https://docs.pixeltable.com/sdk/latest/yolox View Source on GitHub # module  pixeltable.functions.yolox YOLOX object detection functions. ## udf  yolo\_to\_coco() ```python Signature theme={null} @pxt.udf yolo_to_coco(detections: pxt.Json) -> pxt.Json ``` Converts the output of a YOLOX object detection model to COCO format. **Parameters:** * **`detections`** (`pxt.Json`): The output of a YOLOX object detection model, as returned by `yolox`. **Returns:** * `pxt.Json`: A dictionary containing the data from `detections`, converted to COCO format. **Examples:** Add a computed column that converts the output `tbl.detections` to COCO format, where `tbl.image` is the image for which detections were computed: ```python theme={null} tbl.add_computed_column( detections=yolox(tbl.image, model_id='yolox_m', threshold=0.8) ) tbl.add_computed_column(detections_coco=yolo_to_coco(tbl.detections)) ``` ## udf  yolox() ```python Signature theme={null} @pxt.udf yolox( images: pxt.Image, *, model_id: pxt.String, threshold: pxt.Float = 0.5 ) -> pxt.Json ``` Computes YOLOX object detections for the specified image. `model_id` should reference one of the models defined in the [YOLOX documentation](https://github.com/Megvii-BaseDetection/YOLOX). **Requirements**: * `pip install pixeltable-yolox` **Parameters:** * **`model_id`** (`pxt.String`): one of: `yolox_nano`, `yolox_tiny`, `yolox_s`, `yolox_m`, `yolox_l`, `yolox_x` * **`threshold`** (`pxt.Float`): the threshold for object detection **Returns:** * `pxt.Json`: A dictionary containing the output of the object detection model. **Examples:** Add a computed column that applies the model `yolox_m` to an existing Pixeltable column `tbl.image` of the table `tbl`: ```python theme={null} tbl.add_computed_column( detections=yolox(tbl.image, model_id='yolox_m', threshold=0.8) ) ``` # Computed Columns Source: https://docs.pixeltable.com/tutorials/computed-columns Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. This guide introduces one of Pixeltable’s most essential and powerful concepts: computed columns. You’ll learn how to: * Add computed columns to a table * Use computed columns for complex operations such as image processing and model inference ## Prerequisites This guide assumes you’re familiar with: * Creating and managing tables * Inserting and querying data * Basic table operations If you’re new to Pixeltable, start with the [Tables and Data Operations](/tutorials/tables-and-data-operations) guide. First, let’s ensure the Pixeltable library is installed in your environment, along with the Huggingface `transformers` library. ```python theme={null} %pip install -qU pixeltable torch transformers ``` ### Computed Columns Let’s start with a simple example that illustrates the basic concepts behind computed columns. We’ll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by using `pxt.create_table()` with the `source` parameter. ```python theme={null} import pixeltable as pxt pxt.create_dir('fundamentals', if_exists='ignore') pop_t = pxt.create_table( 'fundamentals/population', source='https://github.com/pixeltable/pixeltable/raw/release/docs/resources/world-population-data.csv', if_exists='replace', ) ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'fundamentals'.
  Created table 'population'.
  Inserting rows into \`population\`: 234 rows \[00:00, 6850.71 rows/s]
  Inserted 234 rows with 0 errors.
Also recall that `pop_t.head()` returns the first few rows of a table, and typing the table name `pop_t` by itself gives the schema. ```python theme={null} pop_t.head(5) ```
```python theme={null} pop_t ```
Now let’s suppose we want to add a new column for the year-over-year population change from 2022 to 2023. You can `select()` such a quantity into a Pixeltable `Query`, giving it the name `yoy_change` (year-over-year change): ```python theme={null} pop_t.select( pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022) ).head(5) ```
A **computed column** is a way of turning such a selection into a new, permanent column of the table. Here’s how it works: ```python theme={null} pop_t.add_computed_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)) ```
  Added 234 column values with 0 errors.
  234 rows updated, 468 values computed.
As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of `pop_t`, we see the new column and its definition. ```python theme={null} pop_t ```
The new column can be queried in the usual manner. ```python theme={null} pop_t.select(pop_t.country, pop_t.yoy_change).head(5) ```
The output is identical to the previous example, but now we’re retrieving the computed output from the database, instead of computing it on-the-fly. Computed columns can be “chained” with other computed columns. Here’s an example that expresses population change as a percentage: ```python theme={null} pop_t.add_computed_column( yoy_percent_change=(100 * pop_t.yoy_change / pop_t.pop_2022) ) ```
  Added 234 column values with 0 errors.
  234 rows updated, 468 values computed.
```python theme={null} pop_t ```
```python theme={null} pop_t.select( pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change ).head(5) ```
Although computed columns appear superficially similar to Queries, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are “downstream” of the new data, ensuring that the state of the entire data is kept up-to-date. In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, all updates are applied incrementally. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data. Let’s see how this works in practice. For purposes of illustration, we’ll add an entry for California to the table, as if it were a country. ```python theme={null} pop_t.insert(country='California', pop_2023=39110000, pop_2022=39030000) ```
  Inserting rows into \`population\`: 1 rows \[00:00, 228.35 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 5 values computed.
Observe that the computed columns `yoy_growth` and `yoy_percent_growth` have been automatically updated in response to the new data. ```python theme={null} pop_t.tail(5) ```
Remember that all tables in Pixeltable are persistent. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved. ### Recomputing Columns From time to time you might need to recompute the data in an existing computed column. Perhaps the *code* for one of your UDFs has changed, and you want to recompute a column that uses that UDF in order to pick up the new logic. Or perhaps you want to re-run a nondeterministic computation such as model inference. The command to do this is `recompute_columns()`. It won’t do much in the current example, because all our computations are simple and deterministic, but for demonstration purposes here’s what it looks like: ```python theme={null} pop_t.recompute_columns(pop_t.yoy_change, pop_t.yoy_percent_change) ```
  Inserting rows into \`population\`: 235 rows \[00:00, 8795.92 rows/s]
  235 rows updated, 940 values computed.
```python theme={null} pop_t.tail(5) ```
As expected, it looks the same. If you modify the data that a computed column depends on, Pixeltable will recompute automatically; so recompute\_columns() is primarily useful when the input data remains the same, but your UDF business logic changes. ### A More Complex Example: Image Processing Pixeltable supports media data such as images alongside traditional structured data. Let’s explore an example that uses computed columns for image processing operations. In this example, we’ll create the table directly by providing a schema, rather than importing it from a CSV. ```python theme={null} t = pxt.create_table('fundamentals/image_ops', {'source': pxt.Image}) ```
  Created table 'image\_ops'.
```python theme={null} url_prefix = 'https://github.com/pixeltable/pixeltable/raw/release/docs/resources/images' images = ['000000000139.jpg', '000000000632.jpg', '000000000872.jpg'] t.insert({'source': f'{url_prefix}/{image}'} for image in images) ```
  Inserting rows into \`image\_ops\`: 3 rows \[00:00, 1133.39 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.
```python theme={null} t.collect() ```
What are some things we might want to do with these images? A fairly basic one is to extract metadata. Pixeltable provides the built-in UDF `get_metadata()`, which returns a dictionary with various metadata about the image. Let’s go ahead and make this a computed column. “UDF” is standard terminology in databases, meaning “User-Defined Function”. Technically speaking, the get\_metadata() function isn’t user-defined, it’s built in to the Pixeltable library. But we’ll consistently refer to Pixeltable functions as “UDFs” in order to clearly distinguish them from ordinary Python functions. Later in this guide, we’ll see how to turn (almost) any Python function into a Pixeltable UDF. ```python theme={null} t.add_computed_column(metadata=t.source.get_metadata()) t.collect() ```
  Added 3 column values with 0 errors.
Image operations, of course, can also return new images. ```python theme={null} t.add_computed_column(rotated=t.source.rotate(10)) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} t.collect() ```
Or, perhaps we want to rotate our images and fill them in with a transparent background rather than black. We can do this by chaining image operations, adding a transparency layer before doing the rotation. ```python theme={null} t.add_computed_column( rotated_transparent=t.source.convert('RGBA').rotate(10) ) t.collect() ```
  Added 3 column values with 0 errors.
In addition to get\_metadata(), convert(), and rotate(), Pixeltable has a sizable library of other common image operations that can be used as UDFs in computed columns. For the most part, the image UDFs are analogs of the operations provided by the Pillow library (in fact, Pixeltable is just using Pillow under the covers). You can read more about the provided image (and other) UDFs in the Pixeltable SDK Documentation. Let’s have a look at our table schema. ```python theme={null} t ```
### Image Detection In addition to simple operations like `rotate()` and `convert()`, the Pixeltable API includes UDFs for various off-the-shelf image models. Let’s look at one example: object detection using the ResNet-50 model. Model inference is a UDF too, and it can be inserted into a computed column like any other. This one may take a little more time to compute, since it involves first downloading the ResNet-50 model (if it isn’t already cached), then running inference on the images in our table. ```python theme={null} from pixeltable.functions.huggingface import detr_for_object_detection t.add_computed_column( detections=detr_for_object_detection( t.source, model_id='facebook/detr-resnet-50', threshold=0.8 ) ) ```
  Added 3 column values with 0 errors.
  3 rows updated, 3 values computed.
```python theme={null} t.select(t.source, t.detections).collect() ```
It’s great that the DETR model gave us so much information about the images, but it’s not exactly in human-readable form. Those are JSON structures that encode bounding boxes, confidence scores, and categories for each detected object. Let’s do something more useful with them: we’ll use Pixeltable’s `draw_bounding_boxes()` API to superimpose bounding boxes on the images, using different colors to distinguish different object categories. ```python theme={null} from pixeltable.functions.vision import draw_bounding_boxes t.add_computed_column( image_with_bb=draw_bounding_boxes( t.source, t.detections.boxes, labels=t.detections.label_text, fill=True, ) ) t.select(t.source, t.image_with_bb).collect() ```
  Added 3 column values with 0 errors.
It can be a little hard to see what’s going on, so let’s zoom in on just one image. If you select a single image in a notebook, Pixeltable will enlarge its display: ```python theme={null} t.select(t.image_with_bb).head(1) ```
Let’s check in on our schema. We now have five computed columns, all derived from the single source column. ```python theme={null} t ```
And as always, when we add new data to the table, its computed columns are updated automatically. Let’s try this on a few more images. ```python theme={null} more_images = ['000000000108.jpg', '000000000885.jpg'] t.insert({'source': f'{url_prefix}/{image}'} for image in more_images) ```
  Inserting rows into \`image\_ops\`: 2 rows \[00:00, 944.77 rows/s]
  Inserted 2 rows with 0 errors.
  2 rows inserted, 14 values computed.
```python theme={null} t.select( t.source, t.image_with_bb, t.detections.label_text, t.metadata ).tail(2) ```
It bears repeating that Pixeltable is persistent! Anything you put into a table, including computed columns, will be saved in persistent storage. This includes inference outputs such as t.detections, as well as generated images such as t.image\_with\_bb. (Later we’ll see how to tune this behavior in cases where it might be undesirable to store everything, but the default behavior is that computed column output is always persisted.) ### Expressions Let’s have a closer look at that call to `draw_bounding_boxes()` in the last example. ```python theme={null} draw_bounding_boxes(t.source, t.detections.boxes, labels=t.detections.label_text, fill=True) ``` There are a couple of things going on. `draw_bounding_boxes()` is, of course, a UDF, and its first argument is a column reference of the sort we’ve used many times now: `t.source`, the source image. The other two arguments are more than simple column references, though: they’re compound expressions that include the column reference `t.detections` along with a suffix (`.boxes` or `.label_text`) that tells Pixeltable to look inside the dictionary stored in `t.detections`. These are all examples of Pixeltable expressions. In fact, we’ve seen other types of Pixeltable expressions as well, without explicitly calling them out: * Calls to a UDF are expressions, such as `t.source.rotate(10)`, or the `draw_bounding_boxes()` example above; * Arithmetic operations are expressions, such as year-over-year calculation in our first example: `100 * pop_t.yoy_change / pop_t.pop_2022`. ## Next Steps Learn more about working with Pixeltable: * [Queries and Expressions](/tutorials/queries-and-expressions) * [Tables and Data Operations](/tutorials/tables-and-data-operations) # Queries and Expressions Source: https://docs.pixeltable.com/tutorials/queries-and-expressions Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. Expressions are the basic building blocks of Pixeltable. This guide explores how to use queries and expressions, including: * Different types of Pixeltable expressions * Column references and arithmetic operations * Function calls and media operations * The Pixeltable type system ## Prerequisites This guide assumes you’re familiar with: * Creating and managing tables * Basic table operations and queries * Computed columns If you’re new to these concepts, start with: * [Tables and Data Operations](/tutorials/tables-and-data-operations) * [Computed Columns](/tutorials/computed-columns) ## Understanding Expressions You can use Pixeltable expressions in queries: ```python theme={null} pop_t.select(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).collect() ``` Or as computed columns that update automatically: ```python theme={null} pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)) ``` Both examples use the expression `pop_t.pop_2023 - pop_t.pop_2022`. You can also chain operations: ```python theme={null} t.source.convert('RGBA').rotate(10) ``` Or invoke models: ```python theme={null} detr_for_object_detection( t.source, model_id='facebook/detr-resnet-50', threshold=0.8 ) ``` You can include an expression in a `select()` statement to evaluate it dynamically, or in an `add_column()` statement to add it to the table schema as a computed column. To get started, let’s import the necessary libraries and set up a demo directory. ```python theme={null} %pip install -qU pixeltable datasets torch transformers ``` ```python theme={null} import pixeltable as pxt pxt.drop_dir('demo', force=True) pxt.create_dir('demo') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'demo'.
In this guide we’ll work with a subset of the MNIST dataset, a classic reference database of hand-drawn digits. A copy of the MNIST dataset is hosted on the Hugging Face datasets repository, so we can use `create_table()` with the `source` parameter to load it into a Pixeltable table. ```python theme={null} import datasets # Download the first 50 images of the MNIST dataset ds = datasets.load_dataset('ylecun/mnist', split='train[:50]') # Import them into a Pixeltable table t = pxt.create_table('demo/mnist', source=ds) ```
  Created table 'mnist'.
  Inserting rows into \`mnist\`: 50 rows \[00:00, 7516.67 rows/s]
  Inserted 50 rows with 0 errors.
```python theme={null} t.head(5) ```
### Column References The most basic type of expression is a **column reference**: that’s what you get when you type, say, `t.image`. An expression such as `t.image` by itself is just a Python object; it doesn’t contain any actual data, and no data will be loaded until you use the expression in a `select()` query or `add_column()` statement. Here’s what we get if we type `t.image` by itself: ```python theme={null} t.image ```
This is true of all Pixeltable expressions: we can freely create them and manipulate them in various ways, but no actual data will be loaded until we use them in a query. ### JSON Collections (Dicts and Lists) Data is commonly presented in JSON format: for example, API responses and model output often take the shape of JSON dictionaries or lists of dictionaries. Pixeltable has native support for JSON accessors. To demonstrate this, let’s add a computed column that runs an image classification model against the images in our dataset. ```python theme={null} from pixeltable.functions.huggingface import vit_for_image_classification t.add_computed_column( classification=vit_for_image_classification( t.image, model_id='farleyknight-org-username/vit-base-mnist' ) ) ```
  Added 50 column values with 0 errors.
  50 rows updated, 50 values computed.
```python theme={null} t.select(t.image, t.classification).head(3) ```
We see that the output is returned as a dict containing three lists: the five most likely labels (classes) for the image, the corresponding text labels (in this case, just the string form of the class number), and the scores (confidences) of each prediction. The Pixeltable type of the `classification` column is `pxt.Json`: ```python theme={null} t ```
Pixeltable provides a range of operators on `Json`-typed output that behave just as you’d expect. To look up a key in a dictionary, use the syntax `t.classification['labels']`: ```python theme={null} t.select(t.classification['labels']).head(3) ```
You can also use a convenient “attribute” syntax for dictionary lookups. This follows the standard [JSONPath](https://en.wikipedia.org/wiki/JSONPath) expression syntax. ```python theme={null} t.select(t.classification.labels).head(3) ```
The “attribute” syntax isn’t fully general (it won’t work for dictionary keys that are not valid Python identifiers), but it’s handy when it works. `t.classification.labels` is another Pixeltable expression; you can think of it as saying, “do the `'labels'` lookup from every dictionary in the column `t.classification`, and return the result as a new column.” As before, the expression by itself contains no data; it’s the query that does the actual work of retrieving data. Here’s what we see if we just give the expression by itself, without a query: ```python theme={null} t.classification.labels ```
  classification.labels
Similarly, one can pull out a specific item in a list (for this model, we’re probably mostly interested in the first item anyway): ```python theme={null} t.select(t.classification.labels[0]).head(3) ```
Or slice a list in the usual manner: ```python theme={null} t.select(t.classification.labels[:2]).head(3) ```
Pixeltable is resilient against out-of-bounds indices or dictionary keys. If an index or key doesn’t exist for a particular row, you’ll get a `None` output for that row. ```python theme={null} t.select(t.classification.not_a_key).head(3) ```
As always, any expression can be used to create a computed column. ```python theme={null} # Use label_text to be consistent with t.label, which was given # to us as a string t.add_computed_column(pred_label=t.classification.label_text[0]) t ```
  Added 50 column values with 0 errors.
Finally, just as it’s possible to extract items from lists and dictionaries using Pixeltable expressions, you can also construct new lists and dictionaries: just package them up in the usual way. ```python theme={null} custom_dict = { # Keys must be strings; values can be any expressions 'ground_truth': t.label, 'prediction': t.pred_label, 'is_correct': t.label == t.pred_label, # You can also use constants as values 'engine': 'pixeltable', } t.select(t.image, custom_dict).head(5) ```
### UDF Calls UDF calls are another common type of expression. For example, we used one earlier when we added a model invocation to our workload: ```python theme={null} vit_for_image_classification( t.image, model_id='farleyknight-org-username/vit-base-mnist' ) ``` This calls the `vit_for_image_classification` UDF in the `pxt.functions.huggingface` module. Note that `vit_for_image_classification` is a Pixeltable UDF, not an ordinary Python function. You can think of a Pixeltable UDF as a function that operates on columns of data, iteratively applying an underlying operation to each row in the column (or columns). In this case, `vit_for_image_classification` operates on `t.image`, running the model against every image in the column. Notice that in addition to the column `t.image`, this call to `vit_for_image_classification` also takes a constant argument specifying the `model_id`. Any UDF call argument may be a constant, and the constant value simply means “use this value for every row being evaluated”. You can always compose Pixeltable expressions to form more complicated ones; here’s an example that runs the model against a 90-degree rotation of every image in the sample and extracts the label. Not surprisingly, the model doesn’t perform as well on the rotated images. ```python theme={null} rot_model_result = vit_for_image_classification( t.image.rotate(90), model_id='farleyknight-org-username/vit-base-mnist', ) t.select(t.image, rot_label=rot_model_result.labels[0]).head(5) ```
Note that we employed a useful trick here: we assigned an expression to the variable rot\_model\_result for later reuse. Every Pixeltable expression is a Python object, so you can freely assign them to variables, reuse them, compose them, and so on. Remember that nothing actually happens until the expression is used in a query - so in this example, setting the variable rot\_model\_result doesn’t itself result in any data being retrieved; that only happens later, when we actually use it in the select() query. There are a large number of built-in UDFs that ship with Pixeltable; you can always refer back to the [SDK Documentation](/sdk/latest/) for details. ### Method Calls Many built-in UDFs allow a convenient alternate syntax. The following two expressions are exactly equivalent: ```python theme={null} a = t.image.rotate(90) b = pxt.functions.image.rotate(t.image, 90) ``` `a` and `b` can always be used interchangeably in queries, with identical results. Just like in standard Python classes, whenever Pixeltable sees the **method call** `t.image.rotate(90)`, it interprets it as a **function call** `pxt.functions.image.rotate(self, 90)`, with (in this case) `self` equal to `t.image`. Any method call can also be written as a function call, but (just like in standard Python) not every function call can be written as a method call. For example, the following won’t work: ```python theme={null} t.image.vit_for_image_classification( model_id='farleyknight-org-username/vit-base-mnist' ) ``` That’s because `vit_for_image_classification` is part of the `pxt.functions.huggingface` module, not the core module `pxt.functions.image`. Most Pixeltable types have a corresponding **core module** of UDFs that can be used as method calls (`pxt.functions.image` for `Image`; `pxt.functions.string` for `String`; and so on), described fully in the [SDK Documentation](/sdk/latest/). ### Arithmetic and Boolean Operations Expressions can also be combined using standard arithmetic and boolean operators. As with everything else, arithmetic and boolean expressions are operations on columns that (when used in a query) are applied to every row. ```python theme={null} t.select(t.image, t.label, t.label == '4', t.label < '5').head(5) ```
When you use a `where` clause in a query, you’re giving it a Pixeltable expression, too (a boolean-valued one). ```python theme={null} t.where(t.label == '4').select(t.image).show() ```
The following example shows how boolean expressions can be assigned to variables and used to form more complex expressions. ```python theme={null} # Reuse `rot_model_result` from above, extracting # the dominant label as a new expression rot_label = rot_model_result.label_text[0] # Select all the rows where the ground truth label is '5', # and the "rotated" version of the model got it wrong # (by returning something other than a '5') t.where((t.label == '5') & (rot_label != '5')).select( t.image, t.label, rot_label=rot_label ).show() ```
Notice that to form a logical “and”, we wrote ```python theme={null} (t.label == '5') & (rot_label != '5') ``` using the operator `&` rather than `and`. Likewise, to form a logical “or”, we’d use `|` rather than `or`: ```python theme={null} (t.label == '5') | (rot_label != '5') ``` For logical negation: ```python theme={null} ~(t.label == '5') ``` This follows the convention used by other popular data-manipulation frameworks such as Pandas, and it’s necessary because the Python language does not allow the meanings of `and`, `or`, and `not` to be customized. There is one more instance of this to be aware of: to check whether an expression is `None`, it’s necessary to write (say) ```python theme={null} t.label == None ``` rather than `t.label is None`, for the same reason. ### Arrays In addition to lists and dicts, Pixeltable also has built-in support for numerical arrays. A typical place where arrays show up is as the output of an embedding. ```python theme={null} from pixeltable.functions.huggingface import clip # Add a computed column that computes a CLIP embedding for each image t.add_computed_column( clip=clip(t.image, model_id='openai/clip-vit-base-patch32') ) t.select(t.image, t.clip).head(5) ```
  Added 50 column values with 0 errors.
The underlying Python type of `pxt.Array` is an ordinary NumPy array (`np.ndarray`), so that an array-typed column is a column of NumPy arrays (in this example, representing the embedding output of each image in the table). As with lists, arrays can be sliced in all the usual ways. ```python theme={null} t.select(t.clip[0], t.clip[5:10], t.clip[-3:]).head(5) ```
### Ad hoc UDFs with `apply` We’ve now seen the most commonly encountered Pixeltable expression types. There are a few other less commonly encountered expressions that are occasionally useful. You can use `apply` to map any Python function onto a column of data. You can think of `apply` as a quick way of constructing an “on-the-fly” UDF for one-off use. ```python theme={null} import numpy as np t.select(t.clip.apply(np.ndarray.dumps, col_type=pxt.String)).head(2) ``` Note, however, that if the function you’re `apply`ing doesn’t have type hints (as in the example here), you’ll need to specify the output column type explicitly. ### Type Conversion with `astype` Sometimes it’s useful to transform an expression of one type into a different type. For example, you can use `astype` to turn an expression of type `pxt.Json` into one of type `pxt.String`. This assumes that the value being converted is actually a string; otherwise, you’ll get an exception. Here’s an example: ```python theme={null} # Select the text in position 0 of `t.classification.label_text`; since # `t.classification.label_text` has type `pxt.Json`, so does # `t.classification.label_text[0]` t.classification.label_text[0].col_type ```
  Optional\[Json]
```python theme={null} # Select the text in position 0 of `t.classification.label_text`, this time # cast as a `pxt.String` t.classification.label_text[0].astype(pxt.String).col_type ```
  Optional\[String]
### Column Properties Some `ColumnRef` expressions have additional useful properties. A media column (image, video, audio, or document) has the following two properties: * `localpath`: the media location on the local filesystem * `fileurl`: the original URL where the media resides (could be the same as `localpath`) ```python theme={null} t.select(t.image, t.image.localpath).head(5) ```
Any computed column will have two additional properties, `errortype` and `errormsg`. These properties will usually be `None`. However, if the computed column was created with `on_error='ignore'` and an exception was encountered during column execution, then the properties will contain additional information about the exception. To demonstrate this feature, we’re going to deliberately trigger an exception in a computed column. The images in our example table are black and white, meaning they have only one color channel. If we try to extract a channel other than channel number `0`, we’ll get an exception. Ordinarily when we call `add_computed_column`, the exception is raised and the `add_computed_column` operation is aborted. ```python theme={null} t.add_computed_column(channel=t.image.getchannel(1)) ```
  Error: Error while evaluating computed column 'channel':
  band index out of range
  \[0;31m---------------------------------------------------------------------------\[0m
  \[0;31mValueError\[0m                                Traceback (most recent call last)
  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exec/expr\_eval/evaluators.py:225\[0m, in \[0;36mFnCallEvaluator.eval\[0;34m(self, call\_args\_batch)\[0m
  \[1;32m    224\[0m \[38;5;28;01mtry\[39;00m:
  \[0;32m--> 225\[0m     item\[38;5;241m.\[39mrow\[\[38;5;28mself\[39m\[38;5;241m.\[39mfn\_call\[38;5;241m.\[39mslot\_idx] \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mscalar\_py\_fn\[49m\[43m(\[49m\[38;5;241;43m*\[39;49m\[43mitem\[49m\[38;5;241;43m.\[39;49m\[43margs\[49m\[43m,\[49m\[43m \[49m\[38;5;241;43m*\[39;49m\[38;5;241;43m\*\[39;49m\[43mitem\[49m\[38;5;241;43m.\[39;49m\[43mkwargs\[49m\[43m)\[49m
  \[1;32m    226\[0m \[38;5;28;01mexcept\[39;00m \[38;5;167;01mException\[39;00m \[38;5;28;01mas\[39;00m exc:

  File \[0;32m/opt/miniconda3/envs/pxt/lib/python3.10/site-packages/PIL/Image.py:2682\[0m, in \[0;36mImage.getchannel\[0;34m(self, channel)\[0m
  \[1;32m   2680\[0m         \[38;5;28;01mraise\[39;00m \[38;5;167;01mValueError\[39;00m(msg) \[38;5;28;01mfrom\[39;00m \[38;5;21;01me\[39;00m
  \[0;32m-> 2682\[0m \[38;5;28;01mreturn\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_new(\[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mim\[49m\[38;5;241;43m.\[39;49m\[43mgetband\[49m\[43m(\[49m\[43mchannel\[49m\[43m)\[49m)

  \[0;31mValueError\[0m: band index out of range

  The above exception was the direct cause of the following exception:

  \[0;31mError\[0m                                     Traceback (most recent call last)
  Cell \[0;32mIn\[27], line 1\[0m
  \[0;32m----> 1\[0m \[43mt\[49m\[38;5;241;43m.\[39;49m\[43madd\_computed\_column\[49m\[43m(\[49m\[43mchannel\[49m\[38;5;241;43m=\[39;49m\[43mt\[49m\[38;5;241;43m.\[39;49m\[43mimage\[49m\[38;5;241;43m.\[39;49m\[43mgetchannel\[49m\[43m(\[49m\[38;5;241;43m1\[39;49m\[43m)\[49m\[43m)\[49m

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table.py:697\[0m, in \[0;36mTable.add\_computed\_column\[0;34m(self, stored, destination, print\_stats, on\_error, if\_exists, \*\*kwargs)\[0m
  \[1;32m    695\[0m \[38;5;28mself\[39m\[38;5;241m.\[39m\_verify\_column(new\_col)
  \[1;32m    696\[0m \[38;5;28;01massert\[39;00m \[38;5;28mself\[39m\[38;5;241m.\[39m\_tbl\_version \[38;5;129;01mis\[39;00m \[38;5;129;01mnot\[39;00m \[38;5;28;01mNone\[39;00m
  \[0;32m--> 697\[0m result \[38;5;241m+\[39m\[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_tbl\_version\[49m\[38;5;241;43m.\[39;49m\[43mget\[49m\[43m(\[49m\[43m)\[49m\[38;5;241;43m.\[39;49m\[43madd\_columns\[49m\[43m(\[49m\[43m\[\[49m\[43mnew\_col\[49m\[43m]\[49m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mon\_error\[49m\[38;5;241;43m=\[39;49m\[43mon\_error\[49m\[43m)\[49m
  \[1;32m    698\[0m FileCache\[38;5;241m.\[39mget()\[38;5;241m.\[39memit\_eviction\_warnings()
  \[1;32m    699\[0m \[38;5;28;01mreturn\[39;00m result

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:666\[0m, in \[0;36mTableVersion.add\_columns\[0;34m(self, cols, print\_stats, on\_error)\[0m
  \[1;32m    664\[0m         all\_cols\[38;5;241m.\[39mappend(undo\_col)
  \[1;32m    665\[0m \[38;5;66;03m# Add all columns\[39;00m
  \[0;32m--> 666\[0m status \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43m\_add\_columns\[49m\[43m(\[49m\[43mall\_cols\[49m\[43m,\[49m\[43m \[49m\[43mprint\_stats\[49m\[38;5;241;43m=\[39;49m\[43mprint\_stats\[49m\[43m,\[49m\[43m \[49m\[43mon\_error\[49m\[38;5;241;43m=\[39;49m\[43mon\_error\[49m\[43m)\[49m
  \[1;32m    667\[0m \[38;5;66;03m# Create indices and their md records\[39;00m
  \[1;32m    668\[0m \[38;5;28;01mfor\[39;00m col, (idx, val\_col, undo\_col) \[38;5;129;01min\[39;00m index\_cols\[38;5;241m.\[39mitems():

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table\_version.py:732\[0m, in \[0;36mTableVersion.\_add\_columns\[0;34m(self, cols, print\_stats, on\_error)\[0m
  \[1;32m    730\[0m plan\[38;5;241m.\[39mopen()
  \[1;32m    731\[0m \[38;5;28;01mtry\[39;00m:
  \[0;32m--> 732\[0m     excs\_per\_col \[38;5;241m=\[39m \[38;5;28;43mself\[39;49m\[38;5;241;43m.\[39;49m\[43mstore\_tbl\[49m\[38;5;241;43m.\[39;49m\[43mload\_column\[49m\[43m(\[49m\[43mcol\[49m\[43m,\[49m\[43m \[49m\[43mplan\[49m\[43m,\[49m\[43m \[49m\[43mon\_error\[49m\[43m \[49m\[38;5;241;43m==\[39;49m\[43m \[49m\[38;5;124;43m'\[39;49m\[38;5;124;43mabort\[39;49m\[38;5;124;43m'\[39;49m\[43m)\[49m
  \[1;32m    733\[0m \[38;5;28;01mexcept\[39;00m sql\_exc\[38;5;241m.\[39mDBAPIError \[38;5;28;01mas\[39;00m exc:
  \[1;32m    734\[0m     Catalog\[38;5;241m.\[39mget()\[38;5;241m.\[39mconvert\_sql\_exc(exc, \[38;5;28mself\[39m\[38;5;241m.\[39mid, \[38;5;28mself\[39m\[38;5;241m.\[39mhandle, convert\_db\_excs\[38;5;241m=\[39m\[38;5;28;01mTrue\[39;00m)

  File \[0;32m\~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:247\[0m, in \[0;36mStoreBase.load\_column\[0;34m(self, col, exec\_plan, abort\_on\_exc)\[0m
  \[1;32m    245\[0m \[38;5;28;01mif\[39;00m abort\_on\_exc \[38;5;129;01mand\[39;00m row\[38;5;241m.\[39mhas\_exc():
  \[1;32m    246\[0m     exc \[38;5;241m=\[39m row\[38;5;241m.\[39mget\_first\_exc()
  \[0;32m--> 247\[0m     \[38;5;28;01mraise\[39;00m excs\[38;5;241m.\[39mError(\[38;5;124mf\[39m\[38;5;124m'\[39m\[38;5;124mError while evaluating computed column \[39m\[38;5;132;01m\{\[39;00mcol\[38;5;241m.\[39mname\[38;5;132;01m!r}\[39;00m\[38;5;124m:\[39m\[38;5;130;01m\n\[39;00m\[38;5;132;01m\{\[39;00mexc\[38;5;132;01m}\[39;00m\[38;5;124m'\[39m) \[38;5;28;01mfrom\[39;00m \[38;5;21;01mexc\[39;00m
  \[1;32m    248\[0m table\_row, num\_row\_exc \[38;5;241m=\[39m row\_builder\[38;5;241m.\[39mcreate\_store\_table\_row(row, \[38;5;28;01mNone\[39;00m, row\[38;5;241m.\[39mpk)
  \[1;32m    249\[0m num\_excs \[38;5;241m+\[39m\[38;5;241m=\[39m num\_row\_exc

  \[0;31mError\[0m: Error while evaluating computed column 'channel':
  band index out of range
But if we use `on_error='ignore'`, the exception will be logged in the column properties instead. ```python theme={null} t.add_computed_column(channel=t.image.getchannel(1), on_error='ignore') ```
  Added 50 column values with 50 errors.
  50 rows updated, 50 values computed, 50 exceptions.
Notice that the update status informs us that there were 50 errors. If we query the table, we see that the column contains only `None` values, but the `errortype` and `errormsg` fields contain details of the error. ```python theme={null} t.select( t.image, t.channel, t.channel.errortype, t.channel.errormsg ).head(5) ```
More details on Pixeltable’s error handling can be found in the [External Files](/platform/external-files) guide. ## The Pixeltable Type System We’ve seen that every column and every expression in Pixeltable has an associated **Pixeltable type**. In this section, we’ll briefly survey the various Pixeltable types and their uses. Here are all the supported types and their corresponding Python types:
The Python type is what you’ll get back if you query an expression of the given Pixeltable type. For `pxt.Json`, it can be any of `str`, `int`, `float`, `bool`, `list`, or `dict`. pxt.Audio, pxt.Video, and pxt.Document all correspond to the Python type str. This is because those types are represented by file paths that reference the media in question. When you query for, say, t.select(t.video\_col), you’re guaranteed to get a file path on the local filesystem (Pixeltable will download and cache a local copy of the video if necessary to ensure this). If you want the original URL, use t.video\_col.fileurl instead. Several types can be **specialized** to constrain the allowable data in a column. * `pxt.Image` can be specialized with a resolution and/or an image mode: * `pxt.Image[(300,200)]` - images with width 300 and height 200 * `pxt.Image['RGB']` - images with mode `'RGB'`; see the [PIL Documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html) for the full list * `pxt.Image[(300,200), 'RGB']` - combines the above constraints * `pxt.Array` can be specialized with a shape and/or a dtype: * `pxt.Array[pxt.Float]` - arrays with dtype `pxt.Float` * `pxt.Array[(64,64,3), pxt.Float]` - 3-dimensional arrays with dtype `pxt.Float` and 64x64x3 shape If we look at the structure of our table now, we see examples of specialized image and array types. ```python theme={null} t ```
`t.clip` has type `pxt.Array[(512,), pxt.Float]`, since the output of the embedding is always a 1x512 array. `t.channel` has type `Image['L']`, since it’s always an `'L'` mode (1-channel) image. You can freely use pxt.Image by itself to mean “any image, without constraints”, but numerical arrays must always specify a shape and a dtype; pxt.Array by itself will raise an error. Array shapes follow standard numpy conventions: a shape is a tuple of integers, such as (512,) or (64,64,3). A None may be used in place of an integer to indicate an unconstrained size for that dimension, as in (None,None,3) (3-dimensional array with two unconstrained dimensions), or simply (None,) (unconstrained 1-dimensional array). # Tables and Data Operations Source: https://docs.pixeltable.com/tutorials/tables-and-data-operations Open in Kaggle  Open in Colab  Download Notebook This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links. This guide shows you how to: * Create and manage tables: Understand Pixeltable’s table structure, create and modify tables, and work with table schemas * Manipulate data: Insert, update, and delete data within tables, and retrieve data from tables into Python variables * Filter and select data: Use `where()`, `select()`, and `order_by()` to query for specific rows and columns * Import data from CSV files and other file types First, let’s ensure the Pixeltable library is installed in your environment. ```python theme={null} %pip install -qU pixeltable ``` ### Tables All data in Pixeltable is stored in tables. At a high level, a Pixeltable table behaves similarly to an ordinary SQL database table, but with many additional capabilities to support complex AI workflows. We’ll introduce those advanced capabilities gradually throughout this tutorial; in this section, the focus is on basic table and data operations. Tables in Pixeltable are grouped into **directories**, which are simply user-defined namespaces. The following command creates a new directory, `fundamentals`, which we’ll use to store the tables in our tutorial. ```python theme={null} import pixeltable as pxt # First we delete the `fundamentals` directory and all its contents (if # it exists), in order to ensure a clean environment for the tutorial. pxt.drop_dir('fundamentals', force=True) # Now we create the directory. pxt.create_dir('fundamentals') ```
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
  Created directory 'fundamentals'.
Now let’s create our first table. To create a table, we must give it a name and a **schema** that describes the table structure. Note that prefacing the name with `fundamentals` causes it to be placed in our newly-created directory. ```python theme={null} films_t = pxt.create_table( 'fundamentals/films', {'film_name': pxt.String, 'year': pxt.Int, 'revenue': pxt.Float}, ) ```
  Created table 'films'.
To insert data into a table, we use the `insert()` method, passing it a list of Python dicts. ```python theme={null} films_t.insert( [ {'film_name': 'Jurassic Park', 'year': 1993, 'revenue': 1037.5}, {'film_name': 'Titanic', 'year': 1997, 'revenue': 2257.8}, { 'film_name': 'Avengers: Endgame', 'year': 2019, 'revenue': 2797.5, }, ] ) ```
  Inserting rows into \`films\`: 3 rows \[00:00, 572.84 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 3 values computed.
If you’re inserting just a single row, you can use an alternate syntax that is sometimes more convenient. ```python theme={null} films_t.insert( [{'film_name': 'Inside Out 2', 'year': 2024, 'revenue': 1462.7}] ) ```
  Inserting rows into \`films\`: 1 rows \[00:00, 318.76 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 1 value computed.
We can peek at the data in our table with the `collect()` method, which retrieves all the rows in the table. ```python theme={null} films_t.collect() ```
Pixeltable also provides `update()` and `delete()` methods for modifying and removing data from a table; we’ll see examples of them shortly. ### Filtering and Selecting Data Often you want to select only certain rows and/or certain columns in a table. You can do this with the `where()` and `select()` methods. ```python theme={null} films_t.where(films_t.revenue >= 2000.0).collect() ```
```python theme={null} films_t.select(films_t.film_name, films_t.year).collect() ```
Note the expressions that appear inside the calls to `where()` and `select()`, such as `films_t.year`. These are **column references** that point to specific columns within a table. In place of `films_t.year`, you can also use dictionary syntax and type `films_t['year']`, which means exactly the same thing but is sometimes more convenient. ```python theme={null} films_t.select(films_t['film_name'], films_t['year']).collect() ```
In addition to selecting columns directly, you can use column references inside various kinds of expressions. For example, our `revenue` numbers are given in millions of dollars. Let’s say we wanted to select revenue in thousands of dollars instead; we could do that as follows: ```python theme={null} films_t.select(films_t.film_name, films_t.revenue * 1000).collect() ```
Note that since we selected an abstract expression rather than a specific column, Pixeltable gave it the generic name `col_1`. You can assign it a more informative name with Python keyword syntax: ```python theme={null} films_t.select( films_t.film_name, revenue_thousands=films_t.revenue * 1000 ).collect() ```
### Tables are Persistent This is a good time to mention a few key differences between Pixeltable tables and other familiar datastructures, such as Python dicts or Pandas dataframes. First, **Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database**. When you reset a notebook kernel or start a new Python session, you’ll have access to all the data you’ve stored previously in Pixeltable. Let’s demonstrate this by using the IPython `%reset -f` command to clear out all our notebook variables, so that `films_t` is no longer defined. ```python theme={null} %reset -f films_t.collect() # Throws an exception now ```
  NameError: name 'films\_t' is not defined
  \[0;31m---------------------------------------------------------------------------\[0m
  \[0;31mNameError\[0m                                 Traceback (most recent call last)
  Cell \[0;32mIn\[11], line 2\[0m
  \[1;32m      1\[0m get\_ipython()\[38;5;241m.\[39mrun\_line\_magic(\[38;5;124m'\[39m\[38;5;124mreset\[39m\[38;5;124m'\[39m, \[38;5;124m'\[39m\[38;5;124m-f\[39m\[38;5;124m'\[39m)
  \[0;32m----> 2\[0m \[43mfilms\_t\[49m\[38;5;241m.\[39mcollect()  \[38;5;66;03m# Throws an exception now\[39;00m

  \[0;31mNameError\[0m: name 'films\_t' is not defined
The `films_t` variable (along with all other variables in our Python session) has been cleared out - but that’s ok, because it wasn’t the source of record for our data. The `films_t` variable is just a reference to the underlying database table. We can recover it with the `get_table` command, referencing the `films` table by name. ```python theme={null} import pixeltable as pxt films_t = pxt.get_table('fundamentals/films') films_t.collect() ```
You can always get a list of existing tables with the Pixeltable `pxt.ls()` command. Let’s use it to see the contents of the `fundamentals` directory. ```python theme={null} pxt.ls(path='fundamentals') ```
Note that if you’re running Pixeltable on colab or kaggle, the database will persist only for as long as your colab/kaggle session remains active. If you’re running it locally or on your own server, then your database will persist indefinitely (until you actively delete it). ### Tables are Typed The second major difference is that **Pixeltable is strongly typed**. Because Pixeltable is a database, every column has a data type: that’s why we specified `String`, `Int`, and `Float` for the three columns when we created the table. These **type specifiers** are *mandatory* when creating tables, and they become part of the table schema. You can always see the table schema with the `describe()` method. ```python theme={null} films_t.describe() ```
In a notebook, you can also just type `films_t` to see the schema; its output is identical to `films_t.describe()`. ```python theme={null} films_t ```
In addition to String, Int, and Float, Pixeltable provides several additional data types:
  • Bool, whose values are True or False;
  • Array for numerical arrays;
  • Json, for lists or dicts that correspond to valid JSON structures; and
  • The media types Image, Video, Audio, and Document.
  • We’ll see examples of each of these types later in this guide. Besides the column names and types, there’s a third element to the schema, `Computed With`. To learn more about this, see the [Computed Columns](/tutorials/computed-columns) guide. All of the methods we’ve discussed so far, such as `insert()` and `get_table()`, are documented in the [Pixeltable SDK](/sdk/latest/) Documentation. The following pages are particularly relevant: * [pixeltable](/sdk/latest/pixeltable) package reference * [pxt.Table](/sdk/latest/table) class reference ### A Real-World Example: Earthquake Data Now let’s dive a little deeper into Pixeltable’s data operations. To showcase all the features, it’ll be helpful to have a real-world dataset, rather than our toy dataset with four movies. The dataset we’ll be using consists of Earthquake data drawn from the US Geological Survey: all recorded Earthquakes that occurred within 100 km of Seattle, Washington, between January 1, 2023 and June 30, 2024. The dataset is in CSV format, and we can load it into Pixeltable by using `create_table()` with the `source` parameter, which creates a new Pixeltable table from the contents of a CSV file. ```python theme={null} eq_t = pxt.create_table( 'fundamentals/earthquakes', # Name for the new table source='https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/earthquakes.csv', primary_key='id', # Column 'id' is the primary key schema_overrides={ 'timestamp': pxt.Timestamp }, # Interpret column 3 as a timestamp ) ```
      Created table 'earthquakes'.
      Inserting rows into \`earthquakes\`: 1823 rows \[00:00, 19554.24 rows/s]
      Inserted 1823 rows with 0 errors.
    
    In Pixeltable, you can always import external data by giving a URL instead of a local file path. This applies to CSV datasets, media files (such images and video), and other types of content. The URL will often be an http\:// URL, but it can also be an s3:// URL referencing an S3 bucket. Pixeltable’s create\_table() function with the source parameter can import data from various formats including CSV, Excel, and Hugging Face datasets. You can also use source to import from a Pandas dataframe. For more details, see the pixeltable.io package reference. Let’s have a peek at our new dataset. The dataset contains 1823 rows, and we probably don’t want to display them all at once. We can limit our query to fewer rows with the `limit()` method. ```python theme={null} eq_t.limit(5).collect() ```
    A different way of achieving something similar is to use the `head()` and `tail()` methods. Pixeltable keeps track of the insertion order of all its data, and `head()` and `tail()` will always return the *earliest inserted* and *most recently inserted* rows in a table, respectively. ```python theme={null} eq_t.head(5) ```
    ```python theme={null} eq_t.tail(5) ```
    head(n) and limit(n).collect() appear similar in this example. But head() always returns the earliest rows in a table, whereas limit() makes no promises about the ordering of its results (unless you specify an order\_by() clause - more on this below). Let’s also peek at the schema: ```python theme={null} eq_t.describe() ```
    Note that while specifying a schema is mandatory when *creating* a table, it’s not always required when *importing* data. This is because Pixeltable uses the structure of the imported data to infer the column types, when feasible. You can always override the inferred column types with the `schema_overrides` parameter of `import_csv()`. The following examples showcase some common data operations. ```python theme={null} eq_t.count() # Number of rows in the table ```
      1823
    
    ```python theme={null} # 5 highest-magnitude earthquakes eq_t.order_by(eq_t.magnitude, asc=False).limit(5).collect() ```
    ```python theme={null} from datetime import datetime # 5 highest-magnitude earthquakes in Q3 2023 eq_t.where( (eq_t.timestamp >= datetime(2023, 6, 1)) & (eq_t.timestamp < datetime(2023, 10, 1)) ).order_by(eq_t.magnitude, asc=False).limit(5).collect() ```
    Note that Pixeltable uses Pandas-like operators for filtering data: the expression ```python theme={null} (eq_t.timestamp >= datetime(2023, 6, 1)) & (eq_t.timestamp < datetime(2023, 10, 1)) ``` means *both* conditions must be true; similarly (say), ```python theme={null} (eq_t.timestamp < datetime(2023, 6, 1)) | (eq_t.timestamp >= datetime(2023, 10, 1)) ``` would mean *either* condition must be true. You can also use the special `isin` operator to select just those values that appear within a particular list: ```python theme={null} # Earthquakes with specific ids eq_t.where(eq_t.id.isin([123, 456, 789])).collect() ```
    In addition to basic operators like `>=` and `isin`, a Pixeltable `where` clause can also contain more complex operations. For example, the `location` column in our dataset is a string that contains a lot of information, but in a relatively unstructured way. Suppose we wanted to see all Earthquakes in the vicinity of Rainier, Washington; one way to do this is with the `contains()` method: ```python theme={null} # All earthquakes in the vicinity of Rainier eq_t.where(eq_t.location.contains('Rainier')).collect() ```
    Pixeltable also supports various **aggregators**; here’s an example showcasing two fairly simple ones, `max()` and `min()`: ```python theme={null} # Min and max ids eq_t.select( min=pxt.functions.min(eq_t.id), max=pxt.functions.max(eq_t.id) ).collect() ```
    To learn more about Pixeltable functions and expressions, see the [Computed Columns](/tutorials/computed-columns) guide. They’re also exhaustively documented in the [Pixeltable SDK Documentation](/sdk/latest). ### Extracting Data from Tables into Python/Pandas Sometimes it’s handy to pull out data from a table into a Python object. We’ve actually already done this; the call to `collect()` returns an in-memory result set, which we can then dereference in various ways. For example: ```python theme={null} result = eq_t.limit(5).collect() result[0] # Get the first row of the results as a dict ```
      \{'id': 0,
       'magnitude': 1.15,
       'location': '10 km NW of Belfair, Washington',
       'timestamp': datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles')),
       'longitude': -122.93,
       'latitude': 47.51}
    
    ```python theme={null} result[ 'timestamp' ] # Get a list of the `timestamp` field of all the rows that were queried ```
      \[datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles')),
       datetime.datetime(2023, 1, 2, 1, 2, 43, 950000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles')),
       datetime.datetime(2023, 1, 2, 12, 5, 1, 420000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles')),
       datetime.datetime(2023, 1, 2, 12, 45, 14, 220000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles')),
       datetime.datetime(2023, 1, 2, 13, 19, 27, 200000, tzinfo=zoneinfo.ZoneInfo(key='America/Los\_Angeles'))]
    
    ```python theme={null} df = result.to_pandas() # Convert the result set into a Pandas dataframe df['magnitude'].describe() ```
      count    5.000000
      mean     0.744000
      std      0.587988
      min      0.200000
      25%      0.290000
      50%      0.520000
      75%      1.150000
      max      1.560000
      Name: magnitude, dtype: float64
    
    `collect()` without a preceding `limit()` returns the entire contents of a query or table. Be careful! For very large tables, this could result in out-of-memory errors. In this example, the 1823 rows in the table fit comfortably into a dataframe. ```python theme={null} df = eq_t.collect().to_pandas() df['magnitude'].describe() ```
      count    1823.000000
      mean        0.900378
      std         0.625492
      min        -0.830000
      25%         0.420000
      50%         0.850000
      75%         1.310000
      max         4.300000
      Name: magnitude, dtype: float64
    
    ### Adding Columns Like other database tables, Pixeltable tables aren’t fixed entities: they’re meant to evolve over time. Suppose we want to add a new column to hold user-specified comments about particular earthquake events. We can do this with the `add_column()` method: ```python theme={null} eq_t.add_column(note=pxt.String) ```
      Added 1823 column values with 0 errors.
      1823 rows updated, 1823 values computed.
    
    Here, `note` is the column name, and `pxt.String` specifies the type of the new column. ```python theme={null} eq_t.add_column(contact_email=pxt.String) ```
      Added 1823 column values with 0 errors.
      1823 rows updated, 1823 values computed.
    
    Let’s have a look at the revised schema. ```python theme={null} eq_t.describe() ```
    ### Updating Rows in a Table Table rows can be modified and deleted with the SQL-like `update()` and `delete()` commands. ```python theme={null} # Add a comment to records with IDs 123 and 127 ( eq_t.where(eq_t.id.isin([121, 123])).update( { 'note': 'Still investigating.', 'contact_email': 'contact@pixeltable.com', } ) ) ```
      Inserting rows into \`earthquakes\`: 2 rows \[00:00, 366.84 rows/s]
      2 rows updated, 4 values computed.
    
    ```python theme={null} eq_t.where(eq_t.id >= 120).select( eq_t.id, eq_t.magnitude, eq_t.note, eq_t.contact_email ).head(5) ```
    `update()` can also accept an expression, rather than a constant value. For example, suppose we wanted to shorten the location strings by replacing every occurrence of `Washington` with `WA`. One way to do this is with an `update()` clause, using a Pixeltable expression with the `replace()` method. ```python theme={null} eq_t.update({'location': eq_t.location.replace('Washington', 'WA')}) ```
      Inserting rows into \`earthquakes\`: 1823 rows \[00:00, 21494.07 rows/s]
      1823 rows updated, 1823 values computed.
    
    ```python theme={null} eq_t.head(5) ```
    Notice that in all cases, the `update()` clause takes a Python dictionary, but its values can be either constants such as `'contact@pixeltable.com'`, or more complex expressions such as `eq_t.location.replace('Washington', 'WA')`. Also notice that if `update()` appears without a `where()` clause, then every row in the table will be updated, as in the preceding example. ### Batch Updates The `batch_update()` method provides an alternative way to update multiple rows with different values. With a `batch_update()`, the contents of each row are specified by individual `dict`s, rather than according to a formula. Here’s a toy example that shows `batch_update()` in action. ```python theme={null} updates = [ {'id': 500, 'note': 'This is an example note.'}, {'id': 501, 'note': 'This is a different note.'}, {'id': 502, 'note': 'A third note, unrelated to the others.'}, ] eq_t.batch_update(updates) ```
      Inserting rows into \`earthquakes\`: 3 rows \[00:00, 984.58 rows/s]
      3 rows updated, 3 values computed.
    
    ```python theme={null} eq_t.where(eq_t.id >= 500).select( eq_t.id, eq_t.magnitude, eq_t.note, eq_t.contact_email ).head(5) ```
    ### Deleting Rows To delete rows from a table, use the `delete()` method. ```python theme={null} # Delete all rows in 2024 eq_t.where(eq_t.timestamp >= datetime(2024, 1, 1)).delete() ```
      587 rows deleted.
    
    ```python theme={null} eq_t.count() # How many are left after deleting? ```
      1236
    
    Don’t forget to specify a `where()` clause when using `delete()`! If you run `delete()` without a `where()` clause, the entire contents of the table will be deleted. ```python theme={null} eq_t.delete() ```
      1236 rows deleted.
    
    ```python theme={null} eq_t.count() ```
      0
    
    ### Table Versioning Every table in Pixeltable is versioned: some or all of its modification history is preserved. We’ve seen a reference to this already; `pxt.ls()` will show the most recent version along with each table it lists. ```python theme={null} pxt.ls('fundamentals') ```
    To see the version history of a particular table: ```python theme={null} eq_t.history() ```
    If you ever make a mistake, you can always call `revert()` to undo the most recent change to a table and roll back to the previous version. Let’s try it out: we’ll use it to revert the successive `delete()` calls that we just executed. ```python theme={null} eq_t.revert() ``` ```python theme={null} eq_t.count() ```
      1236
    
    ```python theme={null} eq_t.revert() ``` ```python theme={null} eq_t.count() ```
      1823
    
    Be aware: calling revert() cannot be undone! ### Multimodal Data In addition to the structured data we’ve been exploring so far, Pixeltable has native support for **media types**: images, video, audio, and unstructured documents such as pdfs. Media support is one of Pixeltable’s core capabilities. Here’s an example showing how media data lives side-by-side with structured data in Pixeltable. ```python theme={null} # Add a new column of type `Image` eq_t.add_column(map_image=pxt.Image) eq_t.describe() ```
      Added 1823 column values with 0 errors.
    
    ```python theme={null} # Update the row with id == 1002, adding an image to the `map_image` column eq_t.where(eq_t.id == 1002).update( { 'map_image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/port-townsend-map.jpeg' } ) ```
      Inserting rows into \`earthquakes\`: 1 rows \[00:00, 192.79 rows/s]
      1 row updated, 1 value computed.
    
    Note that in Pixeltable, you can always insert images into a table by giving the file path or URL of the image (as a string). It’s not necessary to load the image first; Pixeltable will manage the loading and caching of images in the background. The same applies to other media data such as documents and videos. Pixeltable will also embed image thumbnails in your notebook when you do a query: ```python theme={null} eq_t.where(eq_t.id >= 1000).select( eq_t.id, eq_t.magnitude, eq_t.location, eq_t.map_image ).head(5) ```
    ### Directory Hierarchies So far we’ve only seen an example of a single directory with a table inside it, but one can also put directories inside other directories, in whatever fashion makes the most sense for a given application. ```python theme={null} pxt.create_dir('fundamentals/subdir') pxt.create_dir('fundamentals/subdir/subsubdir') pxt.create_table( 'fundamentals/subdir/subsubdir/my_table', {'my_col': pxt.String} ) ```
      Created directory 'fundamentals/subdir'.
      Created directory 'fundamentals/subdir/subsubdir'.
      Created table 'my\_table'.
    
    ### Deleting Columns, Tables, and Directories `drop_column()`, `drop_table()`, and `drop_dir()` are used to delete columns, tables, and directories, respectively. ```python theme={null} # Delete the `contact_email` column eq_t.drop_column('contact_email') ``` ```python theme={null} eq_t.describe() ```
    ```python theme={null} # Delete the entire table (cannot be reverted!) pxt.drop_table('fundamentals/earthquakes') ``` ```python theme={null} # Delete the entire directory and all its contents, including any nested # subdirectories (cannot be reverted) pxt.drop_dir('fundamentals', force=True) ``` ## Next Steps Learn more about working with Pixeltable: * [Computed Columns](/tutorials/computed-columns) * [Queries and Expressions](/tutorials/queries-and-expressions) # Agents & MCP Source: https://docs.pixeltable.com/use-cases/agents-mcp Build AI agents with tool calling, persistent memory, and MCP server integration **Who:** Agent Builders, AI Engineers\ **Output:** Autonomous AI agents with memory and tool use Build AI agents that can call tools, remember context, and integrate with MCP servers—all backed by Pixeltable's persistent storage and orchestration. **Declarative Agents:** Instead of imperative control flow, define your agent as a table with computed columns. Each row is a user query; computed columns define the reasoning chain (tool selection → execution → context retrieval → response). Pixeltable handles orchestration, caching, and persistence automatically. *** ## Agent Capabilities Register UDFs and queries as tools that LLMs can invoke Store conversation history and retrieved context in tables Connect to Model Context Protocol servers for external tools Semantic search over documents, images, and more *** ## Data Lifecycle Wrap any Python code as `@pxt.udf` tools—API calls, web scraping, database queries ```python theme={null} import pixeltable as pxt import requests import yfinance as yf @pxt.udf def get_latest_news(topic: str) -> str: """Fetch latest news using NewsAPI.""" response = requests.get( "https://newsapi.org/v2/everything", params={"q": topic, "apiKey": os.environ["NEWS_API_KEY"]} ) articles = response.json().get("articles", [])[:3] return "\n".join(f"- {a['title']}" for a in articles) @pxt.udf def fetch_financial_data(ticker: str) -> str: """Fetch stock data using yfinance.""" stock = yf.Ticker(ticker) info = stock.info return f"{info['shortName']}: ${info['currentPrice']}" ``` Writing custom functions Turn semantic search into callable tools with `@pxt.query` ```python theme={null} @pxt.query def search_documents(query_text: str, user_id: str): """Search documents by semantic similarity.""" sim = chunks.text.similarity(query_text) return ( chunks.where((chunks.user_id == user_id) & (sim > 0.5)) .order_by(sim, asc=False) .select(chunks.text, source_doc=chunks.document, sim=sim) .limit(20) ) @pxt.query def search_video_transcripts(query_text: str): """Search video transcripts by text.""" sim = transcript_sentences.text.similarity(query_text) return ( transcript_sentences.where(sim > 0.7) .order_by(sim, asc=False) .select(transcript_sentences.text, source_video=transcript_sentences.video) .limit(20) ) ``` Combine UDFs, queries, and MCP tools into a single registry [`pxt.tools()`](/howto/cookbooks/agents/llm-tool-calling) ```python theme={null} # Register tools from multiple sources tools = pxt.tools( # UDFs - External API Calls get_latest_news, fetch_financial_data, # Query Functions - Agentic RAG search_documents, search_video_transcripts, ) ``` Complete tool calling walkthrough Define the workflow as a table with computed columns ```python theme={null} # Main workflow table - rows trigger the agent pipeline agent = pxt.create_table('agents.workflow', { 'prompt': pxt.String, 'timestamp': pxt.Timestamp, 'user_id': pxt.String, 'system_prompt': pxt.String, 'max_tokens': pxt.Int, 'temperature': pxt.Float, }) ``` First LLM call decides which tool to use ```python theme={null} from pixeltable.functions.anthropic import messages, invoke_tools # Step 1: LLM selects which tool to call agent.add_computed_column( initial_response=messages( model='claude-sonnet-4-20250514', messages=[{'role': 'user', 'content': agent.prompt}], max_tokens=agent.max_tokens, tools=tools, # Available tools tool_choice=tools.choice(required=True), # Force tool selection model_kwargs={'system': agent.system_prompt} ) ) ``` Pixeltable executes the selected tool automatically [`invoke_tools()`](/howto/cookbooks/agents/llm-tool-calling) ```python theme={null} # Step 2: Execute the tool the LLM chose agent.add_computed_column( tool_output=invoke_tools(tools, agent.initial_response) ) ``` Combine tool output with retrieved context ```python theme={null} # Parallel context retrieval (Pixeltable handles this) agent.add_computed_column(doc_context=search_documents(agent.prompt, agent.user_id)) agent.add_computed_column(image_context=search_images(agent.prompt, agent.user_id)) agent.add_computed_column(memory_context=search_memory(agent.prompt, agent.user_id)) # Assemble everything into final context agent.add_computed_column( final_context=assemble_context( agent.prompt, agent.tool_output, agent.doc_context, agent.memory_context, ) ) ``` Second LLM call generates the answer with full context ```python theme={null} # Step 3: Generate final answer with all context agent.add_computed_column( final_response=messages( model='claude-sonnet-4-20250514', messages=agent.final_context, max_tokens=agent.max_tokens, model_kwargs={'system': agent.system_prompt} ) ) # Extract answer text agent.add_computed_column( answer=agent.final_response.content[0].text ) ``` Complete walkthrough Load tools from any MCP-compatible server [`pxt.mcp_udfs()`](/howto/cookbooks/agents/llm-tool-calling) ```python theme={null} # Load tools from MCP server mcp_tools = pxt.mcp_udfs('http://localhost:8000/mcp') # Combine with local tools all_tools = pxt.tools( get_latest_news, fetch_financial_data, search_documents, *mcp_tools # Add MCP tools ) ``` MCP server for Claude, Cursor, and AI IDEs Expose Pixeltable tables as MCP tools for AI IDEs ```python theme={null} # Example: JFK Files MCP Server # Exposes document search to Claude Desktop, Cursor, etc. from mcp.server import Server import pixeltable as pxt server = Server("jfk-files") @server.tool() def search_jfk_documents(query: str) -> str: """Search declassified JFK documents.""" docs = pxt.get_table('jfk.documents') sim = docs.content.similarity(query) results = docs.order_by(sim, asc=False).limit(5).collect() return "\n".join(r['content'] for r in results) ``` Example MCP server with document search Store conversation turns with semantic search ```python theme={null} # Chat history with embedding index chat_history = pxt.create_table('agents.chat_history', { 'role': pxt.String, # 'user' or 'assistant' 'content': pxt.String, 'timestamp': pxt.Timestamp, 'user_id': pxt.String }) chat_history.add_embedding_index( 'content', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2') ) # Recent history query @pxt.query def get_recent_chat_history(user_id: str, limit: int = 4): return ( chat_history.where(chat_history.user_id == user_id) .order_by(chat_history.timestamp, asc=False) .select(role=chat_history.role, content=chat_history.content) .limit(limit) ) # Semantic search over all history @pxt.query def search_chat_history(query_text: str, user_id: str): sim = chat_history.content.similarity(query_text) return ( chat_history.where((chat_history.user_id == user_id) & (sim > 0.8)) .order_by(sim, asc=False) .select(role=chat_history.role, content=chat_history.content, sim=sim) .limit(10) ) ``` Persistent conversation context Store user-saved snippets (code, text, facts) for recall ```python theme={null} # Selective memory - things the user explicitly saves memory_bank = pxt.create_table('agents.memory_bank', { 'content': pxt.String, 'type': pxt.String, # 'code', 'text', 'fact' 'language': pxt.String, # For code: 'python', 'javascript', etc. 'context_query': pxt.String, # What triggered this save 'timestamp': pxt.Timestamp, 'user_id': pxt.String }) memory_bank.add_embedding_index('content', string_embed=embed_fn) @pxt.query def search_memory(query_text: str, user_id: str): sim = memory_bank.content.similarity(query_text) return ( memory_bank.where((memory_bank.user_id == user_id) & (sim > 0.8)) .order_by(sim, asc=False) .select( content=memory_bank.content, type=memory_bank.type, language=memory_bank.language, context_query=memory_bank.context_query, ) .limit(10) ) ``` Index documents, images, video, and audio for retrieval ```python theme={null} # Documents with chunking documents = pxt.create_table('agents.collection', { 'document': pxt.Document, 'uuid': pxt.String, 'user_id': pxt.String }) chunks = pxt.create_view('agents.chunks', documents, iterator=DocumentSplitter.create( document=documents.document, separators='paragraph', metadata='title, heading, page' ) ) chunks.add_embedding_index('text', string_embed=embed_fn) # Images with CLIP images = pxt.create_table('agents.images', { 'image': pxt.Image, 'user_id': pxt.String }) images.add_embedding_index('image', embedding=clip.using(model_id='openai/clip-vit-large-patch14')) # Video frames videos = pxt.create_table('agents.videos', {'video': pxt.Video, 'user_id': pxt.String}) video_frames = pxt.create_view('agents.video_frames', videos, iterator=FrameIterator.create(video=videos.video, fps=1) ) video_frames.add_embedding_index('frame', embedding=clip.using(model_id='openai/clip-vit-large-patch14')) ``` Document retrieval patterns Expose your agent via HTTP API ```python theme={null} from flask import Flask, request from datetime import datetime import pixeltable as pxt app = Flask(__name__) agent = pxt.get_table('agents.workflow') chat_history = pxt.get_table('agents.chat_history') @app.route("/chat", methods=["POST"]) def chat(): data = request.json user_id = data["user_id"] prompt = data["message"] # Store user message chat_history.insert([{ "role": "user", "content": prompt, "timestamp": datetime.now(), "user_id": user_id }]) # Trigger agent workflow (computed columns run automatically) agent.insert([{ "prompt": prompt, "timestamp": datetime.now(), "user_id": user_id, "system_prompt": "You are a helpful assistant.", "max_tokens": 1024, "temperature": 0.7, }]) # Get the answer (already computed) result = agent.order_by(agent.timestamp, asc=False).limit(1).collect() answer = result[0]["answer"] # Store assistant response chat_history.insert([{ "role": "assistant", "content": answer, "timestamp": datetime.now(), "user_id": user_id }]) return {"response": answer} ``` Production deployment patterns One-command deployment with `pxt serve` and `pxt deploy` Learn about upcoming Endpoints and Live Tables *** ## Built with Pixeltable Multimodal AI agent with infinite memory, file search, and image generation Lightweight agent framework with built-in memory and tool orchestration Persistent memory layer for AI applications Model Context Protocol server for Claude, Cursor, and AI IDEs *** ## Related Cookbooks Complete guide to `pxt.tools()` and `invoke_tools()` Persistent conversation context patterns Retrieval-augmented generation workflow Use tables as callable functions # Backend for AI Apps Source: https://docs.pixeltable.com/use-cases/ai-applications Build pipelines that add multimodal intelligence to applications **Who:** AI/App Developers **Output:** AI-powered application Add multimodal intelligence to applications with two deployment patterns. **Same foundation, different intent:** This workflow uses the same Pixeltable capabilities as [Data Wrangling for ML](/use-cases/ml-data-wrangling) — tables, multimodal types, computed columns, iterators. The difference is the output: training datasets vs. live application intelligence. *** ## Data Lifecycle Define schema with native multimodal types — Pixeltable handles storage and references [`create_table()`](/tutorials/tables-and-data-operations), [`pxt.Image`](/platform/type-system), [`pxt.Video`](/platform/type-system), [`pxt.Audio`](/platform/type-system), [`pxt.Document`](/platform/type-system), [`pxt.Json`](/platform/type-system) ```python theme={null} import pixeltable as pxt # Native multimodal types t = pxt.create_table('app.docs', { 'pdf': pxt.Document, 'metadata': pxt.Json }) ``` Create tables and manage data Image, Video, Audio, Document, JSON & more Load from any source — local files, URLs, cloud storage, or databases [`insert()`](/tutorials/tables-and-data-operations), [`import_csv()`](/sdk/latest/io), [S3/GCS/Azure](/integrations/cloud-storage) ```python theme={null} # Insert with URLs, local paths, or direct upload t.insert([ {'pdf': 'https://example.com/report.pdf'}, {'pdf': '/local/path/to/doc.pdf'}, {'pdf': 's3://bucket/documents/spec.pdf'} ]) ``` Load from cloud storage S3, GCS, Azure, R2 configuration Create UDFs and computed columns — they auto-update when data changes [`@pxt.udf`](/platform/udfs-in-pixeltable), [`@pxt.query`](/platform/udfs-in-pixeltable), [`add_computed_column()`](/tutorials/computed-columns) Write custom functions in Python Auto-update derived data Extract frames, transcribe audio, chunk documents [`FrameIterator`](/platform/iterators), [`DocumentSplitter`](/platform/iterators), [`AudioSplitter`](/platform/iterators) Process video into searchable frames Audio to text with Whisper Add embedding indexes with **incremental sync** — only new/changed rows are embedded [`add_embedding_index()`](/platform/embedding-indexes) ```python theme={null} # Add index once — auto-updates on insert docs.add_embedding_index('content', string_embed=e5_embed) ``` Configure and query indexes Use OpenAI embedding models Define `@pxt.query` functions that return data from your tables [`@pxt.query`](/platform/udfs-in-pixeltable) ```python theme={null} @pxt.query def get_image(image_id: str) -> PIL.Image.Image: return ( images.where(images.uuid == image_id) .select(images.image) .limit(1) ) # Use in computed columns or API endpoints t.add_computed_column(thumbnail=get_image(t.image_id)) ``` Reusable parameterized queries Find relevant content by meaning, not keywords [`.similarity()`](/platform/embedding-indexes), `.order_by()`, `.where()`, `.collect()` ```python theme={null} sim = images.image.similarity(query) results = images.order_by(sim, asc=False).select( uuid=images.uuid, url=images.image.fileurl ).limit(10).collect() ``` Search documents by meaning Find visually similar images Expose Pixeltable functions as LLM tools for agents [`pxt.tools()`](/howto/cookbooks/agents/llm-tool-calling), [`invoke_tools()`](/howto/cookbooks/agents/llm-tool-calling) LLM agents with function calling Persistent conversation context Integrate with Flask, FastAPI, or any Python web framework `pxt.get_table()`, `.insert()`, `.select()`, `.collect()` ```python theme={null} from flask import Flask, request import pixeltable as pxt app = Flask(__name__) images = pxt.get_table("app.images") @app.route("/api/search", methods=["POST"]) def search(): query = request.form.get("q") sim = images.image.similarity(query) return images.order_by(sim, asc=False).limit(10).collect() @app.route("/api/upload", methods=["POST"]) def upload(): images.insert([{"image": request.files["file"]}]) return {"status": "ok"} ``` Production deployment patterns Full Flask app with file upload & search Get pre-signed URLs for media files stored in cloud storage `.fileurl`, pre-signed URLs for S3/GCS/Tigris ```python theme={null} # Get file URL from Pixeltable url = row["image"].fileurl # Generate pre-signed URL for client access presigned = s3.generate_presigned_url( "get_object", Params={"Bucket": bucket, "Key": key}, ExpiresIn=3600 ) ``` S3, GCS, Azure, R2, Tigris configuration *** ## Deployment Patterns **When:** Keep existing RDBMS + blob storage Pixeltable processes media, runs models, then exports results to your existing systems. ```python theme={null} # Process in Pixeltable with media stored directly to S3/GCS/Azure videos.add_computed_column( thumbnail=videos.frame.resize((256, 256)), destination='s3://my-bucket/thumbnails/' # Direct to blob storage ) # Export metadata to external RDBMS df = videos.select(videos.video, videos.transcript).collect() df.to_sql('video_metadata', engine, if_exists='append') # SQLAlchemy ``` Process → Export to your existing infrastructure **When:** Need versioning, lineage, and retrieval (RAG) from same system Pixeltable persists everything—use it as your primary data backend with automatic versioning. ```python theme={null} # Everything in one place: storage + compute + retrieval docs.add_computed_column(chunks=DocumentSplitter(docs.pdf)) docs.add_embedding_index('chunks', string_embed=e5_embed) # Query with full lineage results = docs.chunks.similarity(query).limit(10).collect() ``` Versioning, lineage, and retrieval in one system *** ## End-to-End Examples Multimodal AI agent with memory, file search, and image generation Next.js + FastAPI app for text & image search Retrieval-augmented generation workflow **More sample apps:** Check out the [sample-apps directory](https://github.com/pixeltable/pixeltable/tree/main/docs/sample-apps) for chat applications, multimodal search, and more. # Get Started with Data Sharing Source: https://docs.pixeltable.com/use-cases/get-started Explore and share multimodal AI datasets with Pixeltable Cloud ## Overview Build and share multimodal AI datasets without managing infrastructure. Work with your images, videos, audio, and documents through a unified Python API - process them with AI models, create embeddings, and publish your results for team collaboration or public research. ## Quick Start **Requirements:** Pixeltable >= 0.4.24 **Replicate a dataset:** ```python theme={null} import pixeltable as pxt coco_copy = pxt.replicate( remote_uri='pxt://pixeltable:fiftyone/coco_mini_2017', local_path='coco-copy' ) ``` Replicas are read-only locally, but you can query them, perform similarity searches, update them with `pull()`, or create independent copies. **Publish your datasets** (requires account and API key from [pixeltable.com](https://pixeltable.com/)): ```python theme={null} pxt.publish( source='my-table', destination_uri='pxt://username/my-dataset' ) ``` After publishing, use `push()` to update the cloud replica with local changes. Access defaults to private; add `access='public'` to make it publicly accessible. Learn more in the [Data Sharing Guide](/platform/data-sharing). ## Resources Get real-time help from our community Report issues and contribute code Browse our documentation Schedule time with our team # Data Wrangling for ML Source: https://docs.pixeltable.com/use-cases/ml-data-wrangling Process video, audio, documents, and images into training-ready datasets **Who:** ML Engineers, Data Scientists **Output:** Training/evaluation datasets **Pixeltable is your system of record**—all data, cached results, and references stay in sync. *** ## Data Lifecycle Load from any source: [`import_csv()`](/sdk/latest/io#func-import_csv), [`import_parquet()`](/sdk/latest/io#func-import_parquet), [HuggingFace](/howto/cookbooks/data/data-import-huggingface), [S3/GCS/Azure](/integrations/cloud-storage), RDBMS via Python DB API Load images/videos from cloud storage Load datasets from HuggingFace Hub Statistics & sampling: [`select()`](/tutorials/queries-and-expressions), [`.sample()`](/howto/cookbooks/data/data-sampling), `.head()` Sample and filter large datasets efficiently Transform & extract: [`add_computed_column()`](/tutorials/computed-columns), [`FrameIterator`](/platform/iterators), [`DocumentSplitter`](/platform/iterators) Process video into frame-level data Audio to text with Whisper **Model-in-the-loop:** Auto-generate labels with AI models * **Object Detection:** [`yolox.yolox()`](/sdk/latest/yolox), [`huggingface.detr_for_object_detection()`](/sdk/latest/huggingface) * **Vision LLMs:** [`openai.vision()`](/sdk/latest/openai), [`anthropic.messages()`](/sdk/latest/anthropic), [`gemini.messages()`](/sdk/latest/gemini) * **Classification:** [`huggingface.image_classification()`](/sdk/latest/huggingface) Run YOLOX detection on images Analyze images with GPT-4o **Human-in-the-loop:** Refine labels with human annotators [Label Studio](/howto/using-label-studio-with-pixeltable) sync, [FiftyOne](/howto/working-with-fiftyone) export, [`add_embedding_index()`](/platform/embedding-indexes) for curation search Sync annotations bidirectionally Visualize and curate datasets **Model-in-the-loop vs Human-in-the-loop:** Use pre-annotation to generate initial labels with AI models, then refine with human annotators. Pixeltable keeps both in sync—model outputs and human corrections live in the same table. Find similar examples with embedding search, filter by quality metrics [`add_embedding_index()`](/platform/embedding-indexes), [`.similarity()`](/platform/embedding-indexes), `.where()`, `.order_by()` Find visually similar samples Search by meaning, not keywords **Test transformations before committing:** Run `SELECT` to preview results on samples before adding computed columns ```python theme={null} # Test on 5 rows first (no storage cost) t.select(t.image, new_label=my_classifier(t.image)).head(5) # Happy? Commit to full dataset t.add_computed_column(new_label=my_classifier(t.image)) ``` Test UDFs and expressions before committing Version control: [`create_snapshot()`](/platform/version-control), [`create_view()`](/platform/views), [`history()`](/platform/version-control), lineage tracking Track changes and revert to previous states **Why curate?** ML models are only as good as their training data. Use Pixeltable's search and filtering to find edge cases, remove duplicates, balance classes, and iterate on your data quality before export. Publish to cloud: [`publish()`](/platform/data-sharing), [`replicate()`](/platform/data-sharing), `push()`, `pull()` Collaborate with your team via cloud replicas Training formats: [`to_pytorch_dataset()`](/sdk/latest/query#method-to_pytorch_dataset), [`export_parquet()`](/sdk/latest/io#func-export_parquet), [`to_coco_dataset()`](/sdk/latest/query#method-to_coco_dataset), [`export_lancedb()`](/sdk/latest/io#func-export_lancedb) Convert to PyTorch DataLoader format All import/export formats *** ## End-to-End Examples Complete workflow: ingest video → extract frames → detect objects → export Transcribe and analyze audio at scale Extract structured data from images with GPT-4o Auto-generate image descriptions # Cloud Offering Source: https://docs.pixeltable.com/use-cases/services Data sharing, endpoints, and collaboration via Pixeltable Cloud Pixeltable Cloud extends the local SDK with team collaboration and production deployment capabilities. *** ## Publish & Replicate ✅ Available Now | Feature | API | | ------------------ | ----------------------------------------------------------------- | | Publish datasets | [`pxt.publish(source, destination_uri)`](/platform/data-sharing) | | Replicate datasets | [`pxt.replicate(remote_uri, local_path)`](/platform/data-sharing) | | Sync updates | `push()`, `pull()` | | Access control | `access='public'` or `'private'` | ```python theme={null} # Publish your curated dataset pxt.publish(source='my-table', destination_uri='pxt://myorg/my-dataset') # Anyone can replicate public datasets (no account required) coco = pxt.replicate(remote_uri='pxt://pixeltable:fiftyone/coco_mini_2017', local_path='coco-copy') ``` Full documentation on publish, replicate, push, and pull *** ## 2. Endpoints 🔜 Coming Soon One-command API deployment with managed hosting. | Feature | What | | --------------- | ---------------------------------- | | `pxt serve` | Local development server | | `pxt deploy` | Cloud deployment with auto-scaling | | Pre-signed URLs | Media access without proxying | [Join the waitlist](https://www.pixeltable.com/waitlist) to get early access to Endpoints. *** ## 3. Live Tables 🔜 Coming Soon Multi-writer collaboration and serverless compute. | Feature | What | | ------------------ | -------------------------------------- | | Multi-writer | Team collaboration on shared tables | | Serverless compute | Auto-scaling without infrastructure | | UDF versioning | Safe experimentation with code changes | | RBAC + audit | Governance and compliance | *** ## Unified: Where It's All Going When all three cloud services are available, the two use cases converge: * **Data wrangling + AI pipelines + endpoints = one system** * **Orchestration + storage + retrieval unified** * **Table becomes the endpoint** Your training datasets and production APIs share the same infrastructure—versioning, lineage, and retrieval in the serving path. Schedule time with our team to discuss your use case