# Bundlebase > Docker for data. Bundle, version, and query datasets without database infrastructure. Bundlebase packages data files (CSV, Parquet, JSON) from any source (local, S3, HTTP, SFTP) into versioned bundles with a built-in SQL query engine, persistent data hygiene rules, custom connectors, custom SQL functions, and Python/CLI/MCP interfaces. Consumers access bundles via Python, CLI REPL, Arrow Flight SQL server, or MCP tool server — no database required. ## Docs - [Why Bundlebase?](https://nvoxland.github.io/bundlebase/getting-started/why-bundlebase/): Comparisons to DVC, Delta Lake, plain files, databases - [Python Quick Start](https://nvoxland.github.io/bundlebase/getting-started/python/quick-start/): Create a bundle, attach data, query with SQL - [CLI Quick Start](https://nvoxland.github.io/bundlebase/getting-started/cli/quick-start/): Interactive REPL usage - [Basic Concepts](https://nvoxland.github.io/bundlebase/getting-started/basic-concepts/): Read-only vs mutable, lazy evaluation, sources, versioning - [SQL Reference](https://nvoxland.github.io/bundlebase/sql-reference/): Full command syntax - [Python API — Sync](https://nvoxland.github.io/bundlebase/api/python/sync-api/): Full sync API reference - [Python API — Async](https://nvoxland.github.io/bundlebase/api/python/async-api/): Full async API reference - [CLI REPL Guide](https://nvoxland.github.io/bundlebase/guide/cli-repl/): REPL flags and meta-commands - [SQL Server Guide](https://nvoxland.github.io/bundlebase/guide/cli-serve/): Arrow Flight SQL server for BI tools - [MCP Server Guide](https://nvoxland.github.io/bundlebase/guide/cli-mcp/): MCP tool server for AI assistants - [Sources Guide](https://nvoxland.github.io/bundlebase/guide/sources/): HTTP, S3, SFTP, fetch modes - [Custom Connectors](https://nvoxland.github.io/bundlebase/guide/custom-connectors/): Python, Go, Java, native connector SDK - [Custom Functions](https://nvoxland.github.io/bundlebase/guide/functions/): Scalar and aggregate UDFs - [Examples](https://nvoxland.github.io/bundlebase/examples/): Copy-paste examples by task - [Use Cases](https://nvoxland.github.io/bundlebase/use-cases/): End-to-end scenarios by role ## Quick Reference ### Installation ```bash pip install bundlebase # base (PyArrow only) pip install "bundlebase[pandas]" # with pandas pip install "bundlebase[jupyter]" # with pandas + Jupyter support ``` CLI binary (no Python required): ```bash # macOS / Linux — always downloads latest release curl -L https://github.com/nvoxland/bundlebase/releases/latest/download/bundlebase-aarch64-apple-darwin.tar.gz | tar xz sudo mv bundlebase /usr/local/bin/ ``` ### Python API (bundlebase.sync) ```python import bundlebase.sync as bb # Factory functions bundle = bb.create("s3://bucket/path") # new bundle — mutable bundle = bb.create("memory:///") # in-memory (no persistence) bundle = bb.open("s3://bucket/path") # existing bundle — read-only # Make a read-only bundle mutable bundle = bb.open("s3://bucket/path").extend() # Data operations (all return self for chaining) bundle.attach("data.parquet") bundle.attach("s3://other/data.csv") bundle.attach("https://example.com/feed.json") bundle.filter("amount > 0") # WHERE clause string bundle.filter("amount > $1 AND status = $2", [0, "active"]) # parameterized bundle.join("orders", on="id = order_id", location="orders.parquet", how="inner") bundle.drop_column("ssn") bundle.rename_column("fname", "first_name") bundle.add_column("full_name", "first_name || ' ' || last_name") bundle.cast_column("amount", "float64") bundle.normalize_column_names() # "Customer Id" → "customer_id" bundle.set_name("My Bundle") bundle.set_description("Description") # Persistent data hygiene rules (fire on every future attach) bundle.always_delete("WHERE amount < 0") bundle.always_delete("WHERE status = 'test'") bundle.always_update("SET currency = UPPER(currency)") bundle.always_update("SET region = 'EMEA' WHERE region IN ('Europe', 'EU')") # Sources and fetch bundle.create_source("http", {"url": "https://api.example.com/data", "format": "json", "json_record_path": "records"}) bundle.create_source("remote_dir", {"url": "s3://bucket/data/", "patterns": "**/*.parquet"}) bundle.create_source("sftp_directory", {"url": "sftp://host/path/", "patterns": "*.csv", "username": "user", "key_path": "/key"}) bundle.fetch("base", "add") # add new files only bundle.fetch("ref", "update") # add new + refresh changed bundle.fetch("snap", "sync") # full replacement # Custom functions bundle.import_temp_function("acme.score", "python::my_module:score_fn") # Python (session-only) bundle.import_function("acme.score", "ipc::./score_binary") # persistent (non-Python runtimes) bundle.drop_function("acme.score") # Custom connectors bundle.import_temp_connector("acme.salesforce", "python::sf_connector:SalesforceSource") bundle.import_connector("acme.weather", "ipc::./weather_binary") bundle.drop_connector("acme.weather") # Views bundle.create_view("active_users", "SELECT * FROM bundle WHERE active = true") # Versioning bundle.commit("Added customer data") bundle.reset() # discard uncommitted changes # Querying result = bundle.query("SELECT region, SUM(amount) FROM bundle GROUP BY region") df = result.to_pandas() df = result.to_polars() # Direct export df = bundle.to_pandas() df = bundle.to_polars() d = bundle.to_dict() a = bundle.to_numpy() for batch in bundle.stream_batches(): # constant memory, any dataset size process(batch) # Introspection bundle.name # str or None (property) bundle.description # str or None (property) bundle.version # 12-char hex string (property) bundle.url # storage path (property) bundle.num_rows # int (property) bundle.schema # PySchema — .fields, .field(name) bundle.history() # list of commits bundle.status() # list of uncommitted operations ``` ### SQL Commands (REPL / CLI) ```sql -- Bundle lifecycle CREATE '' OPEN '' EXTEND -- Data ATTACH '' [TO ] DETACH '' FILTER WITH SELECT * FROM bundle WHERE -- Schema JOIN '' AS ON DROP JOIN DROP COLUMN RENAME COLUMN TO CAST COLUMN AS NORMALIZE COLUMN NAMES ADD COLUMN AS CREATE VIEW AS DROP VIEW -- Persistent hygiene rules ALWAYS DELETE WHERE ALWAYS UPDATE SET = [WHERE ] SHOW ALWAYS RULES -- Sources CREATE SOURCE [FOR ] USING [WITH ( = '', ...)] FETCH [DRY RUN] FETCH ALL [DRY RUN] -- Functions IMPORT TEMP FUNCTION FROM '::' IMPORT FUNCTION FROM '::' DROP FUNCTION -- Connectors IMPORT TEMP CONNECTOR FROM '::' IMPORT CONNECTOR FROM '::' DROP CONNECTOR -- Indexes CREATE INDEX DROP INDEX -- Versioning COMMIT '' RESET UNDO [LAST ] -- Metadata SET NAME '' SET DESCRIPTION '' -- Inspection SHOW STATUS SHOW SCHEMA SHOW HISTORY SHOW ALWAYS RULES EXPLAIN [ANALYZE] [VERBOSE] SELECT * FROM search('', '') ``` Runtimes for functions/connectors: `python` (temp/session-only), `ipc`, `ffi`, `java`, `docker`. ### CLI ```bash # Setup AI assistant integration (configures MCP + installs agent skills) bundlebase setup-agent # project-level bundlebase setup-agent --scope global # user-level # Interactive REPL bundlebase repl --bundle ./my-bundle bundlebase repl --bundle s3://bucket/path # Non-interactive (agent/script friendly) bundlebase query --bundle ./sales --sql "SELECT * FROM bundle LIMIT 5" --format json bundlebase create --bundle ./sales bundlebase extend --bundle ./sales --execute "ATTACH 'data.csv'" --execute "COMMIT 'load'" # SQL server (Arrow Flight — connect from Metabase, DBeaver, R, Julia, Go) bundlebase serve --bundle s3://bucket/path --port 32010 # MCP tool server (for AI assistants) bundlebase mcp --bundle s3://bucket/path # Report generation bundlebase generate-report --input report.md --output report.pdf ``` ### Custom Connector (Python) ```python from bundlebase_sdk.source import Connector from bundlebase_sdk.types import Location class MyConnector(Connector): def discover(self, attached_locations: list[str], **kwargs: str) -> list[Location]: """Return list of available data locations.""" return [Location(location="data.parquet", must_copy=True, format="parquet")] def data(self, location: Location, **kwargs: str): """Return pa.Table, pa.RecordBatch, list[dict], or None.""" ... ``` Register: `bundle.import_temp_connector("acme.my_conn", "python::my_module:MyConnector")` Use: `bundle.create_source("acme.my_conn", {"key": "value"})` ### Custom SQL Function (Python scalar) ```python import pyarrow as pa import pyarrow.compute as pc def double_val(col: pa.Array) -> pa.Array: return pc.multiply(col, 2) def bundlebase_metadata(): return [{"name": "double_val", "input_types": ["Int64"], "return_type": "Int64", "kind": "scalar"}] ``` Register: `bundle.import_temp_function("acme.double_val", "python::my_module:double_val")` Use in SQL: `SELECT acme.double_val(amount) FROM bundle`