Bundle your data.
Ship it anywhere.

Your team wants the data you've been cleaning. Email a CSV and it's stale by Friday. Re-run the pipeline yourself, every time. Stand up a database, and that's a quarter of work.

Bundlebase packages data into a versioned, self-describing container. Point anyone at the path and they get the data, the schema, the transformation history, and the provenance. Query with SQL, pull into pandas, connect Metabase. No server, no README, no repeating yourself.

PythonSQL

import bundlebase.sync as bb

bundle = (bb.create("s3://company-data/sales-q1")
    .attach("exports/jan.csv")
    .attach("exports/feb.csv")
    .drop_column("ssn")
    .filter("status = 'closed_won'")
    .set_name("Q1 Sales -- Closed Won"))
bundle.commit("Initial Q1 export")

# Anyone with the path
bundle = bb.open("s3://company-data/sales-q1")
df = bundle.to_pandas()

ATTACH 'exports/jan.csv';
ATTACH 'exports/feb.csv';
DROP COLUMN ssn;
FILTER WITH SELECT * FROM bundle WHERE status = 'closed_won';
SET NAME 'Q1 Sales -- Closed Won';
COMMIT 'Initial Q1 export';

-- Anyone with the path
OPEN 's3://company-data/sales-q1';
SELECT region, SUM(amount) FROM bundle GROUP BY region;

Commit once. Share the path. The consumer gets a DataFrame, a SQL connection, or a CLI session, whichever they prefer. The commit history is the changelog. No "which file is the latest?" conversations.

What makes a bundle different from a Parquet file ->

Durable storage for LLM agents¶

Agents lose context between sessions. A bundle doesn't. At startup the agent reads name, num_rows, schema, and history(), so it knows what it has without re-fetching anything. If a session crashes mid-run, the next one opens the last committed state.

PythonSQL / MCP

# New session -- reconstruct from the bundle itself
bundle = bb.open("s3://agent-workspace/product-reviews")
print(bundle.name)      # "Product Reviews"
print(bundle.num_rows)  # 14,302
for e in bundle.history():
    print(e)            # what ran and when

results = bundle.query(
    "SELECT * FROM bundle WHERE review_date >= '2026-01-01'"
).to_pandas()

-- REPL: new session
OPEN 's3://agent-workspace/product-reviews';
SHOW STATUS;    -- name, rows, version
SHOW HISTORY;   -- full commit log

SELECT * FROM bundle WHERE review_date >= '2026-01-01';

Or expose as an MCP tool server. The agent gets query, schema, history, and sample as native tools, no code required:

bundlebase mcp --bundle s3://agent-workspace/product-reviews

See the agent use case ->

Data hygiene rules that travel with the bundle¶

Your source data is dirty. You clean it every time, until you forget. always_delete and always_update encode cleanup rules into the bundle itself. They fire on every future attach, regardless of who runs it.

PythonSQL

bundle = bb.create("s3://analytics/crm-export")

# Define once
bundle.always_delete("WHERE amount < 0")
bundle.always_delete("WHERE status = 'test'")

# Attach dirty data -- rules fire automatically
bundle.attach("jan.csv").commit("January")
bundle.extend().attach("feb.csv").commit("February")

OPEN 's3://analytics/crm-export';
ALWAYS DELETE WHERE amount < 0;
ALWAYS DELETE WHERE status = 'test';

-- attach dirty data -- rules fire automatically
ATTACH 'jan.csv';
COMMIT 'January';

No shared cleanup script. No "oops, I forgot the filter" incidents.

See the ETL use case ->

How it works¶

%%{init: {'flowchart': {'curve': 'stepAfter'}}}%%
flowchart LR
    A["CSV · Parquet · JSON\nHTTP · S3 · SFTP · Kaggle"] --> B["Bundle\nversioned · self-describing"]
    B --> C1["Python -- bb.open()"]
    B --> C2["CLI -- bundlebase REPL"]
    B --> C3["SQL server -- Metabase · R · DBeaver"]
    B --> C4["MCP server -- AI assistants"]

    classDef bundle fill:#0e7490,color:#fff,stroke:#22d3ee,stroke-width:1.5px
    classDef output fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px
    classDef source fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px

    class B bundle
    class C1,C2,C3,C4 output
    class A source

Works with your stack¶


Python	`pandas` · `polars` · `numpy` · `pyarrow`, sync + async
CLI	Interactive REPL, scriptable commands -- no Python required
BI tools	Metabase · DBeaver · R · Julia · Go -- anything with an Arrow Flight driver or native Arrow Flight support
Storage	S3 · GCS · Azure Blob · local paths
Formats	Parquet · CSV · JSON -- mixed sources union automatically
Custom connectors	Pull from any source (Salesforce, internal APIs, custom databases) by writing a connector in Python, Go, Java, or any IPC-compatible language
Custom SQL functions	Register Python callables as scalar or aggregate UDFs and call them directly in SQL queries
SQL	Full Apache DataFusion syntax -- `SELECT * FROM bundle WHERE ...`
Scale	Streaming execution -- datasets larger than RAM, constant memory
Core	Rust + Apache Arrow -- columnar, fast

Get started¶

Why Bundlebase? -- comparisons to DVC, Delta Lake, plain files, databases
Python -- pip install, then build your first bundle
CLI -- download the binary, query interactively

Bundle your data.Ship it anywhere.

Share a dataset¶