Skip to content

Bundle your data.
Ship it anywhere.

Bundlebase

Versioned, self-describing data containers. Access via Python, SQL, CLI, or any BI tool — no server required. No database, no pipeline docs, no infrastructure to maintain.

import bundlebase.sync as bb

bundle = (bb.create("s3://data/sales-q1")
    .attach("jan.csv")
    .attach("feb.csv")
    .drop_column("ssn")
    .filter("status = 'closed_won'"))
bundle.commit("Q1 export, PII removed")

# --- anyone with the path ---
bundle = bb.open("s3://data/sales-q1")
df = bundle.to_pandas()
ATTACH 'exports/jan.csv';
ATTACH 'exports/feb.csv';
DROP COLUMN ssn;
FILTER WITH SELECT * FROM bundle
  WHERE status = 'closed_won';
COMMIT 'Q1 export, PII removed';

-- anyone with the path
OPEN 's3://data/sales-q1';
SELECT region, SUM(amount)
FROM bundle GROUP BY region;

Your team wants the data you've been cleaning. Email a CSV and it's stale by Friday. Re-run the pipeline yourself, every time. Stand up a database, and that's a quarter of work.

Bundlebase packages data into a versioned, self-describing container. Point anyone at the path and they get the data, the schema, the transformation history, and the provenance. Query with SQL, pull into pandas, connect Metabase. No server, no README, no repeating yourself.

Share a dataset

import bundlebase.sync as bb

bundle = (bb.create("s3://company-data/sales-q1")
    .attach("exports/jan.csv")
    .attach("exports/feb.csv")
    .drop_column("ssn")
    .filter("status = 'closed_won'")
    .set_name("Q1 Sales -- Closed Won"))
bundle.commit("Initial Q1 export")

# Anyone with the path
bundle = bb.open("s3://company-data/sales-q1")
df = bundle.to_pandas()
ATTACH 'exports/jan.csv';
ATTACH 'exports/feb.csv';
DROP COLUMN ssn;
FILTER WITH SELECT * FROM bundle WHERE status = 'closed_won';
SET NAME 'Q1 Sales -- Closed Won';
COMMIT 'Initial Q1 export';

-- Anyone with the path
OPEN 's3://company-data/sales-q1';
SELECT region, SUM(amount) FROM bundle GROUP BY region;

Commit once. Share the path. The consumer gets a DataFrame, a SQL connection, or a CLI session, whichever they prefer. The commit history is the changelog. No "which file is the latest?" conversations.

What makes a bundle different from a Parquet file ->

Durable storage for LLM agents

Agents lose context between sessions. A bundle doesn't. At startup the agent reads name, num_rows, schema, and history(), so it knows what it has without re-fetching anything. If a session crashes mid-run, the next one opens the last committed state.

# New session -- reconstruct from the bundle itself
bundle = bb.open("s3://agent-workspace/product-reviews")
print(bundle.name)      # "Product Reviews"
print(bundle.num_rows)  # 14,302
for e in bundle.history():
    print(e)            # what ran and when

results = bundle.query(
    "SELECT * FROM bundle WHERE review_date >= '2026-01-01'"
).to_pandas()
-- REPL: new session
OPEN 's3://agent-workspace/product-reviews';
SHOW STATUS;    -- name, rows, version
SHOW HISTORY;   -- full commit log

SELECT * FROM bundle WHERE review_date >= '2026-01-01';

Or expose as an MCP tool server. The agent gets query, schema, history, and sample as native tools, no code required:

bundlebase mcp --bundle s3://agent-workspace/product-reviews

See the agent use case ->

Data hygiene rules that travel with the bundle

Your source data is dirty. You clean it every time, until you forget. always_delete and always_update encode cleanup rules into the bundle itself. They fire on every future attach, regardless of who runs it.

bundle = bb.create("s3://analytics/crm-export")

# Define once
bundle.always_delete("WHERE amount < 0")
bundle.always_delete("WHERE status = 'test'")

# Attach dirty data -- rules fire automatically
bundle.attach("jan.csv").commit("January")
bundle.extend().attach("feb.csv").commit("February")
OPEN 's3://analytics/crm-export';
ALWAYS DELETE WHERE amount < 0;
ALWAYS DELETE WHERE status = 'test';

-- attach dirty data -- rules fire automatically
ATTACH 'jan.csv';
COMMIT 'January';

No shared cleanup script. No "oops, I forgot the filter" incidents.

See the ETL use case ->

How it works

%%{init: {'flowchart': {'curve': 'stepAfter'}}}%%
flowchart LR
    A["CSV · Parquet · JSON\nHTTP · S3 · SFTP · Kaggle"] --> B["Bundle\nversioned · self-describing"]
    B --> C1["Python -- bb.open()"]
    B --> C2["CLI -- bundlebase REPL"]
    B --> C3["SQL server -- Metabase · R · DBeaver"]
    B --> C4["MCP server -- AI assistants"]

    classDef bundle fill:#0e7490,color:#fff,stroke:#22d3ee,stroke-width:1.5px
    classDef output fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px
    classDef source fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px

    class B bundle
    class C1,C2,C3,C4 output
    class A source

Works with your stack

Python pandas · polars · numpy · pyarrow, sync + async
CLI Interactive REPL, scriptable commands -- no Python required
BI tools Metabase · DBeaver · R · Julia · Go -- anything with an Arrow Flight driver or native Arrow Flight support
Storage S3 · GCS · Azure Blob · local paths
Formats Parquet · CSV · JSON -- mixed sources union automatically
Custom connectors Pull from any source (Salesforce, internal APIs, custom databases) by writing a connector in Python, Go, Java, or any IPC-compatible language
Custom SQL functions Register Python callables as scalar or aggregate UDFs and call them directly in SQL queries
SQL Full Apache DataFusion syntax -- SELECT * FROM bundle WHERE ...
Scale Streaming execution -- datasets larger than RAM, constant memory
Core Rust + Apache Arrow -- columnar, fast

Get started

  1. Why Bundlebase? -- comparisons to DVC, Delta Lake, plain files, databases
  2. Python -- pip install, then build your first bundle
  3. CLI -- download the binary, query interactively