Your team wants the data you've been cleaning. Email a CSV? Stale by Friday. Run the pipeline yourself? Every time. Stand up a database? That's a quarter of work.
Bundlebase packages data into a versioned, self-describing container. Point anyone at the path — they get the data, the schema, the transformation history, and the provenance. Query with SQL, pull into pandas, connect Metabase. No server, no README, no repeating yourself.
Share a dataset¶
import bundlebase.sync as bb
bundle = (bb.create("s3://company-data/sales-q1")
.attach("exports/jan.csv")
.attach("exports/feb.csv")
.drop_column("ssn")
.filter("status = 'closed_won'")
.set_name("Q1 Sales — Closed Won"))
bundle.commit("Initial Q1 export")
# Anyone with the path
bundle = bb.open("s3://company-data/sales-q1")
df = bundle.to_pandas()
ATTACH 'exports/jan.csv';
ATTACH 'exports/feb.csv';
DROP COLUMN ssn;
FILTER WITH SELECT * FROM bundle WHERE status = 'closed_won';
SET NAME 'Q1 Sales — Closed Won';
COMMIT 'Initial Q1 export';
-- Anyone with the path
OPEN 's3://company-data/sales-q1';
SELECT region, SUM(amount) FROM bundle GROUP BY region;
Commit once. Share the path. The consumer gets a DataFrame, a SQL connection, or a CLI session — their choice. The commit history is the changelog. No "which file is the latest?" conversations.
What makes a bundle different from a Parquet file →
Durable storage for LLM agents¶
Agents lose context between sessions. A bundle doesn't. At startup the agent reads name, num_rows, schema, and history() — it knows exactly what it has without re-fetching anything. Crash mid-run? Next session opens the last committed state, clean.
# New session — reconstruct from the bundle itself
bundle = bb.open("s3://agent-workspace/product-reviews")
print(bundle.name) # "Product Reviews"
print(bundle.num_rows) # 14,302
for e in bundle.history():
print(e) # what ran and when
results = bundle.query(
"SELECT * FROM bundle WHERE review_date >= '2026-01-01'"
).to_pandas()
-- REPL: new session
OPEN 's3://agent-workspace/product-reviews';
SHOW STATUS; -- name, rows, version
SHOW HISTORY; -- full commit log
SELECT * FROM bundle WHERE review_date >= '2026-01-01';
Or expose as an MCP tool server — the agent gets query, schema,
history, and sample as native tools, no code required:
Data hygiene rules that travel with the bundle¶
Your source data is dirty. You clean it every time — until you forget. always_delete and always_update encode cleanup rules into the bundle itself. They fire on every future attach, regardless of who runs it.
No shared cleanup script. No "oops, I forgot the filter" incidents.
How it works¶
%%{init: {'flowchart': {'curve': 'stepAfter'}}}%%
flowchart LR
A["CSV · Parquet · JSON\nHTTP · S3 · SFTP · Kaggle"] --> B["Bundle\nversioned · self-describing"]
B --> C1["Python — bb.open()"]
B --> C2["CLI — bundlebase REPL"]
B --> C3["SQL server — Metabase · R · DBeaver"]
B --> C4["MCP server — AI assistants"]
classDef bundle fill:#0e7490,color:#fff,stroke:#22d3ee,stroke-width:1.5px
classDef output fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px
classDef source fill:#1e293b,color:#cbd5e1,stroke:#334155,stroke-width:1px
class B bundle
class C1,C2,C3,C4 output
class A source
Works with your stack¶
| Python | pandas · polars · numpy · pyarrow, sync + async |
| CLI | Interactive REPL, scriptable commands — no Python required |
| BI tools | Metabase · DBeaver · R · Julia · Go — anything with an Arrow Flight driver or native Arrow Flight support |
| Storage | S3 · GCS · Azure Blob · local paths |
| Formats | Parquet · CSV · JSON — mixed sources union automatically |
| Custom connectors | Pull from any source — Salesforce, internal APIs, custom databases — by writing a connector in Python, Go, Java, or any IPC-compatible language |
| Custom SQL functions | Register Python callables as scalar or aggregate UDFs and call them directly in SQL queries |
| SQL | Full Apache DataFusion syntax — SELECT * FROM bundle WHERE ... |
| Scale | Streaming execution — datasets larger than RAM, constant memory |
| Core | Rust + Apache Arrow — columnar, fast |
Get started¶
- Why Bundlebase? — comparisons to DVC, Delta Lake, plain files, databases
- Python — pip install, then build your first bundle
- CLI — download the binary, query interactively