Use Case: Data Scientist¶
The problem¶
You have data from multiple places — CSVs from a vendor, Parquet exports from a database, maybe a direct API pull. You've cleaned it up and done your analysis. Now someone asks: "can you share that dataset?" And you realize: the "dataset" is actually a Python script, three files, and a notebook with a dozen transformation steps embedded in it.
Bundlebase gives you a place to land the cleaned, combined data so that anyone with the path can pick up exactly what you worked with — including what you filtered out, what you renamed, and when things changed.
The scenario¶
You're analyzing customer support tickets. The data lives in three places:
- Monthly CSV exports from your ticketing system (inconsistent column names between months)
- A product lookup table in Parquet on S3
- A filter: only tickets from enterprise customers, resolved status
You want to share a stable, versioned view of this with the engineering team who's building a dashboard.
Step 1: Combine and clean the raw data¶
import bundlebase.sync as bb
bundle = (bb.create("s3://team-data/support-tickets")
.attach("exports/2026-01.csv")
.attach("exports/2026-02.csv")
.attach("exports/2026-03.csv")
.normalize_column_names() # fixes "Ticket ID", "ticket_id", "TicketId" → consistent
.filter("customer_tier = 'enterprise' AND status = 'resolved'")
.drop_column("internal_notes") # strip content you don't want shared
.drop_column("assignee_email"))
CREATE 's3://team-data/support-tickets';
ATTACH 'exports/2026-01.csv';
ATTACH 'exports/2026-02.csv';
ATTACH 'exports/2026-03.csv';
NORMALIZE COLUMN NAMES;
FILTER WITH SELECT * FROM bundle WHERE customer_tier = 'enterprise' AND status = 'resolved';
DROP COLUMN internal_notes;
DROP COLUMN assignee_email;
Why normalize first
CSVs from the same system often have inconsistent column casing between exports. normalize_column_names() handles this in one step so your filter and drop operations use predictable names.
Step 2: Enrich with a lookup table¶
The tickets have a product_id column but the dashboard needs the product name. You have a product lookup table in S3:
The left join keeps all tickets — even the ones where the product ID doesn't match anything in the lookup, which is worth knowing about.
Step 3: Document and commit¶
The description travels with the data. Six months from now, someone opening this bundle doesn't need to find you to understand what's in it.
Step 4: Share it¶
Send the path: s3://team-data/support-tickets
The engineering team opens it:
They can also query it directly without loading everything into memory:
Step 5: Update it when data changes¶
When April's export arrives, you don't create a new bundle — you extend the existing one:
Anyone who already has the path gets the update automatically next time they open it. The history is preserved:
What you've avoided¶
- No "which file is the latest?" confusion — the bundle is always the latest committed state
- No "what did you filter?" questions — the commit history and description capture it
- No re-running your cleaning script to hand off data — commit once, share the path
- No schema drift between months —
normalize_column_names()is idempotent and committed
Next steps¶
- Attaching data — full options for attach, including column type casting
- Joins — joining with other bundles or files
- Versioning — reset, undo, and viewing history
- Querying — SQL syntax and output formats