Basic Operations¶
Creating a bundle¶
Opening an existing bundle¶
Attaching data¶
Multiple files union together automatically, even across formats:
Note
CSV columns import as text. Use cast_column() to convert types after attaching.
Filtering rows¶
Column operations¶
# Remove columns
bundle.drop_column("ssn")
bundle.drop_column("credit_card")
# Rename
bundle.rename_column("fname", "first_name")
bundle.rename_column("lname", "last_name")
# Normalize messy names: "Customer Id" → "customer_id", "Phone 1" → "phone_1"
bundle.normalize_column_names()
# Cast CSV text to typed columns
bundle.cast_column("amount", "float64")
bundle.cast_column("created_at", "timestamp")
Querying with SQL¶
Exporting¶
# pandas
df = bundle.to_pandas()
# polars
df = bundle.to_polars()
# numpy (returns dict of arrays keyed by column name)
arrays = bundle.to_numpy()
x = arrays["revenue"]
# Streaming batches — constant memory regardless of dataset size
for batch in bundle.stream_batches():
process(batch) # each batch is a PyArrow RecordBatch
Versioning¶
import bundlebase.sync as bb
# Create and commit
bundle = bb.create("s3://my-bucket/sales")
bundle.attach("jan.csv")
bundle.commit("January data")
# Extend (mutable copy of an existing bundle)
bundle = bb.open("s3://my-bucket/sales").extend()
bundle.attach("feb.csv")
bundle.commit("Added February")
# View history
bundle = bb.open("s3://my-bucket/sales")
for entry in bundle.history():
print(entry)
# Roll back uncommitted changes
bundle = bb.open("s3://my-bucket/sales").extend()
bundle.attach("bad-data.csv")
bundle.reset() # back to last committed state
Indexes¶
Method chaining¶
All mutation methods return self, so operations can be chained:
import bundlebase.sync as bb
bundle = (bb.create("s3://my-bucket/sales-q1")
.attach("jan.csv")
.attach("feb.csv")
.attach("mar.csv")
.normalize_column_names()
.cast_column("amount", "float64")
.drop_column("internal_id")
.filter("status = 'closed_won'")
.set_name("Q1 Sales — Closed Won")
.commit("Initial Q1 export"))