Use Case: Backend Developer¶
The problem¶
Your service has data in it. Analysts and data scientists keep asking you to export it for them. You write a one-off script, they run it, they find an edge case, they ask again. Or you expose an internal API endpoint and they write their own export logic — and now there are three slightly different versions of "the same data" floating around.
Bundlebase lets you publish your data once in a form that anyone can consume directly from Python, without you being in the loop for every request. The data is versioned, the schema is self-documenting, and consumers get pandas DataFrames in one line.
The scenario¶
You're a backend developer at a SaaS company. You have a REST API that returns customer usage metrics. The data science team wants this data weekly. You'd rather not write a custom ETL job, manage a shared database, or email CSV files.
Option A: Use your existing API as a source (no extra code)¶
If your API already returns JSON or CSV, Bundlebase can pull from it directly. You just configure a source pointing at your endpoint:
import bundlebase.sync as bb
bundle = (bb.create("s3://team-data/usage-metrics")
.set_name("Customer Usage Metrics")
.set_description("Weekly pull from /api/v2/metrics. Covers all accounts, all event types.")
.create_source("http", {
"url": "https://api.yourservice.com/v2/metrics/export",
"headers": "Authorization: Bearer YOUR_TOKEN\nAccept: application/json",
"json_record_path": "data"
})
.fetch("base", "replace")) # replace = full refresh each time
bundle.commit("Weekly usage metrics — 2026-04-04")
CREATE 's3://team-data/usage-metrics';
SET NAME 'Customer Usage Metrics';
SET DESCRIPTION 'Weekly pull from /api/v2/metrics. Covers all accounts, all event types.';
CREATE SOURCE FOR base USING http WITH (
url = 'https://api.yourservice.com/v2/metrics/export',
headers = 'Authorization: Bearer YOUR_TOKEN\nAccept: application/json',
json_record_path = 'data'
);
FETCH base SYNC;
COMMIT 'Weekly usage metrics — 2026-04-04';
fetch("base", "replace") does a full refresh — it replaces the previous data with the current API response. Use "add" instead if your API returns only new records and you want to accumulate.
Authentication
The headers field takes Name: Value pairs, one per line. For tokens that rotate, you can template this via environment variables or a wrapper script that sets the header before calling fetch().
Option B: Parameterized pulls¶
If your API takes date ranges or other parameters, use POST:
import json
from datetime import date, timedelta
week_ago = (date.today() - timedelta(days=7)).isoformat()
bundle = bb.open("s3://team-data/usage-metrics").extend()
bundle.create_source("http", {
"url": "https://api.yourservice.com/v2/metrics/export",
"method": "POST",
"body": json.dumps({"since": week_ago, "format": "json"}),
"headers": "Authorization: Bearer YOUR_TOKEN\nContent-Type: application/json",
"json_record_path": "results"
})
bundle.fetch("base", "add")
bundle.commit(f"Usage metrics week of {week_ago}")
OPEN 's3://team-data/usage-metrics';
EXTEND;
CREATE SOURCE FOR base USING http WITH (
url = 'https://api.yourservice.com/v2/metrics/export',
method = 'POST',
body = '{"since": "<week_ago>", "format": "json"}',
headers = 'Authorization: Bearer YOUR_TOKEN\nContent-Type: application/json',
json_record_path = 'results'
);
FETCH base ADD;
COMMIT 'Usage metrics week of <week_ago>';
Option C: Export from your database (for non-Python stacks)¶
If you're running Go, Java, Node, or anything else, you don't need the Python API at all to publish data. Export to Parquet or CSV and write it to a location Bundlebase can read:
// Go example: write a Parquet file to S3, then let Bundlebase pick it up
rows := db.Query("SELECT account_id, event_type, count, ts FROM usage WHERE ts > ?", lastWeek)
writeParquetToS3(rows, "s3://team-data/raw/usage-2026-04-04.parquet")
Then a small Python script (or cron job) publishes it as a bundle:
The data team gets a consistently structured, versioned dataset. You control the export format from your side, they consume it from theirs.
What the data team gets¶
Once you've committed the bundle, the data science team consumes it without involving you:
import bundlebase.sync as bb
bundle = bb.open("s3://team-data/usage-metrics")
# See what's in it before loading
print(bundle.name) # "Customer Usage Metrics"
print(bundle.description) # what it covers, how it's sourced
print(bundle.num_rows) # current row count
print(bundle.version) # which commit they're on
# Pull into pandas
df = bundle.to_pandas()
# Or query directly — no need to load 10M rows to answer one question
top_accounts = bundle.query("""
SELECT account_id, SUM(count) as total_events
FROM bundle
WHERE event_type = 'api_call'
GROUP BY account_id
ORDER BY total_events DESC
LIMIT 20
""").to_pandas()
They can also check what changed between updates:
Automating the weekly publish¶
Wrap the publish step in a script and run it from a cron job or CI pipeline:
#!/usr/bin/env python3
# scripts/publish_usage_bundle.py
import bundlebase.sync as bb
from datetime import date
bundle = bb.open("s3://team-data/usage-metrics").extend()
bundle.create_source("http", {
"url": "https://api.yourservice.com/v2/metrics/export",
"headers": f"Authorization: Bearer {API_TOKEN}"
})
bundle.fetch("base", "replace")
bundle.commit(f"Weekly usage metrics — {date.today()}")
print(f"Published: {bundle.num_rows:,} rows at version {bundle.version}")
Option D: Expose the bundle as a SQL server¶
If your consumers don't use Python at all — R, Julia, Metabase, DBeaver, any JDBC/ODBC tool — you can run a SQL server directly on top of the bundle with one command:
That's it. Any SQL client connects to localhost:32010 and queries bundle as a table. The data is read-only — consumers can't change what's committed.
From R:
library(arrow)
conn <- flight_connect("localhost", 32010)
df <- flight_get(conn, "SELECT account_id, SUM(count) as events FROM bundle GROUP BY account_id")
From Metabase, DBeaver, or any JDBC tool:
Use the Arrow Flight SQL JDBC driver. Connection URL: jdbc:arrow-flight-sql://localhost:32010. Query bundle as a table — same SQL syntax as everywhere else.
From Go:
// Standard SQL client library
client, _ := flightsql.NewClient("localhost:32010", nil, nil, grpc.WithTransportCredentials(insecure.NewCredentials()))
info, _ := client.Execute(ctx, "SELECT * FROM bundle WHERE event_type = 'api_call' LIMIT 1000")
The bundle on S3 doesn't move. You run the server where the data lives — on a bastion host with S3 access, in a Docker container, in a GitHub Actions job — and point consumers at the port. No ETL, no data duplication, no new infrastructure to maintain.
Why this beats "just expose a CSV endpoint"¶
| Approach | Schema docs | Version history | Queryable | Self-service |
|---|---|---|---|---|
| Email CSV | No | No | No | No |
| CSV endpoint | No | No | Requires pandas | Barely |
| Shared database | Requires setup | Depends | Yes | Requires credentials |
| Bundlebase | Yes | Yes | Yes | bb.open(path) |
Next steps¶
- Sources — full HTTP connector options including POST/PUT and auth headers
- Versioning — commit history and how consumers track updates
- Custom Connectors — if you want deeper integration from Go, Java, or Rust