# Bundlebase

> Docker for data. Bundle, version, and query datasets without database infrastructure.

Bundlebase packages data files (CSV, Parquet, JSON) from any source (local, S3, HTTP, SFTP) into versioned bundles with a built-in SQL query engine, persistent data hygiene rules, custom connectors, custom SQL functions, and Python/CLI/MCP interfaces. Consumers access bundles via Python, CLI REPL, Arrow Flight SQL server, or MCP tool server — no database required.

## Docs

- [Why Bundlebase?](https://nvoxland.github.io/bundlebase/getting-started/why-bundlebase/): Comparisons to DVC, Delta Lake, plain files, databases
- [Python Quick Start](https://nvoxland.github.io/bundlebase/getting-started/python/quick-start/): Create a bundle, attach data, query with SQL
- [CLI Quick Start](https://nvoxland.github.io/bundlebase/getting-started/cli/quick-start/): Interactive REPL usage
- [Basic Concepts](https://nvoxland.github.io/bundlebase/getting-started/basic-concepts/): Read-only vs mutable, lazy evaluation, sources, versioning
- [SQL Reference](https://nvoxland.github.io/bundlebase/sql-reference/): Full command syntax
- [Python API — Sync](https://nvoxland.github.io/bundlebase/api/python/sync-api/): Full sync API reference
- [Python API — Async](https://nvoxland.github.io/bundlebase/api/python/async-api/): Full async API reference
- [CLI REPL Guide](https://nvoxland.github.io/bundlebase/guide/cli-repl/): REPL flags and meta-commands
- [SQL Server Guide](https://nvoxland.github.io/bundlebase/guide/cli-serve/): Arrow Flight SQL server for BI tools
- [MCP Server Guide](https://nvoxland.github.io/bundlebase/guide/cli-mcp/): MCP tool server for AI assistants
- [Sources Guide](https://nvoxland.github.io/bundlebase/guide/sources/): HTTP, S3, SFTP, fetch modes
- [Custom Connectors](https://nvoxland.github.io/bundlebase/guide/custom-connectors/): Python, Go, Java, native connector SDK
- [Custom Functions](https://nvoxland.github.io/bundlebase/guide/functions/): Scalar and aggregate UDFs
- [Examples](https://nvoxland.github.io/bundlebase/examples/): Copy-paste examples by task
- [Use Cases](https://nvoxland.github.io/bundlebase/use-cases/): End-to-end scenarios by role

## Quick Reference

### Installation

```bash
pip install bundlebase            # base (PyArrow only)
pip install "bundlebase[pandas]"  # with pandas
pip install "bundlebase[jupyter]" # with pandas + Jupyter support
```

CLI binary (no Python required):
```bash
# macOS / Linux — always downloads latest release
curl -L https://github.com/nvoxland/bundlebase/releases/latest/download/bundlebase-aarch64-apple-darwin.tar.gz | tar xz
sudo mv bundlebase /usr/local/bin/
```

### Python API (bundlebase.sync)

```python
import bundlebase.sync as bb

# Factory functions
bundle = bb.create("s3://bucket/path")   # new bundle — mutable
bundle = bb.create("memory:///")         # in-memory (no persistence)
bundle = bb.open("s3://bucket/path")     # existing bundle — read-only

# Make a read-only bundle mutable
bundle = bb.open("s3://bucket/path").extend()

# Data operations (all return self for chaining)
bundle.attach("data.parquet")
bundle.attach("s3://other/data.csv")
bundle.attach("https://example.com/feed.json")
bundle.filter("amount > 0")                          # WHERE clause string
bundle.filter("amount > $1 AND status = $2", [0, "active"])  # parameterized
bundle.join("orders", on="id = order_id", location="orders.parquet", how="inner")
bundle.drop_column("ssn")
bundle.rename_column("fname", "first_name")
bundle.add_column("full_name", "first_name || ' ' || last_name")
bundle.cast_column("amount", "float64")
bundle.normalize_column_names()          # "Customer Id" → "customer_id"
bundle.set_name("My Bundle")
bundle.set_description("Description")

# Persistent data hygiene rules (fire on every future attach)
bundle.always_delete("WHERE amount < 0")
bundle.always_delete("WHERE status = 'test'")
bundle.always_update("SET currency = UPPER(currency)")
bundle.always_update("SET region = 'EMEA' WHERE region IN ('Europe', 'EU')")

# Sources and fetch
bundle.create_source("http", {"url": "https://api.example.com/data", "format": "json", "json_record_path": "records"})
bundle.create_source("remote_dir", {"url": "s3://bucket/data/", "patterns": "**/*.parquet"})
bundle.create_source("sftp_directory", {"url": "sftp://host/path/", "patterns": "*.csv", "username": "user", "key_path": "/key"})
bundle.fetch("base", "add")     # add new files only
bundle.fetch("ref", "update")   # add new + refresh changed
bundle.fetch("snap", "sync")    # full replacement

# Custom functions
bundle.import_temp_function("acme.score", "python::my_module:score_fn")   # Python (session-only)
bundle.import_function("acme.score", "ipc::./score_binary")               # persistent (non-Python runtimes)
bundle.drop_function("acme.score")

# Custom connectors
bundle.import_temp_connector("acme.salesforce", "python::sf_connector:SalesforceSource")
bundle.import_connector("acme.weather", "ipc::./weather_binary")
bundle.drop_connector("acme.weather")

# Views
bundle.create_view("active_users", "SELECT * FROM bundle WHERE active = true")

# Versioning
bundle.commit("Added customer data")
bundle.reset()                   # discard uncommitted changes

# Querying
result = bundle.query("SELECT region, SUM(amount) FROM bundle GROUP BY region")
df = result.to_pandas()
df = result.to_polars()

# Direct export
df = bundle.to_pandas()
df = bundle.to_polars()
d  = bundle.to_dict()
a  = bundle.to_numpy()
for batch in bundle.stream_batches():   # constant memory, any dataset size
    process(batch)

# Introspection
bundle.name           # str or None (property)
bundle.description    # str or None (property)
bundle.version        # 12-char hex string (property)
bundle.url            # storage path (property)
bundle.num_rows       # int (property)
bundle.schema         # PySchema — .fields, .field(name)
bundle.history()      # list of commits
bundle.status()       # list of uncommitted operations
```

### SQL Commands (REPL / CLI)

```sql
-- Bundle lifecycle
CREATE '<path>'
OPEN '<path>'
EXTEND

-- Data
ATTACH '<path>' [TO <pack>]
DETACH '<location>'
FILTER WITH SELECT * FROM bundle WHERE <condition>

-- Schema
JOIN '<source>' AS <name> ON <condition>
DROP JOIN <name>
DROP COLUMN <name>
RENAME COLUMN <old> TO <new>
CAST COLUMN <name> AS <type>
NORMALIZE COLUMN NAMES
ADD COLUMN <name> AS <expr>
CREATE VIEW <name> AS <sql>
DROP VIEW <name>

-- Persistent hygiene rules
ALWAYS DELETE WHERE <condition>
ALWAYS UPDATE SET <col> = <expr> [WHERE <condition>]
SHOW ALWAYS RULES

-- Sources
CREATE SOURCE [FOR <pack>] USING <connector> [WITH (<key> = '<value>', ...)]
FETCH <pack> <ADD|UPDATE|SYNC> [DRY RUN]
FETCH ALL <ADD|UPDATE|SYNC> [DRY RUN]

-- Functions
IMPORT TEMP FUNCTION <ns.name> FROM '<runtime>::<entrypoint>'
IMPORT FUNCTION <ns.name> FROM '<runtime>::<entrypoint>'
DROP FUNCTION <ns.name>

-- Connectors
IMPORT TEMP CONNECTOR <name> FROM '<runtime>::<entrypoint>'
IMPORT CONNECTOR <name> FROM '<runtime>::<entrypoint>'
DROP CONNECTOR <name>

-- Indexes
CREATE INDEX <column>
DROP INDEX <column>

-- Versioning
COMMIT '<message>'
RESET
UNDO [LAST <n>]

-- Metadata
SET NAME '<name>'
SET DESCRIPTION '<desc>'

-- Inspection
SHOW STATUS
SHOW SCHEMA
SHOW HISTORY
SHOW ALWAYS RULES
EXPLAIN [ANALYZE] [VERBOSE] <sql>
SELECT * FROM search('<index>', '<query>')
```

Runtimes for functions/connectors: `python` (temp/session-only), `ipc`, `ffi`, `java`, `docker`.

### CLI

```bash
# Setup AI assistant integration (configures MCP + installs agent skills)
bundlebase setup-agent              # project-level
bundlebase setup-agent --scope global  # user-level

# Interactive REPL
bundlebase repl --bundle ./my-bundle
bundlebase repl --bundle s3://bucket/path

# Non-interactive (agent/script friendly)
bundlebase query --bundle ./sales --sql "SELECT * FROM bundle LIMIT 5" --format json
bundlebase create --bundle ./sales
bundlebase extend --bundle ./sales --execute "ATTACH 'data.csv'" --execute "COMMIT 'load'"

# SQL server (Arrow Flight — connect from Metabase, DBeaver, R, Julia, Go)
bundlebase serve --bundle s3://bucket/path --port 32010

# MCP tool server (for AI assistants)
bundlebase mcp --bundle s3://bucket/path

# Report generation
bundlebase generate-report --input report.md --output report.pdf
```

### Custom Connector (Python)

```python
from bundlebase_sdk.source import Connector
from bundlebase_sdk.types import Location

class MyConnector(Connector):
    def discover(self, attached_locations: list[str], **kwargs: str) -> list[Location]:
        """Return list of available data locations."""
        return [Location(location="data.parquet", must_copy=True, format="parquet")]

    def data(self, location: Location, **kwargs: str):
        """Return pa.Table, pa.RecordBatch, list[dict], or None."""
        ...
```

Register: `bundle.import_temp_connector("acme.my_conn", "python::my_module:MyConnector")`
Use: `bundle.create_source("acme.my_conn", {"key": "value"})`

### Custom SQL Function (Python scalar)

```python
import pyarrow as pa
import pyarrow.compute as pc

def double_val(col: pa.Array) -> pa.Array:
    return pc.multiply(col, 2)

def bundlebase_metadata():
    return [{"name": "double_val", "input_types": ["Int64"], "return_type": "Int64", "kind": "scalar"}]
```

Register: `bundle.import_temp_function("acme.double_val", "python::my_module:double_val")`
Use in SQL: `SELECT acme.double_val(amount) FROM bundle`