Joins¶

You can create a bundle built up of data in different files joined together, even files of different types.

When joining, you must specify a unique name for the join and an on expression defining how the data relates. Use bundle. to reference columns from the bundle's existing data in the join expression.

You can optionally specify a location for the initial data and a how for the join type ("inner", "left", "right", or "full").

Basic Usage¶

Async APISync APISQL

import bundlebase as bb

bundle = await bb.create("my/data")
await bundle.attach("customers.parquet")

# Join with orders data
await bundle.join("orders", on="bundle.customer_id = orders.id",
             location="s3://external/orders.parquet")
print(await bundle.to_pandas())

import bundlebase.sync as bb

bundle = bb.create("my/data")
bundle.attach("customers.parquet")

# Join with orders data
bundle.join("orders", on="bundle.customer_id = orders.id",
       location="s3://external/orders.parquet")

print(bundle.to_pandas())

JOIN 'orders.parquet' AS orders ON bundle.customer_id = orders.id

Joining With Another Bundle¶

You can join with the query output of another committed bundle using a bundle:// URL. This reads the target bundle's full query output — including any filters, column operations, and joins that have been applied.

For filesystem bundles, use bundle:// followed by the path:

SQLAsync APISync API

JOIN 'bundle:///path/to/other/bundle' AS other ON bundle.id = other.id

await bundle.join("other", on="bundle.id = other.id",
             location="bundle:///path/to/other/bundle")

bundle.join("other", on="bundle.id = other.id",
       location="bundle:///path/to/other/bundle")

For remote bundles (S3, etc.), use the compound scheme bundle+<scheme>://:

SQLAsync APISync API

JOIN 'bundle+s3://bucket/path/to/bundle' AS other ON bundle.id = other.id

await bundle.join("other", on="bundle.id = other.id",
             location="bundle+s3://bucket/path/to/bundle")

bundle.join("other", on="bundle.id = other.id",
       location="bundle+s3://bucket/path/to/bundle")

Note

The target bundle must be committed. The joined data reflects the target's full query output at read time — including any filters, column operations, and joins that have been applied.

Join Without Initial Data¶

Create a join point without initial data. Data can be attached later using attach() with the pack parameter or via sources.

Async APISync APISQL

await bundle.join("orders", on="bundle.customer_id = orders.id")

# Attach data to the join later
await bundle.attach("orders.parquet", pack="orders")

bundle.join("orders", on="bundle.customer_id = orders.id")

# Attach data to the join later
bundle.attach("orders.parquet", pack="orders")

JOIN AS orders ON bundle.customer_id = orders.id

Join Types¶

Async APISync APISQL

# Left join (keep all customers, even without orders)
await bundle.join("orders", on="bundle.customer_id = orders.id",
             location="orders.parquet", how="left")

# Full outer join
await bundle.join("all_data", on="bundle.id = all_data.id",
             location="other.parquet", how="full")

bundle.join("orders", on="bundle.customer_id = orders.id",
       location="orders.parquet", how="left")

bundle.join("all_data", on="bundle.id = all_data.id",
       location="other.parquet", how="full")

LEFT JOIN 'orders.parquet' AS orders ON bundle.customer_id = orders.id

Drop Join¶

Remove an existing join from the bundle.

Async APISync APISQL

await bundle.drop_join("orders")

bundle.drop_join("orders")

DROP JOIN orders

Rename Join¶

Rename an existing join.

Async APISync APISQL

await bundle.rename_join("orders", new_name="customer_orders")

bundle.rename_join("orders", new_name="customer_orders")

RENAME JOIN orders TO customer_orders

Note

Until you commit, join changes will not be used when the bundle is reopened.