Text Search¶

Bundlebase includes full-text search powered by Tantivy with BM25 relevance ranking. Text indexes can span one or more columns, and search results include a _score column for ranking.

Creating a Text Index¶

Use create_index() with one or more columns and index_type="text". You can optionally provide a name for the index (used when querying). If no name is provided, indexes are auto-named as idx_{columns joined by _}.

Async APISync API

# Single-column text index (auto-named "idx_description")
await bundle.create_index("description", "text")

# Multi-column text index (auto-named "idx_title_description")
await bundle.create_index(["title", "description"], "text")

# With an explicit name
await bundle.create_index(["title", "description"], "text", name="product_search")

# With a custom tokenizer
await bundle.create_index("content", "text", args={"tokenizer": "en_stem"})

await bundle.commit("Created text indexes")

bundle.create_index("description", "text")
bundle.create_index(["title", "description"], "text")
bundle.create_index(["title", "description"], "text", name="product_search")
bundle.create_index("content", "text", args={"tokenizer": "en_stem"})

bundle.commit("Created text indexes")

Note

Until you commit, the index will not be used when the bundle is reopened.

Multi-Column Indexes¶

A multi-column index stores all specified columns in a single search index. This lets you search across columns in a single query using field-specific syntax, and get a unified relevance score.

await bundle.create_index(["title", "description", "tags"], "text", name="product_search")

Querying¶

Use the search() table function to perform full-text search. It replaces FROM bundle in your SQL. You can call it with one or two arguments:

search('query') — When only one text index exists, you can omit the index name.
search('index_name', 'query') — Specify which index to search when multiple exist.

Results include all columns from the bundle plus a _score column with the BM25 relevance score.

SQLAsync APISync API

-- Basic search (when only one text index exists)
SELECT * FROM search('machine learning')

-- Specify index name
SELECT * FROM search('desc_search', 'machine learning')

-- Order by relevance
SELECT title, description, _score
FROM search('product_search', 'machine learning')
ORDER BY _score DESC
LIMIT 10

-- Additional WHERE filters on top of search results
SELECT * FROM search('product_search', 'machine learning')
WHERE category = 'AI'

results = await bundle.query(
    "SELECT title, _score FROM search('product_search', 'machine learning') ORDER BY _score DESC"
)
df = await results.to_polars()

results = bundle.query(
    "SELECT title, _score FROM search('product_search', 'machine learning') ORDER BY _score DESC"
)
df = results.to_polars()

Query Syntax¶

The query string uses Tantivy query syntax:

Syntax	Example	Description
Terms	`machine learning`	Matches rows containing any of the words (OR by default)
Required terms	`+machine +learning`	Matches rows containing both words
Phrases	`"machine learning"`	Matches the exact phrase
Field-specific	`title:learning`	Searches only the specified column
Combined fields	`+title:machine +description:learning`	Different terms in different columns

Field-Specific Queries¶

With multi-column indexes, you can target specific columns using the column:term syntax. The column names match the columns specified when creating the index.

-- Search only the title column
SELECT * FROM search('product_search', 'title:learning')

-- Require matches in both title and description
SELECT * FROM search('product_search', '+title:machine +description:learning')

When no field is specified, all columns in the index are searched.

Result Limit¶

By default, search() returns up to 10,000 results when no SQL LIMIT clause is specified. If your query matches more documents than this, results are truncated to the top 10,000 by relevance score. Always use LIMIT for best performance and predictable results:

SELECT * FROM search('product_search', 'common term')
ORDER BY _score DESC
LIMIT 100

The Score Column¶

Every search() query returns a _score column (Float64) with the BM25 relevance score. Higher scores indicate better matches. Use ORDER BY _score DESC to rank results by relevance.

SELECT title, _score
FROM search('product_search', 'machine learning')
ORDER BY _score DESC
LIMIT 10

Tokenizers¶

When creating a text index, you can specify a tokenizer to control how text is split and normalized. The default is simple.

Tokenizer	Aliases	Description
`simple`		Splits on whitespace and lowercases. Good general-purpose default.
`raw`		No tokenization — treats the entire field value as a single token. Use for exact-match fields like IDs or codes.
`en_stem`	`english_stem`, `english`	English stemming with stop word removal. Matches word variants (e.g., "running" matches "run").
`de_stem`	`german_stem`, `german`	German stemming with stop word removal.
`fr_stem`	`french_stem`, `french`	French stemming with stop word removal.
`es_stem`	`spanish_stem`, `spanish`	Spanish stemming with stop word removal.
`it_stem`	`italian_stem`, `italian`	Italian stemming with stop word removal.
`pt_stem`	`portuguese_stem`, `portuguese`	Portuguese stemming with stop word removal.
`nl_stem`	`dutch_stem`, `dutch`	Dutch stemming with stop word removal.
`sv_stem`	`swedish_stem`, `swedish`	Swedish stemming with stop word removal.
`no_stem`	`norwegian_stem`, `norwegian`	Norwegian stemming with stop word removal.
`da_stem`	`danish_stem`, `danish`	Danish stemming with stop word removal.
`fi_stem`	`finnish_stem`, `finnish`	Finnish stemming with stop word removal.
`ru_stem`	`russian_stem`, `russian`	Russian stemming with stop word removal.