Data Generation

GenDB uses an LLM to generate all synthetic data values directly. The LLM receives your schema (table names, column names, data types, constraints) and produces realistic, semantically coherent data in JSON batches.

How It Works

Tables are processed in topological order (referenced tables first)
For each table, the LLM receives the full schema context, column types, constraints, and any config instructions
Data is generated in batches of up to 50 rows per LLM call
Generated values are type-coerced and validated before insertion

Config Overrides

You can provide per-column instructions in gendb.yaml to guide the LLM:

generation:
  tables:
    users:
      columns:
        status:
          generator: one_of
          values: [active, inactive, pending]
        bio:
          prompt: "Write a short professional bio for a tech company employee"
        sku:
          generator: regex
          format: "[A-Z]{3}-[0-9]{6}"

Override Types

Type	Config	LLM instruction
`one_of`	`generator: one_of` + `values`	"must be one of: [v1, v2, v3]"
`regex`	`generator: regex` + `format`	"must match the regex pattern: ..."
`skip`	`generator: skip`	Column excluded from generation
`prompt`	`prompt: "..."`	Direct instruction to the LLM

Column Rules

Column rules apply instructions across all tables based on column name patterns:

generation:
  column_rules:
    - pattern: "*_sku"
      generator: regex
      format: "[A-Z]{3}-[0-9]{6}"
    - pattern: "*_status"
      generator: skip

Patterns use glob-style matching:

Pattern	Matches
`*_email`	`user_email`, `admin_email`
`phone*`	`phone`, `phone_number`
`name`	`first_name`, `company_name`
`status`	`status` (exact match)

Foreign Key Resolution

GenDB automatically resolves foreign key relationships:

Tables are generated in topological order — referenced tables first, then dependent tables
FK values from parent tables are included in the LLM prompt so it can pick valid references
Post-processing validates FK values and replaces any invalid ones

UNIQUE Constraint Handling

Unique constraints are communicated to the LLM in the prompt. The LLM is instructed to generate unique values for columns with UNIQUE indexes. A post-processing step validates uniqueness using a tracker.