CLI Reference¶
The CLOE Synthetic Data Generator provides a powerful command-line interface with several commands for data generation, configuration validation, and table discovery.
Command Overview¶
| Command | Description |
|---|---|
generate |
Generate synthetic data from YAML configuration |
test-connection |
Test Databricks Connect connectivity |
list-configs |
List all configurations in a directory |
validate-config |
Validate a YAML configuration file |
discover |
Discover existing tables and generate configs |
Global Options¶
All commands support these global options:
| Option | Short | Description |
|---|---|---|
--help |
-h |
Show help message |
--verbose |
-v |
Enable verbose (debug) logging |
Commands¶
generate¶
Generate synthetic data from YAML configuration files and write to Unity Catalog.
Usage¶
Options¶
| Option | Short | Type | Description | Default |
|---|---|---|---|---|
--config |
-c |
Path | Path to YAML configuration file | None |
--config-dir |
-d |
Path | Directory containing YAML files | None |
--num-records |
-n |
Integer | Override number of records from config | None |
--verbose |
-v |
Flag | Enable verbose logging | False |
Mutual Exclusivity
You must specify either --config OR --config-dir, but not both.
Examples¶
Generate from single configuration:
Generate from single configuration with record override:
Generate from all configurations in directory:
Generate with verbose logging:
Sample Output¶
Configuration: User Data Generation
┌─────────────┬────────────────────────────────┐
│ Property │ Value │
├─────────────┼────────────────────────────────┤
│ Target Table│ main.test_data.users │
│ Columns │ 6 │
│ Records │ 1000 │
│ Write Mode │ overwrite │
└─────────────┴────────────────────────────────┘
⠋ Connecting to Databricks...
✅ Connected to Databricks!
Sample generated data:
+--------------------+----------+---------+--------------------+----+--------------------+
| user_id|first_name|last_name| email| age| created_at|
+--------------------+----------+---------+--------------------+----+--------------------+
|550e8400-e29b-41d4-...| John| Doe|john.doe@email.com| 34|2023-05-15 14:30:22|
+--------------------+----------+---------+--------------------+----+--------------------+
⠋ Writing to Unity Catalog...
⠋ Verifying write...
✅ Completed successfully!
🎉 Successfully generated 1000 records and wrote to main.test_data.users
test-connection¶
Test the connection to your Databricks workspace using Databricks Connect.
Usage¶
Examples¶
Sample Output¶
⠋ Testing Databricks Connect connection...
✅ Connected! Reading sample data...
✅ Connection test completed!
✅ Successfully connected to Databricks!
Sample table 'samples.nyctaxi.trips' contains 1,547,741 rows
Sample data:
+--------+--------------------+-----+---+-----+
|vendor_id|pickup_datetime |...|
+--------+--------------------+-----+---+-----+
| 2|2016-12-31 15:15:00 |...|
+--------+--------------------+-----+---+-----+
Troubleshooting Connection Issues
If the connection fails, check:
- Databricks Connect configuration
- Workspace URL and access token
- Network connectivity
- Unity Catalog access permissions
list-configs¶
List all YAML configuration files in a directory with summary information.
Usage¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
DIRECTORY |
Path | Directory containing YAML configuration files |
Options¶
| Option | Short | Description | Default |
|---|---|---|---|
--verbose |
-v |
Show detailed configuration information | False |
Examples¶
Basic listing:
Detailed listing:
Sample Output¶
Configurations in ./configs/
┌──────────────────┬─────────────────────────────┬─────────┬─────────┐
│ Name │ Target Table │ Records │ Columns │
├──────────────────┼─────────────────────────────┼─────────┼─────────┤
│ User Data Gen │ main.test_data.users │ 1000 │ 6 │
│ Employee Data │ main.hr_data.employees │ 5000 │ 12 │
│ Product Catalog │ main.inventory.products │ 10000 │ 8 │
└──────────────────┴─────────────────────────────┴─────────┴─────────┘
Found 3 configuration(s)
Detailed Configurations in ./configs/
┌────────────┬─────────────────────┬─────────┬─────────┬────────────┬─────────────────────┐
│ Name │ Target │ Records │ Columns │ Write Mode │ Column Names │
├────────────┼─────────────────────┼─────────┼─────────┼────────────┼─────────────────────┤
│ User Data │ main.test_data.users│ 1000 │ 6 │ overwrite │ user_id, first_name,│
│ │ │ │ │ │ last_name, email... │
└────────────┴─────────────────────┴─────────┴─────────┴────────────┴─────────────────────┘
validate-config¶
Validate a YAML configuration file for syntax errors and configuration completeness.
Usage¶
Arguments¶
| Argument | Type | Description |
|---|---|---|
CONFIG_FILE |
Path | Path to YAML configuration file to validate |
Examples¶
Sample Output¶
✅ Configuration 'user_data.yaml' is valid!
Configuration Details
┌─────────────┬─────────────────────────────┐
│ Property │ Value │
├─────────────┼─────────────────────────────┤
│ Name │ User Data Generation │
│ Target Table│ main.test_data.users │
│ Write Mode │ overwrite │
│ Records │ 1000 │
│ Batch Size │ 1000 │
│ Columns │ 6 │
└─────────────┴─────────────────────────────┘
Column Definitions
┌─────────────┬───────────┬──────────┬─────────────────┐
│ Name │ Type │ Nullable │ Faker Function │
├─────────────┼───────────┼──────────┼─────────────────┤
│ user_id │ string │ No │ uuid4 │
│ first_name │ string │ No │ first_name │
│ last_name │ string │ No │ last_name │
│ email │ string │ No │ email │
│ age │ integer │ Yes │ random_int │
│ created_at │ timestamp │ No │ date_time_between│
└─────────────┴───────────┴──────────┴─────────────────┘
discover¶
Discover existing tables in a Databricks catalog/schema and automatically generate YAML configuration files.
Usage¶
Required Options¶
| Option | Short | Type | Description |
|---|---|---|---|
--catalog |
-c |
String | Target catalog name in Unity Catalog |
--schema |
-s |
String | Target schema name within the catalog |
Optional Options¶
| Option | Short | Type | Description | Default |
|---|---|---|---|---|
--table-regex |
-t |
String | Regex pattern to filter table names | None (all tables) |
--output-dir |
-o |
Path | Directory to write generated YAML files | ./discovered_configs |
--num-records |
-n |
Integer | Records to generate per table | 1000 |
--write-mode |
-w |
String | Write mode for tables | "overwrite" |
--verbose |
-v |
Flag | Enable verbose logging | False |
Examples¶
Discover all tables in a schema:
Discover with custom output directory:
cloe-synthetic-data-generator discover \
--catalog main \
--schema hr_data \
--output-dir ./my_configs/
Discover tables matching a pattern:
cloe-synthetic-data-generator discover \
--catalog main \
--schema hr_data \
--table-regex "employee.*"
Discover with custom record count:
cloe-synthetic-data-generator discover \
--catalog main \
--schema hr_data \
--num-records 5000 \
--write-mode append
Sample Output¶
⠋ Connecting to Databricks...
✅ Connected to Databricks!
⠋ Found 3 tables...
Discovered Tables in main.hr_data
┌─────────────┬─────────┬────────────────────────┐
│ Table Name │ Columns │ Full Path │
├─────────────┼─────────┼────────────────────────┤
│ employees │ 12 │ main.hr_data.employees │
│ departments │ 4 │ main.hr_data.departments│
│ salaries │ 6 │ main.hr_data.salaries │
└─────────────┴─────────┴────────────────────────┘
⠋ Generating YAML configurations...
⠋ Writing YAML files...
✅ Discovery completed!
Discovery Results
┌─────────────────────┬─────────────────────────────┐
│ Property │ Value │
├─────────────────────┼─────────────────────────────┤
│ Tables Discovered │ 3 │
│ Configs Generated │ 3 │
│ Output Directory │ ./discovered_configs │
│ Records per Table │ 1000 │
│ Write Mode │ overwrite │
└─────────────────────┴─────────────────────────────┘
Generated Configuration Files
┌──────────────────────────────────────┬────────────────────────┐
│ File │ Table │
├──────────────────────────────────────┼────────────────────────┤
│ main_hr_data_employees_config.yaml │ main.hr_data.employees │
│ main_hr_data_departments_config.yaml │ main.hr_data.departments│
│ main_hr_data_salaries_config.yaml │ main.hr_data.salaries │
└──────────────────────────────────────┴────────────────────────┘
🎉 Successfully discovered 3 tables and generated 3 YAML configuration files in ./discovered_configs
Generated File Names¶
The discover command generates files using this naming pattern:
Examples:
- main_hr_data_employees_config.yaml
- dev_test_users_config.yaml
- prod_analytics_events_config.yaml
Command Combinations¶
You can combine commands in workflows:
Development Workflow¶
# 1. Discover existing tables
cloe-synthetic-data-generator discover --catalog main --schema test_data
# 2. Validate generated configurations
cloe-synthetic-data-generator list-configs ./discovered_configs --verbose
# 3. Validate specific configuration
cloe-synthetic-data-generator validate-config ./discovered_configs/main_test_data_users_config.yaml
# 4. Generate small test dataset
cloe-synthetic-data-generator generate \
--config ./discovered_configs/main_test_data_users_config.yaml \
--num-records 10
# 5. Generate full dataset
cloe-synthetic-data-generator generate \
--config-dir ./discovered_configs
Production Workflow¶
# 1. Test connection
test-connection
# 2. Validate all configurations
for config in configs/*.yaml; do
cloe-synthetic-data-generator validate-config "$config"
done
# 3. Generate data with verbose logging
cloe-synthetic-data-generator generate \
--config-dir ./configs \
--verbose
Error Handling¶
The CLI provides detailed error messages and appropriate exit codes:
| Exit Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error (configuration, connection, etc.) |
Common Error Messages¶
Configuration Errors
Solution: Check YAML syntax and ensure all required fields are present.Connection Errors
Solution: Verify Databricks Connect setup and credentials.Permission Errors
Solution: Check Unity Catalog permissions for the target location.File Not Found
Solution: Verify file path and ensure file exists.Performance Tips¶
Large Datasets¶
For generating large datasets:
# Use larger batch sizes (reduce memory pressure)
# Edit your config file to set: batch_size: 10000
# Monitor progress with verbose logging
cloe-synthetic-data-generator generate \
--config large_dataset.yaml \
--verbose
Multiple Configurations¶
For processing many configurations:
# Place all configs in a directory
cloe-synthetic-data-generator generate --config-dir ./configs/
# This is more efficient than running generate multiple times
Resource Management¶
Monitor resource usage:
# Check table sizes after generation
# In Databricks SQL:
SELECT
table_catalog,
table_schema,
table_name,
table_rows
FROM system.information_schema.tables
WHERE table_catalog = 'main'
AND table_schema = 'your_schema';
Next Steps¶
- 📚 Configuration Guide - Learn about YAML configuration options
- 🔍 Table Discovery - Deep dive into auto-discovery features
- 🎭 Faker Integration - Explore advanced Faker capabilities
- 📊 Examples - See real-world usage examples