Skip to content

Snowflake Crawler

Introduction

The Snowflake crawler is a Python-based tool that crawls a Snowflake account or database to retrieve metadata about databases, schemas, tables, and columns. It saves this metadata in a CLOE-compliant database metadata format (JSON). The crawler can also update existing metadata repositories, making it ideal for maintaining up-to-date documentation of your Snowflake data structures.

Key Features:

  • Crawls databases, schemas, tables, and columns from Snowflake
  • Supports selective crawling (e.g., skip columns or tables)
  • Filters databases using wildcard patterns
  • Merges with existing metadata repositories while preserving IDs and custom attributes
  • Replaces database name patterns (useful for multi-environment setups)
  • Excludes system databases (SNOWFLAKE, INFORMATION_SCHEMA) automatically

Use Cases:

  • Creating initial metadata documentation for existing Snowflake databases
  • Maintaining synchronized metadata as database structures evolve
  • Working with CLOE modules that require Snowflake database metadata (e.g. RBAC module)
  • Multi-environment setups where database names include environment suffixes (DEV, TEST, PROD)

Requirements

  • Execution Environment:
    • DevOps Pipeline (cloud execution) OR Python 3.11+ (local execution)
  • Snowflake Access:
    • Snowflake technical user with appropriate privileges
    • Permissions to access databases, schemas, and tables
    • If retrieving column metadata: READ access to tables
  • DevOps (if applicable):

How It Works

The crawler operates in the following sequence:

  1. Connect to Snowflake: Establishes connection using provided credentials and warehouse
  2. Retrieve Databases: Fetches all databases (optionally filtered by pattern), excluding system databases
  3. Retrieve Schemas: For each database, fetches all schemas (excluding PUBLIC and INFORMATION_SCHEMA)
  4. Retrieve Tables: For each schema, fetches all tables (optionally filtered by table type)
  5. Retrieve Columns: For each table, fetches column metadata including data types and constraints (if not ignored)
  6. Transform Data: Applies name replacements and converts to CLOE metadata format
  7. Merge (if applicable): Merges with existing repository, preserving IDs and custom attributes
  8. Output: Writes the final metadata to disk as JSON

Flow Diagram

The exact crawler flow might differ based on use case and infrastructure.

snowflake-crawler-flow.png

Configuration

Snowflake Connection

The crawler uses the cloe-util-snowflake-connector module (~1.0.5) for all Snowflake connections. Connection parameters are configured via environment variables with the CLOE_SNOWFLAKE_ prefix.

Required Environment Variables: - CLOE_SNOWFLAKE_USER - CLOE_SNOWFLAKE_ACCOUNT - One of: CLOE_SNOWFLAKE_PASSWORD (deprecated by Snowflake) or CLOE_SNOWFLAKE_PRIVATE_KEY / CLOE_SNOWFLAKE_PRIVATE_KEY_FILE (recommended)

Optional Environment Variables: - CLOE_SNOWFLAKE_WAREHOUSE - CLOE_SNOWFLAKE_ROLE - CLOE_SNOWFLAKE_AUTOCOMMIT

For detailed connection setup, authentication methods (password vs. key pair), and Azure DevOps integration, see the CLOE Util Snowflake Connector documentation.

Command-Line Arguments

Argument Required Default Description
output-json-path - Path where the final JSON output will be saved
--ignore-columns False Skip column metadata retrieval. Only retrieve database, schema, and table information
--ignore-tables False Skip table metadata retrieval. Only retrieve database and schema information
--existing-model-path None Path to an existing CLOE repository JSON. If provided, the crawler will merge new data with existing data, preserving IDs and custom attributes
--database-filter None Filter databases using Snowflake wildcard patterns (e.g., COMPANY_% or %_PROD). If not set, all databases except system databases are retrieved
--database-name-replace None Regex pattern to replace parts of database names with the CLOE environment placeholder {{ CLOE_BUILD_CRAWLER_DB_REPLACEMENT }}. Useful for removing environment suffixes (e.g., _DEV, _PROD)
--delete-old-databases False When merging with an existing model, remove databases that no longer exist in Snowflake. If False, old databases are preserved in the output
--include-only-base-tables True Restrict the crawler to include only base tables (regular tables). When True, views and other table types are excluded

Note on Boolean Flags:

  • For flags that default to False (like --ignore-columns, --ignore-tables, --delete-old-databases), simply include the flag to set it to True. Example: --ignore-columns
  • For flags that default to True (like --include-only-base-tables), Typer automatically creates a --no- prefix version. Use --no-include-only-base-tables to set it to False.

Installation

Install the package using uv (recommended) or pip:

# Using uv
uv pip install cloe-snowflake-crawler

# Using pip
pip install cloe-snowflake-crawler

Usage

Basic Usage

Crawl all databases and save to a JSON file:

# Set environment variables
export CLOE_SNOWFLAKE_USER="your_user"
export CLOE_SNOWFLAKE_PASSWORD="your_password"
export CLOE_SNOWFLAKE_ACCOUNT="xy12345.eu-central-1"
export CLOE_SNOWFLAKE_WAREHOUSE="WH_XS"

# Run crawler
python -m cloe_snowflake_crawler crawl output/metadata.json

Or using uv:

uv run python -m cloe_snowflake_crawler crawl output/metadata.json

Example: Filter Specific Databases

Crawl only databases matching a pattern:

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --database-filter "COMPANY_%"

This will crawl only databases starting with COMPANY_ (e.g., COMPANY_SALES, COMPANY_HR).

Example: Ignore Column Details

Retrieve only database, schema, and table names without column details (faster):

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --ignore-columns

Example: Update Existing Repository

Merge new metadata with an existing repository file:

python -m cloe_snowflake_crawler crawl \
    output/updated_metadata.json \
    --existing-model-path existing/metadata.json

This preserves: - Existing database, schema, and table IDs - Custom display names - Custom table levels or other attributes

Example: Multi-Environment Setup

For environments where database names include suffixes (e.g., SALES_DEV, SALES_PROD):

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --database-filter "%_PROD" \
    --database-name-replace "_PROD$"

This will: 1. Crawl only databases ending with _PROD 2. Replace _PROD suffix with {{ CLOE_BUILD_CRAWLER_DB_REPLACEMENT }} placeholder

Example: Lightweight Schema-Only Crawl

Retrieve only database and schema information (no tables or columns):

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --ignore-tables

Example: Include Views and Materialized Views

By default, only base tables are included. To include all table types:

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --no-include-only-base-tables

Example: Full Refresh with Cleanup

Update an existing repository and remove databases that no longer exist:

python -m cloe_snowflake_crawler crawl \
    output/metadata.json \
    --existing-model-path existing/metadata.json \
    --delete-old-databases

Output Format

The crawler generates a JSON file following the CLOE metadata format:

{
  "databases": [
    {
      "id": "uuid-here",
      "name": "DATABASE_NAME",
      "display_name": "Database Name",
      "schemas": [
        {
          "id": "uuid-here",
          "name": "SCHEMA_NAME",
          "tables": [
            {
              "id": "uuid-here",
              "name": "TABLE_NAME",
              "columns": [
                {
                  "name": "COLUMN_NAME",
                  "ordinal_position": 1,
                  "is_nullable": true,
                  "data_type": "VARCHAR",
                  "data_type_length": 255
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Behavior Notes

Automatic Exclusions

The crawler automatically excludes:

  • System databases: SNOWFLAKE, SNOWFLAKE_SAMPLE_DATA
  • System schemas: PUBLIC, INFORMATION_SCHEMA

Data Type Handling

The crawler extracts and parses Snowflake data types:

  • String types: VARCHAR(n), CHAR(n), STRING, TEXT → captures length
  • Numeric types: NUMBER(p,s), DECIMAL(p,s) → captures precision and scale
  • Other types: DATE, TIMESTAMP, BOOLEAN, etc. → captures type name

ID Preservation

When using --existing-model-path:

  • Database, schema, and table IDs are preserved if they exist in the old repository
  • New objects receive new UUIDs
  • This ensures stable references in downstream CLOE modules

Name Replacement

The --database-name-replace parameter accepts Python regex patterns:

  • _DEV$ → matches _DEV at the end of the name
  • ^TEST_ → matches TEST_ at the start of the name
  • _(DEV|TEST|PROD)$ → matches any of these suffixes

Matched portions are replaced with the {{ CLOE_BUILD_CRAWLER_DB_REPLACEMENT }} placeholder.

About the CLOE_BUILD_CRAWLER_DB_REPLACEMENT Placeholder:

This is a Jinja2-style template placeholder used in multi-environment deployments. When you have databases with environment-specific names (e.g., SALES_DEV, SALES_TEST, SALES_PROD), the crawler can replace the environment suffix with this placeholder. Later, during deployment via DevOps pipelines or CLOE build processes, the placeholder is substituted with the actual target environment name. This allows you to:

  1. Maintain a single metadata definition across environments
  2. Automatically adapt database references during deployment
  3. Keep environment-agnostic configuration in version control

Example flow: - Source database: SALES_PROD - After crawler: SALES_{{ CLOE_BUILD_CRAWLER_DB_REPLACEMENT }} - After deployment to DEV: SALES_DEV

Troubleshooting

Connection Issues

Problem: Unable to connect to Snowflake

Solutions: - Verify environment variables are set correctly - Check that the account identifier includes the region (e.g., xy12345.eu-central-1) - Ensure the warehouse is running and accessible

Permission Errors

Problem: Insufficient privileges to access database/schema/table

Solutions: - Verify the Snowflake role has appropriate READ permissions - Check that the user can access the specified warehouse - If crawling specific databases, ensure the role has access to them

Missing Data

Problem: Expected databases/schemas/tables are missing from output

Solutions: - Check --database-filter pattern matches the database names - Verify --include-only-base-tables setting (set to False to include views) - Ensure the Snowflake role has visibility to the objects - Check that databases aren't system databases (automatically excluded)

Performance

Problem: Crawler is slow

Solutions: - Use --ignore-columns to skip column metadata (significantly faster) - Use --database-filter to limit the scope - Use a larger warehouse for better performance - Use --ignore-tables if you only need database and schema information

Best Practices

  1. Use Database Filters: Limit crawling to relevant databases using --database-filter to improve performance
  2. Incremental Updates: Use --existing-model-path to preserve IDs and custom metadata
  3. Environment Separation: Use --database-name-replace with filters for multi-environment setups
  4. Appropriate Permissions: Use a dedicated Snowflake role with minimal required permissions
  5. Regular Refresh: Schedule regular crawls to keep metadata synchronized with database changes
  6. Start Small: Test with --database-filter on a few databases before crawling the entire account

Integration with CLOE

The Snowflake crawler generates metadata that can be consumed by other CLOE modules:

  • CLOE Data Catalog: Import metadata for data discovery and lineage
  • CLOE Documentation: Generate automatic database documentation
  • CLOE Access Control (RBAC): Map database structures to role-based access controls
  • CLOE ETL: Use metadata for schema validation and mapping

Environment Placeholder in DevOps

The {{ CLOE_BUILD_CRAWLER_DB_REPLACEMENT }} placeholder enables environment-agnostic metadata definitions. In DevOps pipelines (Azure DevOps, Jenkins, etc.), CLOE build tools automatically replace this placeholder with the target environment identifier during deployment. This means you can:

  • Crawl production metadata once
  • Use the same metadata JSON across all environments
  • Let the deployment pipeline inject the correct environment-specific database names

This approach is particularly valuable when combined with Snowflake cloning or database replication strategies across DEV/TEST/PROD environments.