Common Issues¶

This page contains solutions to frequently encountered problems when working with Cloe Nessy pipelines.

Table and Database Issues¶

"Table not found" Error¶

Problem: Pipeline fails with a message saying the table doesn't exist.

Solutions: - Verify your table names are correct in the format: catalog_name.schema_name.table_name - Check that you have the right catalog and schema names for your environment - Ensure the source table creation step completed successfully - Test table access manually: spark.table("your_table_name").count()

"Permission denied" Error¶

Problem: Access denied when reading from or writing to Unity Catalog tables.

Solutions: - Confirm you have read/write permissions to the specified catalog and schema - Try using a different catalog/schema that you have access to - Contact your administrator to verify your Unity Catalog permissions - Check if you're connected to the correct workspace/environment

Environment Variables¶

"Environment variable not found" Error¶

Problem: Pipeline can't find the table names you set as environment variables.

Solutions: - Make sure you ran the environment variable setup completely:

import os
os.environ["TABLE_SOURCE"] = "your_source_table"
os.environ["TABLE_RESULT"] = "your_result_table"

- Check for typos in variable names (TABLE_SOURCE and TABLE_RESULT) - Verify variables are set correctly:

# Check environment variables
import os
required_vars = ['TABLE_SOURCE', 'TABLE_RESULT']
for var in required_vars:
    print(f"{var}: {os.environ.get(var, 'NOT SET')}")

Pipeline Configuration Issues¶

YAML Parsing Errors¶

Problem: Pipeline fails to parse the YAML configuration.

Solutions: - Check YAML indentation - use spaces consistently, not tabs - Ensure all quotes and brackets match properly - Verify special characters in strings are properly escaped - Test your YAML syntax using an online YAML validator

Example of correct YAML formatting:

name: my_pipeline
steps:
    step_name:
        action: READ_CATALOG_TABLE
        options:
            table_identifier: {{env:TABLE_SOURCE}}

Step Execution Failures¶

Problem: Individual pipeline steps fail during execution.

Solutions: - Review step names in YAML - they must be unique within the pipeline - Ensure action names are spelled correctly (e.g., READ_CATALOG_TABLE) - Check that all required options are provided for each action - Verify data flows correctly between steps (each step uses the output of the previous step)

Installation and Import Issues¶

"Module not found" Error¶

Problem: Python can't find the cloe_nessy module.

Solutions: - Ensure cloe_nessy is installed: %pip install cloe_nessy - Restart your Python kernel after installation: %restart_python - Verify installation worked:

# Verify installation
try:
    import cloe_nessy
    print(f"✓ cloe_nessy version: {cloe_nessy.__version__}")
except ImportError:
    print("❌ cloe_nessy not installed. Run: %pip install cloe_nessy")

Data and Filter Issues¶

No Results After Filtering¶

Problem: Pipeline runs successfully but result table is empty or has fewer rows than expected.

Solutions: - Check your filter condition syntax and logic - Test the filter condition directly on your data:

# Test your filter condition
test_df = spark.table("your_source_table")
filtered_df = test_df.filter("status == 'Mysterious' or status == 'Legendary'")
print(f"Filtered rows: {filtered_df.count()}")
filtered_df.show()

- Verify the column names and values in your source data match your filter - Check for case sensitivity in string comparisons

Pipeline Runs But Creates No Output¶

Problem: Pipeline appears to complete but no result table is created.

Solutions: - Check the write mode - using ignore mode will skip writing if table exists - Verify you have write permissions to the target location - Look for error messages in the pipeline output logs - Try using overwrite mode to ensure the table gets created

Streaming-Specific Issues¶

Checkpoint Location Errors¶

Problem: Streaming pipeline fails with checkpoint-related errors.

Solutions: - Ensure the checkpoint location path is accessible by your Spark environment - Make sure the checkpoint directory is empty for first-time runs - Verify you have write permissions to the checkpoint location - For cloud storage, ensure proper authentication and access policies

Example checkpoint setup:

CHECKPOINT_LOCATION = "abfss://container@storage.dfs.core.windows.net/checkpoints/my_pipeline"

Streaming Pipeline Reprocesses All Data¶

Problem: Expected incremental processing but pipeline processes all data every time.

Solutions: - Verify stream: true is set in your READ_CATALOG_TABLE action - Check that the checkpoint location remains the same between runs - Ensure the checkpoint directory isn't being deleted between runs - Confirm you're using the same pipeline name and configuration

No New Data Processed in Streaming¶

Problem: Added new data to source table but streaming pipeline doesn't pick it up.

Solutions: - Verify new data was actually added to the source table: spark.table("source_table").count() - Check that you're using mode: append when adding new data to the source - Ensure the checkpoint location hasn't been modified or deleted - For testing, verify the new data has different timestamps or commit versions

Checkpoint Recovery Issues¶

Problem: Pipeline fails to restart from checkpoint after interruption.

Solutions: - Delete the entire checkpoint directory and restart (will reprocess all data) - Check if checkpoint files are corrupted or incomplete - Ensure consistent Spark and Delta versions between runs - Verify checkpoint location permissions haven't changed