Common Issues¶
This page contains solutions to frequently encountered problems when working with Cloe Nessy pipelines.
Table and Database Issues¶
"Table not found" Error¶
Problem: Pipeline fails with a message saying the table doesn't exist.
Solutions:
- Verify your table names are correct in the format: catalog_name.schema_name.table_name
- Check that you have the right catalog and schema names for your environment
- Ensure the source table creation step completed successfully
- Test table access manually: spark.table("your_table_name").count()
"Permission denied" Error¶
Problem: Access denied when reading from or writing to Unity Catalog tables.
Solutions: - Confirm you have read/write permissions to the specified catalog and schema - Try using a different catalog/schema that you have access to - Contact your administrator to verify your Unity Catalog permissions - Check if you're connected to the correct workspace/environment
Environment Variables¶
"Environment variable not found" Error¶
Problem: Pipeline can't find the table names you set as environment variables.
Solutions: - Make sure you ran the environment variable setup completely:
import os
os.environ["TABLE_SOURCE"] = "your_source_table"
os.environ["TABLE_RESULT"] = "your_result_table"
TABLE_SOURCE and TABLE_RESULT)
- Verify variables are set correctly:
# Check environment variables
import os
required_vars = ['TABLE_SOURCE', 'TABLE_RESULT']
for var in required_vars:
print(f"{var}: {os.environ.get(var, 'NOT SET')}")
Pipeline Configuration Issues¶
YAML Parsing Errors¶
Problem: Pipeline fails to parse the YAML configuration.
Solutions: - Check YAML indentation - use spaces consistently, not tabs - Ensure all quotes and brackets match properly - Verify special characters in strings are properly escaped - Test your YAML syntax using an online YAML validator
Example of correct YAML formatting:
name: my_pipeline
steps:
step_name:
action: READ_CATALOG_TABLE
options:
table_identifier: {{env:TABLE_SOURCE}}
Step Execution Failures¶
Problem: Individual pipeline steps fail during execution.
Solutions:
- Review step names in YAML - they must be unique within the pipeline
- Ensure action names are spelled correctly (e.g., READ_CATALOG_TABLE)
- Check that all required options are provided for each action
- Verify data flows correctly between steps (each step uses the output of the previous step)
Installation and Import Issues¶
"Module not found" Error¶
Problem: Python can't find the cloe_nessy module.
Solutions:
- Ensure cloe_nessy is installed: %pip install cloe_nessy
- Restart your Python kernel after installation: %restart_python
- Verify installation worked:
# Verify installation
try:
import cloe_nessy
print(f"✓ cloe_nessy version: {cloe_nessy.__version__}")
except ImportError:
print("❌ cloe_nessy not installed. Run: %pip install cloe_nessy")
Data and Filter Issues¶
No Results After Filtering¶
Problem: Pipeline runs successfully but result table is empty or has fewer rows than expected.
Solutions: - Check your filter condition syntax and logic - Test the filter condition directly on your data:
# Test your filter condition
test_df = spark.table("your_source_table")
filtered_df = test_df.filter("status == 'Mysterious' or status == 'Legendary'")
print(f"Filtered rows: {filtered_df.count()}")
filtered_df.show()
Pipeline Runs But Creates No Output¶
Problem: Pipeline appears to complete but no result table is created.
Solutions:
- Check the write mode - using ignore mode will skip writing if table exists
- Verify you have write permissions to the target location
- Look for error messages in the pipeline output logs
- Try using overwrite mode to ensure the table gets created
Streaming-Specific Issues¶
Checkpoint Location Errors¶
Problem: Streaming pipeline fails with checkpoint-related errors.
Solutions: - Ensure the checkpoint location path is accessible by your Spark environment - Make sure the checkpoint directory is empty for first-time runs - Verify you have write permissions to the checkpoint location - For cloud storage, ensure proper authentication and access policies
Example checkpoint setup:
Streaming Pipeline Reprocesses All Data¶
Problem: Expected incremental processing but pipeline processes all data every time.
Solutions:
- Verify stream: true is set in your READ_CATALOG_TABLE action
- Check that the checkpoint location remains the same between runs
- Ensure the checkpoint directory isn't being deleted between runs
- Confirm you're using the same pipeline name and configuration
No New Data Processed in Streaming¶
Problem: Added new data to source table but streaming pipeline doesn't pick it up.
Solutions:
- Verify new data was actually added to the source table: spark.table("source_table").count()
- Check that you're using mode: append when adding new data to the source
- Ensure the checkpoint location hasn't been modified or deleted
- For testing, verify the new data has different timestamps or commit versions
Checkpoint Recovery Issues¶
Problem: Pipeline fails to restart from checkpoint after interruption.
Solutions: - Delete the entire checkpoint directory and restart (will reprocess all data) - Check if checkpoint files are corrupted or incomplete - Ensure consistent Spark and Delta versions between runs - Verify checkpoint location permissions haven't changed