Data Quality
CLOE differentiate three level of DQ
- BK is unique and not empty
- Conversion of a cell to a datatype is successfull
- Values returned as part of a lookup are not empty
CLOE offers maximum flexibility. DQ and DQ Logging are optional and can be configured down to the column level. However, it is important to pay attention to logical consistency. Examples:
- If DQ2 is enabled but no conversions are done on column level, an error will occur.
- If DQ2 logging is enabled but no columns are flagged for logging that also contain conversions, an error occurs.
Logging
CLOE expects one table per layer (DQ1, DQ2, DQ3) for dataquality logging. These tables need to be created on the sink database before the ETL is started.
| DQ Tables |
|---|
| CREATE SCHEMA cloe_dwh
CREATE TABLE [cloe_dwh].[DQ1_Log]
(
[ID] INT IDENTITY(1, 1) NOT NULL
,[Content_JSON] NVARCHAR(MAX) NULL
,[_Sourceobject] NVARCHAR(255) NULL
,[Error_Count] INT NULL
,[_DCR] DATETIME2(2) NULL
)
CREATE TABLE [cloe_dwh].[DQ2_Log]
(
[ID] INT IDENTITY(1, 1) NOT NULL
,[Content_JSON] NVARCHAR(MAX) NULL
,[_Sourceobject] NVARCHAR(255) NULL
,[Error_Count] INT NULL
,[_DCR] DATETIME2(2) NULL
)
CREATE TABLE [cloe_dwh].[DQ3_Log]
(
[ID] INT IDENTITY(1, 1) NOT NULL
,[Content_JSON] NVARCHAR(MAX) NULL
,[_Sourceobject] NVARCHAR(255) NULL
,[Error_Count] INT NULL
,[_DCR] DATETIME2(2) NULL
)
|
| DQ Tables |
|---|
| CREATE SCHEMA cloe_dwh
CREATE OR REPLACE TABLE cloe_dwh.DQ1_Log
(
"ID" INT AUTOINCREMENT START 1 INCREMENT 1 NOT NULL
,"Content_JSON" VARCHAR(16777216) NULL
,"_Sourceobject" VARCHAR(255) NULL
,"Error_Count" NUMBER NULL
,"_DCR" timestamp_ntz NULL
)
CREATE OR REPLACE TABLE cloe_dwh.DQ2_Log
(
"ID" INT AUTOINCREMENT START 1 INCREMENT 1 NOT NULL
,"Content_JSON" VARCHAR(16777216) NULL
,"_Sourceobject" VARCHAR(255) NULL
,"Error_Count" NUMBER NULL
,"_DCR" timestamp_ntz NULL
)
CREATE OR REPLACE TABLE cloe_dwh.DQ3_Log
(
"ID" INT AUTOINCREMENT START 1 INCREMENT 1 NOT NULL
,"Content_JSON" VARCHAR(16777216) NULL
,"_Sourceobject" VARCHAR(255) NULL
,"Error_Count" NUMBER NULL
,"_DCR" timestamp_ntz NULL
)
|
In the name, accordingly, replace the DQ3 with DQ1 and DQ2 and create the table three times.
Logging limitations
The current method uses conversion of erroneous rows to JSONs. Rows are effectively compressed to a list of object which can be stored in single row/single column. Example:
| DML DQ2 Logging query |
|---|
| DECLARE @dq2_error_json NVARCHAR(MAX) = (
SELECT
MY_COLUMNS
FROM MY_TABLES s
FOR JSON AUTO, INCLUDE_NULL_VALUES
)
|
The following limitations apply to NVARCHAR(MAX) in MSSQL, a maximum of 2GB of data can be stored. If more errors occur than it can fit in 2GB, errors may occur. In regular logging, such masses of errors should not occur and consideration should be given to excluding columns with high numbers of errors from logging(if errors are expected) or fixing errors.