Metadata guide¶
Introduction¶
This guide presents the different forms of metadata used by nessy. It is structured by - object: the object which the metadata describes. This is currently tables, schemas, and views. - the necessity of the metadata: either required or optional.
In the following, metadata and necessity are presented by object.
Schema metadata¶
Schema metadata reference¶
| Field | Required? | Metadata Type | Description | Example |
|---|---|---|---|---|
| catalog | true | string | The catalog in which the schema is located. | my_catalog |
| name | true | string | Name of schema within a catalog. | my_schema |
| storage_path | false | string | Storage path of the schema. Acts as a fallback for table storage paths. | "abfss://container@storageaccount.company.core.windows.net/schema_folder/schemas/" |
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. | |
| comment | false | string | A comment added to the schema. | What a wonderful schema to store data. |
Example Schema metadata¶
catalog: "my_catalog"
name: "raw"
storage_path: "abfss://mycontainer@mystorageaccount.blob.core.windows.net/schemas/"
Info
The definition of the schema allows for dynamic values by addressing them in the format of {{env:NAME_OF_ENV_VARIABLE}}.
This allows for the definition of environment-specific values, such as the storage path or catalog name.
Table metadata¶
Table metadata reference¶
| Field | Required? | Metadata Type | Description |
|---|---|---|---|
| name | true | string | Name of the table. |
| columns | true | list of columns | A list of columns containing column metadata for each column. |
| is_external | false | boolean | Indicates whether the table is managed in DBX or external. |
| partition_by | false | list | List of columns by which the table is partitioned or clustered (depending on the attribute liquid_clustering). |
| liquid_clustering | false | boolean | Defines whether to to cluster the table by the partition_by columns (liquid, multi-dimensional clustering for a Delta Lake table) or to partition by the partition_by columns. Required if partition_by is not None. |
| composite_primary_key | false | list | List of columns which denote unique rows. |
| properties | false | dictionary | DBX delta table properties as described here |
| constraints | false | dictionary | One or more constraint definitions as dictionary of constraint names, SQL code to filter on one or multiple columns, and a description. |
| foreign_keys | false | list | A list of dictionaries containing foreign_key_columns, parent_table, parent_columns, and foreign_key_options for each foreign key in the table. |
| storage_path | false | string | Storage path of the table. Acts as overwrite for schema storage path. |
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. |
| comment | false | string | A comment on the table. |
| data_source_format | false | string | The format of the data, e.g. delta or parquet. |
Foreign key metadata reference¶
| Field | Required? | Metadata Type | Description |
|---|---|---|---|
| foreign_key_columns | false | string | The name of the columns in the current table that are foreign keys referencing columns in another table. |
| parent_table | false | string | The name of the parent table containing the column being referenced by the foreign keys. |
| parent_columns | false | string | The name of the columns in the parent table being referenced by the foreign keys. |
| foreign_key_option | false | list of strings | The options for the foreign key constraint, such as "MATCH FULL", "ON UPDATE NO ACTION" or "ON DELETE NO ACTION". For details, see the Databricks documentation. |
Column metadata reference¶
| Field | Required? | Metadata Type | Description |
|---|---|---|---|
| name | true | string | Physical column name. |
| data_type | true | string | The data types supported in Azure Databricks. |
| nullable | true | boolean | Whether the column allows null values. By default it should not. |
| default_value | false | In case of non-nullable columns, the default value according to data modeling standards. | |
| generated | false | string | Optional generated expression for auto-generation of column values. |
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. |
| comment | false | string | A comment on the column. |
Example table metadata¶
name: customers
is_external : false
partition_by:
- "region"
liquid_clustering: true
data_source_format: "delta"
properties:
"delta.autoOptimize.optimizeWrite": "true"
"delta.enableDeletionVectors": "true"
constraints:
realisticBirthConstraint:
expression: "(birthDate > '1900-01-01')"
description: "Ensures birthDate is realistic"
foreign_keys:
- foreign_key_columns:
- "service_level"
parent_table: "service_level_lookup"
parent_columns:
- "service_level_name"
foreign_key_option:
- "MATCH FULL"
columns:
- name: "customer_id"
data_type: "long"
nullable: false
generated: "GENERATED ALWAYS AS IDENTITY"
- name: "customer_name"
data_type: "string"
nullable: false
default_value: "Max Mustermann"
business_properties:
responsible_manager: "Dr. Michaela Mustermensch"
External vs. Managed Tables¶
Managed and external tables can be configured in multiple ways in the metadata. The following table provides an overview of the different possibilities:
| is_external | table_storage_path | Result |
|---|---|---|
false |
not set |
managed table |
true |
not set |
external table derived from schema_storage_path |
true |
set |
external table based on table_storage_path |
false |
set |
no valid table |