The sections below describe most of standards we have for Data Vault objects in the Optimal Data Engine. Whatever is on this web page is our current version.
ODE Data Vault solution will have at least three databases, configuration, stage and vault itself. Semantic layer, i.e. Data Marts, could be implemented in a separate database.
ODE databases names:
- ODE_Config for configuration database
- ODE_Vault, ODE_Data_Vault or ODE_Business_Vault for the Data Vault
- For staging database, replace word “vault” with “stage” in the ODE Data Vault database name, e.g. ODE_Stage, ODE_Business_Stage etc.
Raw and Business Vaults objects get separated by the naming convention.
|Raw Vault||Business Vault|
|Hub schema||RawHub (hubs that don’t represent a business concept on its own, but data is used in calculations)||Hub|
|Hub name||<source_system>__<source_schema>__<source_table_name>. E.g. CRM__Sales__Order_Header||Aligned with the business concept, pascal case. Make it short. E.g. Customer, Product.|
|Hub key||Aligned with the source column name||Two fields. First one is source system identifier, char(4). Second on is a unique business key (concatenated in case of multipart key), varchar(128). Name same as the hub name.|
|Satellite name||<source_system>__<source_schema>__<source_table_name>||Start with the hub name, then meaningful name for the group of attributes represented by satellite. E.g. Customer_DeliveryAddress.|
|Satellite fields||Aligned with the source field names||Meaningful names, pascal case. E.g. FirstName.|
|Link schema||RawLnk (Links in Raw Vault are not actually required)||Lnk|
|Link name||Source system first, then source tables and relationship between them. E.g. CRM__Order_Header__Customer_Payer||Meaningful name describing the relationship, ideally noun + verb + noun separated by underscore, e.g. Customer_Make_Order, Order_Include_Product|
|Link fields||Aligned with hub keys||Aligned with hub keys. In case of SAL or HAL, use adjectives, e.g. MasterCustomerKey, BaseCustomerKey|
Each configuration record must have a release key. Release number is composed from the Sprint start date (YYYYMMDD) and the next available sequence number. E.g. 2016082201, 2016082202. Release could include different types of configuration records. Make release as small as possible. By the end of the Sprint all the finished releases should be pushed to the next environment, e.g. from DEV to TEST. Unfinished releases could be backed out or renamed with the new Sprint sequence number.
Full release number is Sprint start date and 00 as a sequence number or without any sequence number to make it outstanding among incremental release numbers, e.g. 2018031900 or 20180304. Full releases also start from DEV environment for release keys consistency. Full release process is described here.
Although any ETL tool is acceptable, generally business logic is implemented via the stored procedures in Stage area. Don’t use views. Stage procedure has the same name as satellite with the “usp_” prefix. Stored procedure populates the stage table of the same name. Stored procedures code is versioned by Git.
Typically stored procedure includes:
- Drop the stage table code.
- Common table expressions with all the source tables for the query.
- Select statement that writes data into the stage table. All the column transformations happen here.
- Stage table could have one input parameter in the case where the same procedure supports Full and Delta load.
- Convert all source fields to the required data type explicitly.
Business stage table includes only those source fields which are required to be in the Data Vault. All the stage tables have a field dv_stage_datetime defaulted to the stage table load timestamp. This field is required for the audit purpose only and is excluded from the configuration.
If business rule has been changed significantly, stored procedure and staging table get versioned within the database. New copy of the staging object gets the version number suffix, e.g. “_v001”. Dependent Vault objects are switched to the new version of stage table. In case if objects require to be switched off, flag it as “retired” in configuration and change the stored procedure to stop populating the object with data.
Scheduling the Load
Stage tables should be scheduled to update the Data Vault. Scheduled tables are loaded in parallel. Most of the tables are supposed to run in the queue “001”; queue “002” is for heavy tables with complicate business logic. There could be a one or multiple schedules. For example, load all the Raw tables in the first schedule and start applying business rules after that. However, schedule hierarchy guarantees the correct load order within the load.
Schedule itself should be initiated by SQL Agent or any external trigger.
Always convert hub key field to a string data type.
All the satellites should be configured to have a columnstore index. However, this feature is only available for SQL Server Enterprise Edition.
Every link has a satellite. Satellite on a link contain hub keys only, no context is allowed.
Good old date dimension is represented by the Date reference ensemble. Hub key is date (not an integer).