When doing a data conversion between applications reconciliation testing is simple. Everything that was in the source application should be in the destination, and everything that is in the destination should be in the source.

A data warehouse is a more complicated to reconcile because eventually data is transformed.

Challenges

  1. The first kind of transformation is augmenting data with pre-calculated attributes. For example, predictive scores and percentile rankings might be added to every record.
  2. The other kind of transformation alters the number of records by de-duplicating, filtering, or aggregating.

Vaults

Raw Vault

The Raw vault can be fully reconciled to its source.

  1. “Except” queries against the Hubs to make sure that all keys and no spurious keys are present.
  2. “Except” Queries between the sats and their masters (Hub or Link) to check on Referential Integrity
  3. “Except” Queries for all Satellite attributes

Business Vault

  1. If the only transformations are augmenting data then we can reconcile counts and all of the non-transformed fields back to the source vault. The transformed fields are not reconciled. The transformations are tested by unit tests and acceptance tests. Reconciling the transformed fields would only be possible by sharing or duplicating the transformation code, which are pointless and bad-practise respectively.
  2. If the records are being filtered or de-duplicated then reconciliation become different. We can still check that the records in the destination are all in the source, and verify their untransformed attributes. However, we can not reconcile the source to the destination as we have intentionally removed records.
    1. First we reconcile that all records in the destination are in the source
    2. Next we determine that all of the missing source records either:
      1. Match one of the filters that should have been applied
      2. Are marked as non-surviving records in a same_as link table
  3. When aggregating reconciliation is again different. In ODE our aggregates always match a non-leaf level of a hierarchy. We can identify the aggregation level and check that the source aggregates to the number of rows in the destination. We can also apply the appropriate aggregation functions to untransformed attributes.

All of these can be automated, and applied automatically.

Possibly also load row count trends/ key (hub and link) creation rates over time to highlight anomalies.