Sunday, August 03, 2008

Data Validation for Business Intelligence

Recently, there have been several exciting products released in the BI space around data visualization and optimization to handle sophisticated information challenges. However, I have been surprised that there are no software products or technologies to help with something as fundamental as data validation for data warehouses. Most of the projects that I have encountered end up spending an inordinate amount of time at the tail end in trying to get the data to tie in the data warehouse. There is usually a subset of transactions missing, that were not thoroughly understood as relevant to a particular calculation and must be replicated from the source ERP system to the data warehouse.

There are some products that help with data quality, and information profiling but they fall short in semantically walking the lienage of the metric that you may see on one report to a metric that you may see on another.
We can argue that the root cause of the issue is related to multiple versions of the truth, however, I belive that multiple versions will always exist because some events are relevant to certain decisions vs. others. eg. If I am trying to understand customer sales, I am primarily interested in product sales vs. for financial reporting shipping and handling charges will usually be included as part of a sales number. Unfortunately, human beings has neither the patience nor the presence to use all their words so when there is a conversation going on, both those numbers may be reported and referred to as sales.

In my 12 years in this domain, this has been an ongoing issue, how can we "tie" the data in the data warehouse to an existing "known" source. It is irrelevant whether the known source is right or wrong, the information has to link to the source system for it to gain acceptance in the data warehouse.

I also find it fascinating that most of the business people can look at a number and state that it doesn't tie, even though they have no idea how that number is calculated in the first place.

The reason this problem has not been solved because understanding the context and semantics are very important and most technologies stil fall short in trying to "understand" relevance. Maybe this is a search problem after all and the new semantic web will help us conquer this fundamental BI problem at some point in the near future.