A huge problem with collecting data is platforms changing data historically.
Let’s consider a couple of examples.
A client purchased a product online. Payment was processed by Stripe. Transaction data was downloaded to be used in a report.
After a little while, the transaction turned out to be fraudulent. Stripe deleted the transaction from revenue. But the historic data remained in the report. Now the transaction data does not match the bank account statement.
Another example we can look at is from marketing. An advertiser is conducting an ad campaign on Facebook. At the end of each day, the advertiser downloads a report with ad costs.
Facebook thinks that if a user clicks on the advertiser’s product within seven days after the ad impression, the advertiser has to pay Facebook. It means that Facebook changes data historically. However, these changes are not incorporated into the report because the advertiser didn’t reload the data.
99% of analytical use cases require business rules built into report generation. Let me give you an example.
A supplier contract includes special provisions, e.g. prices varying by region, volume, or season.
For example, an advertiser works with an advertising agency which advertises on Facebook. The advertiser pays a 10% fee to the agency. The advertiser also receives an invoice directly from Facebook. The advertiser has to add 10% to these figures. The advertiser has contracts with several agencies; the contracts vary, the prices vary, and timeframes also vary. At the same time, Facebook doesn’t specify in any way which spend was made by which agency. In the end, invoices issued by the agencies do not match marketing reports.
Data always contains gaps and garbage. There are many reasons why this happens:
Our experience with past products has shown that every client has an incident almost every day. Over three years we had 300,000 incidents which required intervention of the quality control system. Many of these incidents are invisible. It means that the platform doesn’t return an error code and says that everything is fine.
To capture these silent errors, we developed an automated system that captures irregularities in the data, tries to fix them, and if it fails, the issue is escalated to the Technical Support Team and the client.
Today’s ETL tools can’t prepare data for analysts.
Marketing directors who face this problem give up and employ data engineers who create a data processing pipeline from scratch.
Data modeling is the process of restructuring raw data—cleansing, de-normalizing, pre-aggregating, and re-shaping it—so that it supports analytical use cases.
Data modeling is hard, and we believe it’s the most important piece of the analytics stack.