Data Issues Today

The issue with historical data

A huge problem with collecting data is platforms changing data historically.

Let’s consider a couple of examples.

A client purchased a product online. Payment was processed by Stripe. Transaction data was downloaded to be used in a report.

After a little while, the transaction turned out to be fraudulent. Stripe deleted the transaction from revenue. But the historic data remained in the report. Now the transaction data does not match the bank account statement.

Another example we can look at is from marketing. An advertiser is conducting an ad campaign on Facebook. At the end of each day, the advertiser downloads a report with ad costs.

Facebook thinks that if a user clicks on the advertiser’s product within seven days after the ad impression, the advertiser has to pay Facebook. It means that Facebook changes data historically. However, these changes are not incorporated into the report because the advertiser didn’t reload the data.

The use of business rules within reports

99% of analytical use cases require business rules built into report generation. Let me give you an example.

A supplier contract includes special provisions, e.g. prices varying by region, volume, or season.

For example, an advertiser works with an advertising agency which advertises on Facebook. The advertiser pays a 10% fee to the agency. The advertiser also receives an invoice directly from Facebook. The advertiser has to add 10% to these figures. The advertiser has contracts with several agencies; the contracts vary, the prices vary, and timeframes also vary. At the same time, Facebook doesn’t specify in any way which spend was made by which agency. In the end, invoices issued by the agencies do not match marketing reports.

The quality of data

Data always contains gaps and garbage. There are many reasons why this happens:

An updated account password resulted in a failure to read data.
An error was made in ad campaign configuration (incorrect tracking pixels and their equivalents).
Invisible silent limitations for data downloads in platforms created data gaps.
An attempt to download data from an API platform resulted in a failure. The resulting report lacked some data.
API version changes means the resulting report lacks some data.

Our experience with past products has shown that every client has an incident almost every day. Over three years we had 300,000 incidents which required intervention of the quality control system. Many of these incidents are invisible. It means that the platform doesn’t return an error code and says that everything is fine.

To capture these silent errors, we developed an automated system that captures irregularities in the data, tries to fix them, and if it fails, the issue is escalated to the Technical Support Team and the client.

Data modeling difficulty

Today’s ETL tools can’t prepare data for analysts.

Marketing directors who face this problem give up and employ data engineers who create a data processing pipeline from scratch.

Data modeling is the process of restructuring raw data—cleansing, de-normalizing, pre-aggregating, and re-shaping it—so that it supports analytical use cases.

Data modeling is hard, and we believe it’s the most important piece of the analytics stack.