Bad Data is costly- three ways to improve data quality
The cost of bad data is high. Inaccurate or Governed by faulty algorithms, it can have catastrophic consequences for businesses everywhere — from one person’s lost reputation on social media to entire industries losing out in the marketplace because they were never able find success where others had been before them with their own products/services
The saying goes “garbage-in = garbage out.” This holds true when looking at how inaccurate information leads us down false paths that ultimately leaves our company no wiser than if we hadn’t collected any
Bad data prevented Gan Wing, an ambassador in the 100s, from reaching Rome.
Gan Ying was a Chinese diplomat, explorer, and military official sent to Rome by Chinese military general Ban Chao.
Gan Ying did not make it to Rome, only as far as the "western sea," present-day Iran; to protect their trading monopolies, the Anxi merchants provided Ying with false information, claiming the journey could take up to two years. Ying turned back because the wait was too long, and China never connected with the Roman Empire.
NASA lost millions in 1999 due to inaccurate data.
NASA lost the Mars Orbiter in 1999, costing the agency $125 million. The engineering team in charge of developing the Orbiter used English units of measurement, whereas NASA used the metric system. The issue here is that the data was inconsistent, which resulted in a costly and disastrous error.
But what exactly is good quality data?
There is no universally accepted yardstick for determining what constitutes good data. However, data is generally considered to be of high quality if it is relevant to the purpose for which it was collected. Good data can be divided into five categories based on five primary criteria:
- Accuracy — all data correctly reflects the object or event in the real world
- Completeness — all data that should be present is present
- Relevance — all data meets the requirements for intended use
- Timeliness — all data reflects the correct point in time
- Consistency — values and records are represented in the same way within/across datasets
How do we ensure that we have high-quality data?
The challenge with data quality improvement is that it is a marathon, not a sprint. There is a widespread misconception that all we need is to run a single magic query, and voila, the data is clean. That may work for that dataset until you obtain a new one and must restart. That is not scalable in the long run.
Define
Establish data quality standards early on. These guidelines will act as a guide for your efforts. This step enables you to establish objectives and visualize how improving the quality of your data will benefit your business's growth.
Collect
Gather and categorize all data quality issues. When the problems are identified and documented, it is much easier to create a framework and data literacy and governance program to address them.
Monitor and maintain
Quality data is the result of ongoing effort; a one-time cleansing will not suffice. Your data types may become obsolete as your needs change. Regular data cleansing should be performed by a single group of people who adhere to a set of rules for consistency.
In conclusion
High-quality data serves as a foundation in the long run. As data quality improves, your foundation becomes more stable, allowing more to be built on top of it and multiplying the potential uses of your data.