The Problem of Mostly Right Data

5 min readAug 19, 2020

What do you do when your system data is 99% perfect?

Perfection is so hard to attain — heck, I haven’t even gotten there myself yet! Sometimes when I start to ponder exactly when I will get there, I start thinking about business systems and what perfection looks like for them. System perfection is often judged by whether a system is highly performant, highly flexible and maintainable, and most importantly, highly accurate. And you might be excused for one or two shortcomings in the maintainability and performance side, but in general, accuracy is a pretty important factor in our evaluations of systems. We have to trust what comes out of them or else there’s no point in having them online in the company’s IT ecosystem.

In my experience, business systems do tend to be pretty accurate, if only because they have to be. But they cannot possibly be perfect. They were designed and built by humans, and because they generally depend on data coming from other (imperfect) systems or from (imperfect) humans who have to enter data into the system, or both. For example, a colleague of mine had her first name misspelled by the HR team when they onboarded her to the company we worked for at the time. Simple problem with a simple fix, right? Except that by the time anyone noticed it, the mistake had rippled through every other business system in the company that knew of her existence — we were very efficient that way. Her name was even spelled wrong in her company email address. I asked her why they didn’t correct it, and she said that everyone she mentioned it to had thought it would be too much trouble to update every system that had it wrong, especially since some of those systems didn’t have mechanisms for making corrections to data. We joked that maybe she should legally change her name to reflect how our company knew her.

Sometimes mistakes don’t matter much, like my friend’s email address. That would have no impact on company profitability, nor on her quality of life. Or when you are training an artificial intelligence engine on a huge data set, maybe a record or two being wrong won’t have a material impact on the results because the weight of the correct data might make the mistakes mostly irrelevant. But sometimes they do matter — say, for tax filings, or if checks are being written based on the data in the system. Aye, and there’s the rub — because perfection is so hard to attain.

In my experience, technologists tend to think in terms of “zero”, “one”, or an undefined “many” instances of a particular condition when they develop data or process models. But operations people tend to approach the world in terms of “none”, “a couple”, “a few”, or “tons”. With respect to the condition of zero data corrections anticipated to be needed per month (or feed, or other unit)? Great, it’s easy to build a system to support that. Or if one or two corrections are generally needed, especially if we know they’re likely to be there and what kind of mistakes they might be? Not bad, just build in a simple manual correction option, and you’re probably good to go. Tons of corrections needed? That’s going to need bulletproof data extraction, management, and editing techniques, and you will want to make them a core part of the system architecture.

The more annoying condition, though, is the “a few” problem. Maybe you have a system that handles 1M rows of data per period, and historically there have been around 300 bad rows per feed. The system is not perfectly accurate, but it’s not known to be a major data manipulation monster either. It’s just slightly wrong. The big huge expensive data management techniques for the “many” or “tons” scenario could be built in, but sometimes that’s like swatting a fly with a sledgehammer. The manual correction route might be the right way to go, but it might not. Sometimes the mistakes aren’t expected to be caught until after checks have been written. And sometimes there are just too many problems to be satisfactorily corrected manually, perhaps, especially if there is no expectation that the mistakes will be apparent enough to know what they are or how they will need to be corrected.

When you are designing a business system, this is a situation that you must look out for. Proactively ask the users of the system you are replacing or upgrading about how often data consumed or created by the system will be known to be incorrect, and what are the operational impacts of that. Can you head off the inaccuracies at the pass by preventing them from occurring? Absolutely, try to do that. But if they’re based on data from upstream systems over which you have minimal control, or on flawed business processes wrapped around the system (“We have a lot of manual overrides because the VP doesn’t like the results we report!”), that might be a fix with tentacles that need to reach deep into the ecosystem. You might not be able to systematically prevent bad data from getting in or bad processes from invalidating the good data you have.

And the next question to ask is whether there are going to be downstream impacts from bad data in your system. And further, if and when you correct the data in your system, will that have downstream implications too? Will the downstream system be aware of the change so that it can be brought in line with the more accurate data once you find it? Will it have generated meaningful results that must be corrected and communicated further downstream? If the system gets audited, will you be able to justify the trail of corrections? If the answer to any of these questions is, “aw man…”, then you need to incorporate the system data correction requirements into your solution.

Sometimes that’s really hard, by the way. But dealing with a system that cannot be corrected without adjacent systems failing is hard too. The lesson here is that one measure of system perfection is in how it deals with imperfection in this imperfect world.

The Problem of Mostly Right Data

Written by David Kelly