The System Data Life Cycle: The Care and Feeding of the Facts and Figures Living in your Enterprise
It is a truth universally acknowledged that a row of data pulled from a company’s source systems — especially a system of record for a given domain — ought to be the truth, ideally. And so it is, more or less. Eventually. System interactions, operations, decision making, compliance — these all depend on the trustworthiness of the data in relevant systems. And mostly, the trust in the data is not misplaced. Mostly.
My thesis is this: data has a life-cycle. It is created by whatever processes are brought to bear — a person is hired, or some widgets are sold, or some other event of importance to your company occurs — and it gets entered somehow into the correct system for managing events of its type. Over time, some of the records of the events (but likely not all, nor even a majority) will be discovered to be incorrect in some way. Perhaps someone fat-fingered the data entry from a hand-written form, or the person who filled out the form wasn’t entirely certain of the meaning of a field, or circumstances changed that made an attribute wrong in retrospect. So that event’s data must be changed in whatever manner the system supports. And the fastest-cheapest-easiest way to change the data is to overwrite it in the system — making it a stateless change. With this kind of change, we know what the world looks like right now, but we cannot tell you what it looked like yesterday.
Systems are designed to solve the problems or processes they are supposed to address — obviously. But they are not always implemented in such a way that they can address other systems’ problems — again, not surprisingly. I mean, how would you even do that? You work with the requirements you are given, and if you are building an HR system or an accounting system, your requirements tend to go deep into the specifics of your particular domain, but with much more superficial requirements provided around how other systems, users, and reports might access whatever data is living in the system you are building. We all do the best we can to get along with each other, of course, but in a crunch, my known requirements take priority over your hypothetical ones when designing or implementing a system. That’s just a matter of human nature and the constraints of tight budgets and schedules.
Then add to that the design paradigms that we’ve inherited from earlier times when processing and storage were much more expensive than they are today. Maximum efficiency in operations and storage were a prime consideration, while the subtle nuances inherent in data interactions were not always desirable enough to build in.
And that’s not necessarily a bad thing. If all you need to know is that the Great Lakes region sold some X19N HypoWidgets — oops, make that X19M HypoWidgets — then having the right data in the system is the happy outcome. You might not care about the history of our misunderstanding of the event — all that matters is that we have it right at this moment.
But here’s the rub. The system of record for widget sales is deemed ‘accurate’ as of this (or any) moment, and that’s pretty much all it cares about. But anyone who uses data from the system to manage other operational or analytical processes might feel differently about it. The commission system cares because X19N widgets are commissioned at 4.5% of revenue, while X19M widgets get only a 3% commission rate. We just overpaid the account rep for the sale and good luck getting that money back without a lot of backtalk from the sales person. And the supply chain system just ordered a bunch of Nano-Rugalators to enable the assembly of replacement X19Ns, whereas the X19Ms use Micro-Rugalators. So the order we placed is for the wrong part, and now we’ll have an excess of Nanos and a shortage of Micros, and that could cut into our X19M production and sales. And don’t even get me started on what you’ve done to the analytics boffin who is trying to use predictive analytics to forecast sales and revenue.
The biggest challenge downstream systems face is that there might be no systematic way to learn that the data they depend on has changed, nor what particular attributes of the data have changed. Maybe a change to data is meaningless, but how do they know unless they are informed about what’s changed? If the HR system changes an employee’s shirt size from “XL” to “L” (gotta give them the right sized tee-shirt at the company picnic), maybe the commission system won’t care. But if the job title changes from “Assistant Manager” to “Manager-in-Training”, that could have a financial impact on someone somewhere. But either way, how does the downstream system even learn about the correction?
Data movement between systems can be designed to look for changes in state between yesterday’s snapshot of the world and today’s, and pass on the deltas to such systems that care, but that’s not always a cheap or easy thing to do. And what if it’s the users of the downstream system who discover that the upstream data is wrong? Back to our commission example, sales reps know perfectly well how much commission they should earn for the sale of widgets. If you overpay, they might even tell you that you got it wrong. If you underpay, they absolutely will inform you. So now the commission system is discovering that that sales data is incorrect. How should they handle that? Communicate the change back upstream in the hopes that the data will be corrected and reloaded? Some systems can be made to do that, but some can’t. Or overwrite the sales data they pay on? That leaves us with two trusted systems that are now out of sync with each other. And depending on how the system interactions have been defined, it’s absolutely possible that the downstream system will get an even later feed from the upstream system that overwrites the corrected data to go back to the bad condition — say, changing the color of the sold widgets, but leaving the X19N model number that was wrong in the first place. Now the incentive system will scramble to re-right the re-wronged calculation of the data.
Is there a prescription here? Yes, but not a very precise one. When pondering the requirements and design of downstream systems — systems that are dependent on source systems for some or all of their data — you must look beyond the rules and calculations that your new system is designed to perform. Think about what happens when the source systems get it wrong. Ask yourself a couple of questions: how will the downstream system know a correction was made, and what should happen the second — third — nth time it sees the record of an event. Whenever possible, implement systems statefully — with an historical record of the known changes identified by effective dates.
Many of these ideas are really about operational policies and procedures, not data or processing. But when money might change hands based on what each system does, it’s well worth the exercise of documenting the new future state of operations, bearing changes to data firmly in mind. The system you’re designing might not care, but I promise that all the related systems in the enterprise certainly will. It’s not enough to get the right answer the first time — you must get it right every time.