By Christian Prokopp on 2022-08-08
There is one simple thing most companies miss about their data. It has been instrumental in my work as a data professional ever since.
Regarding data, we regularly cannot see the wood for the trees. It happened to me when I worked at a startup a decade ago when the brilliant Ted Dunning visited us. We did the horse and pony show to impress him with the cool stuff we built. He listened, and the comment that stuck with me was, "all the successful startups I have seen measure themselves".
It was powerful. Ted did not say much about what we did wrong or what we should do. At least not directly, but we all got the message, and it hit home.
We were too comfortable with our work, i.e. we knew every line of code and datastore. But humans are terrible at looking at data and finding patterns without some (graphical) reporting. I mined millions and millions of products daily but had only basic observability built in because I knew every process and system involved. Yet that is not good enough and not sustainable.
We heard Ted loud and clear. We started collecting and visualising operational data and metadata to help us manage our platform and processes. Eventually, we exited to Google, so things did work out.
The one simple thing most companies miss on their data is measuring and measuring the right thing, which is tough. Many companies use data to measure (other) things, but few measure the data itself! The data quality, processes and metadata, i.e. is the data you use trustworthy? What are the trends regarding data quality, size, speed, etc.?
Since Ted's memorable visit, I have developed this further in every role. It keeps paying dividends. Recently, for example, for a good week, I have been trying to figure out a weird shape in the curve measuring some operational metrics on processing throughput and queuing in my staging data platform at Bold Data. The platform's overall performance was good, and so was the data quality. Still, it looked wonky with jumps where there should be smooth lines. After extensive digging and thinking, I figured out an overlapping, interfering behaviour between multiple data mining processes that only occurs under specific circumstances that could impact subtly but significantly the SLA and cost in the long term. Something hard to predict or test for proactively or even detect later on.
In previous roles, I always used a two-pronged approach when I had to deal with legacy systems and data. Quality and quantity, i.e. manually dive into the raw data and systems to understand what is what and automate metrics at a larger scale. Firstly, do not believe ancient data dictionary documents, third-party consultants, or system integrators with underpaid and overworked staff.
I found plenty of cautionary tales. There were innocent columns in tables containing PII data because of a broken process or a user manually entered it in the wrong field. Debug logs that spilt sensitive financial data from secured processes into unsecured areas. Unicode characters break ancient SQL and processing scripts, leading to strange behaviour. Data synchronisation processes missed a few financial records every time they ran. Just small enough to slip through as reconciliation errors for the finance department to deal with but monetary errors nonetheless. Or the supply chain system that dropped records when overloaded and never reconciled them with the source system, mysteriously losing packages.
Secondly, start measuring things. It can be simple, like how many records go in and come out of a pipeline. What defines good for a field, the ratio of good vs bad, and how does it change over time? Look at basic stats, e.g. NULL might be acceptable in your schema, but is it acceptable that all records are NULL? How long do processes take? How are they interconnected? You get the idea.
The interesting part is that in the examples I mentioned, the risk either was not known or realised, e.g. leaking financial data. Or the error was small enough that people and processes started to work around it, e.g. expect losing X% of shipped packages or Y% of monetary error. People started accepting that these systems and processes are too complex and demanding to fix instead of addressing the issues.
However, processes are deterministic in an organisation. They may not be understood because they are complex or not measured, but they have no magic. Not knowing why things go wrong should be an alarm bell. You may ignore an issue because it is economical to do so but not measuring and understanding it is dangerous.
My version of Ted's advice today is "Companies that do well measure, understand and can trust their data all the way from origin to consumption".
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at firstname.lastname@example.org for inquiries.
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retr...
Open Source libraries offer user documentation. But expert users and contributors have a deeper understanding of the inner workings stemming from a...
Five tips for Cloud Engineers to deploy Databricks' Delta Lake on AWS safely.
Microsoft could follow Google's $100bn loss. I tried the new Bing Chat (ChatGPT) feature, which was great until it went disastrously wrong. It even...
ChatGPT is a state-of-the-art language model developed by OpenAI, utilising the Transformer model and fine-tuned through reinforcement learning to...