One simple thing companies miss about their data

August 8th, 2022
There is one simple thing most companies miss about their data. I was lucky to have someone respected and experienced giving me this advice, or else I may have ignored it as unimportant. It has been instrumental in my work as a data professional ever since.
When you can't see the wood for the trees

Regarding data, we regularly cannot see the wood for the trees. It happened to me when I worked at a startup a decade ago when the brilliant Ted Dunning visited us. We did the horse and pony show to impress him with the cool stuff we built. He listened, and the comment that stuck with me was, "all the successful startups I have seen measure themselves".

Know thyself

It was powerful. Ted did not say much about what we did wrong or what we should do. At least not directly, but we all got the message, and it hit home.

We were too comfortable with our work, i.e. we knew every line of code and datastore. But humans are terrible at looking at data and finding patterns without some (graphical) reporting. I mined millions and millions of products daily but had only basic observability built in because I knew every process and system involved. Yet that is not good enough and not sustainable.

We heard Ted loud and clear. We started collecting and visualising operational data and metadata to help us manage our platform and processes. Eventually, we exited to Google, so things did work out.

Define and measure data quality and processing

The one simple thing most companies miss on their data is measuring and measuring the right thing, which is tough. Many companies use data to measure (other) things, but few measure the data itself! The data quality, processes and metadata, i.e. is the data you use trustworthy? What are the trends regarding data quality, size, speed, etc.?

Since Ted's memorable visit, I have developed this further in every role. It keeps paying dividends. Recently, for example, for a good week, I have been trying to figure out a weird shape in the curve measuring some operational metrics on processing throughput and queuing in my staging data platform at Bold Data. The platform's overall performance was good, and so was the data quality. Still, it looked wonky with jumps where there should be smooth lines. After extensive digging and thinking, I figured out an overlapping, interfering behaviour between multiple data mining processes that only occurs under specific circumstances that could impact subtly but significantly the SLA and cost in the long term. Something hard to predict or test for proactively or even detect later on.

Do not trust what you have not checked yourself

In previous roles, I always used a two-pronged approach when I had to deal with legacy systems and data. Quality and quantity, i.e. manually dive into the raw data and systems to understand what is what and automate metrics at a larger scale. Firstly, do not believe ancient data dictionary documents, third-party consultants, or system integrators with underpaid and overworked staff.

I found plenty of cautionary tales. There were innocent columns in tables containing PII data because of a broken process or a user manually entered it in the wrong field. Debug logs that spilt sensitive financial data from secured processes into unsecured areas. Unicode characters break ancient SQL and processing scripts, leading to strange behaviour. Data synchronisation processes missed a few financial records every time they ran. Just small enough to slip through as reconciliation errors for the finance department to deal with but monetary errors nonetheless. Or the supply chain system that dropped records when overloaded and never reconciled them with the source system, mysteriously losing packages.

Secondly, start measuring things. It can be simple, like how many records go in and come out of a pipeline. What defines good for a field, the ratio of good vs bad, and how does it change over time? Look at basic stats, e.g. NULL might be acceptable in your schema, but is it acceptable that all records are NULL? How long do processes take? How are they interconnected? You get the idea.

"We do not know" is not good enough

The interesting part is that in the examples I mentioned, the risk either was not known or realised, e.g. leaking financial data. Or the error was small enough that people and processes started to work around it, e.g. expect losing X% of shipped packages or Y% of monetary error. People started accepting that these systems and processes are too complex and demanding to fix instead of addressing the issues.

However, processes are deterministic in an organisation. They may not be understood because they are complex or not measured, but they have no magic. Not knowing why things go wrong should be an alarm bell. You may ignore an issue because it is economical to do so but not measuring and understanding it is dangerous.

My version of Ted's advice today is "Companies that do well measure, understand and can trust their data all the way from origin to consumption".


By Christian Prokopp (AboutLinkedInTwitter), Founder of Bold Data

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.

Related Posts

Your data lies: Be a data-driven Luddite

Manual labour
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?

Why I became a Solo Founder

Single lego figure walking in sand
I never wanted to be a solo founder. Yet, in 2021, I quit my job and started Bold Data to mine the Internet single-handedly. Trust me, it sounds as insane to write as to read. What on earth possessed me, and more importantly, would I do it again?

Amazon bestsellers are big data

Your data's size matters
According to an adage, big data is anything too big for Excel, i.e. more than 1,048,576 rows. It is a bit cheek-in-tongue, but, as with many jokes, it is grounded in truth. Many business processes run on Excel to this day. That is an issue when analysing datasets like Amazon product data for valuable insight on pricing, production and supply planning, and new product or category development. Excel cannot load a single country's Amazon bestseller list. Even if you use more scalable systems, many will struggle to analyse the more comprehensive product catalogue, complex product and category relationships, or changes over time.

Free Amazon bestsellers datasets (May 8th 2022)

All you can eat free data
Get huge, valuable datasets with 4.9 million Amazon bestsellers for free. No payment, registration or credit card is needed. Download the bestsellers of Amazon.com (2,090,907 products), Amazon.de (1,431,524 products), and Amazon.co.uk (1,377,192 products). The products list all categories in which they rank in the top 100, the product name, reviews, review average, offer price, number of offers, and extra tag data like author, type, brand, etc.

All Blog Posts

See the full list of blog posts to read more.
Subscribe for updates, free datasets and analysis.