Your data lies: Be a data-driven Luddite

June 29th, 2022
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?
Manual labour

Data validates ideas

The first point is well-examined and self-evident to anyone who has worked with data, so I will not dwell on it. In short, experience and gut are helpful to spark ideas, explore new avenues or unearth causalities, but metrics and hard data validate and steer us to valuable, measurable outcomes. So why should you not trust the data?

Everybody lies, including data

The problem with data is that it is rarely correct if you deal with data mining or data sourced from external parties, containing user input or legacy and complex systems. They suffer from broken processes, late data, fat finger input, broken Unicode characters, localisation issues, and sometimes simple lies.

For example, if you mine a marketplace like Amazon, you can expect innocent and malicious inputs from users. Which reviews are genuine or bought? Are the GTINs (identifiers) correct or merely copied from another (original?) product, or is it faked to fill a required field?

Be a Luddite, occasionally

I have countless stories of assumed correct data only to find it untrue. Big data from legacy or complex systems are a common source. I remember building with a team a new analysis on top of a data source only to discover the transaction data contained was partially missing. It was a small mistake in how the batch processes synchronised at night overlooked by inexperienced staff, a silent error. 

The true source of the issue here was that the people in charge of the data movement were not the same as those preparing the analysis and again not the same as those using the data. It is also an excellent example in favour of cross-functional agile product development, but that is another topic. Only when we started checking on the data and the results in detail, diving into the data 'manually' slicing it in all possible ways and going up and down the processing chain became the issue and pattern obvious. A small number of transactions got lost and underreported sales; every day!

Another example is the product feed from a retailer to a service company to create a data feed to advertise on a search platform. As part of the work at Bold Data, I dug through the final feed 'manually' and found some issues immediately. The retailer was paying another company to ingest and prepare their data, yet the result was broken in various ways, likely impacting the final advertising performance, a potentially costly error from a paid-for service.

But before judging these companies, think about the number of processes, data sources, feeds, stores and humans involved in your data processes. Can you confidently say that they all have automated metrics and robust checks to validate them technically and logically? Do you have competent staff with the time and incentives to check these data flows to harden them or find issues?

Get your hands dirty

Here are a few tips to improve your situation. Nowadays, it is tempting with the plethora of tools and services with shiny UIs and automation to plug and play, but it is more plug and pray if you can't pop the hood and check yourself. Pay extra to get skilled staff at all levels. People who can drop onto a console or a query engine and dig through logs, databases, and feeds, or who can slice and dice the data in varying ways. 

Have clear communication and collaboration between technical and business staff (again, cross-functional teams are fantastic) to ensure data is also logically correct. Most importantly, do not trust data that a data-driven Luddite has not validated. Someone who is not just watching automated tools or high-level metrics, which are essential for smooth operations, but also gets in there and occasionally digs around.

By Christian Prokopp (About, LinkedIn, Twitter), Founder of Bold Data

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.

Related Posts A deep dive into the (in)feasibility of RAG with LLMs

Llama looking through wooden fence
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Language Model (LLM). It became a ChatGPT-like expert agent with deep, up-to-date domain knowledge and conversational skills. But I am shutting it down instead of rolling it out. Here is why, how I got there and what it means for the future.

Will Tesla's AI break the insurance market?

Car accident
Insurance works because it shares costs in the face of uncertainty. What happens when Tesla removes uncertainty and distributes cost seemingly more fairly? First partially and eventually wholly? Will insurance fail, doing more harm than good?

Why I became a Solo Founder

Single lego figure walking in sand
I never wanted to be a solo founder. Yet, in 2021, I quit my job and started Bold Data to mine the Internet single-handedly. Trust me, it sounds as insane to write as to read. What on earth possessed me, and more importantly, would I do it again?

Hard work is not appreciated!?

Hard work is not appreciated
Your hard work is not appreciated. So why should you still do it? There is a good reason.

4 career tips I wish I knew

You didn't come this far to only come this far
When I mentor university students or discuss careers with the people I lead, I often draw from four pieces of advice. I wish I had known these when I started, but they come with experience, perspective and confidence. Three things most of us lack at the beginning.

Amazon bestsellers are big data

Your data's size matters
According to an adage, big data is anything too big for Excel, i.e. more than 1,048,576 rows. It is a bit cheek-in-tongue, but, as with many jokes, it is grounded in truth. Many business processes run on Excel to this day. That is an issue when analysing datasets like Amazon product data for valuable insight on pricing, production and supply planning, and new product or category development. Excel cannot load a single country's Amazon bestseller list. Even if you use more scalable systems, many will struggle to analyse the more comprehensive product catalogue, complex product and category relationships, or changes over time.

Free Amazon bestsellers datasets (May 8th 2022)

All you can eat free data
Get huge, valuable datasets with 4.9 million Amazon bestsellers for free. No payment, registration or credit card is needed.

All Blog Posts

See the full list of blog posts to read more.
Subscribe for updates, free datasets and analysis.