Bad Data: Nameless Amazon bestsellers

By Christian Prokopp on 2022-05-03

Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which they frequently see and buy, to be accurate. Bold Data, which is processing 100s of millions of products daily, has a unique ability to find hidden insights and issues. For example, active Amazon bestsellers with names resulting from data processing errors.

Amazon serves data of 100s of millions of products on its websites, often provided by marketplace sellers and many being seen and sold rarely. Over the last decade, Bold Data's founder has seen a lot of poor data in this long tail. This included gems like Amazon test data that somehow made it onto the public website.

Last week, while processing the bestsellers of amazon.co.uk, amazon.de, and amazon.com, something peculiar surfaced. Bestsellers are products ranked in the top 100 in at least one product category. We found that four bestsellers had no names. Based on how the Amazon bestseller website presents its data, that should not be possible. What happened?

Error in error out

Looking at the Amazon web pages for the nameless products, what happened quickly becomes apparent to people familiar with data or software engineering.

But even if you are not an engineer, you can see the names in the images sound strange. Computer systems use NULL, NaN, NA, and similar outputs to indicate no data for a field or attribute to a human user. In simple terms and making some inferences, the upload into Amazon contained a message of no data, e.g. NULL. Instead of failing, the message was converted into text and was stored wrongly as the product name in Amazon's product catalogue. When Bold Data analysed the data, the error reappeared.

You may know the saying "garbage in, garbage out", which computer scientists use. In particular, data engineers and data scientists use it to highlight that if your foundation, the data, is imperfect, then so will be the outcome. Therefore, experienced data professionals prioritise sourcing accurate data and its processing instead of applying increasingly complex analytics or machine learning algorithms.

One in a million error

While this is unfortunate and surprising that these items made it into the bestsellers without names, it is not as bad as it seems. The dataset analysed comprised 4.88 million products from three Amazon websites, the United States, Great Britain, and Germany. So close to one in a million was wrong, a small number. However, it demonstrates that our systems have to expect and accommodate data errors. Mined data is only as good as its source system.

While the error rate is low, the product's name is prominent, and other attributes have not as much scrutiny. We will publish datasets, analyses and more findings in the future. Be sure to subscribe to our email list so as not to miss these updates.

Bold Data

The described challenge in this post is one of many that data mining and data engineering face daily. The Internet has an abundance of valuable data. Mining and processing data at scale, with low cost, high confidence and quality are complex, requiring decades of experience. This is precisely what Bold Data has focused on for our customers. Create affordable, reliable datasets and decision support Analytics so you can make better decisions daily.

Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.

Bad Data: Nameless Amazon bestsellers

Error in error out

One in a million error

Bold Data

Related Posts

How many words are 128k tokens?

How to Build Training Datasets and Fine-tune ChatGPT

OpenAI GPT-4 Turbo's 128k token context has a 4k completion limit

Is OpenAI killing Retrieval Augmented Generation?

Your Data lies: Be a data-driven Luddite

Why I became a Solo Founder