Bad data: Nameless Amazon bestsellers

May 3rd, 2022
Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which they frequently see and buy, to be accurate. Bold Data, which is processing 100s of millions of products daily, has a unique ability to find hidden insights and issues. For example, active Amazon bestsellers with names resulting from data processing errors.
Bestsellers missing names

Amazon serves data of 100s of millions of products on its websites, often provided by marketplace sellers and many being seen and sold rarely. Over the last decade, Bold Data's founder has seen a lot of poor data in this long tail. This included gems like Amazon test data that somehow made it onto the public website.

Last week, while processing the bestsellers of,, and, something peculiar surfaced. Bestsellers are products ranked in the top 100 in at least one product category. We found that four bestsellers had no names. Based on how the Amazon bestseller website presents its data, that should not be possible. What happened?

Error in error out

Looking at the Amazon web pages for the nameless products, what happened quickly becomes apparent to people familiar with data or software engineering.

But even if you are not an engineer, you can see the names in the images sound strange. Computer systems use NULL, NaN, NA, and similar outputs to indicate no data for a field or attribute to a human user. In simple terms and making some inferences, the upload into Amazon contained a message of no data, e.g. NULL. Instead of failing, the message was converted into text and was stored wrongly as the product name in Amazon's product catalogue. When Bold Data analysed the data, the error reappeared.

You may know the saying "garbage in, garbage out", which computer scientists use. In particular, data engineers and data scientists use it to highlight that if your foundation, the data, is imperfect, then so will be the outcome. Therefore, experienced data professionals prioritise sourcing accurate data and its processing instead of applying increasingly complex analytics or machine learning algorithms.

One in a million error

While this is unfortunate and surprising that these items made it into the bestsellers without names, it is not as bad as it seems. The dataset analysed comprised 4.88 million products from three Amazon websites, the United States, Great Britain, and Germany. So close to one in a million was wrong, a small number. However, it demonstrates that our systems have to expect and accommodate data errors. Mined data is only as good as its source system.

While the error rate is low, the product's name is prominent, and other attributes have not as much scrutiny. We will publish datasets, analyses and more findings in the future. Be sure to subscribe to our email list so as not to miss these updates.

Bold Data

The described challenge in this post is one of many that data mining and data engineering face daily. The Internet has an abundance of valuable data. Mining and processing data at scale, with low cost, high confidence and quality are complex, requiring decades of experience. This is precisely what Bold Data has focused on for our customers. Create affordable, reliable datasets and decision support analytics so you can make better decisions daily.

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.