Amazon serves data of 100s of millions of products on its websites, often provided by marketplace sellers and many being seen and sold rarely. Over the last decade, Bold Data's founder has seen a lot of poor data in this long tail. This included gems like Amazon test data that somehow made it onto the public website.
Last week, while processing the bestsellers of amazon.co.uk, amazon.de, and amazon.com, something peculiar surfaced. Bestsellers are products ranked in the top 100 in at least one product category. We found that four bestsellers had no names. Based on how the Amazon bestseller website presents its data, that should not be possible. What happened?
Looking at the Amazon web pages for the nameless products, what happened quickly becomes apparent to people familiar with data or software engineering.
But even if you are not an engineer, you can see the names in the images sound strange. Computer systems use NULL, NaN, NA, and similar outputs to indicate no data for a field or attribute to a human user. In simple terms and making some inferences, the upload into Amazon contained a message of no data, e.g.
NULL. Instead of failing, the message was converted into text and was stored wrongly as the product name in Amazon's product catalogue. When Bold Data analysed the data, the error reappeared.
You may know the saying "garbage in, garbage out", which computer scientists use. In particular, data engineers and data scientists use it to highlight that if your foundation, the data, is imperfect, then so will be the outcome. Therefore, experienced data professionals prioritise sourcing accurate data and its processing instead of applying increasingly complex analytics or machine learning algorithms.
While this is unfortunate and surprising that these items made it into the bestsellers without names, it is not as bad as it seems. The dataset analysed comprised 4.88 million products from three Amazon websites, the United States, Great Britain, and Germany. So close to one in a million was wrong, a small number. However, it demonstrates that our systems have to expect and accommodate data errors. Mined data is only as good as its source system.
While the error rate is low, the product's name is prominent, and other attributes have not as much scrutiny. We will publish datasets, analyses and more findings in the future. Be sure to subscribe to our email list so as not to miss these updates.
The described challenge in this post is one of many that data mining and data engineering face daily. The Internet has an abundance of valuable data. Mining and processing data at scale, with low cost, high confidence and quality are complex, requiring decades of experience. This is precisely what Bold Data has focused on for our customers. Create affordable, reliable datasets and decision support analytics so you can make better decisions daily.