Hello redditors,
The Food-101N dataset is a computer vision dataset that is a varient of Food-101 that has extra images and label noise added. I spent some time using an automated data correction platform to really quantify the amount of noise in this dataset. With over 100k examples, manual inspection isn’t an option.
To my surprise, I didn’t just find noise, I also found outliers, ambiguous examples, and duplicates. It was quite an eye-opener seeing thousands of issues that were not included in the “disclaimer” of added label noise by the authors.
Here’s a quick breakdown of what I found:
27,488 Mislabeled Examples 8,519 Outliers 13,538 Ambiguous Examples 17,510 (Near) Duplicate Examples.
If you’d like to read and see a bit more, you can check out the article. There are many visuals that show all of the errors that I wish I could upload here.
* Disclaimer: I am a data scientist for Cleanlab who builds Cleanlab Studio, the automated data correction platform that I used to find these issues.
submitted by /u/cmauck10
[link] [comments]