We experimented with MNIST and BDD100K and noticed two recurring issues: about 2–4% of samples were noisy or confusing, and there was significant redundancy in the datasets.
We achieved ~87% accuracy on MNIST with only 10 samples (1 per class), and on BDD, we matched baseline performance with less than ~40% of the dataset after removing obvious redundancies and very low-quality samples.
This made us wonder why we don’t see more “dataset goldifying” approaches, where datasets are split into something like:
- Gold subset (very clean, ~1%)
- Silver subset (medium, ~5%)
- Full dataset
Are there any canonical methods or open-source efforts for creating curated gold/silver subsets of datasets?
submitted by /u/taranpula39
[link] [comments]