Are there efforts to create gold/silver subsets for open ML datasets?

We experimented with MNIST and BDD100K and noticed two recurring issues: about 2–4% of samples were noisy or confusing, and there was significant redundancy in the datasets.

We achieved ~87% accuracy on MNIST with only 10 samples (1 per class), and on BDD, we matched baseline performance with less than ~40% of the dataset after removing obvious redundancies and very low-quality samples.

This made us wonder why we don’t see more “dataset goldifying” approaches, where datasets are split into something like:

Gold subset (very clean, ~1%)
Silver subset (medium, ~5%)
Full dataset

Are there any canonical methods or open-source efforts for creating curated gold/silver subsets of datasets?

submitted by /u/taranpula39
[link] [comments]

Are There Efforts To Create Gold/silver Subsets For Open ML Datasets?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments