Datalab: Automatically Detect Common Real-World Issues In Your Datasets

Hello Redditors!

I’m excited to share Datalab — a linter for datasets.

I recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data.

All of us that have dealt with real-world data know it’s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code datalab.find_issues() automatically detects all of these issues.

In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. Datalab combines any ML model with novel data quality algorithms to provide a linter for this Software 2.0 stack that automatically analyzes a dataset for “bugs”. Unlike data validation, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics/histograms, Datalab’s checks consider all the pertinent information learned by your trained ML model.

Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling — it’s so easy to use you have no excuse not to 😛

Let me know your thoughts!

submitted by /u/jonas__m
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *