{"id":18148,"date":"2023-05-16T20:27:19","date_gmt":"2023-05-16T18:27:19","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/datalab-automatically-detect-common-real-world-issues-in-your-datasets\/"},"modified":"2023-05-16T20:27:19","modified_gmt":"2023-05-16T18:27:19","slug":"datalab-automatically-detect-common-real-world-issues-in-your-datasets","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/datalab-automatically-detect-common-real-world-issues-in-your-datasets\/","title":{"rendered":"Datalab: Automatically Detect Common Real-World Issues In Your Datasets"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hello Redditors!<\/p>\n<p>I&#8217;m excited to share <strong>Datalab<\/strong> \u2014 a <em>linter<\/em> for datasets.<\/p>\n<p>I recently published a <a href=\"https:\/\/cleanlab.ai\/blog\/datalab\/\">blog<\/a> introducing <strong>Datalab<\/strong> and an <a href=\"https:\/\/github.com\/cleanlab\/cleanlab\">open-source<\/a> Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I\u2019ve made a quick <a href=\"https:\/\/docs.cleanlab.ai\/stable\/tutorials\/datalab\/datalab_quickstart.html\">Jupyter tutorial<\/a> to run <strong>Datalab<\/strong> on your own data.<\/p>\n<p>All of us that have dealt with real-world data know it\u2019s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code datalab.find_issues() automatically detects all of these issues.<\/p>\n<p>In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. <strong>Datalab<\/strong> combines any ML model with novel data quality algorithms to provide a <em>linter<\/em> for this Software 2.0 stack that automatically analyzes a dataset for \u201cbugs\u201d. Unlike <em>data validation<\/em>, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics\/histograms, Datalab\u2019s checks consider all the pertinent information learned by your trained ML model.<\/p>\n<p>Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling &#8212; it&#8217;s so easy to use you have no excuse not to \ud83d\ude1b<\/p>\n<p>Let me know your thoughts!<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/jonas__m\"> \/u\/jonas__m <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/13jcage\/datalab_automatically_detect_common_realworld\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/13jcage\/datalab_automatically_detect_common_realworld\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-18148 jlk' href='javascript:void(0)' data-task='like' data-post_id='18148' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-18148 lc'>0<\/span><\/a><\/div><\/div> <div class='status-18148 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Hello Redditors! I&#8217;m excited to share Datalab \u2014 a linter for datasets. I recently published a blog&#8230;<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-18148","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/18148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=18148"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/18148\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=18148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=18148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=18148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}