{"id":17108,"date":"2023-04-07T12:27:38","date_gmt":"2023-04-07T10:27:38","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-represent-large-categorical-data\/"},"modified":"2023-04-07T12:27:38","modified_gmt":"2023-04-07T10:27:38","slug":"how-to-represent-large-categorical-data","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-represent-large-categorical-data\/","title":{"rendered":"How To Represent Large Categorical Data?"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I&#8217;ve 10 numerical and large datasets where each has 3 generic categories. Each row contains unique data. The end row of each dataset contains the labels for each category. The category is not distinct thus other row may refer to any of the 3 categories.<\/p>\n<p>e.g.<\/p>\n<p>\u200b<\/p>\n<p>   Date Value Category    1\/1\/2010 1.11111 Alpha   2\/1\/2010 2.11111 Beta   3\/1\/2010 2.00009 Alpha   4\/1\/2010 0.00000 Charlie   <\/p>\n<p>But the 10 datasets have different volume of data. E.g. dataset A may have 10K rows, dataset B around 100K, Dataset C 1 million, etc.<\/p>\n<p>I couldn&#8217;t process all the data as its too large.<\/p>\n<p>What would be the best way to sample each dataset? I&#8217;d like the sample containing a fair representative of the 3 categories.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/runnersgo\"> \/u\/runnersgo <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/12egh51\/how_to_represent_large_categorical_data\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/12egh51\/how_to_represent_large_categorical_data\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-17108 jlk' href='javascript:void(0)' data-task='like' data-post_id='17108' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-17108 lc'>0<\/span><\/a><\/div><\/div> <div class='status-17108 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve 10 numerical and large datasets where each has 3 generic categories. Each row contains unique data&#8230;.<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-17108","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/17108","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=17108"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/17108\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=17108"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=17108"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=17108"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}