{"id":36018,"date":"2025-10-14T14:27:23","date_gmt":"2025-10-14T12:27:23","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-improve-and-refine-categorization-for-a-large-dataset-with-26000-unique-categories\/"},"modified":"2025-10-14T14:27:23","modified_gmt":"2025-10-14T12:27:23","slug":"how-to-improve-and-refine-categorization-for-a-large-dataset-with-26000-unique-categories","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-improve-and-refine-categorization-for-a-large-dataset-with-26000-unique-categories\/","title":{"rendered":"How To Improve And Refine Categorization For A Large Dataset With 26,000 Unique Categories"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I have got a beast of a dataset with about 2M business names and its got like 26000 categories some of the categories are off like zomato is categorized as a tech startup which is correct but on consumer basis it should be food and beverages and some are straight wrong and alot of them are confusing too But some of them are subcategories like 26000 is a whole number but on the ground it has a couple 100 categories which still is a shit load Any way that i can fix this mess as key word based cleaning aint working it will be a real help<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Existing_Pay8831\"> \/u\/Existing_Pay8831 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1nykh9c\/how_to_improve_and_refine_categorization_for_a\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1nykh9c\/how_to_improve_and_refine_categorization_for_a\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-36018 jlk' href='javascript:void(0)' data-task='like' data-post_id='36018' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-36018 lc'>0<\/span><\/a><\/div><\/div> <div class='status-36018 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I have got a beast of a dataset with about 2M business names and its got like&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-36018","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36018","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=36018"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36018\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=36018"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=36018"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=36018"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}