{"id":33787,"date":"2025-05-07T12:27:18","date_gmt":"2025-05-07T10:27:18","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-analyze-a-large-unstructured-data\/"},"modified":"2025-05-07T12:27:18","modified_gmt":"2025-05-07T10:27:18","slug":"how-to-analyze-a-large-unstructured-data","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/how-to-analyze-a-large-unstructured-data\/","title":{"rendered":"How To Analyze A Large Unstructured Data"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hi guys!<\/p>\n<p>I&#8217;ve been assigned a task by my project lead to instruction tune an open source LLM on text-based data. The problem is that this text based dataset is highly unstructured- no folder structure, no consistent structure in JSONs, sometimes even the JSONs are missing and its just plain txt file. The thing is, its super difficult to analyze this data. Its super huge- so many directories with a total space of 15GBs occupied on the disk. That&#8217;s a lot of text data. I&#8217;m not able to understand how should I parse such a large dataset. How do you guys handle such vast unstructured data? Also, I&#8217;m open to buying any paid services if they exist.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/bugbaiter\"> \/u\/bugbaiter <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1kgtjo2\/how_to_analyze_a_large_unstructured_data\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1kgtjo2\/how_to_analyze_a_large_unstructured_data\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-33787 jlk' href='javascript:void(0)' data-task='like' data-post_id='33787' data-nonce='bc39e8310e' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-33787 lc'>0<\/span><\/a><\/div><\/div> <div class='status-33787 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Hi guys! I&#8217;ve been assigned a task by my project lead to instruction tune an open source&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-33787","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/33787","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=33787"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/33787\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=33787"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=33787"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=33787"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}