{"id":40684,"date":"2026-04-28T16:27:33","date_gmt":"2026-04-28T14:27:33","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/the-dr-duke-database-of-phytochemicals-contains-40-years-of-data-on-plant-compounds-and-is-virtually-unusable-for-machine-learning-i-rebuilt-it\/"},"modified":"2026-04-28T16:27:33","modified_gmt":"2026-04-28T14:27:33","slug":"the-dr-duke-database-of-phytochemicals-contains-40-years-of-data-on-plant-compounds-and-is-virtually-unusable-for-machine-learning-i-rebuilt-it","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/the-dr-duke-database-of-phytochemicals-contains-40-years-of-data-on-plant-compounds-and-is-virtually-unusable-for-machine-learning-i-rebuilt-it\/","title":{"rendered":"The Dr. Duke Database Of Phytochemicals Contains 40 Years Of Data On Plant Compounds And Is Virtually Unusable For Machine Learning &#8211; I Rebuilt It"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>The USDA Dr. Duke Database of Phytochemicals and Ethnobotany is one of the most comprehensive collections of relationships between plant compounds in existence. Over 76,000 records. Decades of work. It includes notes on bioactivity, concentration ranges, and ethnobotanical uses for thousands of plant species.<\/p>\n<p>The user interface hasn\u2019t changed in about twenty years. There is no bulk export. The compounds have no standardized identifiers. SMILES strings do not exist. If your workflow requires PubChem CIDs, you have to start from scratch.<\/p>\n<p>Every team working in the field of machine learning for natural products ultimately has to preprocess the same raw data independently. I know this because I\u2019ve spoken with people who\u2019ve done it, and the same problems came up every time.<\/p>\n<p>So I rebuilt it.<\/p>\n<p>The current version: 76,907 records. 9,098 unique compounds with PubChem CID mappings. SMILES via CID lookup. USPTO patent numbers starting in 2020. Intervention data from ClinicalTrials.gov. Classification of compounds into discrete phytochemicals, complex mixtures, substance classes, and generic ambiguities.<\/p>\n<p>The most time-consuming part was not the data enrichment. It was the question of how to handle records where the compound name is ambiguous. RESIN has no CID. ALKALOID FRACTION has no CID. Assigning one would be incorrect. Leaving them without documentation explaining why they are zero leaves the next researcher in the dark. That is why I added a \u201ccompound_type\u201d column that classifies each record and documents the classification logic.<\/p>\n<p>The dataset underwent an external CID review this month. A chemistry consultant manually reviewed 13,206 compound assignments and compared them with PubChem, COCONUT, and InChI keys. One confirmed error was found and corrected. 1,534 previously zero-CIDs were resolved by matching them with IUPAC names. The number of zero-CIDs has decreased by 8%.<\/p>\n<p>The dataset is provided as Parquet and JSON. Queryable in less than five minutes using DuckDB.<\/p>\n<p>Available on HuggingFace (wirthal1990-tech\/USDA-Phytochemical-Database-JSON). The GitHub repository (wirthal1990-tech\/USDA-Phytochemical-Database-JSON) contains the complete MANIFEST and the methodology documentation.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/DoubleReception2962\"> \/u\/DoubleReception2962 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1sy2p8v\/the_dr_duke_database_of_phytochemicals_contains\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1sy2p8v\/the_dr_duke_database_of_phytochemicals_contains\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-40684 jlk' href='javascript:void(0)' data-task='like' data-post_id='40684' data-nonce='185d54ef6a' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-40684 lc'>0<\/span><\/a><\/div><\/div> <div class='status-40684 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>The USDA Dr. Duke Database of Phytochemicals and Ethnobotany is one of the most comprehensive collections of&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-40684","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/40684","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=40684"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/40684\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=40684"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=40684"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=40684"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}