The original Dr. Duke database is a veritable treasure trove of plant compounds, but it remains completely untapped. It cannot be easily integrated into modern machine learning pipelines.
My partner and I have spent the last few weeks manually cleaning and structurally validating 76,907 records from it. We assigned them PubChem CIDs, verified the SMILES descriptions, and added bioactivity values from ChEMBL v35. We also built a query bridge to 1.55 million PubMed abstracts. The core dataset itself is now a strictly typed flat file.
I have uploaded a public 400-row sample with all 16 columns to GitHub and Zenodo so you can test the schema in Pandas or DuckDB.
GitHub: github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON
Zenodo DOI: 10.5281/zenodo.19660107
submitted by /u/DoubleReception2962
[link] [comments]