I’m working on a system that processes large medical record packets and generates a chronological timeline with evidence citations (think: turning hundreds or thousands of pages of medical records into a structured chronology).
Right now I’m trying to find datasets that resemble real world medical record packets so I can test robustness. Most of the datasets I’ve found so far are either:
• purely structured EHR tables (diagnoses, labs, etc.)
• small sets of individual clinical notes
• synthetic datasets
What I’m ideally looking for:
• Long clinical documents (discharge summaries, physician notes, operative reports)
• Multi-document patient records
• Collections of clinical PDFs or reports
• Narrative-heavy hospital documentation
• Anything resembling actual chart records rather than isolated notes
Datasets I already know about:
• MIMIC-IV / MIMIC-IV-Note (waiting for credentials, anyone have a mirror?)
• i2b2 / n2c2 clinical NLP datasets (registration to download it is closed?)
• MTSamples medical transcription dataset
submitted by /u/deputy1389
[link] [comments]