Hi everyone! I wanted to share a benchmark we developed for testing our LLM-based symptom checker app. We built this because existing static benchmarks (like MedQA, PubMedQA) didn’t fully capture the real-world utility of our app. With no suitable benchmark available, we created our own and are open-sourcing it in the spirit of transparency.
GitHub: https://github.com/medaks/symptomcheck-bench
Quick Summary:
We call it SymptomCheck Bench because it tests the core functionality of symptom checker apps—extracting symptoms through text-based conversations and generating possible diagnoses. It’s designed to evaluate how well an LLM-based agent can perform this task in a simulated setting.
The benchmark has three main components:
Patient Simulator: Responds to agent questions based on clinical vignettes. Symptom Checker Agent: Gathers information (limited to 12 questions) to form a diagnosis. Evaluator agent: Compares symptom checker diagnoses against the ground truth diagnosis.
Key Features:
400 clinical vignettes from a study comparing commercial symptom checkers. Multiple LLM support (GPT series, Mistral, Claude, DeepSeek) Auto-evaluation system validated against human medical experts
We know it’s not perfect, but we believe it’s a step in the right direction for more realistic medical AI evaluation. Would love to hear your thoughts and suggestions for improvement!
submitted by /u/Significant-Pair-275
[link] [comments]