[self-promotion] Introducing SymptomCheck Bench: An Open-Source Benchmark For Testing Diagnostic Accuracy Of Medical LLM Agents

Hi everyone! I wanted to share a benchmark we developed for testing our LLM-based symptom checker app. We built this because existing static benchmarks (like MedQA, PubMedQA) didn’t fully capture the real-world utility of our app. With no suitable benchmark available, we created our own and are open-sourcing it in the spirit of transparency.

GitHub: https://github.com/medaks/symptomcheck-bench

Quick Summary:

We call it SymptomCheck Bench because it tests the core functionality of symptom checker apps—extracting symptoms through text-based conversations and generating possible diagnoses. It’s designed to evaluate how well an LLM-based agent can perform this task in a simulated setting.

The benchmark has three main components:

Patient Simulator: Responds to agent questions based on clinical vignettes. Symptom Checker Agent: Gathers information (limited to 12 questions) to form a diagnosis. Evaluator agent: Compares symptom checker diagnoses against the ground truth diagnosis.

Key Features:

400 clinical vignettes from a study comparing commercial symptom checkers. Multiple LLM support (GPT series, Mistral, Claude, DeepSeek) Auto-evaluation system validated against human medical experts

We know it’s not perfect, but we believe it’s a step in the right direction for more realistic medical AI evaluation. Would love to hear your thoughts and suggestions for improvement!

submitted by /u/Significant-Pair-275
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *