Sharing a dataset I work on. REFUTE is an Apache-2.0 benchmark for testing whether models can critique recent science summaries with calibrated, evidence-grounded judgment.
Configs: – refute_soundness — judge-free split (no LLM judge needed to score) – refute_hard_60 / refute_120 — harder vignettes
Each item: a paper summary (some with planted flaws / overclaims / missing evidence) + gold labels, with confidence targets scored using Brier (a strictly proper rule), so calibration is measured rather than just accuracy.
License: Apache-2.0 Load: load_dataset(“BGPT-OFFICIAL/refute”, “refute_soundness”) Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard
Happy to answer questions about how it was constructed and labeled.
submitted by /u/connerpro
[link] [comments]