[Dataset] REFUTE — Scientific Critique & Epistemic Calibration On Recent Paper Summaries (Apache-2.0)

Sharing a dataset I work on. REFUTE is an Apache-2.0 benchmark for testing whether models can critique recent science summaries with calibrated, evidence-grounded judgment.

Configs: – refute_soundness — judge-free split (no LLM judge needed to score) – refute_hard_60 / refute_120 — harder vignettes

Each item: a paper summary (some with planted flaws / overclaims / missing evidence) + gold labels, with confidence targets scored using Brier (a strictly proper rule), so calibration is measured rather than just accuracy.

License: Apache-2.0 Load: load_dataset(“BGPT-OFFICIAL/refute”, “refute_soundness”) Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard

Happy to answer questions about how it was constructed and labeled.

submitted by /u/connerpro
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *