Boredom Central - [Dataset] REFUTE — scientific critique & epistemic calibration on recent paper summaries (Apache-2.0)

Sharing a dataset I work on. REFUTE is an Apache-2.0 benchmark for testing whether models can critique recent science summaries with calibrated, evidence-grounded judgment.

Configs: – refute_soundness — judge-free split (no LLM judge needed to score) – refute_hard_60 / refute_120 — harder vignettes

Each item: a paper summary (some with planted flaws / overclaims / missing evidence) + gold labels, with confidence targets scored using Brier (a strictly proper rule), so calibration is measured rather than just accuracy.

License: Apache-2.0 Load: load_dataset(“BGPT-OFFICIAL/refute”, “refute_soundness”) Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard

Happy to answer questions about how it was constructed and labeled.

submitted by /u/connerpro
[link] [comments]

[Dataset] REFUTE — Scientific Critique & Epistemic Calibration On Recent Paper Summaries (Apache-2.0)

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments