Disclosure: this is our own dataset.
Our dataset consists of 362 translation segments annotated by 48 professional linguists (not crowdsourced) across 16 language pairs.
MT systems evaluated: EuroLLM-22B, Qwen3-235B, TranslateGemma-12B.
Language pairs (all from English): Arabic (MSA, Egyptian, Moroccan, Saudi), Belarusian, French, German, Hmong, Italian, Japanese, Korean, Polish, Portuguese (Brazilian and European), Russian, Ukrainian.
Each segment includes full MQM error annotations:
- error category (accuracy, fluency, terminology, etc.)
- severity level (minor, major, critical)
- exact error span in the text
- multiple annotators per segment for inter-annotator agreement analysis
Methodology follows WMT guidelines. Kendall’s τ = 0.317 on IAA – roughly 2.6x what typical WMT campaigns report.
It may be useful for MT evaluation research and benchmarking translation quality.
Dataset: https://huggingface.co/datasets/alconost/mqm-translation-gold
Happy to answer questions about the annotation process!
submitted by /u/ritis88
[link] [comments]