Professional MQM-annotated Machine Translation Dataset – 16 Lang Pairs, 48 Annotators

Disclosure: this is our own dataset.

Our dataset consists of 362 translation segments annotated by 48 professional linguists (not crowdsourced) across 16 language pairs.

MT systems evaluated: EuroLLM-22B, Qwen3-235B, TranslateGemma-12B.

Language pairs (all from English): Arabic (MSA, Egyptian, Moroccan, Saudi), Belarusian, French, German, Hmong, Italian, Japanese, Korean, Polish, Portuguese (Brazilian and European), Russian, Ukrainian.

Each segment includes full MQM error annotations:

  • error category (accuracy, fluency, terminology, etc.)
  • severity level (minor, major, critical)
  • exact error span in the text
  • multiple annotators per segment for inter-annotator agreement analysis

Methodology follows WMT guidelines. Kendall’s τ = 0.317 on IAA – roughly 2.6x what typical WMT campaigns report.

It may be useful for MT evaluation research and benchmarking translation quality.

Dataset: https://huggingface.co/datasets/alconost/mqm-translation-gold

Happy to answer questions about the annotation process!

submitted by /u/ritis88
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *