Translation systems powered by LLM have become so advanced that they can in some cases surpass human translators. As LLM improves, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and accurately evaluate this progress. Still using traditional automation metrics such as BLEU, but cannot explain why the score is given. As the quality of translation reaches close to human levels, users need to conduct assessments that go beyond numerical metrics, thus providing reasoning on key dimensions such as accuracy, terminology, and audience applicability. This transparency enables users to evaluate evaluations, identify errors and make smarter decisions.
Although BLEU has long been the standard for evaluating machine translation (MT), its usefulness is gradually fading away because modern systems now compete or outperform human translators. Newer metrics, such as Bleurt, Comet, and Metricx, fine-tune powerful language models to more accurately evaluate translation quality. Large models such as GPT and PALM2 can now provide zero beat or structured evaluation, and can even produce MQM-style feedback. Technologies such as pairwise comparisons also enhance consistency with human judgment. Recent research shows that requiring models to explain their choices can improve the quality of decision making. However, although such fundamentals-based approaches are still underutilized in MT assessments, despite their growing potential.
Researchers at Sakana.ai have developed Transevalnia, a translation evaluation and ranking system that uses prompt-based reasoning to evaluate translation quality. It ranks the translations using selected MQM sizes and assigns a 5-point Likert scale, including overall scores, providing detailed feedback. The system is competitive and even better in several language pairs and tasks, including English-Japanese, Chinese-English, etc. The LLMS tests like Claude 3.5 and Qwen-2.5 have a good judgment that is consistent with the human rating. The team also addressed location bias and published all data, inference output and public usage code.
The focus of this approach is to evaluate translations in key quality aspects, including accuracy, terminology, audience applicability, and clarity. For poetic texts like Haikus, emotional tone replaces standard grammatical checks. Translations break down by span and evaluate spans, score at a ratio of 1-5, and then rank. To reduce bias, three evaluation strategies were compared: single-step, two-step, and more reliable interwoven methods. An “unreasonable” approach was also tested, but lacked transparency and prone to bias. Finally, human experts reviewed the selected translations to compare their judgments with systematic judgments, thus providing insight into their consistency with professional standards.
The researchers evaluated the translation ranking system using a dataset with human scores, comparing its Transevalnia model (QWEN and SONNET) with MT grades, Comet-22/23, Xcomet-XXL and Metricx-XXL. On WMT-2024 EN-ES, the MT level performs best, which may be due to the abundant training data. However, in most other datasets, transevalnia matches or outperforms the MT level; for example, QWEN’s war-free approach leads to the victory of WMT-2023 En-De. Position deviations are analyzed using inconsistent scores where the interleaving method usually has the lowest deviation (e.g., 1.04 on hard EN-JA). The overall Likert score given by human evaluators to sonnets (4.37–4.61), sonnet evaluation is well related to human judgment (Spearman’s R ~ 0.51–0.54).
In summary, Transevalnia is a promotion-based system for evaluation and ranking translation using LLMs such as Claude 3.5 sonnets and Qwen. The system provides detailed scores of key quality dimensions inspired by the MQM framework and selects better translations between options. Despite the fine tuning, Metricx-XXL leads to MT-level matching on several WMT language pairs on WMT. Human evaluators believe that sonnet production is reliable and that scores are highly correlated with human judgment. Fine-tuning QWEN significantly improves performance. The team also explored solutions to position bias, ongoing challenges in ranking systems, and shared all evaluation data and code.
Check The paper is here. random View various applications in our tutorial page on AI Agents and Agent AI. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
