chrF++ is a character-level machine translation evaluation metric that measures similarity between a system-generated translation and a human reference. Unlike word-based metrics such as BLEU, chrF++ operates at the level of characters and character n-grams, making it especially effective for languages with rich morphology, complex inflection, or flexible word order. By analysing similarity at the character level, chrF++ captures fine-grained linguistic patterns that may be overlooked by traditional metrics.
How chrF++ works
chrF++ evaluates translation quality using a combination of precision, recall, and F-score applied to character n-grams. It considers both:
- character n-gram overlap, which identifies structural similarity
- word n-gram information, which adds lexical alignment
This hybrid approach enables chrF++ to recognise valid variations in phrasing, spelling, and morphology while still maintaining sensitivity to translation accuracy.
Key components
- Character n-gram precision: how many character sequences in the MT output match the reference
- Character n-gram recall: how much of the reference is represented in the output
- F-score: the harmonic mean balancing precision and recall
Because chrF++ is less dependent on word segmentation, it performs well on agglutinative languages and languages with case endings, diacritics, or script variations.
Strengths of chrF++
1. Morphology-sensitive evaluation
chrF++ excels with languages like Turkish, Finnish, Arabic, and Russian, where meaning is often encoded through morphological changes rather than separate words.
2. Robust to paraphrasing
Since it operates at the character level, chrF++ is less strict about exact word matching and can recognise translation correctness even when the phrasing differs.
3. Effective for diverse language families
It performs reliably across Latin, Cyrillic, Indic, Arabic, and East Asian scripts, making it a widely adopted metric in multilingual research.
Limitations of chrF++
- It does not fully capture semantic meaning like BERTScore or COMET
- It may overestimate similarity when translations are structurally similar but semantically incorrect
- It is less reliable for long-context or discourse-level evaluation
For these reasons, chrF++ is typically used alongside other metrics to provide a balanced view of translation quality.
chrF++ in machine translation workflows
Because chrF++ provides a detailed view at the character level, it is often used to assess:
- MT performance on morphologically complex languages
- quality across multilingual datasets
- improvements in segmentation, tokenisation, or model architecture
- accuracy of output in settings where word order is flexible
Its sensitivity to linguistic structure makes chrF++ a valuable complement to traditional metrics.
How chrF++ is used within Trad AI
Trad AI incorporates chrF++ into its multi-metric evaluation framework, combining it with surface-level metrics like BLEU and semantic metrics like BERTScore and COMET. By analysing character-level similarity, Trad AI can measure improvements in morphology handling, terminology consistency, and script-specific accuracy across language pairs. This ensures that updates to the system reflect genuine quality gains and align with Trad AI’s goal of delivering high-accuracy, domain-aware, and context-reliable translations.
#chrFPlusPlus #MTEvaluation #TranslationQuality #TradAI