← Back to resources

BERTScore

A metric comparing contextual embeddings of a machine translation and a human reference.

BERTScore

BERTScore is an advanced machine translation evaluation metric that measures the semantic similarity between a system-generated translation and a human reference. Unlike surface-level metrics such as BLEU or chrF++, BERTScore evaluates meaning by comparing contextual embeddings, allowing it to capture nuance, paraphrasing, and long-distance dependencies that traditional methods often miss.

Why BERTScore matters

BERTScore is built on large pre-trained language models capable of understanding context, syntax, and semantic relationships. Because translation is not only about matching words but about conveying meaning accurately, BERTScore offers a more comprehensive view of quality than n-gram-based metrics. This makes it especially useful in professional environments where accuracy, terminology consistency, and contextual understanding are critical.

How BERTScore works

1. Embedding extraction

Each token in both the machine-generated translation and the human reference is converted into high-dimensional contextual embeddings using a transformer-based model.

2. Token-level comparison

Tokens are compared using cosine similarity, a mathematical measure of how closely aligned two embeddings are.

3. Precision, recall, and F1 scoring

The metric evaluates precision, recall, and F1 to capture how well system outputs align with reference meaning.

This deep semantic scoring allows BERTScore to recognise high-quality paraphrases, alternative phrasings, and linguistically valid variations that simplistic metrics tend to penalise.

Where BERTScore excels

  • Languages with free word order (e.g., Russian, Arabic)
  • Languages with rich morphology
  • Long sentences where context affects meaning
  • Domain-specific texts requiring precise terminology
  • Evaluations of context-aware translation and document-level translation

Because it focuses on meaning rather than surface structure, BERTScore serves as a reliable indicator of whether an AI translation system truly understands the source content.

BERTScore in professional MT workflows

  • Benchmarking MT engines
  • Evaluating changes in prompting or model versions
  • Detecting semantic drift
  • Complementing human evaluation
  • Quality assurance in continuous localisation pipelines

Because it correlates strongly with human judgments, BERTScore helps teams understand whether an MT system is improving in semantic accuracy, not just string matching.

How BERTScore is used within Trad AI

Trad AI applies BERTScore in internal MT evaluation pipelines to measure semantic improvements across extended context windows, domain-adapted prompting, terminology-controlled translation, and multilingual context-aware workflows. By combining BERTScore with metrics like BLEU, COMET, and TER, Trad AI ensures that enhancements are meaningful, measurable, and aligned with professional expectations.

BERTScore directly supports Trad AI’s mission to provide context-aware, domain-consistent, and high-reliability AI-powered translation for linguists and LSPs.

#BERTScore #TranslationQuality #MTEvaluation #TradAI

Explore Trad AI

Open the workspace