← Back to resources

Vocabulary

In language AI, vocabulary is token-based and directly affects how models handle rare terms, domain language, and multilingual variation.

Vocabulary

In everyday language, vocabulary means the set of words a person knows. In AI language systems, the idea is similar but more technical: a model’s vocabulary is the set of units it can recognise and process internally. These units are often not full words, but tokens.

Understanding this difference helps translators and localisation teams interpret model behaviour more realistically. A language model does not read text the way humans do. It works through token sequences, probability patterns, and context windows, which means vocabulary design influences quality, fluency, and terminology consistency.

Vocabulary in AI is token-based, not dictionary-based

Traditional dictionaries organise complete words and meanings. Model vocabularies organise token units chosen for computational efficiency and coverage. A token might be:

  • a whole word
  • part of a word
  • a punctuation mark
  • a short character sequence

This token-level approach allows the same vocabulary to handle many writing patterns across languages, including compounds, inflections, abbreviations, and domain-specific strings.

Vocabulary size and model capability

Vocabulary size affects how efficiently a model represents text. A larger vocabulary can include more whole words or common sequences, which may reduce token fragmentation and improve handling of frequent expressions. However, very large vocabularies also increase complexity and memory demands.

A smaller vocabulary can be efficient but may split more words into multiple tokens, which can make some patterns harder to learn. In practice, model developers balance vocabulary size with architecture, training data, and target language coverage.

For multilingual translation, this balance is especially important. The token inventory must support different scripts, morphology patterns, and terminology domains without becoming impractical.

Out-of-vocabulary (OOV) problems

An out-of-vocabulary (OOV) issue occurs when input contains forms the system cannot represent well with its available tokens. In older word-based systems, this could cause hard failures or placeholders for unknown words. Modern token-based models are more robust, but OOV-like behaviour still appears in practice as awkward handling of rare terms, names, codes, or newly emerging language.

In translation workflows, this can affect product names, legal identifiers, medical terms, and client-specific terminology. Even if output looks fluent, terminology precision may still drift when rare forms are poorly represented.

How subword tokenisation reduces rare-word risk

Modern models typically use subword tokenisation methods that split unfamiliar words into smaller known pieces. This allows the model to process new or rare forms without requiring every full word to exist in the vocabulary.

For example, a technical term may be decomposed into recognisable parts, allowing the model to infer structure and meaning from context. This is one reason current language models cope better with evolving terminology than older fixed-word systems.

Even so, subword handling is not a guarantee of accuracy. Rare terms can still be mistranslated, normalised incorrectly, or rendered inconsistently across documents.

Why terminology management and human review remain essential

Professional translation quality depends on more than broad fluency. Organisations need consistent terminology, legal precision, brand alignment, and domain-appropriate phrasing. Token vocabularies support language generation, but they do not replace governance.

That is why terminology management remains central in AI-assisted workflows. Glossaries, termbases, and style guides provide explicit rules that models can follow during generation or post-editing. Human reviewers then verify whether the output is accurate in context and suitable for publication.

In practical terms, the strongest workflow combines model vocabulary strengths with linguistic control:

  • maintain curated termbases for domain-critical language
  • enforce approved naming and brand terms
  • review high-risk content with expert linguists
  • feed validated terminology back into future projects

This approach keeps AI output useful and scalable while preserving the precision required in professional localisation.

#Vocabulary #Tokenisation #OOV #TradAI

Subword tokenisation improves rare-word coverage, but professional translation still depends on terminology governance and human review.

Explore Trad AI

Open the workspace