← Back to resources

Token

A unit of text processed by a model, such as a word, subword, or punctuation mark.

Token

A token is a fundamental unit of text used by AI models during processing. Tokens may represent whole words, subwords, characters, or punctuation marks, depending on the model’s tokenisation method. Large language models interpret text as a sequence of tokens rather than raw characters, which determines how much content the model can handle at once and how it interprets linguistic structure.

How tokenisation works

Tokenisation is the process of breaking text into smaller units so the model can analyse and generate language. Tokenisers use rules or statistical patterns to determine token boundaries. Depending on the language and tokeniser type, a single word may become:

  • one token
  • several subword units
  • a combination of word and punctuation tokens

Compound words, rare terms, and inflected forms are often split into multiple tokens.

Types of tokens

1. Word tokens

Common in older NLP systems or languages with simple morphology.

2. Subword tokens

Used by most modern AI models to handle rare words and multilingual text efficiently.

3. Character tokens

Less common, but useful for languages with complex scripts or high variability.

4. Punctuation tokens

Symbols such as commas, brackets, or question marks are treated as independent units.

Why tokens matter

  • context window length
  • translation cost and billing
  • latency and processing speed
  • segmentation and formatting accuracy
  • model behaviour in long documents

Understanding tokens is essential for managing workloads, budgets, and performance in AI assisted translation.

Tokens in machine translation

In translation workflows, token counts affect:

  • how much content can be sent in a single request
  • whether a model retains coherence across long passages
  • the cost associated with API usage
  • how accurately the model interprets complex phrases

Large documents require careful batching or context window management to avoid truncation.

Tokenisation challenges

Tokenisation is not always straightforward. Challenges include:

  • inconsistent splits across languages
  • fragmentation of technical terminology
  • difficulties with named entities
  • errors caused by emoji or special characters
  • inflated token counts due to formatting artefacts

These issues can affect quality and increase computational load.

Tokens and context windows

AI models process a finite number of tokens per request. This limit determines:

  • how much of the document the model can see
  • whether cross sentence context is preserved
  • how well long form content is translated

Token boundaries therefore influence document level coherence and terminology consistency.

Tokens and cost optimisation

Since many AI providers charge per token, optimising prompts and formatting can reduce:

  • unnecessary whitespace
  • redundant instructions
  • repeated context
  • overly verbose structures

Token efficient workflows offer better scalability.

How Trad AI manages tokens

Trad AI is designed to optimise token usage through extended context windows, structured prompting, and document level batching. The platform reduces redundant instructions and processes only necessary content while preserving quality. Because all translation occurs through user owned API keys, users maintain full control over token consumption and billing. This approach supports efficient, secure, and compliant workflows that align with GDPR and the EU AI Act.

#Tokens #NLP #AITranslation #TradAI

Explore Trad AI

Open the workspace