← Back to resources

Tokenisation in Natural Language Processing

How text is segmented into machine-readable units for NLP pipelines and large language models.

Tokenisation in Natural Language Processing

Tokenisation is a core preprocessing step that converts raw text into model-readable units. It underpins modern natural language processing, neural translation, and large language models by defining how text is segmented before inference or training.

What Is Tokenisation

Tokenisation is the process of splitting text into smaller units called tokens. Depending on the algorithm, a token can be a full word, a subword fragment, punctuation, or even a byte sequence.

Why Tokenisation Is Important for AI Models

AI models do not read characters as humans do. They map token IDs to embeddings and process these vectors through neural layers. Better tokenisation improves representational efficiency, lowers unknown-token risk, and stabilises quality in multilingual and domain-specific tasks.

Types of Tokenisation (word, subword, byte-level)

  • Word tokenisation: splits by word boundaries; simple but brittle for rare words.
  • Subword tokenisation: decomposes words into reusable pieces and handles OOV terms better.
  • Byte-level tokenisation: operates over bytes for robust multilingual coverage.

Byte Pair Encoding and SentencePiece

Byte Pair Encoding (BPE) iteratively merges frequent symbol pairs into compact subword vocabularies. SentencePiece generalises this idea with language-agnostic segmentation and unigram-style alternatives, widely used in contemporary NLP pipelines.

Tokenisation in Large Language Models

In LLM systems, tokenisation controls prompt packing, cost, context utilisation, and output fluency. It directly affects how models interpret prompts, preserve terminology, and perform tasks such as neural machine translation.

Token Limits and Their Impact on AI Systems

Every model has a finite context window. Input and output tokens share this budget, so long documents may require chunking or retrieval strategies to avoid truncation and quality loss.

Tokenisation Challenges Across Languages

Languages with rich morphology, script variation, or missing whitespace segmentation can be difficult to tokenise. Cross-language fairness depends on balanced vocabularies and multilingual evaluation, not only tokenizer speed.

Related Glossary Terms

Explore Trad AI

Open the workspace