# MTP

> **TL;DR.** MTP (Multi-Token Prediction) is an inference acceleration technique that lets a lightweight drafter model predict several future tokens simultaneously, which a larger target model then verifies in a single forward pass — delivering 2–3x higher throughput at zero quality loss.

- **Category:** AI / Developer Tools / Inference Optimization
- **Stage:** validating
- **Age:** 42 days
- **Origin date:** 2026-05-05
- **First detected:** 2026-05-07
- **Canonical URL:** https://earlyterms.com/term/mtp
- **Sources:** 6 primary URLs

## Definition

MTP (Multi-Token Prediction) is an inference acceleration technique that lets a lightweight drafter model predict several future tokens simultaneously, which a larger target model then verifies in a single forward pass — delivering 2–3x higher throughput at zero quality loss.

The technique dates to [Meta FAIR's April 2024 paper](https://arxiv.org/abs/2404.19737) and was embedded in DeepSeek-V3's architecture in December 2024. On [May 5, 2026, Google released open-source MTP drafters for Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/) under Apache 2.0, shipping across Hugging Face, vLLM, SGLang, MLX, and Ollama, triggering a 678-point Hacker News thread and mainstream adoption.

## Analogy

Think of it as a fast stenographer who drafts the next three sentences while the editor checks the first.

## Why it's emerging now

On May 5, 2026, Google released Apache 2.0 MTP drafters for Gemma 4, delivering up to 3x faster inference across vLLM, SGLang, MLX, and Ollama with no quality loss. SemiAnalysis data shows MTP alone accounts for a 14x throughput gap on B300 GPUs running DeepSeek R1 — making it the highest-leverage software optimization available today.

## Related terms

- *parent:* speculative decoding
- *related:* EAGLE (speculative decoding)
- *child:* MTPLX
- *related:* mlx
- *related:* DeepSeek-V3
- *related:* Qwen3
- *related:* Gemma 4
- *related:* token-maxxing
- *related:* quantization
- *related:* flash attention

## Sources

1. [Google — Gemma 4 MTP drafters announcement](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/)
2. [Hacker News — Gemma 4 MTP thread (678 pts)](https://news.ycombinator.com/item?id=48024540)
3. [Meta FAIR — Better & Faster LLMs via Multi-token Prediction (arXiv 2404.19737)](https://arxiv.org/abs/2404.19737)
4. [AMD ROCm Blog — MTP + SGLang on DeepSeek-V3](https://rocm.blogs.amd.com/software-tools-optimization/mtp/README.html)
5. [GitHub — youssofal/MTPLX: native MTP for Apple Silicon](https://github.com/youssofal/MTPLX)
6. [MarkTechPost — Google MTP Drafters for Gemma 4](https://www.marktechpost.com/2026/05/06/google-ai-releases-multi-token-prediction-mtp-drafters-for-gemma-4-delivering-up-to-3x-faster-inference-without-quality-loss/)

---
_Generated by EarlyTerms · https://earlyterms.com/term/mtp_
