Gemma 4 verifies parallel token proposals inside an accelerated inference flow.📷 AI-generated image / TECH&SPACE
- ★Gemma 4 uses MTP draft models to propose multiple tokens in parallel.
- ★Speculative decoding lets the main model verify drafted tokens in one pass.
- ★InfoQ reports speedups of up to roughly 3x without output quality loss.
The most interesting part of the Gemma 4 news is not another contest over model size, but an engineering attack on the part users actually feel: waiting for tokens. According to InfoQ, Gemma 4 can be paired with multi-token prediction, or MTP, draft models that use speculative decoding to propose several future tokens in parallel. The main model then does not have to produce every token in a fully serial path; it can verify a drafted sequence in a single pass.
That may sound like a runtime detail, but the consequence is concrete. If the draft model’s proposals often match what the main model would have generated anyway, the system can deliver faster output without changing the user-visible answer. InfoQ reports speedups of up to roughly 3x without quality loss. In practice, that difference can decide whether an AI assistant feels immediate or sluggish, and whether the same hardware can serve more requests.
The important point is that MTP is not a shortcut that replaces the larger model with a smaller one. The draft model acts as a fast proposer, while Gemma 4 remains the authoritative verifier. That division is why speculative decoding is attractive for production inference: the goal is not a different answer, but the same class of answer with less sequential waiting. Google’s broader Gemma documentation already frames the family as a developer-facing model line, and this runtime layer shifts attention from training alone to delivery.
Multi-token prediction and speculative decoding let the model verify several drafted tokens in one pass, with reported speedups of up to roughly 3x without quality loss.
The MTP draft model proposes a continuation, while the main model decides what passes.📷 AI-generated image / TECH&SPACE
Technically, the bottleneck is that language generation is naturally sequential. A model usually predicts the next token, then the next, then the next, with each step depending on the previous one. The MTP draft approach tries to shorten that loop: instead of waiting for one token at a time, an auxiliary mechanism proposes a small continuation package, and the main model decides in one pass how much of that package can be accepted.
For large-scale services, this matters more than another polished demo. Latency shapes user experience, but it also shapes inference economics: fewer model passes can mean less accelerator time per response, if the reported gains hold under real workloads. That makes the technique relevant for chat systems, agent workflows, coding tools and any application where users notice every pause between tokens.
The claim still needs to be read precisely. “Up to roughly 3x” is not a universal guarantee for every prompt, response length or hardware setup. The gain depends on how accurate the draft proposals are, how expensive verification is, and whether the application can actually benefit from a faster output stream. But the architecture is sound: rather than trading quality away for speed, it tries to extract speed from more parallel decoding.
If Gemma 4 makes this MTP path reliable for developers, the conversation about open models becomes less abstract. It is not enough to have a model that performs well on a benchmark; it also has to be served quickly, steadily and at a reasonable cost. That is where the Gemma developer ecosystem gains a sharper edge: inference optimization becomes part of the product, not a footnote after the model announcement.

