Fastest AI model yet, but there’s a catch

TL;DR

DiffusionGemma writes a whole chunk of text in one go and then keeps polishing it rather than building it word by word.
Google says it can be up to 4x faster, hitting 1,000+ tokens per second on NVIDIA H100 and around 700 on an RTX 5090, thanks to parallel processing.
Output quality is still inferior to Gemma 4, so it’s more of an experimental tool than a finished product.

Google has released DiffusionGemma, an experimental AI model that takes a very different approach to how most chatbots generate text today. Instead of writing one word after another in a strict sequence, it generates a whole block of text at once and then keeps refining it until it becomes readable. The idea is to push for speed and hardware efficiency, even if it means giving up some polish in the final output.

DiffusionGemma compared with other Gemma models

This new AI model is open-sourced under the Apache 2.0 license and is aimed at developers and researchers rather than everyday users. To understand why this matters, it helps to look at how most large language models work. Systems like Google’s Gemma 4 generate text step by step, one token at a time. Each new word depends on what came before it, which makes the process inherently sequential and harder to speed up.

DiffusionGemma, on the other hand, starts with a full canvas of random tokens, essentially noisy, unreadable text, and then repeatedly cleans it up in multiple passes. With each pass, the output becomes more structured and coherent until it settles into a final response. A simple way to picture it is that traditional models write, while DiffusionGemma drafts and edits everything at once.

Don’t want to miss the best from Android Authority?

That shift has a direct impact on performance. Per Google’s claims, DiffusionGemma can be up to four times faster than standard autoregressive models in low-concurrency scenarios, where a single user or process uses the GPU. On high-end hardware, the numbers are even more aggressive. The company asserts more than 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on an RTX 5090.

Under the hood, DiffusionGemma is a 26-billion-parameter Mixture-of-Experts model, but it does not activate all of that at once. Only about 3.8 billion parameters are used during inference, helping keep compute requirements manageable. Google says this makes it possible to run the model on high-end consumer GPUs when quantized, with a memory footprint of around 18GB VRAM.

Where things get more interesting is how the model actually generates text. It can produce up to 256 tokens in parallel in a single step, and each token can attend to every other token in the block. That gives the model a global view of the output instead of a strictly linear one.

This makes it better suited for structured or rule-based tasks. For example, it can help fill in missing sections of code, complete structured formats like JSON, work through logic-heavy problems such as Sudoku-style puzzles, or handle mathematical patterns where consistency across the whole output matters more than sentence-by-sentence flow. Because it sees the entire block at once, it can also correct contradictions within the same generation cycle, rather than waiting for a later token to fix them.

But there is a catch, and Google is upfront about it. DiffusionGemma does not match the output quality of its standard Gemma 4 models. The writing can be less stable, less refined, and not as reliable for complex or nuanced responses. So, you get speed but lose some polish.

That is why Google is positioning it as an experimental tool — it is designed for scenarios where responsiveness matters more than perfection, such as real-time AI tools, inline writing or coding assistants, and fast iterative workflows where users care more about instant feedback than final-quality text.

Hence, DiffusionGemma is not meant to replace existing Gemini or Gemma models. It is a speed-first experiment that trades output quality for efficiency and responsiveness. But it also hints at a different direction for AI text generation, where models do not just predict the next word, but generate and refine entire blocks of text simultaneously.

Thank you for being part of our community. Read our Comment Policy before posting.

What's Hot

‘GreatXML’ Zero-Day Exploit Bypasses BitLocker

Why SpaceX Needed $75 Billion from the IPO and Changed Strategy for AI in 2027 and Beyond

The new Tecno Pova 8 boasts an 8,000mAh battery, Alive Matrix Display on its back

Fastest AI model yet, but there’s a catch

I wish my reMarkable tablet had this basic iPad feature

OpenClaw AI agent tricked into phishing attacks, with user data compromised

Honor confirms that the Magic series will get 7 years of OS updates and security patches

Firefox is offering unlimited VPN usage for the entire summer

I found a hidden CarPlay feature that I’m never driving without again

North Korean hackers are at it again — phishing scheme targets hundreds of workers to try and steal crypto and more

You don’t need a NAS to self-host — I proved it with hardware from my closet

The iPad Air brand makes no sense – it needs a rethink

ChatGPT Group Chats are here … but not for everyone (yet)

Our Picks

‘GreatXML’ Zero-Day Exploit Bypasses BitLocker

Why SpaceX Needed $75 Billion from the IPO and Changed Strategy for AI in 2027 and Beyond

The new Tecno Pova 8 boasts an 8,000mAh battery, Alive Matrix Display on its back

Subscribe to Updates

What's Hot

Fastest AI model yet, but there’s a catch

Related Posts