TECH&SPACE
LIVE FEEDMC v1.0
HR
// STATUS
ISS420 kmCREW7 aboardNEOs0 tracked todayKp0FLAREB1.0LATESTBaltic Whale and Fehmarn Delays Push Scandlines Toward Faste...ISS420 kmCREW7 aboardNEOs0 tracked todayKp0FLAREB1.0LATESTBaltic Whale and Fehmarn Delays Push Scandlines Toward Faste...
// INITIALIZING GLOBE FEED...
AIdb#1283

Gemma 4’s real trick: Squeezing more IQ per byte

(3w ago)
Mountain View, United States
simonwillison.net
Gemma 4’s real trick: Squeezing more IQ per byte

A miniature processor die the size of a thumbnail, stamped with the marking 'E2B', physically denting and lifting a massive cast-iron weight block📷 Photo by Tech&Space

  • 2B and 4B models relabeled as ‘Effective’ parameter sizes
  • Per-Layer Embeddings trade off memory for on-device speed
  • MoE variant hints at Google’s bet on sparse expertise

Google DeepMind’s Gemma 4 arrives with a familiar playbook: smaller models, bigger claims. This time, the hook isn’t raw scale but intelligence-per-parameter, a metric that sounds scientific until you ask how it’s measured. The four models—2B, 4B, 31B, plus a 26B Mixture-of-Experts (MoE) variant—lean hard into efficiency, with the two smallest relabeled as E2B and E4B (‘Effective’ parameters) to signal they punch above their weight.

The technical gambit here is Per-Layer Embeddings (PLE), a tweak that gives each decoder layer its own tiny embedding table per token. It’s a memory-for-speed tradeoff, optimized for on-device deployments where every megabyte counts. Early system card details suggest this isn’t just compression but a rethink of how parameters are allocated—though whether that translates to real-world latency improvements remains untested.

Google’s framing—‘unprecedented intelligence-per-parameter’—is classic benchmark theater. The Apache 2.0 license and vision capabilities sweetened the pot, but the core question lingers: Are these models better, or just packaged to look efficient in synthetic tests?

The efficiency arms race just got a new benchmark—if you trust the benchmarks

A sparse green circuit board with 26 processor sockets in a grid, only 4 slots filled with glowing chips while 22 sit empty and dark, creating an📷 Photo by Tech&Space

The efficiency arms race just got a new benchmark—if you trust the benchmarks

The MoE variant (26B-A4B) is the wild card. Sparse models like these have long promised specialist performance without monolithic costs, but Google’s vague ‘A4B’ naming leaves critical details—like expert utilization ratios—unanswered. If this is a play for edge devices, the tradeoffs (e.g., memory spikes during expert switching) could outweigh the gains.

Developer reaction on GitHub and forums like Hacker News has been cautiously optimistic, with praise for the license but skepticism about the ‘Effective’ parameter branding. One comment thread noted the PLE approach mirrors techniques from 2023’s Qwen2 and Phi-3, raising the question: Is this innovation or iterative tuning?

The real signal isn’t the models themselves but the strategic pivot. Google’s betting that the future of AI isn’t just bigger models, but smarter allocation—a direct shot at Meta’s Llama 3 and Mistral’s Mixtral in the open-weight arms race. For startups and device makers, the message is clear: Efficiency is the new scale.

GemmaGoogleov
// liked by readers

//Comments