AIdb#667

Trillion-parameter models now fit in laptops. So what?

March 24, 202612:00(4w ago)

Menlo Park, CA

Trillion-parameter models now fit in laptops. So what?📷 Published: Mar 24, 2026 at 12:00 UTC

★1T-parameter model runs on 96GB MacBook RAM
★iPhone demo hits 0.6 tokens/sec—with caveats
★Streaming experts: clever hack or deployment dead-end?

Five days ago, running a 397-billion-parameter model on 48GB of RAM was a neat parlor trick. Today, the goalposts moved: @seikixtc squeezed MoE’s 1-trillion-parameter Kimi K2.5—32B active weights—into a 96GB M2 Max MacBook Pro. Meanwhile, @anemll ported the same Qwen3.5-397B-A17B to an iPhone, churning out 0.6 tokens/second. The technique? Streaming experts: swapping model weights in/out of RAM like a DJ cueing vinyl, but with SSDs and Mixture-of-Experts (MoE) architectures.

Hype filter engaged. This isn’t about ‘AI on your phone’—it’s about how far we’ll stretch ‘runs’ before admitting deployment is a different sport. The iPhone demo logs its limitations openly: no batch processing, glacial speeds, and a ‘proof of concept’ label the size of a billboard. Even the MacBook feat, while impressive, trades latency for RAM savings. Real-world inference? Still a pipe dream for anything beyond toy examples.

The numbers sound wild until you benchmark them against, say, Llama 3.1’s optimized 405B running on actual servers with actual throughput. Streaming experts is a clever hack for edge cases—but edge cases don’t ship products. They make demos.

The gap between ‘it boots’ and ‘it ships’ just got wider📷 Published: Mar 24, 2026 at 12:00 UTC

The gap between ‘it boots’ and ‘it ships’ just got wider

So who benefits? Open-source tinkerers and cloud-averse researchers, for now. The technique cuts cloud costs by letting teams iterate on massive models without renting A100 clusters. But the trade-offs are brutal: SSD wear, token-by-token latency, and a workflow that assumes you’ve got hours to spare. For startups, this is a ‘maybe later’—not a ‘drop everything’.

The developer signal is mixed. GitHub stars for the iOS repo are pouring in, but the pull requests? Mostly bug fixes for making it run at all. The community’s excitement is proportional to the novelty, not the utility. And while Dan Woods’ autoresearch loops hunt for optimizations, the core tension remains: this is a RAM hack, not a compute breakthrough.

The real story isn’t ‘bigger models on smaller hardware.’ It’s that MoE architectures—once dismissed as unwieldy—are now the duct tape holding together the ‘run it anywhere’ fantasy. That, and the quiet admission that we’re still measuring AI progress in what fits where, not what works well.

MacBookStreamingLarge Language Models

// liked by readers

//Comments

Uredi u foto-review →