Editorial visual for "Trillion-parameter models now fit in laptops. So what?", focused on the article's core system and stakes.đˇ AI-generated / Tech&Space editorial composite
- â 1T-parameter model runs on 96GB MacBook RAM
- â iPhone demo hits 0.6 tokens/secâwith caveats
- â [object Object]
Five days ago, running a 397-billion-parameter model on 48GB of RAM was a neat parlor trick. Today, the goalposts moved: @seikixtc squeezed MoEâs 1-trillion-parameter Kimi K2.5â32B active weightsâinto a 96GB M2 Max MacBook Pro. Meanwhile, @anemll ported the same Qwen3.5-397B-A17B to an iPhone, churning out 0.6 tokens/second. The technique? Streaming experts: swapping model weights in/out of RAM like a DJ cueing vinyl, but with SSDs and Mixture-of-Experts (MoE) architectures.
Hype filter engaged. This isnât about âAI on your phoneââitâs about how far weâll stretch ârunsâ before admitting deployment is a different sport. The iPhone demo logs its limitations openly: no batch processing, glacial speeds, and a âproof of conceptâ label the size of a billboard. Even the MacBook feat, while impressive, trades latency for RAM savings. Real-world inference? Still a pipe dream for anything beyond toy examples.
The numbers sound wild until you benchmark them against, say, Llama 3.1âs optimized 405B running on actual servers with actual throughput. Streaming experts is a clever hack for edge casesâbut edge cases donât ship products. They make demos.
The gap between âit bootsâ and âit shipsâ just got wider
Secondary visual angle showing the practical mechanism behind "The gap between âit bootsâ and âit shipsâ just got wider".đˇ AI-generated / Tech&Space editorial composite
So who benefits? Open-source tinkerers and cloud-averse researchers, for now. The technique cuts cloud costs by letting teams iterate on massive models without renting A100 clusters. But the trade-offs are brutal: SSD wear, token-by-token latency, and a workflow that assumes youâve got hours to spare. For startups, this is a âmaybe laterâânot a âdrop everythingâ.
The developer signal is mixed. GitHub stars for the iOS repo are pouring in, but the pull requests? Mostly bug fixes for making it run at all. The communityâs excitement is proportional to the novelty, not the utility. And while Dan Woodsâ autoresearch loops hunt for optimizations, the core tension remains: this is a RAM hack, not a compute breakthrough.
The real story isnât âbigger models on smaller hardware.â Itâs that MoE architecturesâonce dismissed as unwieldyâare now the duct tape holding together the ârun it anywhereâ fantasy. That, and the quiet admission that weâre still measuring AI progress in what fits where, not what works well.

