📷 Source: Web
- ★Decouples data ratios from pretraining via post-hoc merging
- ★Bayesian optimization replaces guesswork in mixture weights
- ★Gemma 27B tests show gains—but real-world gaps remain
Google’s Gemma 27B just became the guinea pig for a method that flips continual pretraining on its head. Instead of agonizing over dataset mixture ratios before training—where a bad guess can torch weeks of GPU time—OptiMer trains one model per dataset, extracts their ‘distribution vectors’ (a fancy term for ‘how each dataset warped the weights’), then optimizes the blend afterward using Bayesian search.
The irony? This works because pretraining is still more alchemy than science. Current methods force engineers to commit to data ratios upfront, like baking a cake without tasting the batter. OptiMer’s post-hoc merging isn’t just efficient—it’s an admission that we’re still groping for the right knobs.
Early benchmarks on Japanese, Chinese, math, and code domains show it outperforms data mixing and model averaging, but let’s not confuse ‘better than a bad baseline’ with ‘solved.’ The real test isn’t synthetic metrics—it’s whether this holds when models hit production chaos.
📷 Source: Web
The rare case where ‘after the fact’ beats ‘plan ahead’
The competitive angle is sharp: this hands an advantage to teams with the resources to train multiple specialized models upfront. If you’re a startup scraping together one pretraining run, OptiMer’s ‘train now, optimize later’ approach is a luxury. For Google? It’s a force multiplier—another way to squeeze more performance from existing compute.
Developer reaction on Hacker News and GitHub skews cautious but intrigued. The method’s elegance is undeniable, but as one commenter noted, ‘Bayesian optimization over distribution vectors sounds like a band-aid for not knowing what your data’s doing.’ That’s the reality gap: OptiMer doesn’t eliminate the need for domain expertise—it just defers some of the pain.
The bigger question is whether this becomes a standard tool or another niche trick. Right now, it’s a clever hack for teams drowning in hyperparameter tuning. But if it scales, it could turn pretraining from a rigid pipeline into something more adaptive—assuming the overhead of training N models doesn’t cancel out the gains.