Wikipedia lead image: Dopamineš· Wikipedia / Wikimedia Commons
- ā Reinforcement learning replaces kNN for demo selection
- ā Dueling DQN with Transformer Decoder optimizes output range
- ā No performance numbers yetājust a March 2026 arXiv abstract
Multimodal Large Language Models (MLLMs) have spent the last two years drowning in their own demo debt. The standard fixāk-Nearest Neighbor (kNN) searchāprioritizes similarity over substance, churning out redundant examples that flatten the output range of complex tasks like factual regression. Enter Learning to Select Demonstrations (LSD), a reinforcement learning approach that reframes demo selection as a sequential decision problem.
Instead of letting kNN lazily grab the nearest neighbors, LSD trains a Dueling Deep Q-Network (DQN) with a query-centric Transformer Decoder to construct optimal sets. The goal isnāt just to pick similar examplesāitās to pick the ones that actually teach the model something new.
The paperās abstract, posted in March 2026, reads like a direct critique of the status quo. kNNās redundancy isnāt just inefficient; itās actively harmful for tasks where output diversity matters. LSDās RL-based policy aims to maximize downstream performance, but the abstract stops short of sharing any numbers. Thatās the first red flagāor at least the first question mark. For all the talk of āoptimalā demo sets, weāre still in the realm of theoretical improvement, not benchmarked gains. The original kNN approach itās replacing was never designed for multimodal complexity, so the bar for ābetterā isnāt exactly high.
The technical community has already started poking at the gaps. On GitHub discussions, developers note that RL-based selection isnāt newāitās been tried in text-based ICL for yearsābut the multimodal twist is whatās drawing attention. The real test will be whether LSD can scale beyond visual tasks. The paperās title hints at āvisual in-context demonstrations,ā but the methodās architecture doesnāt seem tied to images. If it works, it could become a drop-in replacement for kNN across modalities.
The hype says 'smarter demos,' but the reality is still a research abstract
Wikimedia Commons: Multimodal Large Language Models (MLLMs)š· Ā© Donald Judge
So who stands to gain?
The obvious winners are the teams already invested in MLLMs for complex regression tasksāthink autonomous systems, medical imaging, or any domain where output range matters more than raw similarity. Companies like Google DeepMind and Meta have been vocal about the limitations of kNN, but neither has shipped a production-ready alternative. LSDās RL approach could fill that gap, assuming the performance claims hold up under scrutiny.
The competitive pressure isnāt just on the model developers, though. The entire āin-context learningā narrative has been built on the back of cheap, unsupervised demo selection. If LSD proves that smarter selection leads to better performance, it could force a reckoning: either invest in RL-based curation or admit that your modelās ālearningā is just memorization in disguise. The Hugging Face community has already started debating whether this is a ānice-to-haveā or a āmust-haveā for future MLLM architectures.
Thereās also the question of implementation cost. kNN is fast and cheap; RL is neither. The paperās Dueling DQN with a Transformer Decoder isnāt exactly lightweight, and training a policy to select demos adds another layer of complexity to an already expensive pipeline. For now, the trade-off is theoretical. Until someone runs the numbers on real-world tasksāand shares them publiclyāLSD remains an intriguing idea, not a proven upgrade.
The real signal here isnāt the method itself, but the shift in thinking. Demo selection isnāt just a preprocessing step anymore; itās a first-class problem. Thatās the kind of reframing that often precedes real progressāeven if the first attempt is more hype than substance.

