Alibaba tests Qwen3.7-Max as an AI engineer built to work for hours
Qwen3.7-Max was framed around long-running agent work on custom-chip code.📷 AI-generated image / TECH&SPACE
- ★Qwen3.7-Max targets long-running autonomous agent tasks, not just short chatbot responses.
- ★Alibaba cites a 35-hour autonomous run optimizing code for its own custom chip.
- ★The model was compared with Claude Opus 4.6, DeepSeek V4 Pro and Kimi K2.6, and shown in a robotics demo.
Alibaba's Qwen team has introduced Qwen3.7-Max, a proprietary AI model aimed at one of the hardest practical tests for today's agent systems: whether a model can remain useful after many hours of work without constant human steering. According to The Decoder, the model ran autonomously for 35 hours while optimizing code for Alibaba's custom chip.
That detail matters more than the model name. Much of the public AI conversation is still built around short prompts, leaderboard tables and polished demos. This claim points at a different operating mode: sustained execution, context retention and the ability to keep a complex technical task from collapsing after the first few steps. If the described run holds up, Qwen3.7-Max is being positioned less as a one-shot assistant and more as a component in an AI work pipeline.
Alibaba is not starting from zero. Qwen is already a visible model family, and public material around the Qwen ecosystem shows how aggressively Chinese AI labs are building tooling, model infrastructure and application layers around their systems. In this case, however, Qwen3.7-Max is described as proprietary, which limits what outsiders can verify directly: the architecture, training mix, evaluation method and exact conditions of the 35-hour agent run.
The Qwen team says its new proprietary model autonomously optimized code for Alibaba's custom chip for 35 hours, while matching Claude Opus 4.6 benchmarks and steering a four-legged robot in a demo.
The robotics demo extends the story from code optimization to physical control.📷 AI-generated image / TECH&SPACE
The benchmark claim also needs a cool reading. The Decoder reports that Qwen3.7-Max matches Claude Opus 4.6 on benchmarks and beats Chinese rivals DeepSeek V4 Pro and Kimi K2.6. That comparison has commercial weight because Anthropic's Claude remains a key reference point for advanced reasoning and coding, while DeepSeek and Kimi represent strong domestic pressure on Alibaba in China.
But a benchmark is not the same thing as production work. The more interesting part of this release is the pairing of a benchmark claim with a specific agent scenario: code optimization for a custom chip. That kind of task demands more than a clean code snippet. The model has to track the objective, evaluate changes, respect hardware constraints and avoid drifting into confident improvisation. From the supplied context, we do not know how much human oversight was involved, which tools were connected to the model or how the final optimization quality was measured, so those remain the important missing pieces.
The robotics demo adds another signal. The team also showed the model steering a four-legged robot. That does not prove readiness for industrial autonomy, but it does reveal the intended direction: Qwen3.7-Max is not framed only as a text model, but as an agent core that can touch software, hardware and physical action. That is where the next stage of the AI race will be decided. Models that write a good answer are no longer enough; the valuable systems will be the ones that can work for hours in a way that is verifiable, correctable and resistant to quality collapse.

