Meta tag

Swe-bench Verified

2 articles

Half of AI Code Fails Real-World Review Despite Benchmark Glory

AIRewritten

db#3072

AI code is winning benchmarks and losing the review that actually matters

A METR study reveals that nearly half of AI-generated code passing the SWE-bench benchmark would be rejected by actual developers in production environments.

11 Mar 2026

db#1451

AI’s latest safety trick: Behavior trees over black-box hype

OpenHands’ new paper distills LLM execution logs into verifiable behavior trees—a rare case of safety designed *before* the demo.

09 Mar 2026