HSCO-Bench Tests LLM Agents on Real HW-SW Chip Co-Design
HSCO-Bench frames the LLM agent as part of the full SoC co-design flow.📷 AI-generated image / TECH&SPACE
- ★HSCO-Bench joins software and hardware evaluations that AI benchmarks usually measure separately.
- ★The benchmark targets heterogeneous SoCs, where code, accelerator and platform decisions affect one another.
- ★The work comes from Columbia University and IBM Research and was reported by Semiconductor Engineering on May 25, 2026.
Researchers from Columbia University and IBM Research have released “HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip,” according to the abstract highlighted by Semiconductor Engineering. The subject is technical, but the gap is easy to state: large language models are increasingly used in software and hardware design, while the benchmarks around them still tend to evaluate those domains separately.
That is a weak model for heterogeneous systems-on-chip. In a modern SoC, performance does not depend only on whether the software is clean or whether a hardware block is fast. It depends on how work is partitioned across processor cores, accelerators, memory, interconnects and the software layer that drives the whole platform. If a software benchmark assumes a fixed hardware target, and a hardware benchmark only optimizes a component, the central decision is missing: co-design.
HSCO-Bench is aimed at that missing layer. It evaluates LLM agents in an end-to-end hardware-software co-design flow rather than treating them only as code generators or isolated hardware optimization assistants. For the semiconductor industry, that shift matters. AI tools should not be judged only by their ability to solve neat standalone tasks. They need to be measured against connected engineering tradeoffs.
Columbia University and IBM Research propose a benchmark that stops separating software from hardware and measures whether an agent can optimize a full heterogeneous SoC flow.
The benchmark measures linked decisions across code, accelerators and the SoC platform.📷 AI-generated image / TECH&SPACE
This is where the difference between a polished demo and a useful benchmark becomes visible. An LLM can look capable when it writes a function, suggests an optimization, or describes a microarchitecture. A heterogeneous SoC asks for a chain of interdependent decisions: what should move into hardware, what should remain flexible in software, where the memory path becomes a bottleneck, and how performance, power and complexity stay in balance. If an agent cannot see that chain, its answer can be locally elegant and systemically wrong.
The Columbia and IBM Research pairing is also notable. Columbia brings the academic benchmark framing, while IBM Research has deep history in computing systems, chips and AI tooling. The supplied abstract does not provide enough detail to make claims about measured scores or specific benchmark results, so those claims should not be manufactured. The stronger signal is the problem definition itself: AI evaluation for chip design has to move from components to workflows.
For readers outside EDA and SoC engineering, this may sound narrow, but the implications are broader. If LLM agents are going to help design computing systems, they need to be tested where decisions carry cost. A heterogeneous SoC is not a text puzzle. It is a negotiation between software portability, hardware specialization and manufacturing reality. HSCO-Bench matters not because it promises magical automation, but because it asks a sharper question: can an agent reason about the system, not just the easiest slice of it?

