HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip
Pith reviewed 2026-05-20 02:24 UTC · model grok-4.3
The pith
A new benchmark shows frontier LLMs rarely complete end-to-end hardware-software co-design for heterogeneous SoCs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HSCO-Bench is the first benchmark that requires LLMs to jointly reason about and modify both software and hardware stacks to generate complete, deployable heterogeneous SoC prototypes. Results show end-to-end integration remains difficult: only two of five evaluated models succeed in producing valid designs on the target FPGA platform, and these designs reach a peak speedup of 16.22X while adding only 23.67% resource utilization at most. The work demonstrates that models can identify acceleration opportunities but still heavily underutilize available hardware capacity.
What carries the argument
HSCO-Bench, an end-to-end benchmark built on an open-source SoC platform with curated repository structure that evaluates LLM agents on generating and deploying accelerator-rich heterogeneous SoC prototypes to an AMD Virtex-7 FPGA.
If this is right
- LLM agents must improve at joint hardware-software reasoning to produce usable accelerator-rich SoCs.
- Current models identify some acceleration kernels but leave substantial hardware capacity unused.
- The benchmark provides a concrete way to measure progress in agent-driven co-design over time.
- Design flows that separate hardware and software evaluation miss the integration failures observed here.
Where Pith is reading between the lines
- Extending the benchmark to additional FPGA targets or ASIC flows would reveal whether the observed limitations are platform-specific.
- The gap between achieved and possible resource utilization suggests LLMs need stronger explicit cost or area models in their reasoning.
- Successful co-design may require hybrid human-AI loops rather than fully autonomous agents in the near term.
Load-bearing premise
The chosen open-source SoC platform and specific AMD Virtex-7 FPGA target form a representative testbed for real-world end-to-end hardware-software co-design.
What would settle it
A new model that consistently produces valid SoC prototypes achieving near-maximal resource utilization and higher speedups than 16.22X on the same platform and tasks would contradict the reported challenges.
Figures
read the original abstract
Large language models (LLMs) are adopted for software and hardware design, yet these domains are still evaluated separately. Software benchmarks typically assume fixed hardware targets, while hardware benchmarks focus on component-level optimization without considering the full hardware-software stack. Consequently, no existing benchmark evaluates whether an LLM agent can perform end-to-end, system-level hardware-software co-design. Such a process requires: 1) analyzing applications to identify kernels requiring acceleration, 2) designing and integrating heterogeneous accelerators into a System-on-Chip (SoC) under resource constraints, and 3) mapping kernels onto the generated accelerators. We present HSCO-Bench, an end-to-end hardware-software co-design benchmark for accelerator-rich heterogeneous SoC generation. Built upon an open-source SoC platform with a curated repository structure, HSCO-Bench evaluates the ability of LLMs to jointly optimize software and hardware stacks, producing SoC prototypes deployed on the AMD Virtex-7 FPGA VC707 Evaluation Kit. Experimental results show that end-to-end integration remains challenging for current models. Among the five frontier models evaluated, only two of them could successfully generate valid SoC prototypes. Yet, even in these successful instances, the generated designs are far from optimal. While we observe a promising peak speedup of 16.22X, the maximum additional resource utilization reaches only 23.67%. This highlights that while state-of-the-art models demonstrate an emerging capability for hardware acceleration, they still heavily underutilize the available hardware capacity, leaving room for future optimization. To the best of our knowledge, HSCO-Bench is the first benchmark targeting this complete co-design flow, enabling LLMs to jointly reason about and modify both the software and hardware stacks of heterogeneous SoCs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HSCO-Bench, the first benchmark for evaluating LLM agents on complete end-to-end hardware-software co-design of accelerator-rich heterogeneous SoCs. Built on a curated open-source SoC platform targeting the AMD Virtex-7 VC707 FPGA, the benchmark requires agents to identify acceleration kernels, design and integrate heterogeneous accelerators under resource constraints, and map software kernels. Experiments on five frontier models show only two produce valid SoC prototypes; even these achieve a peak speedup of 16.22X but only 23.67% additional resource utilization, indicating substantial underutilization of hardware capacity.
Significance. If the benchmark and testbed prove representative, the work provides concrete evidence of current limitations in joint hardware-software reasoning by LLMs and supplies a reproducible starting point for measuring progress in agent-driven SoC generation. The explicit success rates, speedup, and utilization metrics are useful for the community.
major comments (1)
- The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.
minor comments (2)
- The results section should specify the exact prompt templates, evaluation criteria for “valid SoC prototypes,” and whether multiple runs or statistical tests were performed to support the reported success rates and metrics.
- Clarify the precise definition of “additional resource utilization” and how the 23.67% figure is computed relative to the baseline SoC.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have reviewed the major comment carefully and provide a point-by-point response below. We agree that additional context on the platform will strengthen the paper and plan to incorporate revisions accordingly.
read point-by-point responses
-
Referee: The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.
Authors: We agree that an explicit discussion of the platform's representativeness is valuable for interpreting the results. In the revised manuscript, we will add a dedicated paragraph in Section 3 (or a new subsection) that compares the open-source SoC platform to commercial flows. Specifically, we will describe that the platform targets the AMD Virtex-7 VC707 with a standard set of peripherals (AXI interconnect, DDR3, Ethernet, UART, and GPIO), uses a curated repository structure that mirrors typical open-source SoC repositories to enable agent modifications, and relies on the Xilinx Vivado tool flow for synthesis and implementation. We will note that this FPGA-based setup captures essential aspects of heterogeneous accelerator integration and resource-constrained co-design but does not fully replicate ASIC tape-out complexities, advanced verification suites, or larger-scale commercial SoCs with proprietary IP blocks and multi-die packaging. This addition will clarify the scope of the observed failure modes and utilization gaps while preserving the benchmark's focus on reproducible, agent-driven end-to-end flows. revision: yes
Circularity Check
No circularity in empirical benchmark paper with no derivations or fitted predictions
full rationale
This paper introduces HSCO-Bench, a new empirical benchmark for LLM agents performing end-to-end hardware-software co-design on an open-source SoC platform targeting the AMD Virtex-7 FPGA. The central claims consist of direct experimental results (e.g., only 2/5 models produce valid prototypes, peak speedup 16.22X with max 23.67% additional resource utilization). No mathematical derivations, first-principles predictions, parameter fitting, self-definitional loops, or load-bearing self-citations are present. The work is self-contained as an empirical evaluation against external model performance on the introduced benchmark, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The open-source SoC platform and AMD Virtex-7 FPGA target are representative of real heterogeneous SoC design challenges under resource constraints.
invented entities (1)
-
HSCO-Bench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://deepmind.google/models/model-cards/gemini-3-1-pro/. Ce Guo and Tong Zhao. ResBench: A resource-aware benchmark for llm-generated fpga de- signs. InProceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, HEART ’25, page 25–34, New York, NY , USA, 2025. Associa- tion for Computing Machiner...
-
[2]
doi: 10.1109/ISSCC.2014.6757323. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=VTF8yNQM66. Kimi Team. Kimi k2.5: Visual...
-
[3]
load the bitstream onto the FPGA
-
[4]
load the data onto the FPGA
-
[5]
Please take a look at the README.md file and finish the job
execute the program B Codebase Structure of Each Testcase Each test case is located in a separate directory, and the files are listed in Figure 6. test_case/ <accelerator>_workspace/ .......accelerator related folder for each accelerator, including accelerator systemC code, simulation scripts, and software driver data_files/ .................................
-
[6]
-type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘
**Copy the workspace and replace ‘dummy‘/‘DUMMY‘ strings inside files** (case- preserving): ‘‘‘bash cp -r dummy_workspace myaccel_workspace cd myaccel_workspace find . -type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘
-
[7]
**Rename files and directories** whose names contain ‘dummy‘ - only substitute the basename, not the full path (use ‘-depth‘ so inner paths are renamed before their parents): 16 ‘‘‘bash find . -depth -name ’*dummy*’ | while read f; do d=$(dirname "$f"); b=$(basename "$f") mv "$f" "$d/$(echo "$b" | sed ’s/dummy/myaccel/g’)" done ‘‘‘
-
[8]
**Assign a unique device ID** in both the XML file and the driver config header. Pick a value from this project’s reserved range ‘0x201‘–‘0x27F‘ (the reference ‘dummy‘ already occupies ‘0x200‘). See **Device ID Allocation** below for the full convention. ‘‘‘bash sed -i ’s/device_id="200"/device_id="201"/’ \ myaccel_sysc_catapult/myaccel_sysc_catapult.xml ...
-
[9]
The ‘load()‘ and ‘store()‘ functions handle DMA data movement
Modify ‘src/<accel>.cpp‘ - implement your kernel in the ‘compute()‘ function. The ‘load()‘ and ‘store()‘ functions handle DMA data movement
-
[10]
‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions
Modify ‘inc/<accel>_specs.hpp‘ - adjust PLM sizes (e.g. ‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions
-
[11]
If you change the accelerator’s configuration parameters, update ‘inc/<accel> _conf_info.hpp‘ **and** the XML ‘<param>‘ list (register offsets are derived from the XML)
-
[12]
Update the testbench (‘tb/testbench.cpp‘, ‘tb/testbench.hpp‘) to match your accelerator’s interface. ### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:
-
[13]
Probes for the accelerator via ‘probe(&devs, VENDOR_SLD, DEV_ID, DEV_NAME)‘
-
[14]
Provides an ‘exec()‘ function that: - Converts float inputs to fixed-point and writes to ‘accel_shared_mem‘ - Configures accelerator registers via ‘iowrite32()‘ - Starts accelerator and polls ‘STATUS_REG‘ for completion - Converts fixed-point outputs back to float
-
[15]
The driver must use the shared memory from ‘accel_common.h‘ (do not allocate separate DMA buffers) ### Step 5: Integrate into software
-
[16]
Add ‘#include‘ for your driver in ‘accel_drivers.h‘
-
[17]
Add init/cleanup calls in ‘accel_init_all()‘ / ‘accel_cleanup_all()‘
-
[18]
Modify the workload’s core computation (the file called from ‘systest.c‘) to invoke your accelerator’s ‘exec()‘ function for the appropriate operations ( guarded by ‘#ifdef USE_ACCELERATOR‘)
-
[19]
Enable ‘USE_ACCELERATOR‘ in ‘systest.c‘
-
[20]
200"‘ in XML corresponds to ‘0x200‘ in C code. **It is NOT decimal.** For example ‘device_id=
Register your accelerator(s) in ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ (see **SoC Tile Configuration** below) ## Key ESP Platform Conventions ### SoC Tile Configuration (‘esp_defconfig‘) The ESP SoC is organized as a **NoC (Network-on-Chip) grid** of tiles. Each tile can be a CPU, memory, I/O, or accelerator. The tile layout is defined in ‘ esp_xilinx-vc7...
-
[21]
You are **NOT allowed** to access (read or write) files **outside the workspace directory**
-
[22]
‘dummy_workspace/‘ is a **reference-only** structural template. You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation. Create your own accelerator workspace(s) by copying it and redesigning the HLS source, testbench, driver, and configuration to match your design
-
[23]
Each accelerator must have a **unique device ID** in its XML file (‘< your_workspace>/<accel_name>/<accel_name>.xml‘). ## Evaluation Criteria
-
[24]
**Execution time**: Minimize total workload execution time (measured in cycles by ‘rdcycle64()‘)
-
[25]
**Resource constraints** (VC707 FPGA): - Logic Cells: 485,760 - Block RAM (Kb): 37,080 - DSP Slices: 2,800
-
[26]
**Correctness**: The accelerated pipeline must **pass** whatever correctness check ‘systest.c‘ performs (e.g., output-vs-gold comparison, top-1 match, MSE threshold, etc.). The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run. ## Deliverables Your job is done when you have completed all of the following:
-
[27]
Created accelerator HLS source(s) (new workspace directories with complete ‘ src/‘, ‘inc/‘, ‘tb/‘, and XML)
-
[28]
Written corresponding software driver(s) (header files with init/exec/cleanup functions)
-
[29]
Integrated drivers into ‘accel_drivers.h‘ and modified the workload’s core computation to use your accelerators
-
[30]
Updated ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ with accelerator tile entries (expanding NoC grid if needed)
-
[31]
Enabled ‘USE_ACCELERATOR‘ in ‘systest.c‘ D Performance Measurement We measure the cycle count of the application by inserting the following function: 1static inline u int 64 _t r d c y c l e 6 4 ( void ) { 2ui nt 64_ t val ; 3__asm__ _ _ v o l a t i l e _ _ ( " csrr %0 , mcycle " : " = r " ( val ) ) ; 4return val ; 5} Listing 1: Cycle measurement on RISC-...
-
[32]
28 29/* A c c u m u l a t e into output */ 30for ( int i = 0; i < r_tile ; i ++) { 31for ( int j = 0; j < c_tile ; j ++) { 32output [( r_start + i ) * o u t _ f e a t u r e s + ( c_start + j ) ] += b e r t _ t i n y _ v c 7 0 7 _ o p u s _ m a t m u l _ t i l e _ b u f [ i * c_tile + j ]; 33} 34} 35} 36} 37} Listing 2: Software-managed tiling logic genera...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.