pith. sign in

arxiv: 2605.19399 · v1 · pith:TDEM7B2Unew · submitted 2026-05-19 · 💻 cs.AR

HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip

Pith reviewed 2026-05-20 02:24 UTC · model grok-4.3

classification 💻 cs.AR
keywords hardware-software co-designLLM agentsSystem-on-ChipFPGA prototypingaccelerator integrationheterogeneous computingbenchmarkend-to-end design
0
0 comments X

The pith

A new benchmark shows frontier LLMs rarely complete end-to-end hardware-software co-design for heterogeneous SoCs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HSCO-Bench to test whether LLM agents can handle the full hardware-software co-design flow for accelerator-rich Systems-on-Chip. This flow includes analyzing applications to find kernels for acceleration, designing and integrating heterogeneous accelerators into an SoC under resource limits, and mapping software kernels onto those accelerators. The benchmark uses an open-source SoC platform with a structured repository and targets deployment on an AMD Virtex-7 FPGA. Experiments with five frontier models reveal that only two produce valid SoC prototypes, and even those achieve limited resource utilization despite some speedups. This establishes that current models have emerging but incomplete capability for joint hardware and software optimization.

Core claim

HSCO-Bench is the first benchmark that requires LLMs to jointly reason about and modify both software and hardware stacks to generate complete, deployable heterogeneous SoC prototypes. Results show end-to-end integration remains difficult: only two of five evaluated models succeed in producing valid designs on the target FPGA platform, and these designs reach a peak speedup of 16.22X while adding only 23.67% resource utilization at most. The work demonstrates that models can identify acceleration opportunities but still heavily underutilize available hardware capacity.

What carries the argument

HSCO-Bench, an end-to-end benchmark built on an open-source SoC platform with curated repository structure that evaluates LLM agents on generating and deploying accelerator-rich heterogeneous SoC prototypes to an AMD Virtex-7 FPGA.

If this is right

  • LLM agents must improve at joint hardware-software reasoning to produce usable accelerator-rich SoCs.
  • Current models identify some acceleration kernels but leave substantial hardware capacity unused.
  • The benchmark provides a concrete way to measure progress in agent-driven co-design over time.
  • Design flows that separate hardware and software evaluation miss the integration failures observed here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the benchmark to additional FPGA targets or ASIC flows would reveal whether the observed limitations are platform-specific.
  • The gap between achieved and possible resource utilization suggests LLMs need stronger explicit cost or area models in their reasoning.
  • Successful co-design may require hybrid human-AI loops rather than fully autonomous agents in the near term.

Load-bearing premise

The chosen open-source SoC platform and specific AMD Virtex-7 FPGA target form a representative testbed for real-world end-to-end hardware-software co-design.

What would settle it

A new model that consistently produces valid SoC prototypes achieving near-maximal resource utilization and higher speedups than 16.22X on the same platform and tasks would contradict the reported challenges.

Figures

Figures reproduced from arXiv: 2605.19399 by Kuan-Lin Chiu, Luca P. Carloni, Pei-Huan Tsai, Pin-Yu Chen, William Baisi.

Figure 1
Figure 1. Figure 1: Agent-powered autonomous SoC generation and hardware-software co-design [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HSCO-Bench flow. 3.2 Task Format and Evaluation Metrics [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation results across 10 applications. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of failure modes across eval [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cost efficiency (η) comparison across applications. While Opus 4.6 provides a steady baseline with fewer failures, GPT-5.4 exhibits su￾perior value-for-money on several successful runs. To evaluate the economic viability of utilizing LLMs for SoC design, we assess the cost effi￾ciency of the two successful models (Opus 4.6 and GPT-5.4). We define a specialized metric, Cost Efficiency (η), calculated as: η … view at source ↗
Figure 6
Figure 6. Figure 6: The testcase structure of our proposed benchmark. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Large language models (LLMs) are adopted for software and hardware design, yet these domains are still evaluated separately. Software benchmarks typically assume fixed hardware targets, while hardware benchmarks focus on component-level optimization without considering the full hardware-software stack. Consequently, no existing benchmark evaluates whether an LLM agent can perform end-to-end, system-level hardware-software co-design. Such a process requires: 1) analyzing applications to identify kernels requiring acceleration, 2) designing and integrating heterogeneous accelerators into a System-on-Chip (SoC) under resource constraints, and 3) mapping kernels onto the generated accelerators. We present HSCO-Bench, an end-to-end hardware-software co-design benchmark for accelerator-rich heterogeneous SoC generation. Built upon an open-source SoC platform with a curated repository structure, HSCO-Bench evaluates the ability of LLMs to jointly optimize software and hardware stacks, producing SoC prototypes deployed on the AMD Virtex-7 FPGA VC707 Evaluation Kit. Experimental results show that end-to-end integration remains challenging for current models. Among the five frontier models evaluated, only two of them could successfully generate valid SoC prototypes. Yet, even in these successful instances, the generated designs are far from optimal. While we observe a promising peak speedup of 16.22X, the maximum additional resource utilization reaches only 23.67%. This highlights that while state-of-the-art models demonstrate an emerging capability for hardware acceleration, they still heavily underutilize the available hardware capacity, leaving room for future optimization. To the best of our knowledge, HSCO-Bench is the first benchmark targeting this complete co-design flow, enabling LLMs to jointly reason about and modify both the software and hardware stacks of heterogeneous SoCs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces HSCO-Bench, the first benchmark for evaluating LLM agents on complete end-to-end hardware-software co-design of accelerator-rich heterogeneous SoCs. Built on a curated open-source SoC platform targeting the AMD Virtex-7 VC707 FPGA, the benchmark requires agents to identify acceleration kernels, design and integrate heterogeneous accelerators under resource constraints, and map software kernels. Experiments on five frontier models show only two produce valid SoC prototypes; even these achieve a peak speedup of 16.22X but only 23.67% additional resource utilization, indicating substantial underutilization of hardware capacity.

Significance. If the benchmark and testbed prove representative, the work provides concrete evidence of current limitations in joint hardware-software reasoning by LLMs and supplies a reproducible starting point for measuring progress in agent-driven SoC generation. The explicit success rates, speedup, and utilization metrics are useful for the community.

major comments (1)
  1. The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.
minor comments (2)
  1. The results section should specify the exact prompt templates, evaluation criteria for “valid SoC prototypes,” and whether multiple runs or statistical tests were performed to support the reported success rates and metrics.
  2. Clarify the precise definition of “additional resource utilization” and how the 23.67% figure is computed relative to the baseline SoC.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have reviewed the major comment carefully and provide a point-by-point response below. We agree that additional context on the platform will strengthen the paper and plan to incorporate revisions accordingly.

read point-by-point responses
  1. Referee: The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.

    Authors: We agree that an explicit discussion of the platform's representativeness is valuable for interpreting the results. In the revised manuscript, we will add a dedicated paragraph in Section 3 (or a new subsection) that compares the open-source SoC platform to commercial flows. Specifically, we will describe that the platform targets the AMD Virtex-7 VC707 with a standard set of peripherals (AXI interconnect, DDR3, Ethernet, UART, and GPIO), uses a curated repository structure that mirrors typical open-source SoC repositories to enable agent modifications, and relies on the Xilinx Vivado tool flow for synthesis and implementation. We will note that this FPGA-based setup captures essential aspects of heterogeneous accelerator integration and resource-constrained co-design but does not fully replicate ASIC tape-out complexities, advanced verification suites, or larger-scale commercial SoCs with proprietary IP blocks and multi-die packaging. This addition will clarify the scope of the observed failure modes and utilization gaps while preserving the benchmark's focus on reproducible, agent-driven end-to-end flows. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark paper with no derivations or fitted predictions

full rationale

This paper introduces HSCO-Bench, a new empirical benchmark for LLM agents performing end-to-end hardware-software co-design on an open-source SoC platform targeting the AMD Virtex-7 FPGA. The central claims consist of direct experimental results (e.g., only 2/5 models produce valid prototypes, peak speedup 16.22X with max 23.67% additional resource utilization). No mathematical derivations, first-principles predictions, parameter fitting, self-definitional loops, or load-bearing self-citations are present. The work is self-contained as an empirical evaluation against external model performance on the introduced benchmark, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes a new evaluation framework rather than new physical models or fitted constants; it relies on the suitability of an existing open-source platform and standard FPGA deployment as the test environment.

axioms (1)
  • domain assumption The open-source SoC platform and AMD Virtex-7 FPGA target are representative of real heterogeneous SoC design challenges under resource constraints.
    The benchmark's ability to measure meaningful co-design progress depends on this platform capturing the relevant complexities of accelerator integration and kernel mapping.
invented entities (1)
  • HSCO-Bench no independent evidence
    purpose: To evaluate LLM agents on the complete end-to-end hardware-software co-design flow for accelerator-rich SoCs
    Newly created benchmark framework introduced in this work without independent external validation beyond the reported experiments.

pith-pipeline@v0.9.0 · 5870 in / 1495 out tokens · 56959 ms · 2026-05-20T02:24:07.847077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Ce Guo and Tong Zhao

    URLhttps://deepmind.google/models/model-cards/gemini-3-1-pro/. Ce Guo and Tong Zhao. ResBench: A resource-aware benchmark for llm-generated fpga de- signs. InProceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, HEART ’25, page 25–34, New York, NY , USA, 2025. Associa- tion for Computing Machiner...

  2. [2]

    Andrew G

    doi: 10.1109/ISSCC.2014.6757323. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=VTF8yNQM66. Kimi Team. Kimi k2.5: Visual...

  3. [3]

    load the bitstream onto the FPGA

  4. [4]

    load the data onto the FPGA

  5. [5]

    Please take a look at the README.md file and finish the job

    execute the program B Codebase Structure of Each Testcase Each test case is located in a separate directory, and the files are listed in Figure 6. test_case/ <accelerator>_workspace/ .......accelerator related folder for each accelerator, including accelerator systemC code, simulation scripts, and software driver data_files/ .................................

  6. [6]

    -type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

    **Copy the workspace and replace ‘dummy‘/‘DUMMY‘ strings inside files** (case- preserving): ‘‘‘bash cp -r dummy_workspace myaccel_workspace cd myaccel_workspace find . -type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

  7. [7]

    $f"); b=$(basename

    **Rename files and directories** whose names contain ‘dummy‘ - only substitute the basename, not the full path (use ‘-depth‘ so inner paths are renamed before their parents): 16 ‘‘‘bash find . -depth -name ’*dummy*’ | while read f; do d=$(dirname "$f"); b=$(basename "$f") mv "$f" "$d/$(echo "$b" | sed ’s/dummy/myaccel/g’)" done ‘‘‘

  8. [8]

    200"/device_id=

    **Assign a unique device ID** in both the XML file and the driver config header. Pick a value from this project’s reserved range ‘0x201‘–‘0x27F‘ (the reference ‘dummy‘ already occupies ‘0x200‘). See **Device ID Allocation** below for the full convention. ‘‘‘bash sed -i ’s/device_id="200"/device_id="201"/’ \ myaccel_sysc_catapult/myaccel_sysc_catapult.xml ...

  9. [9]

    The ‘load()‘ and ‘store()‘ functions handle DMA data movement

    Modify ‘src/<accel>.cpp‘ - implement your kernel in the ‘compute()‘ function. The ‘load()‘ and ‘store()‘ functions handle DMA data movement

  10. [10]

    ‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

    Modify ‘inc/<accel>_specs.hpp‘ - adjust PLM sizes (e.g. ‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

  11. [11]

    If you change the accelerator’s configuration parameters, update ‘inc/<accel> _conf_info.hpp‘ **and** the XML ‘<param>‘ list (register offsets are derived from the XML)

  12. [12]

    ### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

    Update the testbench (‘tb/testbench.cpp‘, ‘tb/testbench.hpp‘) to match your accelerator’s interface. ### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

  13. [13]

    Probes for the accelerator via ‘probe(&devs, VENDOR_SLD, DEV_ID, DEV_NAME)‘

  14. [14]

    Provides an ‘exec()‘ function that: - Converts float inputs to fixed-point and writes to ‘accel_shared_mem‘ - Configures accelerator registers via ‘iowrite32()‘ - Starts accelerator and polls ‘STATUS_REG‘ for completion - Converts fixed-point outputs back to float

  15. [15]

    The driver must use the shared memory from ‘accel_common.h‘ (do not allocate separate DMA buffers) ### Step 5: Integrate into software

  16. [16]

    Add ‘#include‘ for your driver in ‘accel_drivers.h‘

  17. [17]

    Add init/cleanup calls in ‘accel_init_all()‘ / ‘accel_cleanup_all()‘

  18. [18]

    Modify the workload’s core computation (the file called from ‘systest.c‘) to invoke your accelerator’s ‘exec()‘ function for the appropriate operations ( guarded by ‘#ifdef USE_ACCELERATOR‘)

  19. [19]

    Enable ‘USE_ACCELERATOR‘ in ‘systest.c‘

  20. [20]

    200"‘ in XML corresponds to ‘0x200‘ in C code. **It is NOT decimal.** For example ‘device_id=

    Register your accelerator(s) in ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ (see **SoC Tile Configuration** below) ## Key ESP Platform Conventions ### SoC Tile Configuration (‘esp_defconfig‘) The ESP SoC is organized as a **NoC (Network-on-Chip) grid** of tiles. Each tile can be a CPU, memory, I/O, or accelerator. The tile layout is defined in ‘ esp_xilinx-vc7...

  21. [21]

    You are **NOT allowed** to access (read or write) files **outside the workspace directory**

  22. [22]

    You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation

    ‘dummy_workspace/‘ is a **reference-only** structural template. You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation. Create your own accelerator workspace(s) by copying it and redesigning the HLS source, testbench, driver, and configuration to match your design

  23. [23]

    ## Evaluation Criteria

    Each accelerator must have a **unique device ID** in its XML file (‘< your_workspace>/<accel_name>/<accel_name>.xml‘). ## Evaluation Criteria

  24. [24]

    **Execution time**: Minimize total workload execution time (measured in cycles by ‘rdcycle64()‘)

  25. [25]

    **Resource constraints** (VC707 FPGA): - Logic Cells: 485,760 - Block RAM (Kb): 37,080 - DSP Slices: 2,800

  26. [26]

    The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run

    **Correctness**: The accelerated pipeline must **pass** whatever correctness check ‘systest.c‘ performs (e.g., output-vs-gold comparison, top-1 match, MSE threshold, etc.). The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run. ## Deliverables Your job is done when you have completed all of the following:

  27. [27]

    Created accelerator HLS source(s) (new workspace directories with complete ‘ src/‘, ‘inc/‘, ‘tb/‘, and XML)

  28. [28]

    Written corresponding software driver(s) (header files with init/exec/cleanup functions)

  29. [29]

    Integrated drivers into ‘accel_drivers.h‘ and modified the workload’s core computation to use your accelerators

  30. [30]

    Updated ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ with accelerator tile entries (expanding NoC grid if needed)

  31. [31]

    csrr %0 , mcycle

    Enabled ‘USE_ACCELERATOR‘ in ‘systest.c‘ D Performance Measurement We measure the cycle count of the application by inserting the following function: 1static inline u int 64 _t r d c y c l e 6 4 ( void ) { 2ui nt 64_ t val ; 3__asm__ _ _ v o l a t i l e _ _ ( " csrr %0 , mcycle " : " = r " ( val ) ) ; 4return val ; 5} Listing 1: Cycle measurement on RISC-...

  32. [32]

    28 29/* A c c u m u l a t e into output */ 30for ( int i = 0; i < r_tile ; i ++) { 31for ( int j = 0; j < c_tile ; j ++) { 32output [( r_start + i ) * o u t _ f e a t u r e s + ( c_start + j ) ] += b e r t _ t i n y _ v c 7 0 7 _ o p u s _ m a t m u l _ t i l e _ b u f [ i * c_tile + j ]; 33} 34} 35} 36} 37} Listing 2: Software-managed tiling logic genera...