HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip

Kuan-Lin Chiu; Luca P. Carloni; Pei-Huan Tsai; Pin-Yu Chen; William Baisi

arxiv: 2605.19399 · v1 · pith:TDEM7B2Unew · submitted 2026-05-19 · 💻 cs.AR

HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip

Pei-Huan Tsai , Kuan-Lin Chiu , William Baisi , Pin-Yu Chen , Luca P. Carloni This is my paper

Pith reviewed 2026-05-20 02:24 UTC · model grok-4.3

classification 💻 cs.AR

keywords hardware-software co-designLLM agentsSystem-on-ChipFPGA prototypingaccelerator integrationheterogeneous computingbenchmarkend-to-end design

0 comments

The pith

A new benchmark shows frontier LLMs rarely complete end-to-end hardware-software co-design for heterogeneous SoCs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HSCO-Bench to test whether LLM agents can handle the full hardware-software co-design flow for accelerator-rich Systems-on-Chip. This flow includes analyzing applications to find kernels for acceleration, designing and integrating heterogeneous accelerators into an SoC under resource limits, and mapping software kernels onto those accelerators. The benchmark uses an open-source SoC platform with a structured repository and targets deployment on an AMD Virtex-7 FPGA. Experiments with five frontier models reveal that only two produce valid SoC prototypes, and even those achieve limited resource utilization despite some speedups. This establishes that current models have emerging but incomplete capability for joint hardware and software optimization.

Core claim

HSCO-Bench is the first benchmark that requires LLMs to jointly reason about and modify both software and hardware stacks to generate complete, deployable heterogeneous SoC prototypes. Results show end-to-end integration remains difficult: only two of five evaluated models succeed in producing valid designs on the target FPGA platform, and these designs reach a peak speedup of 16.22X while adding only 23.67% resource utilization at most. The work demonstrates that models can identify acceleration opportunities but still heavily underutilize available hardware capacity.

What carries the argument

HSCO-Bench, an end-to-end benchmark built on an open-source SoC platform with curated repository structure that evaluates LLM agents on generating and deploying accelerator-rich heterogeneous SoC prototypes to an AMD Virtex-7 FPGA.

If this is right

LLM agents must improve at joint hardware-software reasoning to produce usable accelerator-rich SoCs.
Current models identify some acceleration kernels but leave substantial hardware capacity unused.
The benchmark provides a concrete way to measure progress in agent-driven co-design over time.
Design flows that separate hardware and software evaluation miss the integration failures observed here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to additional FPGA targets or ASIC flows would reveal whether the observed limitations are platform-specific.
The gap between achieved and possible resource utilization suggests LLMs need stronger explicit cost or area models in their reasoning.
Successful co-design may require hybrid human-AI loops rather than fully autonomous agents in the near term.

Load-bearing premise

The chosen open-source SoC platform and specific AMD Virtex-7 FPGA target form a representative testbed for real-world end-to-end hardware-software co-design.

What would settle it

A new model that consistently produces valid SoC prototypes achieving near-maximal resource utilization and higher speedups than 16.22X on the same platform and tasks would contradict the reported challenges.

Figures

Figures reproduced from arXiv: 2605.19399 by Kuan-Lin Chiu, Luca P. Carloni, Pei-Huan Tsai, Pin-Yu Chen, William Baisi.

**Figure 2.** Figure 2: HSCO-Bench flow. 3.2 Task Format and Evaluation Metrics [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation results across 10 applications. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of failure modes across eval [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Cost efficiency (η) comparison across applications. While Opus 4.6 provides a steady baseline with fewer failures, GPT-5.4 exhibits superior value-for-money on several successful runs. To evaluate the economic viability of utilizing LLMs for SoC design, we assess the cost efficiency of the two successful models (Opus 4.6 and GPT-5.4). We define a specialized metric, Cost Efficiency (η), calculated as: η … view at source ↗

**Figure 6.** Figure 6: The testcase structure of our proposed benchmark. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Large language models (LLMs) are adopted for software and hardware design, yet these domains are still evaluated separately. Software benchmarks typically assume fixed hardware targets, while hardware benchmarks focus on component-level optimization without considering the full hardware-software stack. Consequently, no existing benchmark evaluates whether an LLM agent can perform end-to-end, system-level hardware-software co-design. Such a process requires: 1) analyzing applications to identify kernels requiring acceleration, 2) designing and integrating heterogeneous accelerators into a System-on-Chip (SoC) under resource constraints, and 3) mapping kernels onto the generated accelerators. We present HSCO-Bench, an end-to-end hardware-software co-design benchmark for accelerator-rich heterogeneous SoC generation. Built upon an open-source SoC platform with a curated repository structure, HSCO-Bench evaluates the ability of LLMs to jointly optimize software and hardware stacks, producing SoC prototypes deployed on the AMD Virtex-7 FPGA VC707 Evaluation Kit. Experimental results show that end-to-end integration remains challenging for current models. Among the five frontier models evaluated, only two of them could successfully generate valid SoC prototypes. Yet, even in these successful instances, the generated designs are far from optimal. While we observe a promising peak speedup of 16.22X, the maximum additional resource utilization reaches only 23.67%. This highlights that while state-of-the-art models demonstrate an emerging capability for hardware acceleration, they still heavily underutilize the available hardware capacity, leaving room for future optimization. To the best of our knowledge, HSCO-Bench is the first benchmark targeting this complete co-design flow, enabling LLMs to jointly reason about and modify both the software and hardware stacks of heterogeneous SoCs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HSCO-Bench creates the first integrated benchmark for LLM agents to handle full end-to-end SoC hardware-software co-design, with early results showing current models succeed only half the time and underuse hardware.

read the letter

Colleague, the main takeaway is that this paper introduces HSCO-Bench as the first benchmark forcing LLMs through the complete co-design loop: kernel identification, accelerator design, SoC integration under constraints, and kernel mapping. Earlier work kept software and hardware evaluations separate, so this joint setup is actually new. They run five frontier models on an open-source platform targeting the AMD Virtex-7 FPGA and report that only two produce valid prototypes, with a peak 16.22X speedup but just 23.67% additional resource utilization. That concrete gap in optimization is useful to see. The setup itself is a strength because it uses a curated repository structure and real FPGA deployment, which moves beyond simulation-only scores and gives reproducible prototypes. The paper does a clean job framing why separate benchmarks miss the integration step. The soft spots are proportionate. The entire evaluation sits on one open-source SoC platform and the Virtex-7 board, so the stress-test concern holds: if that platform has unusually clean documentation or simpler peripherals than typical commercial flows, the failure modes and low utilization numbers may not generalize. The abstract also leaves out prompt engineering details and exact validity criteria, which weakens in the statistical robustness until the methods section is checked. No circular math or fitted parameters appear, just empirical runs on a new testbed. This work is for researchers building LLM agents for design automation or hardware teams testing AI-assisted co-design flows. A reader focused on systems benchmarks or agentic hardware generation would extract value from the framework and baseline numbers. It deserves a serious referee because the benchmark fills a clear gap and the initial findings are worth detailed scrutiny on methodology and platform choice. I would send it out for review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces HSCO-Bench, the first benchmark for evaluating LLM agents on complete end-to-end hardware-software co-design of accelerator-rich heterogeneous SoCs. Built on a curated open-source SoC platform targeting the AMD Virtex-7 VC707 FPGA, the benchmark requires agents to identify acceleration kernels, design and integrate heterogeneous accelerators under resource constraints, and map software kernels. Experiments on five frontier models show only two produce valid SoC prototypes; even these achieve a peak speedup of 16.22X but only 23.67% additional resource utilization, indicating substantial underutilization of hardware capacity.

Significance. If the benchmark and testbed prove representative, the work provides concrete evidence of current limitations in joint hardware-software reasoning by LLMs and supplies a reproducible starting point for measuring progress in agent-driven SoC generation. The explicit success rates, speedup, and utilization metrics are useful for the community.

major comments (1)

The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.

minor comments (2)

The results section should specify the exact prompt templates, evaluation criteria for “valid SoC prototypes,” and whether multiple runs or statistical tests were performed to support the reported success rates and metrics.
Clarify the precise definition of “additional resource utilization” and how the 23.67% figure is computed relative to the baseline SoC.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have reviewed the major comment carefully and provide a point-by-point response below. We agree that additional context on the platform will strengthen the paper and plan to incorporate revisions accordingly.

read point-by-point responses

Referee: The central experimental claim—that only two of five models produce valid prototypes and that even successful designs heavily underutilize hardware—depends on the chosen open-source platform and Virtex-7 VC707 target being a fair proxy for real heterogeneous SoC flows. The manuscript should explicitly discuss how the platform’s peripheral set, repository structure, and tool flow compare to commercial or more complex SoC design scenarios; without this, the observed failure modes and optimality gaps may not generalize.

Authors: We agree that an explicit discussion of the platform's representativeness is valuable for interpreting the results. In the revised manuscript, we will add a dedicated paragraph in Section 3 (or a new subsection) that compares the open-source SoC platform to commercial flows. Specifically, we will describe that the platform targets the AMD Virtex-7 VC707 with a standard set of peripherals (AXI interconnect, DDR3, Ethernet, UART, and GPIO), uses a curated repository structure that mirrors typical open-source SoC repositories to enable agent modifications, and relies on the Xilinx Vivado tool flow for synthesis and implementation. We will note that this FPGA-based setup captures essential aspects of heterogeneous accelerator integration and resource-constrained co-design but does not fully replicate ASIC tape-out complexities, advanced verification suites, or larger-scale commercial SoCs with proprietary IP blocks and multi-die packaging. This addition will clarify the scope of the observed failure modes and utilization gaps while preserving the benchmark's focus on reproducible, agent-driven end-to-end flows. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark paper with no derivations or fitted predictions

full rationale

This paper introduces HSCO-Bench, a new empirical benchmark for LLM agents performing end-to-end hardware-software co-design on an open-source SoC platform targeting the AMD Virtex-7 FPGA. The central claims consist of direct experimental results (e.g., only 2/5 models produce valid prototypes, peak speedup 16.22X with max 23.67% additional resource utilization). No mathematical derivations, first-principles predictions, parameter fitting, self-definitional loops, or load-bearing self-citations are present. The work is self-contained as an empirical evaluation against external model performance on the introduced benchmark, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes a new evaluation framework rather than new physical models or fitted constants; it relies on the suitability of an existing open-source platform and standard FPGA deployment as the test environment.

axioms (1)

domain assumption The open-source SoC platform and AMD Virtex-7 FPGA target are representative of real heterogeneous SoC design challenges under resource constraints.
The benchmark's ability to measure meaningful co-design progress depends on this platform capturing the relevant complexities of accelerator integration and kernel mapping.

invented entities (1)

HSCO-Bench no independent evidence
purpose: To evaluate LLM agents on the complete end-to-end hardware-software co-design flow for accelerator-rich SoCs
Newly created benchmark framework introduced in this work without independent external validation beyond the reported experiments.

pith-pipeline@v0.9.0 · 5870 in / 1495 out tokens · 56959 ms · 2026-05-20T02:24:07.847077+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Ce Guo and Tong Zhao

URLhttps://deepmind.google/models/model-cards/gemini-3-1-pro/. Ce Guo and Tong Zhao. ResBench: A resource-aware benchmark for llm-generated fpga de- signs. InProceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, HEART ’25, page 25–34, New York, NY , USA, 2025. Associa- tion for Computing Machiner...

work page doi:10.1145/3728179.3728192 2025
[2]

Andrew G

doi: 10.1109/ISSCC.2014.6757323. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=VTF8yNQM66. Kimi Team. Kimi k2.5: Visual...

work page doi:10.1109/isscc.2014.6757323 2014
[3]

load the bitstream onto the FPGA

work page
[4]

load the data onto the FPGA

work page
[5]

Please take a look at the README.md file and finish the job

execute the program B Codebase Structure of Each Testcase Each test case is located in a separate directory, and the files are listed in Figure 6. test_case/ <accelerator>_workspace/ .......accelerator related folder for each accelerator, including accelerator systemC code, simulation scripts, and software driver data_files/ .................................

work page
[6]

-type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

**Copy the workspace and replace ‘dummy‘/‘DUMMY‘ strings inside files** (case- preserving): ‘‘‘bash cp -r dummy_workspace myaccel_workspace cd myaccel_workspace find . -type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

work page
[7]

$f"); b=$(basename

**Rename files and directories** whose names contain ‘dummy‘ - only substitute the basename, not the full path (use ‘-depth‘ so inner paths are renamed before their parents): 16 ‘‘‘bash find . -depth -name ’*dummy*’ | while read f; do d=$(dirname "$f"); b=$(basename "$f") mv "$f" "$d/$(echo "$b" | sed ’s/dummy/myaccel/g’)" done ‘‘‘

work page
[8]

200"/device_id=

**Assign a unique device ID** in both the XML file and the driver config header. Pick a value from this project’s reserved range ‘0x201‘–‘0x27F‘ (the reference ‘dummy‘ already occupies ‘0x200‘). See **Device ID Allocation** below for the full convention. ‘‘‘bash sed -i ’s/device_id="200"/device_id="201"/’ \ myaccel_sysc_catapult/myaccel_sysc_catapult.xml ...

work page
[9]

The ‘load()‘ and ‘store()‘ functions handle DMA data movement

Modify ‘src/<accel>.cpp‘ - implement your kernel in the ‘compute()‘ function. The ‘load()‘ and ‘store()‘ functions handle DMA data movement

work page
[10]

‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

Modify ‘inc/<accel>_specs.hpp‘ - adjust PLM sizes (e.g. ‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

work page
[11]

If you change the accelerator’s configuration parameters, update ‘inc/<accel> _conf_info.hpp‘ **and** the XML ‘<param>‘ list (register offsets are derived from the XML)

work page
[12]

### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

Update the testbench (‘tb/testbench.cpp‘, ‘tb/testbench.hpp‘) to match your accelerator’s interface. ### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

work page
[13]

Probes for the accelerator via ‘probe(&devs, VENDOR_SLD, DEV_ID, DEV_NAME)‘

work page
[14]

Provides an ‘exec()‘ function that: - Converts float inputs to fixed-point and writes to ‘accel_shared_mem‘ - Configures accelerator registers via ‘iowrite32()‘ - Starts accelerator and polls ‘STATUS_REG‘ for completion - Converts fixed-point outputs back to float

work page
[15]

The driver must use the shared memory from ‘accel_common.h‘ (do not allocate separate DMA buffers) ### Step 5: Integrate into software

work page
[16]

Add ‘#include‘ for your driver in ‘accel_drivers.h‘

work page
[17]

Add init/cleanup calls in ‘accel_init_all()‘ / ‘accel_cleanup_all()‘

work page
[18]

Modify the workload’s core computation (the file called from ‘systest.c‘) to invoke your accelerator’s ‘exec()‘ function for the appropriate operations ( guarded by ‘#ifdef USE_ACCELERATOR‘)

work page
[19]

Enable ‘USE_ACCELERATOR‘ in ‘systest.c‘

work page
[20]

200"‘ in XML corresponds to ‘0x200‘ in C code. **It is NOT decimal.** For example ‘device_id=

Register your accelerator(s) in ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ (see **SoC Tile Configuration** below) ## Key ESP Platform Conventions ### SoC Tile Configuration (‘esp_defconfig‘) The ESP SoC is organized as a **NoC (Network-on-Chip) grid** of tiles. Each tile can be a CPU, memory, I/O, or accelerator. The tile layout is defined in ‘ esp_xilinx-vc7...

work page
[21]

You are **NOT allowed** to access (read or write) files **outside the workspace directory**

work page
[22]

You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation

‘dummy_workspace/‘ is a **reference-only** structural template. You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation. Create your own accelerator workspace(s) by copying it and redesigning the HLS source, testbench, driver, and configuration to match your design

work page
[23]

## Evaluation Criteria

Each accelerator must have a **unique device ID** in its XML file (‘< your_workspace>/<accel_name>/<accel_name>.xml‘). ## Evaluation Criteria

work page
[24]

**Execution time**: Minimize total workload execution time (measured in cycles by ‘rdcycle64()‘)

work page
[25]

**Resource constraints** (VC707 FPGA): - Logic Cells: 485,760 - Block RAM (Kb): 37,080 - DSP Slices: 2,800

work page
[26]

The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run

**Correctness**: The accelerated pipeline must **pass** whatever correctness check ‘systest.c‘ performs (e.g., output-vs-gold comparison, top-1 match, MSE threshold, etc.). The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run. ## Deliverables Your job is done when you have completed all of the following:

work page
[27]

Created accelerator HLS source(s) (new workspace directories with complete ‘ src/‘, ‘inc/‘, ‘tb/‘, and XML)

work page
[28]

Written corresponding software driver(s) (header files with init/exec/cleanup functions)

work page
[29]

Integrated drivers into ‘accel_drivers.h‘ and modified the workload’s core computation to use your accelerators

work page
[30]

Updated ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ with accelerator tile entries (expanding NoC grid if needed)

work page
[31]

csrr %0 , mcycle

Enabled ‘USE_ACCELERATOR‘ in ‘systest.c‘ D Performance Measurement We measure the cycle count of the application by inserting the following function: 1static inline u int 64 _t r d c y c l e 6 4 ( void ) { 2ui nt 64_ t val ; 3__asm__ _ _ v o l a t i l e _ _ ( " csrr %0 , mcycle " : " = r " ( val ) ) ; 4return val ; 5} Listing 1: Cycle measurement on RISC-...

work page
[32]

28 29/* A c c u m u l a t e into output */ 30for ( int i = 0; i < r_tile ; i ++) { 31for ( int j = 0; j < c_tile ; j ++) { 32output [( r_start + i ) * o u t _ f e a t u r e s + ( c_start + j ) ] += b e r t _ t i n y _ v c 7 0 7 _ o p u s _ m a t m u l _ t i l e _ b u f [ i * c_tile + j ]; 33} 34} 35} 36} 37} Listing 2: Software-managed tiling logic genera...

work page

[1] [1]

Ce Guo and Tong Zhao

URLhttps://deepmind.google/models/model-cards/gemini-3-1-pro/. Ce Guo and Tong Zhao. ResBench: A resource-aware benchmark for llm-generated fpga de- signs. InProceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, HEART ’25, page 25–34, New York, NY , USA, 2025. Associa- tion for Computing Machiner...

work page doi:10.1145/3728179.3728192 2025

[2] [2]

Andrew G

doi: 10.1109/ISSCC.2014.6757323. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=VTF8yNQM66. Kimi Team. Kimi k2.5: Visual...

work page doi:10.1109/isscc.2014.6757323 2014

[3] [3]

load the bitstream onto the FPGA

work page

[4] [4]

load the data onto the FPGA

work page

[5] [5]

Please take a look at the README.md file and finish the job

execute the program B Codebase Structure of Each Testcase Each test case is located in a separate directory, and the files are listed in Figure 6. test_case/ <accelerator>_workspace/ .......accelerator related folder for each accelerator, including accelerator systemC code, simulation scripts, and software driver data_files/ .................................

work page

[6] [6]

-type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

**Copy the workspace and replace ‘dummy‘/‘DUMMY‘ strings inside files** (case- preserving): ‘‘‘bash cp -r dummy_workspace myaccel_workspace cd myaccel_workspace find . -type f -exec sed -i ’s/dummy/myaccel/g; s/DUMMY/MYACCEL/g’ {} + ‘‘‘

work page

[7] [7]

$f"); b=$(basename

**Rename files and directories** whose names contain ‘dummy‘ - only substitute the basename, not the full path (use ‘-depth‘ so inner paths are renamed before their parents): 16 ‘‘‘bash find . -depth -name ’*dummy*’ | while read f; do d=$(dirname "$f"); b=$(basename "$f") mv "$f" "$d/$(echo "$b" | sed ’s/dummy/myaccel/g’)" done ‘‘‘

work page

[8] [8]

200"/device_id=

**Assign a unique device ID** in both the XML file and the driver config header. Pick a value from this project’s reserved range ‘0x201‘–‘0x27F‘ (the reference ‘dummy‘ already occupies ‘0x200‘). See **Device ID Allocation** below for the full convention. ‘‘‘bash sed -i ’s/device_id="200"/device_id="201"/’ \ myaccel_sysc_catapult/myaccel_sysc_catapult.xml ...

work page

[9] [9]

The ‘load()‘ and ‘store()‘ functions handle DMA data movement

Modify ‘src/<accel>.cpp‘ - implement your kernel in the ‘compute()‘ function. The ‘load()‘ and ‘store()‘ functions handle DMA data movement

work page

[10] [10]

‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

Modify ‘inc/<accel>_specs.hpp‘ - adjust PLM sizes (e.g. ‘A_PLM_IN_WORD‘, ‘ B_PLM_IN_WORD‘, ‘O_PLM_OUT_WORD‘) and ‘MEM_SIZE‘ to match your data dimensions

work page

[11] [11]

If you change the accelerator’s configuration parameters, update ‘inc/<accel> _conf_info.hpp‘ **and** the XML ‘<param>‘ list (register offsets are derived from the XML)

work page

[12] [12]

### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

Update the testbench (‘tb/testbench.cpp‘, ‘tb/testbench.hpp‘) to match your accelerator’s interface. ### Step 4: Write the software driver Adapt the driver (from ‘dummy_sw_driver/dummy_driver.h‘) so that it:

work page

[13] [13]

Probes for the accelerator via ‘probe(&devs, VENDOR_SLD, DEV_ID, DEV_NAME)‘

work page

[14] [14]

Provides an ‘exec()‘ function that: - Converts float inputs to fixed-point and writes to ‘accel_shared_mem‘ - Configures accelerator registers via ‘iowrite32()‘ - Starts accelerator and polls ‘STATUS_REG‘ for completion - Converts fixed-point outputs back to float

work page

[15] [15]

The driver must use the shared memory from ‘accel_common.h‘ (do not allocate separate DMA buffers) ### Step 5: Integrate into software

work page

[16] [16]

Add ‘#include‘ for your driver in ‘accel_drivers.h‘

work page

[17] [17]

Add init/cleanup calls in ‘accel_init_all()‘ / ‘accel_cleanup_all()‘

work page

[18] [18]

Modify the workload’s core computation (the file called from ‘systest.c‘) to invoke your accelerator’s ‘exec()‘ function for the appropriate operations ( guarded by ‘#ifdef USE_ACCELERATOR‘)

work page

[19] [19]

Enable ‘USE_ACCELERATOR‘ in ‘systest.c‘

work page

[20] [20]

200"‘ in XML corresponds to ‘0x200‘ in C code. **It is NOT decimal.** For example ‘device_id=

Register your accelerator(s) in ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ (see **SoC Tile Configuration** below) ## Key ESP Platform Conventions ### SoC Tile Configuration (‘esp_defconfig‘) The ESP SoC is organized as a **NoC (Network-on-Chip) grid** of tiles. Each tile can be a CPU, memory, I/O, or accelerator. The tile layout is defined in ‘ esp_xilinx-vc7...

work page

[21] [21]

You are **NOT allowed** to access (read or write) files **outside the workspace directory**

work page

[22] [22]

You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation

‘dummy_workspace/‘ is a **reference-only** structural template. You must ** not use it as-is** - its kernel is an identity pass-through and cannot perform any useful computation. Create your own accelerator workspace(s) by copying it and redesigning the HLS source, testbench, driver, and configuration to match your design

work page

[23] [23]

## Evaluation Criteria

Each accelerator must have a **unique device ID** in its XML file (‘< your_workspace>/<accel_name>/<accel_name>.xml‘). ## Evaluation Criteria

work page

[24] [24]

**Execution time**: Minimize total workload execution time (measured in cycles by ‘rdcycle64()‘)

work page

[25] [25]

**Resource constraints** (VC707 FPGA): - Logic Cells: 485,760 - Block RAM (Kb): 37,080 - DSP Slices: 2,800

work page

[26] [26]

The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run

**Correctness**: The accelerated pipeline must **pass** whatever correctness check ‘systest.c‘ performs (e.g., output-vs-gold comparison, top-1 match, MSE threshold, etc.). The exact check and its threshold are workload- specific - look at what ‘systest.c‘ reports at the end of a run. ## Deliverables Your job is done when you have completed all of the following:

work page

[27] [27]

Created accelerator HLS source(s) (new workspace directories with complete ‘ src/‘, ‘inc/‘, ‘tb/‘, and XML)

work page

[28] [28]

Written corresponding software driver(s) (header files with init/exec/cleanup functions)

work page

[29] [29]

Integrated drivers into ‘accel_drivers.h‘ and modified the workload’s core computation to use your accelerators

work page

[30] [30]

Updated ‘esp_xilinx-vc707-xc7vx485t_defconfig‘ with accelerator tile entries (expanding NoC grid if needed)

work page

[31] [31]

csrr %0 , mcycle

Enabled ‘USE_ACCELERATOR‘ in ‘systest.c‘ D Performance Measurement We measure the cycle count of the application by inserting the following function: 1static inline u int 64 _t r d c y c l e 6 4 ( void ) { 2ui nt 64_ t val ; 3__asm__ _ _ v o l a t i l e _ _ ( " csrr %0 , mcycle " : " = r " ( val ) ) ; 4return val ; 5} Listing 1: Cycle measurement on RISC-...

work page

[32] [32]

28 29/* A c c u m u l a t e into output */ 30for ( int i = 0; i < r_tile ; i ++) { 31for ( int j = 0; j < c_tile ; j ++) { 32output [( r_start + i ) * o u t _ f e a t u r e s + ( c_start + j ) ] += b e r t _ t i n y _ v c 7 0 7 _ o p u s _ m a t m u l _ t i l e _ b u f [ i * c_tile + j ]; 33} 34} 35} 36} 37} Listing 2: Software-managed tiling logic genera...

work page