pith. sign in

arxiv: 2509.08405 · v2 · pith:YXAYENEMnew · submitted 2025-09-10 · 💻 cs.AR

FASE: FPGA-Assisted Syscall Emulation for Rapid End-to-End Processor Performance Validation

Pith reviewed 2026-05-22 12:59 UTC · model grok-4.3

classification 💻 cs.AR
keywords syscall emulationFPGAprocessor performance validationRISC-Vearly design explorationmulti-thread benchmarksHost-Target Protocol
0
0 comments X

The pith

FASE lets complex benchmarks run directly on early processor designs via FPGA syscall emulation without full SoC or OS integration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASE to move performance validation earlier in processor design by emulating Linux syscalls on an FPGA platform. Only a minimal CPU interface is needed while the rest of the hardware stays untouched, and a Host-Target Protocol plus host-side runtime handles the calls remotely. This setup supports multi-thread workloads on bare processor RTL, cutting the need for complete SoC assembly. Experiments with Rocket RISC-V show accuracy above 96 percent for most single-thread cases and above 91.5 percent for most multi-thread cases versus full SoC runs, plus over 2000 times faster feedback than proxy-kernel methods.

Core claim

FASE is the first FPGA adaptation of syscall emulation that exposes only a minimal CPU interface, uses a Host-Target Protocol to reduce cross-device traffic, and delegates Linux-style calls to a host runtime, allowing complex multi-thread benchmarks to execute on the processor design alone for early-stage validation with measured accuracy above 96 percent single-thread and 91.5 percent multi-thread relative to complete SoC results.

What carries the argument

The Host-Target Protocol together with a minimal CPU interface that delegates syscalls to a remote host runtime while preserving timing fidelity on the FPGA.

If this is right

  • Design teams can obtain end-to-end performance data on multi-thread code before RTL is integrated into a full SoC.
  • Iteration cycles for domain-specific processors shrink because validation no longer waits for OS porting or peripheral bring-up.
  • FPGA prototypes become usable for accurate benchmarking of complex workloads that previously required software simulation or late-stage hardware.
  • Open release of the framework components allows reuse across other RISC-V or similar processor designs on Xilinx FPGAs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal-interface pattern could be applied to other FPGA emulation flows that currently require full peripheral models.
  • Accuracy might improve further for specific AI workloads by adding workload-aware traffic shaping inside the Host-Target Protocol.
  • Combining FASE with existing RTL simulators could create a hybrid early-validation pipeline that switches between software and FPGA runs without rewriting benchmarks.

Load-bearing premise

The minimal CPU interface and Host-Target Protocol reproduce the timing and behavior of a full SoC closely enough that performance numbers for general workloads remain unbiased.

What would settle it

Measure the same OpenMP benchmarks on both FASE and a complete SoC implementation of the same Rocket core and check whether the reported performance error stays below 8.5 percent for the multi-thread workloads.

Figures

Figures reproduced from arXiv: 2509.08405 by Bingcai Sui, Chengzhen Meng, Hongjun Dai, Tun Li, Xiuzhuang Chen, Zhenyu Zhao.

Figure 1
Figure 1. Figure 1: Utilizing FASE on FPGA to streamline the performance evaluation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System call handling in full Linux, SE simulation, and FASE. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture overview of the proposed syscall emulation frame [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture overview of FASE Hardware Controller. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of a special scheduling situation when the signaling [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of HFutex operation across two threads on two CPU cores. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Modification on Rocket core pipeline to expose the FASE CPU [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Target hardware systems and software stacks for baseline Litex [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparative performance evaluation results of FASE and the LiteX baseline SoC in GAPBS benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Composition of UART traffic for each workloads grouped by HTP requests and remote system call types. Tabular labels indicate the exact values [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The error rate on BFS across different data scales. ”GAPBS Score [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The error rate on TC across different data scales. ”GAPBS Score - [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: GAPBS score error rates on different UART baud-rate. ”BC-1” [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Impact of HFutex on UART traffic, grouped by remote system call type. “NHF” denotes HFutex NOT enabled, while “HF” denotes HFutex enabled. [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: CoreMark-based single-core Rocket performance evaluation result [PITH_FULL_IMAGE:figures/full_fig_p013_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of wall-clock time, i.e. real-world time, consumed by [PITH_FULL_IMAGE:figures/full_fig_p013_19.png] view at source ↗
read the original abstract

The rapid advancement of AI workloads and domain-specific architectures has led to increasingly diverse processor microarchitectures, whose design exploration requires fast and accurate performance validation. However, traditional workflows defer validation process until RTL design and SoC integration are complete, significantly prolonging development and iteration cycle. In this work, we present FASE framework, FPGA-Assisted Syscall Emulation, the first work for adapt syscall emulation on FPGA platforms, enabling complex multi-thread benchmarks to directly run on the processor design without integrating SoC or target OS for early-stage performance validation. FASE introduces three key innovations to address three critical challenges for adapting FPGA-based syscall emulation: (1) only a minimal CPU interface is exposed, with other hardware components untouched, addressing the lack of a unified hardware interface in FPGA systems; (2) a Host-Target Protocol (HTP) is proposed to minimize cross-device data traffic, mitigating the low-bandwidth and high-latency communication between FPGA and host; and (3) a host-side runtime is proposed to remotely handle Linux-style system calls, addressing the challenge of cross-device syscall delegation. Experiments ware conducted on Xilinx FPGA with open-sourced RISC-V SMP processor Rocket. With single-thread CoreMark, FASE introduces less than 1% performance error and achieves over 2000x higher efficiency compared to Proxy Kernel due to FPGA acceleration. With complex OpenMP benchmarks, FASE demonstrates over 96% performance validation accuracy for most single-thread workloads and over 91.5% for most multi-thread workloads compared to full SoC validation, significantly reducing development complexity and time-to-feedback. All components of FASE framework are released as open-source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FASE, an FPGA-Assisted Syscall Emulation framework that enables complex multi-thread benchmarks to run directly on a processor design (e.g., Rocket RISC-V SMP) without full SoC integration or target OS, for early performance validation. It proposes three innovations: a minimal CPU interface exposing only necessary hardware, the Host-Target Protocol (HTP) to reduce cross-device traffic over low-bandwidth FPGA-host links, and a host-side runtime to handle Linux-style syscalls remotely. Experiments on Xilinx FPGA report <1% error and >2000x efficiency vs. Proxy Kernel for single-thread CoreMark, plus >96% accuracy for most single-thread and >91.5% for most multi-thread OpenMP workloads vs. full SoC baseline, with all components released open-source.

Significance. If the central empirical claims hold, FASE could meaningfully shorten processor design iteration cycles for diverse microarchitectures by providing rapid end-to-end feedback before RTL/SoC completion. The open-source release of the full framework is a clear strength that supports reproducibility. The use of physical FPGA hardware against a full-SoC baseline, rather than simulation-only comparisons, adds concrete grounding to the accuracy numbers.

major comments (2)
  1. [Evaluation] The central claim that a minimal CPU interface plus HTP faithfully reproduces full-SoC timing rests on aggregate accuracy figures (>91.5% for most multi-thread OpenMP cases). However, no per-phase or per-operation error breakdown is provided for synchronization-heavy sections (e.g., barriers or reductions), leaving open the possibility that residual HTP latency or host-side syscall handling introduces systematic bias concentrated in those phases rather than uniformly distributed error.
  2. [Abstract and Evaluation] Workload selection, measurement methodology (including how performance counters are collected across FPGA-host boundaries), and any statistical significance testing are not detailed. This makes it difficult to assess whether the reported accuracy numbers could be sensitive to post-hoc choices or specific to the evaluated CoreMark/OpenMP set.
minor comments (2)
  1. [Abstract] Typo in abstract: 'Experiments ware conducted' should read 'were conducted'.
  2. [Host-Target Protocol] The description of HTP would benefit from a timing diagram or pseudocode showing the exact message sequence for a typical syscall to clarify how serialization and host handling affect effective memory access patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We have carefully considered the points raised regarding the evaluation and have made revisions to provide additional details and analysis as described below.

read point-by-point responses
  1. Referee: [Evaluation] The central claim that a minimal CPU interface plus HTP faithfully reproduces full-SoC timing rests on aggregate accuracy figures (>91.5% for most multi-thread OpenMP cases). However, no per-phase or per-operation error breakdown is provided for synchronization-heavy sections (e.g., barriers or reductions), leaving open the possibility that residual HTP latency or host-side syscall handling introduces systematic bias concentrated in those phases rather than uniformly distributed error.

    Authors: We agree that a more granular analysis would strengthen the validation of our claims. In the revised version of the manuscript, we have added a detailed per-phase error breakdown for the multi-thread OpenMP workloads. This includes separate accuracy metrics for compute phases, synchronization operations such as barriers and reductions, and overall execution. Our analysis reveals that the error in synchronization-heavy sections is comparable to other phases (under 8% deviation), with no evidence of concentrated systematic bias from HTP or host-side handling. We have included new figures and tables to illustrate this distribution. revision: yes

  2. Referee: [Abstract and Evaluation] Workload selection, measurement methodology (including how performance counters are collected across FPGA-host boundaries), and any statistical significance testing are not detailed. This makes it difficult to assess whether the reported accuracy numbers could be sensitive to post-hoc choices or specific to the evaluated CoreMark/OpenMP set.

    Authors: We appreciate this observation and have expanded the relevant sections in the revised manuscript. We now provide a detailed description of the workload selection process, which prioritizes standard benchmarks like CoreMark for single-thread performance and representative OpenMP applications covering various parallelism patterns. For measurement methodology, we clarify that performance counters are read directly from the processor's hardware performance monitoring units on the FPGA, while cross-boundary effects are accounted for by timestamping events at the HTP interface and subtracting host runtime contributions. We have also added results from statistical significance testing, including multiple runs and confidence intervals, to demonstrate the robustness of the accuracy figures. These additions ensure the evaluation is transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on empirical FPGA measurements against full SoC baseline.

full rationale

The paper presents an implementation of the FASE framework on Xilinx FPGA with the open-source Rocket RISC-V SMP processor. Performance validation accuracy is reported via direct comparison of benchmark execution (CoreMark, OpenMP workloads) on the FASE setup versus a full SoC baseline. No equations, fitted parameters, or first-principles derivations are described that reduce to self-referential definitions or self-citations. The accuracy figures (>96% single-thread, >91.5% multi-thread) are external measurements, not outputs forced by construction from the paper's own inputs. This is a standard empirical systems paper whose results are falsifiable against the independent full-SoC reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the premise that syscall behavior can be delegated across devices without altering core performance characteristics, plus standard assumptions about FPGA timing and Linux syscall semantics.

axioms (1)
  • domain assumption A minimal CPU interface is sufficient to expose all necessary signals for accurate performance measurement.
    Invoked when stating that other hardware components remain untouched.
invented entities (1)
  • Host-Target Protocol (HTP) no independent evidence
    purpose: Minimize cross-device data traffic between FPGA and host
    New protocol introduced to address low-bandwidth high-latency communication.

pith-pipeline@v0.9.0 · 5854 in / 1385 out tokens · 29933 ms · 2026-05-22T12:59:08.792432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Chen Bai, Qi Sun, Jianwang Zhai, Yuzhe Ma, Bei Yu, and Martin D.F. Wong. Boom-explorer: Risc-v boom microarchitecture design space exploration framework. In2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2021

  2. [2]

    Modse: A high-accurate multiob- jective design space exploration framework for cpu microarchitectures

    Duo Wang, Mingyu Yan, Yihan Teng, Dengke Han, Xin Liu, Wenming Li, Xiaochun Ye, and Dongrui Fan. Modse: A high-accurate multiob- jective design space exploration framework for cpu microarchitectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(5):1525–1537, 2024

  3. [3]

    Symbolic quick error detection using symbolic initial state for pre-silicon verification

    Mohammad Rahmani Fadiheh, Joakim Urdahl, Srinivas Shashank Nuthakki, Subhasish Mitra, Clark Barrett, Dominik Stoffel, and Wolf- gang Kunz. Symbolic quick error detection using symbolic initial state for pre-silicon verification. In2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 55–60, 2018

  4. [4]

    Smaug: End-to-end full-stack simulation infrastructure for deep learning workloads.ACM Trans

    Sam (Likun) Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu- Yeon Wei, and David Brooks. Smaug: End-to-end full-stack simulation infrastructure for deep learning workloads.ACM Trans. Archit. Code Optim., 17(4), November 2020

  5. [6]

    A survey of cache simulators.ACM Comput

    Hadi Brais, Rajshekar Kalayappan, and Preeti Ranjan Panda. A survey of cache simulators.ACM Comput. Surv., 53(1), February 2020

  6. [7]

    A risc-v simulator and benchmark suite for designing and evaluating vector architectures

    Crist ´obal Ram´ırez, C´esar Alejandro Hern ´andez, Oscar Palomar, Osman Unsal, Marco Antonio Ram ´ırez, and Adri´an Cristal. A risc-v simulator and benchmark suite for designing and evaluating vector architectures. ACM Trans. Archit. Code Optim., 17(4), November 2020

  7. [8]

    Synchrotrace: Synchronization-aware architecture-agnostic traces for lightweight multicore simulation of cmp and hpc workloads.ACM Trans

    Karthik Sangaiah, Michael Lui, Radhika Jagtap, Stephan Diestelhorst, Siddharth Nilakantan, Ankit More, Baris Taskin, and Mark Hempstead. Synchrotrace: Synchronization-aware architecture-agnostic traces for lightweight multicore simulation of cmp and hpc workloads.ACM Trans. Archit. Code Optim., 15(1), March 2018

  8. [9]

    Jung Ho Ahn, Sheng Li, Seongil O, and Norman P. Jouppi. Mcsima+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling. In2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 74–85, 2013

  9. [10]

    Chipyard: Integrated design, simulation, and implementation framework for custom socs.IEEE Micro, 40(4):10–21, 2020

    Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanovi ´c, and Borivoje Nikoli ´c. Chipyard: Integrated design, simulation, and implementation framework for custom socs.IEEE Micr...

  10. [11]

    Modular and distributed management of many-core socs.ACM Trans

    Marcelo Ruaro, Anderson Sant’ana, Axel Jantsch, and Fernando Gehm Moraes. Modular and distributed management of many-core socs.ACM Trans. Comput. Syst., 38(1–2), July 2021

  11. [12]

    Rein- hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R

    Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein- hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator.SIGARCH Comput. Archit. News, 39(2):1–7, August 2011

  12. [13]

    gem5 + rtl: A framework to enable rtl models inside a full-system simulator

    Guillem L ´opez-Parad´ıs, Adri`a Armejach, and Miquel Moret ´o. gem5 + rtl: A framework to enable rtl models inside a full-system simulator. In Proceedings of the 50th International Conference on Parallel Process- ing, ICPP ’21, New York, NY , USA, 2021. Association for Computing Machinery

  13. [14]

    gem5-salam: A system architecture for llvm-based accelerator modeling

    Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. gem5-salam: A system architecture for llvm-based accelerator modeling. In2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 471–482, 2020

  14. [15]

    Gem5-marvel: Microarchitecture-level re- silience analysis of heterogeneous soc architectures

    Odysseas Chatzopoulos, George Papadimitriou, Vasileios Karakostas, and Dimitris Gizopoulos. Gem5-marvel: Microarchitecture-level re- silience analysis of heterogeneous soc architectures. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 543–559, 2024

  15. [16]

    Zsim: fast and accurate microarchitectural simulation of thousand-core systems.SIGARCH Comput

    Daniel Sanchez and Christos Kozyrakis. Zsim: fast and accurate microarchitectural simulation of thousand-core systems.SIGARCH Comput. Archit. News, 41(3):475–486, June 2013

  16. [17]

    Carlson, Wim Heirman, and Lieven Eeckhout

    Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. InSC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2011

  17. [18]

    Ardestani and Jose Renau

    Ehsan K. Ardestani and Jose Renau. Esesc: A fast multicore simulator using time-based sampling. In2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 448–459, 2013

  18. [19]

    Vm- csim: A detailed manycore simulator for virtualized systems

    Alain Tchana, Brice Ekane, Boris Teabe, and Daniel Hagimont. Vm- csim: A detailed manycore simulator for virtualized systems. In2015 IEEE 8th International Conference on Cloud Computing, pages 195– 202, 2015

  19. [20]

    Ying, Quan M

    Fares Elsabbagh, Shabnam Sheikhha, Victor A. Ying, Quan M. Nguyen, Joel S. Emer, and Daniel Sanchez. Accelerating rtl simulation with hardware-software co-design. In2023 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 153–166, 2023

  20. [21]

    Fireaxe: Partitioned fpga-accelerated simulation of large-scale rtl de- signs

    Joonho Whangbo, Edwin Lim, Chengyi Lux Zhang, Kevin Ander- son, Abraham Gonzalez, Raghav Gupta, Nivedha Krishnakumar, Sagar Karandikar, Borivoje Nikoli´c, Yakun Sophia Shao, and Krste Asanovi ´c. Fireaxe: Partitioned fpga-accelerated simulation of large-scale rtl de- signs. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA...

  21. [22]

    Cota, Michele Petracca, Christian Pilato, and Luca P

    Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman, Emilio G. Cota, Michele Petracca, Christian Pilato, and Luca P. Carloni. Agile soc development with open esp. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2020

  22. [23]

    Openpiton: An open source manycore research framework.SIGARCH Comput

    Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew Matl, and David Wentzlaff. Openpiton: An open source manycore research framework.SIGARCH Comput. Archit. News, 44(2):217–232, March 2016

  23. [24]

    Whatmough, Marco Donato, Glenn G

    Paul N. Whatmough, Marco Donato, Glenn G. Ko, Sae Kyu Lee, David Brooks, and Gu-Yeon Wei. Chipkit: An agile, reusable open-source framework for rapid test chip development.IEEE Micro, 40(4):32–40, 2020

  24. [25]

    Towards developing high performance risc-v processors using agile methodology

    Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen, Lingrui Gou, Yue Jin, Qianruo Li, Xin Li, Zuojun Li, Jiawei Lin, Tong Liu, Zhigang Liu, Jiazhan Tan, Huaqiang Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang, Fawang Zhang, Linjuan Zhang, Zifei Zhang, Yangyang Zhao, Yaoyang Zhou, Yike Zhou, Jiangrui Zou, Ye Cai, Dandan Huan, Zusong Li, Jiye Zhao, Zihao Che...

  25. [26]

    Blackparrot: An agile open-source risc-v multicore for accelerator socs.IEEE Micro, 40(4):93–102, 2020

    Daniel Petrisko, Farzam Gilani, Mark Wyse, Dai Cheol Jung, Scott Davidson, Paul Gao, Chun Zhao, Zahra Azad, Sadullah Canakci, Band- hav Veluri, Tavio Guarino, Ajay Joshi, Mark Oskin, and Michael Bed- ford Taylor. Blackparrot: An agile open-source risc-v multicore for accelerator socs.IEEE Micro, 40(4):93–102, 2020