pith. sign in

arxiv: 2511.22267 · v2 · submitted 2025-11-27 · 💻 cs.AR

Aquas: Enhancing Domain Specialization through Holistic Hardware-Software Co-Optimization based on MLIR

Pith reviewed 2026-05-17 05:04 UTC · model grok-4.3

classification 💻 cs.AR
keywords ASIPRISC-Vhardware-software co-designMLIRcustom instructionsdomain accelerationmemory interface modelinge-graph compiler
0
0 comments X

The pith

Aquas offers a MLIR-based co-design framework that models memory interfaces with cache awareness and uses e-graph compilation to automate custom instruction offloading for RISC-V ASIPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing ASIP design flows struggle with memory bottlenecks and complex custom instructions, limiting automation in emerging domains. Aquas addresses this through a joint memory interface model and an e-graph retargetable compiler that maps and offloads instructions effectively. If the approach holds, it would enable substantial performance gains across applications while keeping hardware costs low and frequency intact. A sympathetic reader would care because this reduces reliance on manual tuning and fixed extension interfaces, making domain specialization more practical as applications grow in complexity.

Core claim

Aquas proposes a memory interface model that jointly considers interface characteristics and cache effects, along with an interface-aware synthesis flow that optimizes the input specification and generates efficient hardware. It also introduces an e-graph-based retargetable compiler with a novel matching engine for robust instruction mapping and offloading. Case studies in four domains demonstrate up to 15.61x speedup with 14.5 percent area overhead and zero frequency degradation, remaining competitive against stronger general-purpose cores and vector extensions.

What carries the argument

The memory interface model that accounts for both interface traits and cache effects, paired with an e-graph-based retargetable compiler featuring a novel matching engine for automated instruction offloading.

If this is right

  • Memory access can be optimized progressively during synthesis rather than treated as an afterthought.
  • Custom instructions with non-trivial control and memory behavior become viable for automated offloading.
  • Domain accelerators achieve high speedups while using less area than more powerful general cores.
  • The framework supports multiple diverse applications through a single retargetable compilation approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This style of co-optimization might shorten the iteration cycle between hardware generation and software adaptation in RISC-V ecosystems.
  • If the matching engine proves robust, similar e-graph techniques could apply to other compiler targets beyond MLIR dialects.
  • The low area overhead suggests the approach could scale to systems with multiple specialized extensions without compounding hardware costs.

Load-bearing premise

The memory interface model and e-graph compiler will generalize to new domains without needing extensive manual tuning or post-hoc fixes.

What would settle it

Applying Aquas to a fifth domain with highly irregular memory patterns and complex control logic, then measuring whether speedups remain above 5x without developer adjustments to the models or engine.

Figures

Figures reproduced from arXiv: 2511.22267 by Chenyun Yin, Renze Chen, Ruifan Xu, Yansong Xu, Yitian Sun, Youwei Xiao, Yuhao Luo, Yun Liang, Yuyang Zou.

Figure 1
Figure 1. Figure 1: Overview of the unified toolchain in Aquas. to build the cross-level flow for both the hardware synthesis and compiler support. Specifically, Aquas introduces the aquas dialect, which extends existing features of [25] with operations for fast memory access and optimization directives (Section 4.1). At the mi￾croarchitecture level, Aquas introduces a burst DMA engine based on TileLink-UH for data transfers … view at source ↗
Figure 2
Figure 2. Figure 2: Synthesis flow of gemv using Aquas. It consists of (a) CADL input, (b) MLIR parsed from CADL including aquas dialect, and (c) synthesized hardware including DMA engine, scratchpad memory, and main execution pipeline. Optimization Directives. To exploit data-level parallelism en￾abled by efficient memory access, we introduce optimization direc￾tives to guide hardware synthesis. Available directives include … view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end workflow of the Aquas retargetable compiler. ❶ to ❽ correspond to the steps for compiling an application. expressions, which blocks MLIR’s loop transformations. An inter￾nal rewrite first canonicalizes the index computation (i≪2⇒i*4). Subsequently, the compiler leverages a cost model to extract a program variant that avoids non-affine (e.g., "≪") access within loops, enabling more aggressive aff… view at source ↗
read the original abstract

Application-Specific Instruction-Set Processors (ASIPs) built on the RISC-V architecture offer specialization opportunities for various applications. Existing frameworks are largely designed around fixed instruction extension interfaces and rely on manual software adaptation. However, as emerging domains scale up in complexity, two major challenges arise. First, memory access remains a primary bottleneck as existing design flows lack architectural awareness of memory interfaces, leading to suboptimal interface selection and orchestration. Second, the semantic complexity of custom instruction extensions, characterized by non-trivial control logic and irregular memory behaviors, hinders the ability of conventional compilers to perform automated and comprehensive offloading. We present Aquas, a holistic hardware-software co-design framework built upon MLIR. Aquas proposes a memory interface model that jointly considers interface characteristics and cache effects, along with an interface-aware synthesis flow guided by this model that progressively optimizes the input specification and generates efficient hardware implementations. We also propose an e-graph-based retargetable compiler approach with a novel matching engine for efficient instruction mapping and offloading, enabling robust and effective utilization of custom instruction capabilities. Case studies across four diverse domains show that Aquas delivers substantial acceleration, achieving up to 15.61x speedup with 14.5% area overhead and zero frequency degradation, proving highly competitive in domain acceleration against more powerful general-purpose cores and vector extensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Aquas, an MLIR-based holistic hardware-software co-design framework for RISC-V ASIPs. It introduces a memory interface model that jointly accounts for interface characteristics and cache effects, an interface-aware synthesis flow for hardware generation, and an e-graph-based retargetable compiler with a novel matching engine for automated instruction offloading. Case studies in four domains report up to 15.61x speedup, 14.5% area overhead, and zero frequency degradation, claiming competitiveness versus general-purpose cores and vector extensions.

Significance. If the central performance claims are substantiated by detailed ablations and reproducible methodology, the work could advance automated domain specialization by unifying memory-aware synthesis with compiler retargeting in a single MLIR infrastructure. The e-graph matching approach for irregular custom instructions represents a technically interesting direction that could reduce manual effort in ASIP flows.

major comments (2)
  1. [Evaluation / Case Studies] Evaluation section (case studies): the headline results (up to 15.61x speedup, 14.5% area, zero frequency loss) are presented without ablations that isolate the contribution of the joint memory-interface/cache model versus the novel e-graph matcher, nor any count of manually added patterns per domain. This leaves open whether the reported gains derive from the automated co-optimization framework or from domain-specific manual tuning in the synthesis and matching rules.
  2. [Memory Interface Model] Memory interface model description: the claim that the model 'jointly considers interface characteristics and cache effects' is central to addressing the stated memory bottleneck, yet the manuscript provides no quantitative comparison against prior interface-only models or sensitivity analysis on cache-effect parameters, making it impossible to verify that the model itself drives the observed interface selection improvements.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction use 'zero frequency degradation' without specifying the synthesis tool, target process node, or timing constraints under which this holds; add a sentence clarifying the experimental setup.
  2. [Compiler Approach] Notation for the e-graph matching engine (e.g., cost functions or rewrite rules) is introduced without a compact summary table; a small table listing the novel matching heuristics would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Case Studies] Evaluation section (case studies): the headline results (up to 15.61x speedup, 14.5% area, zero frequency loss) are presented without ablations that isolate the contribution of the joint memory-interface/cache model versus the novel e-graph matcher, nor any count of manually added patterns per domain. This leaves open whether the reported gains derive from the automated co-optimization framework or from domain-specific manual tuning in the synthesis and matching rules.

    Authors: We agree that explicit ablations would strengthen the evaluation by isolating component contributions. In the revised manuscript we will add ablations that separately disable the joint memory-interface/cache model and the e-graph matcher to quantify their individual effects on the reported speedups. We will also include a table reporting the number of manually added patterns per domain; these are limited to a small set of domain-specific edge cases, as the majority of instruction patterns are automatically discovered and matched by the e-graph engine. revision: yes

  2. Referee: [Memory Interface Model] Memory interface model description: the claim that the model 'jointly considers interface characteristics and cache effects' is central to addressing the stated memory bottleneck, yet the manuscript provides no quantitative comparison against prior interface-only models or sensitivity analysis on cache-effect parameters, making it impossible to verify that the model itself drives the observed interface selection improvements.

    Authors: We acknowledge the value of direct quantitative validation. The revised version will include a new subsection with comparisons of the joint model against prior interface-only models on the same benchmarks, plus sensitivity analysis varying cache-effect parameters (e.g., hit rates and latency multipliers) to show their influence on interface selection and overall performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and results are independent of each other.

full rationale

The paper describes a hardware-software co-design framework using MLIR, a memory interface model, and an e-graph retargetable compiler, then reports empirical speedups from case studies on four domains. No equations, fitted parameters, or self-citations are shown that would make the performance numbers reduce to the inputs by construction. The claimed acceleration is presented as an outcome of applying the described components rather than a tautological renaming or load-bearing self-reference. The derivation chain for the models and compiler remains self-contained against the external benchmarks of the evaluated domains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the memory interface model and e-graph engine are described at high level without stated assumptions or new postulated components.

pith-pipeline@v0.9.0 · 5566 in / 1178 out tokens · 26305 ms · 2026-05-17T05:04:15.795285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    AMD. 2025. AMD Vitis™HLS. https://www.amd.com/en/products/software/ adaptive-socs-and-fpgas/vitis/vitis-hls.html

  2. [2]

    Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, and Dimitrios Soudris. 2025. Mixed-precision Neural Networks on RISC-V Cores: ISA extensions for Multi- Pumped Soft SIMD Operations. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design. Association for Computing Machinery, New York, NY, USA, 1–9. https://doi.org/10.1145/367...

  3. [3]

    2016.The Rocket Chip Generator

    Krste Asanović, Rimas Avižienis, and Jonathan Bachrach. 2016.The Rocket Chip Generator. Technical Report UCB/EECS-2016-17. Berkeley, CA. http: //www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html

  4. [4]

    Yaohui Cai, Kaixin Yang, Chenhui Deng, Cunxi Yu, and Zhiru Zhang. 2025. SmoothE: Differentiable E-Graph Extraction. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). Association for Computing Machinery, New York, NY, USA, 1020–1034. doi:10.1145/3669940.3707262

  5. [5]

    Pham, and Peter Y

    Hao Cheng, Georgios Fotiadis, Johann Großschädl, Daniel Page, Thinh H. Pham, and Peter Y. A. Ryan. 2024. RISC-V Instruction Set Extensions for Multi-Precision Integer Arithmetic: A Case Study on Post-Quantum Key Exchange Using CSIDH-

  6. [6]

    Association for Computing Machinery, New York, NY, USA, 1–6

    InProceedings of the 61st ACM/IEEE Design Automation Conference (DAC ’24). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/ 3649329.3657347

  7. [7]

    Codasip. 2025. CodAL RISC-V architecture description language - Codasip. https://codasip.com/products/codasip-studio/codal/

  8. [8]

    Schuyler Eldridge, Prithayan Barua, Aliaksei Chapyzhenka, Adam Izraelevitz, Jack Koenig, Chris Lattner, Andrew Lenharth, George Leontiev, Fabian Schuiki, Ram Sunder, Andrew Young, and Richard Xia. 2021. MLIR as Hardware Compiler Infrastructure. InWorkshop on Open-Source EDA Technology (WOSET)

  9. [9]

    Gerasimos Gerogiannis, Stijn Eyerman, Evangelos Georganas, Wim Heirman, and Josep Torrellas. 2025. DECA: A Near-Core LLM Decompression Accelerator Grounded on a 3D Roofline Model. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture (MICRO ’25). Association for Computing Machinery, New York, NY, USA, 184–200. doi:10.1145/37258...

  10. [10]

    Hoe and Arvind

    James C. Hoe and Arvind. 2000. Synthesis of operation-centric hardware descrip- tions. InProceedings of the 2000 IEEE/ACM international conference on Computer- aided design (ICCAD ’00). IEEE Press, San Jose, California, 511–519

  11. [11]

    Lana Josipović. 2025. dynamatic: DHLS (Dynamic High-Level Synthesis) compiler based on MLIR. https://github.com/EPFL-LAP/dynamatic

  12. [12]

    Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Olek- sandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. doi:10.1109/CGO51591.2021.9370308

  13. [13]

    Huimin Li, Nele Mentens, and Stjepan Picek. 2022. A scalable SIMD RISC-V based processor with customized vector extensions for CRYSTALS-kyber. InProceedings of the 59th ACM/IEEE Design Automation Conference (DAC ’22). Association for Computing Machinery, New York, NY, USA, 733–738. doi:10.1145/3489517. 3530552

  14. [14]

    Jules Merckx, Alexandre Lopoukhine, Samuel Coward, Jianyi Cheng, Bjorn De Sutter, and Tobias Grosser. 2025. eqsat: An Equality Saturation Dialect for Non- destructive Rewriting. doi:10.48550/arXiv.2505.09363 arXiv:2505.09363 [cs]

  15. [15]

    Julian Oppermann, Brindusa Mihaela Damian-Kosterhon, Florian Meisel, Tammo Mürmann, Eyck Jentzsch, and Andreas Koch. 2024. Longnail: High-Level Syn- thesis of Portable Custom Instruction Set Extensions for RISC-V Processors from Descriptions in the Open-Source CoreDSL Language. InProceedings of the 29th ACM International Conference on Architectural Suppor...

  16. [16]

    Tianwei Pan, Tianao Dai, Jianlei Yang, Hongbin Jing, Yang Su, Zeyu Hao, Xi- aotao Jia, Chunming Hu, and Weisheng Zhao. 2025. Finesse: An Agile Design Framework for Pairing-based Cryptography via Software/Hardware Co-Design. InProceedings of the 52nd Annual International Symposium on Computer Architec- ture. ACM, Tokyo Japan, 65–77. doi:10.1145/3695053.3731033

  17. [17]

    Radu Bogdan Rusu and Steve Cousins. 2011. 3D is here: Point Cloud Library (PCL). InIEEE International Conference on Robotics and Automation (ICRA). IEEE, Shanghai, China

  18. [18]

    Paul Scheffler, Luca Colagrande, and Luca Benini. 2024. SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indi- rect Stream Registers. InProceedings of the 61st ACM/IEEE Design Automation Conference (DAC ’24). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3649329.3658494

  19. [19]

    Synopsys, Inc. 2025. ASIP Designer. https://www.synopsys.com/dw/ipdir.php? ds=asip-designer

  20. [20]

    Simpson, Fadi Alzammar, Liam Cooper, and Hyesoon Kim

    Blaise Tine, Varun Saxena, Santosh Srivatsan, Joshua R. Simpson, Fadi Alzammar, Liam Cooper, and Hyesoon Kim. 2023. Skybox: Open-Source Graphic Rendering on Programmable RISC-V GPUs. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023). Association for Compu...

  21. [21]

    Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon

  22. [22]

    Post-Fabrication Microarchitecture,

    Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. InMICRO- 54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 754–766. doi:10. 1145/3466752.3480128

  23. [23]

    Johan Van Praet, Dirk Lanneer, Werner Geurts, and Gert Goossens. 2008. Chapter 4 - nML: A Structural Processor Modeling Language for Retargetable Compilation and ASIP Design. InProcessor Description Languages. Systems on Silicon, Vol. 1. Morgan Kaufmann, Burlington, 65–93. doi:10.1016/B978-012374287-2.50007-0

  24. [24]

    Veripool. 2025. Veripool. https://www.veripool.org/verilator/

  25. [25]

    Patterson, and Krste Asanović

    Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanović. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley, Berkeley, CA. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014- 54.html

  26. [26]

    egg: Fast and Extensible Equality Saturation , url =

    Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. egg: Fast and extensible equality saturation.Artifact for "Fast and Extensible Equality Saturation"5, POPL (Jan. 2021), 23:1–23:29. doi:10.1145/3434304

  27. [27]

    Youwei Xiao, Yuyang Zou, Yansong Xu, Yuhao Luo, Yitian Sun, Chenyun Yin, Ruifan Xu, Renze Chen, and Yun Liang. 2025. APS: Open-Source Hardware- Software Co-Design Framework for Agile Processor Specialization. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD). Mu- nich, Germany

  28. [28]

    Ruifan Xu, Youwei Xiao, Jin Luo, and Yun Liang. 2022. HECTOR: A Multi-level Intermediate Representation for Hardware Synthesis Methodologies. In2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD). San Diego, California, USA, 1–9. https://ieeexplore.ieee.org/document/10068908 ISSN: 1558-2434

  29. [29]

    En-Yu Yang, Tianyu Jia, David Brooks, and Gu-Yeon Wei. 2021. FlexACC: A Pro- grammable Accelerator with Application-Specific ISA for Flexible Deep Neural Network Inference. In2021 IEEE 32nd International Conference on Application- specific Systems, Architectures and Processors (ASAP). 266–273. doi:10.1109/ ASAP52443.2021.00046 ISSN: 2160-052X

  30. [30]

    Hanchen Ye, HyeGang Jun, Hyunmin Jeong, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: a scalable high-level synthesis framework with multi-level transformations and optimizations: invited. InProceedings of the 59th ACM/IEEE Design Automation Conference (DAC ’22). Association for Computing Machinery, New York, NY, USA, 1355–1358. doi:10.1145/34895...

  31. [31]

    Jiaqi Yin, Zhan Song, Chen Chen, Yaohui Cai, Zhiru Zhang, and Cunxi Yu. 2025. e-boost: Boosted E-Graph Extraction with Adaptive Heuristics and Exact Solving. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD). Munich, Germany

  32. [32]

    Abd-El-Aziz Zayed and Christophe Dubach. 2025. DialEgg: Dialect-Agnostic MLIR Optimizer using Equality Saturation with Egglog. InProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization (CGO ’25). Association for Computing Machinery, New York, NY, USA, 271–283. doi:10.1145/3696443.3708957

  33. [33]

    Better Together: Unifying Datalog and Equality Saturation

    Yihong Zhang, Yisu Remy Wang, Oliver Flatt, David Cao, Philip Zucker, Eli Rosenthal, Zachary Tatlock, and Max Willsey. 2023. Better Together: Unifying 7 Conference’17, July 2017, Washington, DC, USA Zou et al. and Yuyang Zou, Youwei Xiao, Yansong Xu, Chenyun Yin, Yuhao Luo, Yitian Sun, Ruifan Xu, Renze Chen and Yun Liang Datalog and Equality Saturation.Ar...

  34. [34]

    Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste Asanovic. 2020. Son- icBOOM: The 3rd Generation Berkeley Out-of-Order Machine. InThe Fourth Workshop on RISC-V for Computer Architecture Research (CARRV). 8