pith. sign in

arxiv: 2509.18355 · v5 · submitted 2025-09-22 · 💻 cs.AR · cs.AI

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Pith reviewed 2026-05-18 14:03 UTC · model grok-4.3

classification 💻 cs.AR cs.AI
keywords chiplet-based SoCRISC-V architecturemodular AI accelerationedge AI devicesenergy efficiencydynamic voltage scalingUCIe interconnectload migration
0
0 comments X

The pith

A chiplet-based RISC-V SoC integrates four optimizations to deliver 40.1 percent efficiency gains for real-time edge AI inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a modular chiplet architecture for edge AI devices built around a RISC-V CPU and dedicated AI accelerators. It replaces large monolithic chips, which suffer from low yields at advanced process nodes, with smaller chiplets assembled on a silicon interposer. The design adds adaptive voltage and frequency scaling across chiplets, extensions to the UCIe interconnect for AI traffic, distributed security, and sensor-based task migration. Benchmarks show these changes reduce latency by roughly 15 percent, raise throughput by 17 percent, and lower power by 16 percent relative to simpler chiplet setups, yielding a 40 percent efficiency improvement at about 3.5 millijoules per inference while staying under five milliseconds of latency. If the gains hold, modular hardware could offer comparable density to single-chip solutions along with easier scaling and lower manufacturing risk for future edge AI products.

Core claim

The proposed chiplet-based RISC-V SoC integrates a 7nm CPU chiplet with dual 5nm AI accelerators delivering 15 TOPS INT8 each, 16GB HBM3 memory, and power controllers on a 30mm by 30mm interposer. Adaptive cross-chiplet DVFS, AI-aware UCIe extensions with streaming flow control and compression, distributed cryptographic security, and sensor-driven load migration together produce a 14.7 percent latency reduction, 17.3 percent throughput increase, and 16.2 percent power reduction over prior basic chiplet designs. These combine into a 40.1 percent efficiency gain reaching approximately 3.5 mJ per MobileNetV2 inference at 860 mW and 244 images per second while preserving sub-5ms real-time speed.

What carries the argument

Four coordinated optimizations—adaptive cross-chiplet Dynamic Voltage and Frequency Scaling, AI-aware UCIe protocol extensions, distributed cryptographic security, and sensor-driven load migration—that jointly manage performance, power, and security across heterogeneous chiplets on the interposer.

If this is right

  • Modular chiplet designs reach near-monolithic computational density for AI workloads.
  • The architecture supports cost efficiency, scalability, and future upgradeability in edge AI devices.
  • Real-time capability stays under 5 ms across the tested benchmarks including MobileNetV2 and ResNet-50.
  • Energy use falls to roughly 3.5 mJ per inference at 860 mW while sustaining 244 images per second.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Using smaller chiplets could raise manufacturing yields and cut overall production costs compared with large monolithic SoCs.
  • The same cross-chiplet coordination techniques might transfer to other processor families or larger AI models.
  • Physical prototypes would likely expose interactions among the four optimizations that simulations miss.
  • Pairing the design with newer memory or interconnect generations could produce further efficiency steps.

Load-bearing premise

The four optimizations can be added in hardware without meaningful extra area, complexity, or power cost, and the chosen industry benchmarks capture real deployment conditions.

What would settle it

Fabricate the full SoC and run the same MobileNetV2, ResNet-50, and video workloads on the physical hardware to measure whether the reported latency, throughput, power, and efficiency numbers are reached or whether overheads reduce the gains.

Figures

Figures reproduced from arXiv: 2509.18355 by Prerana Ramkumar, Suhas Suresh Bharadwaj.

Figure 1
Figure 1. Figure 1: Top - Physical placement of chiplets on the 30mm x 30mm interposer. Bottom - Cross section of 2.5D Chiplet integration with UCIe interconnect. This architecture employs UCIe 2.0 die-to-die links to communicate between chiplets delivering ~30 GB/s bandwidth at <2ns latency. To cutdown on energy wastage, an adaptive cross-chiplet power management system predicts workload phases and redistributes power throug… view at source ↗
Figure 2
Figure 2. Figure 2: AI-optimized versus baseline chiplet performance across edge AI benchmarks. (a) Batch-1 inference latency: AI-optimized delivers the fastest mean latency at 4.1 ms ± 0.3 ms. (b) Throughput scales from batch sizes of 1 to 32, with AI-optimized consistently achieving the highest images/sec. (c) Batch-1 power draw: AI-optimized consumes the least energy at 860 mW. (d) Latency comparison across MobileNetV2, Re… view at source ↗
read the original abstract

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a chiplet-based RISC-V SoC architecture on a 30mm x 30mm silicon interposer that integrates a 7nm RISC-V CPU chiplet, dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3, and four key innovations: adaptive cross-chiplet DVFS, AI-aware UCIe protocol extensions with streaming flow control and compression, distributed cryptographic security across chiplets, and sensor-driven load migration. It reports experimental results on benchmarks including MobileNetV2 and ResNet-50 showing ~14.7% latency reduction, 17.3% throughput improvement, 16.2% power reduction, and 40.1% efficiency gain (~3.5 mJ per inference at 860 mW/244 images/s) relative to prior basic chiplet designs, while maintaining sub-5 ms real-time performance.

Significance. If the performance claims are substantiated, the work would provide concrete evidence that modular chiplet designs can approach monolithic computational density for edge AI while improving yield, cost, scalability, and upgradeability. The integration of cross-chiplet DVFS, UCIe extensions, distributed security, and load migration into a single RISC-V platform addresses a timely problem in heterogeneous edge computing.

major comments (3)
  1. [Abstract / Experimental results] Abstract and experimental results section: The headline claims of 14.7% latency reduction, 17.3% throughput gain, 16.2% power reduction, and 40.1% efficiency improvement are stated without any description of the simulation or emulation tools, baseline configurations (e.g., exact parameters of the 'previous basic chiplet implementations'), measurement methodology, or statistical validation. This absence prevents assessment of whether the reported numbers support the central claims.
  2. [Architecture description] Architecture and power modeling sections: The four innovations are presented as delivering their benefits with negligible overhead, yet no component-level power or area breakdown, sensitivity analysis, or explicit comparison of the baseline with versus without the new logic (especially distributed crypto and sensor-driven migration) is provided. If any feature adds even modest interconnect or controller cost, the net 16.2% power reduction and sub-5 ms bound become unreliable.
  3. [Evaluation / Results] Evaluation section: The efficiency figure of ~3.5 mJ per MobileNetV2 inference (860 mW at 244 images/s) is given as a derived outcome, but the manuscript supplies no equations, tables, or intermediate data showing how the individual percentage improvements combine into the 40.1% efficiency gain or how real-time capability was verified across workloads.
minor comments (2)
  1. [Abstract] The abstract mentions 'industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing' but does not list the complete set of workloads or input sizes used.
  2. [Architecture] Notation for the AI accelerators (15 TOPS INT8 each) and memory (16GB HBM3) is clear, but the manuscript would benefit from a table summarizing chiplet dimensions, process nodes, and interconnect parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where additional transparency is needed to substantiate the reported performance gains. We will revise the manuscript to address each point by expanding the experimental methodology, providing component-level breakdowns, and including derivation details for the efficiency metrics.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] Abstract and experimental results section: The headline claims of 14.7% latency reduction, 17.3% throughput gain, 16.2% power reduction, and 40.1% efficiency improvement are stated without any description of the simulation or emulation tools, baseline configurations (e.g., exact parameters of the 'previous basic chiplet implementations'), measurement methodology, or statistical validation. This absence prevents assessment of whether the reported numbers support the central claims.

    Authors: We agree that the current manuscript lacks sufficient detail on the experimental setup. In the revised version, we will add a new subsection to the Evaluation section describing the gem5-based simulation framework with custom UCIe interconnect models, the precise baseline configuration (standard chiplet design using unmodified UCIe without DVFS, compression, or migration features), power modeling via McPAT/CACTI, and statistical validation through 500 Monte Carlo runs reporting mean values with 95% confidence intervals. revision: yes

  2. Referee: [Architecture description] Architecture and power modeling sections: The four innovations are presented as delivering their benefits with negligible overhead, yet no component-level power or area breakdown, sensitivity analysis, or explicit comparison of the baseline with versus without the new logic (especially distributed crypto and sensor-driven migration) is provided. If any feature adds even modest interconnect or controller cost, the net 16.2% power reduction and sub-5 ms bound become unreliable.

    Authors: The manuscript currently omits explicit component-level breakdowns. We will insert a new table and accompanying text in the Architecture section providing power and area estimates for each innovation (e.g., distributed crypto at ~1.5% total power overhead, sensor-driven migration at <1%). A sensitivity analysis varying overhead assumptions by ±50% will be added, along with incremental comparisons isolating the contribution of each feature to the overall power reduction. revision: yes

  3. Referee: [Evaluation / Results] Evaluation section: The efficiency figure of ~3.5 mJ per MobileNetV2 inference (860 mW at 244 images/s) is given as a derived outcome, but the manuscript supplies no equations, tables, or intermediate data showing how the individual percentage improvements combine into the 40.1% efficiency gain or how real-time capability was verified across workloads.

    Authors: We acknowledge the absence of intermediate derivations and supporting data. The revision will include an appendix with the efficiency calculation (energy per inference = average power × latency per inference, with efficiency gain = (baseline energy − optimized energy) / baseline energy) and a table showing per-optimization contributions. Latency verification across all workloads will be supported by cycle-accurate trace summaries confirming sub-5 ms bounds. revision: yes

Circularity Check

0 steps flagged

No circularity; results reported from external benchmark evaluations

full rationale

The paper presents an architectural description of a chiplet-based RISC-V SoC incorporating four specific innovations (adaptive cross-chiplet DVFS, UCIe extensions, distributed security, and sensor-driven load migration) and states performance outcomes such as 14.7% latency reduction and 40.1% efficiency gain as measured results on industry-standard external benchmarks including MobileNetV2 and ResNet-50. No equations, fitted parameters, or self-citations are used to define the target metrics in terms of themselves; the reported numbers are positioned as empirical outcomes of the design evaluated against prior basic chiplet baselines rather than quantities constructed by redefinition or internal fitting. The derivation chain therefore remains self-contained against external benchmarks with no reduction of claims to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering architecture paper whose claims rest on the realizability of the four listed optimizations and the representativeness of the chosen benchmarks rather than mathematical free parameters or unstated axioms.

pith-pipeline@v0.9.0 · 5860 in / 1301 out tokens · 83831 ms · 2026-05-18T14:03:26.873854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Edge Intelligence: Paving the Way for a New Era of AI,

    Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo and J. Zhang, "Edge Intelligence: Paving the Way for a New Era of AI," in Proceedings of the IEEE, vol. 107, no. 8, pp. 1608-1631, Aug. 2019, doi: 10.1109/JPROC.2019.2918951

  2. [2]

    MobileNetV2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510– 4520

  3. [3]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778

  4. [4]

    Modeling statistical dopant fluctuations in MOS transistors,

    P. A. Stolk, F. P. Widdershoven, and D. B. M. Klaassen, “Modeling statistical dopant fluctuations in MOS transistors,” IEEE Trans. Electron Devices, vol. 45, no. 9, pp. 1960–1971, Sep. 1998

  5. [5]

    The use and evaluation of yield models in integrated circuit manufacturing,

    J. A. Cunningham, “The use and evaluation of yield models in integrated circuit manufacturing,” IEEE Trans. Semicond. Manuf., vol. 3, no. 2, pp. 60–71, May 1990

  6. [6]

    TAP-2.5D: A thermally-aware chiplet placement framework for 2.5D systems,

    Y. Ma, J.-W. Min, J.-S. Yoon, and S.-K. Lim, “TAP-2.5D: A thermally-aware chiplet placement framework for 2.5D systems,” in Proc. Design, Automation & Test in Europe (DATE), 2021, pp. 1–6

  7. [7]

    High-performance, power-efficient three-dimensional chiplet systems,

    D. Das Sharma and K. Ganesan, “High-performance, power-efficient three-dimensional chiplet systems,” Nat. Electron., vol. 7, pp. 174– 185, Feb. 2024

  8. [8]

    A 17–95.6 TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm,

    B. Keller, S. Naffziger, A. Youssef, et al., “A 17–95.6 TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm,” IEEE J. Solid-State Circuits, vol. 58, no. 4, pp. 1129–1141, Apr. 2023

  9. [9]

    High Bandwidth Memory (HBM3) DRAM Standard JESD238,

    JEDEC Solid State Technology Assoc., “High Bandwidth Memory (HBM3) DRAM Standard JESD238,” Jan. 2022

  10. [10]

    HBM3 stacking and integration challenges,

    A. Patel and S. Gupta, “HBM3 stacking and integration challenges,” IEEE J. Solid-State Circuits, vol. 58, no. 4, pp. 789–798, Apr. 2023

  11. [11]

    Fine-grained DVFS using on-chip regulators,

    S. Eyerman and L. Eeckhout, “Fine-grained DVFS using on-chip regulators,” ACM Trans. Archit. Code Optim., vol. 8, no. 1, Apr. 2011

  12. [12]

    Adaptive DVFS and thread packing under power caps,

    R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda, “Adaptive DVFS and thread packing under power caps,” in Proc. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2011, pp. 175–185

  13. [13]

    2.5D root of trust: Secure system-level integration of untrusted chiplets,

    M. Nabeel, M. Ashraf, S. Patnaik, V. Soteriou, O. Sinanoglu, and J. Knechtel, “2.5D root of trust: Secure system-level integration of untrusted chiplets,” IEEE Trans. Comput., vol. 69, no. 11, pp. 1611– 1625, Nov. 2020

  14. [14]

    Temperature-aware microarchitecture: Modeling and implementation,

    K. Skadron, M. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, “Temperature-aware microarchitecture: Modeling and implementation,” in Proc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA), 2003, pp. 2–13

  15. [15]

    Thermal management for 3D-stacked systems via unified DVFS and load power management,

    T. Chantem, X. Wu, R. Dick, and W. Wolf, “Thermal management for 3D-stacked systems via unified DVFS and load power management,” ACM Trans. Archit. Code Optim., vol. 20, no. 3, Sep. 2023

  16. [16]

    System level analysis of fast, per-core DVFS using on-chip switching regulators,

    Wonyoung Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pp. 123–134, Feb. 2008. doi:10.1109/hpca.2008.4658633

  17. [17]

    Fine-Grained DVFS Using On-Chip Regulators

    Eyerman, Stijn, and Lieven Eeckhout. “Fine-Grained DVFS Using On-Chip Regulators.” ACM Transactions on Architecture and Code Optimization, vol. 8, no. 1, 5 Feb. 2011, pp. 1–24, https://doi.org/10.1145/1952998.1952999. Accessed 4 Dec. 2023

  18. [18]

    UCIe: Standard for an Open Chiplet Ecosystem

    Onufryk, Peter, and Swadesh Choudhary. “UCIe: Standard for an Open Chiplet Ecosystem.” IEEE Micro, 1 Jan. 2024, pp. 1–7, https://doi.org/10.1109/mm.2024.3451532. Accessed 17 July 2025

  19. [19]

    AuthenTree: A Scalable MPC-Based Distributed Trust Architecture for Chiplet-Based Heterogeneous Systems

    Tashdid, Ishraq, et al. “AuthenTree: A Scalable MPC-Based Distributed Trust Architecture for Chiplet-Based Heterogeneous Systems.” ArXiv.org, 2025, arxiv.org/abs/2508.13033. Accessed 14 Oct. 2025. (Unpublished)