Recognition: unknown
From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference
Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3
The pith
Reinforcement learning jointly optimizes ASIC architecture, memory, and workload placement for AI inference across seven process nodes from 3nm to 28nm.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a single Soft Actor-Critic agent with Mixture-of-Experts gating, operating inside one Markov Decision Process, can explore the joint space of mesh topology, heterogeneous per-core microarchitecture, and operator placement to produce ASIC configurations that meet target performance or power levels on two representative workloads across all seven process nodes without any node-specific manual retuning of the agent or objective.
What carries the argument
A unified Markov Decision Process whose actions select mesh sizes, per-tile FETCH/VLEN values, memory hierarchy details, and operator-to-tile mapping, solved by Soft Actor-Critic augmented with Mixture-of-Experts gating and evaluated under a single Power-Performance-Area reward.
If this is right
- The same RL process produces a 29,809 tokens-per-second configuration for Llama 3.1 8B FP16 at 3 nm in high-performance mode.
- It also produces configurations for SmolVLM that stay below 13 mW at 10 MHz on every node from 3 nm to 28 nm.
- Mesh sizes and per-tile parameters including heterogeneous FETCH, VLEN, and memory sizes are discovered automatically rather than hand-tuned per node.
- The method removes the need for separate manual retuning when the target manufacturing process changes.
Where Pith is reading between the lines
- If the PPA model holds, the same framework could be applied to additional AI workloads without rewriting the reward function or retraining the agent from scratch.
- The approach suggests that larger design spaces previously considered too expensive for exhaustive search become tractable when an RL agent with gating can focus exploration on promising regions.
- Real-world validation would require closing the loop between the simulated PPA numbers and measurements on actual fabricated dies at multiple nodes.
Load-bearing premise
The single unified Power-Performance-Area objective used inside the Markov Decision Process accurately predicts real silicon power, speed, and area for the tested AI workloads at every process node.
What would settle it
Fabricate one or more of the RL-generated ASIC layouts at a chosen process node, run the same Llama or SmolVLM workload on the silicon, and compare measured power, throughput, and area against the values the RL agent predicted.
Figures
read the original abstract
We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an RL-driven compiler using Soft Actor-Critic (SAC) with Mixture-of-Experts gating to jointly optimize ASIC architecture, memory hierarchy, and workload partitioning for on-device AI inference. The design space is cast as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) reward; the method explores mesh topology, per-tile microarchitecture (FETCH, VLEN, memory allocation), and operator placement. Validation is reported on Llama 3.1 8B (high-performance mode) and SmolVLM (low-power mode) across seven process nodes (3 nm–28 nm), with the RL claimed to adapt configurations automatically without node-specific manual retuning.
Significance. If the unified PPA objective proves to be a faithful proxy for real silicon behavior, the work would offer a meaningful step toward automated, retuning-free architecture exploration for AI accelerators across technology nodes. The formulation of a single MDP for heterogeneous mesh and microarchitectural choices is technically interesting, but the absence of any validation of the PPA model against post-layout or foundry data currently prevents assessment of whether the reported adaptations reflect silicon reality.
major comments (2)
- [Abstract] Abstract: concrete performance numbers are stated (29809 tokens/s at 3 nm for Llama 3.1 8B FP16; <13 mW at 10 MHz for SmolVLM) yet no information is supplied on simulation accuracy, baseline definitions, error bars, or how PPA is evaluated, leaving the central performance and adaptation claims without verifiable support.
- [MDP and reward formulation] MDP and reward formulation (described in the methods): the unified PPA objective is asserted to drive SAC exploration of mesh sizes, FETCH/VLEN, and memory allocation across nodes, but the manuscript provides no description of the PPA estimator (analytical model, RTL-level, or foundry-calibrated), its treatment of node-specific leakage and wire delay, or any correlation study versus post-layout numbers. This is load-bearing for the claim that adaptation occurs without node-specific retuning.
minor comments (1)
- [Abstract and Results] The abstract and results sections would benefit from explicit statements of the number of RL episodes, the Mixture-of-Experts architecture, and the precise definition of the mixed discrete-continuous action space.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the technical interest in our unified MDP formulation. We address the concerns about the abstract and the lack of detail on the PPA estimator by expanding the manuscript with additional methodology, calibration information, and evaluation context. Revisions have been made to both the abstract and methods sections to improve verifiability of the reported results and adaptation claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: concrete performance numbers are stated (29809 tokens/s at 3 nm for Llama 3.1 8B FP16; <13 mW at 10 MHz for SmolVLM) yet no information is supplied on simulation accuracy, baseline definitions, error bars, or how PPA is evaluated, leaving the central performance and adaptation claims without verifiable support.
Authors: We agree that the abstract should supply sufficient context for the reported numbers. In the revised manuscript we have added a sentence to the abstract stating that 'PPA values are obtained from a cycle-accurate simulator whose power, performance, and area models are calibrated to commercial 3-28 nm PDKs; results are averaged over five independent RL seeds with standard-deviation error bars; baselines are fixed mesh accelerators without joint RL optimization.' Expanded simulation accuracy, baseline definitions, and error-bar details now appear in the new Section 4.1. These changes make the central claims directly verifiable from the text. revision: yes
-
Referee: [MDP and reward formulation] MDP and reward formulation (described in the methods): the unified PPA objective is asserted to drive SAC exploration of mesh sizes, FETCH/VLEN, and memory allocation across nodes, but the manuscript provides no description of the PPA estimator (analytical model, RTL-level, or foundry-calibrated), its treatment of node-specific leakage and wire delay, or any correlation study versus post-layout numbers. This is load-bearing for the claim that adaptation occurs without node-specific retuning.
Authors: We acknowledge that the original methods section lacked a sufficiently detailed description of the PPA estimator. We have revised Section 3.2 to explicitly describe the estimator as a hybrid analytical-RTL model: leakage is taken from foundry PDK tables for each node, dynamic power uses activity factors from cycle-accurate simulation, and wire delay is modeled via Elmore approximations scaled by node-specific RC parameters. A new subsection and accompanying figure present correlation results between the estimator and post-layout numbers from a commercial place-and-route tool on 20 sampled architectures (R^{2} > 0.92 for both power and area). These additions substantiate that the single unified reward enables the SAC policy to discover node-appropriate configurations (e.g., larger meshes and adjusted VLEN at 3 nm) without manual retuning. revision: yes
Circularity Check
No circularity: RL optimization produces adaptation as emergent outcome
full rationale
The paper formulates ASIC exploration as a single MDP whose reward is a unified PPA objective and applies SAC with MoE gating to search mesh, microarchitecture, and placement choices. The reported cross-node adaptation (mesh sizes, heterogeneous FETCH/VLEN, memory allocation) is presented as the result of running this optimizer on each process node; no equations, fitted parameters, or self-citations are shown that would make the adaptation equivalent to the inputs by construction. The PPA surrogate is an external modeling choice whose fidelity is an assumption about correctness, not a definitional loop. The derivation chain therefore remains self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TVM: An automated end-to-end optimizing compiler for deep learning,
T. Chen et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inOSDI, 2018
2018
-
[2]
MLIR: A compiler in- frastructure for the end of Moore’s Law,
C. Lattner et al., “MLIR: A compiler in- frastructure for the end of Moore’s Law,” arXiv preprint arXiv:2002.11054, 2020
-
[3]
XLA: Opti- mizing compiler for machine learning,
TensorFlow XLA Team, “XLA: Opti- mizing compiler for machine learning,” https://www.tensorflow.org/xla, 2017
2017
-
[4]
Glow: Graph Lowering Compiler Techniques for Neural Networks
N. Rotem et al., “Glow: Graph lowering compiler techniques for neural networks,” arXiv preprint arXiv:1805.00907, 2018
work page Pith review arXiv 2018
-
[5]
Device placement optimization with reinforcement learn- ing,
A. Mirhoseini et al., “Device placement optimization with reinforcement learn- ing,” inICML, 2017
2017
-
[7]
Gaussian Processes for Machine Learn- ing,
C. E. Rasmussen and C. K. I. Williams, “Gaussian Processes for Machine Learn- ing,” MIT Press, 2006
2006
-
[8]
A genetic algorithm tuto- rial,
D. Whitley, “A genetic algorithm tuto- rial,”Statistics and Computing, vol. 4, no. 2, pp. 65–85, 1994
1994
-
[9]
Optimization by simulated an- nealing,
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated an- nealing,”Science, vol. 220, no. 4598, pp. 671–680, 1983
1983
-
[10]
Neural Architecture Search with Reinforcement Learning
B. Zoph and Q. V. Le, “Neural architec- ture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016
work page Pith review arXiv 2016
-
[11]
Regularized evolution for image classifier architecture search,
E. Real et al., “Regularized evolution for image classifier architecture search,” in AAAI, 2019
2019
-
[12]
Reinforce- ment Learning: An Introduction,
R. S. Sutton and A. G. Barto, “Reinforce- ment Learning: An Introduction,” MIT Press, 2018
2018
-
[13]
Proximal Policy Optimization Algorithms
J. Schulman et al., “Proximal policy op- timization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Simple statistical gradient-following algorithms for connec- tionist reinforcement learning,
R. J. Williams, “Simple statistical gradient-following algorithms for connec- tionist reinforcement learning,”Machine Learning, vol. 8, no. 3-4, pp. 229–256, 1992
1992
-
[15]
Asynchronous meth- ods for deep reinforcement learning,
V. Mnih et al., “Asynchronous meth- ods for deep reinforcement learning,” in ICML, 2016
2016
-
[16]
Soft actor-critic: Off- policy maximum entropy deep reinforce- ment learning with a stochastic actor,
T. Haarnoja et al., “Soft actor-critic: Off- policy maximum entropy deep reinforce- ment learning with a stochastic actor,” in ICML, 2018
2018
-
[17]
Chip placement with deep reinforcement learning,
A. Mirhoseini et al., “Chip placement with deep reinforcement learning,”arXiv preprint arXiv:2004.10746, 2020. 24
-
[18]
R. Ganti and S. Xu, “Hardware- aware neural network compilation with learned optimization: A RISC-V ac- celerator approach,”arXiv preprint arXiv:2512.00031, 2025
-
[19]
In-datacenter per- formance analysis of a tensor processing unit,
N. P. Jouppi et al., “In-datacenter per- formance analysis of a tensor processing unit,” inISCA, 2017
2017
-
[20]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,
N. Shazeer et al., “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inICLR, 2017
2017
-
[21]
Switch transformers: Scaling to trillion parameter models with simple and effi- cient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and effi- cient sparsity,”JMLR, vol. 23, no. 120, pp. 1–39, 2022
2022
-
[22]
GPipe: Efficient training of giant neural networks using pipeline parallelism,
Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” inNeurIPS, 2019
2019
-
[23]
Ansor: Generating high- performance tensor programs for deep learning,
L. Zheng et al., “Ansor: Generating high- performance tensor programs for deep learning,” inOSDI, 2020
2020
-
[24]
TensorRT: Programmable inference accelerator,
NVIDIA, “TensorRT: Programmable inference accelerator,” https: //developer.nvidia.com/tensorrt, 2018
2018
-
[25]
FBNet: Hardware-aware efficient ConvNet design via differen- tiable neural architecture search,
B. Wu et al., “FBNet: Hardware-aware efficient ConvNet design via differen- tiable neural architecture search,” in CVPR, 2019
2019
-
[26]
Placeto: Learn- ing generalizable device placement algo- rithms for distributed machine learning,
R. Addanki et al., “Placeto: Learn- ing generalizable device placement algo- rithms for distributed machine learning,” inNeurIPS, 2019
2019
-
[27]
ConfuciuX: Au- tonomous hardware resource assignment for DNN accelerators using reinforcement learning,
S.-C. Kao et al., “ConfuciuX: Au- tonomous hardware resource assignment for DNN accelerators using reinforcement learning,” inMICRO, 2020
2020
-
[28]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
A. Grattafiori et al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Estimating GPU memory consumption of deep learning models,
L. Gao et al., “Estimating GPU memory consumption of deep learning models,” in ESEC/FSE, 2020
2020
-
[31]
Efficient processing of deep neural networks: A tutorial and survey,
V. Sze et al., “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017
2017
-
[32]
Timeloop: A system- atic approach to DNN accelerator evalu- ation,
A. Parashar et al., “Timeloop: A system- atic approach to DNN accelerator evalu- ation,” inISPASS, 2019
2019
-
[33]
Efficient memory man- agement for large language model serving with PagedAttention,
W. Kwon et al., “Efficient memory man- agement for large language model serving with PagedAttention,” inSOSP, 2023
2023
-
[34]
FlexGen: High- throughput generative inference of large language models with a single GPU,
Y. Sheng et al., “FlexGen: High- throughput generative inference of large language models with a single GPU,” in ICML, 2023
2023
-
[35]
LLM.int8(): 8-bit matrix multiplication for transformers at scale,
T. Dettmers et al., “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inNeurIPS, 2022
2022
-
[36]
SmoothQuant: Accu- rate and efficient post-training quantiza- tion for large language models,
G. Xiao et al., “SmoothQuant: Accu- rate and efficient post-training quantiza- tion for large language models,” inICML, 2023. 25
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.