arxiv: 2604.07526 · v1 · submitted 2026-04-08 · 💻 cs.AR · cs.LG

Recognition: unknown

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

Ravindra Ganti , Steve Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.AR cs.LG

keywords reinforcement learningASIC architecture explorationAI inference hardwareon-device AIprocess node adaptationLLM acceleratorPPA optimizationMarkov Decision Process

0 comments

The pith

Reinforcement learning jointly optimizes ASIC architecture, memory, and workload placement for AI inference across seven process nodes from 3nm to 28nm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates a reinforcement learning system that treats the full ASIC design problem for AI chips as one decision process. The agent chooses mesh topology, per-tile microarchitecture details such as FETCH and vector lengths, memory allocation, and how to split the workload, all guided by a single objective that balances power, performance, and area. It is tested on demanding models including an 8-billion-parameter LLM and a compact vision-language model, producing working configurations at every process node from the most advanced to older technologies. A reader would care because current ASIC design for on-device AI still requires heavy manual tuning that changes with each new manufacturing process; an automated method that removes that step could shorten the time from model to efficient silicon.

Core claim

The paper establishes that a single Soft Actor-Critic agent with Mixture-of-Experts gating, operating inside one Markov Decision Process, can explore the joint space of mesh topology, heterogeneous per-core microarchitecture, and operator placement to produce ASIC configurations that meet target performance or power levels on two representative workloads across all seven process nodes without any node-specific manual retuning of the agent or objective.

What carries the argument

A unified Markov Decision Process whose actions select mesh sizes, per-tile FETCH/VLEN values, memory hierarchy details, and operator-to-tile mapping, solved by Soft Actor-Critic augmented with Mixture-of-Experts gating and evaluated under a single Power-Performance-Area reward.

If this is right

The same RL process produces a 29,809 tokens-per-second configuration for Llama 3.1 8B FP16 at 3 nm in high-performance mode.
It also produces configurations for SmolVLM that stay below 13 mW at 10 MHz on every node from 3 nm to 28 nm.
Mesh sizes and per-tile parameters including heterogeneous FETCH, VLEN, and memory sizes are discovered automatically rather than hand-tuned per node.
The method removes the need for separate manual retuning when the target manufacturing process changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the PPA model holds, the same framework could be applied to additional AI workloads without rewriting the reward function or retraining the agent from scratch.
The approach suggests that larger design spaces previously considered too expensive for exhaustive search become tractable when an RL agent with gating can focus exploration on promising regions.
Real-world validation would require closing the loop between the simulated PPA numbers and measurements on actual fabricated dies at multiple nodes.

Load-bearing premise

The single unified Power-Performance-Area objective used inside the Markov Decision Process accurately predicts real silicon power, speed, and area for the tested AI workloads at every process node.

What would settle it

Fabricate one or more of the RL-generated ASIC layouts at a chosen process node, run the same Llama or SmolVLM workload on the silicon, and compare measured power, throughput, and area against the values the RL agent predicted.

Figures

Figures reproduced from arXiv: 2604.07526 by Ravindra Ganti, Steve Xu.

**Figure 2.** Figure 2: SAC actor network: s ∈ R 52 → 2-layer MLP (256 hidden) → 80-dim output (20 discrete logits + 30 means + 30 log-stds). Actions sampled via tanh-squashed Gaussian with reparameterization. σ ⊙ ϵ), ϵ ∼ N (0, I). Log-std is clamped to [−20, 2] for numerical stability. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: RL convergence trace at 3nm: best PPA score vs. episode count over [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: PPA score and mesh scaling across 7 process nodes. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: PPA metric decomposition across process nodes: (a) power, (b) performance, (c) area. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Inference throughput (tokens/s) by process node for Llama 3.1 8B FP16. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Derived efficiency metrics by process node. Smaller nodes achieve higher power and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Pearson correlation matrix across PPA metrics. Strong positive correlations between [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Log-log trend fits for performance, power, and area versus process node. Fit equations [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Heterogeneous weight memory allocation across the 41 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Weight memory allocation analysis by mesh region: (a) violin plots showing the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Aggregate analysis of weight allocation and cross-node PPA tradeoffs. (a) WMEM [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames ASIC co-design for AI inference as one RL problem that adapts across nodes, but the results rest on an unverified PPA surrogate with no shown validation.

read the letter

The core idea is to cast mesh topology, per-tile microarchitecture choices like FETCH and VLEN, memory hierarchy, and operator partitioning into a single MDP, then let Soft Actor-Critic with Mixture-of-Experts gating search it under a unified PPA reward. The abstract shows the agent producing different configurations for Llama 3.1 8B and SmolVLM across seven process nodes without hand-tuned rules per node, which is the main practical claim.

Referee Report

2 major / 1 minor

Summary. The paper proposes an RL-driven compiler using Soft Actor-Critic (SAC) with Mixture-of-Experts gating to jointly optimize ASIC architecture, memory hierarchy, and workload partitioning for on-device AI inference. The design space is cast as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) reward; the method explores mesh topology, per-tile microarchitecture (FETCH, VLEN, memory allocation), and operator placement. Validation is reported on Llama 3.1 8B (high-performance mode) and SmolVLM (low-power mode) across seven process nodes (3 nm–28 nm), with the RL claimed to adapt configurations automatically without node-specific manual retuning.

Significance. If the unified PPA objective proves to be a faithful proxy for real silicon behavior, the work would offer a meaningful step toward automated, retuning-free architecture exploration for AI accelerators across technology nodes. The formulation of a single MDP for heterogeneous mesh and microarchitectural choices is technically interesting, but the absence of any validation of the PPA model against post-layout or foundry data currently prevents assessment of whether the reported adaptations reflect silicon reality.

major comments (2)

[Abstract] Abstract: concrete performance numbers are stated (29809 tokens/s at 3 nm for Llama 3.1 8B FP16; <13 mW at 10 MHz for SmolVLM) yet no information is supplied on simulation accuracy, baseline definitions, error bars, or how PPA is evaluated, leaving the central performance and adaptation claims without verifiable support.
[MDP and reward formulation] MDP and reward formulation (described in the methods): the unified PPA objective is asserted to drive SAC exploration of mesh sizes, FETCH/VLEN, and memory allocation across nodes, but the manuscript provides no description of the PPA estimator (analytical model, RTL-level, or foundry-calibrated), its treatment of node-specific leakage and wire delay, or any correlation study versus post-layout numbers. This is load-bearing for the claim that adaptation occurs without node-specific retuning.

minor comments (1)

[Abstract and Results] The abstract and results sections would benefit from explicit statements of the number of RL episodes, the Mixture-of-Experts architecture, and the precise definition of the mixed discrete-continuous action space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the technical interest in our unified MDP formulation. We address the concerns about the abstract and the lack of detail on the PPA estimator by expanding the manuscript with additional methodology, calibration information, and evaluation context. Revisions have been made to both the abstract and methods sections to improve verifiability of the reported results and adaptation claims.

read point-by-point responses

Referee: [Abstract] Abstract: concrete performance numbers are stated (29809 tokens/s at 3 nm for Llama 3.1 8B FP16; <13 mW at 10 MHz for SmolVLM) yet no information is supplied on simulation accuracy, baseline definitions, error bars, or how PPA is evaluated, leaving the central performance and adaptation claims without verifiable support.

Authors: We agree that the abstract should supply sufficient context for the reported numbers. In the revised manuscript we have added a sentence to the abstract stating that 'PPA values are obtained from a cycle-accurate simulator whose power, performance, and area models are calibrated to commercial 3-28 nm PDKs; results are averaged over five independent RL seeds with standard-deviation error bars; baselines are fixed mesh accelerators without joint RL optimization.' Expanded simulation accuracy, baseline definitions, and error-bar details now appear in the new Section 4.1. These changes make the central claims directly verifiable from the text. revision: yes
Referee: [MDP and reward formulation] MDP and reward formulation (described in the methods): the unified PPA objective is asserted to drive SAC exploration of mesh sizes, FETCH/VLEN, and memory allocation across nodes, but the manuscript provides no description of the PPA estimator (analytical model, RTL-level, or foundry-calibrated), its treatment of node-specific leakage and wire delay, or any correlation study versus post-layout numbers. This is load-bearing for the claim that adaptation occurs without node-specific retuning.

Authors: We acknowledge that the original methods section lacked a sufficiently detailed description of the PPA estimator. We have revised Section 3.2 to explicitly describe the estimator as a hybrid analytical-RTL model: leakage is taken from foundry PDK tables for each node, dynamic power uses activity factors from cycle-accurate simulation, and wire delay is modeled via Elmore approximations scaled by node-specific RC parameters. A new subsection and accompanying figure present correlation results between the estimator and post-layout numbers from a commercial place-and-route tool on 20 sampled architectures (R^{2} > 0.92 for both power and area). These additions substantiate that the single unified reward enables the SAC policy to discover node-appropriate configurations (e.g., larger meshes and adjusted VLEN at 3 nm) without manual retuning. revision: yes

Circularity Check

0 steps flagged

No circularity: RL optimization produces adaptation as emergent outcome

full rationale

The paper formulates ASIC exploration as a single MDP whose reward is a unified PPA objective and applies SAC with MoE gating to search mesh, microarchitecture, and placement choices. The reported cross-node adaptation (mesh sizes, heterogeneous FETCH/VLEN, memory allocation) is presented as the result of running this optimizer on each process node; no equations, fitted parameters, or self-citations are shown that would make the adaptation equivalent to the inputs by construction. The PPA surrogate is an external modeling choice whose fidelity is an assumption about correctness, not a definitional loop. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities. The method relies on standard RL components (SAC, MoE) and PPA metrics assumed from existing hardware design literature.

pith-pipeline@v0.9.0 · 5453 in / 1281 out tokens · 70907 ms · 2026-05-10T17:14:21.229968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 8 canonical work pages · 3 internal anchors

[1]

TVM: An automated end-to-end optimizing compiler for deep learning,

T. Chen et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inOSDI, 2018

2018
[2]

MLIR: A compiler in- frastructure for the end of Moore’s Law,

C. Lattner et al., “MLIR: A compiler in- frastructure for the end of Moore’s Law,” arXiv preprint arXiv:2002.11054, 2020

work page arXiv 2002
[3]

XLA: Opti- mizing compiler for machine learning,

TensorFlow XLA Team, “XLA: Opti- mizing compiler for machine learning,” https://www.tensorflow.org/xla, 2017

2017
[4]

Glow: Graph Lowering Compiler Techniques for Neural Networks

N. Rotem et al., “Glow: Graph lowering compiler techniques for neural networks,” arXiv preprint arXiv:1805.00907, 2018

work page Pith review arXiv 2018
[5]

Device placement optimization with reinforcement learn- ing,

A. Mirhoseini et al., “Device placement optimization with reinforcement learn- ing,” inICML, 2017

2017
[7]

Gaussian Processes for Machine Learn- ing,

C. E. Rasmussen and C. K. I. Williams, “Gaussian Processes for Machine Learn- ing,” MIT Press, 2006

2006
[8]

A genetic algorithm tuto- rial,

D. Whitley, “A genetic algorithm tuto- rial,”Statistics and Computing, vol. 4, no. 2, pp. 65–85, 1994

1994
[9]

Optimization by simulated an- nealing,

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated an- nealing,”Science, vol. 220, no. 4598, pp. 671–680, 1983

1983
[10]

Neural Architecture Search with Reinforcement Learning

B. Zoph and Q. V. Le, “Neural architec- ture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016

work page Pith review arXiv 2016
[11]

Regularized evolution for image classifier architecture search,

E. Real et al., “Regularized evolution for image classifier architecture search,” in AAAI, 2019

2019
[12]

Reinforce- ment Learning: An Introduction,

R. S. Sutton and A. G. Barto, “Reinforce- ment Learning: An Introduction,” MIT Press, 2018

2018
[13]

Proximal Policy Optimization Algorithms

J. Schulman et al., “Proximal policy op- timization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Simple statistical gradient-following algorithms for connec- tionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connec- tionist reinforcement learning,”Machine Learning, vol. 8, no. 3-4, pp. 229–256, 1992

1992
[15]

Asynchronous meth- ods for deep reinforcement learning,

V. Mnih et al., “Asynchronous meth- ods for deep reinforcement learning,” in ICML, 2016

2016
[16]

Soft actor-critic: Off- policy maximum entropy deep reinforce- ment learning with a stochastic actor,

T. Haarnoja et al., “Soft actor-critic: Off- policy maximum entropy deep reinforce- ment learning with a stochastic actor,” in ICML, 2018

2018
[17]

Chip placement with deep reinforcement learning,

A. Mirhoseini et al., “Chip placement with deep reinforcement learning,”arXiv preprint arXiv:2004.10746, 2020. 24

work page arXiv 2004
[18]

Hardware- aware neural network compilation with learned optimization: A RISC-V ac- celerator approach,

R. Ganti and S. Xu, “Hardware- aware neural network compilation with learned optimization: A RISC-V ac- celerator approach,”arXiv preprint arXiv:2512.00031, 2025

work page arXiv 2025
[19]

In-datacenter per- formance analysis of a tensor processing unit,

N. P. Jouppi et al., “In-datacenter per- formance analysis of a tensor processing unit,” inISCA, 2017

2017
[20]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer et al., “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inICLR, 2017

2017
[21]

Switch transformers: Scaling to trillion parameter models with simple and effi- cient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and effi- cient sparsity,”JMLR, vol. 23, no. 120, pp. 1–39, 2022

2022
[22]

GPipe: Efficient training of giant neural networks using pipeline parallelism,

Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” inNeurIPS, 2019

2019
[23]

Ansor: Generating high- performance tensor programs for deep learning,

L. Zheng et al., “Ansor: Generating high- performance tensor programs for deep learning,” inOSDI, 2020

2020
[24]

TensorRT: Programmable inference accelerator,

NVIDIA, “TensorRT: Programmable inference accelerator,” https: //developer.nvidia.com/tensorrt, 2018

2018
[25]

FBNet: Hardware-aware efficient ConvNet design via differen- tiable neural architecture search,

B. Wu et al., “FBNet: Hardware-aware efficient ConvNet design via differen- tiable neural architecture search,” in CVPR, 2019

2019
[26]

Placeto: Learn- ing generalizable device placement algo- rithms for distributed machine learning,

R. Addanki et al., “Placeto: Learn- ing generalizable device placement algo- rithms for distributed machine learning,” inNeurIPS, 2019

2019
[27]

ConfuciuX: Au- tonomous hardware resource assignment for DNN accelerators using reinforcement learning,

S.-C. Kao et al., “ConfuciuX: Au- tonomous hardware resource assignment for DNN accelerators using reinforcement learning,” inMICRO, 2020

2020
[28]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

The Llama 3 Herd of Models

A. Grattafiori et al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Estimating GPU memory consumption of deep learning models,

L. Gao et al., “Estimating GPU memory consumption of deep learning models,” in ESEC/FSE, 2020

2020
[31]

Efficient processing of deep neural networks: A tutorial and survey,

V. Sze et al., “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017

2017
[32]

Timeloop: A system- atic approach to DNN accelerator evalu- ation,

A. Parashar et al., “Timeloop: A system- atic approach to DNN accelerator evalu- ation,” inISPASS, 2019

2019
[33]

Efficient memory man- agement for large language model serving with PagedAttention,

W. Kwon et al., “Efficient memory man- agement for large language model serving with PagedAttention,” inSOSP, 2023

2023
[34]

FlexGen: High- throughput generative inference of large language models with a single GPU,

Y. Sheng et al., “FlexGen: High- throughput generative inference of large language models with a single GPU,” in ICML, 2023

2023
[35]

LLM.int8(): 8-bit matrix multiplication for transformers at scale,

T. Dettmers et al., “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inNeurIPS, 2022

2022
[36]

SmoothQuant: Accu- rate and efficient post-training quantiza- tion for large language models,

G. Xiao et al., “SmoothQuant: Accu- rate and efficient post-training quantiza- tion for large language models,” inICML, 2023. 25

2023