pith. sign in

arxiv: 2604.25191 · v1 · submitted 2026-04-28 · 💻 cs.AR · cs.AI· cs.LG

How Can Reinforcement Learning Achieve Expert-level Placement?

Pith reviewed 2026-05-07 14:38 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.LG
keywords reinforcement learningchip placementreward modelingexpert demonstrationstrajectory inferencephysical design automationpreference learning
0
0 comments X

The pith

Reinforcement learning reaches expert chip placement quality by learning a reward model directly from final expert layouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that RL methods for chip placement underperform experts mainly because their rewards focus narrowly on wirelength instead of capturing full expert goals. Rather than engineering complex reward functions, the approach infers step-by-step expert trajectories from the final layouts alone and uses those trajectories as demonstrations or preferences to train a model of the latent rewards implicit in expert results. This reward model then guides RL training. A sympathetic reader would care because chip placement is a bottleneck in hardware design, and closing the gap to human experts could speed up the overall physical design flow without requiring extensive manual reward tuning. Experiments indicate the method learns efficiently from a single design and generalizes to new cases.

Core claim

By beginning with final expert layouts, inferring the step-by-step trajectories that produced them, and training a reward model on those trajectories treated as demonstrations or preferences, RL agents can learn the implicit objectives that experts optimize and thereby produce layouts of expert quality.

What carries the argument

Inference of step-by-step trajectories from final layouts alone, which then serve as data to train a reward model capturing latent expert preferences.

If this is right

  • RL agents trained on the learned reward model will produce placements closer to expert quality than agents trained only on wirelength rewards.
  • The framework requires only a single expert design to learn an effective reward model that generalizes.
  • Reward engineering for placement can be replaced by automated extraction of implicit rewards from existing expert results.
  • The same trajectory-inference plus preference-learning pipeline can be applied to other sequential physical-design tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If trajectory inference proves reliable, the method could be extended to domains where only final expert artifacts are archived and process traces are missing.
  • Combining the learned reward model with other RL advances such as better exploration or hierarchical policies might further reduce the number of required expert examples.
  • A practical test would be whether the inferred rewards also respect hard constraints like design-rule violations that experts implicitly avoid but that are not explicit in the final layout data.
  • Success here raises the question of whether similar final-outcome-to-reward extraction can improve RL in other engineering optimization settings beyond chip design.

Load-bearing premise

Step-by-step expert trajectories can be accurately recovered from final layouts without any extra information on the expert's intermediate choices or constraints.

What would settle it

Train the reward model on trajectories inferred from one expert layout and test whether the resulting RL agent produces layouts on multiple unseen designs that are measurably worse than the expert baselines in standard metrics such as wirelength, congestion, and timing.

Figures

Figures reproduced from arXiv: 2604.25191 by Chao Qian, Chengrui Gao, Ke Xue, Mingxuan Yuan, Peng Xie, Ruo-Tong Chen, Siyuan Xu, Tian Xu, Yunqi Shi, Zhi-Hua Zhou.

Figure 1
Figure 1. Figure 1: Illustration of our proposed framework. Instead of manually formalizing intricate expert knowledge, we circumvent this by directly learning from the final expert layouts to derive a reward model. The visualizations of design superblue18 of MaskPlace [17], EfficientPlace [10], DREAMPlace 4.1.0 [18], EIM-D, EIM-P and Expert are illustrated. The reward models of EIM-D and EIM-P are trained on design superblue… view at source ↗
read the original abstract

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes an RL framework for chip placement that addresses the reward design gap with experts by learning an implicit reward model directly from expert layouts. It infers step-by-step trajectories from final expert configurations to serve as demonstrations or preferences, then trains a model to capture latent expert rewards. The central claim is that this enables efficient learning from even a single design with good generalization to unseen cases.

Significance. If the trajectory inference is reliable and the resulting rewards demonstrably encode expert decision quality rather than reconstruction artifacts, the approach could meaningfully advance RL-based physical design by reducing dependence on hand-crafted objectives and supporting data-efficient training. This would be a notable contribution given the scarcity of expert placement data.

major comments (1)
  1. [Methods (trajectory inference from final layouts)] The trajectory inference step (described in the methods following the abstract) is load-bearing for the central claim yet under-specified. Final layouts do not uniquely determine placement order or intermediate states, as many sequences can produce identical wirelengths and positions. The manuscript must detail the exact reconstruction procedure, any heuristics or constraints applied, and provide validation (e.g., comparison to recorded expert traces or sensitivity analysis) showing that inferred paths align with expert behavior rather than introducing artifacts that the reward model then learns.
minor comments (1)
  1. [Abstract] The abstract states that 'experiments show' efficient learning and generalization but provides no quantitative metrics, baselines, or details on reward-model training (e.g., inverse RL formulation, preference optimization objective, or evaluation protocol). The full manuscript should include these to allow verification of the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The point raised about the trajectory inference procedure is well-taken and central to the validity of our claims. We provide a point-by-point response below and will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: The trajectory inference step (described in the methods following the abstract) is load-bearing for the central claim yet under-specified. Final layouts do not uniquely determine placement order or intermediate states, as many sequences can produce identical wirelengths and positions. The manuscript must detail the exact reconstruction procedure, any heuristics or constraints applied, and provide validation (e.g., comparison to recorded expert traces or sensitivity analysis) showing that inferred paths align with expert behavior rather than introducing artifacts that the reward model then learns.

    Authors: We agree that the trajectory inference procedure is under-specified in the current manuscript and requires elaboration for reproducibility and to mitigate concerns about potential artifacts. The original submission describes the high-level approach of inferring step-by-step trajectories from final expert layouts but does not provide the full algorithmic details. In the revised manuscript, we will expand the Methods section to include the exact reconstruction procedure, the specific heuristics and constraints used, and additional validation experiments. This will include a sensitivity analysis varying key inference parameters to demonstrate robustness of the learned reward model. While our dataset consists only of final layouts and does not include recorded expert placement traces for direct comparison, we will add ablation studies contrasting our inferred trajectories against those from alternative inference strategies (e.g., random or connectivity-based orderings) to show alignment with expert-quality outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation relies on external expert data without self-referential reduction

full rationale

The paper's core claim is that a reward model can be derived by starting from final expert layouts, inferring trajectories, and training on them as demonstrations or preferences. No equations, derivations, or self-citations are shown that reduce this process to a fitted parameter defined in terms of the target expert-level outcome. The method treats expert layouts as independent external input rather than constructing the result from its own predictions or prior fitted values, leaving the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that expert layouts encode recoverable step-wise decision sequences and that a learned model from those sequences will produce transferable rewards. No free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5438 in / 1142 out tokens · 36516 ms · 2026-05-07T14:38:20.956719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Adya, Mehmet Can Yildiz, Igor L

    Saurabh N. Adya, Mehmet Can Yildiz, Igor L. Markov, Paul Villarrubia, Phiroze N. Parakh, and Patrick H. Madden. 2004. Benchmarking for large-scale placement and beyond.Transactions on Computer-Aided Design of Integrated Circuits and Systems23, 4 (2004)

  2. [2]

    Anthony Agnesina, Puranjay Rajvanshi, Tian Yang, Geraldo Pradipta, Austin Jiao, Ben Keller, Brucek Khailany, and Haoxing Ren. 2023. Au- toDMP: Automated DREAMPlace-based macro placement. InProceed- ings of the 2023 International Symposium on Physical Design

  3. [3]

    Tutu Ajayi, Vidya A Chhabria, Mateus Fogaça, Soheil Hashemi, Abdel- rahman Hosny, Andrew B Kahng, Minsoo Kim, Jeongsup Lee, Uday Mallappa, Marina Neseem, et al. 2019. Toward an open-source digital flow: First learnings from the openroad project. InProceedings of the 56th Design Automation Conference

  4. [4]

    Chin-Hao Chang, Yao-Wen Chang, and Tung-Chieh Chen. 2017. A novel damped-wave framework for macro placement. InProceedings of the 36th International Conference on Computer-Aided Design

  5. [5]

    Yifan Chen, Zaiwen Wen, Yun Liang, and Yibo Lin. 2023. Stronger mixed-size placement backbone considering second-order information. InProceedings of the 42nd International Conference on Computer-Aided Design

  6. [6]

    Chung-Kuan Cheng, Andrew B Kahng, Ilgweon Kang, and Lutong Wang. 2018. Replace: Advancing solution quality and routability validation in global placement.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems38, 9 (2018)

  7. [7]

    Ruoyu Cheng and Junchi Yan. 2021. On joint learning for solving place- ment and routing in chip design. InAdvances in Neural Information Processing Systems 34

  8. [8]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 30

  9. [9]

    Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. 2021. IQ-Learn: Inverse soft-Q learning for imitation. InAdvances in Neural Information Processing Systems 34

  10. [10]

    Zijie Geng, Jie Wang, Ziyan Liu, Siyuan Xu, Zhentao Tang, Mingxuan Yuan, Jianye Hao, Yongdong Zhang, and Feng Wu. 2024. Reinforcement learning within tree search for fast macro placement. InProceedings of the 41st International Conference on Machine Learning

  11. [11]

    Anna Goldie, Azalia Mirhoseini, and Jeff Dean. 2024. That chip has sailed: A critique of unfounded skepticism around AI for chip design. arxiv:2411.10053(2024)

  12. [12]

    Anna Goldie, Azalia Mirhoseini, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al . 2024. Addendum: A graph placement methodology for fast chip design.Nature634, 8034 (2024)

  13. [13]

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine

  14. [14]

    In Proceedings of the 34th International Conference on Machine Learning

    Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning

  15. [15]

    Kahng, Ravi Varadarajan, and Zhiang Wang

    Andrew B. Kahng, Ravi Varadarajan, and Zhiang Wang. 2022. RTL-MP: Toward practical, human-quality chip planning and macro placement. InProceedings of the 2022 International Symposium on Physical Design

  16. [16]

    Andrew B Kahng, Ravi Varadarajan, and Zhiang Wang. 2023. Hier- RTLMP: A hierarchical automatic macro placer for large-scale complex IP blocks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 5 (2023)

  17. [17]

    Myung-Chul Kim, Jin Hu, Jiajia Li, and Natarajan Viswanathan. 2015. ICCAD-2015 CAD contest in incremental timing-driven placement and benchmark suite. InProceedings of the 34th International Conference on Computer-Aided Design

  18. [18]

    Yao Lai, Yao Mu, and Ping Luo. 2022. MaskPlace: Fast chip placement via reinforced visual representation learning. InAdvances in Neural Information Processing Systems 35

  19. [19]

    Peiyu Liao, Dawei Guo, Zizheng Guo, Siting Liu, Yibo Lin, and Bei Yu

  20. [20]

    DREAMPlace 4.0: Timing-driven placement with momentum- based net weighting and lagrangian-based refinement.IEEE Transac- tions on Computer-Aided Design of Integrated Circuits and Systems42, 10 (2023)

  21. [21]

    Jai-Ming Lin, You-Lun Deng, Szu-Ting Li, Bo-Heng Yu, Li-Yen Chang, and Te-Wei Peng. 2019. Regularity-aware routability-driven macro placement methodology for mixed-size circuits with obstacles.IEEE Transactions on Very Large Scale Integration Systems27, 1 (2019), 57–68

  22. [22]

    Jai-Ming Lin, You-Lun Deng, Ya-Chu Yang, Jia-Jian Chen, and Po- Chen Lu. 2021. Dataflow-aware macro placement based on simulated evolution algorithm for mixed-size designs.IEEE Transactions on Very Large Scale Integration Systems29, 5 (2021)

  23. [23]

    Yibo Lin, Zixuan Jiang, Jiaqi Gu, Wuxi Li, Shounak Dhar, Haoxing Ren, Brucek Khailany, and David Z Pan. 2020. DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems40, 4 (2020)

  24. [24]

    Jingwei Lu, Pengwen Chen, Chin-Chih Chang, Lu Sha, Dennis Jen- Hsin Huang, Chin-Chi Teng, and Chung-Kuan Cheng. 2015. ePlace: Electrostatics-based placement using fast Fourier transform and Nes- terov’s method.ACM Transactions on Design Automation of Electronic Systems20, 2 (2015)

  25. [25]

    Alberto Maria Metelli, Giorgia Ramponi, Alessandro Concetti, and Marcello Restelli. 2021. Provably efficient learning of transferable rewards. InProceedings of the 38th International Conference on Machine Learning

  26. [26]

    Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. 2021. A graph placement methodology for fast chip design.Nature594, 7862 (2021)

  27. [27]

    Hiroshi Murata, Kunihiro Fujiyoshi, Shigetoshi Nakatake, and Yoji Kajitani. 1996. VLSI module placement based on rectangle-packing by the sequence-pair.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems15, 12 (1996)

  28. [28]

    Ng and Stuart Russell

    Andrew Y. Ng and Stuart Russell. 2000. Algorithms for inverse rein- forcement learning. InProceedings of the 17th International Conference on Machine Learning

  29. [29]

    Andrew Bagnell, Pieter Abbeel, and Jan Peters

    Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. 2018. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics(2018)

  30. [30]

    Yuan Pu, Tinghuan Chen, Zhuolun He, Chen Bai, Haisheng Zheng, Yibo Lin, and Bei Yu. 2024. IncreMacro: Incremental macro placement refinement. InProceedings of the 2024 International Symposium on Physical Design

  31. [31]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347(2017)

  32. [32]

    Yunqi Shi, Ke Xue, Lei Song, and Chao Qian. 2023. Macro placement by wire-mask-guided black-box optimization. InAdvances in Neural Information Processing Systems 36

  33. [33]

    Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. 2017. A survey of preference-based reinforcement learning methods.Journal of Machine Learning Research(2017)

  34. [34]

    Ke Xue, Ruo-Tong Chen, Xi Lin, Yunqi Shi, Shixiong Kai, Siyuan Xu, and Chao Qian. 2024. Reinforcement learning policy as macro regulator rather than macro placer. InAdvances in Neural Information Processing Systems 37

  35. [35]

    Rui Yu, Shenghua Wan, Yucen Wang, Chen-Xiao Gao, Le Gan, Zongzhang Zhang, and De-Chuan Zhan. 2025. Reward models in deep reinforcement learning: A survey. InProceedings of the 34th In- ternational Joint Conference on Artificial Intelligence

  36. [36]

    Zhi-Hua Zhou and Yu-Xuan Huang. 2021. Abductive Learning. In Neuro-Symbolic Artificial Intelligence: The State of the Art. Vol. 342