How Can Reinforcement Learning Achieve Expert-level Placement?
Pith reviewed 2026-05-07 14:38 UTC · model grok-4.3
The pith
Reinforcement learning reaches expert chip placement quality by learning a reward model directly from final expert layouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By beginning with final expert layouts, inferring the step-by-step trajectories that produced them, and training a reward model on those trajectories treated as demonstrations or preferences, RL agents can learn the implicit objectives that experts optimize and thereby produce layouts of expert quality.
What carries the argument
Inference of step-by-step trajectories from final layouts alone, which then serve as data to train a reward model capturing latent expert preferences.
If this is right
- RL agents trained on the learned reward model will produce placements closer to expert quality than agents trained only on wirelength rewards.
- The framework requires only a single expert design to learn an effective reward model that generalizes.
- Reward engineering for placement can be replaced by automated extraction of implicit rewards from existing expert results.
- The same trajectory-inference plus preference-learning pipeline can be applied to other sequential physical-design tasks.
Where Pith is reading between the lines
- If trajectory inference proves reliable, the method could be extended to domains where only final expert artifacts are archived and process traces are missing.
- Combining the learned reward model with other RL advances such as better exploration or hierarchical policies might further reduce the number of required expert examples.
- A practical test would be whether the inferred rewards also respect hard constraints like design-rule violations that experts implicitly avoid but that are not explicit in the final layout data.
- Success here raises the question of whether similar final-outcome-to-reward extraction can improve RL in other engineering optimization settings beyond chip design.
Load-bearing premise
Step-by-step expert trajectories can be accurately recovered from final layouts without any extra information on the expert's intermediate choices or constraints.
What would settle it
Train the reward model on trajectories inferred from one expert layout and test whether the resulting RL agent produces layouts on multiple unseen designs that are measurably worse than the expert baselines in standard metrics such as wirelength, congestion, and timing.
Figures
read the original abstract
Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an RL framework for chip placement that addresses the reward design gap with experts by learning an implicit reward model directly from expert layouts. It infers step-by-step trajectories from final expert configurations to serve as demonstrations or preferences, then trains a model to capture latent expert rewards. The central claim is that this enables efficient learning from even a single design with good generalization to unseen cases.
Significance. If the trajectory inference is reliable and the resulting rewards demonstrably encode expert decision quality rather than reconstruction artifacts, the approach could meaningfully advance RL-based physical design by reducing dependence on hand-crafted objectives and supporting data-efficient training. This would be a notable contribution given the scarcity of expert placement data.
major comments (1)
- [Methods (trajectory inference from final layouts)] The trajectory inference step (described in the methods following the abstract) is load-bearing for the central claim yet under-specified. Final layouts do not uniquely determine placement order or intermediate states, as many sequences can produce identical wirelengths and positions. The manuscript must detail the exact reconstruction procedure, any heuristics or constraints applied, and provide validation (e.g., comparison to recorded expert traces or sensitivity analysis) showing that inferred paths align with expert behavior rather than introducing artifacts that the reward model then learns.
minor comments (1)
- [Abstract] The abstract states that 'experiments show' efficient learning and generalization but provides no quantitative metrics, baselines, or details on reward-model training (e.g., inverse RL formulation, preference optimization objective, or evaluation protocol). The full manuscript should include these to allow verification of the performance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The point raised about the trajectory inference procedure is well-taken and central to the validity of our claims. We provide a point-by-point response below and will revise the manuscript to address the concerns.
read point-by-point responses
-
Referee: The trajectory inference step (described in the methods following the abstract) is load-bearing for the central claim yet under-specified. Final layouts do not uniquely determine placement order or intermediate states, as many sequences can produce identical wirelengths and positions. The manuscript must detail the exact reconstruction procedure, any heuristics or constraints applied, and provide validation (e.g., comparison to recorded expert traces or sensitivity analysis) showing that inferred paths align with expert behavior rather than introducing artifacts that the reward model then learns.
Authors: We agree that the trajectory inference procedure is under-specified in the current manuscript and requires elaboration for reproducibility and to mitigate concerns about potential artifacts. The original submission describes the high-level approach of inferring step-by-step trajectories from final expert layouts but does not provide the full algorithmic details. In the revised manuscript, we will expand the Methods section to include the exact reconstruction procedure, the specific heuristics and constraints used, and additional validation experiments. This will include a sensitivity analysis varying key inference parameters to demonstrate robustness of the learned reward model. While our dataset consists only of final layouts and does not include recorded expert placement traces for direct comparison, we will add ablation studies contrasting our inferred trajectories against those from alternative inference strategies (e.g., random or connectivity-based orderings) to show alignment with expert-quality outcomes. revision: yes
Circularity Check
No circularity detected; derivation relies on external expert data without self-referential reduction
full rationale
The paper's core claim is that a reward model can be derived by starting from final expert layouts, inferring trajectories, and training on them as demonstrations or preferences. No equations, derivations, or self-citations are shown that reduce this process to a fitted parameter defined in terms of the target expert-level outcome. The method treats expert layouts as independent external input rather than constructing the result from its own predictions or prior fitted values, leaving the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adya, Mehmet Can Yildiz, Igor L
Saurabh N. Adya, Mehmet Can Yildiz, Igor L. Markov, Paul Villarrubia, Phiroze N. Parakh, and Patrick H. Madden. 2004. Benchmarking for large-scale placement and beyond.Transactions on Computer-Aided Design of Integrated Circuits and Systems23, 4 (2004)
work page 2004
-
[2]
Anthony Agnesina, Puranjay Rajvanshi, Tian Yang, Geraldo Pradipta, Austin Jiao, Ben Keller, Brucek Khailany, and Haoxing Ren. 2023. Au- toDMP: Automated DREAMPlace-based macro placement. InProceed- ings of the 2023 International Symposium on Physical Design
work page 2023
-
[3]
Tutu Ajayi, Vidya A Chhabria, Mateus Fogaça, Soheil Hashemi, Abdel- rahman Hosny, Andrew B Kahng, Minsoo Kim, Jeongsup Lee, Uday Mallappa, Marina Neseem, et al. 2019. Toward an open-source digital flow: First learnings from the openroad project. InProceedings of the 56th Design Automation Conference
work page 2019
-
[4]
Chin-Hao Chang, Yao-Wen Chang, and Tung-Chieh Chen. 2017. A novel damped-wave framework for macro placement. InProceedings of the 36th International Conference on Computer-Aided Design
work page 2017
-
[5]
Yifan Chen, Zaiwen Wen, Yun Liang, and Yibo Lin. 2023. Stronger mixed-size placement backbone considering second-order information. InProceedings of the 42nd International Conference on Computer-Aided Design
work page 2023
-
[6]
Chung-Kuan Cheng, Andrew B Kahng, Ilgweon Kang, and Lutong Wang. 2018. Replace: Advancing solution quality and routability validation in global placement.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems38, 9 (2018)
work page 2018
-
[7]
Ruoyu Cheng and Junchi Yan. 2021. On joint learning for solving place- ment and routing in chip design. InAdvances in Neural Information Processing Systems 34
work page 2021
-
[8]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems 30
work page 2017
-
[9]
Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. 2021. IQ-Learn: Inverse soft-Q learning for imitation. InAdvances in Neural Information Processing Systems 34
work page 2021
-
[10]
Zijie Geng, Jie Wang, Ziyan Liu, Siyuan Xu, Zhentao Tang, Mingxuan Yuan, Jianye Hao, Yongdong Zhang, and Feng Wu. 2024. Reinforcement learning within tree search for fast macro placement. InProceedings of the 41st International Conference on Machine Learning
work page 2024
- [11]
-
[12]
Anna Goldie, Azalia Mirhoseini, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al . 2024. Addendum: A graph placement methodology for fast chip design.Nature634, 8034 (2024)
work page 2024
-
[13]
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine
-
[14]
In Proceedings of the 34th International Conference on Machine Learning
Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning
-
[15]
Kahng, Ravi Varadarajan, and Zhiang Wang
Andrew B. Kahng, Ravi Varadarajan, and Zhiang Wang. 2022. RTL-MP: Toward practical, human-quality chip planning and macro placement. InProceedings of the 2022 International Symposium on Physical Design
work page 2022
-
[16]
Andrew B Kahng, Ravi Varadarajan, and Zhiang Wang. 2023. Hier- RTLMP: A hierarchical automatic macro placer for large-scale complex IP blocks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 5 (2023)
work page 2023
-
[17]
Myung-Chul Kim, Jin Hu, Jiajia Li, and Natarajan Viswanathan. 2015. ICCAD-2015 CAD contest in incremental timing-driven placement and benchmark suite. InProceedings of the 34th International Conference on Computer-Aided Design
work page 2015
-
[18]
Yao Lai, Yao Mu, and Ping Luo. 2022. MaskPlace: Fast chip placement via reinforced visual representation learning. InAdvances in Neural Information Processing Systems 35
work page 2022
-
[19]
Peiyu Liao, Dawei Guo, Zizheng Guo, Siting Liu, Yibo Lin, and Bei Yu
-
[20]
DREAMPlace 4.0: Timing-driven placement with momentum- based net weighting and lagrangian-based refinement.IEEE Transac- tions on Computer-Aided Design of Integrated Circuits and Systems42, 10 (2023)
work page 2023
-
[21]
Jai-Ming Lin, You-Lun Deng, Szu-Ting Li, Bo-Heng Yu, Li-Yen Chang, and Te-Wei Peng. 2019. Regularity-aware routability-driven macro placement methodology for mixed-size circuits with obstacles.IEEE Transactions on Very Large Scale Integration Systems27, 1 (2019), 57–68
work page 2019
-
[22]
Jai-Ming Lin, You-Lun Deng, Ya-Chu Yang, Jia-Jian Chen, and Po- Chen Lu. 2021. Dataflow-aware macro placement based on simulated evolution algorithm for mixed-size designs.IEEE Transactions on Very Large Scale Integration Systems29, 5 (2021)
work page 2021
-
[23]
Yibo Lin, Zixuan Jiang, Jiaqi Gu, Wuxi Li, Shounak Dhar, Haoxing Ren, Brucek Khailany, and David Z Pan. 2020. DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems40, 4 (2020)
work page 2020
-
[24]
Jingwei Lu, Pengwen Chen, Chin-Chih Chang, Lu Sha, Dennis Jen- Hsin Huang, Chin-Chi Teng, and Chung-Kuan Cheng. 2015. ePlace: Electrostatics-based placement using fast Fourier transform and Nes- terov’s method.ACM Transactions on Design Automation of Electronic Systems20, 2 (2015)
work page 2015
-
[25]
Alberto Maria Metelli, Giorgia Ramponi, Alessandro Concetti, and Marcello Restelli. 2021. Provably efficient learning of transferable rewards. InProceedings of the 38th International Conference on Machine Learning
work page 2021
-
[26]
Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. 2021. A graph placement methodology for fast chip design.Nature594, 7862 (2021)
work page 2021
-
[27]
Hiroshi Murata, Kunihiro Fujiyoshi, Shigetoshi Nakatake, and Yoji Kajitani. 1996. VLSI module placement based on rectangle-packing by the sequence-pair.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems15, 12 (1996)
work page 1996
-
[28]
Andrew Y. Ng and Stuart Russell. 2000. Algorithms for inverse rein- forcement learning. InProceedings of the 17th International Conference on Machine Learning
work page 2000
-
[29]
Andrew Bagnell, Pieter Abbeel, and Jan Peters
Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. 2018. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics(2018)
work page 2018
-
[30]
Yuan Pu, Tinghuan Chen, Zhuolun He, Chen Bai, Haisheng Zheng, Yibo Lin, and Bei Yu. 2024. IncreMacro: Incremental macro placement refinement. InProceedings of the 2024 International Symposium on Physical Design
work page 2024
-
[31]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347(2017)
work page internal anchor Pith review arXiv 2017
-
[32]
Yunqi Shi, Ke Xue, Lei Song, and Chao Qian. 2023. Macro placement by wire-mask-guided black-box optimization. InAdvances in Neural Information Processing Systems 36
work page 2023
-
[33]
Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. 2017. A survey of preference-based reinforcement learning methods.Journal of Machine Learning Research(2017)
work page 2017
-
[34]
Ke Xue, Ruo-Tong Chen, Xi Lin, Yunqi Shi, Shixiong Kai, Siyuan Xu, and Chao Qian. 2024. Reinforcement learning policy as macro regulator rather than macro placer. InAdvances in Neural Information Processing Systems 37
work page 2024
-
[35]
Rui Yu, Shenghua Wan, Yucen Wang, Chen-Xiao Gao, Le Gan, Zongzhang Zhang, and De-Chuan Zhan. 2025. Reward models in deep reinforcement learning: A survey. InProceedings of the 34th In- ternational Joint Conference on Artificial Intelligence
work page 2025
-
[36]
Zhi-Hua Zhou and Yu-Xuan Huang. 2021. Abductive Learning. In Neuro-Symbolic Artificial Intelligence: The State of the Art. Vol. 342
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.