arxiv: 2605.13638 · v1 · submitted 2026-05-13 · 🪐 quant-ph · cs.LG

Recognition: unknown

CO-MAP: A Reinforcement Learning Approach to the Qubit Allocation Problem

Ankit Kulshrestha , Xiaoyuan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:47 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG

keywords qubit mappingreinforcement learningquantum compilationSWAP overheadcombinatorial optimizationMQTBenchQueko circuitslocal search

0 comments

The pith

A reinforcement learning policy trained on a combinatorial formulation cuts SWAP overhead by 65-85 percent on standard quantum circuit benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the qubit allocation step in quantum compilation as a combinatorial optimization problem whose objective is to minimize the number of additional SWAP gates needed to respect hardware connectivity. Instead of using random or heuristic assignments, the authors train a reinforcement learning policy that learns to produce high-quality mappings directly from circuit features. They add a local-search post-processing step that further trims the remaining overhead. When evaluated on MQTBench and Queko circuits, the resulting mappings require 65-85 percent fewer SWAPs than those produced by current compilers. A reader would care because every avoided SWAP reduces both runtime and noise accumulation on present-day quantum processors.

Core claim

We formulate the qubit mapping problem with a combinatorial optimization objective. We then present a method to find a solution to the CO problem by training a reinforcement learning policy. We also propose a local search based post-processing algorithm to further reduce the overhead. Our results show a dramatic improvement over conventional techniques in reducing the number of SWAPs. On different real world datasets like MQTBench and Queko circuits, our trained policy achieves a 65-85 percent reduction in SWAP overhead when compared to existing quantum compilers.

What carries the argument

A reinforcement learning policy that learns to solve a combinatorial optimization objective for logical-to-physical qubit assignment, followed by local-search refinement to minimize SWAP count.

If this is right

Fewer SWAPs allow deeper or wider circuits to execute before decoherence dominates.
The learned policy can replace the mapping heuristic inside any standard quantum compiler stack.
Training once on representative circuits yields reusable mappings across similar hardware topologies.
The post-processing local search can be applied independently to polish any initial mapping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy scales without retraining, compiler designers could shift effort from hand-tuning heuristics to curating diverse training sets.
The same RL framing might be applied to related compilation subproblems such as gate scheduling or initial layout selection.
Hardware-specific connectivity graphs could be encoded as part of the state so that one policy works across different quantum processors.
Combining the reduced SWAP count with error-mitigation techniques could compound the effective fidelity gain on NISQ devices.

Load-bearing premise

The trained reinforcement learning policy generalizes to new circuits without overfitting to the specific training benchmarks.

What would settle it

Measure the SWAP reduction on a fresh collection of circuits drawn from the same distribution as MQTBench and Queko but withheld from training and validation; the claim holds only if the reduction remains in the 65-85 percent range against the same baselines.

Figures

Figures reproduced from arXiv: 2605.13638 by Ankit Kulshrestha, Xiaoyuan Liu.

**Figure 2.** Figure 2: The policy network encoder accepts either a program or a coupling graph and produces [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The decoder accepts program and coupling graph node embeddings from the encoder. At [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Our results are obtained by running the policy network on the test set using a single NVIDIA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.** Figure 4: Time taken to produce mapping with our method vs Qiskit SABRE compiler. A) shows [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The 65-qubit IBM “heavy hex” lattice device graph topology. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Improvements in SWAPs obtained with our solver and various optimization level settings [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Number of SWAPs produced by our proposed method vs Qiskit SABRE compiler baseline [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Number of SWAPs produced for different number of CNOT gates in program qubits. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

A quantum compiler is a critical piece in the quantum computing pipeline since it allows an abstract quantum circuit to be run on a physical quantum computer. One extremely important subproblem in quantum compilation is the generation of a logical to physical qubit mapping. Typically in quantum compilers this step is either implemented as a random or a heuristic based assignment that aims to minimize additional (SWAP) gate overhead in the quantum circuit. In this paper, we present an alternative approach to solving the qubit mapping problem. Specifically, we formulate the qubit mapping problem with a combinatorial optimization (CO) objective. We then present a method to find a solution to the CO problem by training a reinforcement learning (RL) policy. We also propose a local search based post-processing algorithm to further reduce the overhead. Our results show a dramatic improvement over conventional techniques in reducing the number of SWAPs. On different real world datasets like MQTBench and Queko circuits, our trained policy achieves a \textbf{65-85\%} reduction in SWAP overhead when compared to existing quantum compilers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL plus local search for qubit mapping claims big SWAP cuts but the abstract leaves the training and evaluation details too thin to judge if the gains are real or benchmark-specific.

read the letter

The new piece is framing qubit allocation as a combinatorial optimization task and training an RL policy to produce the mapping, followed by a local search cleanup. That is a step past the usual random or heuristic assignments that most compilers still use. The reported 65-85% SWAP reduction on MQTBench and Queko circuits is the headline result, and if it survives proper controls it would be a useful practical improvement for NISQ compilation overhead.

Referee Report

3 major / 1 minor

Summary. This paper presents CO-MAP, a reinforcement learning method for the qubit allocation problem in quantum compilation. It formulates the problem as a combinatorial optimization task, trains an RL policy to find mappings, and applies a local search post-processing algorithm. The main result is a reported 65-85% reduction in SWAP overhead on MQTBench and Queko circuit datasets compared to existing compilers.

Significance. Should the empirical results prove robust under proper validation protocols, the approach could significantly enhance the efficiency of quantum circuit compilation by minimizing SWAP insertions, which is critical for NISQ-era devices. The integration of RL with local search represents a potentially scalable alternative to traditional heuristics.

major comments (3)

[Abstract] The headline claim of 65-85% SWAP reduction lacks any mention of training procedures, baseline definitions, statistical tests, or circuit details, rendering the soundness of the empirical comparison unevaluable from the provided text.
[Evaluation] No held-out test split or cross-family evaluation (e.g., on circuits with different depth or entanglement) is described, so it remains unclear whether the RL policy generalizes to unseen circuits or overfits to the MQTBench/Queko training distribution.
[Results] The absence of an ablation removing the local-search post-processor means the improvements cannot be confidently attributed to the RL component alone, as required to support the central claim.

minor comments (1)

[Abstract] The term 'real world datasets' is vague; specific details on circuit selection, sizes, and any preprocessing should be clarified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the current presentation of the empirical results requires clarification and additional validation to strengthen the claims. We will revise the manuscript accordingly, expanding the abstract, adding explicit generalization tests, and including an ablation study. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] The headline claim of 65-85% SWAP reduction lacks any mention of training procedures, baseline definitions, statistical tests, or circuit details, rendering the soundness of the empirical comparison unevaluable from the provided text.

Authors: We agree that the abstract is too concise and omits key experimental details. In the revision we will expand it to state: (i) the RL policy is trained via PPO on a combinatorial optimization formulation of qubit mapping using a graph neural network encoder; (ii) baselines are the default mappers in Qiskit and t|ket>; (iii) results are averaged over 5 independent training seeds with standard deviation reported; and (iv) the 65-85% figure is the range of average SWAP reductions observed across the MQTBench and Queko suites. Full circuit lists, hyperparameters, and statistical tests will remain in the main text and appendix. revision: yes
Referee: [Evaluation] No held-out test split or cross-family evaluation (e.g., on circuits with different depth or entanglement) is described, so it remains unclear whether the RL policy generalizes to unseen circuits or overfits to the MQTBench/Queko training distribution.

Authors: The manuscript currently reports results on the full MQTBench and Queko collections after training on a random 70% subset, but does not explicitly label a held-out partition or perform cross-family tests. We will add a dedicated generalization section that (a) reports performance on the remaining 30% held-out circuits from the same families, (b) evaluates on additional circuit families with systematically varied depth and entanglement (e.g., random Clifford circuits and QAOA instances), and (c) includes a learning-curve plot showing that test SWAP overhead stabilizes without degradation, supporting that the policy does not overfit the training distribution. revision: yes
Referee: [Results] The absence of an ablation removing the local-search post-processor means the improvements cannot be confidently attributed to the RL component alone, as required to support the central claim.

Authors: We acknowledge that the current results combine the RL policy with the local-search post-processor, making it impossible to isolate the RL contribution. In the revised manuscript we will add an ablation table that reports SWAP overhead for (1) the RL policy alone, (2) the RL policy plus local search, (3) local search applied to random mappings, and (4) the original baselines. This will allow readers to quantify the incremental benefit of each stage and directly support the claim that the learned policy is the primary driver of the observed reductions. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential predictions present

full rationale

The paper formulates qubit allocation as a combinatorial optimization problem solved via RL training plus local search post-processing, then reports empirical SWAP reductions (65-85%) on MQTBench and Queko circuits versus external compilers. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance claims rest on direct comparison to independent baselines rather than any internal reduction to the training inputs. This is a standard empirical ML result with external benchmarks, so no circularity is present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Standard RL training hyperparameters are implicitly present but unspecified.

pith-pipeline@v0.9.0 · 5482 in / 1014 out tokens · 51905 ms · 2026-05-14T17:47:24.750136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 11 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina- torial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Machine learning for combinatorial optimization: a methodological tour d’horizon.European Journal of Operational Research, 290(2):405–421, 2021

Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon.European Journal of Operational Research, 290(2):405–421, 2021

2021
[4]

RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark

Federico Berto, Chuanbo Hua, Junyoung Park, Laurin Luttmann, Yining Ma, Fanchen Bu, Jiarui Wang, Haoran Ye, Minsu Kim, Sanghyeok Choi, Nayeli Gast Zepeda, André Hottung, Jianan Zhou, Jieyi Bi, Yu Hu, Fei Liu, Hyeonah Kim, Jiwoo Son, Haeyeon Kim, Davide Angioni, Wouter Kool, Zhiguang Cao, Jie Zhang, Kijung Shin, Cathy Wu, Sungsoo Ahn, Guojie Song, Changhyu...

2025
[5]

Quantum Compiler Optimizations

Jeffrey Booth Jr. Quantum compiler optimizations.arXiv preprint arXiv:1206.3348, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[6]

Variational quantum algorithms.Nature Reviews Physics, 3(9):625–644, 2021

Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, et al. Variational quantum algorithms.Nature Reviews Physics, 3(9):625–644, 2021

2021
[7]

Programming languages and compiler design for realistic quantum hardware.Nature, 549(7671):180–187, 2017

Frederic T Chong, Diana Franklin, and Margaret Martonosi. Programming languages and compiler design for realistic quantum hardware.Nature, 549(7671):180–187, 2017

2017
[8]

A Quantum Approximate Optimization Algorithm

Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm.arXiv preprint arXiv:1411.4028, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Winner takes it all: Training performant rl populations for combinatorial optimization.Advances in Neural Information Processing Systems, 36:48485–48509, 2023

Nathan Grinsztajn, Daniel Furelos-Blanco, Shikha Surana, Clément Bonnet, and Tom Barrett. Winner takes it all: Training performant rl populations for combinatorial optimization.Advances in Neural Information Processing Systems, 36:48485–48509, 2023

2023
[10]

Reinforcement learning and dear framework for solving the qubit mapping problem

Ching-Yao Huang, Chi-Hsiang Lien, and Wai-Kei Mak. Reinforcement learning and dear framework for solving the qubit mapping problem. InProceedings of the 41st IEEE/ACM international conference on computer-aided design, pages 1–9, 2022

2022
[11]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

2015
[12]

Quantum computing with Qiskit

Ali Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D Nation, Lev S Bishop, Andrew W Cross, et al. Quantum computing with qiskit.arXiv preprint arXiv:2405.08810, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Evidence for the utility of quantum computing before fault tolerance.Nature, 618(7965):500–505, 2023

Youngseok Kim, Andrew Eddins, Sajant Anand, Ken Xuan Wei, Ewout Van Den Berg, Sami Rosenblatt, Hasan Nayfeh, Yantao Wu, Michael Zaletel, Kristan Temme, et al. Evidence for the utility of quantum computing before fault tolerance.Nature, 618(7965):500–505, 2023

2023
[14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Quantum measurements and the Abelian Stabilizer Problem

A Yu Kitaev. Quantum measurements and the abelian stabilizer problem.arXiv preprint quant-ph/9511026, 1995

work page internal anchor Pith review Pith/arXiv arXiv 1995
[16]

Number 47

Alexei Yu Kitaev, Alexander Shen, and Mikhail N Vyalyi.Classical and quantum computation. Number 47. American Mathematical Soc., 2002

2002
[17]

Attention, Learn to Solve Routing Problems!

Wouter Kool, Herke Van Hoof, and Max Welling. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018. 10

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Practical and eﬃcient quantum circuit synthesis and transpiling with rei nforcement learning,

David Kremer, Victor Villar, Hanhee Paik, Ivan Duran, Ismael Faro, and Juan Cruz-Benito. Practical and efficient quantum circuit synthesis and transpiling with reinforcement learning. arXiv preprint arXiv:2405.13196, 2024

work page arXiv 2024
[19]

On the practical usefulness of the hardware efficient ansatz.Quantum, 8:1395, 2024

Lorenzo Leone, Salvatore FE Oliviero, Lukasz Cincio, and Marco Cerezo. On the practical usefulness of the hardware efficient ansatz.Quantum, 8:1395, 2024

2024
[20]

Tackling the qubit mapping problem for nisq-era quantum devices

Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem for nisq-era quantum devices. InProceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pages 1001–1014, 2019

2019
[21]

Graph normalizing flows.Advances in Neural Information Processing Systems, 32, 2019

Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. Graph normalizing flows.Advances in Neural Information Processing Systems, 32, 2019

2019
[22]

Leveraging special-purpose hardware for local search heuristics.Computational Optimization and Applications, 82(1):1–29, 2022

Xiaoyuan Liu, Hayato Ushijima-Mwesigwa, Avradip Mandal, Sarvagya Upadhyay, Ilya Safro, and Arnab Roy. Leveraging special-purpose hardware for local search heuristics.Computational Optimization and Applications, 82(1):1–29, 2022

2022
[23]

Generating compilers for qubit mapping and routing.Proceedings of the ACM on Programming Languages, 10(POPL):2265–2294, 2026

Abtin Molavi, Amanda Xu, Ethan Cecchetti, Swamit Tannu, and Aws Albarghouthi. Generating compilers for qubit mapping and routing.Proceedings of the ACM on Programming Languages, 10(POPL):2265–2294, 2026

2026
[24]

Ionq quantum computers: clear to scale

Christopher Monroe. Ionq quantum computers: clear to scale. InAPS March Meeting Abstracts, volume 2021, pages P10–002, 2021

2021
[25]

The ibm quantum heavy hex lattice, 2021

Paul Nation. The ibm quantum heavy hex lattice, 2021

2021
[26]

Cambridge university press, 2010

Michael A Nielsen and Isaac L Chuang.Quantum computation and quantum information. Cambridge university press, 2010

2010
[27]

The pagerank citation ranking: Bringing order to the web

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999

1999
[28]

Peruzzo, J

A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien. A variational eigenvalue solver on a photonic quantum processor.Nature Communications, 5:4213, 2014

2014
[29]

Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum Architectures

Enrico Russo, Maurizio Palesi, Davide Patti, Giuseppe Ascia, and Vincenzo Catania. Attention- based deep reinforcement learning for qubit allocation in modular quantum architectures.arXiv preprint arXiv:2406.11452, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Hill-climbing search.Encyclopedia of cognitive science, 81(333-335):10, 2006

Bart Selman and Carla P Gomes. Hill-climbing search.Encyclopedia of cognitive science, 81(333-335):10, 2006

2006
[32]

Qubit allocation

Marcos Yukio Siraichi, Vinícius Fernandes dos Santos, Caroline Collange, and Fernando Magno Quintão Pereira. Qubit allocation. InProceedings of the 2018 international symposium on code generation and optimization, pages 113–125, 2018

2018
[33]

An open-source, industrial-strength optimizing compiler for quantum programs.Quantum Science & Technology, 5(4):044001, 2020

Robert S Smith, Eric C Peterson, Mark G Skilbeck, and Erik J Davis. An open-source, industrial-strength optimizing compiler for quantum programs.Quantum Science & Technology, 5(4):044001, 2020

2020
[34]

Optimality study of existing quantum computing layout synthesis tools.IEEE Transactions on Computers, July 2020

Bochen Tan and Jason Cong. Optimality study of existing quantum computing layout synthesis tools.IEEE Transactions on Computers, July 2020

2020
[35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[36]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Pointer networks.Advances in neural information processing systems, 28, 2015

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks.Advances in neural information processing systems, 28, 2015

2015
[38]

The MQT handbook: A summary of design automation tools and software for quantum computing

Robert Wille, Lucas Berent, Tobias Forster, Jagatheesan Kunasaikaran, Kevin Mato, Tom Peham, Nils Quetschlich, Damian Rovara, Aaron Sander, Ludwig Schmid, Daniel Schoenberger, Yannick Stade, and Lukas Burgholzer. The MQT handbook: A summary of design automation tools and software for quantum computing. InIEEE International Conference on Quantum Software (...

2024
[39]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992
[40]

Graph pointer neural networks

Tianmeng Yang, Yujing Wang, Zhihan Yue, Yaming Yang, Yunhai Tong, and Jing Bai. Graph pointer neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8832–8839, 2022

2022
[41]

Learning local search heuristics for boolean satisfiability

Emre Yolcu and Barnabás Póczos. Learning local search heuristics for boolean satisfiability. Advances in Neural Information Processing Systems, 32, 2019

2019
[42]

influence

Henry Zou, Matthew Treinish, Kevin Hartman, Alexander Ivrii, and Jake Lishman. Lightsabre: A lightweight and enhanced sabre algorithm.arXiv preprint arXiv:2409.08368, 2024. 12 A Pre-Processing Circuit Features for Qubit Allocation Our algorithm does not depend on having access to input node features during placement. However, having access to good input p...

work page arXiv 2024