pith. machine review for the scientific record. sign in

arxiv: 2605.10910 · v1 · submitted 2026-05-11 · 🪐 quant-ph · cs.LG

Recognition: no theorem link

Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG
keywords Clifford circuit synthesisreinforcement learningequivariant neural networksquantum circuit compilationsymplectic matricesgate optimization
0
0 comments X

The pith

An equivariant reinforcement learning agent synthesizes Clifford circuits for up to thirty qubits after training only on smaller instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames Clifford circuit synthesis as a reinforcement learning task in which an agent applies elementary gates to drive a symplectic matrix representation down to the identity. A novel neural network architecture ensures the policy remains equivariant under qubit relabelings and works unchanged for any qubit count. After a curriculum of random walks on six-qubit instances and continued training on ten-qubit instances, the same policy is applied directly to thirty-qubit tableaus generated from circuits containing over a thousand gates. It returns sequences whose average two-qubit gate count is lower than that produced by Qiskit's Aaronson-Gottesman algorithm and by greedy synthesis methods.

Core claim

A single size-agnostic, equivariant policy learned through reinforcement learning on random-walk curricula from the identity can be applied without modification to unseen Clifford tableaus at much larger qubit counts, where it produces circuits with fewer two-qubit gates on average than classical synthesis algorithms.

What carries the argument

An equivariant neural network that processes the symplectic matrix representation of a Clifford operation invariantly under qubit permutations, allowing one learned policy to be reused across qubit numbers.

If this is right

  • On six-qubit instances the agent reaches circuits within one two-qubit gate of optimality in milliseconds and finds provably optimal circuits in 99.2 percent of cases within seconds.
  • After training on ten-qubit instances the same policy scales without retraining or splicing to thirty-qubit targets generated from circuits longer than one thousand gates.
  • The resulting circuits exhibit lower average two-qubit gate counts than both the Aaronson-Gottesman algorithm and greedy Clifford synthesizers implemented in Qiskit.
  • The architecture requires no circuit splicing and no reparameterization when qubit count changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation and training recipe could be tested on synthesis problems for other discrete gate sets that admit matrix representations.
  • If the generalization holds for still larger qubit numbers, the method could supply short Clifford subroutines inside larger variational or fault-tolerant circuits.
  • One could measure how performance degrades when the target tableaus are drawn from distributions far from random walks, such as those arising from specific quantum algorithms.

Load-bearing premise

The learned policies continue to generalize reliably from the distribution of random-walk-generated circuits at six to ten qubits to arbitrary tableaus at thirty qubits.

What would settle it

Apply the trained agent to a collection of thirty-qubit Clifford tableaus whose minimal two-qubit gate counts are independently known or computable by exhaustive search on smaller subproblems, then check whether the agent's average gate count matches or undercuts those minima.

Figures

Figures reproduced from arXiv: 2605.10910 by Aleks Kissinger, Richie Yeung, Rob Cornish.

Figure 1
Figure 1. Figure 1: A four-qubit Clifford circuit with 5 CZ gates and its corresponding stabilizer tableau, shown [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Symplectic generator matrices Hi , Si , and CZi,j . Right-multiplying any symplectic matrix by one of these generators applies the highlighted column operations, leaving the other columns untouched. In general, there are many different sequences G1 · · · Gk of different lengths that produce the same overall tableau Mtarget. As such, for practi￾cal purposes, it is desirable to solve (2) in a way that is in … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the permutation-equivariant policy used for Clifford synthesis. The input [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CZ-count comparison for the ten-qubit-trained model at 10, 15, and 24 evaluation qubits. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy Clifford synthesizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a reinforcement learning approach to Clifford circuit synthesis for all-to-all qubit connectivity. An agent learns a policy to apply elementary Clifford gates that reduce a symplectic matrix representation of a target Clifford operation to the identity. Training employs a random-walk curriculum starting from the identity on 6-10 qubit instances. The policy network is equivariant under qubit relabelings and size-agnostic, enabling application across qubit counts without reparameterization. On 6-qubit benchmarks the agent reaches circuits within one two-qubit gate of optimality in milliseconds and finds optimal circuits in 99.2% of cases within seconds. After continued training on 10-qubit instances the same policy is applied to unseen 30-qubit tableaus (including those generated by circuits with >1000 gates) and reports lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy synthesizers.

Significance. If the reported scaling from 6-10 qubit training to 30-qubit test instances is robust, the work would constitute a practical advance in automated Clifford synthesis by demonstrating a single learned policy that generalizes across system sizes. The equivariant, size-agnostic architecture is a clear technical strength that avoids circuit splicing or retraining. Concrete performance numbers on standard benchmarks and the scaling demonstration are provided, though the absence of error bars, ablation studies, and distribution diagnostics limits the strength of the generalization claim. The approach could influence quantum compilation pipelines if the empirical results hold under broader validation.

major comments (2)
  1. [Abstract] Abstract: The headline scaling result—that a policy trained exclusively on random walks from the identity on 6-10 qubits produces shorter two-qubit-gate sequences on 30-qubit symplectic tableaus generated by circuits with >1000 gates—depends on reliable generalization. No comparison of matrix invariants (e.g., distribution of 2×2 blocks, symplectic rank after partial reduction, or Hamming weight of off-diagonal blocks) between the training distribution and the 30-qubit test set is described. This omission is load-bearing because random walks on small n generate bounded-support matrices while long walks on large n can produce qualitatively different block statistics; without such diagnostics the reported improvement over Aaronson-Gottesman and greedy baselines could reflect distribution mismatch rather than learned generalization.
  2. [Scaling experiments] Scaling experiments: The performance claims on 6-qubit optimality (99.2% within seconds) and the 30-qubit average-gate-count improvements are presented without error bars, number of independent trials, or ablation studies isolating the contribution of the equivariant layers versus the curriculum. These details are necessary to evaluate the reliability of the generalization claim and the specific benefit of the proposed architecture.
minor comments (2)
  1. The abstract states that the agent 'achieves lower average two-qubit gate counts' on 30-qubit instances but does not report the numerical averages, standard deviations, or the exact number of test instances used.
  2. Clarify the precise distribution of random-walk lengths and gate choices used in the curriculum for the 6-qubit and 10-qubit training phases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and agree that the requested additions will strengthen the manuscript's claims regarding generalization and reliability. We plan to incorporate the suggested analyses and statistical details in the revised version.

read point-by-point responses
  1. Referee: The headline scaling result depends on reliable generalization. No comparison of matrix invariants (e.g., distribution of 2×2 blocks, symplectic rank after partial reduction, or Hamming weight of off-diagonal blocks) between the training distribution and the 30-qubit test set is described. This omission is load-bearing because random walks on small n generate bounded-support matrices while long walks on large n can produce qualitatively different block statistics.

    Authors: We agree that explicit comparison of matrix invariants would provide stronger support for the generalization claim. The training curriculum uses random walks starting from the identity on 6-10 qubits, while test instances include tableaus from circuits with >1000 gates on 30 qubits. Although the policy is applied zero-shot after training on 10-qubit instances, we did not report invariant statistics in the original manuscript. In revision, we will add an appendix with quantitative comparisons (e.g., histograms of off-diagonal block Hamming weights, average symplectic rank, and sparsity measures) for both the training distribution and the 30-qubit test set to address potential distribution mismatch. revision: yes

  2. Referee: The performance claims on 6-qubit optimality (99.2% within seconds) and the 30-qubit average-gate-count improvements are presented without error bars, number of independent trials, or ablation studies isolating the contribution of the equivariant layers versus the curriculum.

    Authors: The referee is correct that error bars, trial counts, and ablations are absent. The 99.2% optimality figure and 30-qubit averages are computed over the respective test sets from single training runs. In the revision, we will rerun training with multiple random seeds (at least 5) and report means with standard deviations for all key metrics. We will also include an ablation comparing the equivariant policy network to a non-equivariant baseline (with otherwise identical architecture and curriculum) to quantify the benefit of equivariance. The curriculum's role will be discussed explicitly, though full isolation of every component may require additional experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL results stand on external baselines

full rationale

The paper describes an RL training procedure (random-walk curriculum on symplectic matrices) and an equivariant size-agnostic network architecture, then reports empirical performance on held-out 6-qubit and 30-qubit instances against independent external synthesizers (Qiskit Aaronson-Gottesman and greedy). No derivation, uniqueness theorem, or ansatz is invoked that reduces by construction to a fitted parameter, self-citation, or renamed input; the scaling claim is a direct experimental comparison rather than a mathematical prediction forced by the training distribution. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard mathematical representations of Clifford circuits and standard RL assumptions; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (2)
  • standard math Clifford operations admit a faithful symplectic matrix representation over GF(2)
    Standard fact from quantum information theory used to define the state space.
  • domain assumption A policy trained via RL on random-walk curricula will generalize to arbitrary target tableaus
    Core modeling choice that enables the scaling claim.

pith-pipeline@v0.9.0 · 5500 in / 1403 out tokens · 57772 ms · 2026-05-12T03:52:55.210326+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

  1. [1]

    Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004

    Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004. doi: 10.1103/PhysRevA.70.052328. URL https://doi.org/10. 1103/physreva.70.052328

  2. [2]

    Solving the rubik’s cube with deep reinforcement learning and search.Nature Machine Intelligence, 1 (8):356–363, 2019

    Forest Agostinelli, Stephen McAleer, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s cube with deep reinforcement learning and search.Nature Machine Intelligence, 1 (8):356–363, 2019. doi: 10.1038/s42256-019-0070-z. URL https://doi.org/10.1038/ s42256-019-0070-z

  3. [3]

    T-count optimization and reed–muller codes.IEEE Transactions on Information Theory, 65(8):4771–4784, 2019

    Matthew Amy and Michele Mosca. T-count optimization and reed–muller codes.IEEE Transactions on Information Theory, 65(8):4771–4784, 2019. doi: 10.1109/TIT.2019.2906374. URLhttps://doi.org/10.1109/tit.2019.2906374

  4. [4]

    Error estimation in current noisy quantum computers.Quantum Information Process- ing, 23(5), 2024

    Unai Aseguinolaza, Nahual Sobrino, Gabriel Sobrino, Joaquim Jornet-Somoza, and Juan Borge. Error estimation in current noisy quantum computers.Quantum Information Process- ing, 23(5), 2024. doi: 10.1007/s11128-024-04384-z. URL https://doi.org/10.1007/ s11128-024-04384-z

  5. [5]

    In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09)

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374. 1553380

  6. [6]

    Oscar Boykin, Tal Mor, Matthew Pulver, Vwani Roychowdhury, and Farrokh Vatan

    P. Oscar Boykin, Tal Mor, Matthew Pulver, Vwani Roychowdhury, and Farrokh Vatan. A new universal and fault-tolerant quantum basis.Information Processing Letters, 75(3):101–107, 2000. doi: 10.1016/S0020-0190(00)00084-3. URL https://doi.org/10.1016/s0020-0190(00) 00084-3

  7. [7]

    Universal quantum computation with ideal Clifford gates and noisy ancillas.Physical Review A, 71(2), 2005

    Sergey Bravyi and Alexei Kitaev. Universal quantum computation with ideal Clifford gates and noisy ancillas.Physical Review A, 71(2), 2005. doi: 10.1103/PhysRevA.71.022316. URL https://doi.org/10.1103/physreva.71.022316

  8. [8]

    Clifford circuit opti- mization with templates and symbolic Pauli gates.Quantum, 5:580, 2021

    Sergey Bravyi, Ruslan Shaydulin, Shaohan Hu, and Dmitri Maslov. Clifford circuit opti- mization with templates and symbolic Pauli gates.Quantum, 5:580, 2021. doi: 10.22331/ q-2021-11-16-580. URLhttps://doi.org/10.22331/q-2021-11-16-580

  9. [9]

    Data to ac- company Clifford Circuit Optimization with Templates and Symbolic Pauli Gates,

    Sergey Bravyi, Ruslan Shaydulin, Shaohan Hu, and Dmitri Maslov. Data to ac- company Clifford Circuit Optimization with Templates and Symbolic Pauli Gates,

  10. [10]

    GitHub repository, accessed 2026-05-03

    URL https://github.com/rsln-s/Clifford_Circuit_Optimization_with_ Templates_and_Symbolic_Pauli_Gates. GitHub repository, accessed 2026-05-03

  11. [11]

    Latone, and Dmitri Maslov

    Sergey Bravyi, Joseph A. Latone, and Dmitri Maslov. 6-qubit optimal Clifford circuits.npj Quantum Information, 8(1), 2022. doi: 10.1038/s41534-022-00583-7. URL https://doi. org/10.1038/s41534-022-00583-7. 10

  12. [12]

    Teaching small transformers to rewrite ZX diagrams

    Francois Charton, Alexandre Krajenbrink, Konstantinos Meichanetzidis, and Richie Yeung. Teaching small transformers to rewrite ZX diagrams. In3rd MATH-AI Workshop at NeurIPS’23,

  13. [13]

    URLhttps://mathai2023.github.io/papers/34.pdf

  14. [14]

    AlphaCNOT: Learning CNOT Minimization with Model-Based Planning

    Jacopo Cossio, Daniele Lizzio Bosco, Riccardo Romanello, Giuseppe Serra, and Carla Piazza. AlphaCNOT: Learning CNOT minimization with model-based planning, 2026. URL https: //arxiv.org/abs/2604.13812v1

  15. [15]

    Fast and effective techniques for T-Count reduction via spider nest identities

    Niel de Beaudrap, Xiaoning Bian, and Quanlong Wang. Fast and effective techniques for T-Count reduction via spider nest identities. In15th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2020), volume 158, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi: 10.4230/LIPIcs.TQC.2020.11. URL ht...

  16. [16]

    Michael Doherty, Matteo Puviani, Jasmine Brewer, Gabriel Matos, David Amaro, Ben Criger, and David T. Stephen. Fast stabilizer state preparation via AI-optimized graph decimation, 2026. URLhttps://arxiv.org/abs/2603.17743v1

  17. [17]

    Pauli network circuit synthesis with reinforcement learning, 2025

    Ayushi Dubal, David Kremer, Simon Martiel, Victor Villar, Derek Wang, and Juan Cruz- Benito. Pauli network circuit synthesis with reinforcement learning, 2025. URL https: //arxiv.org/abs/2503.14448v1

  18. [18]

    Graph-theoretic simplification of quantum circuits with the ZX-calculus.Quantum, 4:279, 2020

    Ross Duncan, Aleks Kissinger, Simon Perdrix, and John van de Wetering. Graph-theoretic simplification of quantum circuits with the ZX-calculus.Quantum, 4:279, 2020. doi: 10.22331/ q-2020-06-04-279. URLhttps://doi.org/10.22331/q-2020-06-04-279

  19. [19]

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster ma- trix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi:...

  20. [20]

    Re- verse curriculum generation for reinforcement learning

    Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Re- verse curriculum generation for reinforcement learning. InConference on Robot Learn- ing, pages 482–495, 2017. URL http://proceedings.mlr.press/v78/florensa17a/ florensa17a.pdf

  21. [21]

    Inverting clifford tableaus, 2020

    Craig Gidney. Inverting clifford tableaus, 2020. URL https://algassert.com/post/2002. Blog post, accessed 2026-04-07

  22. [22]

    Magic state cultivation: growing T states as cheap as CNOT gates

    Craig Gidney, Noah Shutty, and Cody Jones. Magic state cultivation: growing t states as cheap as CNOT gates, 2024. URLhttps://arxiv.org/abs/2409.17595v1

  23. [23]

    The heisenberg representation of quantum computers, 1998

    Daniel Gottesman. The heisenberg representation of quantum computers, 1998. URL https: //arxiv.org/abs/9807006v1

  24. [24]

    An efficient quantum compiler that reducesTcount , volume =

    Luke E Heyfron and Earl T Campbell. An efficient quantum compiler that reduces T count. Quantum Science and Technology, 4(1):015004, 2018. doi: 10.1088/2058-9565/aad604. URL https://doi.org/10.1088/2058-9565/aad604

  25. [25]

    GreedySynthesisClifford, 2026

    IBM Quantum and Qiskit contributors. GreedySynthesisClifford, 2026. URL https://quantum.cloud.ibm.com/docs/en/api/qiskit/2.2/qiskit.transpiler. passes.synthesis.hls_plugins.GreedySynthesisClifford. Qiskit 2.2 API documen- tation, accessed 2026-05-07

  26. [26]

    DefaultSynthesisClifford, 2026

    IBM Quantum and Qiskit contributors. DefaultSynthesisClifford, 2026. URL https://quantum.cloud.ibm.com/docs/en/api/qiskit/2.2/qiskit.transpiler. passes.synthesis.hls_plugins.DefaultSynthesisClifford. Qiskit 2.2 API documentation, accessed 2026-05-07

  27. [27]

    random_clifford, 2026

    IBM Quantum and Qiskit contributors. random_clifford, 2026. URL https://quantum. cloud.ibm.com/docs/en/api/qiskit/2.2/quantum_info. Qiskit 2.2 API documenta- tion, accessed 2026-05-07. 11

  28. [28]

    Reducing the number of non-Clifford gates in quantum circuits , volume =

    Aleks Kissinger and John van de Wetering. Reducing the number of non-Clifford gates in quantum circuits.Physical Review A, 102(2), 2020. doi: 10.1103/PhysRevA.102.022406. URL https://doi.org/10.1103/physreva.102.022406

  29. [29]

    Practical and efficient quantum circuit synthesis and transpiling with reinforcement learning,

    David Kremer, Victor Villar, Hanhee Paik, Ivan Duran, Ismael Faro, and Juan Cruz-Benito. Practical and efficient quantum circuit synthesis and transpiling with reinforcement learning,

  30. [30]

    URLhttps://arxiv.org/abs/2405.13196v2

  31. [31]

    Magic State Distillation: Not as Costly as You Think , volume =

    Daniel Litinski. Magic state distillation: Not as costly as you think.Quantum, 3:205, 2019. doi: 10.22331/q-2019-12-02-205. URLhttps://doi.org/10.22331/q-2019-12-02-205

  32. [32]

    Lattice surgery with a twist: Simplifying Clifford gates of surface codes.Quantum, 2:62, 2018

    Daniel Litinski and Felix von Oppen. Lattice surgery with a twist: Simplifying Clifford gates of surface codes.Quantum, 2:62, 2018. doi: 10.22331/q-2018-05-04-62. URL https: //doi.org/10.22331/q-2018-05-04-62

  33. [33]

    Dubey, Christopher Mutschler, Axel Plinge, and Daniel D

    Alexander Mattick, Maniraman Periyasamy, Christian Ufrecht, Abhishek Y . Dubey, Christopher Mutschler, Axel Plinge, and Daniel D. Scherer. Optimizing quantum circuits via ZX diagrams using reinforcement learning and graph neural networks, 2025. URL https://arxiv.org/ abs/2504.03429v1

  34. [34]

    Taylor, and Peter Stone

    Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey.Jour- nal of Machine Learning Research, 21(181):1–50, 2020. URL http://jmlr.org/papers/ volume21/20-212/20-212.pdf

  35. [35]

    Optimizing ZX-diagrams with deep reinforcement learning.Machine Learning: Science and Technology, 5(3):035077, 2024

    Maximilian Nägele and Florian Marquardt. Optimizing ZX-diagrams with deep reinforcement learning.Machine Learning: Science and Technology, 5(3):035077, 2024. doi: 10.1088/ 2632-2153/ad76f7. URLhttps://doi.org/10.1088/2632-2153/ad76f7

  36. [36]

    Depth-optimal synthesis of Clifford circuits with SAT solvers

    Tom Peham, Nina Brandl, Richard Kueng, Robert Wille, and Lukas Burgholzer. Depth-optimal synthesis of Clifford circuits with SAT solvers. In2023 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 802–813, 2023. doi: 10.1109/QCE57702. 2023.00095. URLhttps://doi.org/10.1109/qce57702.2023.00095

  37. [37]

    Estarellas

    Jordi Riu, Jan Nogué, Gerard Vilaplana, Artur Garcia-Saez, and Marta P. Estarellas. Reinforce- ment learning based quantum circuit optimization via ZX-calculus.Quantum, 9:1758, 2025. doi: 10.22331/Q-2025-05-28-1758. URLhttps://doi.org/10.22331/q-2025-05-28-1758

  38. [38]

    Francisco J. R. Ruiz, Tuomas Laakkonen, Johannes Bausch, Matej Balog, Mohammadamin Barekatain, Francisco J. H. Heras, Alexander Novikov, Nathan Fitzpatrick, Bernardino Romera- Paredes, John van de Wetering, Alhussein Fawzi, Konstantinos Meichanetzidis, and Pushmeet Kohli. Quantum circuit optimization with AlphaTensor.Nature Machine Intelligence, 7(3): 374...

  39. [39]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347v2

  40. [40]

    Efficient Clifford+T approximation of single-qubit operators.Quantum In- formation and Computation, 15(1&2):159–180, 2015

    Peter Selinger. Efficient Clifford+T approximation of single-qubit operators.Quantum In- formation and Computation, 15(1&2):159–180, 2015. doi: 10.26421/qic15.1-2-10. URL https://doi.org/10.26421/qic15.1-2-10

  41. [41]

    CNOT-optimal Clifford synthesis as SAT, 2025

    Irfansha Shaik and Jaco van de Pol. CNOT-optimal Clifford synthesis as SAT, 2025. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SAT.2025.28

  42. [42]

    Self-attention with relative position representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen- tations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, 2018. Association for Computational Lingu...

  43. [43]

    Relating measurement patterns to circuits via Pauli flow.Electronic Proceedings in Theoretical Computer Science, 343:50–101, 2021

    Will Simmons. Relating measurement patterns to circuits via Pauli flow.Electronic Proceedings in Theoretical Computer Science, 343:50–101, 2021. doi: 10.4204/EPTCS.343.4. URL https://doi.org/10.4204/eptcs.343.4. 12

  44. [44]

    PufferLib: Making reinforcement learning libraries and environments play nice,

    Joseph Suarez. PufferLib: Making reinforcement learning libraries and environments play nice,

  45. [45]

    URLhttps://arxiv.org/abs/2406.12905v1

  46. [46]

    The CMS experiment at the CERN LHC

    Arianne van de Griend. Constrained quantum CNOT circuit re-synthesis using deep reinforce- ment learning. Master’s thesis, Radboud University, 2019. URL https://theses.ubn.ru. nl/handle/123456789/10713

  47. [47]

    Optimal compilation of parametrised quantum circuits.Quantum, 9:1828, 2025

    John van de Wetering, Richie Yeung, Tuomas Laakkonen, and Aleks Kissinger. Optimal compilation of parametrised quantum circuits.Quantum, 9:1828, 2025. doi: 10.22331/ q-2025-08-27-1828. URLhttps://doi.org/10.22331/q-2025-08-27-1828

  48. [48]

    Worrall, Herke van Hoof, Frans A

    Elise van der Pol, Daniel E. Worrall, Herke van Hoof, Frans A. Oliehoek, and Max Welling. MDP homomorphic networks: Group symmetries in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 4199–4210, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 2be5f9c2e3620eb73c2972d7552b6cb5-Paper.pdf

  49. [49]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/ paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  50. [50]

    Heuristic and optimal synthesis of CNOT and Clifford circuits, 2025

    Mark Webster, Stergios Koutsioumpas, and Dan E Browne. Heuristic and optimal synthesis of CNOT and Clifford circuits, 2025. URLhttps://arxiv.org/abs/2503.14660v1

  51. [51]

    Quantum circuit discovery for fault-tolerant logical state preparation with reinforcement learning

    Remmy Zen, Jan Olle, Luis Colmenarez, Matteo Puviani, Markus Müller, and Florian Marquardt. Quantum circuit discovery for fault-tolerant logical state preparation with reinforcement learning. Physical Review X, 15(4):041012, 2025. doi: 10.1103/gqpr-dgz7. URL https://doi.org/ 10.1103/gqpr-dgz7

  52. [52]

    Martin Zinkevich and Tucker R. Balch. Symmetry in Markov decision processes and its implications for single agent and multiagent learning. InProceedings of the Eighteenth Inter- national Conference on Machine Learning, pages 632–640. Morgan Kaufmann, 2001. URL https://www.cs.cmu.edu/~maz/publications/symmetry7.pdf. 13 A Equivariant Optimal Policies This a...

  53. [53]

    Reductions

    Irreducibility implies every state has the same period, so the whole chain is finite, irreducible, and aperiodic. The standard convergence theorem for finite Markov chains then gives K d(I2n, M)− → 1 |Γ| for everyM∈Γ, which is exactly uniform sampling fromSp(2n,F 2). For n= 1 and n= 2 , the walk generated by {Hi, Si,CZ i,j} has period 2, so the fixed-leng...