Recognition: no theorem link
Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis
Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3
The pith
An equivariant reinforcement learning agent synthesizes Clifford circuits for up to thirty qubits after training only on smaller instances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single size-agnostic, equivariant policy learned through reinforcement learning on random-walk curricula from the identity can be applied without modification to unseen Clifford tableaus at much larger qubit counts, where it produces circuits with fewer two-qubit gates on average than classical synthesis algorithms.
What carries the argument
An equivariant neural network that processes the symplectic matrix representation of a Clifford operation invariantly under qubit permutations, allowing one learned policy to be reused across qubit numbers.
If this is right
- On six-qubit instances the agent reaches circuits within one two-qubit gate of optimality in milliseconds and finds provably optimal circuits in 99.2 percent of cases within seconds.
- After training on ten-qubit instances the same policy scales without retraining or splicing to thirty-qubit targets generated from circuits longer than one thousand gates.
- The resulting circuits exhibit lower average two-qubit gate counts than both the Aaronson-Gottesman algorithm and greedy Clifford synthesizers implemented in Qiskit.
- The architecture requires no circuit splicing and no reparameterization when qubit count changes.
Where Pith is reading between the lines
- The same representation and training recipe could be tested on synthesis problems for other discrete gate sets that admit matrix representations.
- If the generalization holds for still larger qubit numbers, the method could supply short Clifford subroutines inside larger variational or fault-tolerant circuits.
- One could measure how performance degrades when the target tableaus are drawn from distributions far from random walks, such as those arising from specific quantum algorithms.
Load-bearing premise
The learned policies continue to generalize reliably from the distribution of random-walk-generated circuits at six to ten qubits to arbitrary tableaus at thirty qubits.
What would settle it
Apply the trained agent to a collection of thirty-qubit Clifford tableaus whose minimal two-qubit gate counts are independently known or computable by exhaustive search on smaller subproblems, then check whether the agent's average gate count matches or undercuts those minima.
Figures
read the original abstract
We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy Clifford synthesizers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a reinforcement learning approach to Clifford circuit synthesis for all-to-all qubit connectivity. An agent learns a policy to apply elementary Clifford gates that reduce a symplectic matrix representation of a target Clifford operation to the identity. Training employs a random-walk curriculum starting from the identity on 6-10 qubit instances. The policy network is equivariant under qubit relabelings and size-agnostic, enabling application across qubit counts without reparameterization. On 6-qubit benchmarks the agent reaches circuits within one two-qubit gate of optimality in milliseconds and finds optimal circuits in 99.2% of cases within seconds. After continued training on 10-qubit instances the same policy is applied to unseen 30-qubit tableaus (including those generated by circuits with >1000 gates) and reports lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy synthesizers.
Significance. If the reported scaling from 6-10 qubit training to 30-qubit test instances is robust, the work would constitute a practical advance in automated Clifford synthesis by demonstrating a single learned policy that generalizes across system sizes. The equivariant, size-agnostic architecture is a clear technical strength that avoids circuit splicing or retraining. Concrete performance numbers on standard benchmarks and the scaling demonstration are provided, though the absence of error bars, ablation studies, and distribution diagnostics limits the strength of the generalization claim. The approach could influence quantum compilation pipelines if the empirical results hold under broader validation.
major comments (2)
- [Abstract] Abstract: The headline scaling result—that a policy trained exclusively on random walks from the identity on 6-10 qubits produces shorter two-qubit-gate sequences on 30-qubit symplectic tableaus generated by circuits with >1000 gates—depends on reliable generalization. No comparison of matrix invariants (e.g., distribution of 2×2 blocks, symplectic rank after partial reduction, or Hamming weight of off-diagonal blocks) between the training distribution and the 30-qubit test set is described. This omission is load-bearing because random walks on small n generate bounded-support matrices while long walks on large n can produce qualitatively different block statistics; without such diagnostics the reported improvement over Aaronson-Gottesman and greedy baselines could reflect distribution mismatch rather than learned generalization.
- [Scaling experiments] Scaling experiments: The performance claims on 6-qubit optimality (99.2% within seconds) and the 30-qubit average-gate-count improvements are presented without error bars, number of independent trials, or ablation studies isolating the contribution of the equivariant layers versus the curriculum. These details are necessary to evaluate the reliability of the generalization claim and the specific benefit of the proposed architecture.
minor comments (2)
- The abstract states that the agent 'achieves lower average two-qubit gate counts' on 30-qubit instances but does not report the numerical averages, standard deviations, or the exact number of test instances used.
- Clarify the precise distribution of random-walk lengths and gate choices used in the curriculum for the 6-qubit and 10-qubit training phases.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and agree that the requested additions will strengthen the manuscript's claims regarding generalization and reliability. We plan to incorporate the suggested analyses and statistical details in the revised version.
read point-by-point responses
-
Referee: The headline scaling result depends on reliable generalization. No comparison of matrix invariants (e.g., distribution of 2×2 blocks, symplectic rank after partial reduction, or Hamming weight of off-diagonal blocks) between the training distribution and the 30-qubit test set is described. This omission is load-bearing because random walks on small n generate bounded-support matrices while long walks on large n can produce qualitatively different block statistics.
Authors: We agree that explicit comparison of matrix invariants would provide stronger support for the generalization claim. The training curriculum uses random walks starting from the identity on 6-10 qubits, while test instances include tableaus from circuits with >1000 gates on 30 qubits. Although the policy is applied zero-shot after training on 10-qubit instances, we did not report invariant statistics in the original manuscript. In revision, we will add an appendix with quantitative comparisons (e.g., histograms of off-diagonal block Hamming weights, average symplectic rank, and sparsity measures) for both the training distribution and the 30-qubit test set to address potential distribution mismatch. revision: yes
-
Referee: The performance claims on 6-qubit optimality (99.2% within seconds) and the 30-qubit average-gate-count improvements are presented without error bars, number of independent trials, or ablation studies isolating the contribution of the equivariant layers versus the curriculum.
Authors: The referee is correct that error bars, trial counts, and ablations are absent. The 99.2% optimality figure and 30-qubit averages are computed over the respective test sets from single training runs. In the revision, we will rerun training with multiple random seeds (at least 5) and report means with standard deviations for all key metrics. We will also include an ablation comparing the equivariant policy network to a non-equivariant baseline (with otherwise identical architecture and curriculum) to quantify the benefit of equivariance. The curriculum's role will be discussed explicitly, though full isolation of every component may require additional experiments. revision: yes
Circularity Check
No significant circularity; empirical RL results stand on external baselines
full rationale
The paper describes an RL training procedure (random-walk curriculum on symplectic matrices) and an equivariant size-agnostic network architecture, then reports empirical performance on held-out 6-qubit and 30-qubit instances against independent external synthesizers (Qiskit Aaronson-Gottesman and greedy). No derivation, uniqueness theorem, or ansatz is invoked that reduces by construction to a fitted parameter, self-citation, or renamed input; the scaling claim is a direct experimental comparison rather than a mathematical prediction forced by the training distribution. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Clifford operations admit a faithful symplectic matrix representation over GF(2)
- domain assumption A policy trained via RL on random-walk curricula will generalize to arbitrary target tableaus
Reference graph
Works this paper leans on
-
[1]
Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004
Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004. doi: 10.1103/PhysRevA.70.052328. URL https://doi.org/10. 1103/physreva.70.052328
-
[2]
Forest Agostinelli, Stephen McAleer, Alexander Shmakov, and Pierre Baldi. Solving the rubik’s cube with deep reinforcement learning and search.Nature Machine Intelligence, 1 (8):356–363, 2019. doi: 10.1038/s42256-019-0070-z. URL https://doi.org/10.1038/ s42256-019-0070-z
-
[3]
Matthew Amy and Michele Mosca. T-count optimization and reed–muller codes.IEEE Transactions on Information Theory, 65(8):4771–4784, 2019. doi: 10.1109/TIT.2019.2906374. URLhttps://doi.org/10.1109/tit.2019.2906374
-
[4]
Error estimation in current noisy quantum computers.Quantum Information Process- ing, 23(5), 2024
Unai Aseguinolaza, Nahual Sobrino, Gabriel Sobrino, Joaquim Jornet-Somoza, and Juan Borge. Error estimation in current noisy quantum computers.Quantum Information Process- ing, 23(5), 2024. doi: 10.1007/s11128-024-04384-z. URL https://doi.org/10.1007/ s11128-024-04384-z
-
[5]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 41–48, 2009. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374. 1553380
-
[6]
Oscar Boykin, Tal Mor, Matthew Pulver, Vwani Roychowdhury, and Farrokh Vatan
P. Oscar Boykin, Tal Mor, Matthew Pulver, Vwani Roychowdhury, and Farrokh Vatan. A new universal and fault-tolerant quantum basis.Information Processing Letters, 75(3):101–107, 2000. doi: 10.1016/S0020-0190(00)00084-3. URL https://doi.org/10.1016/s0020-0190(00) 00084-3
-
[7]
Sergey Bravyi and Alexei Kitaev. Universal quantum computation with ideal Clifford gates and noisy ancillas.Physical Review A, 71(2), 2005. doi: 10.1103/PhysRevA.71.022316. URL https://doi.org/10.1103/physreva.71.022316
-
[8]
Clifford circuit opti- mization with templates and symbolic Pauli gates.Quantum, 5:580, 2021
Sergey Bravyi, Ruslan Shaydulin, Shaohan Hu, and Dmitri Maslov. Clifford circuit opti- mization with templates and symbolic Pauli gates.Quantum, 5:580, 2021. doi: 10.22331/ q-2021-11-16-580. URLhttps://doi.org/10.22331/q-2021-11-16-580
-
[9]
Data to ac- company Clifford Circuit Optimization with Templates and Symbolic Pauli Gates,
Sergey Bravyi, Ruslan Shaydulin, Shaohan Hu, and Dmitri Maslov. Data to ac- company Clifford Circuit Optimization with Templates and Symbolic Pauli Gates,
-
[10]
GitHub repository, accessed 2026-05-03
URL https://github.com/rsln-s/Clifford_Circuit_Optimization_with_ Templates_and_Symbolic_Pauli_Gates. GitHub repository, accessed 2026-05-03
work page 2026
-
[11]
Sergey Bravyi, Joseph A. Latone, and Dmitri Maslov. 6-qubit optimal Clifford circuits.npj Quantum Information, 8(1), 2022. doi: 10.1038/s41534-022-00583-7. URL https://doi. org/10.1038/s41534-022-00583-7. 10
-
[12]
Teaching small transformers to rewrite ZX diagrams
Francois Charton, Alexandre Krajenbrink, Konstantinos Meichanetzidis, and Richie Yeung. Teaching small transformers to rewrite ZX diagrams. In3rd MATH-AI Workshop at NeurIPS’23,
-
[13]
URLhttps://mathai2023.github.io/papers/34.pdf
-
[14]
AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
Jacopo Cossio, Daniele Lizzio Bosco, Riccardo Romanello, Giuseppe Serra, and Carla Piazza. AlphaCNOT: Learning CNOT minimization with model-based planning, 2026. URL https: //arxiv.org/abs/2604.13812v1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Fast and effective techniques for T-Count reduction via spider nest identities
Niel de Beaudrap, Xiaoning Bian, and Quanlong Wang. Fast and effective techniques for T-Count reduction via spider nest identities. In15th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2020), volume 158, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi: 10.4230/LIPIcs.TQC.2020.11. URL ht...
- [16]
-
[17]
Pauli network circuit synthesis with reinforcement learning, 2025
Ayushi Dubal, David Kremer, Simon Martiel, Victor Villar, Derek Wang, and Juan Cruz- Benito. Pauli network circuit synthesis with reinforcement learning, 2025. URL https: //arxiv.org/abs/2503.14448v1
-
[18]
Graph-theoretic simplification of quantum circuits with the ZX-calculus.Quantum, 4:279, 2020
Ross Duncan, Aleks Kissinger, Simon Perdrix, and John van de Wetering. Graph-theoretic simplification of quantum circuits with the ZX-calculus.Quantum, 4:279, 2020. doi: 10.22331/ q-2020-06-04-279. URLhttps://doi.org/10.22331/q-2020-06-04-279
-
[19]
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster ma- trix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi:...
-
[20]
Re- verse curriculum generation for reinforcement learning
Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Re- verse curriculum generation for reinforcement learning. InConference on Robot Learn- ing, pages 482–495, 2017. URL http://proceedings.mlr.press/v78/florensa17a/ florensa17a.pdf
work page 2017
-
[21]
Inverting clifford tableaus, 2020
Craig Gidney. Inverting clifford tableaus, 2020. URL https://algassert.com/post/2002. Blog post, accessed 2026-04-07
work page 2020
-
[22]
Magic state cultivation: growing T states as cheap as CNOT gates
Craig Gidney, Noah Shutty, and Cody Jones. Magic state cultivation: growing t states as cheap as CNOT gates, 2024. URLhttps://arxiv.org/abs/2409.17595v1
work page internal anchor Pith review arXiv 2024
-
[23]
The heisenberg representation of quantum computers, 1998
Daniel Gottesman. The heisenberg representation of quantum computers, 1998. URL https: //arxiv.org/abs/9807006v1
-
[24]
An efficient quantum compiler that reducesTcount , volume =
Luke E Heyfron and Earl T Campbell. An efficient quantum compiler that reduces T count. Quantum Science and Technology, 4(1):015004, 2018. doi: 10.1088/2058-9565/aad604. URL https://doi.org/10.1088/2058-9565/aad604
-
[25]
IBM Quantum and Qiskit contributors. GreedySynthesisClifford, 2026. URL https://quantum.cloud.ibm.com/docs/en/api/qiskit/2.2/qiskit.transpiler. passes.synthesis.hls_plugins.GreedySynthesisClifford. Qiskit 2.2 API documen- tation, accessed 2026-05-07
work page 2026
-
[26]
DefaultSynthesisClifford, 2026
IBM Quantum and Qiskit contributors. DefaultSynthesisClifford, 2026. URL https://quantum.cloud.ibm.com/docs/en/api/qiskit/2.2/qiskit.transpiler. passes.synthesis.hls_plugins.DefaultSynthesisClifford. Qiskit 2.2 API documentation, accessed 2026-05-07
work page 2026
-
[27]
IBM Quantum and Qiskit contributors. random_clifford, 2026. URL https://quantum. cloud.ibm.com/docs/en/api/qiskit/2.2/quantum_info. Qiskit 2.2 API documenta- tion, accessed 2026-05-07. 11
work page 2026
-
[28]
Reducing the number of non-Clifford gates in quantum circuits , volume =
Aleks Kissinger and John van de Wetering. Reducing the number of non-Clifford gates in quantum circuits.Physical Review A, 102(2), 2020. doi: 10.1103/PhysRevA.102.022406. URL https://doi.org/10.1103/physreva.102.022406
-
[29]
Practical and efficient quantum circuit synthesis and transpiling with reinforcement learning,
David Kremer, Victor Villar, Hanhee Paik, Ivan Duran, Ismael Faro, and Juan Cruz-Benito. Practical and efficient quantum circuit synthesis and transpiling with reinforcement learning,
- [30]
-
[31]
Magic State Distillation: Not as Costly as You Think , volume =
Daniel Litinski. Magic state distillation: Not as costly as you think.Quantum, 3:205, 2019. doi: 10.22331/q-2019-12-02-205. URLhttps://doi.org/10.22331/q-2019-12-02-205
-
[32]
Lattice surgery with a twist: Simplifying Clifford gates of surface codes.Quantum, 2:62, 2018
Daniel Litinski and Felix von Oppen. Lattice surgery with a twist: Simplifying Clifford gates of surface codes.Quantum, 2:62, 2018. doi: 10.22331/q-2018-05-04-62. URL https: //doi.org/10.22331/q-2018-05-04-62
-
[33]
Dubey, Christopher Mutschler, Axel Plinge, and Daniel D
Alexander Mattick, Maniraman Periyasamy, Christian Ufrecht, Abhishek Y . Dubey, Christopher Mutschler, Axel Plinge, and Daniel D. Scherer. Optimizing quantum circuits via ZX diagrams using reinforcement learning and graph neural networks, 2025. URL https://arxiv.org/ abs/2504.03429v1
-
[34]
Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey.Jour- nal of Machine Learning Research, 21(181):1–50, 2020. URL http://jmlr.org/papers/ volume21/20-212/20-212.pdf
work page 2020
-
[35]
Maximilian Nägele and Florian Marquardt. Optimizing ZX-diagrams with deep reinforcement learning.Machine Learning: Science and Technology, 5(3):035077, 2024. doi: 10.1088/ 2632-2153/ad76f7. URLhttps://doi.org/10.1088/2632-2153/ad76f7
-
[36]
Depth-optimal synthesis of Clifford circuits with SAT solvers
Tom Peham, Nina Brandl, Richard Kueng, Robert Wille, and Lukas Burgholzer. Depth-optimal synthesis of Clifford circuits with SAT solvers. In2023 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 802–813, 2023. doi: 10.1109/QCE57702. 2023.00095. URLhttps://doi.org/10.1109/qce57702.2023.00095
-
[37]
Jordi Riu, Jan Nogué, Gerard Vilaplana, Artur Garcia-Saez, and Marta P. Estarellas. Reinforce- ment learning based quantum circuit optimization via ZX-calculus.Quantum, 9:1758, 2025. doi: 10.22331/Q-2025-05-28-1758. URLhttps://doi.org/10.22331/q-2025-05-28-1758
-
[38]
Francisco J. R. Ruiz, Tuomas Laakkonen, Johannes Bausch, Matej Balog, Mohammadamin Barekatain, Francisco J. H. Heras, Alexander Novikov, Nathan Fitzpatrick, Bernardino Romera- Paredes, John van de Wetering, Alhussein Fawzi, Konstantinos Meichanetzidis, and Pushmeet Kohli. Quantum circuit optimization with AlphaTensor.Nature Machine Intelligence, 7(3): 374...
-
[39]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347v2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Peter Selinger. Efficient Clifford+T approximation of single-qubit operators.Quantum In- formation and Computation, 15(1&2):159–180, 2015. doi: 10.26421/qic15.1-2-10. URL https://doi.org/10.26421/qic15.1-2-10
-
[41]
CNOT-optimal Clifford synthesis as SAT, 2025
Irfansha Shaik and Jaco van de Pol. CNOT-optimal Clifford synthesis as SAT, 2025. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SAT.2025.28
-
[42]
Self-attention with relative position representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen- tations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, 2018. Association for Computational Lingu...
-
[43]
Will Simmons. Relating measurement patterns to circuits via Pauli flow.Electronic Proceedings in Theoretical Computer Science, 343:50–101, 2021. doi: 10.4204/EPTCS.343.4. URL https://doi.org/10.4204/eptcs.343.4. 12
-
[44]
PufferLib: Making reinforcement learning libraries and environments play nice,
Joseph Suarez. PufferLib: Making reinforcement learning libraries and environments play nice,
- [45]
-
[46]
The CMS experiment at the CERN LHC
Arianne van de Griend. Constrained quantum CNOT circuit re-synthesis using deep reinforce- ment learning. Master’s thesis, Radboud University, 2019. URL https://theses.ubn.ru. nl/handle/123456789/10713
-
[47]
Optimal compilation of parametrised quantum circuits.Quantum, 9:1828, 2025
John van de Wetering, Richie Yeung, Tuomas Laakkonen, and Aleks Kissinger. Optimal compilation of parametrised quantum circuits.Quantum, 9:1828, 2025. doi: 10.22331/ q-2025-08-27-1828. URLhttps://doi.org/10.22331/q-2025-08-27-1828
-
[48]
Worrall, Herke van Hoof, Frans A
Elise van der Pol, Daniel E. Worrall, Herke van Hoof, Frans A. Oliehoek, and Max Welling. MDP homomorphic networks: Group symmetries in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 4199–4210, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 2be5f9c2e3620eb73c2972d7552b6cb5-Paper.pdf
work page 2020
-
[49]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/ paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
work page 2017
-
[50]
Heuristic and optimal synthesis of CNOT and Clifford circuits, 2025
Mark Webster, Stergios Koutsioumpas, and Dan E Browne. Heuristic and optimal synthesis of CNOT and Clifford circuits, 2025. URLhttps://arxiv.org/abs/2503.14660v1
-
[51]
Quantum circuit discovery for fault-tolerant logical state preparation with reinforcement learning
Remmy Zen, Jan Olle, Luis Colmenarez, Matteo Puviani, Markus Müller, and Florian Marquardt. Quantum circuit discovery for fault-tolerant logical state preparation with reinforcement learning. Physical Review X, 15(4):041012, 2025. doi: 10.1103/gqpr-dgz7. URL https://doi.org/ 10.1103/gqpr-dgz7
-
[52]
Martin Zinkevich and Tucker R. Balch. Symmetry in Markov decision processes and its implications for single agent and multiagent learning. InProceedings of the Eighteenth Inter- national Conference on Machine Learning, pages 632–640. Morgan Kaufmann, 2001. URL https://www.cs.cmu.edu/~maz/publications/symmetry7.pdf. 13 A Equivariant Optimal Policies This a...
work page 2001
-
[53]
Irreducibility implies every state has the same period, so the whole chain is finite, irreducible, and aperiodic. The standard convergence theorem for finite Markov chains then gives K d(I2n, M)− → 1 |Γ| for everyM∈Γ, which is exactly uniform sampling fromSp(2n,F 2). For n= 1 and n= 2 , the walk generated by {Hi, Si,CZ i,j} has period 2, so the fixed-leng...
work page 1919
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.