Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
Pith reviewed 2026-05-21 05:31 UTC · model grok-4.3
The pith
A JAX vectorized simulator for Riichi Mahjong runs millions of game steps per second on GPUs to support tabula rasa reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Mahjax, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on GPUs. The simulator achieves throughputs of up to 2 million steps per second on eight NVIDIA A100 GPUs under the no-red rules and 1 million steps per second under the red rules. We also provide a high-quality visualization tool. Experimental results demonstrate that agents can be trained effectively to improve their rank against baseline policies.
What carries the argument
The fully vectorized JAX implementation of the complete Riichi Mahjong rule set that executes many independent games simultaneously on GPU hardware.
If this is right
- Reinforcement-learning agents can be trained on the full game without any supervised pre-training from human play logs.
- Batch sizes large enough for modern policy-gradient or value-based methods become practical on commodity GPU hardware.
- Both the no-red and red rule variants are supported, allowing direct comparison of learned policies across rule sets.
- The supplied visualization tool makes it feasible to inspect and debug the internal decision process of trained agents.
Where Pith is reading between the lines
- The same vectorization pattern could be applied to other imperfect-information multi-player games such as poker or bridge.
- High-throughput simulation opens the door to systematic testing of new algorithms that explicitly handle stochasticity and hidden states.
- Because the environment is written in JAX, it can be combined with differentiable components for hybrid model-based or planning-augmented learning.
Load-bearing premise
The JAX code exactly reproduces every stochastic transition, hidden-information rule, and legal move of standard Riichi Mahjong.
What would settle it
A side-by-side comparison in which Mahjax generates a sequence of states or moves that an established, independently verified Mahjong engine would reject as illegal or assign a different probability.
Figures
read the original abstract
Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mahjax, a fully vectorized JAX implementation of a Riichi Mahjong simulator designed for large-scale GPU-parallel rollouts in reinforcement learning. It reports peak throughputs of 2 million steps/sec (no-red rules) and 1 million steps/sec (red rules) on 8 A100 GPUs, provides a visualization tool, and validates utility by training agents that improve rank against baseline policies.
Significance. If the simulator faithfully implements the complete Riichi rule set, stochastic wall draws, hidden information, yaku scoring, and multi-player interactions, the work would provide a valuable scalable environment for tabula-rasa RL in imperfect-information games, extending the AlphaZero-style paradigm to a new domain. The explicit throughput measurements and training curves constitute direct, reproducible empirical validation rather than fitted claims.
major comments (1)
- [Experimental Results] Experimental Results section: The central claim that agents improve rank against baselines validates the environment's utility for RL, yet the manuscript contains no unit tests, cross-validation against reference Mahjong libraries, or statistical comparison of game-outcome distributions (e.g., win rates, yaku frequencies, or discard patterns) to confirm that stochastic transitions and hidden-information mechanics are reproduced exactly. This verification step is load-bearing for the claim that observed policy improvement occurs in the true Riichi MDP rather than an approximation.
minor comments (2)
- [Abstract] The abstract states throughput figures but does not specify the exact batch size, number of parallel environments, or whether the measurements include or exclude host-device transfers; adding these details would improve reproducibility.
- [Implementation] The visualization tool is mentioned but no description of its interface, supported actions, or how it handles hidden information for debugging is provided; a short usage example or figure would clarify its utility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment regarding verification of the simulator below.
read point-by-point responses
-
Referee: The central claim that agents improve rank against baselines validates the environment's utility for RL, yet the manuscript contains no unit tests, cross-validation against reference Mahjong libraries, or statistical comparison of game-outcome distributions (e.g., win rates, yaku frequencies, or discard patterns) to confirm that stochastic transitions and hidden-information mechanics are reproduced exactly. This verification step is load-bearing for the claim that observed policy improvement occurs in the true Riichi MDP rather than an approximation.
Authors: We agree that explicit verification of the simulator's fidelity to the full Riichi rule set, including stochastic draws and hidden information, is important for substantiating the RL results. The current manuscript does not present unit tests, cross-validation against reference libraries, or statistical comparisons of outcome distributions. In the revised manuscript we will add a dedicated validation subsection in the Experimental Results section that reports win rates, yaku frequencies, and discard-pattern statistics benchmarked against publicly available reference Mahjong implementations. These additions will directly address the concern that policy improvement may be occurring in an approximation rather than the true MDP. revision: yes
Circularity Check
No circularity: empirical validation of simulator utility stands independently
full rationale
The paper introduces Mahjax as a JAX-based vectorized simulator and supports its utility claim solely through direct empirical results: measured throughput (up to 2M steps/sec) and observed rank improvement when RL agents are trained against baseline policies. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain consists of implementation followed by external benchmarking against baselines, which is self-contained and falsifiable outside any internal fit. Implementation correctness is a separate verification issue, not a circularity in the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard rules and stochastic mechanics of Riichi Mahjong
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Mahjax, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on GPUs... achieving up to 2 million and 1 million steps per second... agents can be trained effectively to improve their rank against baseline policies.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employed two primary optimization techniques: Vectorized Logic... Caching... pre-computed the relevant statistics for all possible suit combinations and encoded them into a bitmask.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Suphx: Mastering mahjong with deep reinforcement learning,
J. Li, S. Koyamada, Q. Ye, G. Liu, C. Wang, R. Yang, L. Zhao, T. Qin, T.-Y . Liu, and H.-W. Hon, “Suphx: Mastering mahjong with deep reinforcement learning,”arXiv preprint arXiv:2003.13590, 2020
-
[2]
Building a 3-player mahjong ai using deep reinforcement learning,
X. Zhao and S. B. Holden, “Building a 3-player mahjong ai using deep reinforcement learning,”arXiv preprint arXiv:2202.12847, 2022
-
[3]
Tjong: A transformer-based mahjong ai via hierarchical decision-making and fan backward,
X. Li, B. Liu, Z. Wei, Z. Wang, and L. Wu, “Tjong: A transformer-based mahjong ai via hierarchical decision-making and fan backward,”CAAI Transactions on Intelligence Technology, vol. 9, no. 4, pp. 982–995, 2024
work page 2024
-
[4]
Mj-dlvat: A deep learning value assessment technique for mahjong,
T. Ogami, K. Amano, and Y . Tsuruoka, “Mj-dlvat: A deep learning value assessment technique for mahjong,” inIEEE Conference on Games (CoG), 2024, pp. 1–8
work page 2024
-
[5]
Actor-critic policy optimization in a large-scale imperfect-information game,
H. Fu, W. Liu, S. Wu, Y . Wang, T. Yang, K. Li, J. Xing, B. Li, B. Ma, Q. Fu, and W. Yang, “Actor-critic policy optimization in a large-scale imperfect-information game,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[6]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
S. Levine, A. Kumar, G. Tucker, J. Fu, and C. Finn, “Offline reinforce- ment learning,”arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Mastering the game of go with deep neural networks and tree search,
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanc- tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature,...
work page 2016
-
[8]
Mastering the game of go without human knowledge,
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,”Nature, vol. 550, no. 7676, pp. 354–359, 2017
work page 2017
-
[9]
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018
work page 2018
-
[10]
Discovering faster matrix multi- plication algorithms with reinforcement learning,
A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli, “Discovering faster matrix multi- plication algorithms with reinforcement learning,”Nature, vol. 610, no. 7930, pp. 47–53, 2022
work page 2022
-
[11]
E. Zhao, R. Yan, J. Li, K. Li, and J. Xing, “Alphaholdem: High- performance artificial intelligence for heads-up no-limit poker via end- to-end reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022
work page 2022
-
[12]
Mjx: A framework for mahjong ai research,
S. Koyamada, K. Habara, N. Goto, S. Okano, S. Nishimori, and S. Ishii, “Mjx: A framework for mahjong ai research,” inIEEE Conference on Games (CoG), 2022, pp. 504–507
work page 2022
-
[13]
Pgx: Hardware-accelerated parallel game simulators for reinforcement learning,
S. Koyamada, S. Okano, S. Nishimori, Y . Murata, K. Habara, H. Kita, and S. Ishii, “Pgx: Hardware-accelerated parallel game simulators for reinforcement learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 45 716–45 743
work page 2023
-
[14]
Jumanji: a diverse suite of scalable reinforcement learning environments in JAX,
C. Bonnet, D. Luo, D. Byrne, S. Surana, S. Abramowitz, P. Duckworth, V . Coyette, L. I. Midgley, E. Tegegn, T. Kalloniatis, O. Mahjoub, M. Macfarlane, A. P. Smit, N. Grinsztajn, R. Boige, C. N. Waters, M. A. Mimouni, U. A. M. Sob, R. de Kock, S. Singh, D. Furelos- Blanco, V . Le, A. Pretorius, and A. Laterre, “Jumanji: a diverse suite of scalable reinforc...
work page 2024
-
[15]
Brax - A differentiable physics engine for large scale rigid body simulation,
C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - A differentiable physics engine for large scale rigid body simulation,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021
work page 2021
-
[16]
Jaxmarl: Multi- agent RL environments and algorithms in JAX,
A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson, T. Willi, R. Hammond, A. Khan, C. S. de Witt, A. Souly, S. Bandyopad- hyay, M. Samvelyan, M. Jiang, R. T. Lange, S. Whiteson, B. Lacerda, N. Hawes, T. Rockt ¨aschel, C. Lu, and J. N. Foerster, “Jaxmarl: Multi- agent RL environments and algorithms in JAX,” inAdvances in Neural Informatio...
work page 2024
-
[17]
Craftax: A lightning-fast benchmark for open-ended reinforcement learning,
M. T. Matthews, M. Beukman, B. Ellis, M. Samvelyan, M. T. Jackson, S. Coward, and J. N. Foerster, “Craftax: A lightning-fast benchmark for open-ended reinforcement learning,” inInternational Conference on Machine Learning, 2024
work page 2024
-
[18]
NA VIX: scaling minigrid environments with JAX,
E. Pignatelli, J. Liesen, R. T. Lange, C. Lu, P. S. Castro, and L. Toni, “NA VIX: scaling minigrid environments with JAX,”arXiv preprint arXiv:2407.19396, 2024
-
[19]
Simplifying deep temporal difference learning,
M. Gallici, M. Fellows, B. Ellis, B. Pou, I. Masmitja, J. N. Foerster, and M. Martin, “Simplifying deep temporal difference learning,” in International Conference on Learning Representations, 2025
work page 2025
-
[20]
SPO: sequential monte carlo policy optimisation,
M. Macfarlane, E. Toledo, D. Byrne, P. Duckworth, and A. Laterre, “SPO: sequential monte carlo policy optimisation,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 1019–1057
work page 2024
-
[21]
JAX: composable transformations of Python+NumPy programs,
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax
work page 2018
-
[22]
S. Tsunoda, “Tenhou,” 2022. [Online]. Available: https://tenhou.net/
work page 2022
- [23]
-
[24]
Xland-minigrid: Scalable meta-reinforcement learning environments in JAX,
A. Nikulin, V . Kurenkov, I. Zisman, A. Agarkov, V . Sinii, and S. Kolesnikov, “Xland-minigrid: Scalable meta-reinforcement learning environments in JAX,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 43 809–43 835
work page 2024
-
[25]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[26]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
E. Yu, T. H. Liu, Y . Wang, C. L. Canonne, N. H. Tran, and C. Xu, “Nash policy gradient: A policy gradient method with iteratively refined regu- larization for finding nash equilibria,”arXiv preprint arXiv:2510.18183, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.