pith. sign in

arxiv: 2605.20577 · v1 · pith:VQDWMVXFnew · submitted 2026-05-20 · 💻 cs.AI · cs.LG

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Pith reviewed 2026-05-21 05:31 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords Riichi MahjongReinforcement LearningJAXGPU SimulationImperfect InformationMulti-Agent EnvironmentVectorized RolloutsTabula Rasa Learning
0
0 comments X

The pith

A JAX vectorized simulator for Riichi Mahjong runs millions of game steps per second on GPUs to support tabula rasa reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Riichi Mahjong combines multiple players, hidden information, and stochastic draws in a high-dimensional state space that mirrors difficult real-world decision problems. The paper introduces Mahjax as a fully vectorized implementation of the game written in JAX so that thousands of independent games can execute in parallel on a single GPU. This design delivers throughputs of up to two million steps per second across eight A100 GPUs under standard rules, removing the computational bottleneck that has previously forced researchers to rely on supervised pre-training from human logs. The authors further show that reinforcement-learning agents trained inside Mahjax steadily improve their ranking against fixed baseline policies, confirming that the environment supports end-to-end learning from scratch. A companion visualization tool is supplied to inspect agent decisions during training and evaluation.

Core claim

We introduce Mahjax, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on GPUs. The simulator achieves throughputs of up to 2 million steps per second on eight NVIDIA A100 GPUs under the no-red rules and 1 million steps per second under the red rules. We also provide a high-quality visualization tool. Experimental results demonstrate that agents can be trained effectively to improve their rank against baseline policies.

What carries the argument

The fully vectorized JAX implementation of the complete Riichi Mahjong rule set that executes many independent games simultaneously on GPU hardware.

If this is right

  • Reinforcement-learning agents can be trained on the full game without any supervised pre-training from human play logs.
  • Batch sizes large enough for modern policy-gradient or value-based methods become practical on commodity GPU hardware.
  • Both the no-red and red rule variants are supported, allowing direct comparison of learned policies across rule sets.
  • The supplied visualization tool makes it feasible to inspect and debug the internal decision process of trained agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same vectorization pattern could be applied to other imperfect-information multi-player games such as poker or bridge.
  • High-throughput simulation opens the door to systematic testing of new algorithms that explicitly handle stochasticity and hidden states.
  • Because the environment is written in JAX, it can be combined with differentiable components for hybrid model-based or planning-augmented learning.

Load-bearing premise

The JAX code exactly reproduces every stochastic transition, hidden-information rule, and legal move of standard Riichi Mahjong.

What would settle it

A side-by-side comparison in which Mahjax generates a sequence of states or moves that an established, independently verified Mahjong engine would reject as illegal or assign a different probability.

Figures

Figures reproduced from arXiv: 2605.20577 by Eason Yu, Keigo Habara, Masashi Sugiyama, Shinri Okano, Soichiro Nishimori, Sotetsu Koyamada.

Figure 1
Figure 1. Figure 1: Example code snippet demonstrating the Mahjax API. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SVG-based visualization of the Mahjax game state. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput comparison (steps per second) between Mahjax (red and [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Mahjax, a fully vectorized JAX implementation of a Riichi Mahjong simulator designed for large-scale GPU-parallel rollouts in reinforcement learning. It reports peak throughputs of 2 million steps/sec (no-red rules) and 1 million steps/sec (red rules) on 8 A100 GPUs, provides a visualization tool, and validates utility by training agents that improve rank against baseline policies.

Significance. If the simulator faithfully implements the complete Riichi rule set, stochastic wall draws, hidden information, yaku scoring, and multi-player interactions, the work would provide a valuable scalable environment for tabula-rasa RL in imperfect-information games, extending the AlphaZero-style paradigm to a new domain. The explicit throughput measurements and training curves constitute direct, reproducible empirical validation rather than fitted claims.

major comments (1)
  1. [Experimental Results] Experimental Results section: The central claim that agents improve rank against baselines validates the environment's utility for RL, yet the manuscript contains no unit tests, cross-validation against reference Mahjong libraries, or statistical comparison of game-outcome distributions (e.g., win rates, yaku frequencies, or discard patterns) to confirm that stochastic transitions and hidden-information mechanics are reproduced exactly. This verification step is load-bearing for the claim that observed policy improvement occurs in the true Riichi MDP rather than an approximation.
minor comments (2)
  1. [Abstract] The abstract states throughput figures but does not specify the exact batch size, number of parallel environments, or whether the measurements include or exclude host-device transfers; adding these details would improve reproducibility.
  2. [Implementation] The visualization tool is mentioned but no description of its interface, supported actions, or how it handles hidden information for debugging is provided; a short usage example or figure would clarify its utility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment regarding verification of the simulator below.

read point-by-point responses
  1. Referee: The central claim that agents improve rank against baselines validates the environment's utility for RL, yet the manuscript contains no unit tests, cross-validation against reference Mahjong libraries, or statistical comparison of game-outcome distributions (e.g., win rates, yaku frequencies, or discard patterns) to confirm that stochastic transitions and hidden-information mechanics are reproduced exactly. This verification step is load-bearing for the claim that observed policy improvement occurs in the true Riichi MDP rather than an approximation.

    Authors: We agree that explicit verification of the simulator's fidelity to the full Riichi rule set, including stochastic draws and hidden information, is important for substantiating the RL results. The current manuscript does not present unit tests, cross-validation against reference libraries, or statistical comparisons of outcome distributions. In the revised manuscript we will add a dedicated validation subsection in the Experimental Results section that reports win rates, yaku frequencies, and discard-pattern statistics benchmarked against publicly available reference Mahjong implementations. These additions will directly address the concern that policy improvement may be occurring in an approximation rather than the true MDP. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of simulator utility stands independently

full rationale

The paper introduces Mahjax as a JAX-based vectorized simulator and supports its utility claim solely through direct empirical results: measured throughput (up to 2M steps/sec) and observed rank improvement when RL agents are trained against baseline policies. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain consists of implementation followed by external benchmarking against baselines, which is self-contained and falsifiable outside any internal fit. Implementation correctness is a separate verification issue, not a circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an implementation of established game rules rather than new mathematics or fitted parameters. No free parameters, invented entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption Standard rules and stochastic mechanics of Riichi Mahjong
    The simulator must faithfully implement the game to support valid RL training.

pith-pipeline@v0.9.0 · 5747 in / 1072 out tokens · 61341 ms · 2026-05-21T05:31:30.016018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    Suphx: Mastering mahjong with deep reinforcement learning,

    J. Li, S. Koyamada, Q. Ye, G. Liu, C. Wang, R. Yang, L. Zhao, T. Qin, T.-Y . Liu, and H.-W. Hon, “Suphx: Mastering mahjong with deep reinforcement learning,”arXiv preprint arXiv:2003.13590, 2020

  2. [2]

    Building a 3-player mahjong ai using deep reinforcement learning,

    X. Zhao and S. B. Holden, “Building a 3-player mahjong ai using deep reinforcement learning,”arXiv preprint arXiv:2202.12847, 2022

  3. [3]

    Tjong: A transformer-based mahjong ai via hierarchical decision-making and fan backward,

    X. Li, B. Liu, Z. Wei, Z. Wang, and L. Wu, “Tjong: A transformer-based mahjong ai via hierarchical decision-making and fan backward,”CAAI Transactions on Intelligence Technology, vol. 9, no. 4, pp. 982–995, 2024

  4. [4]

    Mj-dlvat: A deep learning value assessment technique for mahjong,

    T. Ogami, K. Amano, and Y . Tsuruoka, “Mj-dlvat: A deep learning value assessment technique for mahjong,” inIEEE Conference on Games (CoG), 2024, pp. 1–8

  5. [5]

    Actor-critic policy optimization in a large-scale imperfect-information game,

    H. Fu, W. Liu, S. Wu, Y . Wang, T. Yang, K. Li, J. Xing, B. Li, B. Ma, Q. Fu, and W. Yang, “Actor-critic policy optimization in a large-scale imperfect-information game,” inInternational Conference on Learning Representations, 2022

  6. [6]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, J. Fu, and C. Finn, “Offline reinforce- ment learning,”arXiv preprint arXiv:2005.01643, 2020

  7. [7]

    Mastering the game of go with deep neural networks and tree search,

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanc- tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature,...

  8. [8]

    Mastering the game of go without human knowledge,

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,”Nature, vol. 550, no. 7676, pp. 354–359, 2017

  9. [9]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018

  10. [10]

    Discovering faster matrix multi- plication algorithms with reinforcement learning,

    A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli, “Discovering faster matrix multi- plication algorithms with reinforcement learning,”Nature, vol. 610, no. 7930, pp. 47–53, 2022

  11. [11]

    Alphaholdem: High- performance artificial intelligence for heads-up no-limit poker via end- to-end reinforcement learning,

    E. Zhao, R. Yan, J. Li, K. Li, and J. Xing, “Alphaholdem: High- performance artificial intelligence for heads-up no-limit poker via end- to-end reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022

  12. [12]

    Mjx: A framework for mahjong ai research,

    S. Koyamada, K. Habara, N. Goto, S. Okano, S. Nishimori, and S. Ishii, “Mjx: A framework for mahjong ai research,” inIEEE Conference on Games (CoG), 2022, pp. 504–507

  13. [13]

    Pgx: Hardware-accelerated parallel game simulators for reinforcement learning,

    S. Koyamada, S. Okano, S. Nishimori, Y . Murata, K. Habara, H. Kita, and S. Ishii, “Pgx: Hardware-accelerated parallel game simulators for reinforcement learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 45 716–45 743

  14. [14]

    Jumanji: a diverse suite of scalable reinforcement learning environments in JAX,

    C. Bonnet, D. Luo, D. Byrne, S. Surana, S. Abramowitz, P. Duckworth, V . Coyette, L. I. Midgley, E. Tegegn, T. Kalloniatis, O. Mahjoub, M. Macfarlane, A. P. Smit, N. Grinsztajn, R. Boige, C. N. Waters, M. A. Mimouni, U. A. M. Sob, R. de Kock, S. Singh, D. Furelos- Blanco, V . Le, A. Pretorius, and A. Laterre, “Jumanji: a diverse suite of scalable reinforc...

  15. [15]

    Brax - A differentiable physics engine for large scale rigid body simulation,

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - A differentiable physics engine for large scale rigid body simulation,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

  16. [16]

    Jaxmarl: Multi- agent RL environments and algorithms in JAX,

    A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson, T. Willi, R. Hammond, A. Khan, C. S. de Witt, A. Souly, S. Bandyopad- hyay, M. Samvelyan, M. Jiang, R. T. Lange, S. Whiteson, B. Lacerda, N. Hawes, T. Rockt ¨aschel, C. Lu, and J. N. Foerster, “Jaxmarl: Multi- agent RL environments and algorithms in JAX,” inAdvances in Neural Informatio...

  17. [17]

    Craftax: A lightning-fast benchmark for open-ended reinforcement learning,

    M. T. Matthews, M. Beukman, B. Ellis, M. Samvelyan, M. T. Jackson, S. Coward, and J. N. Foerster, “Craftax: A lightning-fast benchmark for open-ended reinforcement learning,” inInternational Conference on Machine Learning, 2024

  18. [18]

    NA VIX: scaling minigrid environments with JAX,

    E. Pignatelli, J. Liesen, R. T. Lange, C. Lu, P. S. Castro, and L. Toni, “NA VIX: scaling minigrid environments with JAX,”arXiv preprint arXiv:2407.19396, 2024

  19. [19]

    Simplifying deep temporal difference learning,

    M. Gallici, M. Fellows, B. Ellis, B. Pou, I. Masmitja, J. N. Foerster, and M. Martin, “Simplifying deep temporal difference learning,” in International Conference on Learning Representations, 2025

  20. [20]

    SPO: sequential monte carlo policy optimisation,

    M. Macfarlane, E. Toledo, D. Byrne, P. Duckworth, and A. Laterre, “SPO: sequential monte carlo policy optimisation,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 1019–1057

  21. [21]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax

  22. [22]

    Tsunoda, “Tenhou,” 2022

    S. Tsunoda, “Tenhou,” 2022. [Online]. Available: https://tenhou.net/

  23. [23]

    [Online]

    Equim-chan, “Mortal,” 2022. [Online]. Available: https://github.com/ Equim-chan/Mortal

  24. [24]

    Xland-minigrid: Scalable meta-reinforcement learning environments in JAX,

    A. Nikulin, V . Kurenkov, I. Zisman, A. Agarkov, V . Sinii, and S. Kolesnikov, “Xland-minigrid: Scalable meta-reinforcement learning environments in JAX,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 43 809–43 835

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  26. [26]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  27. [27]

    NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

    E. Yu, T. H. Liu, Y . Wang, C. L. Canonne, N. H. Tran, and C. Xu, “Nash policy gradient: A policy gradient method with iteratively refined regu- larization for finding nash equilibria,”arXiv preprint arXiv:2510.18183, 2025