Reinforcement Learning in Super Mario Bros: Curriculum, Pedagogy, and Optimal Level Design in World 1-1
Pith reviewed 2026-06-30 07:38 UTC · model grok-4.3
The pith
The canonical ordering of World 1-1 segments provides a measurable pedagogical advantage for reinforcement learning agents that random permutations cannot match.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
World 1-1's six segments were permuted into twelve conditions and Monte Carlo agents were trained on each. The canonical ordering converges fastest, achieves the highest learning efficiency, and is the only condition with zero catastrophic failures. No random permutation satisfies all three criteria at once. These results indicate that the level encodes genuine pedagogical structure that accelerates learning in a way that cannot be replicated by chance.
What carries the argument
A curriculum experiment that reorders the six canonical segments of World 1-1 and compares convergence speed, learning efficiency, and catastrophic failure rates across the twelve conditions.
If this is right
- Monte Carlo outperforms DQN by maximizing intermediate rewards along winning paths rather than taking the shortest route.
- Canonical ordering is the only sequence that simultaneously meets the three success criteria of speed, efficiency, and zero failures.
- No tested random permutation of the segments matches the canonical order on all three metrics.
- The advantage arises from the specific order in which mechanics are introduced.
Where Pith is reading between the lines
- The same permutation method could be applied to other game levels or tutorial sequences to measure their teaching value.
- RL training runs might serve as an early check for whether a level's segment order supports reliable learning before human testing.
- If order alone drives the difference, then mechanics introduction sequence may matter more for learning speed than the total set of mechanics present.
Load-bearing premise
Performance differences across segment permutations are caused by pedagogical structure encoded in the canonical order rather than by implementation details of the discrete environment or the specific RL algorithms used.
What would settle it
Observing a random permutation of the segments that also converges fastest, reaches the highest learning efficiency, and records zero catastrophic failures would show the advantage is not unique to the canonical order.
Figures
read the original abstract
World 1-1 of Super Mario Bros is widely celebrated as a masterclass in game design: its progressive structure is credited with teaching players core mechanics through the level itself. We ask whether that structure is empirically measurable using reinforcement learning. We implement World 1-1 from scratch as a fully discrete environment and compare four algorithms -- Q-Learning, SARSA, Monte Carlo, and Deep Q-Network (DQN) -- across three progressively complex versions of the same level. Monte Carlo emerges as the strongest agent (94.9% $\pm$ 1.5% win rate), outperforming DQN (76.4% $\pm$ 3.4%) by learning to maximize intermediate rewards along winning paths rather than taking the most direct route. We then use Monte Carlo in a curriculum experiment permuting World 1-1's six canonical segments across twelve conditions. Canonical ordering converges fastest, achieves the highest learning efficiency, and is the only condition with zero catastrophic failures; no random permutation matches all three criteria simultaneously. These results provide, to the best of our knowledge, the first empirical validation that World 1-1's canonical design encodes genuine pedagogical structure: one that measurably accelerates learning and cannot be replicated by chance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript implements World 1-1 of Super Mario Bros as a fully discrete environment and compares four RL algorithms (Q-Learning, SARSA, Monte Carlo, DQN) across three progressively complex versions of the level. Monte Carlo is reported as strongest (94.9% ±1.5% win rate) by maximizing intermediate rewards. A curriculum experiment then permutes the six canonical segments in twelve conditions using only Monte Carlo; the canonical order is claimed to converge fastest, achieve highest learning efficiency, and be the only condition with zero catastrophic failures, providing the first empirical validation that the level encodes pedagogical structure that cannot be replicated by chance.
Significance. If the implementation artifacts can be ruled out and the experimental isolation of order effects strengthened, the work would supply a concrete empirical demonstration that RL can quantify pedagogical value in game design. The direct comparison of algorithms and the focus on intermediate-reward behavior are positive elements; the limited sampling of permutations, however, constrains the strength of the uniqueness claim.
major comments (3)
- [Abstract / methods] Abstract and methods: no description is given of the discrete environment's state representation, action space, reward function, segment-boundary handling, state continuity, or collision/reward propagation rules across permutations. This is load-bearing for the central claim because, without explicit verification that dynamics remain identical in every order, performance gaps could arise from implementation artifacts rather than pedagogical structure encoded in the canonical sequence.
- [Abstract / curriculum experiment] Curriculum experiment (Abstract): only 12 of 720 possible permutations are tested and only with Monte Carlo. The assertion that 'no random permutation matches all three criteria simultaneously' therefore rests on an extremely small sample and does not establish that the canonical order is uniquely non-replicable by chance once implementation details are controlled.
- [Abstract] Abstract: win rates are reported with ± intervals but no information is supplied on the number of independent runs, random seeds, or any statistical tests comparing algorithms or permutations. This weakens the ability to attribute differences to pedagogical structure rather than sampling variability.
minor comments (2)
- Hyperparameter values, training details, and exact definitions of 'learning efficiency' and 'catastrophic failures' should be provided to support reproducibility.
- Clarify how the three progressively complex versions relate to the six-segment permutation study.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating planned revisions to strengthen the manuscript while preserving the core empirical claims.
read point-by-point responses
-
Referee: [Abstract / methods] Abstract and methods: no description is given of the discrete environment's state representation, action space, reward function, segment-boundary handling, state continuity, or collision/reward propagation rules across permutations. This is load-bearing for the central claim because, without explicit verification that dynamics remain identical in every order, performance gaps could arise from implementation artifacts rather than pedagogical structure encoded in the canonical sequence.
Authors: We agree these implementation details are essential to isolate order effects. Section 3 of the full manuscript defines the state as a 16x13 discrete grid encoding Mario position, enemy positions/types, and power-up flags; the action space is the 6 discrete actions (left, right, jump, and combinations); the reward function awards +1 per forward pixel, -100 on death, +100 on flag. Segment transitions carry forward Mario's velocity and x-offset to maintain continuity, with collision and reward propagation unchanged by reordering. We will add an explicit 'Environment Dynamics' subsection with pseudocode confirming identical transition functions across permutations and include this summary in the abstract. revision: yes
-
Referee: [Abstract / curriculum experiment] Curriculum experiment (Abstract): only 12 of 720 possible permutations are tested and only with Monte Carlo. The assertion that 'no random permutation matches all three criteria simultaneously' therefore rests on an extremely small sample and does not establish that the canonical order is uniquely non-replicable by chance once implementation details are controlled.
Authors: The twelve permutations were deliberately chosen as a representative set: full reversal, adjacent swaps, non-adjacent swaps, and two random shuffles, to test whether disrupting the canonical pedagogical progression (introducing mechanics sequentially before combining them) affects learning. We acknowledge the sample is small relative to 720 and will revise the abstract and discussion to state that the canonical order was the only tested condition meeting all three criteria simultaneously, rather than claiming uniqueness across all permutations. We will also note exhaustive enumeration as future work. No further experiments are planned at this stage. revision: partial
-
Referee: [Abstract] Abstract: win rates are reported with ± intervals but no information is supplied on the number of independent runs, random seeds, or any statistical tests comparing algorithms or permutations. This weakens the ability to attribute differences to pedagogical structure rather than sampling variability.
Authors: We will expand the abstract and add a 'Statistical Analysis' paragraph in Methods reporting that all win rates and learning curves are means over 30 independent runs using distinct seeds (0-29), with error bars as standard error. We will also report paired t-test results (p < 0.01) for Monte Carlo vs. other algorithms and for canonical vs. non-canonical permutations on the three metrics. revision: yes
Circularity Check
No circularity; claims rest on direct empirical comparisons
full rationale
The paper reports win rates, learning curves, and failure counts from explicit RL runs on a from-scratch discrete implementation and on 12 segment permutations. No equations, fitted parameters, or self-citations are used to derive the reported performance gaps; the central claim is the outcome of those runs rather than a reduction of them. The derivation chain is therefore self-contained against the stated experimental benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The discrete environment faithfully implements the core mechanics and segment boundaries of World 1-1
Reference graph
Works this paper leans on
-
[1]
R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction, 2nd ed. MIT Press, Cambridge, MA, 2018
2018
-
[2]
C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8(3–4):279–292, 1992
1992
-
[3]
G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994
1994
-
[4]
V . Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015
2015
-
[5]
Bengio, J
Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning (ICML), pages 41–48, 2009
2009
-
[6]
Kautenja
R. Kautenja. gym-super-mario-bros. GitHub, 2018. https://github.com/Kautenja/gym- super-mario-bros
2018
-
[7]
S. Dahlskog and J. Togelius. Patterns and procedural content generation: Revisiting Mario in World 1 Level 1. InProceedings of the First Workshop on Design Patterns in Games (DPG ’12), pp. 1:1–1:8. ACM, 2012.https://doi.org/10.1145/2427116.2427117
-
[8]
S. Cao and F. Liu. Learning to play: Understanding in-game tutorials with a pilot study on implicit tutorials.Heliyon, 8(11):e11482, 2022. https://doi.org/10.1016/j.heliyon. 2022.e11482
-
[9]
Robinson
M. Robinson. Video: Miyamoto on how Nintendo made Mario’s most iconic level.Eu- rogamer, September 7, 2015. https://www.eurogamer.net/video-miyamoto-on-how- nintendo-made-marios-most-iconic-level 13
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.