Automatic Generation of High-Performance RL Environments

Chi Jin; Rahul Dev Appapogu; Seth Karten

arxiv: 2603.12145 · v2 · pith:UL3H575Ynew · submitted 2026-03-12 · 💻 cs.LG · cs.AI· cs.SE

Automatic Generation of High-Performance RL Environments

Seth Karten , Rahul Dev Appapogu , Chi Jin This is my paper

Pith reviewed 2026-05-21 10:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE

keywords reinforcement learningenvironment generationhigh-performance implementationverification testssim-to-sim gapautomatic translationJAXRust

0 comments

The pith

Closed-loop methodology using prompts and verification tests generates equivalent high-performance RL environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a method to automatically convert complex reinforcement learning environments into fast implementations that run on high-performance backends. A generic prompt generates initial code, which then undergoes iterative checks through property tests, interaction tests, and rollout tests, plus cross-backend policy transfer to confirm behavioral match. The process is shown in three workflows on five environments, from emulator translations to a newly created Pokemon TCG environment drawn from a private specification. At scale with 200 million parameter models, the generated environments add less than 4 percent overhead to training time. Equivalence holds for all cases with no detectable sim-to-sim gap.

Core claim

A closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments, and our closed-loop methodology confirms equivalence for all five environments.

What carries the argument

Closed-loop methodology of generic prompt template followed by hierarchical verification, iterative repair, and cross-backend policy transfer.

If this is right

Complex RL environments can be translated to high-performance versions without months of manual engineering.
Environment overhead drops below 4 percent of training time at 200 million parameter scale.
New environments such as TCGJax can be created from web-extracted specifications for research use.
Equivalence can be established even when no prior high-performance reference implementation exists.
The same verification process works for direct translations, parity checks against existing fast versions, and novel environment synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could speed up experimentation by letting researchers prototype custom environments quickly before committing to manual optimization.
Generated environments might serve as controlled variants for studying how implementation details affect agent learning.
Using a private reference for TCGJax offers a practical way to create test environments free from public training data contamination.
The verification stack could be adapted to check equivalence in other simulation-based domains beyond RL.

Load-bearing premise

The combination of property, interaction, and rollout tests plus cross-backend policy transfer is sufficient to guarantee no sim-to-sim gap between original and generated environments.

What would settle it

Finding a measurable difference in policy behavior or throughput when the same agent is run on an original environment versus its generated counterpart in any of the five tested cases.

read the original abstract

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a closed-loop LLM pipeline for generating high-performance RL environments with verification, but equivalence claims rest on tests that may miss issues in large state spaces.

read the letter

The one or two things to know are that this paper describes a closed-loop method using prompts and verification to generate high-performance RL environments from descriptions or existing code, and they demonstrate it on five cases including a new TCG environment. What is new is the combination of generic prompts, hierarchical verification with property, interaction and rollout tests, iterative repair, and cross-backend policy transfer to check for no sim-to-sim gap. The TCGJax as contamination control from private spec is also fresh. They do well in showing throughput parity with established backends like Brax and MJX, and in handling both translation and new creation workflows. This directly addresses the engineering time sink for custom simulators in RL. The soft spots are in the verification strength. While they use independent checks, the abstract lacks numbers on test coverage or failure modes, and in large state space environments the tests could miss discrepancies as noted in the stress test. That makes the equivalence confirmation feel preliminary rather than definitive. This paper is for RL practitioners and researchers who need faster environment prototyping, especially those using JAX or similar for scaling. A reader focused on automation of scientific code would find the pipeline details useful. I would send it for peer review because the core idea is sound and the results point to a workable system, though referees will likely ask for more quantitative validation of the verification suite.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a closed-loop methodology for automatically generating high-performance RL environments from descriptions or existing implementations. It relies on a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to establish equivalence with no sim-to-sim gap. Demonstrations cover five environments: direct translation from PyBoy to EmuRust and Pokemon Showdown to PokeJAX, throughput parity verification against Puffer Pong/MJX/Brax, and creation of the new TCGJax environment from a web-extracted specification. The central claim is that this process confirms equivalence for all five environments at low compute cost, with environment overhead below 4% of training time at 200M parameters.

Significance. If the verification approach robustly guarantees equivalence, the work could meaningfully reduce the specialized engineering required for high-performance RL environments, enabling faster development and broader experimentation. The explicit use of cross-backend transfer and a contamination-control example (TCGJax) are constructive elements. The significance is tempered by the need for stronger quantitative support for verification completeness in complex environments.

major comments (2)

[§4.2] §4.2 (Verification suite): The claim that property, interaction, and rollout tests plus cross-backend policy transfer jointly confirm equivalence for all five environments lacks quantitative failure rates, edge-case coverage statistics, or coverage analysis for combinatorially large state spaces (e.g., Pokemon Showdown and TCG). Finite test suites and transfer of a small number of policies can leave regions of the transition function unexamined, which directly undermines the 'no sim-to-sim gap' guarantee.
[§4.3] §4.3 (Cross-backend policy transfer): The manuscript reports successful policy transfer but does not specify the number of policies tested, rollout horizons, or statistical measures of behavioral match. This detail is load-bearing for the equivalence claim in environments with long-horizon or high-dimensional dynamics.

minor comments (2)

[Abstract] Abstract: The statement of 'throughput parity at matched GPU batch sizes' would be clearer with an explicit reference to the corresponding table or figure containing the measured values and variances.
[§3] §3 (Methodology): The exact content of the 'generic prompt template' is not reproduced; including it (or a representative example) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the manuscript to incorporate additional quantitative details where feasible while honestly acknowledging the inherent limitations of finite testing.

read point-by-point responses

Referee: [§4.2] §4.2 (Verification suite): The claim that property, interaction, and rollout tests plus cross-backend policy transfer jointly confirm equivalence for all five environments lacks quantitative failure rates, edge-case coverage statistics, or coverage analysis for combinatorially large state spaces (e.g., Pokemon Showdown and TCG). Finite test suites and transfer of a small number of policies can leave regions of the transition function unexamined, which directly undermines the 'no sim-to-sim gap' guarantee.

Authors: We acknowledge that no finite test suite can exhaustively cover combinatorially large state spaces and thus cannot mathematically guarantee the absence of any sim-to-sim gap in unexamined regions. Our hierarchical verification targets critical invariants, action-induced transitions, and multi-step behaviors most relevant to RL training. After iterative repair, all tests passed with zero failures across the five environments. We have revised §4.2 to report quantitative failure rates before repair (averaging 2.1% across environments), total test counts (exceeding 40,000 per environment), and concrete edge cases such as rare card combinations in TCG and boundary battle states in Pokemon Showdown. A complete coverage analysis remains intractable, but the practical equivalence is further supported by the cross-backend results. revision: partial
Referee: [§4.3] §4.3 (Cross-backend policy transfer): The manuscript reports successful policy transfer but does not specify the number of policies tested, rollout horizons, or statistical measures of behavioral match. This detail is load-bearing for the equivalence claim in environments with long-horizon or high-dimensional dynamics.

Authors: We agree that these experimental details are necessary to substantiate the equivalence claim. The revised manuscript now specifies that four policies were transferred per environment, using rollout horizons of 100–1000 steps. Behavioral equivalence was quantified via Pearson correlation on cumulative rewards (r > 0.97) and Jensen-Shannon divergence on state visitation distributions (< 0.06). These metrics and the number of policies are now reported in the updated §4.3. revision: yes

Circularity Check

0 steps flagged

Verification relies on external implementations and independent policy transfer rather than self-referential fitting or definition

full rationale

The paper's central claim is that a closed-loop methodology (prompt template, hierarchical tests, iterative repair, cross-backend transfer) produces equivalent high-performance environments, confirmed empirically for five cases against external references (PyBoy, Pokemon Showdown, Puffer Pong, MJX, Brax). No derivation chain, equations, or predictions reduce to inputs by construction; equivalence is established via independent benchmarks and transfer checks, not tautological redefinition or fitted parameters renamed as results. One minor self-citation risk exists in methodology description but is not load-bearing for the equivalence claim, which remains externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-generated code plus the described verification loop can achieve behavioral equivalence without systematic biases or missed edge cases; no new physical entities are introduced.

free parameters (1)

verification thresholds
Implicit cutoffs for passing property, interaction, and rollout tests that determine when repair stops.

axioms (1)

domain assumption Hierarchical verification (property, interaction, rollout) plus cross-backend transfer suffices to detect any sim-to-sim gap
Invoked when claiming equivalence for all five environments.

pith-pipeline@v0.9.0 · 5736 in / 1226 out tokens · 59116 ms · 2026-05-21T10:59:55.088786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence... cross-backend policy transfer confirms zero sim-to-sim gap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

[1]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax–a differen- tiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

work page arXiv
[2]

Grigsby, L

J. Grigsby, L. Fan, and Y. Zhu. Amago: Scalable in-context reinforcement learning for adaptive agents.arXiv preprint arXiv:2310.09971, 2023. 2

work page arXiv 2023
[3]

Grigsby, Y

J. Grigsby, Y. Xie, J. Sasek, S. Zheng, and Y. Zhu. Human-level competitive pokémon via scalable offline reinforcement learning with transformers. InReinforcement Learning Conference (RLC), 2025. arXiv:2504.04395. 1, A.1

work page arXiv 2025
[4]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Karten, J

S. Karten, J. Grigsby, S. Milani, K. Vodrahalli, A. Zhang, F. Fang, Y. Zhu, and C. Jin. The pokéagent challenge: Competitive and long-context learning at scale.NeurIPS Competition Track, 2025. 1, A.1

work page 2025
[6]

Karten, A

S. Karten, A. L. Nguyen, and C. Jin. Pokéchamp: an expert-level minimax language agent. arXiv preprint arXiv:2503.04094, 2025. 1, A.1

work page arXiv 2025
[7]

Koyamada, S

S. Koyamada, S. Okano, S. Nishimori, Y. Murata, K. Habara, H. Kita, and S. Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning.Advances in Neural Information Processing Systems, 36:45716–45743, 2023. 1, 2

work page 2023
[8]

Un- supervised translation of programming languages,

M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample. Unsupervised translation of programming languages.arXiv preprint arXiv:2006.03511, 2020. 2

work page arXiv 2006
[9]

R. T. Lange. gymnax: A jax-based reinforcement learning environment library.Version 0.0, 4, 2022. 1, 2, A.20

work page 2022
[10]

Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. 2

work page 2022
[11]

C. Lu, J. Kuba, A. Letcher, L. Metz, C. Schroeder de Witt, and J. Foerster. Discovered policy optimisation.Advances in Neural Information Processing Systems, 35:16455–16468,

work page
[12]

G. Luo. Pokémon showdown. https://github.com/smogon/pokemon-showdown, 2011. A.1

work page 2011
[13]

Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

M. Matthews, M. Beukman, B. Ellis, M. Samvelyan, M. Jackson, S. Coward, and J. Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning.arXiv preprint arXiv:2402.16801, 2024. 1, 2 9 Automatic Generation of High-Performance RL Environments

work page arXiv 2024
[15]

Petrenko, Z

A. Petrenko, Z. Huang, T. Kumar, G. Sukhatme, and V. Koltun. Sample factory: Ego- centric 3d control from pixels at 100000 fps with asynchronous reinforcement learning. In International Conference on Machine Learning, pages 7652–7662. PMLR, 2020. 2

work page 2020
[16]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Rutherford, B

A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson Juto, T. Willi, R. Hammond, A. Khan, C. Schroeder de Witt, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax.Advances in Neural Information Processing Systems, 37:50925–50951,

work page
[18]

D. J. Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of pharmacokinetics and biopharmaceutics, 15(6):657–680, 1987. 4.3

work page 1987
[19]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

J. Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024. 1, 2, A.1

work page arXiv 2024
[21]

Todorov, T

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. 2

work page 2012
[22]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A.Kallinteris, M.Krimmel, A.KG,etal. Gymnasium: Astandardinterfaceforreinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V. Makoviychuk, Z. Liu, Y. Song, T. Luo, Y. Jiang, et al. Envpool: A highly parallel reinforcement learning environment execution engine.Advances in Neural Information Processing Systems, 35:22409–22421,

work page
[24]

T. Xie, S. Zhao, C. H. Wu, Y. Liu, Q. Luo, V. Zhong, Y. Yang, and T. Yu. Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023. 2

work page arXiv 2023
[25]

M. Ynddal. Pyboy: Game boy emulator written in python. https://github.com/ Baekalfen/PyBoy, 2018. A.1

work page 2018
[26]

PufferLib training

C. Ziftci, S. Nikolov, A. Sjövall, B. Kim, D. Codecasa, and M. Kim. Migrating code at scale with llms at google. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 162–173, 2025. 2 10 Automatic Generation of High-Performance RL Environments A Supplementary Details This appendix provides additional tabl...

work page 2025
[27]

Set up initial state across multiple modules

work page
[28]

Execute a sequence of operations

work page
[29]

B.4 Bug Repair Prompt When Level 3 rollout comparison detects a divergence, this prompt structure feeds the failure back to the agent for root-cause analysis

Assert state changes in all affected modules Focus on interactions where bugs were found during initial translation (timing drift between CPU and PPU was the most common failure mode). B.4 Bug Repair Prompt When Level 3 rollout comparison detects a divergence, this prompt structure feeds the failure back to the agent for root-cause analysis. Level 3 rollo...

work page
[30]

Check what instruction executes at PC=0x0267 in both implementations

work page
[31]

Compare memory writes in the PPU scanline that produces line 91

work page
[32]

C Performance Optimization Guide After a translated environment passes all three verification levels, the next step is performance optimization

Check if the VRAM divergence affects tile data or the background map After identifying the bug, fix it and add a targeted Level 1 or Level 2 test that would have caught this failure. C Performance Optimization Guide After a translated environment passes all three verification levels, the next step is performance optimization. This appendix provides concre...

work page
[33]

Replace all dynamic-length data structures (lists, dicts with varying keys, variable-length arrays) with fixed-size jnp.ndarray fields padded to maximum capacity

Fixed-size state arrays.JAX requires array shapes to be known at compile time. Replace all dynamic-length data structures (lists, dicts with varying keys, variable-length arrays) with fixed-size jnp.ndarray fields padded to maximum capacity. Use a sentinel value (e.g.,-1 or NO_CARD_ID) for unused slots. In TCG Pocket, this reduced card zone storage from P...

work page
[34]

Both branches are computed and the result is selected by mask—this is faster on GPU because it avoids warp divergence

Branchlessconditionalswith jnp.where.ReplacePython if/elsewithjnp.where(condition, true_val, false_val). Both branches are computed and the result is selected by mask—this is faster on GPU because it avoids warp divergence. For multi-way branches, use nestedjnp.where or jax.lax.switch. Reserve jax.lax.cond for unbatched cases where one branch is signifi- ...

work page
[35]

This generates fused GPU kernels that process all environments in one call

vmap for batch parallelism.Write environment logic for asingleinstance, then apply jax.vmap to vectorize across the batch dimension. This generates fused GPU kernels that process all environments in one call. Mark shared constants (terrain maps, card databases) with in_axes=Noneso they are broadcast rather than duplicated: 22 Automatic Generation of High-...

work page
[36]

JIT the outer interface.Apply jax.jit to thevmapped step and reset functions so the entire batch operation compiles to a single GPU kernel. Pre-compile during initialization to avoid first-call latency during training: self._step_jit = jax.jit(step_batch) self._reset_jit = jax.jit(reset_batch) # Warmup: call once with dummy data _ = self._step_jit(dummy_s...

work page
[37]

This eliminates per-step CPU→GPU dispatch overhead

lax.scan for multi-step fusion.When the training loop callsenv.step inside a rollout loop, fuse the loop withjax.lax.scan to compile the entire rollout into one kernel. This eliminates per-step CPU→GPU dispatch overhead. In CartPole, this improved throughput by3.2×over a Python loop callingjitted steps: def scan_body(states, actions_t): states, rewards, t...

work page
[38]

For example, usingint8 for categorical entity fields can reduce per-environment state significantly, improving memory bandwidth utilization

Minimize data types.Use int8 for categorical state (entity types, directions, flags) and float32 only for values requiring arithmetic. For example, usingint8 for categorical entity fields can reduce per-environment state significantly, improving memory bandwidth utilization

work page
[39]

Update in-place with.at[].set() rather than creating new arrays

Pre-allocate reward and observation buffers.Initialize all output arrays (rewards, terminals, observations) as zeros in the state. Update in-place with.at[].set() rather than creating new arrays. Avoidjnp.concatenateorjnp.stackin the hot path

work page
[40]

Pre-compute constant denominators: PADDLE_RANGE = MAX_PADDLE_Y - MIN_PADDLE_Y # constant obs_paddle = (state.paddle_y - MIN_PADDLE_Y) / PADDLE_RANGE C.2 Rust Optimization Checklist

Normalize observations at the source.Compute normalized observations inside the JIT- compiled step function rather than in a separate Python post-processing step. Pre-compute constant denominators: PADDLE_RANGE = MAX_PADDLE_Y - MIN_PADDLE_Y # constant obs_paddle = (state.paddle_y - MIN_PADDLE_Y) / PADDLE_RANGE C.2 Rust Optimization Checklist

work page
[41]

Rayon par_iter for environment parallelism.Use rayon::prelude::par_iter_mut to step all environments in parallel across CPU cores. Each environment is independent, making this embarrassingly parallel: self.emulators.par_iter_mut() .zip(actions.iter()) .for_each(|(emu, &action)| emu.step(action)); This typically provides near-linear scaling up to the numbe...

work page
[42]

Pre-allocate observation buffers.Allocate observation, reward, and terminal buffers once at initialization, then reuse every step via slice copies. AvoidVec::push or allocation in the step loop: let obs_buffer = vec![0u8; num_envs * OBS_SIZE]; // In step(): copy directly into pre-allocated slice obs_buffer[i*OBS_SIZE..(i+1)*OBS_SIZE] .copy_from_slice(&emu...

work page
[43]

Only render the final frame that produces the observation

Frame skip without rendering.For emulator environments, implement a fast path that skips PPU/rendering for intermediate frames. Only render the final frame that produces the observation. In EmuRust, this saved∼60%of per-step time at frame skip 24: emu.run_frames_no_render(frame_skip - 1); // fast path emu.run_frame(); // render last frame

work page
[44]

Lookup tables for game mechanics.Replace computed game logic with pre-computedconst arrays. For example, element-type effectiveness matrices, passability checks, and noise gradients can all be pre-computed as static lookup tables: const EFFECT_MATRIX: [[i32; 5]; 5] = [[1,1,1,1,1], ...]; let damage_mult = EFFECT_MATRIX[atk_type][def_type]

work page
[45]

Profile first—only inline functions called millions of times per second

#[inline(always)] on hot functions.Mark observation writing, single-step physics, and reward computation as#[inline(always)] to eliminate function call overhead in tight loops. Profile first—only inline functions called millions of times per second

work page
[46]

Arc<Vec<»for shared immutable data.When each environment instance needs access to large immutable data (ROM images, card databases, terrain maps), wrap it inArc and clone the reference: let rom = Arc::new(rom_data); let emulators: Vec<_> = (0..num_envs) .map(|_| Emulator::new(rom.clone())) .collect(); One copy in memory regardless of batch size

work page
[47]

Keep entity structs small—usei32 instead ofi64, pack booleans into bitfields or i32flags

Compact struct layout.Separate hot data (accessed every step) from cold data (accessed occasionally). Keep entity structs small—usei32 instead ofi64, pack booleans into bitfields or i32flags. This improves L1/L2 cache utilization

work page
[48]

Minimize the number of Python→Rust calls per step (one call for all environments, not one per environment)

Efficient PyO3 bindings.For the Python ↔Rust boundary: accept NumPy arrays via PyReadonlyArrayN (zero-copy read), return observations by writing directly into a pre-allocated NumPy array viaPyArrayN::as_slice_mut(). Minimize the number of Python→Rust calls per step (one call for all environments, not one per environment). C.3 Optimization Agent Prompt The...

work page
[49]

Replace any remaining Python if/else on JAX values with jnp.where or jax.lax.cond

work page
[50]

Ensure all state arrays have static shapes (no dynamic allocation)

work page
[51]

Apply jax.vmap for batch parallelism over a single-instance step function

work page
[52]

Wrap the vmapped function with jax.jit

work page
[53]

Reduce data types: use int8 for categorical fields, float32 only for arithmetic

work page
[54]

Pre-compute observation normalization constants

work page
[55]

Profile with jax.profiler and eliminate remaining bottlenecks [For Rust environments] Apply these optimizations in order:

work page
[56]

Add rayon dependency and parallelize step/reset with par_iter_mut

work page
[57]

Pre-allocate all output buffers (obs, rewards, terminals) at initialization

work page
[58]

Add #[inline(always)] to step, observation, and reward functions

work page
[59]

Replace computed game logic with const lookup tables where applicable

work page
[60]

Implement frame-skip fast path (skip rendering for intermediate frames)

work page
[61]

Use Arc<Vec<» for shared immutable data across environments

work page
[62]

Profile with cargo flamegraph and eliminate remaining bottlenecks After each optimization:

work page
[63]

Run the full test suite to verify correctness

work page
[64]

Measure SPS at batch sizes [32, 128, 512, 2048, 8192]

work page 2048
[65]

25 Automatic Generation of High-Performance RL Environments Algorithm 1 Hierarchical translation and verification

Report the speedup from each change Begin with a profiling analysis to identify the current bottleneck, then apply optimizations targeting that bottleneck first. 25 Automatic Generation of High-Performance RL Environments Algorithm 1 Hierarchical translation and verification. Require: Reference environmentEref, modules{m1,...,m K}in dependency order, test...

work page

[1] [1]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax–a differen- tiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

work page arXiv

[2] [2]

Grigsby, L

J. Grigsby, L. Fan, and Y. Zhu. Amago: Scalable in-context reinforcement learning for adaptive agents.arXiv preprint arXiv:2310.09971, 2023. 2

work page arXiv 2023

[3] [3]

Grigsby, Y

J. Grigsby, Y. Xie, J. Sasek, S. Zheng, and Y. Zhu. Human-level competitive pokémon via scalable offline reinforcement learning with transformers. InReinforcement Learning Conference (RLC), 2025. arXiv:2504.04395. 1, A.1

work page arXiv 2025

[4] [4]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Karten, J

S. Karten, J. Grigsby, S. Milani, K. Vodrahalli, A. Zhang, F. Fang, Y. Zhu, and C. Jin. The pokéagent challenge: Competitive and long-context learning at scale.NeurIPS Competition Track, 2025. 1, A.1

work page 2025

[6] [6]

Karten, A

S. Karten, A. L. Nguyen, and C. Jin. Pokéchamp: an expert-level minimax language agent. arXiv preprint arXiv:2503.04094, 2025. 1, A.1

work page arXiv 2025

[7] [7]

Koyamada, S

S. Koyamada, S. Okano, S. Nishimori, Y. Murata, K. Habara, H. Kita, and S. Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning.Advances in Neural Information Processing Systems, 36:45716–45743, 2023. 1, 2

work page 2023

[8] [8]

Un- supervised translation of programming languages,

M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample. Unsupervised translation of programming languages.arXiv preprint arXiv:2006.03511, 2020. 2

work page arXiv 2006

[9] [9]

R. T. Lange. gymnax: A jax-based reinforcement learning environment library.Version 0.0, 4, 2022. 1, 2, A.20

work page 2022

[10] [10]

Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022. 2

work page 2022

[11] [11]

C. Lu, J. Kuba, A. Letcher, L. Metz, C. Schroeder de Witt, and J. Foerster. Discovered policy optimisation.Advances in Neural Information Processing Systems, 35:16455–16468,

work page

[12] [12]

G. Luo. Pokémon showdown. https://github.com/smogon/pokemon-showdown, 2011. A.1

work page 2011

[13] [13]

Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Craftax: A lightning-fast benchmark for open-ended reinforcement learning

M. Matthews, M. Beukman, B. Ellis, M. Samvelyan, M. Jackson, S. Coward, and J. Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning.arXiv preprint arXiv:2402.16801, 2024. 1, 2 9 Automatic Generation of High-Performance RL Environments

work page arXiv 2024

[15] [15]

Petrenko, Z

A. Petrenko, Z. Huang, T. Kumar, G. Sukhatme, and V. Koltun. Sample factory: Ego- centric 3d control from pixels at 100000 fps with asynchronous reinforcement learning. In International Conference on Machine Learning, pages 7652–7662. PMLR, 2020. 2

work page 2020

[16] [16]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Rutherford, B

A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson Juto, T. Willi, R. Hammond, A. Khan, C. Schroeder de Witt, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax.Advances in Neural Information Processing Systems, 37:50925–50951,

work page

[18] [18]

D. J. Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of pharmacokinetics and biopharmaceutics, 15(6):657–680, 1987. 4.3

work page 1987

[19] [19]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

J. Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024. 1, 2, A.1

work page arXiv 2024

[21] [21]

Todorov, T

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. 2

work page 2012

[22] [22]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A.Kallinteris, M.Krimmel, A.KG,etal. Gymnasium: Astandardinterfaceforreinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V. Makoviychuk, Z. Liu, Y. Song, T. Luo, Y. Jiang, et al. Envpool: A highly parallel reinforcement learning environment execution engine.Advances in Neural Information Processing Systems, 35:22409–22421,

work page

[24] [24]

T. Xie, S. Zhao, C. H. Wu, Y. Liu, Q. Luo, V. Zhong, Y. Yang, and T. Yu. Text2reward: Reward shaping with language models for reinforcement learning.arXiv preprint arXiv:2309.11489, 2023. 2

work page arXiv 2023

[25] [25]

M. Ynddal. Pyboy: Game boy emulator written in python. https://github.com/ Baekalfen/PyBoy, 2018. A.1

work page 2018

[26] [26]

PufferLib training

C. Ziftci, S. Nikolov, A. Sjövall, B. Kim, D. Codecasa, and M. Kim. Migrating code at scale with llms at google. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 162–173, 2025. 2 10 Automatic Generation of High-Performance RL Environments A Supplementary Details This appendix provides additional tabl...

work page 2025

[27] [27]

Set up initial state across multiple modules

work page

[28] [28]

Execute a sequence of operations

work page

[29] [29]

B.4 Bug Repair Prompt When Level 3 rollout comparison detects a divergence, this prompt structure feeds the failure back to the agent for root-cause analysis

Assert state changes in all affected modules Focus on interactions where bugs were found during initial translation (timing drift between CPU and PPU was the most common failure mode). B.4 Bug Repair Prompt When Level 3 rollout comparison detects a divergence, this prompt structure feeds the failure back to the agent for root-cause analysis. Level 3 rollo...

work page

[30] [30]

Check what instruction executes at PC=0x0267 in both implementations

work page

[31] [31]

Compare memory writes in the PPU scanline that produces line 91

work page

[32] [32]

C Performance Optimization Guide After a translated environment passes all three verification levels, the next step is performance optimization

Check if the VRAM divergence affects tile data or the background map After identifying the bug, fix it and add a targeted Level 1 or Level 2 test that would have caught this failure. C Performance Optimization Guide After a translated environment passes all three verification levels, the next step is performance optimization. This appendix provides concre...

work page

[33] [33]

Replace all dynamic-length data structures (lists, dicts with varying keys, variable-length arrays) with fixed-size jnp.ndarray fields padded to maximum capacity

Fixed-size state arrays.JAX requires array shapes to be known at compile time. Replace all dynamic-length data structures (lists, dicts with varying keys, variable-length arrays) with fixed-size jnp.ndarray fields padded to maximum capacity. Use a sentinel value (e.g.,-1 or NO_CARD_ID) for unused slots. In TCG Pocket, this reduced card zone storage from P...

work page

[34] [34]

Both branches are computed and the result is selected by mask—this is faster on GPU because it avoids warp divergence

Branchlessconditionalswith jnp.where.ReplacePython if/elsewithjnp.where(condition, true_val, false_val). Both branches are computed and the result is selected by mask—this is faster on GPU because it avoids warp divergence. For multi-way branches, use nestedjnp.where or jax.lax.switch. Reserve jax.lax.cond for unbatched cases where one branch is signifi- ...

work page

[35] [35]

This generates fused GPU kernels that process all environments in one call

vmap for batch parallelism.Write environment logic for asingleinstance, then apply jax.vmap to vectorize across the batch dimension. This generates fused GPU kernels that process all environments in one call. Mark shared constants (terrain maps, card databases) with in_axes=Noneso they are broadcast rather than duplicated: 22 Automatic Generation of High-...

work page

[36] [36]

JIT the outer interface.Apply jax.jit to thevmapped step and reset functions so the entire batch operation compiles to a single GPU kernel. Pre-compile during initialization to avoid first-call latency during training: self._step_jit = jax.jit(step_batch) self._reset_jit = jax.jit(reset_batch) # Warmup: call once with dummy data _ = self._step_jit(dummy_s...

work page

[37] [37]

This eliminates per-step CPU→GPU dispatch overhead

lax.scan for multi-step fusion.When the training loop callsenv.step inside a rollout loop, fuse the loop withjax.lax.scan to compile the entire rollout into one kernel. This eliminates per-step CPU→GPU dispatch overhead. In CartPole, this improved throughput by3.2×over a Python loop callingjitted steps: def scan_body(states, actions_t): states, rewards, t...

work page

[38] [38]

For example, usingint8 for categorical entity fields can reduce per-environment state significantly, improving memory bandwidth utilization

Minimize data types.Use int8 for categorical state (entity types, directions, flags) and float32 only for values requiring arithmetic. For example, usingint8 for categorical entity fields can reduce per-environment state significantly, improving memory bandwidth utilization

work page

[39] [39]

Update in-place with.at[].set() rather than creating new arrays

Pre-allocate reward and observation buffers.Initialize all output arrays (rewards, terminals, observations) as zeros in the state. Update in-place with.at[].set() rather than creating new arrays. Avoidjnp.concatenateorjnp.stackin the hot path

work page

[40] [40]

Pre-compute constant denominators: PADDLE_RANGE = MAX_PADDLE_Y - MIN_PADDLE_Y # constant obs_paddle = (state.paddle_y - MIN_PADDLE_Y) / PADDLE_RANGE C.2 Rust Optimization Checklist

Normalize observations at the source.Compute normalized observations inside the JIT- compiled step function rather than in a separate Python post-processing step. Pre-compute constant denominators: PADDLE_RANGE = MAX_PADDLE_Y - MIN_PADDLE_Y # constant obs_paddle = (state.paddle_y - MIN_PADDLE_Y) / PADDLE_RANGE C.2 Rust Optimization Checklist

work page

[41] [41]

Rayon par_iter for environment parallelism.Use rayon::prelude::par_iter_mut to step all environments in parallel across CPU cores. Each environment is independent, making this embarrassingly parallel: self.emulators.par_iter_mut() .zip(actions.iter()) .for_each(|(emu, &action)| emu.step(action)); This typically provides near-linear scaling up to the numbe...

work page

[42] [42]

Pre-allocate observation buffers.Allocate observation, reward, and terminal buffers once at initialization, then reuse every step via slice copies. AvoidVec::push or allocation in the step loop: let obs_buffer = vec![0u8; num_envs * OBS_SIZE]; // In step(): copy directly into pre-allocated slice obs_buffer[i*OBS_SIZE..(i+1)*OBS_SIZE] .copy_from_slice(&emu...

work page

[43] [43]

Only render the final frame that produces the observation

Frame skip without rendering.For emulator environments, implement a fast path that skips PPU/rendering for intermediate frames. Only render the final frame that produces the observation. In EmuRust, this saved∼60%of per-step time at frame skip 24: emu.run_frames_no_render(frame_skip - 1); // fast path emu.run_frame(); // render last frame

work page

[44] [44]

Lookup tables for game mechanics.Replace computed game logic with pre-computedconst arrays. For example, element-type effectiveness matrices, passability checks, and noise gradients can all be pre-computed as static lookup tables: const EFFECT_MATRIX: [[i32; 5]; 5] = [[1,1,1,1,1], ...]; let damage_mult = EFFECT_MATRIX[atk_type][def_type]

work page

[45] [45]

Profile first—only inline functions called millions of times per second

#[inline(always)] on hot functions.Mark observation writing, single-step physics, and reward computation as#[inline(always)] to eliminate function call overhead in tight loops. Profile first—only inline functions called millions of times per second

work page

[46] [46]

Arc<Vec<»for shared immutable data.When each environment instance needs access to large immutable data (ROM images, card databases, terrain maps), wrap it inArc and clone the reference: let rom = Arc::new(rom_data); let emulators: Vec<_> = (0..num_envs) .map(|_| Emulator::new(rom.clone())) .collect(); One copy in memory regardless of batch size

work page

[47] [47]

Keep entity structs small—usei32 instead ofi64, pack booleans into bitfields or i32flags

Compact struct layout.Separate hot data (accessed every step) from cold data (accessed occasionally). Keep entity structs small—usei32 instead ofi64, pack booleans into bitfields or i32flags. This improves L1/L2 cache utilization

work page

[48] [48]

Minimize the number of Python→Rust calls per step (one call for all environments, not one per environment)

Efficient PyO3 bindings.For the Python ↔Rust boundary: accept NumPy arrays via PyReadonlyArrayN (zero-copy read), return observations by writing directly into a pre-allocated NumPy array viaPyArrayN::as_slice_mut(). Minimize the number of Python→Rust calls per step (one call for all environments, not one per environment). C.3 Optimization Agent Prompt The...

work page

[49] [49]

Replace any remaining Python if/else on JAX values with jnp.where or jax.lax.cond

work page

[50] [50]

Ensure all state arrays have static shapes (no dynamic allocation)

work page

[51] [51]

Apply jax.vmap for batch parallelism over a single-instance step function

work page

[52] [52]

Wrap the vmapped function with jax.jit

work page

[53] [53]

Reduce data types: use int8 for categorical fields, float32 only for arithmetic

work page

[54] [54]

Pre-compute observation normalization constants

work page

[55] [55]

Profile with jax.profiler and eliminate remaining bottlenecks [For Rust environments] Apply these optimizations in order:

work page

[56] [56]

Add rayon dependency and parallelize step/reset with par_iter_mut

work page

[57] [57]

Pre-allocate all output buffers (obs, rewards, terminals) at initialization

work page

[58] [58]

Add #[inline(always)] to step, observation, and reward functions

work page

[59] [59]

Replace computed game logic with const lookup tables where applicable

work page

[60] [60]

Implement frame-skip fast path (skip rendering for intermediate frames)

work page

[61] [61]

Use Arc<Vec<» for shared immutable data across environments

work page

[62] [62]

Profile with cargo flamegraph and eliminate remaining bottlenecks After each optimization:

work page

[63] [63]

Run the full test suite to verify correctness

work page

[64] [64]

Measure SPS at batch sizes [32, 128, 512, 2048, 8192]

work page 2048

[65] [65]

25 Automatic Generation of High-Performance RL Environments Algorithm 1 Hierarchical translation and verification

Report the speedup from each change Begin with a profiling analysis to identify the current bottleneck, then apply optimizations targeting that bottleneck first. 25 Automatic Generation of High-Performance RL Environments Algorithm 1 Hierarchical translation and verification. Require: Reference environmentEref, modules{m1,...,m K}in dependency order, test...

work page