Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

Esraa Elelimy; Martha White; Matthew Vandergrift

arxiv: 2606.28433 · v1 · pith:2PJ7YCWOnew · submitted 2026-06-26 · 💻 cs.LG

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

Matthew Vandergrift , Esraa Elelimy , Martha White This is my paper

Pith reviewed 2026-06-30 01:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningsimulatorsbenchmarksdeploymentproxy evaluationresearch practicessequential decision making

0 comments

The pith

RL researchers need to distinguish between solving simulators and using simulators as a proxy for learning in deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that reinforcement learning experiments often blur two separate goals when using simulators. One goal is to achieve high performance inside the simulator environment itself. The other is to develop agents whose learned behavior will transfer to deployment settings where the simulator is no longer available. Conflating these goals leads to algorithms, constraints, and metrics that fit one purpose but produce misleading results for the other. The authors illustrate the problems with examples and simple experiments, and call for researchers to state explicitly which use case they intend.

Core claim

The central claim is that RL researchers need to distinguish between two use cases of simulators: solving simulators and using simulators as a proxy for learning in deployment. These two settings differ in the constraints placed on how the agent may interact with the simulator, in the algorithms that are appropriate, and in the evaluation metrics that make sense. Failing to keep the distinction clear allows solutions meant only for the simulator to be presented as progress toward deployable agents and produces misleading conclusions about research progress, as shown by examples and simple experiments. The work is a call for the community to begin stating clearly how simulators are being used

What carries the argument

The distinction between solving simulators (optimizing performance inside the given environment) and using simulators as a proxy (developing policies intended for use outside the simulator in deployment).

If this is right

Algorithms developed under one use case may violate the access constraints of the other.
Evaluation metrics chosen for simulator solving can overstate progress toward deployment.
Research conclusions about general-purpose sequential decision making can rest on methods that only work inside the simulator.
Published work would become clearer if authors stated which use case they address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark suites could add separate tracks or reporting requirements for each use case to reduce mismatched comparisons.
The distinction highlights why some high simulator scores fail to translate when the simulator is removed at test time.
Similar separation of goals may be useful in other simulation-heavy fields such as robotics control.

Load-bearing premise

That failing to distinguish the two simulator use cases produces issues and misleading conclusions in RL research.

What would settle it

An experiment in which methods developed under one use case are applied to the other and no differences appear in outcomes, constraints, or conclusions.

Figures

Figures reproduced from arXiv: 2606.28433 by Esraa Elelimy, Martha White, Matthew Vandergrift.

**Figure 1.** Figure 1: Example of a learning curve that presents a reasonable solution when solving a simulator but presents a bad solution when using simulators as a proxy. The solid line is the average online performance across 20 runs and the shaded area is the standard error. There is a lot of room for nuance in this evaluation of learning in deployment, given different use cases. It can be reasonable to focus on performanc… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Demonstration of how resetting can lead to a better performing final policy for PPO compared to PPO without resets. Notably the policy maintains increased performance when evaluated without resets. We can observe the performance left on the table by not using resets in an example simulator, MountainCar-v0 (Lange, 2022). For simplicity, we use a naive resetting mechanism, resetting the agent to a random s… view at source ↗

**Figure 4.** Figure 4: Contrasting the performance of SAC and DQN when using online evaluation versus evaluating deterministic policy in evaluation episodes. The solid lines are the average across 30 runs for SAC and 50 runs for DQN and the shaded region is the standard error. Most RL papers, however, use evaluation rollouts, including in most papers using SAC (Haarnoja et al., 2018), DQN (Mnih et al., 2015), DDPG (Lillicrap et… view at source ↗

read the original abstract

One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of achieving high performance in the simulator can mutate into focusing exclusively on solving the simulator. To achieve high scores, researchers may adopt solutions exclusively meant for solving simulators, rather than learning while the agent is deployed outside a simulator. Solving simulators is also worthy of investigation, but it is a fundamentally different RL research question. In this paper, we argue that RL researchers need to distinguish between two use cases of simulators: solving simulators and using simulators as a proxy for learning in deployment. We first discuss how these two use-cases are importantly different, in terms of constraints on how the agent can use the simulator, which algorithms are appropriate, and which evaluation metrics are appropriate. We then highlight several issues and misleading conclusions that can occur by not making the distinction between these two settings clear, supported with examples and simple experiments. This work is a call to the community to begin clearly distinguishing how they are using simulators in their work, hopefully sparking further discussion on which empirical practices work best in each setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully pushes RL work to label whether a simulator run is about beating that sim or about learning for later deployment, and the distinction holds up.

read the letter

The main takeaway is that RL experiments often slide from using a simulator as a stand-in for real deployment into just solving the simulator itself, and the authors want papers to state which one they are doing. They lay out the differences clearly: solving the simulator permits heavy overfitting, unlimited resets, and metrics that reward simulator-specific tricks, while the proxy case restricts those and needs different evaluation. The examples they give show how mixing the two can produce misleading claims about generalization or efficiency.

The paper does this without overclaiming. The argument is definitional and practical rather than a new theorem, and the simple experiments serve as illustrations rather than proof. That keeps the tone proportionate.

The soft spot is that the work stays at the level of examples and does not measure how common the conflation is across recent RL papers or how often it actually changes conclusions. A reader could agree with the distinction yet still wonder whether the problem is large enough to change reviewing standards. No internal contradictions appear in the logic.

This is for RL researchers who run simulator benchmarks and for reviewers who evaluate those papers. Someone already thinking about experimental hygiene will get value from the framing. It deserves peer review because the issue affects a large part of the field and a published version could influence norms even if it remains a position piece.

Referee Report

0 major / 0 minor

Summary. This position paper claims that reinforcement learning (RL) research often conflates two distinct uses of simulators: solving the simulator (achieving high performance within it) versus using it as a proxy for learning deployable policies. The authors argue these use cases differ in terms of simulator access constraints, appropriate algorithms, and evaluation metrics. They support this by discussing the differences and highlighting issues from not distinguishing them, backed by examples and simple experiments, calling for clearer practices in the community.

Significance. If adopted, the proposed distinction would improve clarity in RL experimental design and reduce the risk of misleading conclusions from simulator-specific optimizations that do not transfer. The manuscript earns credit for its explicit enumeration of differences across constraints, algorithms, and metrics, as well as for grounding the normative argument in illustrative examples rather than unsubstantiated assertion.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thoughtful review and positive recommendation to accept the manuscript. Their summary accurately captures the core argument of the position paper.

Circularity Check

0 steps flagged

No circularity: position paper with no derivation chain

full rationale

This is a normative position paper arguing that RL researchers should distinguish 'solving simulators' from 'using simulators as a proxy for deployment.' It enumerates differences in constraints, algorithms, and metrics, then illustrates issues with examples and simple experiments. No equations, fitted parameters, theorems, or quantitative predictions are asserted that could reduce to inputs by construction. The load-bearing content is definitional and argumentative rather than deductive, so no self-definitional, fitted-input, or self-citation circularity applies. The paper is self-contained against external benchmarks as a call for clearer terminology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on domain assumptions about RL research practices rather than new mathematical axioms or entities.

axioms (1)

domain assumption Current RL research practices often conflate solving simulators with using them as proxies.
This is the premise the position builds upon, extracted from the abstract's discussion of issues arising from not distinguishing.

pith-pipeline@v0.9.1-grok · 5746 in / 987 out tokens · 31585 ms · 2026-06-30T01:26:46.795745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William Yue, and Ila Fiete

Bonnet, C., Luo, D., Byrne, D., Surana, S., Abramowitz, S., Duckworth, P., Coyette, V ., Midgley, L. I., Tegegn, E., Kalloniatis, T., Mahjoub, O., Macfarlane, M., Smit, A. P., Grinsztajn, N., Boige, R., Waters, C. N., Mimouni, M. A., Sob, U. A. M., de Kock, R., Singh, S., Furelos-Blanco, D., Le, V ., Pretorius, A., and Laterre, A. Jumanji: a diverse suite...

work page arXiv
[2]

Castro, P. S. The formalism-implementation gap in reinforcement learning research.arXiv preprint arXiv:2510.16175,

work page arXiv
[3]

Go-explore: a new approach for hard-exploration problems.arXiv preprint arXiv:1901.10995, 2019

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for hard- exploration problems.arXiv preprint arXiv:1901.10995,

work page arXiv 1901
[4]

Jayawardana, V ., Tang, C., Li, S., Suo, D., and Wu, C

URL https: //iclr-blog-track.github.io/2022/03/ 25/ppo-implementation-details/. Jayawardana, V ., Tang, C., Li, S., Suo, D., and Wu, C. The impact of task underspecification in evaluating deep rein- forcement learning. InAdvances in Neural Information Processing Systems,

2022
[5]

Q., Wu, N., et al

Luo, J., Paduraru, C., V oicu, O., Chervonyi, Y ., Munns, S., Li, J., Qian, C., Dutta, P., Davis, J. Q., Wu, N., et al. Con- trolling commercial cooling systems using reinforcement learning.arXiv preprint arXiv:2211.07357,

work page arXiv
[6]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments

Young, K. and Tian, T. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments.arXiv preprint arXiv:1903.03176,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[8]

The hyper- parameters used are the same hyperparameters as (Dohare et al., 2023). A.2. PQN Experiment To produce figure 2 we first ran PQN using the default hy- perparameters as specified in the configuration files on its official Github page (Gallici et al., 2025). We then changed only the number of environments to 1 within the configu- ration to produce...

2023
[9]

For both algorithms we report the online return. A.4. Evaluation Rollouts Experiment For this experiment we ran SAC and DQN using both online and evaluation rollouts/episodes. For SAC our evaluation episodes consisted of deterministic action selection rather Hyperparameter Name Value Observation Normalization False Reward Normalization False Advantage Nor...

2048
[10]

We used the Half-Cheetah MuJoCo environ- ment (Todorov et al., 2012)

than sampling. We used the Half-Cheetah MuJoCo environ- ment (Todorov et al., 2012). We run for five million steps over 30 seeds, plotting both the evaluation episode returns and the online return. The hyperparameters used for the SAC experiments in Figure 4 are shown in Table

2012
[11]

The hyperparameters used are listed in table

in Gymnax (Lange, 2022), running for 4,000,000 steps, plotting the evaluation and online return. The hyperparameters used are listed in table

2022
[12]

13 Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy Hyperparameter Name Value Observation Normalization False Reward Normalization True Advantage Normalization True Gradient Clipping False Value Loss Clipping False Rollout Length 2048 Epochs 10 Minibatch size 256 λ0.95 Discount factor,γ0.99 Clip Coeff...

2048

[1] [1]

Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William Yue, and Ila Fiete

Bonnet, C., Luo, D., Byrne, D., Surana, S., Abramowitz, S., Duckworth, P., Coyette, V ., Midgley, L. I., Tegegn, E., Kalloniatis, T., Mahjoub, O., Macfarlane, M., Smit, A. P., Grinsztajn, N., Boige, R., Waters, C. N., Mimouni, M. A., Sob, U. A. M., de Kock, R., Singh, S., Furelos-Blanco, D., Le, V ., Pretorius, A., and Laterre, A. Jumanji: a diverse suite...

work page arXiv

[2] [2]

Castro, P. S. The formalism-implementation gap in reinforcement learning research.arXiv preprint arXiv:2510.16175,

work page arXiv

[3] [3]

Go-explore: a new approach for hard-exploration problems.arXiv preprint arXiv:1901.10995, 2019

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for hard- exploration problems.arXiv preprint arXiv:1901.10995,

work page arXiv 1901

[4] [4]

Jayawardana, V ., Tang, C., Li, S., Suo, D., and Wu, C

URL https: //iclr-blog-track.github.io/2022/03/ 25/ppo-implementation-details/. Jayawardana, V ., Tang, C., Li, S., Suo, D., and Wu, C. The impact of task underspecification in evaluating deep rein- forcement learning. InAdvances in Neural Information Processing Systems,

2022

[5] [5]

Q., Wu, N., et al

Luo, J., Paduraru, C., V oicu, O., Chervonyi, Y ., Munns, S., Li, J., Qian, C., Dutta, P., Davis, J. Q., Wu, N., et al. Con- trolling commercial cooling systems using reinforcement learning.arXiv preprint arXiv:2211.07357,

work page arXiv

[6] [6]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments

Young, K. and Tian, T. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments.arXiv preprint arXiv:1903.03176,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[8] [8]

The hyper- parameters used are the same hyperparameters as (Dohare et al., 2023). A.2. PQN Experiment To produce figure 2 we first ran PQN using the default hy- perparameters as specified in the configuration files on its official Github page (Gallici et al., 2025). We then changed only the number of environments to 1 within the configu- ration to produce...

2023

[9] [9]

For both algorithms we report the online return. A.4. Evaluation Rollouts Experiment For this experiment we ran SAC and DQN using both online and evaluation rollouts/episodes. For SAC our evaluation episodes consisted of deterministic action selection rather Hyperparameter Name Value Observation Normalization False Reward Normalization False Advantage Nor...

2048

[10] [10]

We used the Half-Cheetah MuJoCo environ- ment (Todorov et al., 2012)

than sampling. We used the Half-Cheetah MuJoCo environ- ment (Todorov et al., 2012). We run for five million steps over 30 seeds, plotting both the evaluation episode returns and the online return. The hyperparameters used for the SAC experiments in Figure 4 are shown in Table

2012

[11] [11]

The hyperparameters used are listed in table

in Gymnax (Lange, 2022), running for 4,000,000 steps, plotting the evaluation and online return. The hyperparameters used are listed in table

2022

[12] [12]

13 Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy Hyperparameter Name Value Observation Normalization False Reward Normalization True Advantage Normalization True Gradient Clipping False Value Loss Clipping False Rollout Length 2048 Epochs 10 Minibatch size 256 λ0.95 Discount factor,γ0.99 Clip Coeff...

2048