Drowning in Routine: Signal Dilution in Multi-Turn Agent Training

2); (2) Polytechnique Montr\'eal); Vi Retault (2) ((1) Mila - Qu\'ebec AI Institute; Yann Pernot (1

arxiv: 2606.22164 · v1 · pith:SOXVMNK6new · submitted 2026-06-20 · 💻 cs.LG

Drowning in Routine: Signal Dilution in Multi-Turn Agent Training

Yann Pernot (1 , 2) , Vi Retault (2) ((1) Mila - Qu\'ebec AI Institute , (2) Polytechnique Montr\'eal) This is my paper

Pith reviewed 2026-06-26 12:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-turn agentssignal dilutiondecision densitytrajectory-level estimatorscredit assignmentreinforcement learningGRPO

0 comments

The pith

Routine turns dilute training signals in multi-turn agents, with signal-to-noise ratio scaling as the inverse square root of decision density.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the cost of credit assignment in multi-turn agent training is driven by decision density rather than horizon length alone. Decision density measures the fraction of turns whose actions change the return distribution, while the rest are routine and reward-equivalent. Low density causes routine turns to add gradient variance without contributing signal to trajectory-level estimators such as GRPO. Under explicit assumptions with critic error controlled, this produces a signal-to-noise ratio that scales as rho to the power of negative one-half. The analysis also identifies the high-density regime where trajectory-level methods can compete without a critic, and an experiment with tunable density recovers the predicted scaling with R squared equal to 0.999.

Core claim

Multi-turn agents interleave consequential decisions with routine execution where some actions change the downstream return distribution while others are necessary but reward-equivalent. The cost of trajectory-level credit assignment is governed by decision density rho, the fraction of turns whose actions affect the return. When decision density is low, routine turns create signal dilution by adding gradient variance to trajectory-level estimators such as GRPO without adding expected signal. Under explicit assumptions, the resulting turn-level to trajectory-level signal-to-noise ratio scales as rho to the power of negative one-half, provided critic error remains controlled. The same analysis

What carries the argument

Decision density rho, the fraction of turns whose actions affect the return distribution, which determines the degree of signal dilution in trajectory-level estimators.

If this is right

At low decision density, trajectory-level methods suffer from signal dilution due to added variance from routine turns.
At high decision density, trajectory-level methods can remain competitive without requiring a value critic.
The signal-to-noise ratio between turn-level and trajectory-level estimators scales as rho to the power of negative one-half when critic error is controlled.
In environments where decision density is tunable, the predicted scaling relation is observed with R squared equal to 0.999.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Environments or agent designs that increase the fraction of impactful turns could reduce reliance on separate critics during training.
The dilution analysis may extend to other sequential tasks with sparse impactful actions, such as long-horizon planning with many maintenance steps.
Hybrid estimators could monitor estimated decision density to decide dynamically between trajectory-level and turn-level updates.

Load-bearing premise

Routine turns are reward-equivalent and critic error remains controlled throughout the derivation.

What would settle it

An experiment that varies decision density in a new environment and measures the turn-to-trajectory signal-to-noise ratio, finding that it fails to follow the inverse square root scaling.

Figures

Figures reproduced from arXiv: 2606.22164 by 2), (2) Polytechnique Montr\'eal), Vi Retault (2) ((1) Mila - Qu\'ebec AI Institute, Yann Pernot (1.

**Figure 1.** Figure 1: Paired initialization SNR ratio SNRturn/SNRtraj vs. decision density, log scale. Points are per-L IQMs. 5.2. Experiment 2: training-step efficiency The threshold-speedup ratio as a function of ρ is reported in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Threshold speedup τtraj/τturn vs. decision density, log scale. 6. Related Work Trajectory-level optimization for LLMs. Standard LLM post-training methods often assign supervision at the level 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Association between the initialization SNR ratio and AUC ratio on the shared L values. Each point is one value of L [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: reports iterations-to-threshold for each estimator separately, complementing the speedup ratio in the main text. At ρ = 1, both estimators reach the threshold in roughly 76–78 iterations; at ρ = 1/51, the trajectory-level estimator requires 225.25 iterations ([186.0, 254.0]) while the turn-level estimator requires 102.0 ([94.75, 118.0]). B.9. Critic-noise frontier The initialization SNR ratio under varying… view at source ↗

**Figure 6.** Figure 6: Initialization SNR ratio SNRturn,ϵ/SNRtraj vs. L at several critic-noise levels. Horizontal asymptotes form as we raise the critic error scale [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Initialization SNR ratio SNRturn,ϵ/SNRtraj vs. critic noise at several values of L. routine score variance is nonzero, as required by Assumption 3.2(c), depends on πθ and is not enforced by construction after initialization. Each experimental seed sets both the training run (weight initialization and on-policy rollout sampling) and the MDP instance (which of the KC doors is correct at each critical depth… view at source ↗

**Figure 9.** Figure 9: AUC ratio AUCturn/AUCtraj vs. decision density. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Multi-turn agents interleave consequential decisions with routine execution: some actions change the downstream return distribution, while others are necessary but reward-equivalent. The cost of trajectory-level credit assignment, often attributed to long horizons, is in fact governed by decision density $\rho$: the fraction of turns whose actions affect the return. When decision density is low, routine turns create signal dilution: they add gradient variance to trajectory-level estimators such as GRPO without adding expected signal. Under explicit assumptions, the resulting turn-level to trajectory-level signal-to-noise ratio scales as $\rho^{-1/2}$, provided critic error remains controlled. The same analysis identifies the complementary regime: at high decision density, trajectory-level methods can remain competitive while avoiding the cost of a critic. In a controlled environment where $\rho$ is exactly tunable, the predicted scaling is recovered with $R^2 = 0.999$, and the training-step gap widens significantly as $\rho \to 0$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a ρ^{-1/2} SNR scaling for signal dilution in low-decision-density multi-turn RL and recovers it with R²=0.999 in a tunable synthetic environment.

read the letter

The main point here is a derivation showing that when only a fraction ρ of turns actually affect returns, trajectory-level estimators pick up extra variance from the routine turns, so the effective signal-to-noise ratio between turn-level and trajectory-level drops as ρ^{-1/2} (assuming critic error stays controlled and routine turns are reward-equivalent). They then build an environment where ρ can be set exactly and recover the predicted scaling almost perfectly.

What stands out is that the scaling is presented as a derived consequence rather than a fit, and the experiment is set up to test it directly instead of just showing improvement on a benchmark. That gives the claim some independent footing.

The obvious limitation is that the result lives inside the stated assumptions. If real tasks have routine turns that are not cleanly reward-equivalent or if critic error grows with horizon, the scaling will not hold in the same clean way. The environment is fully controllable by design, so it does not yet speak to how often low ρ actually occurs in the agents people care about. No head-to-head against other credit-assignment tricks appears in the abstract either.

This is useful for anyone choosing between GRPO-style methods and critic-based ones on long-horizon tasks; the ρ framing gives a concrete knob to turn. It is worth sending to referees because the central prediction is falsifiable and the empirical match is tight under the conditions they set, even if the scope is narrower than the title suggests.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that in multi-turn agents, the cost of trajectory-level credit assignment is governed by decision density ρ (fraction of turns whose actions affect the return distribution), rather than horizon length alone. Routine turns that are reward-equivalent create signal dilution in estimators such as GRPO by adding gradient variance without expected signal. Under explicit assumptions on the environment, reward-equivalent routine turns, and controlled critic error, the turn-level to trajectory-level signal-to-noise ratio scales as ρ^{-1/2}. The analysis also identifies the complementary high-ρ regime where trajectory-level methods remain competitive. In a controlled environment with exactly tunable ρ, the predicted scaling is recovered with R² = 0.999, and the training-step gap widens as ρ → 0.

Significance. If the central scaling holds, the work supplies a precise, parameter-free characterization of signal dilution in multi-turn RL and clarifies when trajectory-level methods suffice versus when a critic is required. The explicit-assumption derivation combined with near-perfect empirical recovery (R² = 0.999) in a tunable-ρ setting constitutes a falsifiable prediction that could guide algorithm design; the absence of free parameters and the independent empirical grounding are particular strengths.

major comments (2)

[Abstract] The central derivation of the ρ^{-1/2} scaling is presented as holding under explicit assumptions on reward-equivalent routine turns and controlled critic error, yet the abstract (and the provided material) does not supply the full derivation, the precise statement of those assumptions, or the intermediate steps; without these, the claim cannot be independently verified even though the empirical recovery is reported.
The experimental section is described only at the level of 'a controlled environment where ρ is exactly tunable' with R² = 0.999; the manuscript must include the precise definition of the tunable environment, the implementation of ρ, the number of runs, and the exact estimator (GRPO or otherwise) used, as these details are load-bearing for assessing whether the empirical result independently confirms the scaling rather than being an artifact of the construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] The central derivation of the ρ^{-1/2} scaling is presented as holding under explicit assumptions on reward-equivalent routine turns and controlled critic error, yet the abstract (and the provided material) does not supply the full derivation, the precise statement of those assumptions, or the intermediate steps; without these, the claim cannot be independently verified even though the empirical recovery is reported.

Authors: The full derivation with explicit assumptions and intermediate steps appears in Section 3. To improve verifiability from the abstract, we will revise the abstract to state the key assumptions concisely and reference Section 3 for the complete derivation. revision: yes
Referee: [—] The experimental section is described only at the level of 'a controlled environment where ρ is exactly tunable' with R² = 0.999; the manuscript must include the precise definition of the tunable environment, the implementation of ρ, the number of runs, and the exact estimator (GRPO or otherwise) used, as these details are load-bearing for assessing whether the empirical result independently confirms the scaling rather than being an artifact of the construction.

Authors: We agree these details are required for independent assessment. The revised manuscript will expand the experimental section with the precise definition of the tunable environment, the implementation of ρ, the number of runs, and confirmation that GRPO was the estimator employed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with independent empirical grounding

full rationale

The paper states the ρ^{-1/2} SNR scaling under explicit assumptions on reward-equivalent routine turns and controlled critic error. It then reports empirical recovery of the exact scaling (R²=0.999) in a controlled environment where ρ is stated to be exactly tunable. This constitutes independent validation rather than a fitted input renamed as prediction or a self-definitional reduction. No self-citation chains, uniqueness theorems, or ansatzes smuggled via prior work are referenced in the provided material as load-bearing. The central claim retains independent content from the derivation-plus-validation structure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the definition of decision density as the fraction of return-affecting turns, the assumption that routine turns are reward-equivalent, and the condition that critic error remains controlled; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption Routine turns are necessary but reward-equivalent and do not affect the return distribution
Stated directly in the abstract as the basis for signal dilution
domain assumption Critic error remains controlled
Explicit condition required for the ρ^{-1/2} scaling to hold

invented entities (1)

decision density ρ no independent evidence
purpose: Quantifies the fraction of turns whose actions affect the return to explain signal dilution
Newly defined quantity used to derive the scaling; no independent evidence outside the paper's controlled environment

pith-pipeline@v0.9.1-grok · 5721 in / 1376 out tokens · 24808 ms · 2026-06-26T12:15:35.043136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 6 canonical work pages · 5 internal anchors

[1]

S., Courville, A

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. G. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2021

2021
[2]

Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s

Ahmadian, A., Cremer, C., Gall \'e , M., Fadaee, M., Kreutzer, J., Pietquin, O., \"U st \"u n, A., and Hooker, S. Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s. ACL, 2024

2024
[3]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y., Chen, J., et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

and Tibshirani, R

Efron, B. and Tibshirani, R. J. An Introduction to the Bootstrap. Chapman and Hall, 1993

1993
[5]

Group-in-Group Policy Optimization for LLM Agent Training

Feng, L., et al. Group-in-Group Policy Optimization for LLM Agent Training. arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

L., and Baxter, J

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471--1530, 2004

2004
[7]

V., Jeon, M., Vu, K., Lai, V., and Yang, E

Le, T.-L. V., Jeon, M., Vu, K., Lai, V., and Yang, E. No prompt left behind: Exploiting zero-variance prompts in LLM reinforcement learning via entropy-guided advantage shaping. ICLR, 2026

2026
[8]

P., Li, L., and Li, Y

Li, J., Zhou, P., Meng, R., Vadera, M. P., Li, L., and Li, Y. Turn- PPO : Turn-level advantage estimation with PPO for improved multi-turn RL in agentic LLM s. Findings of the Association for Computational Linguistics: EACL 2026, pp. 6227--6243, 2026

2026
[9]

Understanding why neural networks generalize well through GSNR of parameters

Liu, J., Bai, Y., Jiang, G., Chen, T., and Wang, H. Understanding why neural networks generalize well through GSNR of parameters. ICLR, 2020

2020
[10]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ICLR, 2019

2019
[11]

Steps toward artificial intelligence

Minsky, M. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8--30, 1961

1961
[12]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. Training language models to follow instructions with human feedback. NeurIPS, 2022

2022
[13]

On the difficulty of training recurrent neural networks

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. ICML, 2013

2013
[14]

D., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

2023
[15]

and Tedrake, R

Roberts, J. and Tedrake, R. Signal-to-noise ratio analysis of policy gradient algorithms. Advances in Neural Information Processing Systems, 21, 2008

2008
[16]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

F., Lee, S., and Har, D

Seo, M., Vecchietti, L. F., Lee, S., and Har, D. Rewards prediction-based credit assignment for reinforcement learning with sparse binary rewards. IEEE Access, 7:118776--118791, 2019

2019
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

S., McAllester, D

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12:1057--1063, 2000

2000
[21]

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

2018
[22]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017

2017
[23]

Re- inforcing multi-turn reasoning in LLM agents via turn- level reward design.arXiv preprint arXiv:2505.11821,

Wei, Q., Zeng, S., Li, C., et al. Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design. arXiv preprint arXiv:2505.11821, 2025

work page arXiv 2025
[24]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229--256, 1992

1992
[25]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE -agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37, 2024

2024
[26]

F., Zhu, H., et al

Zhou, S., Xu, F. F., Zhu, H., et al. WebArena : A realistic web environment for building autonomous agents. ICLR, 2024

2024

[1] [1]

S., Courville, A

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. G. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2021

2021

[2] [2]

Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s

Ahmadian, A., Cremer, C., Gall \'e , M., Fadaee, M., Kreutzer, J., Pietquin, O., \"U st \"u n, A., and Hooker, S. Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s. ACL, 2024

2024

[3] [3]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y., Chen, J., et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

and Tibshirani, R

Efron, B. and Tibshirani, R. J. An Introduction to the Bootstrap. Chapman and Hall, 1993

1993

[5] [5]

Group-in-Group Policy Optimization for LLM Agent Training

Feng, L., et al. Group-in-Group Policy Optimization for LLM Agent Training. arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

L., and Baxter, J

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471--1530, 2004

2004

[7] [7]

V., Jeon, M., Vu, K., Lai, V., and Yang, E

Le, T.-L. V., Jeon, M., Vu, K., Lai, V., and Yang, E. No prompt left behind: Exploiting zero-variance prompts in LLM reinforcement learning via entropy-guided advantage shaping. ICLR, 2026

2026

[8] [8]

P., Li, L., and Li, Y

Li, J., Zhou, P., Meng, R., Vadera, M. P., Li, L., and Li, Y. Turn- PPO : Turn-level advantage estimation with PPO for improved multi-turn RL in agentic LLM s. Findings of the Association for Computational Linguistics: EACL 2026, pp. 6227--6243, 2026

2026

[9] [9]

Understanding why neural networks generalize well through GSNR of parameters

Liu, J., Bai, Y., Jiang, G., Chen, T., and Wang, H. Understanding why neural networks generalize well through GSNR of parameters. ICLR, 2020

2020

[10] [10]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ICLR, 2019

2019

[11] [11]

Steps toward artificial intelligence

Minsky, M. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8--30, 1961

1961

[12] [12]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. Training language models to follow instructions with human feedback. NeurIPS, 2022

2022

[13] [13]

On the difficulty of training recurrent neural networks

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. ICML, 2013

2013

[14] [14]

D., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

2023

[15] [15]

and Tedrake, R

Roberts, J. and Tedrake, R. Signal-to-noise ratio analysis of policy gradient algorithms. Advances in Neural Information Processing Systems, 21, 2008

2008

[16] [16]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

F., Lee, S., and Har, D

Seo, M., Vecchietti, L. F., Lee, S., and Har, D. Rewards prediction-based credit assignment for reinforcement learning with sparse binary rewards. IEEE Access, 7:118776--118791, 2019

2019

[19] [19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

S., McAllester, D

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12:1057--1063, 2000

2000

[21] [21]

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

2018

[22] [22]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017

2017

[23] [23]

Re- inforcing multi-turn reasoning in LLM agents via turn- level reward design.arXiv preprint arXiv:2505.11821,

Wei, Q., Zeng, S., Li, C., et al. Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design. arXiv preprint arXiv:2505.11821, 2025

work page arXiv 2025

[24] [24]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229--256, 1992

1992

[25] [25]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE -agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37, 2024

2024

[26] [26]

F., Zhu, H., et al

Zhou, S., Xu, F. F., Zhu, H., et al. WebArena : A realistic web environment for building autonomous agents. ICLR, 2024

2024