arxiv: 2605.06869 · v2 · submitted 2026-05-07 · 💻 cs.AI

Recognition: no theorem link

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer , Pablo Samuel Castro , Glen Berseth

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords benchmarksequential decision makingreinforcement learninglarge language model agentsagent evaluationprocedurally generated tasksmulti-agent systems

0 comments

The pith

Agentick benchmark shows no single approach dominates sequential decision-making across 37 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentick as a unified benchmark to evaluate RL agents, LLM agents, VLMs, hybrids, and humans on the same sequential decision-making tasks. It includes 37 procedurally generated tasks in six categories with multiple difficulty levels and observation types, plus tools like oracle policies and a composable harness for agents. Through testing 27 configurations in over 90,000 episodes, it finds that GPT-5 mini has the best overall performance but other agents like PPO are better at certain tasks, and that a reasoning harness greatly improves LLM results while ASCII beats natural language. This matters because it gives a common way to measure progress toward general agents and points to where each type of agent can improve.

Core claim

Agentick is a benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents on common ground through 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, and evaluations across 27 configurations reveal that no single approach dominates, with GPT-5 mini leading at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks.

What carries the argument

The Agentick benchmark consisting of 37 procedurally generated tasks exposed through a single Gymnasium-compatible interface, along with oracle reference policies, pre-built SFT datasets, and a composable agent harness.

Load-bearing premise

The 37 procedurally generated tasks across six capability categories are representative of the fundamental challenges of sequential decision-making for general agents.

What would settle it

A new agent configuration that consistently outperforms all current leaders across every capability category, difficulty level, and observation modality would show that a dominant approach exists.

Figures

Figures reproduced from arXiv: 2605.06869 by Glen Berseth, Pablo Samuel Castro, Roger Creus Castanyer.

**Figure 2.** Figure 2: Overall ONS for all evaluated agents. Among the initial set of agents, GPT-5 mini and PPO (2M) lead at 0.309 and 0.287 respectively, with substantial room for improvement across all paradigms. Navigation Reasoning Planning Memory Generalization Multi-Agent 0.0 0.15 0.30 0.45 Capability Profiles — Per-Category ONS GPT-5 mini PPO Dense (2M) Qwen3.5-4B Gemini 2.5 Flash Lite [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 4.** Figure 4: Per-category ONS for the top five agents; best agent varies by category: PPO leads planning [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Per-task success rates at hard difficulty for three frontier LLMs. No single model dominates. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Agentick ONS for the Qwen model family across observation modes and reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Three tasks at all four difficulty levels (columns: easy, medium, hard, expert). [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: All 37 Agentick tasks in isometric view at medium difficulty, spanning six categories: [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: All five observation modalities for the same KeyDoorPuzzle state (medium, seed 42). Every [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agentick gives a practical shared Gym interface and large-scale eval for RL versus LLM agents, with no single winner across 37 tasks, but the tasks' representativeness is the main thing to check.

read the letter

The main point is that this paper ships a unified benchmark called Agentick that lets you run RL agents like PPO and foundation-model agents like GPT-5 mini on the exact same 37 procedural tasks through one Gymnasium interface, with oracle-normalized scores for direct comparison. They include five modalities, a composable reasoning harness that boosts LLM results by 3-10x, oracles, SFT datasets, and a leaderboard, then run 27 configurations over 90k episodes. The results line up with what you'd expect: no approach sweeps everything, GPT-5 mini leads overall at 0.309, PPO is stronger on planning and multi-agent stuff, and ASCII beats natural language observations. That scale and the cross-paradigm setup are the real additions here, and the infrastructure they provide is concrete and usable for people who want to test hybrids or post-train models on sequential tasks. The work is honest about the gaps that remain across paradigms. The soft spot is the claim that these tasks cover the fundamental challenges of sequential decision-making. Procedural generation across six categories and four difficulties is fine for controlled testing, but it is still an assumption that they are representative enough to drive broad progress, and the abstract does not spell out the exact selection criteria or validation steps. Without the full methods on task generation, scoring rules, and how oracle normalization was applied, it is hard to rule out that some post-hoc choices shaped the no-dominance conclusion. This is aimed at RL and agent researchers who need a common testbed rather than single-paradigm suites. It has enough new pieces and evaluation heft that it deserves peer review to verify the details and see if the benchmark holds up under closer scrutiny. I would bring it to a reading group to walk through the harness and task design.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Agentick, a unified benchmark for sequential decision-making agents spanning RL, LLM, VLM, hybrid, and human paradigms. It consists of 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface, along with oracle reference policies, pre-built SFT datasets, a composable reasoning harness, and a live leaderboard. An evaluation of 27 agent configurations over more than 90,000 episodes reports that no single approach dominates, with GPT-5 mini achieving the top overall oracle-normalized score of 0.309, PPO excelling on planning and multi-agent tasks, the reasoning harness multiplying LLM performance by 3-10x, and ASCII observations outperforming natural language.

Significance. If the evaluation protocols prove robust and reproducible, the benchmark fills a clear gap by enabling direct cross-paradigm comparisons in truly sequential environments. The scale of the reported experiments, the provision of oracles and datasets for both evaluation and RL post-training, and the empirical demonstration that no paradigm dominates are concrete strengths that could accelerate research on general agents.

major comments (2)

[Evaluation] Evaluation section: The central claim that no approach dominates rests on oracle-normalized scores across 27 configurations. The manuscript does not specify the exact normalization formula, how oracle policies are applied per task, or rules for episode termination and data inclusion, making it impossible to verify whether post-hoc choices influence the reported rankings (e.g., GPT-5 mini at 0.309) or the no-dominance conclusion.
[Benchmark Design] Benchmark Design section: The assumption that the 37 procedurally generated tasks adequately represent fundamental challenges of sequential decision-making is load-bearing for generalizing the results, yet no ablation or coverage analysis is provided to support this representativeness across the six capability categories.

minor comments (2)

[Abstract] Abstract: Clarify the reference to 'GPT-5 mini' (model version, release date, or citation) to avoid ambiguity in the reported scores.
Results presentation: A summary table listing all 27 configurations, their paradigms, and key scores would improve readability of the no-dominance finding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and will make the corresponding revisions to improve clarity and strengthen the justification of our claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim that no approach dominates rests on oracle-normalized scores across 27 configurations. The manuscript does not specify the exact normalization formula, how oracle policies are applied per task, or rules for episode termination and data inclusion, making it impossible to verify whether post-hoc choices influence the reported rankings (e.g., GPT-5 mini at 0.309) or the no-dominance conclusion.

Authors: We agree that the Evaluation section requires greater specificity on these protocols to support verification of the results. The oracle-normalized score is computed per task as (agent_return - random_return) / (oracle_return - random_return), where oracle policies are the reference policies shipped with the benchmark and executed identically to agent policies. Episodes terminate on goal achievement or after a hard maximum of 1000 steps, and all episodes are included in the reported averages with no post-hoc exclusion. In the revised manuscript we will add a dedicated subsection detailing the exact formula, per-task oracle application, termination rules, and inclusion criteria. This will enable full reproducibility and independent verification of the rankings and the no-dominance conclusion. revision: yes
Referee: [Benchmark Design] Benchmark Design section: The assumption that the 37 procedurally generated tasks adequately represent fundamental challenges of sequential decision-making is load-bearing for generalizing the results, yet no ablation or coverage analysis is provided to support this representativeness across the six capability categories.

Authors: We acknowledge that an explicit coverage analysis would strengthen the claim of representativeness. The 37 tasks were constructed via procedural generation to target the six capability categories through controlled variation of parameters such as planning horizon, agent count, and observation type. In the revision we will add a new subsection in Benchmark Design that maps each task to its primary capability categories, illustrates coverage via parameter ranges, and discusses potential gaps. A full ablation study across all categories is computationally intensive and left for future work; we will note this limitation explicitly. This addition will better justify generalization while remaining honest about the current evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with direct measurements

full rationale

The paper presents an empirical benchmark (37 procedurally generated tasks, Gymnasium interface, oracle policies, and evaluation across 27 configurations and >90k episodes) rather than any derivation chain. Reported outcomes such as GPT-5 mini leading at 0.309 oracle-normalized score, PPO dominance in specific categories, and the 3-10x harness multiplier are direct empirical results from the provided datasets and oracles, not reductions of fitted parameters or self-referential equations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims rest on explicit evaluation scale and oracle normalization, which are externally verifiable and independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from the RL and Gymnasium literature plus the premise that the chosen tasks capture fundamental sequential challenges; no free parameters are explicitly fitted in the abstract and no new entities are postulated.

axioms (1)

domain assumption Gymnasium-compatible environment interface provides a standard and fair evaluation ground for all agent types
The benchmark exposes all tasks through a single Gymnasium interface, assuming this interface is neutral across RL, LLM, and hybrid agents.

pith-pipeline@v0.9.0 · 5544 in / 1359 out tokens · 54751 ms · 2026-05-14T20:47:56.976346+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 15 canonical work pages · 11 internal anchors

[1]

Journal of Artificial Intelligence Research , volume=

The arcade learning environment: An evaluation platform for general agents , author=. Journal of Artificial Intelligence Research , volume=. 2013 , doi=

2013
[2]

DeepMind Control Suite

DeepMind Control Suite , author=. arXiv preprint arXiv:1801.00690 , year=. 1801.00690 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

International Conference on Learning Representations , year=

Behaviour suite for reinforcement learning , author=. International Conference on Learning Representations , year=
[4]

Advances in Neural Information Processing Systems , volume=

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[5]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[6]

International Conference on Learning Representations , year=

Benchmarking the spectrum of agent capabilities , author=. International Conference on Learning Representations , year=
[7]

International Conference on Machine Learning , pages=

Leveraging procedural generation to benchmark reinforcement learning , author=. International Conference on Machine Learning , pages=. 2020 , url=

2020
[8]

Advances in Neural Information Processing Systems , volume=

The NetHack learning environment , author=. Advances in Neural Information Processing Systems , volume=. 2020 , url=

2020
[9]

arXiv preprint arXiv:2411.13543 , year=

Paglieri, Davide and Cupia. arXiv preprint arXiv:2411.13543 , year=. 2411.13543 , archivePrefix=

work page arXiv
[10]

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

arXiv preprint arXiv:2603.24621 , year=. 2603.24621 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Computer Games , pages=

C. Computer Games , pages=. 2019 , doi=

2019
[12]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=. 1707.06347 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Nature , volume=

Human-level control through deep reinforcement learning , author=. Nature , volume=. 2015 , doi=

2015
[14]

International Conference on Machine Learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International Conference on Machine Learning , pages=. 2018 , url=

2018
[15]

AAAI Conference on Artificial Intelligence , year=

Rainbow: Combining improvements in deep reinforcement learning , author=. AAAI Conference on Artificial Intelligence , year=
[16]

Journal of Machine Learning Research , volume=

Stable-baselines3: Reliable reinforcement learning implementations , author=. Journal of Machine Learning Research , volume=. 2021 , url=

2021
[17]

International Conference on Learning Representations , year=

ReAct: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations , year=
[18]

Transactions on Machine Learning Research , year=

Voyager: An open-ended embodied agent with large language models , author=. Transactions on Machine Learning Research , year=
[19]

Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and others , journal=. Do as. 2022 , eprint=

2022
[20]

GPT-4 Technical Report

GPT-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=. 2303.08774 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[22]

The Llama 3 Herd of Models

The Llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=. 2407.21783 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Gemma 3 Technical Report

Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=. 2503.19786 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

2022
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=. 2501.12948 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Designing an Intelligence , note=

Welcome to the era of experience , author=. Designing an Intelligence , note=. 2025 , url=

2025
[27]

2018 , edition=

Reinforcement learning: An introduction , author=. 2018 , edition=

2018
[28]

International Conference on Machine Learning , pages=

Curiosity-driven exploration by self-supervised prediction , author=. International Conference on Machine Learning , pages=. 2017 , url=

2017
[29]

2022 , url=

Kumar, Aviral and Agarwal, Rishabh and Ma, Tengyu and Courville, Aaron and Tucker, George and Levine, Sergey , booktitle=. 2022 , url=

2022
[30]

arXiv preprint arXiv:2506.15544 , year=

Stable gradients for stable learning at scale in deep reinforcement learning , author=. arXiv preprint arXiv:2506.15544 , year=. 2506.15544 , archivePrefix=

work page arXiv
[31]

International Conference on Machine Learning , pages=

Mixtures of experts unlock parameter scaling for deep RL , author=. International Conference on Machine Learning , pages=. 2024 , url=

2024
[32]

International Conference on Machine Learning , pages=

Understanding plasticity in neural networks , author=. International Conference on Machine Learning , pages=. 2023 , url=

2023
[33]

Advances in Neural Information Processing Systems , pages=

Deep reinforcement learning at the edge of the statistical precipice , author=. Advances in Neural Information Processing Systems , pages=. 2021 , url=

2021
[34]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A standard interface for reinforcement learning environments , author=. arXiv preprint arXiv:2407.17032 , year=. 2407.17032 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

International Conference on Learning Representations , year=

Eureka: Human-level reward design via coding large language models , author=. International Conference on Learning Representations , year=
[36]

arXiv preprint arXiv:2510.14176 , year=

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning , author=. arXiv preprint arXiv:2510.14176 , year=. 2510.14176 , archivePrefix=

work page arXiv
[37]

International Conference on Learning Representations , year=

Motif: Intrinsic motivation from artificial intelligence feedback , author=. International Conference on Learning Representations , year=
[38]

International Conference on Learning Representations , year=

Maestromotif: Skill design from artificial intelligence feedback , author=. International Conference on Learning Representations , year=
[39]

International Conference on Machine Learning , pages=

Guiding pretraining in reinforcement learning with large language models , author=. International Conference on Machine Learning , pages=. 2023 , url=

2023
[40]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=. 2001.08361 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[41]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[42]

Meta-World+: An Improved, Standardized,

McLean, Reginald and Chatzaroulas, Evangelos and McCutcheon, Luc and R. Meta-World+: An Improved, Standardized,. NeurIPS Datasets and Benchmarks Track , year=. 2505.11289 , archivePrefix=

work page arXiv
[43]

Nature , volume=

Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. Nature , volume=. 2019 , doi=

2019
[44]

Nature , volume=

Discovering faster matrix multiplication algorithms with reinforcement learning , author=. Nature , volume=. 2022 , doi=

2022
[45]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , eprint=

2021
[46]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=. 2110.14168 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2024 , url=

Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R , booktitle=. 2024 , url=

2024
[48]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=. 2107.03374 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

2025 , eprint=

Stojanovski, Zafir and Stanley, Oliver and Sharratt, Joe and Jones, Richard and Adefioye, Abdulhakeem and Kaddour, Jean and K. 2025 , eprint=

2025
[50]

International Conference on Machine Learning , pages=

Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning , author=. International Conference on Machine Learning , pages=. 2024 , url=

2024
[51]

K. The. Advances in Neural Information Processing Systems , volume=. 2020 , url=

2020