pith. sign in

arxiv: 2605.24423 · v1 · pith:N4LROOLBnew · submitted 2026-05-23 · 💻 cs.AI

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

Pith reviewed 2026-06-30 13:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords ad-hoc teamworkin-context reinforcement learningOvercookedmulti-agent coordinationpartial observabilitytest-time adaptationbenchmark
0
0 comments X

The pith

In-context RL methods underperform random baselines across unseen teammates and layouts in ad-hoc teamwork.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called ICRL4AHT on a fast JAX version of Overcooked-V2 to test whether history-conditioned in-context reinforcement learning can let agents coordinate with unknown partners. It generates a large set of teammate policies, collects learning histories, and runs controlled train-test splits on both teammates and layouts. Evaluation of representative methods such as Algorithm Distillation and Decision-Pretrained Transformer across millions of steps shows they often score below random policies and exhibit no reliable improvement as interaction length grows. The results indicate that current in-context approaches do not yet solve the strategic inference problem posed by partial observability in multi-agent settings. The benchmark supplies an end-to-end, reproducible pipeline for measuring future progress.

Core claim

History-conditioned ICRL algorithms fail to exhibit robust test-time adaptation in multi-agent ad-hoc teamwork, frequently underperforming random baselines on both unseen teammate and unseen layout tracks with no clear in-context improvement over long horizons.

What carries the argument

The ICRL4AHT benchmark, consisting of a diverse teammate suite spanning RL and heuristic policies together with a reproducible multi-episode evaluation protocol on Overcooked-V2 under controlled distribution shifts.

If this is right

  • Existing ICRL techniques require new mechanisms to handle partial observability and teammate inference in multi-agent environments.
  • The benchmark supplies a standardized testbed for measuring whether future coordination algorithms overcome the observed failure modes.
  • No clear scaling benefit from longer interaction histories appears under the current AHT protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that maintain explicit models of possible teammate strategies may be needed before in-context adaptation succeeds.
  • Testing the same algorithms on other partially observable multi-agent domains would clarify whether the limits are specific to Overcooked-V2.
  • Hybrid systems that combine in-context learning with lightweight planning could be a direct next step to evaluate.

Load-bearing premise

The generated teammate suite and the multi-episode evaluation protocol in Overcooked-V2 accurately capture the strategic inference challenges of real ad-hoc teamwork under partial observability.

What would settle it

An in-context method that consistently outperforms random baselines on the unseen teammate track and the unseen layout track of the ICRL4AHT benchmark would falsify the reported limitations.

Figures

Figures reproduced from arXiv: 2605.24423 by Jiajun Zhang, Jian Cheng, Jiaxi Yang, Jinmin He, Junliang Xing, Kai Li, Lei Zhang, Yuheng Jing, Zeyao Ma, Zhe Wu, Ziwen Zhang.

Figure 1
Figure 1. Figure 1: Benchmark Pipeline Overview (ICRL4AHT). (1) A benchmark manifest specifies layouts, tasks, and other properties, as well as defines two evaluation tracks that disentangle generalization over teammates and layouts. (2) A diverse teammate policy pool is constructed from both RL-trained policies and heuristic policies, where teammate policies π −i can be sampled for either training or testing. (3) Using Overc… view at source ↗
Figure 2
Figure 2. Figure 2: Representative OvercookedV2 layouts used in the ICRL4AHT benchmark. Each layout features distinct spatial configurations, imposing diverse coordination challenges. Two agents (blue-ego agent and red-partner) must collaborate under partial observability to prepare and deliver dynamically changing recipes. We supplement OvercookedV2’s full environment documentation in Sec. A [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 3
Figure 3. Figure 3: Track 1: Online Adaptation Curves (Ltrain × Πtest). Episode-wise return trajectories of AD, DPT, and random baseline across six training layouts and four heuristic teammate families. Unlike single-agent ICRL settings where returns typically increase over the interaction horizon, the studied baselines exhibit flat adaptation profiles with no observable in-context improvement, even as more episodes are accum… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation: Context Length. Effect of Transformer context length on ICRL performance across two Track 1 layouts (test wide and demo wide). Each curve represents a different heuristic teammate family. Contrary to expectations from single-agent ICRL, where longer context windows typically enable better in-context adaptation, we observe no consistent improvement as context length increases. Performance remains … view at source ↗
Figure 5
Figure 5. Figure 5: Model Scale and Training Budget Sweep. Performance of AD and DPT under Small, Medium, and Large configurations across both evaluation tracks. Scaling yields only modest gains and does not restore robust within-context adaptation. ure of ICRL in AHT is the partial observability of the part￾ner’s actions. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Track 1 Layouts. Six layouts used for teammate generalization evaluation. The two agents are rendered as chibi-style chefs with blue (ego) and red (partner) colored bodies; wooden crates with ingredients represent dispensers; metallic pots sit on counters; dark conveyor-belt hatches are serving areas; stacked white plates form plate piles; and chalkboard-style menus or industrial buttons denote recipe indi… view at source ↗
Figure 7
Figure 7. Figure 7: Track 2 Layout Pairs. Training and testing layout pairs for layout generalization evaluation. Each pair shares geometric structure but differs in the spatial arrangement of key elements, requiring adaptation of coordination strategies. Layout Pair 1: Asymmetric Advantages. These layouts (9×5 grid) are adapted from the classic Overcooked-AI asymmetric advantages design, featuring two distinct regions separa… view at source ↗
Figure 8
Figure 8. Figure 8: Source Algorithm Training Curves. Evaluation returns of PPO-based best-response policies trained against fixed teammates during learning-history dataset construction. We compare raw trajectories (before filtering) with filtered trajectories selected based on quality scores that reward high final performance and large improvement. The filtered curves exhibit steeper improvement gradients and higher final pe… view at source ↗
Figure 9
Figure 9. Figure 9: Teammate Policy Diversity. Pairwise Hamming distance heatmaps among five policy families: RL-trained policies (|FRL| = 80) and four heuristic families H1–H4 (|FHi| = 5 each). High inter-family distances, particularly between RL and heuristic families, confirm that our train-test split induces substantial behavioral distribution shift, ensuring that generalization to held-out heuristic families requires gen… view at source ↗
Figure 10
Figure 10. Figure 10: Teammate Action Conditioning: Online Adaptation Curves (Ltrain × Πtest). Episode-wise return trajectories of AD+TA, DPT+TA, and random baseline across six training layouts and four heuristic teammate families. Despite conditioning on ground-truth teammate actions, the +TA variants exhibit adaptation profiles qualitatively similar to their unconditional counterparts ( [PITH_FULL_IMAGE:figures/full_fig_p03… view at source ↗
Figure 11
Figure 11. Figure 11: Held-Out RL Teammate Evaluation. AD and DPT evaluated on held-out RL-trained teammates on familiar layouts. Performance is moderately better than under cross-family heuristic shift, but within-context adaptation remains weak. F.4.2. EXTENDED CONTEXT LENGTHS Our main evaluation uses context windows up to K=2,000 transitions. To investigate whether longer context enables adaptation, we extend the evaluation… view at source ↗
Figure 12
Figure 12. Figure 12: Extended Context Length Evaluation. Performance across K ∈ {500, 1,000, 2,000, 5,000, 10,000}. Longer context helps somewhat on moderately difficult settings but does not qualitatively reverse the negative result on the hardest configurations. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Auxiliary Loss Variants. Performance comparison of +TA, +next obs, and mixed auxiliary objectives. Auxiliary losses provide modest stabilization on familiar settings but do not resolve the layout generalization failure. F.4.4. WITHIN-ROLLOUT ADAPTATION GAIN To directly quantify the degree of online adaptation, we compute the adaptation gain: the difference between the mean return over the last 20 episodes… view at source ↗
Figure 14
Figure 14. Figure 14: Within-Rollout Adaptation Gain (∆ = R¯last 20 − R¯first 20). Adaptation gains are small on average across all baselines and both tracks, confirming weak within-rollout online adaptation. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
read the original abstract

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the ICRL4AHT benchmark on a JAX-based Overcooked-V2 environment to evaluate in-context RL for ad-hoc teamwork. It generates a diverse teammate suite spanning RL and heuristic policies, provides a reproducible pipeline for generation, history collection, and multi-episode evaluation, and reports that representative methods (AD, DPT) frequently underperform random baselines on both unseen-teammate and unseen-layout tracks, with no evident in-context gains over long horizons.

Significance. If the benchmark protocol is robust, the negative results are significant because they expose limits of ICRL approaches that succeed in single-agent settings when applied to multi-agent coordination under partial observability. The reproducible JAX implementation, large-scale evaluation across millions of transitions, and controlled train-test shifts constitute concrete strengths that position the benchmark as a useful testbed for future coordination algorithms.

major comments (1)
  1. [Evaluation Protocol] The multi-episode evaluation protocol (abstract and §4) is load-bearing for the central claim of no in-context improvement over long horizons. The manuscript does not specify how history length is varied across episodes, how partial-observability observations are tokenized for the transformer-based models, or the exact mechanism by which the random baseline receives equivalent information, making it impossible to isolate whether underperformance is method-intrinsic or protocol-driven.
minor comments (1)
  1. [Abstract] The abstract states results are obtained 'across millions of transitions' but provides no breakdown by track or number of independent seeds; adding these quantities (with confidence intervals) in the results section would strengthen verifiability without altering the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments. We address the major comment on the evaluation protocol below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Evaluation Protocol] The multi-episode evaluation protocol (abstract and §4) is load-bearing for the central claim of no in-context improvement over long horizons. The manuscript does not specify how history length is varied across episodes, how partial-observability observations are tokenized for the transformer-based models, or the exact mechanism by which the random baseline receives equivalent information, making it impossible to isolate whether underperformance is method-intrinsic or protocol-driven.

    Authors: We agree with the referee that additional details on the evaluation protocol are necessary for full reproducibility and to substantiate the claims. In the revised manuscript, we will expand the description in Section 4 to specify how history length is varied across episodes, the tokenization of partial-observability observations for the transformer-based models, and the exact mechanism by which the random baseline receives equivalent information. These additions will clarify that the comparison is fair and that the underperformance reflects limitations of the ICRL methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study introducing ICRL4AHT on Overcooked-V2 and reporting that representative ICRL methods (AD, DPT) underperform random baselines on unseen teammates and layouts. No derivation chain, equations, fitted parameters, or self-citation load-bearing premises exist; the central claims rest on direct experimental comparisons against an external random baseline within the defined protocol. This is self-contained against external benchmarks and receives the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Overcooked-V2 with the described teammate policies constitutes a faithful test of ad-hoc teamwork; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Overcooked-V2 with partial observability and the generated RL/heuristic teammate suite is representative of ad-hoc teamwork challenges.
    Invoked in the abstract when stating that the benchmark reveals challenges of strategic inference under partial observability.

pith-pipeline@v0.9.1-grok · 5782 in / 1173 out tokens · 36266 ms · 2026-06-30T13:38:27.401745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    org/CorpusID:258845718

    URL https://api.semanticscholar. org/CorpusID:258845718. Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4RL: Datasets for deep data-driven reinforcement learning, 2020. Furuta, H., Matsuo, Y ., and Gu, S. S. Generalized decision transformer for offline hindsight information matching. InInternational Conference on Learning Representations, 2022...

  2. [2]

    Gaussian Error Linear Units (GELUs)

    URL https://openreview.net/forum? id=hlvLM3GX8R. Grigsby, J., Fan, J., and Zhu, Y . AMAGO: Scalable in- context reinforcement learning for adaptive agents. In International Conference on Learning Representations, volume 2024, pp. 26919–26952, 2024. Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., Steiner, A., and van Zee, M. Flax: A neural...

  3. [3]

    Population Based Training of Neural Networks

    URL http://proceedings.mlr.press/ v139/hu21c.html. Jaderberg, M., Dalibard, V ., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., et al. Population based training of neural networks.arXiv preprint arXiv:1711.09846, 2017. Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaned...

  4. [4]

    org/CorpusID:235313679

    URL https://api.semanticscholar. org/CorpusID:235313679. Knott, P., Carroll, M., Devlin, S., Ciosek, K., Hofmann, K., Dragan, A., and Shah, R. Evaluating the robustness of collaborative agents. InProceedings of the 20th Interna- tional Conference on Autonomous Agents and MultiAgent Systems, pp. 1560–1562, 2021. Kurach, K., Raichuk, A., Sta´nczyk, P., Zaja...

  5. [5]

    Lee, K.-H., Nachum, O., Yang, M

    URL https://openreview.net/forum? id=dCYBAGQXLo. Lee, K.-H., Nachum, O., Yang, M. S., Lee, L., Free- man, D., Guadarrama, S., Fischer, I., Xu, W., Jang, E., Michalewski, H., et al. Multi-game decision transformers. InAdvances in Neural Information Processing Systems, pp. 27921–27936, 2022. Li, Y ., Zhang, S., Sun, J., Du, Y ., Wen, Y ., Wang, X., and Pan,...

  6. [6]

    org/CorpusID:259501163

    URL https://api.semanticscholar. org/CorpusID:259501163. Moeini, A., Wang, J., Beck, J., Blaser, E., Whiteson, S., Chandra, R., and Zhang, S. A survey of in-context re- inforcement learning.arXiv preprint arXiv:2502.07978, 2025. M¨uller, S., Hollmann, N., Arango, S. P., Grabocka, J., and Hutter, F. Transformers can do Bayesian inference. In International ...

  7. [7]

    Nikulin, A., Zisman, I., Zemtsov, A., and Kurenkov, V

    URL https://openreview.net/forum? id=KSugKcbNf9. Nikulin, A., Zisman, I., Zemtsov, A., and Kurenkov, V . XLand-100b: A large-scale multi-task dataset for in- context reinforcement learning. InThe Thirteenth In- ternational Conference on Learning Representations,

  8. [8]

    Papoudakis, G., Christianos, F., Sch¨afer, L., and Albrecht, S

    URL https://openreview.net/forum? id=p9OsTj0nMP. Papoudakis, G., Christianos, F., Sch¨afer, L., and Albrecht, S. V . Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), 2021. Parker-Holder, J., Pacchiano, A., Choromans...

  9. [9]

    Rahman, A., Fosong, E., Carlucho, I., and Albrecht, S

    URL https://openreview.net/forum? id=gi9MOXNfw2. Rahman, A., Fosong, E., Carlucho, I., and Albrecht, S. V . Generating teammates for training robust ad hoc team- work agents via best-response diversity.Transactions on Machine Learning Research, 2023. ISSN 2835-

  10. [10]

    Rahman, M., Cui, J., and Stone, P

    URL https://openreview.net/forum? id=l5BzfQhROl. Rahman, M., Cui, J., and Stone, P. Minimum coverage sets for training robust ad hoc teamwork agents. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17523–17530, 2024. Raparthy, S. C., Hambro, E., Kirk, R., Henaff, M., and Raileanu, R. Generalization to new sequential deci-...

  11. [11]

    Proximal Policy Optimization Algorithms

    URL https://openreview.net/forum? id=lVQ4FUZ6dp. Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-maron, G., Gim ´enez, M., Sulsky, Y ., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y ., Hadsell, R., Vinyals, O., Bordbar, M., and de Freitas, N. A generalist agent.Transactions on M...

  12. [12]

    org/CorpusID:278532809

    URL https://api.semanticscholar. org/CorpusID:278532809. Wu, S., Yao, J., Fu, H., Tian, Y ., Qian, C., Yang, Y ., Fu, Q., and Wei, Y . Quality-similar diversity via population based reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023. Wu, S. A., Wang, R. E., Evans, J. A., Tenenbaum, J. B., Parkes, D. C., and K...

  13. [13]

    13 Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork A

    URL https://openreview.net/forum? id=Y8KsHT1kTV. 13 Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork A. OvercookedV2 Environment This section provides a comprehensive specification of the OvercookedV2 environment used in the ICRL4AHT benchmark. We detail the environment mechanics, our implementation enhancements, and the sp...

  14. [14]

    Random Sampling: From the filtered set, uniformly sample the target number of teammates (default: 20) using a fixed random seed for reproducibility. This procedure reduces the initial 50 candidates to 20 high-quality teammates, ensuring that the training distribution consists of competent partners capable of meaningful coordination. B.2. Heuristic Teammat...

  15. [15]

    Batch Collector: Orchestrates parallel execution across all tasks in a manifest, with checkpoint-based resume semantics. C.2. PPO Training Procedure We employ Proximal Policy Optimization (PPO) as the ego agent training algorithm, chosen for its stable learning dynamics and widespread adoption in cooperative multi-agent settings. C.2.1. NETWORKARCHITECTUR...

  16. [16]

    These layers perform channel-wise feature transformation without spatial mixing

    Pointwise Feature Extraction: Three 1×1 convolutional layers with 128, 128, and 8 output channels respectively, each followed by ReLU activation. These layers perform channel-wise feature transformation without spatial mixing

  17. [17]

    with prior

    Spatial Feature Extraction: Three 3×3 convolutional layers with 16, 32, and 32 output channels respectively, each followed by ReLU activation. These layers capture local spatial patterns and object relationships. The resulting feature map is flattened and projected through a dense layer to produce an embedding of dimension demb = 64. All convolutional and...