The Interplay of Harness Design and Post-Training in LLM Agents

Dongwoo Kim; Kyungmin Kim; Sangdon Park; Seoyeon Lee; Suhyeon Jun; Youngbin Choi

arxiv: 2606.25447 · v1 · pith:MNSERLYOnew · submitted 2026-06-24 · 💻 cs.LG · cs.CL

The Interplay of Harness Design and Post-Training in LLM Agents

Kyungmin Kim , Youngbin Choi , Seoyeon Lee , Suhyeon Jun , Dongwoo Kim , Sangdon Park This is my paper

Pith reviewed 2026-06-25 20:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM agentsharness designpost-trainingout-of-distributiontool integrationALFWorldagent scaffoldingenvironment shifts

0 comments

The pith

Treating the harness as a variable during post-training improves LLM agent performance under tool and task shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the harness—the scaffolding that exposes tools, describes them, and adds per-step information—as a controllable factor rather than a fixed detail. By extending ALFWorld to vary harness designs and introduce task and tool-environment shifts, the authors compare post-training that accounts for harness variation against post-training under a minimal harness. Harness-aware post-training raises in-distribution results and preserves performance when environments change after deployment. Minimal-harness post-training produces sharp drops once shifts become stronger. The work therefore reframes harness design as part of the training process rather than an inference-time engineering choice.

Core claim

Harness-aware post-training not only improves in-distribution performance but also enables agents to robustly adapt to OOD settings. Under a harness with minimal design effort, post-training suffers a drastic performance drop under stronger tool environment shifts.

What carries the argument

The harness: the scaffolding that determines which tools are exposed, how they are described, and what auxiliary information accompanies each per-step observation.

If this is right

Post-training routines should vary harness configurations rather than treat the harness as fixed.
Minimal harness design during post-training leaves agents vulnerable to stronger environment shifts.
Harness-aware training produces agents that maintain performance across both seen and unseen task and tool conditions.
The extended ALFWorld benchmark enables controlled measurement of harness effects on post-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Harness variation during training may reduce reliance on post-deployment prompt or scaffolding changes.
Similar harness controls could be added to other agent environments to test whether the same pattern holds.
The result implies that choices about tool presentation belong in the training loop, not only at inference time.

Load-bearing premise

The harness designs and task/tool shifts tested in the extended ALFWorld benchmark represent the variations and changes that matter in real deployments.

What would settle it

If agents post-trained under harness-aware conditions show no advantage or identical drops under new tool-environment shifts compared with minimal-harness agents, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.25447 by Dongwoo Kim, Kyungmin Kim, Sangdon Park, Seoyeon Lee, Suhyeon Jun, Youngbin Choi.

**Figure 2.** Figure 2: Zero-shot success rates without post-training: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Training-time vs. post-hoc harness applica [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: In-depth analysis of post training under the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Post-training success rates of Qwen2.5-7BInstruct under task shift: agents are post-trained on Dtr med (top row) and Dtr hard (bottom row) under each harness with tool schema fixed to v1.0, and evaluated on Dte all under the same harness and schema. In-distribution task categories are highlighted in bold: Clean, Heat, Cool for Dtr med, and Pick 2 for Dtr hard. Post-training under informative harnesses t… view at source ↗

read the original abstract

Tool-integrated LLM agents are often wrapped within a harness: the scaffolding that determines which tools are exposed, how they are described, and what auxiliary information accompanies each per-step observation. While agents are routinely post-trained, this scaffolding is typically treated as a fixed engineering detail, with design effort limited to the training-free regime. Moreover, existing post-training algorithms assume a static environment, even though tool environments and tasks often shift upon deployment. To address this gap, we extend $\texttt{ALFWorld}$ (i) to treat the harness as a controllable design dimension and (ii) to support evaluation under task and tool environment shifts. Building on this, we systematically analyze how the harness design influences post-training in both in-distribution and out-of-distribution (OOD) settings. We empirically show that harness-aware post-training not only improves in-distribution performance but also enables agents to robustly adapt to OOD settings. Under a harness with minimal design effort, post-training suffers a drastic performance drop under stronger tool environment shifts, further highlighting the importance of harness-aware post-training under such shifts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Harness design during post-training improves OOD robustness on their ALFWorld extension, but the practical payoff depends on how representative the shifts are.

read the letter

The central finding is that post-training which accounts for harness design yields better in-distribution results and holds up better when tools and tasks shift, while a minimal harness leads to sharp drops under stronger changes.

The paper extends ALFWorld so the harness becomes a controllable variable and adds explicit OOD evaluation under task and tool shifts. This lets them run a direct comparison of training regimes. The empirical pattern they report is that harness-aware post-training confers robustness that static-harness training lacks. That is the new piece: prior agent post-training work mostly fixes the scaffolding and focuses on the model updates.

The experiments appear to be the main contribution, and the abstract presents the performance differences cleanly. If the full methods include proper controls and the shifts are not contrived, the result is useful for anyone building deployable agents.

The main soft spot is whether the specific harness variations and tool shifts they chose reflect the distribution shifts that matter in practice. The abstract treats the ALFWorld extension as representative, but that assumption needs checking against the actual task and tool changes they implemented. Without seeing the tables and implementation details, it is also hard to judge statistical reporting or data exclusion.

This is worth sending to peer review. People working on agent training pipelines will want to see the full setup and decide how much weight to give the benchmark extension. It is not a load-bearing theoretical claim, just an empirical demonstration that the interaction exists in their setup.

Referee Report

0 major / 3 minor

Summary. The paper extends the ALFWorld benchmark to treat harness design (tool exposure, descriptions, and auxiliary observations) as a controllable variable and to support evaluation under task and tool-environment shifts. It empirically compares post-training regimes and reports that harness-aware post-training improves in-distribution performance while also conferring robustness to OOD shifts; conversely, minimal-effort harnesses produce drastic performance drops under stronger tool shifts.

Significance. If the empirical findings hold under detailed controls, the work demonstrates that harness design is not merely an engineering detail but interacts with post-training to determine both ID gains and OOD robustness in tool-using LLM agents. The ALFWorld extension supplies a concrete testbed for studying this interaction, which could inform more integrated training pipelines for deployed agents.

minor comments (3)

The abstract states that harness-aware post-training 'enables agents to robustly adapt to OOD settings,' but does not specify the statistical tests, number of runs, or confidence intervals used to support the robustness claim; adding these details in the results section would strengthen the central empirical assertion.
The description of the ALFWorld extension (abstract) leaves the precise definition of 'minimal design effort' harness versus harness-aware variants implicit; a table or figure enumerating the differences in tool exposure, descriptions, and auxiliary information would improve reproducibility.
No mention is made of whether the post-training algorithms were run with identical hyperparameters across harness conditions or whether hyperparameter search was performed separately; clarifying this in the experimental setup would rule out a potential confound.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The report accurately captures our extension of ALFWorld to treat harness design as a controllable variable and our empirical findings on the interaction between harness design and post-training for both ID performance and OOD robustness. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark study

full rationale

The paper is an empirical comparison of post-training regimes under varying harness designs on an extended ALFWorld benchmark, reporting performance in ID and OOD settings. No mathematical derivations, first-principles predictions, or fitted parameters are claimed; all central claims rest on direct experimental measurements rather than any reduction to self-defined quantities, self-citations, or ansatzes. The work contains no equations or uniqueness theorems that could trigger the enumerated circularity patterns, making the derivation chain self-contained by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on standard empirical ML assumptions about benchmark validity and the representativeness of tested harness designs.

pith-pipeline@v0.9.1-grok · 5730 in / 1287 out tokens · 24660 ms · 2026-06-25T20:55:43.780666+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2604.25850 , year=

Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses , author=. arXiv preprint arXiv:2604.25850 , year=

Pith/arXiv arXiv
[2]

International Conference on Learning Representations , volume=

Agentsquare: Automatic llm agent search in modular design space , author=. International Conference on Learning Representations , volume=
[3]

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou , booktitle=. Agentic. 2026 , url=

2026
[4]

Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yu, Simon and Zhou, Xiangxin and Xu, Haotian and Xiong, Shaopan and Liu, Bo and Tan, Chenmien and others , journal=. G
[5]

arXiv preprint arXiv:2510.08558 , year=

Agent Learning via Early Experience , author=. arXiv preprint arXiv:2510.08558 , year=

Pith/arXiv arXiv
[6]

Bowen Jin and Hansi Zeng and Zhenrui Yue and Jinsung Yoon and Sercan O Arik and Dong Wang and Hamed Zamani and Jiawei Han , booktitle=. Search-. 2025 , url=

2025
[7]

Yang, Yushi and Padarha, Shreyansh and Lee, Andrew and Mahdi, Adam , journal=. Agentic
[8]

Kim, Kyungmin and Choi, Youngbin and Kim, Hyounghun and Kim, Dongwoo and Park, Sangdon , booktitle=. Chrono
[9]

Ding, Zifeng and Yan, Sikuan and Yuan, Moy and Hu, Xianglong and Lin, Fangru and Vlachos, Andreas , booktitle=. T
[10]

Efficient

Jiralerspong, Thomas and Chen, Xiaoyin and More, Yash and Shah, Vedant and Bengio, Yoshua , booktitle=. Efficient
[11]

Xinji Mai and Haotian Xu and Xing W and Weinong Wang and Yingying Zhang and Wenqiang Zhang , booktitle=. Agentic. 2025 , url=

2025
[12]

Xue, Zhenghai and Zheng, Longtao and Liu, Qian and Li, Yingru and Zheng, Xiaosen and Ma, Zejun and An, Bo , booktitle=. Simple. 2026 , url=

2026
[13]

Ziyu Wan and Yunxiang LI and Xiaoyu Wen and Yan Song and Hanjing Wang and Linyi Yang and Mark Schmidt and Jun Wang and Weinan Zhang and Shuyue Hu and Ying Wen , booktitle=. Re. 2025 , url=

2025
[14]

2025 , url=

Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V Le and Sergey Levine and Yi Ma , booktitle=. 2025 , url=

2025
[15]

Paying Less Generalization Tax:

Liu, Zhihan and Guan, Lin and Nie, Yixin and Zhang, Kai and Hao, Zhuoqun and Chen, Lin and Celikyilmaz, Asli and Wang, Zhaoran and Zhang, Na , journal=. Paying Less Generalization Tax:
[16]

The Eleventh International Conference on Learning Representations , year=

Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=
[17]

s3: You Don ' t Need That Much Data to Train a Search Agent via RL

Jiang, Pengcheng and Xu, Xueqiang and Lin, Jiacheng and Xiao, Jinfeng and Wang, Zifeng and Sun, Jimeng and Han, Jiawei. s3: You Don ' t Need That Much Data to Train a Search Agent via RL. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025
[18]

arXiv preprint arXiv:2510.04678 , year=

Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=

arXiv
[19]

Cheng, Mingyue and Ouyang, Jie and Yu, Shuo and Yan, Ruiran and Luo, Yucong and Liu, Zirui and Wang, Daoyu and Liu, Qi and Chen, Enhong , journal=. Agent-
[20]

2024 , url =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , booktitle =. 2024 , url =

2024
[21]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[22]

International Conference on Learning Representations (ICLR) , year =

Learning Evolving Tools for Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[23]

arXiv preprint arXiv:2603.05910 , year =

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks , author=. arXiv preprint arXiv:2603.05910 , year =

Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2603.28052 , year =

Meta-Harness: End-to-End Optimization of Model Harnesses , author=. arXiv preprint arXiv:2603.28052 , year =

Pith/arXiv arXiv
[25]

The Eleventh International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
[26]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[27]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentic Reasoning: A streamlined framework for enhancing LLM reasoning with agentic tools , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[29]

Advances in Neural Information Processing Systems , volume=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author =. Advances in Neural Information Processing Systems , volume=
[30]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models can Solve Computer Tasks , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Showui: One vision-language-action model for gui visual agent , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[32]

2026 , publisher =

Mohamed, Amr and Assi, Maram and Guizani, Mariam , title =. 2026 , publisher =. doi:10.1145/3809494 , journal =

work page doi:10.1145/3809494 2026
[33]

Science , volume =

Shakked Noy and Whitney Zhang , title =. Science , volume =. 2023 , doi =

2023
[34]

Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=
[35]

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and dahai li and Zhiyuan Liu and Maosong Sun , booktitle=. Tool. 2024 , url=

2024
[36]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[37]

Gonzalez , booktitle=

Shishir G Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , booktitle=. Gorilla: Large Language Model Connected with Massive. 2024 , url=

2024
[38]

2024 , url =

Liu, Zuxin and Hoang, Thai and Zhang, Jianguo and Zhu, Ming and Lan, Tian and Kokane, Shirley and Tan, Juntao and Yao, Weiran and Liu, Zhiwei and Feng, Yihao and Murthy, Rithesh and Yang, Liangwei and Savarese, Silvio and Niebles, Juan Carlos and Wang, Huan and Heinecke, Shelby and Xiong, Caiming , booktitle =. 2024 , url =

2024
[39]

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , booktitle=. To. 2024 , url=

2024
[40]

International conference on machine learning , pages=

Pal: Program-aided language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[41]

Transactions on Machine Learning Research , issn=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

2023
[42]

International conference on learning representations , volume=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. International conference on learning representations , volume=
[43]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Active retrieval augmented generation , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[44]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
[45]

arXiv preprint arXiv:2401.05459 , year=

Personal llm agents: Insights and survey about the capability, efficiency and security , author=. arXiv preprint arXiv:2401.05459 , year=

Pith/arXiv arXiv
[46]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Data interpreter: An llm agent for data science , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[47]

2024 , volume =

Guo, Siyuan and Deng, Cheng and Wen, Ying and Chen, Hechang and Chang, Yi and Wang, Jun , booktitle =. 2024 , volume =

2024
[48]

The Thirteenth International Conference on Learning Representations , year=

Antonis Antoniades and Albert. The Thirteenth International Conference on Learning Representations , year=
[49]

2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=

LLM experiments with simulation: Large language model multi-agent system for simulation model parametrization in digital twins , author=. 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=. 2024 , organization=

2024
[50]

doi: 10.1016/j.jmsy.2025.08.017

AI Agents and Agentic AI–navigating a plethora of concepts for future manufacturing , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.jmsy.2025.08.017 , url =

work page doi:10.1016/j.jmsy.2025.08.017 2025
[51]

arXiv preprint arXiv:2412.20138 , year=

Tradingagents: Multi-agents llm financial trading framework , author=. arXiv preprint arXiv:2412.20138 , year=

arXiv
[52]

Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining , pages=

A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist , author=. Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining , pages=
[53]

2024 , url=

Yubin Kim and Chanwoo Park and Hyewon Jeong and Yik Siu Chan and Xuhai Xu and Daniel McDuff and Hyeonhoon Lee and Marzyeh Ghassemi and Cynthia Breazeal and Hae Won Park , booktitle=. 2024 , url=

2024
[54]

arXiv preprint arXiv:2509.00761 , year=

L-MARS: Legal multi-agent workflow with orchestrated reasoning and agentic search , author=. arXiv preprint arXiv:2509.00761 , year=

arXiv
[55]

2026 , howpublished =

Harness Engineering: Leveraging. 2026 , howpublished =

2026
[56]

Effective Harnesses for Long-Running Agents , year =
[57]

Qian, Cheng and Acikgoz, Emre Can and He, Qi and WANG, Hongru and Chen, Xiusi and Hakkani-Tur, Dilek and Tur, Gokhan and Ji, Heng , booktitle =. Tool. 2025 , pages =

2025
[58]

Group-in-Group Policy Optimization for

Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , booktitle=. Group-in-Group Policy Optimization for. 2026 , url=

2026
[59]

Proceedings of the 38th International Conference on Machine Learning , pages=

WILDS: A Benchmark of in-the-Wild Distribution Shifts , author =. Proceedings of the 38th International Conference on Machine Learning , pages=. 2021 , publisher =

2021
[60]

Revisiting Out-of-distribution Robustness in

Lifan Yuan and Yangyi Chen and Ganqu Cui and Hongcheng Gao and FangYuan Zou and Xingyi Cheng and Heng Ji and Zhiyuan Liu and Maosong Sun , booktitle=. Revisiting Out-of-distribution Robustness in. 2023 , url=

2023
[61]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Out-of-distribution generalization in natural language processing: Past, present, and future , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[62]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[63]

arXiv preprint arXiv:2602.19225 , year=

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training , author=. arXiv preprint arXiv:2602.19225 , year=

arXiv
[64]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[65]

2025 , howpublished =

2025
[66]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and H...

2025
[67]

Advances in Neural Information Processing Systems , volume=

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents , author=. Advances in Neural Information Processing Systems , volume=. 2025 , url =

2025
[68]

International Conference on Learning Representations , volume=

Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement , author=. International Conference on Learning Representations , volume=
[69]

arXiv preprint arXiv:2504.11536 , year=

Retool: Reinforcement learning for strategic tool use in llms , author=. arXiv preprint arXiv:2504.11536 , year=

Pith/arXiv arXiv

[1] [1]

arXiv preprint arXiv:2604.25850 , year=

Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses , author=. arXiv preprint arXiv:2604.25850 , year=

Pith/arXiv arXiv

[2] [2]

International Conference on Learning Representations , volume=

Agentsquare: Automatic llm agent search in modular design space , author=. International Conference on Learning Representations , volume=

[3] [3]

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou , booktitle=. Agentic. 2026 , url=

2026

[4] [4]

Liu, Zichen and Sims, Anya and Duan, Keyu and Chen, Changyu and Yu, Simon and Zhou, Xiangxin and Xu, Haotian and Xiong, Shaopan and Liu, Bo and Tan, Chenmien and others , journal=. G

[5] [5]

arXiv preprint arXiv:2510.08558 , year=

Agent Learning via Early Experience , author=. arXiv preprint arXiv:2510.08558 , year=

Pith/arXiv arXiv

[6] [6]

Bowen Jin and Hansi Zeng and Zhenrui Yue and Jinsung Yoon and Sercan O Arik and Dong Wang and Hamed Zamani and Jiawei Han , booktitle=. Search-. 2025 , url=

2025

[7] [7]

Yang, Yushi and Padarha, Shreyansh and Lee, Andrew and Mahdi, Adam , journal=. Agentic

[8] [8]

Kim, Kyungmin and Choi, Youngbin and Kim, Hyounghun and Kim, Dongwoo and Park, Sangdon , booktitle=. Chrono

[9] [9]

Ding, Zifeng and Yan, Sikuan and Yuan, Moy and Hu, Xianglong and Lin, Fangru and Vlachos, Andreas , booktitle=. T

[10] [10]

Efficient

Jiralerspong, Thomas and Chen, Xiaoyin and More, Yash and Shah, Vedant and Bengio, Yoshua , booktitle=. Efficient

[11] [11]

Xinji Mai and Haotian Xu and Xing W and Weinong Wang and Yingying Zhang and Wenqiang Zhang , booktitle=. Agentic. 2025 , url=

2025

[12] [12]

Xue, Zhenghai and Zheng, Longtao and Liu, Qian and Li, Yingru and Zheng, Xiaosen and Ma, Zejun and An, Bo , booktitle=. Simple. 2026 , url=

2026

[13] [13]

Ziyu Wan and Yunxiang LI and Xiaoyu Wen and Yan Song and Hanjing Wang and Linyi Yang and Mark Schmidt and Jun Wang and Weinan Zhang and Shuyue Hu and Ying Wen , booktitle=. Re. 2025 , url=

2025

[14] [14]

2025 , url=

Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V Le and Sergey Levine and Yi Ma , booktitle=. 2025 , url=

2025

[15] [15]

Paying Less Generalization Tax:

Liu, Zhihan and Guan, Lin and Nie, Yixin and Zhang, Kai and Hao, Zhuoqun and Chen, Lin and Celikyilmaz, Asli and Wang, Zhaoran and Zhang, Na , journal=. Paying Less Generalization Tax:

[16] [16]

The Eleventh International Conference on Learning Representations , year=

Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=

[17] [17]

s3: You Don ' t Need That Much Data to Train a Search Agent via RL

Jiang, Pengcheng and Xu, Xueqiang and Lin, Jiacheng and Xiao, Jinfeng and Wang, Zifeng and Sun, Jimeng and Han, Jiawei. s3: You Don ' t Need That Much Data to Train a Search Agent via RL. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025

[18] [18]

arXiv preprint arXiv:2510.04678 , year=

Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=

arXiv

[19] [19]

Cheng, Mingyue and Ouyang, Jie and Yu, Shuo and Yan, Ruiran and Luo, Yucong and Liu, Zirui and Wang, Daoyu and Liu, Qi and Chen, Enhong , journal=. Agent-

[20] [20]

2024 , url =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , booktitle =. 2024 , url =

2024

[21] [21]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[22] [22]

International Conference on Learning Representations (ICLR) , year =

Learning Evolving Tools for Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =

[23] [23]

arXiv preprint arXiv:2603.05910 , year =

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks , author=. arXiv preprint arXiv:2603.05910 , year =

Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2603.28052 , year =

Meta-Harness: End-to-End Optimization of Model Harnesses , author=. arXiv preprint arXiv:2603.28052 , year =

Pith/arXiv arXiv

[25] [25]

The Eleventh International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

[26] [26]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[27] [27]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024

[28] [28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentic Reasoning: A streamlined framework for enhancing LLM reasoning with agentic tools , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[29] [29]

Advances in Neural Information Processing Systems , volume=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author =. Advances in Neural Information Processing Systems , volume=

[30] [30]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models can Solve Computer Tasks , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[31] [31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Showui: One vision-language-action model for gui visual agent , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[32] [32]

2026 , publisher =

Mohamed, Amr and Assi, Maram and Guizani, Mariam , title =. 2026 , publisher =. doi:10.1145/3809494 , journal =

work page doi:10.1145/3809494 2026

[33] [33]

Science , volume =

Shakked Noy and Whitney Zhang , title =. Science , volume =. 2023 , doi =

2023

[34] [34]

Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=

[35] [35]

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and dahai li and Zhiyuan Liu and Maosong Sun , booktitle=. Tool. 2024 , url=

2024

[36] [36]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[37] [37]

Gonzalez , booktitle=

Shishir G Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , booktitle=. Gorilla: Large Language Model Connected with Massive. 2024 , url=

2024

[38] [38]

2024 , url =

Liu, Zuxin and Hoang, Thai and Zhang, Jianguo and Zhu, Ming and Lan, Tian and Kokane, Shirley and Tan, Juntao and Yao, Weiran and Liu, Zhiwei and Feng, Yihao and Murthy, Rithesh and Yang, Liangwei and Savarese, Silvio and Niebles, Juan Carlos and Wang, Huan and Heinecke, Shelby and Xiong, Caiming , booktitle =. 2024 , url =

2024

[39] [39]

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen , booktitle=. To. 2024 , url=

2024

[40] [40]

International conference on machine learning , pages=

Pal: Program-aided language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[41] [41]

Transactions on Machine Learning Research , issn=

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

2023

[42] [42]

International conference on learning representations , volume=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. International conference on learning representations , volume=

[43] [43]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Active retrieval augmented generation , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023

[44] [44]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

[45] [45]

arXiv preprint arXiv:2401.05459 , year=

Personal llm agents: Insights and survey about the capability, efficiency and security , author=. arXiv preprint arXiv:2401.05459 , year=

Pith/arXiv arXiv

[46] [46]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Data interpreter: An llm agent for data science , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[47] [47]

2024 , volume =

Guo, Siyuan and Deng, Cheng and Wen, Ying and Chen, Hechang and Chang, Yi and Wang, Jun , booktitle =. 2024 , volume =

2024

[48] [48]

The Thirteenth International Conference on Learning Representations , year=

Antonis Antoniades and Albert. The Thirteenth International Conference on Learning Representations , year=

[49] [49]

2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=

LLM experiments with simulation: Large language model multi-agent system for simulation model parametrization in digital twins , author=. 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=. 2024 , organization=

2024

[50] [50]

doi: 10.1016/j.jmsy.2025.08.017

AI Agents and Agentic AI–navigating a plethora of concepts for future manufacturing , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.jmsy.2025.08.017 , url =

work page doi:10.1016/j.jmsy.2025.08.017 2025

[51] [51]

arXiv preprint arXiv:2412.20138 , year=

Tradingagents: Multi-agents llm financial trading framework , author=. arXiv preprint arXiv:2412.20138 , year=

arXiv

[52] [52]

Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining , pages=

A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist , author=. Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining , pages=

[53] [53]

2024 , url=

Yubin Kim and Chanwoo Park and Hyewon Jeong and Yik Siu Chan and Xuhai Xu and Daniel McDuff and Hyeonhoon Lee and Marzyeh Ghassemi and Cynthia Breazeal and Hae Won Park , booktitle=. 2024 , url=

2024

[54] [54]

arXiv preprint arXiv:2509.00761 , year=

L-MARS: Legal multi-agent workflow with orchestrated reasoning and agentic search , author=. arXiv preprint arXiv:2509.00761 , year=

arXiv

[55] [55]

2026 , howpublished =

Harness Engineering: Leveraging. 2026 , howpublished =

2026

[56] [56]

Effective Harnesses for Long-Running Agents , year =

[57] [57]

Qian, Cheng and Acikgoz, Emre Can and He, Qi and WANG, Hongru and Chen, Xiusi and Hakkani-Tur, Dilek and Tur, Gokhan and Ji, Heng , booktitle =. Tool. 2025 , pages =

2025

[58] [58]

Group-in-Group Policy Optimization for

Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , booktitle=. Group-in-Group Policy Optimization for. 2026 , url=

2026

[59] [59]

Proceedings of the 38th International Conference on Machine Learning , pages=

WILDS: A Benchmark of in-the-Wild Distribution Shifts , author =. Proceedings of the 38th International Conference on Machine Learning , pages=. 2021 , publisher =

2021

[60] [60]

Revisiting Out-of-distribution Robustness in

Lifan Yuan and Yangyi Chen and Ganqu Cui and Hongcheng Gao and FangYuan Zou and Xingyi Cheng and Heng Ji and Zhiyuan Liu and Maosong Sun , booktitle=. Revisiting Out-of-distribution Robustness in. 2023 , url=

2023

[61] [61]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Out-of-distribution generalization in natural language processing: Past, present, and future , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023

[62] [62]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[63] [63]

arXiv preprint arXiv:2602.19225 , year=

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training , author=. arXiv preprint arXiv:2602.19225 , year=

arXiv

[64] [64]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[65] [65]

2025 , howpublished =

2025

[66] [66]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and H...

2025

[67] [67]

Advances in Neural Information Processing Systems , volume=

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents , author=. Advances in Neural Information Processing Systems , volume=. 2025 , url =

2025

[68] [68]

International Conference on Learning Representations , volume=

Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement , author=. International Conference on Learning Representations , volume=

[69] [69]

arXiv preprint arXiv:2504.11536 , year=

Retool: Reinforcement learning for strategic tool use in llms , author=. arXiv preprint arXiv:2504.11536 , year=

Pith/arXiv arXiv