LemonHarness Technical Report

Congli Yin; Fubo Sun; Hongke Zhao; Jiachen Liu; Jianping Fan; Jiawei Liu; Jiaying Li; Kailong Ren; Lei Zhang; Likang Wu

arxiv: 2606.24311 · v1 · pith:GJIDKN5Cnew · submitted 2026-06-23 · 💻 cs.AI

LemonHarness Technical Report

Kailong Ren , Fubo Sun , Jiachen Liu , Liu Yang , Zimo Yin , Jiaying Li , Congli Yin , Ming He

show 13 more authors

Yu Huo Jiawei Liu Zeping Chen Yubin Huangfu Ronghua Li Yixuan Wu Xing Su Yanzhi Xu Likang Wu Hongke Zhao Lei Zhang Xiaohui Geng Jianping Fan

This is my paper

Pith reviewed 2026-06-26 00:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords LemonHarnessLLM agentsworkspace boundaryrule knowledge basetime-aware executionlong-horizon tasksTerminal-Benchstate tracking

0 comments

The pith

LemonHarness unifies runtime boundaries, callable rules, and time awareness to stabilize long-horizon LLM agent execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LemonHarness to address scattered state changes that accumulate when LLM agents modify files and artifacts over many iterations while seeing only tool outputs. It creates an explicit workspace boundary that routes all state-changing operations through structured tool interfaces, records their effects as observations, and folds model calls, tool runs, and rule knowledge into one controlled loop. A reusable rule knowledge base converts recurring acceptance criteria into callable runtime facts, while a time-aware mechanism surfaces elapsed and remaining budget so the model can rebalance exploration, implementation, and checks. On Terminal-Bench 2.0 the framework reaches 84.49 percent accuracy with one backbone and 86.52 percent with a stronger one across hundreds of trials. These mechanisms are presented as the source of improved stability for extended agent tasks.

Core claim

LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary. State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge. LemonHarness further adds a time-aware execution mechanism that exposes

What carries the argument

LemonHarness integrated execution framework that enforces a unified runtime boundary, exposes callable rule knowledge, and supplies time-budget signals to the model.

If this is right

State changes become observable and reversible because every write or artifact creation returns structured feedback.
Recurring rules no longer need to be re-prompted each round because they live in a callable knowledge base.
Models can shift effort among exploration, coding, and validation as remaining time shrinks, reducing timeout failures.
Long-horizon runs accumulate fewer untracked file modifications across multiple agent rounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary pattern could be applied to non-agent tool-use loops that also suffer from invisible side effects.
Time-budget exposure might be generalized to token or memory limits in other agent scaffolds.
Rule knowledge bases could be versioned and shared across teams to standardize acceptance criteria for similar tasks.

Load-bearing premise

Accuracy gains are produced by the LemonHarness mechanisms rather than by the choice of LLM backbone or other unstated experimental factors.

What would settle it

Run identical Terminal-Bench tasks on the same backbone model once inside LemonHarness and once without its boundary, rule base, or time signals, then compare accuracy and state-tracking error rates.

Figures

Figures reproduced from arXiv: 2606.24311 by Congli Yin, Fubo Sun, Hongke Zhao, Jiachen Liu, Jianping Fan, Jiawei Liu, Jiaying Li, Kailong Ren, Lei Zhang, Likang Wu, Liu Yang, Ming He, Ronghua Li, Xiaohui Geng, Xing Su, Yanzhi Xu, Yixuan Wu, Yubin Huangfu, Yu Huo, Zeping Chen, Zimo Yin.

**Figure 1.** Figure 1: Overview of the LemonHarness framework. A task is processed through a unified runtime boundary that integrates model reasoning, tool execution, workspace state, and execution logs. General rule knowledge provides reusable execution rules, criteria, and task priors, while time-aware control adjusts exploration, implementation, and validation according to elapsed and remaining budget. The final task state … view at source ↗

read the original abstract

As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary. State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge. LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On Terminal-Bench 2.0, LemonHarness_GPT-5.3-CodeX reached 84.49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5.5 backbone raised the average accuracy to 86.52% across five jobs. The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LemonHarness bundles workspace boundaries, rule knowledge, and time exposure into an agent runtime but reports accuracies only for the full system with no baselines or ablations.

read the letter

The main takeaway is that LemonHarness is a technical report describing an integrated runtime that constrains agent state changes to a defined workspace, routes modifications through structured tool interfaces, keeps a reusable rule knowledge base, and surfaces elapsed and remaining time to the model. The authors give absolute accuracies of 84.49% and 86.52% on Terminal-Bench 2.0 for two GPT backbones.

The framework description is clear about the problem of scattered file-system changes over long horizons and how the boundary plus rule base plus time signal can keep execution contained. The combination of those pieces into one controlled loop is presented as a practical pattern, and the writing focuses on engineering details that matter for implementation.

The limitation is direct: the numbers come only from the complete LemonHarness setup. No runs appear for the same models without the workspace constraints or rule base, no component ablations are shown, and no variance or control information is supplied for the 445 trials. That leaves the central suggestion—that the three mechanisms improve stability—unsupported by evidence that isolates their effect from the underlying models.

The paper shows straightforward thinking about agent execution issues and avoids internal contradictions or circular claims. It is an honest engineering note rather than a controlled experiment.

Practitioners tuning long-horizon coding agents could find the runtime ideas worth trying in their own code. Readers who need evidence that the mechanisms cause the reported gains will not get it here. I would not cite the accuracy figures.

It does not merit peer review as presented because the main result cannot be evaluated without the missing controls.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces LemonHarness, an execution framework for long-horizon LLM agents. It enforces an explicit workspace boundary for state-changing operations, supplies a callable reusable rule knowledge base, and exposes elapsed/remaining time to the model. On Terminal-Bench 2.0 the framework reports 84.49% accuracy (GPT-5.3-CodeX, 445 trials) and 86.52% accuracy (GPT-5.5).

Significance. If the performance gains can be causally attributed to the three mechanisms, the work would address a practical obstacle in agent reliability by reducing uncontrolled state drift and timeout failures. The absolute numbers are high, but without isolating experiments their contribution remains unverified.

major comments (1)

[Abstract] Abstract: the central claim—that unified runtime boundary, callable rule knowledge, and time-aware execution improve long-horizon stability—rests solely on absolute accuracies for the integrated system. No control runs, ablations, or baseline results are reported for the identical LLM backbones under a standard agent loop without the workspace constraints, rule KB, or time exposure. Consequently the numbers cannot be attributed to the proposed mechanisms rather than model choice or other unstated factors.

Simulated Author's Rebuttal

1 responses · 1 unresolved

Thank you for the review. We address the major comment on attribution of results below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim—that unified runtime boundary, callable rule knowledge, and time-aware execution improve long-horizon stability—rests solely on absolute accuracies for the integrated system. No control runs, ablations, or baseline results are reported for the identical LLM backbones under a standard agent loop without the workspace constraints, rule KB, or time exposure. Consequently the numbers cannot be attributed to the proposed mechanisms rather than model choice or other unstated factors.

Authors: We agree that the manuscript reports absolute accuracies only for the complete LemonHarness system and provides no ablation studies, control runs, or baselines with the same backbones under a standard agent loop. The abstract uses 'suggest' rather than claiming definitive causation. The contribution is the integrated framework and its benchmark results. We will revise the abstract to state explicitly that performance figures apply to the full system without isolated controls or attribution experiments. revision: partial

standing simulated objections not resolved

Absence of ablation studies or control experiments isolating the contributions of workspace boundary, rule knowledge base, and time awareness.

Circularity Check

0 steps flagged

No derivation chain or equations; empirical benchmark report only

full rationale

The paper presents an engineering framework for LLM agents and reports absolute benchmark accuracies on Terminal-Bench 2.0. No equations, derivations, fitted parameters, or mathematical claims appear anywhere in the text. The central suggestion that the three mechanisms improve stability is supported solely by the reported numbers for the integrated system; there are no self-referential definitions, predictions derived from fitted inputs, or load-bearing self-citations that reduce any result to its own inputs by construction. The absence of any derivation chain makes circularity analysis inapplicable; the work is self-contained as an empirical technical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5893 in / 1108 out tokens · 21263 ms · 2026-06-26T00:04:18.918625+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 12 linked inside Pith

[1]

(2025) SABER: Small actions, big errors– safeguarding mutating steps in LLM agents

Cuadron, A., Y u, P ., Liu, Y ., & Gupta, A. (2025) SABER: Small actions, big errors– safeguarding mutating steps in LLM agents. arXiv preprint arXiv:2512.07850

arXiv 2025
[2]

E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., & Gholami, A

Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., & Gholami, A. (2025) Plan-and-act: Improving planning of agents for long-horizon tasks. In F orty- second International Conference on Machine Learning (ICML)

2025
[3]

(2023) PAL: Program-aided language models

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P ., Y ang, Y ., Callan, J., & Neubig, G. (2023) PAL: Program-aided language models. In International Conference on Machine Learning , pages 10764–10799. PMLR

2023
[4]

(2025) Enforcing tem- poral constraints for LLM agents

Kamath, A., Zhang, S., Xu, C., Ugare, S., Singh, G., & Misailovic, S. (2025) Enforcing tem- poral constraints for LLM agents. arXiv preprint arXiv:2512.23738

arXiv 2025
[5]

(2026) Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses

Lin, J., Liu, S., Pan, C., Lin, L., Dou, S., Huang, X., Y an, H., Han, Z., & Gui, T. (2026) Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses. arXiv preprint arXiv:2604.25850

Pith/arXiv arXiv 2026
[6]

(2025) UltraHorizon: Benchmarking agent capabilities in ultra long-horizon scenarios

Luo, H., Zhang, H., Zhang, X., Wang, H., Qin, Z., Lu, W., Ma, G., He, H., Xie, Y ., Zhou, Q., et al. (2025) UltraHorizon: Benchmarking agent capabilities in ultra long-horizon scenarios. arXiv preprint arXiv:2509.21766

arXiv 2025
[7]

A., Shaw, A

Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., Shi, L., Shin, J. Y ., Walshe, T., Buchanan, E. K., et al. (2026) Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868

Pith/arXiv arXiv 2026
[8]

(2026) Code as agent harness

Ning, X., Tieu, K., Fu, D., Wei, T., Li, Z., Bei, Y ., Zou, J., Ai, M., Liu, Z., Li, T.-W., et al. (2026) Code as agent harness. arXiv preprint arXiv:2605.18747

Pith/arXiv arXiv 2026
[9]

N., Parker-Holder, J., & Rocktäschel, T

Paglieri, D., Cupiał, B., Cook, J., Piterbarg, U., Tuyls, J., Grefenstette, E., Foerster, J. N., Parker-Holder, J., & Rocktäschel, T. (2025) Learning when to plan: Efﬁciently allocating test-time compute for LLM agents. arXiv preprint arXiv:2509.03581

arXiv 2025
[10]

(2023) Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Y u, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023) Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems , 36:68539–68551

2023
[11]

(2026) EvoCode-Bench: Evaluating coding agents in multi-turn iterative interactions

Shen, H., Chen, X., Xu, W., Ma, Y ., Chen, L., & Li, K. (2026) EvoCode-Bench: Evaluating coding agents in multi-turn iterative interactions. arXiv preprint arXiv:2605.24110

Pith/arXiv arXiv 2026
[12]

(2023) Reﬂexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Y ao, S. (2023) Reﬂexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36:8634–8652

2023
[13]

(2025) OpenAI GPT-5 system card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. (2025) OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267

Pith/arXiv arXiv 2025
[14]

(2026) EnvScaler: Scaling tool-interactive environments for llm agent via programmatic synthesis

Song, X., Chang, H., Dong, G., Zhu, Y ., Wen, J.-R., & Dou, Z. (2026) EnvScaler: Scaling tool-interactive environments for llm agent via programmatic synthesis. arXiv preprint arXiv:2601.05808

Pith/arXiv arXiv 2026
[15]

(2023) V oyager: An open-ended embodied agent with large language models

Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., & Anandkumar, A. (2023) V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291. 8

Pith/arXiv arXiv 2023
[16]

J., Bai, H., Sun, Y ., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., & Nowak, R

Wang, X. J., Bai, H., Sun, Y ., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., & Nowak, R. D. (2026) The long-horizon task mirage? Diagnosing where and why agentic systems break. arXiv preprint arXiv:2604.11978

Pith/arXiv arXiv 2026
[17]

E., Peng, B., Qin, S., Nath, S., Lin, Q., et al

Wang, Z., Wu, Q., Zhang, X., Zhang, C., Y ao, W., Faisal, F. E., Peng, B., Qin, S., Nath, S., Lin, Q., et al. (2026) WebXSkill: Skill learning for autonomous web agents. arXiv preprint arXiv:2604.13318

Pith/arXiv arXiv 2026
[18]

(2026) Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents

Xu, T., Wen, H., & Li, M. (2026) Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents. arXiv preprint arXiv:2605.22166

Pith/arXiv arXiv 2026
[19]

(2026) SkillOpt: Executive strategy for self-evolving agent skills

Y ang, Y ., Gong, Z., Huang, W., Y ang, Q., Zhou, Z., Huang, Z., Li, Y ., Gao, X., Dai, Q., Liu, B., et al. (2026) SkillOpt: Executive strategy for self-evolving agent skills. arXiv preprint arXiv:2605.23904

Pith/arXiv arXiv 2026
[20]

(2022) ReAct: Syner- gizing reasoning and acting in language models

Y ao, S., Zhao, J., Y u, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y . (2022) ReAct: Syner- gizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022
[21]

(2024) ExpeL: LLM agents are experiential learners

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .-J., & Huang, G. (2024) ExpeL: LLM agents are experiential learners. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence 38, pages 19632–19642. 9

2024

[1] [1]

(2025) SABER: Small actions, big errors– safeguarding mutating steps in LLM agents

Cuadron, A., Y u, P ., Liu, Y ., & Gupta, A. (2025) SABER: Small actions, big errors– safeguarding mutating steps in LLM agents. arXiv preprint arXiv:2512.07850

arXiv 2025

[2] [2]

E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., & Gholami, A

Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., & Gholami, A. (2025) Plan-and-act: Improving planning of agents for long-horizon tasks. In F orty- second International Conference on Machine Learning (ICML)

2025

[3] [3]

(2023) PAL: Program-aided language models

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P ., Y ang, Y ., Callan, J., & Neubig, G. (2023) PAL: Program-aided language models. In International Conference on Machine Learning , pages 10764–10799. PMLR

2023

[4] [4]

(2025) Enforcing tem- poral constraints for LLM agents

Kamath, A., Zhang, S., Xu, C., Ugare, S., Singh, G., & Misailovic, S. (2025) Enforcing tem- poral constraints for LLM agents. arXiv preprint arXiv:2512.23738

arXiv 2025

[5] [5]

(2026) Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses

Lin, J., Liu, S., Pan, C., Lin, L., Dou, S., Huang, X., Y an, H., Han, Z., & Gui, T. (2026) Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses. arXiv preprint arXiv:2604.25850

Pith/arXiv arXiv 2026

[6] [6]

(2025) UltraHorizon: Benchmarking agent capabilities in ultra long-horizon scenarios

Luo, H., Zhang, H., Zhang, X., Wang, H., Qin, Z., Lu, W., Ma, G., He, H., Xie, Y ., Zhou, Q., et al. (2025) UltraHorizon: Benchmarking agent capabilities in ultra long-horizon scenarios. arXiv preprint arXiv:2509.21766

arXiv 2025

[7] [7]

A., Shaw, A

Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., Shi, L., Shin, J. Y ., Walshe, T., Buchanan, E. K., et al. (2026) Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868

Pith/arXiv arXiv 2026

[8] [8]

(2026) Code as agent harness

Ning, X., Tieu, K., Fu, D., Wei, T., Li, Z., Bei, Y ., Zou, J., Ai, M., Liu, Z., Li, T.-W., et al. (2026) Code as agent harness. arXiv preprint arXiv:2605.18747

Pith/arXiv arXiv 2026

[9] [9]

N., Parker-Holder, J., & Rocktäschel, T

Paglieri, D., Cupiał, B., Cook, J., Piterbarg, U., Tuyls, J., Grefenstette, E., Foerster, J. N., Parker-Holder, J., & Rocktäschel, T. (2025) Learning when to plan: Efﬁciently allocating test-time compute for LLM agents. arXiv preprint arXiv:2509.03581

arXiv 2025

[10] [10]

(2023) Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Y u, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023) Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems , 36:68539–68551

2023

[11] [11]

(2026) EvoCode-Bench: Evaluating coding agents in multi-turn iterative interactions

Shen, H., Chen, X., Xu, W., Ma, Y ., Chen, L., & Li, K. (2026) EvoCode-Bench: Evaluating coding agents in multi-turn iterative interactions. arXiv preprint arXiv:2605.24110

Pith/arXiv arXiv 2026

[12] [12]

(2023) Reﬂexion: Language agents with verbal reinforcement learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Y ao, S. (2023) Reﬂexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36:8634–8652

2023

[13] [13]

(2025) OpenAI GPT-5 system card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. (2025) OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267

Pith/arXiv arXiv 2025

[14] [14]

(2026) EnvScaler: Scaling tool-interactive environments for llm agent via programmatic synthesis

Song, X., Chang, H., Dong, G., Zhu, Y ., Wen, J.-R., & Dou, Z. (2026) EnvScaler: Scaling tool-interactive environments for llm agent via programmatic synthesis. arXiv preprint arXiv:2601.05808

Pith/arXiv arXiv 2026

[15] [15]

(2023) V oyager: An open-ended embodied agent with large language models

Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., & Anandkumar, A. (2023) V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291. 8

Pith/arXiv arXiv 2023

[16] [16]

J., Bai, H., Sun, Y ., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., & Nowak, R

Wang, X. J., Bai, H., Sun, Y ., Wang, H., Zhang, S., Hu, W., Schroder, M., Mutlu, B., Song, D., & Nowak, R. D. (2026) The long-horizon task mirage? Diagnosing where and why agentic systems break. arXiv preprint arXiv:2604.11978

Pith/arXiv arXiv 2026

[17] [17]

E., Peng, B., Qin, S., Nath, S., Lin, Q., et al

Wang, Z., Wu, Q., Zhang, X., Zhang, C., Y ao, W., Faisal, F. E., Peng, B., Qin, S., Nath, S., Lin, Q., et al. (2026) WebXSkill: Skill learning for autonomous web agents. arXiv preprint arXiv:2604.13318

Pith/arXiv arXiv 2026

[18] [18]

(2026) Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents

Xu, T., Wen, H., & Li, M. (2026) Adapting the interface, not the model: Runtime harness adaptation for deterministic LLM agents. arXiv preprint arXiv:2605.22166

Pith/arXiv arXiv 2026

[19] [19]

(2026) SkillOpt: Executive strategy for self-evolving agent skills

Y ang, Y ., Gong, Z., Huang, W., Y ang, Q., Zhou, Z., Huang, Z., Li, Y ., Gao, X., Dai, Q., Liu, B., et al. (2026) SkillOpt: Executive strategy for self-evolving agent skills. arXiv preprint arXiv:2605.23904

Pith/arXiv arXiv 2026

[20] [20]

(2022) ReAct: Syner- gizing reasoning and acting in language models

Y ao, S., Zhao, J., Y u, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y . (2022) ReAct: Syner- gizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022

[21] [21]

(2024) ExpeL: LLM agents are experiential learners

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .-J., & Huang, G. (2024) ExpeL: LLM agents are experiential learners. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence 38, pages 19632–19642. 9

2024