arxiv: 2604.27209 · v2 · submitted 2026-04-29 · 💻 cs.SE · cs.AI

Recognition: unknown

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

Halley Young , Nikolaj Bj\"orner

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords language modelsresearch softwareprompt orchestrationhallucination accumulationdesynchronizationworkspace stateiterative developmentsoftware engineering

0 comments

The pith

Comet-H is an iterative prompt automaton that keeps code, theory, benchmarks, and documentation aligned in research software projects as specifications evolve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses how language models can be used to create research software where the requirements and theory develop alongside the code and documentation. It identifies two main problems: hallucination accumulation where unsupported claims spread, and desynchronization where different parts of the project fall out of alignment. The proposed solution is Comet-H, which manages all activities through a single workspace state and uses a controller to pick the next prompt based on what is currently missing in the project. This controller uses a simple scoring method and tracks unfinished tasks with a half-life decay to keep the process on track over many iterations. A reader would care because this offers a practical way to leverage language models for complex, iterative research work that involves both building systems and advancing theory without the outputs becoming inconsistent.

Core claim

Comet-H is an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. Prompt selection is framed as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score.

What carries the argument

The controller that selects prompts by scoring them against workspace deficits using a hand-weighted linear score and a half-life mechanism to bound unfinished work.

If this is right

Audit-and-contraction passes dominate the later phases of every successful project trajectory.
The approach sustains development across approximately 400 commits while maintaining alignment.
One detailed case reaches an F1 score of 0.768 on a 90-case benchmark, outperforming the next-best baseline.
The method applies to a portfolio of 46 research-software repositories across two dozen domains.
The transparent scorer makes prompt choices legible directly from the workspace state without a learned policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of a hand-weighted linear scorer suggests that complex reinforcement learning policies may not be necessary for orchestrating long-horizon AI research tasks.
This framework could extend to other AI-assisted scientific workflows where theory and implementation must co-evolve, such as mathematical proofs or experimental design.
Applying the half-life mechanism more broadly might help prevent context bloat in any multi-turn language model interaction.
Future tests on larger or more complex projects would clarify the limits of the current controller design.

Load-bearing premise

The hand-weighted linear score over workspace deficits and the half-life mechanism for unfinished work can reliably prevent hallucination accumulation and desynchronization in long-running projects without learned policies or extensive human intervention.

What would settle it

A run of the system on a research software project in which claims in the paper or README repeatedly exceed the support from the code and benchmarks, or where code, theory, and documentation become inconsistent over the course of hundreds of commits.

Figures

Figures reproduced from arXiv: 2604.27209 by Halley Young, Nikolaj Bj\"orner.

**Figure 1.** Figure 1: High-level Comet-H loop: the system cycles through ideation, generation, hardening, and audit until a batch of repositories reaches quality thresholds. The controller that implements this loop is described in Section 4. The hardest research-software projects do not fail at generation. They fail based on drift: when the mathematical thesis, the executable system, the benchmark surface, and the public claims… view at source ↗

**Figure 2.** Figure 2: The two co-evolution failure modes, located at the artifact interfaces they destabilize. view at source ↗

**Figure 3.** Figure 3: Reactive grounding trigger. Any change to the paper or README forces an immediate grounding view at source ↗

**Figure 4.** Figure 4: Comet-H controller architecture. Workspace Wt feeds selection kernel κ (mode guards, forced queue, budget). Scorer ϕ selects pt; update map δ executes the prompt and re-reads the repository. Paper or README changes trigger a forced grounding-then-audit pass (Proposition 3.1). Seed Generate Harden Halt Tail guards met repos built budget ≤ 9 done ϕ picks next LOC low view at source ↗

**Figure 5.** Figure 5: Controller mode graph (finite-mode backbone). Side state—obligation vector view at source ↗

read the original abstract

Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Comet-H gives a transparent, hand-weighted controller for keeping LLM sessions aligned on evolving research code, with decent results on one full project but no checks on how fragile those weights are.

read the letter

The core contribution is a prompt-selection loop that scores candidate actions against explicit workspace deficits using a simple linear combination, then decays unfinished tasks with a half-life and forces re-audits whenever the paper or README changes. On the A3 static-analysis tool built entirely inside this loop, the system reached F1 0.768 on a 90-case benchmark against a 0.364 baseline, and audit-and-contraction steps came to dominate the later commits across roughly 400 total steps. The 46-repo portfolio adds some breadth to the claim that the pattern generalizes beyond one case.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Comet-H, an iterative prompt automaton for orchestrating large language models in the development of research software projects where the specification evolves over time. It identifies hallucination accumulation and desynchronization as key LM-specific failure modes and addresses them by maintaining a single workspace state that couples ideation, implementation, evaluation, grounding, and paper-writing. Prompt selection is framed as a contextual bandit problem using a hand-weighted linear score based on workspace deficits, with a half-life mechanism for unfinished work. The approach is evaluated through a portfolio of 46 research-software repositories, with a detailed case study on A3, a Python static-analysis tool developed entirely within the loop, achieving an F1 score of 0.768 on a 90-case benchmark compared to a baseline of 0.364. The study observes that audit-and-contraction passes dominate later phases of successful trajectories across approximately 400 commits.

Significance. If the central claims hold, this work would be significant for software engineering and AI-assisted research by providing a transparent, non-learned-policy method for long-horizon orchestration of LMs in complex, evolving projects. The emphasis on coupled evolution of code, benchmarks, and documentation, along with empirical results from a constructed system like A3, offers a practical demonstration of managing common LM pitfalls. The portfolio approach adds breadth, and the transparent scorer design is a strength that allows for legibility of decisions. However, the reliance on hand-weighted parameters without reported robustness checks limits the strength of the conclusions regarding reliability without human intervention.

major comments (2)

[Comet-H Controller] The claim that the hand-weighted linear score over workspace deficits, combined with the half-life mechanism, reliably bounds hallucination accumulation and desynchronization across long trajectories without learned policies or extensive intervention is load-bearing for the contribution. However, no sensitivity analysis, ablation studies, or perturbation tests on the weights or half-life parameter are reported (see the controller description). Given that these are free parameters and results derive from a single detailed trajectory (~400 commits on A3), the weights could have been selected to fit the observed outcomes, which would undermine the assertions of transparency and reliability without learned policies. A concrete test such as re-running trajectories with perturbed weights and reporting variance in success metrics is needed to support the claim.
[Evaluation and Case Study] The key empirical result (F1 = 0.768 on the 90-case benchmark for A3 vs. next-best baseline of 0.364) lacks details on experimental controls, the exact implementation of the baseline, the composition of the benchmark cases, statistical analysis (e.g., confidence intervals or significance tests), or controls for prompt variations. This information is load-bearing for assessing whether the performance improvement is attributable to the Comet-H orchestration (see the A3 case study and evaluation sections).

minor comments (2)

[Abstract] The abstract phrasing 'reaches (F1 = 0.768)' uses unnecessary parentheses; consider rephrasing to 'reaches an F1 score of 0.768' for improved readability.
[Portfolio Construction] The portfolio of 46 repositories across two dozen domains is mentioned, but additional details on selection criteria, diversity metrics, or aggregate success rates across the full portfolio (beyond the A3 focus) would strengthen the presentation of the general observation that audit-and-contraction passes dominate successful trajectories.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where additional analysis and documentation would strengthen the claims regarding the Comet-H controller's reliability and the attribution of empirical results. We respond point-by-point below and will incorporate revisions to address the concerns while preserving the manuscript's core contributions on transparent, non-learned orchestration for evolving research software.

read point-by-point responses

Referee: [Comet-H Controller] The claim that the hand-weighted linear score over workspace deficits, combined with the half-life mechanism, reliably bounds hallucination accumulation and desynchronization across long trajectories without learned policies or extensive intervention is load-bearing for the contribution. However, no sensitivity analysis, ablation studies, or perturbation tests on the weights or half-life parameter are reported (see the controller description). Given that these are free parameters and results derive from a single detailed trajectory (~400 commits on A3), the weights could have been selected to fit the observed outcomes, which would undermine the assertions of transparency and reliability without learned policies. A concrete test such as re-running trajectories with perturbed weights and reporting variance in success metrics is needed to support the claim.

Authors: We agree that the lack of sensitivity analysis on the hand-weighted parameters represents a genuine limitation in the current version, as the weights were iteratively refined during development of the A3 trajectory and the broader portfolio. The linear scorer was designed for legibility rather than optimality, with weights reflecting observed deficit priorities (e.g., higher weight on code-benchmark alignment than on documentation in early phases). To strengthen this, we will revise the controller section to explicitly document the weight selection rationale and add a limited sensitivity analysis: we will perturb weights by ±20% and re-evaluate on a representative subset of shorter trajectories from the 46-repository portfolio (approximately 10-20 commits each), reporting variance in success rate, final F1, and trajectory length. The half-life parameter will be specified with justification tied to typical research task durations. While full re-execution of the complete ~400-commit A3 trajectory under multiple perturbations is not feasible due to resource constraints, the proposed analysis will demonstrate robustness and support the transparency claim over black-box learned policies. revision: partial
Referee: [Evaluation and Case Study] The key empirical result (F1 = 0.768 on the 90-case benchmark for A3 vs. next-best baseline of 0.364) lacks details on experimental controls, the exact implementation of the baseline, the composition of the benchmark cases, statistical analysis (e.g., confidence intervals or significance tests), or controls for prompt variations. This information is load-bearing for assessing whether the performance improvement is attributable to the Comet-H orchestration (see the A3 case study and evaluation sections).

Authors: We concur that expanded experimental details are required for reproducibility and to isolate the contribution of the orchestration mechanism. In the revised evaluation section, we will add: (1) a precise description of the baseline as direct, non-orchestrated LLM prompting using the same model and core prompt templates but without deficit scoring, half-life tracking, or coupled workspace state; (2) the benchmark composition, consisting of 90 manually verified cases spanning type errors, security vulnerabilities, linting issues, and edge-case false positives/negatives drawn from standard Python static analysis scenarios; (3) statistical support including 95% bootstrap confidence intervals on F1 scores and a significance test (e.g., McNemar's test) comparing Comet-H to the baseline; and (4) prompt variation controls, with results from minor rephrasings within the same prompt families to confirm that gains derive from the controller's deficit-driven selection and audit passes rather than wording. These additions will clarify attribution to the coupled evolution process across the ~400 commits. revision: yes

standing simulated objections not resolved

Full re-execution of the complete ~400-commit A3 trajectory under multiple perturbed weight sets, due to prohibitive computational and time costs.

Circularity Check

0 steps flagged

No significant circularity: empirical system proposal with transparent design choices

full rationale

The paper proposes Comet-H as an engineering system (iterative prompt automaton with hand-weighted linear scorer over workspace deficits plus half-life for unfinished work) and reports observed empirical outcomes on a constructed portfolio of 46 repositories, including F1=0.768 on the A3 benchmark after ~400 commits. No mathematical derivation chain, first-principles prediction, or uniqueness theorem is claimed. The hand-weighted score is explicitly presented as a transparent, non-learned design choice rather than a fitted parameter renamed as a prediction. No self-citations, ansatzes smuggled via prior work, or self-definitional reductions appear in the abstract or described claims. The central results (audit-and-contraction dominance, hallucination bounding) are measured outcomes of running the constructed system, not quantities that reduce to the inputs by construction. This is a standard empirical systems paper whose claims stand or fall on the reported runs and are not circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The paper's load-bearing elements are the design of the prompt automaton and the empirical demonstration; it does not rely on unstated mathematical axioms or heavily fitted parameters beyond the described heuristics.

free parameters (2)

weights in linear score
The prompt selection uses a hand-weighted linear score, introducing design parameters that are not data-fitted but chosen by the authors.
half-life parameter
The fading record of unfinished work uses a half-life, which is a tunable parameter in the system.

axioms (1)

domain assumption LLMs can effectively perform tasks like ideation, implementation, evaluation, grounding, and paper-writing when prompted appropriately in an iterative loop
This underpins the entire orchestration mechanism described in the abstract.

invented entities (1)

Comet-H no independent evidence
purpose: To orchestrate the co-evolution of research software components using LLMs
The paper introduces this new system as the core contribution.

pith-pipeline@v0.9.0 · 5603 in / 1523 out tokens · 97081 ms · 2026-05-07T08:00:25.347613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Kiro.https://kiro.dev, 2026

Amazon Web Services. Kiro.https://kiro.dev, 2026

2026
[2]

Claude code.https://docs.anthropic.com/en/docs/claude-code, 2026

Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2026

2026
[3]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

2002
[4]

Stochastic multi-armed-bandit problem with non- stationary rewards

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non- stationary rewards. InAdvances in Neural Information Processing Systems (NeurIPS), 2014

2014
[5]

Aaron R. Bradley. SAT-based model checking without unrolling. InVerification, Model Checking, and Abstract Interpretation (VMCAI), volume 6538 ofLecture Notes in Computer Science, pages 70–87. Springer, 2011

2011
[6]

Contextual bandits with linear payoff func- tions

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff func- tions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learning Research, pages 208–214. PMLR, 2011

2011
[7]

Counterexample-guided ab- straction refinement

Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith. Counterexample-guided ab- straction refinement. InComputer Aided Verification (CAV), volume 1855 ofLecture Notes in Computer Science, pages 154–169. Springer, 2000

2000
[8]

Devin: The first AI software engineer

Cognition AI. Devin: The first AI software engineer. Blog post, 2024. URLhttps://www.cognition. ai/blog/introducing-devin

2024
[9]

In: Au- tomated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings

Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28, volume 12699 ofLecture Notes in Computer Science, pages 625–635. Springer, 2021. doi: 10.1007/978-3-030-79876-5_37. URLhttps://doi.org/10.1007/ 978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[10]

Madhusudan, and Daniel Neider

Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. ICE: A robust framework for learning invariants. InComputer Aided Verification (CAV), volume 8559 ofLecture Notes in Computer Science, pages 69–87. Springer, 2014

2014
[11]

On upper-confidence bound policies for non-stationary bandit problems

Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. InProceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), volume 6925 ofLecture Notes in Computer Science, pages 174–188. Springer, 2011

2011
[12]

Alireza Ghafarollahi and Markus J. Buehler. Sciagents: Automating scientific discovery through multi- agent intelligent graph reasoning, 2024. URLhttps://arxiv.org/abs/2409.05556

work page arXiv 2024
[13]

GitHub Copilot.https://github.com/features/copilot, 2026

GitHub. GitHub Copilot.https://github.com/features/copilot, 2026

2026
[14]

DART: Directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random testing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 213–223. ACM, 2005

2005
[15]

Jules: Google’s AI coding agent.https://jules.google/, 2026

Google. Jules: Google’s AI coding agent.https://jules.google/, 2026

2026
[16]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2024

work page internal anchor Pith review arXiv 2024
[17]

everything is a ralph loop.https://ghuntley.com/loop/, 2026

Geoffrey Huntley. everything is a ralph loop.https://ghuntley.com/loop/, 2026. 17

2026
[18]

SMT-based model checking for recursive pro- grams.Formal Methods in System Design, 48(3):175–205, 2016

Anvesh Komuravelli, Arie Gurfinkel, and Sagar Chaki. SMT-based model checking for recursive pro- grams.Formal Methods in System Design, 48(3):175–205, 2016

2016
[19]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

2020
[20]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to per- sonalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web (WWW), pages 661–670. ACM, 2010

2010
[21]

Autoresearchclaw: Fully autonomous research from idea to paper

Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper. GitHub repository, https://github.com/aiming-lab/AutoResearchClaw, 2026

2026
[22]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[23]

Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Alhussein Fawzi. AlphaEvolve: A coding agent for scientific and al...

2025
[24]

Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

2025
[25]

Codex.https://openai.com/index/introducing-codex/, 2026

OpenAI. Codex.https://openai.com/index/introducing-codex/, 2026

2026
[26]

Safety verification of hybrid systems using barrier certificates

Stephen Prajna and Ali Jadbabaie. Safety verification of hybrid systems using barrier certificates. In Hybrid Systems: Computation and Control (HSCC), volume 2993 ofLecture Notes in Computer Science, pages 477–492. Springer, 2004

2004
[27]

ChatDev: Communicative Agents for Software Development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, YushengSu, XinCong, JuyuanXu, DahaiLi, ZhiyuanLiu, andMaosongSun. ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2024

work page internal anchor Pith review arXiv 2024
[28]

Precise interprocedural dataflow analysis via graph reachability

Thomas Reps, Susan Horwitz, and Mooly Sagiv. Precise interprocedural dataflow analysis via graph reachability. InProceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Pro- gramming Languages (POPL), pages 49–61. ACM, 1995

1995
[29]

Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58(5):527–535, 1952

Herbert Robbins. Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58(5):527–535, 1952

1952
[30]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Push- meet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625:468–475, 2024

2024
[31]

Efficient prompt optimization through the lens of best arm identification.arXiv preprint arXiv:2402.09723, 2024

Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, and Cong Shen. Efficient prompt optimization through the lens of best arm identification.arXiv preprint arXiv:2402.09723, 2024

work page arXiv 2024
[32]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2303.11366. 18

work page internal anchor Pith review arXiv 2023
[33]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. URLhttps://arxiv.org/abs/2305.16291

work page internal anchor Pith review arXiv 2023
[34]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Ziyi Zhang, Botian Jiang, Tao Li, Wenhu Chen, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHa...

work page internal anchor Pith review arXiv 2024
[35]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review arXiv 2024
[36]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2210.03629

work page internal anchor Pith review arXiv 2023
[37]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin Gödel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025. 19 Note on LLM Usage Portions of the writing in this paper were drafted with the assistance of large language models (specifically, drafts were generated using GitHub Copilot and...

work page arXiv 2025