Recognition: unknown
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Pith reviewed 2026-05-07 08:00 UTC · model grok-4.3
The pith
Comet-H is an iterative prompt automaton that keeps code, theory, benchmarks, and documentation aligned in research software projects as specifications evolve.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Comet-H is an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. Prompt selection is framed as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score.
What carries the argument
The controller that selects prompts by scoring them against workspace deficits using a hand-weighted linear score and a half-life mechanism to bound unfinished work.
If this is right
- Audit-and-contraction passes dominate the later phases of every successful project trajectory.
- The approach sustains development across approximately 400 commits while maintaining alignment.
- One detailed case reaches an F1 score of 0.768 on a 90-case benchmark, outperforming the next-best baseline.
- The method applies to a portfolio of 46 research-software repositories across two dozen domains.
- The transparent scorer makes prompt choices legible directly from the workspace state without a learned policy.
Where Pith is reading between the lines
- The success of a hand-weighted linear scorer suggests that complex reinforcement learning policies may not be necessary for orchestrating long-horizon AI research tasks.
- This framework could extend to other AI-assisted scientific workflows where theory and implementation must co-evolve, such as mathematical proofs or experimental design.
- Applying the half-life mechanism more broadly might help prevent context bloat in any multi-turn language model interaction.
- Future tests on larger or more complex projects would clarify the limits of the current controller design.
Load-bearing premise
The hand-weighted linear score over workspace deficits and the half-life mechanism for unfinished work can reliably prevent hallucination accumulation and desynchronization in long-running projects without learned policies or extensive human intervention.
What would settle it
A run of the system on a research software project in which claims in the paper or README repeatedly exceed the support from the code and benchmarks, or where code, theory, and documentation become inconsistent over the course of hundreds of commits.
Figures
read the original abstract
Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Comet-H, an iterative prompt automaton for orchestrating large language models in the development of research software projects where the specification evolves over time. It identifies hallucination accumulation and desynchronization as key LM-specific failure modes and addresses them by maintaining a single workspace state that couples ideation, implementation, evaluation, grounding, and paper-writing. Prompt selection is framed as a contextual bandit problem using a hand-weighted linear score based on workspace deficits, with a half-life mechanism for unfinished work. The approach is evaluated through a portfolio of 46 research-software repositories, with a detailed case study on A3, a Python static-analysis tool developed entirely within the loop, achieving an F1 score of 0.768 on a 90-case benchmark compared to a baseline of 0.364. The study observes that audit-and-contraction passes dominate later phases of successful trajectories across approximately 400 commits.
Significance. If the central claims hold, this work would be significant for software engineering and AI-assisted research by providing a transparent, non-learned-policy method for long-horizon orchestration of LMs in complex, evolving projects. The emphasis on coupled evolution of code, benchmarks, and documentation, along with empirical results from a constructed system like A3, offers a practical demonstration of managing common LM pitfalls. The portfolio approach adds breadth, and the transparent scorer design is a strength that allows for legibility of decisions. However, the reliance on hand-weighted parameters without reported robustness checks limits the strength of the conclusions regarding reliability without human intervention.
major comments (2)
- [Comet-H Controller] The claim that the hand-weighted linear score over workspace deficits, combined with the half-life mechanism, reliably bounds hallucination accumulation and desynchronization across long trajectories without learned policies or extensive intervention is load-bearing for the contribution. However, no sensitivity analysis, ablation studies, or perturbation tests on the weights or half-life parameter are reported (see the controller description). Given that these are free parameters and results derive from a single detailed trajectory (~400 commits on A3), the weights could have been selected to fit the observed outcomes, which would undermine the assertions of transparency and reliability without learned policies. A concrete test such as re-running trajectories with perturbed weights and reporting variance in success metrics is needed to support the claim.
- [Evaluation and Case Study] The key empirical result (F1 = 0.768 on the 90-case benchmark for A3 vs. next-best baseline of 0.364) lacks details on experimental controls, the exact implementation of the baseline, the composition of the benchmark cases, statistical analysis (e.g., confidence intervals or significance tests), or controls for prompt variations. This information is load-bearing for assessing whether the performance improvement is attributable to the Comet-H orchestration (see the A3 case study and evaluation sections).
minor comments (2)
- [Abstract] The abstract phrasing 'reaches (F1 = 0.768)' uses unnecessary parentheses; consider rephrasing to 'reaches an F1 score of 0.768' for improved readability.
- [Portfolio Construction] The portfolio of 46 repositories across two dozen domains is mentioned, but additional details on selection criteria, diversity metrics, or aggregate success rates across the full portfolio (beyond the A3 focus) would strengthen the presentation of the general observation that audit-and-contraction passes dominate successful trajectories.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments correctly identify areas where additional analysis and documentation would strengthen the claims regarding the Comet-H controller's reliability and the attribution of empirical results. We respond point-by-point below and will incorporate revisions to address the concerns while preserving the manuscript's core contributions on transparent, non-learned orchestration for evolving research software.
read point-by-point responses
-
Referee: [Comet-H Controller] The claim that the hand-weighted linear score over workspace deficits, combined with the half-life mechanism, reliably bounds hallucination accumulation and desynchronization across long trajectories without learned policies or extensive intervention is load-bearing for the contribution. However, no sensitivity analysis, ablation studies, or perturbation tests on the weights or half-life parameter are reported (see the controller description). Given that these are free parameters and results derive from a single detailed trajectory (~400 commits on A3), the weights could have been selected to fit the observed outcomes, which would undermine the assertions of transparency and reliability without learned policies. A concrete test such as re-running trajectories with perturbed weights and reporting variance in success metrics is needed to support the claim.
Authors: We agree that the lack of sensitivity analysis on the hand-weighted parameters represents a genuine limitation in the current version, as the weights were iteratively refined during development of the A3 trajectory and the broader portfolio. The linear scorer was designed for legibility rather than optimality, with weights reflecting observed deficit priorities (e.g., higher weight on code-benchmark alignment than on documentation in early phases). To strengthen this, we will revise the controller section to explicitly document the weight selection rationale and add a limited sensitivity analysis: we will perturb weights by ±20% and re-evaluate on a representative subset of shorter trajectories from the 46-repository portfolio (approximately 10-20 commits each), reporting variance in success rate, final F1, and trajectory length. The half-life parameter will be specified with justification tied to typical research task durations. While full re-execution of the complete ~400-commit A3 trajectory under multiple perturbations is not feasible due to resource constraints, the proposed analysis will demonstrate robustness and support the transparency claim over black-box learned policies. revision: partial
-
Referee: [Evaluation and Case Study] The key empirical result (F1 = 0.768 on the 90-case benchmark for A3 vs. next-best baseline of 0.364) lacks details on experimental controls, the exact implementation of the baseline, the composition of the benchmark cases, statistical analysis (e.g., confidence intervals or significance tests), or controls for prompt variations. This information is load-bearing for assessing whether the performance improvement is attributable to the Comet-H orchestration (see the A3 case study and evaluation sections).
Authors: We concur that expanded experimental details are required for reproducibility and to isolate the contribution of the orchestration mechanism. In the revised evaluation section, we will add: (1) a precise description of the baseline as direct, non-orchestrated LLM prompting using the same model and core prompt templates but without deficit scoring, half-life tracking, or coupled workspace state; (2) the benchmark composition, consisting of 90 manually verified cases spanning type errors, security vulnerabilities, linting issues, and edge-case false positives/negatives drawn from standard Python static analysis scenarios; (3) statistical support including 95% bootstrap confidence intervals on F1 scores and a significance test (e.g., McNemar's test) comparing Comet-H to the baseline; and (4) prompt variation controls, with results from minor rephrasings within the same prompt families to confirm that gains derive from the controller's deficit-driven selection and audit passes rather than wording. These additions will clarify attribution to the coupled evolution process across the ~400 commits. revision: yes
- Full re-execution of the complete ~400-commit A3 trajectory under multiple perturbed weight sets, due to prohibitive computational and time costs.
Circularity Check
No significant circularity: empirical system proposal with transparent design choices
full rationale
The paper proposes Comet-H as an engineering system (iterative prompt automaton with hand-weighted linear scorer over workspace deficits plus half-life for unfinished work) and reports observed empirical outcomes on a constructed portfolio of 46 repositories, including F1=0.768 on the A3 benchmark after ~400 commits. No mathematical derivation chain, first-principles prediction, or uniqueness theorem is claimed. The hand-weighted score is explicitly presented as a transparent, non-learned design choice rather than a fitted parameter renamed as a prediction. No self-citations, ansatzes smuggled via prior work, or self-definitional reductions appear in the abstract or described claims. The central results (audit-and-contraction dominance, hallucination bounding) are measured outcomes of running the constructed system, not quantities that reduce to the inputs by construction. This is a standard empirical systems paper whose claims stand or fall on the reported runs and are not circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- weights in linear score
- half-life parameter
axioms (1)
- domain assumption LLMs can effectively perform tasks like ideation, implementation, evaluation, grounding, and paper-writing when prompted appropriately in an iterative loop
invented entities (1)
-
Comet-H
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Kiro.https://kiro.dev, 2026
Amazon Web Services. Kiro.https://kiro.dev, 2026
2026
-
[2]
Claude code.https://docs.anthropic.com/en/docs/claude-code, 2026
Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2026
2026
-
[3]
Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002
2002
-
[4]
Stochastic multi-armed-bandit problem with non- stationary rewards
Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non- stationary rewards. InAdvances in Neural Information Processing Systems (NeurIPS), 2014
2014
-
[5]
Aaron R. Bradley. SAT-based model checking without unrolling. InVerification, Model Checking, and Abstract Interpretation (VMCAI), volume 6538 ofLecture Notes in Computer Science, pages 70–87. Springer, 2011
2011
-
[6]
Contextual bandits with linear payoff func- tions
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff func- tions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learning Research, pages 208–214. PMLR, 2011
2011
-
[7]
Counterexample-guided ab- straction refinement
Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith. Counterexample-guided ab- straction refinement. InComputer Aided Verification (CAV), volume 1855 ofLecture Notes in Computer Science, pages 154–169. Springer, 2000
2000
-
[8]
Devin: The first AI software engineer
Cognition AI. Devin: The first AI software engineer. Blog post, 2024. URLhttps://www.cognition. ai/blog/introducing-devin
2024
-
[9]
Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28, volume 12699 ofLecture Notes in Computer Science, pages 625–635. Springer, 2021. doi: 10.1007/978-3-030-79876-5_37. URLhttps://doi.org/10.1007/ 978-3-030-79876-5_37
-
[10]
Madhusudan, and Daniel Neider
Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. ICE: A robust framework for learning invariants. InComputer Aided Verification (CAV), volume 8559 ofLecture Notes in Computer Science, pages 69–87. Springer, 2014
2014
-
[11]
On upper-confidence bound policies for non-stationary bandit problems
Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. InProceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), volume 6925 ofLecture Notes in Computer Science, pages 174–188. Springer, 2011
2011
- [12]
-
[13]
GitHub Copilot.https://github.com/features/copilot, 2026
GitHub. GitHub Copilot.https://github.com/features/copilot, 2026
2026
-
[14]
DART: Directed automated random testing
Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random testing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 213–223. ACM, 2005
2005
-
[15]
Jules: Google’s AI coding agent.https://jules.google/, 2026
Google. Jules: Google’s AI coding agent.https://jules.google/, 2026
2026
-
[16]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
everything is a ralph loop.https://ghuntley.com/loop/, 2026
Geoffrey Huntley. everything is a ralph loop.https://ghuntley.com/loop/, 2026. 17
2026
-
[18]
SMT-based model checking for recursive pro- grams.Formal Methods in System Design, 48(3):175–205, 2016
Anvesh Komuravelli, Arie Gurfinkel, and Sagar Chaki. SMT-based model checking for recursive pro- grams.Formal Methods in System Design, 48(3):175–205, 2016
2016
-
[19]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020
2020
-
[20]
Schapire
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to per- sonalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web (WWW), pages 661–670. ACM, 2010
2010
-
[21]
Autoresearchclaw: Fully autonomous research from idea to paper
Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper. GitHub repository, https://github.com/aiming-lab/AutoResearchClaw, 2026
2026
-
[22]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Alhussein Fawzi. AlphaEvolve: A coding agent for scientific and al...
2025
-
[24]
Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025
OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025
2025
-
[25]
Codex.https://openai.com/index/introducing-codex/, 2026
OpenAI. Codex.https://openai.com/index/introducing-codex/, 2026
2026
-
[26]
Safety verification of hybrid systems using barrier certificates
Stephen Prajna and Ali Jadbabaie. Safety verification of hybrid systems using barrier certificates. In Hybrid Systems: Computation and Control (HSCC), volume 2993 ofLecture Notes in Computer Science, pages 477–492. Springer, 2004
2004
-
[27]
ChatDev: Communicative Agents for Software Development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, YushengSu, XinCong, JuyuanXu, DahaiLi, ZhiyuanLiu, andMaosongSun. ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
Precise interprocedural dataflow analysis via graph reachability
Thomas Reps, Susan Horwitz, and Mooly Sagiv. Precise interprocedural dataflow analysis via graph reachability. InProceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Pro- gramming Languages (POPL), pages 49–61. ACM, 1995
1995
-
[29]
Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58(5):527–535, 1952
Herbert Robbins. Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58(5):527–535, 1952
1952
-
[30]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Push- meet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625:468–475, 2024
2024
-
[31]
Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, and Cong Shen. Efficient prompt optimization through the lens of best arm identification.arXiv preprint arXiv:2402.09723, 2024
-
[32]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2303.11366. 18
work page internal anchor Pith review arXiv 2023
-
[33]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. URLhttps://arxiv.org/abs/2305.16291
work page internal anchor Pith review arXiv 2023
-
[34]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Ziyi Zhang, Botian Jiang, Tao Li, Wenhu Chen, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHa...
work page internal anchor Pith review arXiv 2024
-
[35]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2210.03629
work page internal anchor Pith review arXiv 2023
-
[37]
Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin Gödel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025. 19 Note on LLM Usage Portions of the writing in this paper were drafted with the assistance of large language models (specifically, drafts were generated using GitHub Copilot and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.