SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Bingheng Yan; Chao Peng; Dong Zhang; Fan Wu; Wei Zhou; Xuanhe Zhou; Yingjie Xiong; Yukai Wu; Zihang Zhou; Ziqian Ren

arxiv: 2605.26186 · v2 · pith:APDQK4F3new · submitted 2026-05-25 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Zihang Zhou , Ziqian Ren , Yukai Wu , Yingjie Xiong , Wei Zhou , Chao Peng , Dong Zhang , Bingheng Yan

show 2 more authors

Xuanhe Zhou Fan Wu

This is my paper

Pith reviewed 2026-06-29 20:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG

keywords LLM agentsrepository setupexperiential learningDocker snapshotsverification protocolsetup failuresmulti-repository coordination

0 comments

The pith

SetupX lets LLM agents transfer setup fixes across repositories using experience units, safe rollbacks, and separated verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SetupX to solve the problem of getting code repositories into a state where their documented features run correctly. Existing LLM agents fail at moving experience from one repository to another, recovering from irreversible setup steps, and telling setup problems apart from actual code bugs. SetupX builds transferable dual-modality experience records, adds a stack of Docker snapshots for reversible trial fixes, and splits evidence collection from final judgment in verification. If these components work as described, agents should reach 92 percent success and handle coordinated multi-container setups that defeat prior methods.

Core claim

SetupX is an experiential learning framework whose three components are the Self-Evolving Experience Representation (XPU) that stores and transfers setup signals, guidance, and actions; Experience-Augmented Speculative Execution that uses a LIFO Docker snapshot stack for safe trial-and-repair; and the Prosecutor-Judge Verification Protocol that separates evidence gathering from judgment. On carefully-crafted benchmarks the system records a 92 percent pass rate and exceeds the strongest baseline by more than 19 percent, with the largest gains appearing in multi-repository tasks that require coordinating interconnected services across containers.

What carries the argument

Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit that encodes setup signals, textual guidance, and executable actions to transfer verified environment fixes to unseen repositories.

If this is right

Verified fixes from one repository can be applied directly to new repositories without starting from scratch.
Non-invertible state changes during setup can be explored safely because any trial can be rolled back to a prior known-good snapshot.
Setup outcomes can be judged more reliably because evidence collection is kept separate from the final decision.
Complex setups that coordinate multiple interconnected services across containers become feasible at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transfer-and-rollback pattern could be reused for other agent tasks that involve persistent external state, such as database migration or cloud resource configuration.
A growing collection of XPU units might eventually serve as shared community knowledge rather than remaining private to each agent run.
Pairing SetupX with code-generation agents would create a pipeline that both writes and deploys working software without manual environment work.

Load-bearing premise

The carefully-crafted benchmarks represent the distribution of real-world repository setup failures and the Prosecutor-Judge protocol produces judgments that generalize beyond the test cases.

What would settle it

Run SetupX on a fresh collection of repositories drawn independently from the original benchmarks and measure whether the pass rate remains above 80 percent across multi-container and single-container tasks.

Figures

Figures reproduced from arXiv: 2605.26186 by Bingheng Yan, Chao Peng, Dong Zhang, Fan Wu, Wei Zhou, Xuanhe Zhou, Yingjie Xiong, Yukai Wu, Zihang Zhou, Ziqian Ren.

**Figure 2.** Figure 2: Overview of SETUPX. We perform the environment setup in a ReAct loop with XPU-guided speculative execution and in-loop verification, where the Prosecutor-Judge verification independently audits the result transferable lessons are ingested into the XPU Database. 2 Problem Definition Prior works [7, 8] primarily verify the configured environment by executing the project’s test suite, leaving documented user-… view at source ↗

**Figure 3.** Figure 3: Delayed audit feedback loop. When a new retrieval is triggered, the Retriever Agent first [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Statistics of The Setup-Specific Benchmark. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Main results on the single-repository track. (a) Setup pass rates by repository difficulty; (b) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: (a) juxtaposes the self-claim signals of three specialized setup tools, dockerfile build success, install command exit 0, and runner status code, with the unified prosecutor–judge verdict. The results show a systematic 23–52 pp gap between traditional tools’ self-claims and our verdict. The root cause is that self-claim only checks that the pipeline did not crash: the dockerfile built, the install command … view at source ↗

read the original abstract

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SetupX assembles XPU experience units, LIFO Docker snapshots, and a Prosecutor-Judge split into a pipeline for LLM repo setup, but the 92% pass rate and 19% gain rest on benchmarks whose selection and scale are not described.

read the letter

Colleague,

The core of this paper is a framework that tries to let LLM agents reuse fixes across repositories by storing dual-modality experience units, rolling back safely with a stack of Docker snapshots, and separating evidence gathering from final setup judgment. Those three pieces—XPU representation, LIFO snapshot stack, and Prosecutor-Judge protocol—are the concrete engineering choices that are not directly copied from the cited prior work.

The approach targets a practical bottleneck: getting repositories to a state where their documented features actually run, which involves dependency conflicts, non-reversible changes, and distinguishing setup errors from code bugs. The multi-repository case, where services must coordinate across containers, is called out as a strength, and linking the code is useful.

The soft spot is the evaluation. The abstract reports a 92% pass rate and a 19% lift over the strongest baseline on “carefully-crafted benchmarks,” with better results on complex multi-repo setups. No numbers appear on how many repositories were used, how failure modes were sampled, whether prompt sensitivity was controlled, or what statistical tests were run. Without those details the improvement cannot be assessed, and the phrase “carefully-crafted” leaves open the possibility that the test distribution favors the proposed components. The free parameters around experience retrieval are also left unspecified.

This is for groups working on agent tooling for software engineering who need concrete patterns for experience transfer and verification. A reader looking for design ideas rather than settled performance numbers could extract value from the pipeline description.

I would send it for peer review. The problem is real, the components are described at a level that can be implemented and checked, and referees can ask for the missing benchmark protocol and raw counts. The work shows straightforward engagement with the failure modes it names.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SetupX, an experiential learning framework for LLM agents performing functionality-correct repository setup. It defines three components: a Self-Evolving Experience Representation (XPU) that encodes setup signals, guidance, and actions for cross-repository transfer; Experience-Augmented Speculative Execution using a LIFO Docker snapshot stack for safe trial-and-repair under non-invertible changes; and a Prosecutor-Judge Verification Protocol that separates evidence collection from judgment. The central claim is that SetupX attains a 92% pass rate on carefully-crafted benchmarks, outperforming the strongest baseline by more than 19%, with particular advantage on complex multi-repository setups that coordinate services across containers. The code is released at https://github.com/OpenDataBox/SetupX.

Significance. If the performance numbers are supported by transparent benchmark construction, statistical controls, and representative failure distributions, the work would offer a concrete advance in LLM-agent reliability for software-engineering tasks by showing how retrieved experience and reversible execution can mitigate common setup failures. The public code release is a clear strength that supports reproducibility and extension. The current absence of those experimental details, however, prevents any assessment of whether the result generalizes beyond the reported test cases.

major comments (2)

[Abstract] Abstract (evaluation paragraph): the headline claim of a 92% pass rate and >19% improvement supplies no information on the number of repositories, the selection or generation process for the 'carefully-crafted benchmarks,' the distribution of failure modes (dependency incompatibilities, non-invertible state changes, verification mismatches), statistical significance, or controls for prompt sensitivity; these omissions are load-bearing for interpreting whether the result supports the stated advantages over baselines.
[Prosecutor-Judge Verification Protocol section] Section describing the Prosecutor-Judge Verification Protocol: the protocol is presented only at a high level with no accompanying ablation, false-positive analysis, or comparison against standard build-time metrics that would demonstrate its ability to distinguish setup-induced failures from repository bugs; without such evidence the reliability of the 92% figure cannot be assessed.

minor comments (2)

[Abstract] The term 'carefully-crafted benchmarks' is used without a definition or concrete examples of how multi-repository cases were constructed.
[Abstract] The abstract states that SetupX 'excels in complex multi-repository setup' but does not quantify this advantage with separate metrics or a dedicated table row.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our evaluation reporting and verification protocol. We address each major comment below and commit to revisions that improve clarity without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract (evaluation paragraph): the headline claim of a 92% pass rate and >19% improvement supplies no information on the number of repositories, the selection or generation process for the 'carefully-crafted benchmarks,' the distribution of failure modes (dependency incompatibilities, non-invertible state changes, verification mismatches), statistical significance, or controls for prompt sensitivity; these omissions are load-bearing for interpreting whether the result supports the stated advantages over baselines.

Authors: We agree that the abstract evaluation paragraph would benefit from additional context to help readers assess the headline claims. The manuscript body (Section 4) already details the benchmark construction, repository count, failure mode distributions, and controls for variance including multiple runs. In the revision we will expand the abstract's evaluation paragraph to concisely report the number of repositories, summarize the benchmark selection/generation process, note the primary failure mode distribution, and indicate that results incorporate statistical controls and averaging over runs to address prompt sensitivity. revision: yes
Referee: [Prosecutor-Judge Verification Protocol section] Section describing the Prosecutor-Judge Verification Protocol: the protocol is presented only at a high level with no accompanying ablation, false-positive analysis, or comparison against standard build-time metrics that would demonstrate its ability to distinguish setup-induced failures from repository bugs; without such evidence the reliability of the 92% figure cannot be assessed.

Authors: We acknowledge that the current presentation of the Prosecutor-Judge Verification Protocol remains largely descriptive. While the protocol's design rationale is explained in the section, we agree that supporting empirical evidence would strengthen the claim. In the revised manuscript we will add an ablation comparing the full protocol against standard build-time metrics, report false-positive rates observed in our experiments, and include concrete examples illustrating how the separation of evidence collection and judgment helps distinguish setup failures from repository bugs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on held-out benchmarks

full rationale

The paper introduces SetupX components (XPU representation, LIFO snapshots, Prosecutor-Judge protocol) and reports measured pass rates (92%) on external benchmarks. No equations, fitted parameters, or self-citations are present in the provided text that reduce any claimed result to its inputs by construction. Performance figures are direct empirical observations rather than predictions derived from the framework itself. This matches the default case of a self-contained empirical systems paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that past setup successes can be encoded in a transferable dual-modality format and that Docker snapshots provide safe rollback; no free parameters are explicitly named in the abstract, but the experience retrieval mechanism and verification thresholds are likely tuned on development data.

free parameters (1)

experience retrieval threshold or similarity metric
Likely tuned to decide which past XPU units are applied to a new repository.

axioms (1)

domain assumption Docker container snapshots can be created and restored without side effects on the host system
Invoked by the LIFO snapshot stack mechanism described in the abstract.

invented entities (2)

XPU (Self-Evolving Experience Representation) no independent evidence
purpose: Dual-modality unit that encodes setup signals, textual guidance, and executable actions for cross-repository transfer
New data structure introduced by the paper; no independent evidence outside the framework itself.
Prosecutor-Judge Verification Protocol no independent evidence
purpose: Separates evidence collection from final judgment to improve setup verification reliability
New procedural split introduced by the paper; no independent evidence outside the framework itself.

pith-pipeline@v0.9.1-grok · 5845 in / 1596 out tokens · 22784 ms · 2026-06-29T20:42:45.587238+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 4 internal anchors

[1]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent collaborative framework,” 2024. [Online]. Available: https://arxiv.org/abs/2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Swe-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. Globersons,...

2024
[3]

Swe-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024
[4]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,

J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y . Zhu, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19856

work page arXiv 2024
[5]

Claude code,

Anthropic, “Claude code,” https://claude.com/product/claude-code, 2025, accessed: 2025-05- 22

2025
[6]

Openhands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, and et al., “Openhands: An open platform for AI software developers as generalist agents,” inThe Thirteenth International Conference on Learning Representations, ICLR 20...

2025
[7]

Repo2run: Automated building executable environment for code repository at scale,

R. Hu, C. Peng, XinchenWang, J. Xu, and C. Gao, “Repo2run: Automated building executable environment for code repository at scale,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=fZsd3KLMje

2025
[8]

You name it, I run it: An LLM agent to execute tests of arbitrary projects,

I. Bouzenia and M. Pradel, “You name it, I run it: An LLM agent to execute tests of arbitrary projects,”Proc. ACM Softw. Eng., vol. 2, no. ISSTA, pp. 1054–1076, 2025. [Online]. Available: https://doi.org/10.1145/3728922

work page doi:10.1145/3728922 2025
[9]

Envbench: A benchmark for automated environment setup,

A. Eliseeva, A. Kovrigin, I. Kholkin, E. Bogomolov, and Y . Zharov, “Envbench: A benchmark for automated environment setup,”CoRR, vol. abs/2503.14443, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.14443

work page doi:10.48550/arxiv.2503.14443 2025
[10]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

2023
[11]

Expel: LLM agents are experiential learners,

A. Zhao, D. Huang, Q. Xu, M. Lin, Y . Liu, and G. Huang, “Expel: LLM agents are experiential learners,” inThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2024, February...

work page doi:10.1609/aaai.v38i17.29936 2024
[12]

V oyager: An open-ended embodied agent with large language models,

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”Trans. Mach. Learn. Res., vol. 2024, 2024. [Online]. Available: https://openreview.net/forum?id=ehfRiF0R3a 10

2024
[13]

Skillcraft: Can LLM agents learn to use tools skillfully?

S. Chen, J. Gai, R. Zhou, J. Zhang, T. Zhu, J. Li, K. Wang, Z. Wang, Z. Chen, K. Kaleb, N. Miao, S. Gao, C. Lu, M. Li, J. He, and Y . W. Teh, “Skillcraft: Can LLM agents learn to use tools skillfully?”CoRR, vol. abs/2603.00718, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.00718

work page doi:10.48550/arxiv.2603.00718 2026
[14]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net,

2023
[15]

Available: https://openreview.net/forum?id=WE_vluYUL-X

[Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X
[16]

An Empirical Analysis of the Python Package Index (PyPI)

E. Bommarito and M. J. B. II, “An empirical analysis of the python package index (pypi),” CoRR, vol. abs/1907.11073, 2019. [Online]. Available: http://arxiv.org/abs/1907.11073

work page internal anchor Pith review Pith/arXiv arXiv 1907
[17]

Watchman: monitoring dependency conflicts for python library ecosystem,

Y . Wang, M. Wen, Y . Liu, Y . Wang, Z. Li, C. Wang, H. Yu, S. Cheung, C. Xu, and Z. Zhu, “Watchman: monitoring dependency conflicts for python library ecosystem,” inICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and D. Bae, Eds. ACM, 2020, pp. 125–135. [Online]. Available: https:...

work page doi:10.1145/3377811.3380426 2020
[18]

Autocoderover: Autonomous program improvement,

Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, M. Christakis and M. Pradel, Eds. ACM, 2024, pp. 1592–1604. [Online]. Available: https://doi.org/10.1145/365021...

work page doi:10.1145/3650212.3680384 2024
[19]

SUPER: evaluating agents on setting up and executing tasks from research repositories,

B. Bogin, K. Yang, S. Gupta, K. Richardson, E. Bransom, P. Clark, A. Sabharwal, and T. Khot, “SUPER: evaluating agents on setting up and executing tasks from research repositories,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y ....

work page doi:10.18653/v1/2024.emnlp-main.702 2024
[20]

Agent KB: leveraging cross-domain experience for agentic problem solving,

X. Tang, T. Qin, T. Peng, Z. Zhou, D. Shao, T. Du, X. Wei, P. Xia, F. Wu, H. Zhu, G. Zhang, J. Liu, X. Wang, S. Hong, C. Wu, H. Cheng, C. Wang, and W. Zhou, “Agent KB: leveraging cross-domain experience for agentic problem solving,”CoRR, vol. abs/2507.06229, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2507.06229

work page doi:10.48550/arxiv.2507.06229 2025
[21]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y . Shen, Y . Wang, and B. Shi, “Evolver: Self-evolving LLM agents through an experience-driven lifecycle,”CoRR, vol. abs/2510.16079, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.16079

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.16079 2025
[22]

Memp: Exploring Agent Procedural Memory

R. Fang, Y . Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang, “Memp: Exploring agent procedural memory,”CoRR, vol. abs/2508.06433, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.06433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06433 2025
[23]

CourtEval: A courtroom-based multi-agent evaluation framework,

S. Kumar, A. A. Nargund, and V . Sridhar, “CourtEval: A courtroom-based multi-agent evaluation framework,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 25 875–25 887. [Online]. Available: https://acl...

2025
[24]

Agent-as-a-judge,

R. You, H. Cai, C. Zhang, Q. Xu, M. Liu, T. Yu, Y . Li, and W. Li, “Agent-as-a-judge,” 2026. [Online]. Available: https://arxiv.org/abs/2601.05111 11 A XPU and Retrieval Implementation A.1 Example XPU Example XPU Entry { "id": "xpu_poetry_lock_conflict", "signals": { "keywords": ["poetry.lock", "pyproject.toml", "dependency conflict"], "regex": ["Because ...

work page arXiv 2026
[25]

Anchor.At retrieval time, record the current trajectory length and the identifiers of the recom- mendedXPUentries, establishing a temporal reference point
[26]

Extract.On the next retrieval call, extract up to five subsequent steps from the main agent’s trajectory after the anchor point
[27]

Judge.An LLM judge determines whether each recommendedXPUcontributed to resolving the observed problem. The verdict issuccessif the advice was adopted and the problem was resolved or improved;failureif the advice was adopted but the problem persisted or worsened; andneutral if adoption or causal contribution cannot be determined
[28]

thought":

Update.The telemetry counters of each recommendedXPUare updated atomically according to the verdict: (successes,failures) +=    (1,0),ifv=success, (0,1),ifv=failure, (0,0),ifv=neutral. (4) B Agent Prompt Excerpts This appendix provides selected excerpts from the role-specific prompts used by SETUPX. We include only the parts that define each agent’s ro...
[29]

Exact match first: XPUs whose advice_nl directly addresses the current problem rank highest
[30]

If they are not, the XPU may still be effective

Telemetry as reference: pay attention to each XPU's historical hit / success / failure counts, but do not discard one merely because it has many failures — judge whether the previous failure scenarios are similar to the current one. If they are not, the XPU may still be effective
[31]

Drop the irrelevant: if an XPU's advice is completely unrelated to the current problem, do not pick it
[32]

%## Selection rules %1

Pick at most {k}. %## Selection rules %1. Exact match first: XPUs whose advice_nl directly solves the current problem rank first. %2. Telemetry as reference: pay attention to each XPU's historical hits / successes / failures, but do not exclude an XPU just because it has many failures. %3. Exclude irrelevant items outright: if an XPU's advice has nothing ...
[33]

Structure reconnaissance:`ls`the project root, locate pyproject.toml / setup.cfg / pytest.ini / tox.ini, etc
[34]

(pytest / unittest / tox, ...)

Locate the test suite: confirm the test directory and framework. (pytest / unittest / tox, ...)
[35]

Run the tests in the project's native way and collect results
[36]

Analyze failure causes and make a judgment
[37]

## Hard constraints (violation invalidates the verdict) - Install no packages

If the project has no tests at all, write a smoke test under /tmp/ to verify basic environment usability. ## Hard constraints (violation invalidates the verdict) - Install no packages. - Modify no environment configuration. - Modify no file under /workspace/repo. - write_file may only write into /tmp/. Prosecutor and Judge Prompt Excerpts You are the pros...
[38]

Checkpoint: Snapshot the current container via docker commit, pushing the image onto the LIFO stackS
[39]

import <module>

Adapt: The agent reads theXPU’sadvice_nl, leverages its full conversation context (recent history, observed versions, repository structure), and generates concrete commands tailored to the current repository. If the agent produces no command, the system falls back to rendering the XPU’satomsvia a type-awareAtom Rendering Enginethat maps 12 predefined atom...

2026
[40]

host repository is cloned but core dependencies are not installed;import fails immediately,

and Redis siblings providing persistence and cache. The latest source’s pyproject.toml requires Python >=3.14,<3.15 (the source already uses PEP 695 type aliases such as type ConfType = ...); the container’s default base image ships Python 3.11. SetupX+XPU trajectory (53 steps).The agent inspects the source and identifies the Python version constraint, th...

[1] [1]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent collaborative framework,” 2024. [Online]. Available: https://arxiv.org/abs/2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Swe-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” inAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. Globersons,...

2024

[3] [3]

Swe-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024

[4] [4]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,

J. Li, G. Li, Y . Zhao, Y . Li, H. Liu, H. Zhu, L. Wang, K. Liu, Z. Fang, L. Wang, J. Ding, X. Zhang, Y . Zhu, Y . Dong, Z. Jin, B. Li, F. Huang, and Y . Li, “Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories,” 2024. [Online]. Available: https://arxiv.org/abs/2405.19856

work page arXiv 2024

[5] [5]

Claude code,

Anthropic, “Claude code,” https://claude.com/product/claude-code, 2025, accessed: 2025-05- 22

2025

[6] [6]

Openhands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, and et al., “Openhands: An open platform for AI software developers as generalist agents,” inThe Thirteenth International Conference on Learning Representations, ICLR 20...

2025

[7] [7]

Repo2run: Automated building executable environment for code repository at scale,

R. Hu, C. Peng, XinchenWang, J. Xu, and C. Gao, “Repo2run: Automated building executable environment for code repository at scale,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=fZsd3KLMje

2025

[8] [8]

You name it, I run it: An LLM agent to execute tests of arbitrary projects,

I. Bouzenia and M. Pradel, “You name it, I run it: An LLM agent to execute tests of arbitrary projects,”Proc. ACM Softw. Eng., vol. 2, no. ISSTA, pp. 1054–1076, 2025. [Online]. Available: https://doi.org/10.1145/3728922

work page doi:10.1145/3728922 2025

[9] [9]

Envbench: A benchmark for automated environment setup,

A. Eliseeva, A. Kovrigin, I. Kholkin, E. Bogomolov, and Y . Zharov, “Envbench: A benchmark for automated environment setup,”CoRR, vol. abs/2503.14443, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.14443

work page doi:10.48550/arxiv.2503.14443 2025

[10] [10]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

2023

[11] [11]

Expel: LLM agents are experiential learners,

A. Zhao, D. Huang, Q. Xu, M. Lin, Y . Liu, and G. Huang, “Expel: LLM agents are experiential learners,” inThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2024, February...

work page doi:10.1609/aaai.v38i17.29936 2024

[12] [12]

V oyager: An open-ended embodied agent with large language models,

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”Trans. Mach. Learn. Res., vol. 2024, 2024. [Online]. Available: https://openreview.net/forum?id=ehfRiF0R3a 10

2024

[13] [13]

Skillcraft: Can LLM agents learn to use tools skillfully?

S. Chen, J. Gai, R. Zhou, J. Zhang, T. Zhu, J. Li, K. Wang, Z. Wang, Z. Chen, K. Kaleb, N. Miao, S. Gao, C. Lu, M. Li, J. He, and Y . W. Teh, “Skillcraft: Can LLM agents learn to use tools skillfully?”CoRR, vol. abs/2603.00718, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.00718

work page doi:10.48550/arxiv.2603.00718 2026

[14] [14]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net,

2023

[15] [15]

Available: https://openreview.net/forum?id=WE_vluYUL-X

[Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

[16] [16]

An Empirical Analysis of the Python Package Index (PyPI)

E. Bommarito and M. J. B. II, “An empirical analysis of the python package index (pypi),” CoRR, vol. abs/1907.11073, 2019. [Online]. Available: http://arxiv.org/abs/1907.11073

work page internal anchor Pith review Pith/arXiv arXiv 1907

[17] [17]

Watchman: monitoring dependency conflicts for python library ecosystem,

Y . Wang, M. Wen, Y . Liu, Y . Wang, Z. Li, C. Wang, H. Yu, S. Cheung, C. Xu, and Z. Zhu, “Watchman: monitoring dependency conflicts for python library ecosystem,” inICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and D. Bae, Eds. ACM, 2020, pp. 125–135. [Online]. Available: https:...

work page doi:10.1145/3377811.3380426 2020

[18] [18]

Autocoderover: Autonomous program improvement,

Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, M. Christakis and M. Pradel, Eds. ACM, 2024, pp. 1592–1604. [Online]. Available: https://doi.org/10.1145/365021...

work page doi:10.1145/3650212.3680384 2024

[19] [19]

SUPER: evaluating agents on setting up and executing tasks from research repositories,

B. Bogin, K. Yang, S. Gupta, K. Richardson, E. Bransom, P. Clark, A. Sabharwal, and T. Khot, “SUPER: evaluating agents on setting up and executing tasks from research repositories,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y ....

work page doi:10.18653/v1/2024.emnlp-main.702 2024

[20] [20]

Agent KB: leveraging cross-domain experience for agentic problem solving,

X. Tang, T. Qin, T. Peng, Z. Zhou, D. Shao, T. Du, X. Wei, P. Xia, F. Wu, H. Zhu, G. Zhang, J. Liu, X. Wang, S. Hong, C. Wu, H. Cheng, C. Wang, and W. Zhou, “Agent KB: leveraging cross-domain experience for agentic problem solving,”CoRR, vol. abs/2507.06229, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2507.06229

work page doi:10.48550/arxiv.2507.06229 2025

[21] [21]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

R. Wu, X. Wang, J. Mei, P. Cai, D. Fu, C. Yang, L. Wen, X. Yang, Y . Shen, Y . Wang, and B. Shi, “Evolver: Self-evolving LLM agents through an experience-driven lifecycle,”CoRR, vol. abs/2510.16079, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.16079

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.16079 2025

[22] [22]

Memp: Exploring Agent Procedural Memory

R. Fang, Y . Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang, “Memp: Exploring agent procedural memory,”CoRR, vol. abs/2508.06433, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.06433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06433 2025

[23] [23]

CourtEval: A courtroom-based multi-agent evaluation framework,

S. Kumar, A. A. Nargund, and V . Sridhar, “CourtEval: A courtroom-based multi-agent evaluation framework,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 25 875–25 887. [Online]. Available: https://acl...

2025

[24] [24]

Agent-as-a-judge,

R. You, H. Cai, C. Zhang, Q. Xu, M. Liu, T. Yu, Y . Li, and W. Li, “Agent-as-a-judge,” 2026. [Online]. Available: https://arxiv.org/abs/2601.05111 11 A XPU and Retrieval Implementation A.1 Example XPU Example XPU Entry { "id": "xpu_poetry_lock_conflict", "signals": { "keywords": ["poetry.lock", "pyproject.toml", "dependency conflict"], "regex": ["Because ...

work page arXiv 2026

[25] [25]

Anchor.At retrieval time, record the current trajectory length and the identifiers of the recom- mendedXPUentries, establishing a temporal reference point

[26] [26]

Extract.On the next retrieval call, extract up to five subsequent steps from the main agent’s trajectory after the anchor point

[27] [27]

Judge.An LLM judge determines whether each recommendedXPUcontributed to resolving the observed problem. The verdict issuccessif the advice was adopted and the problem was resolved or improved;failureif the advice was adopted but the problem persisted or worsened; andneutral if adoption or causal contribution cannot be determined

[28] [28]

thought":

Update.The telemetry counters of each recommendedXPUare updated atomically according to the verdict: (successes,failures) +=    (1,0),ifv=success, (0,1),ifv=failure, (0,0),ifv=neutral. (4) B Agent Prompt Excerpts This appendix provides selected excerpts from the role-specific prompts used by SETUPX. We include only the parts that define each agent’s ro...

[29] [29]

Exact match first: XPUs whose advice_nl directly addresses the current problem rank highest

[30] [30]

If they are not, the XPU may still be effective

Telemetry as reference: pay attention to each XPU's historical hit / success / failure counts, but do not discard one merely because it has many failures — judge whether the previous failure scenarios are similar to the current one. If they are not, the XPU may still be effective

[31] [31]

Drop the irrelevant: if an XPU's advice is completely unrelated to the current problem, do not pick it

[32] [32]

%## Selection rules %1

Pick at most {k}. %## Selection rules %1. Exact match first: XPUs whose advice_nl directly solves the current problem rank first. %2. Telemetry as reference: pay attention to each XPU's historical hits / successes / failures, but do not exclude an XPU just because it has many failures. %3. Exclude irrelevant items outright: if an XPU's advice has nothing ...

[33] [33]

Structure reconnaissance:`ls`the project root, locate pyproject.toml / setup.cfg / pytest.ini / tox.ini, etc

[34] [34]

(pytest / unittest / tox, ...)

Locate the test suite: confirm the test directory and framework. (pytest / unittest / tox, ...)

[35] [35]

Run the tests in the project's native way and collect results

[36] [36]

Analyze failure causes and make a judgment

[37] [37]

## Hard constraints (violation invalidates the verdict) - Install no packages

If the project has no tests at all, write a smoke test under /tmp/ to verify basic environment usability. ## Hard constraints (violation invalidates the verdict) - Install no packages. - Modify no environment configuration. - Modify no file under /workspace/repo. - write_file may only write into /tmp/. Prosecutor and Judge Prompt Excerpts You are the pros...

[38] [38]

Checkpoint: Snapshot the current container via docker commit, pushing the image onto the LIFO stackS

[39] [39]

import <module>

Adapt: The agent reads theXPU’sadvice_nl, leverages its full conversation context (recent history, observed versions, repository structure), and generates concrete commands tailored to the current repository. If the agent produces no command, the system falls back to rendering the XPU’satomsvia a type-awareAtom Rendering Enginethat maps 12 predefined atom...

2026

[40] [40]

host repository is cloned but core dependencies are not installed;import fails immediately,

and Redis siblings providing persistence and cache. The latest source’s pyproject.toml requires Python >=3.14,<3.15 (the source already uses PEP 695 type aliases such as type ConfType = ...); the container’s default base image ships Python 3.11. SetupX+XPU trajectory (53 steps).The agent inspects the source and identifies the Python version constraint, th...