arxiv: 2604.05159 · v1 · submitted 2026-04-06 · 💻 cs.SE · cs.AI· cs.CL

Recognition: no theorem link

Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

Alfonso Amayuelas , Firas Laakom , Piotr Pi\k{e}kos , Wenyi Wang , Yifan Xu , Yuhui Wang , J\"urgen Schmidhuber , William Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:47 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords LLM test generationcuriosity-driven planningQ-value estimationbranch coveragecoverage-guided testingexploration in programsautomated software testing

0 comments

The pith

Feeding coverage maps back to LLMs and selecting plans by estimated Q-values raises branch coverage 51-77% over greedy immediate-gain methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM test generators that always pick the action with the largest immediate coverage gain stall on branches whose discovery requires preparatory setup steps. The paper treats the program's branch structure as an unknown environment and maintains the coverage map as a running probabilistic summary of what has been learned so far. CovQValue asks the LLM to generate several diverse plans in parallel, then uses the same LLM to assign Q-values that estimate long-term branch reachability, and executes the highest-scoring plan. On TestGenEval Lite this yields 51-77% more branch coverage across three LLMs and wins on 77-84% of targets; on the new RepoExploreBench it reaches 40-74% coverage. The core demonstration is that explicit lookahead planning guided by coverage feedback improves sequential exploration of program behavior.

Core claim

CovQValue maintains an evolving coverage map as a proxy posterior over the unknown branch structure, generates diverse candidate plans, and selects the plan whose LLM-estimated Q-value balances immediate branch discovery with future reachability; this planning step produces 51-77% higher branch coverage than greedy selection on TestGenEval Lite and 40-74% on RepoExploreBench.

What carries the argument

CovQValue is the selection procedure that converts the current coverage map into a set of parallel LLM-generated plans and ranks them by LLM-computed Q-values to pick the most informative next test.

If this is right

Greedy maximization of immediate coverage is suboptimal once programs contain branches whose discovery requires non-rewarding setup sequences.
An LLM can serve as its own value estimator for planning without task-specific fine-tuning.
Iterative test generation on repository-scale code benefits from explicit lookahead rather than myopic action selection.
Coverage maps function as compact, sufficient state representations for guiding exploration in code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coverage-to-Q-value loop could be applied to other LLM-driven sequential tasks such as automated debugging or API exploration.
Replacing the LLM Q-value estimator with a learned model trained on execution traces might further improve selection accuracy.
The approach implies that providing richer feedback signals (beyond raw coverage) to the planner could accelerate discovery of even deeper program behavior.

Load-bearing premise

An LLM can produce Q-value estimates from a coverage map that reliably predict which plan will unlock the largest number of additional branches later.

What would settle it

Run CovQValue and a pure greedy baseline on a small program whose full branch structure is known in advance; if the plans chosen by the Q-value step do not reach more branches than the immediate-gain step, the estimation mechanism adds no value.

Figures

Figures reproduced from arXiv: 2604.05159 by Alfonso Amayuelas, Firas Laakom, J\"urgen Schmidhuber, Piotr Pi\k{e}kos, Wenyi Wang, William Wang, Yifan Xu, Yuhui Wang.

**Figure 1.** Figure 1: Overview of CovQvalue. At each round, the LLM generates K diverse candidate plans conditioned on the coverage map. Each plan is scored by its estimated Q-value, information gain plus discounted future reachability (Equation 2), and the highest scoring plan is executed. The resulting branch coverage updates the coverage map for next round. Curiosity Q-values for Plan Selection In theory, an optimal explorat… view at source ↗

**Figure 2.** Figure 2: Cumulative branch coverage over execution steps, averaged across three models [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Mean branch coverage by repository on TestGenEval Lite, averaged across three [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Trade-off between test pass rate and branch coverage. Large dots represent the mean across models, while individual models and benchmark results are represent with transparent dots. Coverage-map strategies discover more branches by generating larger tests at a small cost in pass rate. Consistent results across repositories [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies on RepoExploreBench (93 targets, Gemini 3 Flash). (a) CovQValue [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Corridor navigation on four RepoExploreBench modules (Gemini 3 Flash). It [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Mean branch coverage by repository on RepoExploreBench, averaged for all [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 9.** Figure 9: Cumulative branch coverage over execution steps on TestGenEval Lite plotted [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 11.** Figure 11: Cumulative branch coverage over execution steps on RepoExploreBench plotted [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

read the original abstract

The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program's branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CovQValue adds parallel plan generation and LLM Q-value selection to LLM test generation and claims clear coverage gains over greedy baselines, but the results do not yet isolate whether the Q-value estimates drive those gains or if diversity alone explains them.

read the letter

The paper's core move is to treat test generation as a planning problem where the LLM gets the current coverage map, produces several candidate plans at once, and picks the one with the highest estimated Q-value for long-term branch reachability. This is positioned against greedy methods that stop improving once immediate coverage plateaus. They report 51-77% higher branch coverage on TestGenEval Lite across three LLMs and 40-74% on their new RepoExploreBench, winning on most targets. That is the concrete result worth noting first.

Referee Report

2 major / 2 minor

Summary. The paper proposes CovQValue, a curiosity-driven planning approach for LLM-based test generation. It models the program's branch structure as an unknown environment with the coverage map serving as a proxy posterior, generates multiple diverse candidate plans in parallel via the LLM, and selects the plan with the highest LLM-estimated Q-value to balance immediate branch coverage with long-term reachability. The method is evaluated on TestGenEval Lite, where it reports 51-77% higher branch coverage than greedy selection across three LLMs and wins on 77-84% of targets, plus 40-74% gains on the newly introduced RepoExploreBench for iterative test generation.

Significance. If the core mechanism holds after validation, the work would be a meaningful contribution to LLM test generation by moving beyond greedy immediate-coverage strategies to incorporate planning and exploration principles drawn from Bayesian RL. The introduction of RepoExploreBench as a benchmark for iterative, sequential test generation is a clear positive that could support future research. The approach addresses a known limitation of greedy methods on deep branches but requires stronger evidence that the Q-value component drives the gains.

major comments (2)

[Experimental Evaluation] The central empirical claim (outperformance via Q-value selection) is load-bearing but unsupported by necessary controls: the manuscript provides no ablation comparing Q-value selection against random selection among the generated candidate plans or against pure diversity-based selection. Without this, it is impossible to determine whether gains arise from the curiosity/Q-value mechanism or simply from generating multiple plans in parallel (see strongest claim and skeptic note on lack of per-plan correlation or calibration between estimated Q and realized coverage).
[Results on TestGenEval Lite] Reported performance numbers (51-77% higher branch coverage, 77-84% win rate) lack statistical details such as number of runs, standard deviations, confidence intervals, or significance tests. This makes it difficult to assess robustness, especially given the stochastic nature of LLMs and the claim of consistent wins across targets.

minor comments (2)

[Abstract] The abstract introduces 'CovQValue' without a brief expansion or one-sentence definition of the method name, which would aid readability for readers unfamiliar with the Q-value framing.
[Method] The description of how the coverage map is encoded and fed back to the LLM for Q-value estimation could be clarified with a short example or pseudocode to make the LLM prompt construction reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of our approach and the new benchmark. We address each major comment below, providing clarifications and committing to revisions where appropriate.

read point-by-point responses

Referee: [Experimental Evaluation] The central empirical claim (outperformance via Q-value selection) is load-bearing but unsupported by necessary controls: the manuscript provides no ablation comparing Q-value selection against random selection among the generated candidate plans or against pure diversity-based selection. Without this, it is impossible to determine whether gains arise from the curiosity/Q-value mechanism or simply from generating multiple plans in parallel (see strongest claim and skeptic note on lack of per-plan correlation or calibration between estimated Q and realized coverage).

Authors: We agree that additional ablations are valuable to isolate the effect of Q-value-based selection. The current evaluation compares CovQValue against greedy baselines, which select plans based on immediate coverage without generating multiple candidates or using planning. However, to directly address whether the gains stem from the Q-value mechanism rather than parallel plan generation, we will add ablations in the revised manuscript: (1) random selection among the generated candidate plans, and (2) diversity-based selection (e.g., maximizing the number of unique predicted branches across plans). Additionally, we will include an analysis of the correlation between LLM-estimated Q-values and actual coverage improvements to assess calibration. These additions will strengthen the evidence for the curiosity-driven component. revision: yes
Referee: [Results on TestGenEval Lite] Reported performance numbers (51-77% higher branch coverage, 77-84% win rate) lack statistical details such as number of runs, standard deviations, confidence intervals, or significance tests. This makes it difficult to assess robustness, especially given the stochastic nature of LLMs and the claim of consistent wins across targets.

Authors: We acknowledge the importance of statistical rigor given the stochasticity of LLMs. In the revised version, we will report the number of independent runs performed for each experiment, along with standard deviations, 95% confidence intervals, and results from appropriate statistical tests (e.g., paired t-tests or non-parametric tests) to evaluate the significance of the observed improvements. This will provide a clearer picture of the robustness and consistency of the results across targets. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent benchmark evaluation

full rationale

The paper describes an empirical LLM-based test generation procedure (CovQValue) that feeds coverage maps into an LLM to produce and rank plans via estimated Q-values, then reports aggregate coverage gains on TestGenEval Lite and RepoExploreBench. No equations, fitted parameters, or self-referential definitions appear in the provided text; the central claims rest on experimental comparisons to greedy baselines rather than any derivation that reduces outputs to inputs by construction. External Bayesian exploration principles are invoked without load-bearing self-citations or ansatz smuggling from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating the coverage map as a proxy posterior over an unknown program environment and assuming the LLM can perform useful Q-value estimation for plan selection. No free parameters or new invented entities are described in the abstract.

axioms (2)

domain assumption The program's branch structure can be treated as an unknown environment and the evolving coverage map as a proxy probabilistic posterior.
Explicitly stated in the abstract as the foundation for applying Bayesian exploration principles.
domain assumption LLM-estimated Q-values provide a reliable signal for selecting plans that balance immediate and future coverage gains.
Core mechanism of CovQValue as described.

pith-pipeline@v0.9.0 · 5548 in / 1299 out tokens · 38299 ms · 2026-05-10T18:47:04.540067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Coverup: Effective high coverage test generation for python

Juan Altmayer Pizzorno and Emery D Berger. Coverup: Effective high coverage test generation for python. Proceedings of the ACM on Software Engineering, 2 0 (FSE): 0 2897--2919, 2025

2025
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Chatunitest: A framework for llm-based test generation

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp.\ 572--576, 2024

2024
[4]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goli \'n ski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design. arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

The curious language model: Strategic test-time information acquisition

Michael Cooper, Rohan Wadhawan, John Michael Giorgi, Chenhao Tan, and Davis Liang. The curious language model: Strategic test-time information acquisition. arXiv preprint arXiv:2506.09173, 2025

work page arXiv 2025
[6]

Evosuite: automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp.\ 416--419, 2011

2011
[7]

Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms

Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh T Luu, Junxian He, Pang W Koh, and Bryan Hooi. Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms. Advances in Neural Information Processing Systems, 37: 0 24181--24215, 2024

2024
[8]

Testforge: Feedback-driven, agentic test suite generation

Kush Jain and Claire Le Goues. Testforge: Feedback-driven, agentic test suite generation. arXiv preprint arXiv:2503.14713, 2025

work page arXiv 2025
[9]

Testgeneval: A real world unit test generation and test completion benchmark, 2024 a

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi \`e re. Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752, 2024

work page arXiv 2024
[10]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.\ 919--931. IEEE, 2023

2023
[12]

Pynguin: Automated unit test generation for python

Stephan Lukasczyk and Gordon Fraser. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp.\ 168--172, 2022

2022
[13]

An empirical evaluation of using large language models for automated unit test generation

Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50 0 (1): 0 85--105, 2023

2023
[14]

Curious model-building control systems

J \"u rgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pp.\ 1458--1463, 1991 a

1991
[15]

A possibility for implementing curiosity and boredom in model-building neural controllers

J \"u rgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp.\ 222--227, 1991 b

1991
[16]

Formal theory of creativity, fun, and intrinsic motivation (1990--2010)

J \"u rgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990--2010). IEEE transactions on autonomous mental development, 2 0 (3): 0 230--247, 2010

1990
[17]

Intrinsically motivated reinforcement learning: An evolutionary perspective

Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2 0 (2): 0 70--82, 2010

2010
[18]

Reinforcement driven information acquisition in non-deterministic environments

Jan Storck, Sepp Hochreiter, J \"u rgen Schmidhuber, et al. Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the international conference on artificial neural networks, Paris, volume 2, pp.\ 159--164, 1995

1995
[19]

Planning to be surprised: Optimal bayesian exploration in dynamic environments

Yi Sun, Faustino Gomez, and J \"u rgen Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. In International conference on artificial general intelligence, pp.\ 41--51. Springer, 2011

2011
[20]

Enhancing llm-based test generation by eliminating covered code

WeiZhe Xu, Mengyu Liu, and Fanxin Kong. Enhancing llm-based test generation by eliminating covered code. arXiv preprint arXiv:2602.21997, 2026

work page arXiv 2026
[21]

Advancing code coverage: Incorporating program analysis with large language models

Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. Advancing code coverage: Incorporating program analysis with large language models. ACM Transactions on Software Engineering and Methodology, 2024

2024
[22]

Evaluating and improving chatgpt for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 1703--1726, 2024

2024
[23]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[24]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[25]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[26]

>.Lɘ_ݢ ɍ: pݸXS-=D3p47R sPfIWv'Z [٢ O-9TmtO5\3D 8OxZs E]uyh#hn먈rg? wZ*|9 Q &NR 'whaL['y a5P]3m /ߥ ÿS

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2026