ConcoLixir: Reactive LLM Discovery Oracles for Python Concolic Testing

Chih-Duo Hong; Dong Chen; Fang Yu

arxiv: 2606.26545 · v1 · pith:W4P3GITTnew · submitted 2026-06-25 · 💻 cs.SE

ConcoLixir: Reactive LLM Discovery Oracles for Python Concolic Testing

Dong Chen , Chih-Duo Hong , Fang Yu This is my paper

Pith reviewed 2026-06-26 04:27 UTC · model grok-4.3

classification 💻 cs.SE

keywords concolic testingLLM oraclePythonsymbolic executiontest coveragereactive discoverylibrary calls

0 comments

The pith

A reactive LLM discovery oracle raises line coverage in Python concolic testing by 8.6 to 17 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ConcoLixir adds an LLM that serves strictly as a discovery oracle inside Python concolic execution. The oracle supplies initial seeds, concrete inputs after solver failures on semantic operations, and new targets when coverage stops advancing. Every candidate is run through the concolic engine so that only measured coverage and collected constraints steer the next step. The result is higher line coverage on synthetic, real-world, and library targets, with the largest lifts near semantic barriers and library boundaries.

Core claim

The paper claims that inserting a bounded LLM as a reactive discovery oracle into concolic execution lets it generate useful concrete inputs precisely where the symbolic solver stalls on library calls, regular expressions, checksums, and parsers, producing measured coverage gains without the LLM ever replacing the solver or serving as a correctness oracle.

What carries the argument

The reactive LLM discovery oracle that proposes seeds, post-failure inputs, and stalled-coverage targets, each validated by full concolic execution using only observed coverage and path constraints.

If this is right

Gains concentrate near semantic barriers and library boundaries.
Improvements appear on synthetic, real-world, and library code.
Total API cost for the full evaluation stays at $1.63.
Bounded LLM discovery complements symbolic reasoning instead of replacing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same oracle pattern could be attached to concolic engines for other languages that face similar semantic barriers.
Manual seed selection might become less critical if the oracle reliably supplies diverse starting points.
Prompts tuned to specific hard operations such as checksums could produce further targeted gains.

Load-bearing premise

Coverage gains are produced by the LLM-generated candidates themselves rather than by differences in experimental setup or seeding between the baseline and augmented runs.

What would settle it

Run the baseline concolic tester on exactly the same set of LLM-generated inputs and check whether its final line coverage matches the augmented version.

Figures

Figures reproduced from arXiv: 2606.26545 by Chih-Duo Hong, Dong Chen, Fang Yu.

**Figure 1.** Figure 1: Baseline concolic testing loop driven by coverage. Starting from a current input, the engine executes the target and collects path constraints, selects and negates an unexplored path constraint, invokes the SMT solver to generate a new input when satisfiable, and repeats until full coverage is reached or the constraint pool or exploration budget is exhausted. PyExZ3 [2,6,25], adapt this technique through… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Abstract structure of a seed generation prompt. Live prompts include the full source code, complete CFG, and extracted boundary values. Once the static analysis completes, the LLM generates seed inputs that target all reachable lines. The prompt uses five context elements from the static analysis output. These are the function’s source code with line numbers, type context with parameter types and custom ty… view at source ↗

read the original abstract

Concolic testing combines concrete execution with symbolic constraint solving, but Python programs expose recurring limits. Library calls can cause symbolic variables to downgrade to concrete values. Regular expressions, checksums, parsers, and other semantic operations can be hard to solve, and exploration can plateau on already covered paths. We present ConcoLixir, a reactive LLM extension for Python concolic execution. The LLM acts as a discovery oracle, not a replacement for the solver or a correctness oracle. It generates initial seeds, proposes concrete inputs after solver failures, and targets uncovered code when coverage stalls. Each candidate is executed concolically, and only observed coverage and collected path constraints guide later exploration. Across synthetic, real-world, and library targets, ConcoLixir improves mean line coverage over the baseline concolic tester without an LLM oracle by 8.6, 15.1, and 17.0 percentage points. The gains are strongest near semantic barriers and library boundaries, and the full evaluation costs \$1.63 in API charges. These results show that bounded LLM discovery can complement symbolic reasoning without replacing it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConcoLixir shows coverage gains from an LLM acting as a reactive discovery oracle in Python concolic testing, but the abstract leaves open whether those gains are due to the LLM or unmatched proposal budgets.

read the letter

The main takeaway is that this paper puts an LLM to work as a bounded discovery oracle inside a Python concolic tester. It generates seeds, suggests concrete inputs when the solver fails, and steers toward uncovered code once exploration stalls. Each proposal is then run through the actual concolic engine so that only real coverage and path constraints drive the next step. That framing is clear and avoids overclaiming the LLM as a solver or correctness checker.

The reported numbers are the concrete contribution: mean line coverage up 8.6 pp on synthetic targets, 15.1 pp on real-world code, and 17.0 pp on libraries, all at $1.63 in API cost. The gains are said to appear most where semantic barriers and library boundaries normally stop progress. That is a usable data point for anyone working on test generation for Python.

The soft spot is exactly the one flagged in the stress test. The abstract gives no numbers on how many proposals the plain concolic baseline received, how many solver calls it made, or whether total wall-clock or input budgets were matched. Without those controls it is impossible to tell whether the deltas come from the quality of the LLM suggestions or simply from running more concrete executions. The paper would be stronger if it showed the baseline was given an equivalent exploration quota.

This is aimed at researchers and tool builders in automated testing who already use concolic or symbolic methods and are looking for lightweight ways to handle the parts that defeat the solver. A reader who wants ideas for LLM-assisted testing pipelines can extract the mechanism even if the numbers need verification.

It deserves a serious referee to check the experimental design and see whether the controls hold up in the full paper.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ConcoLixir, a reactive LLM extension for Python concolic testing where the LLM serves as a discovery oracle to generate initial seeds, propose concrete inputs after solver failures, and target uncovered code upon coverage stalls. Each LLM candidate is executed concolically, with only observed coverage and path constraints guiding further exploration. The paper reports that this approach improves mean line coverage over a baseline concolic tester without an LLM oracle by 8.6, 15.1, and 17.0 percentage points across synthetic, real-world, and library targets, with the evaluation costing $1.63 in API charges.

Significance. If the empirical results hold under controlled conditions, the work would demonstrate a practical, low-cost way to integrate LLMs as bounded discovery oracles that complement rather than replace symbolic execution in Python testing, with particular value near semantic barriers and library boundaries.

major comments (2)

[Abstract] Abstract: the reported coverage improvements of 8.6, 15.1, and 17.0 percentage points are presented without any information on experimental setup, number of runs, statistical significance, error bars, baseline implementation, or controls for total concrete inputs, solver invocations, or wall-clock budget; this information is load-bearing for attributing the deltas to the reactive LLM oracle mechanism.
[Abstract] Abstract: the description states that LLM candidates are executed concolically and that only observed coverage/path constraints guide later steps, but supplies no explicit statement that the baseline receives an identical number of proposals or equivalent exploration quota; without this control the measured gains cannot be attributed to the oracle rather than unmatched exploration effort.

minor comments (1)

[Abstract] Abstract: the phrase 'the full evaluation costs $1.63 in API charges' is unclear about whether this figure covers only LLM API calls or includes other experimental costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding experimental controls and setup. We will revise the abstract to incorporate the requested details while preserving its conciseness, and we address each comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported coverage improvements of 8.6, 15.1, and 17.0 percentage points are presented without any information on experimental setup, number of runs, statistical significance, error bars, baseline implementation, or controls for total concrete inputs, solver invocations, or wall-clock budget; this information is load-bearing for attributing the deltas to the reactive LLM oracle mechanism.

Authors: We agree the abstract should be more self-contained on methodology. The full manuscript (Section 4) details that results are means over 10 independent runs per target, with paired t-tests for significance (p<0.05 reported), error bars in Figure 3, and explicit matching of total concrete executions and solver calls between ConcoLixir and baseline. We will revise the abstract to include a one-sentence summary of these controls and the baseline (PyExZ3 extended with the same input budget). revision: yes
Referee: [Abstract] Abstract: the description states that LLM candidates are executed concolically and that only observed coverage/path constraints guide later steps, but supplies no explicit statement that the baseline receives an identical number of proposals or equivalent exploration quota; without this control the measured gains cannot be attributed to the oracle rather than unmatched exploration effort.

Authors: We will add an explicit clause to the abstract stating that the baseline receives an identical exploration quota (same total number of concrete proposals and solver invocations) as ConcoLixir. This matching is already enforced in the evaluation harness described in Section 4.2; the revision will make the control visible at the abstract level. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results with no derivation chain

full rationale

The paper reports experimental coverage gains from running ConcoLixir versus a baseline concolic tester on synthetic, real-world, and library targets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described mechanism. The central claims rest on observed execution outcomes rather than any reduction of outputs to inputs by construction. The skeptic concern addresses experimental controls (budget matching) but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical results from integrating the LLM oracle into concolic execution. No free parameters are mentioned. The main supporting element is a domain assumption about the feasibility of LLM suggestions in the testing loop.

axioms (1)

domain assumption Concolic testing can be extended with external oracles for input generation without compromising the validity of path constraints
Invoked implicitly when describing the reactive use of LLM proposals that are validated through concolic execution.

invented entities (1)

Reactive LLM discovery oracle no independent evidence
purpose: Generates initial seeds, proposes concrete inputs after solver failures, and targets uncovered code when coverage stalls
Introduced as the key extension to the baseline concolic tester

pith-pipeline@v0.9.1-grok · 5728 in / 1467 out tokens · 72635 ms · 2026-06-26T04:27:52.884648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 20 canonical work pages · 3 internal anchors

[1]

IEEE Security & Privacy 16(2), 52–60 (2018).https://doi.org/10.1109/MSP.2018.1870873

Avgerinos, T., Brumley, D., Davis, J., Goulden, R., Nighswander, T., Rebert, A., Williamson, N.: The Mayhem cyber reasoning system. IEEE Security & Privacy 16(2), 52–60 (2018).https://doi.org/10.1109/MSP.2018.1870873

work page doi:10.1109/msp.2018.1870873 2018
[2]

In: Dependable Software Systems Engineering, NATO Science for Peace and Security Series D, vol

Ball, T., Daniel, J.: Deconstructing dynamic symbolic execution. In: Dependable Software Systems Engineering, NATO Science for Peace and Security Series D, vol. 40, pp. 26–41. IOS Press (2015).https://doi.org/10.3233/978-1-61499-495-4-26

work page doi:10.3233/978-1-61499-495-4-26 2015
[3]

In: OSDI

Cadar, C., Dunbar, D., Engler, D.: KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: OSDI. USENIX Association (2008)

2008
[4]

Cadar, C., Ganesh, V., Pawlowski, P.M., Dill, D.L., Engler, D.R.: EXE: Au- tomatically generating inputs of death. In: CCS. pp. 322–335 (2006).https: //doi.org/10.1145/1180405.1180445

work page doi:10.1145/1180405.1180445 2006
[5]

Communications of the ACM56(2), 82–90 (2013).https://doi.org/10.1145/ 2408776.2408795

Cadar, C., Sen, K.: Symbolic execution for software testing: three decades later. Communications of the ACM56(2), 82–90 (2013).https://doi.org/10.1145/ 2408776.2408795

arXiv 2013
[6]

In: APLAS

Chen, Y.F., Tsai, W.L., Wu, W.C., Yen, D.D., Yu, F.: PyCT: A Python concolic tester. In: APLAS. pp. 38–46 (2021).https://doi.org/10.1007/978-3-030-89051-3_3

work page doi:10.1007/978-3-030-89051-3_3 2021
[7]

In: TACAS

De Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: TACAS. pp. 337–340. Springer (2008)

2008
[8]

arXiv preprint arXiv:2212.14834 (2023).https://doi.org/10.48550/arXiv.2212.14834, accepted at ISSTA 2023

Deng, Y., Xia, C.S., Peng, H., Yang, C., Zhang, L.: Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. arXiv preprint arXiv:2212.14834 (2023).https://doi.org/10.48550/arXiv.2212.14834, accepted at ISSTA 2023

work page doi:10.48550/arxiv.2212.14834 2023
[9]

In: SBFT

Erni, N., Al-Ameen, M., Birchler, C., Derakhshanfar, P., Lukasczyk, S., Panichella, S.: SBFT tool competition 2024 – Python test case generation track. In: SBFT. pp. 37–40 (2024).https://doi.org/10.1145/3643659.3643930

work page doi:10.1145/3643659.3643930 2024
[10]

In: WOOT

Fioraldi, A., Maier, D., Eißfeldt, H., Heuse, M.: AFL++: Combining incremental steps of fuzzing research. In: WOOT. USENIX Association (2020)

2020
[11]

In: ESEC/FSE

Fraser, G., Arcuri, A.: EvoSuite: Automatic test suite generation for object- oriented software. In: ESEC/FSE. pp. 416–419 (2011).https://doi.org/10.1145/ 2025113.2025179

arXiv 2011
[12]

In: PLDI

Godefroid, P., Klarlund, N., Sen, K.: DART: Directed automated random testing. In: PLDI. pp. 213–223 (2005)

2005
[13]

In: NDSS

Godefroid, P., Levin, M.Y., Molnar, D.A.: Automated whitebox fuzz testing. In: NDSS. The Internet Society (2008)

2008
[14]

Hong, C.D., Jiang, H., Lin, A.W., Markgraf, O., Parsert, J., Tan, T.: Extracting robust register automata from neural networks over data sequences (2025).https: //doi.org/10.48550/arXiv.2511.19100

work page doi:10.48550/arxiv.2511.19100 2025
[15]

Hong, C.D., Yang, C.C., Wang, Y., Yu, F.: Influence-guided concolic testing of transformer robustness (2025).https://doi.org/10.48550/arXiv.2509.23806, accepted at the International Conference on Software Quality, Reliability, and Security ConcoLixir21

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.23806 2025
[16]

Huang, M.I., Hong, C.D., Yu, F.: Concolic testing on individual fairness of neural network models (2025).https://doi.org/10.48550/arXiv.2509.06864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.06864 2025
[17]

In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (2023)

Lemieux, C., Inala, J.P., Lahiri, S., Sen, S.: CodaMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (2023)

2023
[18]

Li, Y., Meng, R., Duck, G.J.: Large language model powered symbolic execu- tion.Proc.ACMProgram.Lang.9(OOPSLA2),3148–3176(2025).https://doi.org/ 10.1145/3763163

work page doi:10.1145/3763163 2025
[19]

Lin, L.J., Hong, C.D.: Robustness verification of recurrent neural networks with abstraction refinement (2026).https://doi.org/10.48550/arXiv.2606.12490

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.12490 2026
[20]

In: ICSE Companion

Lukasczyk, S., Fraser, G.: Pynguin: Automated unit test generation for Python. In: ICSE Companion. pp. 168–172 (2022).https://doi.org/10.1145/3510454.3516829

work page doi:10.1145/3510454.3516829 2022
[21]

In: IEEE S&P

Luo, Z., Zhao, H., Wolff, D., Cadar, C., Roychoudhury, A.: Agentic concolic exe- cution. In: IEEE S&P. pp. 1–19. IEEE Computer Society (2026)

2026
[22]

Journal of Open Source Software4(43), 1891 (2019).https://doi.org/10.21105/joss.01891

MacIver, D.R., Hatfield-Dodds, Z., Contributors, M.O.: Hypothesis: A new ap- proach to property-based testing. Journal of Open Source Software4(43), 1891 (2019).https://doi.org/10.21105/joss.01891

work page doi:10.21105/joss.01891 2019
[23]

Pizzorno, J.A., Berger, E.D.: CoverUp: Effective high coverage test generation for Python (2025).https://doi.org/10.1145/3729398, to appear at FSE 2025

work page doi:10.1145/3729398 2025
[24]

Poeplau, S., Francillon, A.: Symbolic execution with SymCC: Don’t interpret, com- pile! In: USENIX Security. pp. 181–198. USENIX Association (2020)

2020
[25]

Schanely, P.: CrossHair: An analysis tool for Python.https://github.com/pschanely/ CrossHair(2017)

2017
[26]

Improving change prediction with fine-grained source code mining,

Sen, K.: Concolic testing. In: Proceedings of the 22nd IEEE/ACM Interna- tional Conference on Automated Software Engineering. pp. 571–572 (2007).https: //doi.org/10.1145/1321631.1321746

work page doi:10.1145/1321631.1321746 2007
[27]

Cute: a concolic unit testing engine for c

Sen, K., Marinov, D., Agha, G.: CUTE: A concolic unit testing engine for C. In: ESEC/FSE. pp. 263–272 (2005).https://doi.org/10.1145/1081706.1081750

work page doi:10.1145/1081706.1081750 2005
[28]

In: NDSS (2016)

Stephens, N., Grosen, J., Salls, C., Dutcher, A., Wang, R., Corbetta, J., Shoshi- taishvili, Y., Krügel, C., Vigna, G.: Driller: Augmenting fuzzing through selective symbolic execution. In: NDSS (2016)

2016
[29]

arXiv preprint arXiv:2504.17542 (2025).https://doi.org/10.48550/arXiv.2504.17542, to appear at the 2026 IEEE Symposium on Security and Privacy

Tu, H., Lee, S., Li, Y., Chen, P., Jiang, L., Böhme, M.: COTTONTAIL: Large lan- guage model-driven concolic execution for highly structured test input generation. arXiv preprint arXiv:2504.17542 (2025).https://doi.org/10.48550/arXiv.2504.17542, to appear at the 2026 IEEE Symposium on Security and Privacy

work page doi:10.48550/arxiv.2504.17542 2025
[30]

Available: https://doi.org/10.3102/10769986025002101

Vargha, A., Delaney, H.D.: A critique and improvement of the CL common lan- guage effect size statistics of McGraw and Wong. Journal of Educational and Be- havioral Statistics25(2), 101–132 (2000).https://doi.org/10.3102/10769986025002101

work page doi:10.3102/10769986025002101 2000
[31]

arXiv preprint arXiv:2409.09271 (2024).https://doi.org/10.48550/arXiv.2409.09271

Wang, W., Liu, K., Chen, A.R., Li, G., Jin, Z., Huang, G., Ma, L.: Python symbolic execution with LLM-powered code generation. arXiv preprint arXiv:2409.09271 (2024).https://doi.org/10.48550/arXiv.2409.09271

work page doi:10.48550/arxiv.2409.09271 2024
[32]

Individual Comparisons by Ranking Methods , volume=

Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945).https://doi.org/10.2307/3001968

work page doi:10.2307/3001968 1945
[33]

In: USENIX Security

Yun, I., Lee, S., Xu, M., Jang, Y., Kim, T.: QSYM: A practical concolic execution engine tailored for hybrid fuzzing. In: USENIX Security. pp. 745–761. USENIX Association (2018)

2018

[1] [1]

IEEE Security & Privacy 16(2), 52–60 (2018).https://doi.org/10.1109/MSP.2018.1870873

Avgerinos, T., Brumley, D., Davis, J., Goulden, R., Nighswander, T., Rebert, A., Williamson, N.: The Mayhem cyber reasoning system. IEEE Security & Privacy 16(2), 52–60 (2018).https://doi.org/10.1109/MSP.2018.1870873

work page doi:10.1109/msp.2018.1870873 2018

[2] [2]

In: Dependable Software Systems Engineering, NATO Science for Peace and Security Series D, vol

Ball, T., Daniel, J.: Deconstructing dynamic symbolic execution. In: Dependable Software Systems Engineering, NATO Science for Peace and Security Series D, vol. 40, pp. 26–41. IOS Press (2015).https://doi.org/10.3233/978-1-61499-495-4-26

work page doi:10.3233/978-1-61499-495-4-26 2015

[3] [3]

In: OSDI

Cadar, C., Dunbar, D., Engler, D.: KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: OSDI. USENIX Association (2008)

2008

[4] [4]

Cadar, C., Ganesh, V., Pawlowski, P.M., Dill, D.L., Engler, D.R.: EXE: Au- tomatically generating inputs of death. In: CCS. pp. 322–335 (2006).https: //doi.org/10.1145/1180405.1180445

work page doi:10.1145/1180405.1180445 2006

[5] [5]

Communications of the ACM56(2), 82–90 (2013).https://doi.org/10.1145/ 2408776.2408795

Cadar, C., Sen, K.: Symbolic execution for software testing: three decades later. Communications of the ACM56(2), 82–90 (2013).https://doi.org/10.1145/ 2408776.2408795

arXiv 2013

[6] [6]

In: APLAS

Chen, Y.F., Tsai, W.L., Wu, W.C., Yen, D.D., Yu, F.: PyCT: A Python concolic tester. In: APLAS. pp. 38–46 (2021).https://doi.org/10.1007/978-3-030-89051-3_3

work page doi:10.1007/978-3-030-89051-3_3 2021

[7] [7]

In: TACAS

De Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: TACAS. pp. 337–340. Springer (2008)

2008

[8] [8]

arXiv preprint arXiv:2212.14834 (2023).https://doi.org/10.48550/arXiv.2212.14834, accepted at ISSTA 2023

Deng, Y., Xia, C.S., Peng, H., Yang, C., Zhang, L.: Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. arXiv preprint arXiv:2212.14834 (2023).https://doi.org/10.48550/arXiv.2212.14834, accepted at ISSTA 2023

work page doi:10.48550/arxiv.2212.14834 2023

[9] [9]

In: SBFT

Erni, N., Al-Ameen, M., Birchler, C., Derakhshanfar, P., Lukasczyk, S., Panichella, S.: SBFT tool competition 2024 – Python test case generation track. In: SBFT. pp. 37–40 (2024).https://doi.org/10.1145/3643659.3643930

work page doi:10.1145/3643659.3643930 2024

[10] [10]

In: WOOT

Fioraldi, A., Maier, D., Eißfeldt, H., Heuse, M.: AFL++: Combining incremental steps of fuzzing research. In: WOOT. USENIX Association (2020)

2020

[11] [11]

In: ESEC/FSE

Fraser, G., Arcuri, A.: EvoSuite: Automatic test suite generation for object- oriented software. In: ESEC/FSE. pp. 416–419 (2011).https://doi.org/10.1145/ 2025113.2025179

arXiv 2011

[12] [12]

In: PLDI

Godefroid, P., Klarlund, N., Sen, K.: DART: Directed automated random testing. In: PLDI. pp. 213–223 (2005)

2005

[13] [13]

In: NDSS

Godefroid, P., Levin, M.Y., Molnar, D.A.: Automated whitebox fuzz testing. In: NDSS. The Internet Society (2008)

2008

[14] [14]

Hong, C.D., Jiang, H., Lin, A.W., Markgraf, O., Parsert, J., Tan, T.: Extracting robust register automata from neural networks over data sequences (2025).https: //doi.org/10.48550/arXiv.2511.19100

work page doi:10.48550/arxiv.2511.19100 2025

[15] [15]

Hong, C.D., Yang, C.C., Wang, Y., Yu, F.: Influence-guided concolic testing of transformer robustness (2025).https://doi.org/10.48550/arXiv.2509.23806, accepted at the International Conference on Software Quality, Reliability, and Security ConcoLixir21

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.23806 2025

[16] [16]

Huang, M.I., Hong, C.D., Yu, F.: Concolic testing on individual fairness of neural network models (2025).https://doi.org/10.48550/arXiv.2509.06864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.06864 2025

[17] [17]

In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (2023)

Lemieux, C., Inala, J.P., Lahiri, S., Sen, S.: CodaMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (2023)

2023

[18] [18]

Li, Y., Meng, R., Duck, G.J.: Large language model powered symbolic execu- tion.Proc.ACMProgram.Lang.9(OOPSLA2),3148–3176(2025).https://doi.org/ 10.1145/3763163

work page doi:10.1145/3763163 2025

[19] [19]

Lin, L.J., Hong, C.D.: Robustness verification of recurrent neural networks with abstraction refinement (2026).https://doi.org/10.48550/arXiv.2606.12490

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.12490 2026

[20] [20]

In: ICSE Companion

Lukasczyk, S., Fraser, G.: Pynguin: Automated unit test generation for Python. In: ICSE Companion. pp. 168–172 (2022).https://doi.org/10.1145/3510454.3516829

work page doi:10.1145/3510454.3516829 2022

[21] [21]

In: IEEE S&P

Luo, Z., Zhao, H., Wolff, D., Cadar, C., Roychoudhury, A.: Agentic concolic exe- cution. In: IEEE S&P. pp. 1–19. IEEE Computer Society (2026)

2026

[22] [22]

Journal of Open Source Software4(43), 1891 (2019).https://doi.org/10.21105/joss.01891

MacIver, D.R., Hatfield-Dodds, Z., Contributors, M.O.: Hypothesis: A new ap- proach to property-based testing. Journal of Open Source Software4(43), 1891 (2019).https://doi.org/10.21105/joss.01891

work page doi:10.21105/joss.01891 2019

[23] [23]

Pizzorno, J.A., Berger, E.D.: CoverUp: Effective high coverage test generation for Python (2025).https://doi.org/10.1145/3729398, to appear at FSE 2025

work page doi:10.1145/3729398 2025

[24] [24]

Poeplau, S., Francillon, A.: Symbolic execution with SymCC: Don’t interpret, com- pile! In: USENIX Security. pp. 181–198. USENIX Association (2020)

2020

[25] [25]

Schanely, P.: CrossHair: An analysis tool for Python.https://github.com/pschanely/ CrossHair(2017)

2017

[26] [26]

Improving change prediction with fine-grained source code mining,

Sen, K.: Concolic testing. In: Proceedings of the 22nd IEEE/ACM Interna- tional Conference on Automated Software Engineering. pp. 571–572 (2007).https: //doi.org/10.1145/1321631.1321746

work page doi:10.1145/1321631.1321746 2007

[27] [27]

Cute: a concolic unit testing engine for c

Sen, K., Marinov, D., Agha, G.: CUTE: A concolic unit testing engine for C. In: ESEC/FSE. pp. 263–272 (2005).https://doi.org/10.1145/1081706.1081750

work page doi:10.1145/1081706.1081750 2005

[28] [28]

In: NDSS (2016)

Stephens, N., Grosen, J., Salls, C., Dutcher, A., Wang, R., Corbetta, J., Shoshi- taishvili, Y., Krügel, C., Vigna, G.: Driller: Augmenting fuzzing through selective symbolic execution. In: NDSS (2016)

2016

[29] [29]

arXiv preprint arXiv:2504.17542 (2025).https://doi.org/10.48550/arXiv.2504.17542, to appear at the 2026 IEEE Symposium on Security and Privacy

Tu, H., Lee, S., Li, Y., Chen, P., Jiang, L., Böhme, M.: COTTONTAIL: Large lan- guage model-driven concolic execution for highly structured test input generation. arXiv preprint arXiv:2504.17542 (2025).https://doi.org/10.48550/arXiv.2504.17542, to appear at the 2026 IEEE Symposium on Security and Privacy

work page doi:10.48550/arxiv.2504.17542 2025

[30] [30]

Available: https://doi.org/10.3102/10769986025002101

Vargha, A., Delaney, H.D.: A critique and improvement of the CL common lan- guage effect size statistics of McGraw and Wong. Journal of Educational and Be- havioral Statistics25(2), 101–132 (2000).https://doi.org/10.3102/10769986025002101

work page doi:10.3102/10769986025002101 2000

[31] [31]

arXiv preprint arXiv:2409.09271 (2024).https://doi.org/10.48550/arXiv.2409.09271

Wang, W., Liu, K., Chen, A.R., Li, G., Jin, Z., Huang, G., Ma, L.: Python symbolic execution with LLM-powered code generation. arXiv preprint arXiv:2409.09271 (2024).https://doi.org/10.48550/arXiv.2409.09271

work page doi:10.48550/arxiv.2409.09271 2024

[32] [32]

Individual Comparisons by Ranking Methods , volume=

Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80–83 (1945).https://doi.org/10.2307/3001968

work page doi:10.2307/3001968 1945

[33] [33]

In: USENIX Security

Yun, I., Lee, S., Xu, M., Jang, Y., Kim, T.: QSYM: A practical concolic execution engine tailored for hybrid fuzzing. In: USENIX Security. pp. 745–761. USENIX Association (2018)

2018