MUCOCO: Automated Consistency Testing of Code LLMs

Chua Jin Chou; Ezekiel Soremekun; Khant That Lwin

arxiv: 2604.19086 · v1 · submitted 2026-04-21 · 💻 cs.SE

MUCOCO: Automated Consistency Testing of Code LLMs

Chua Jin Chou , Khant That Lwin , Ezekiel Soremekun This is my paper

Pith reviewed 2026-05-10 02:53 UTC · model grok-4.3

classification 💻 cs.SE

keywords consistency testingcode LLMsmutation analysissemantic preservationautomated testinginconsistent behaviorsLLM evaluationprogram mutants

0 comments

The pith

MUCOCO automatically generates semantically equivalent program mutants to expose inconsistent behaviors in code LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUCOCO as a method that starts with a coding query and its program, then applies semantic-preserving mutations to create equivalent variants called mutants. It runs each mutant and the original through a code LLM and flags inconsistencies such as differing outputs or unexpected test failures. Existing benchmarks are mostly static and hand-crafted, so they miss this consistency property; MUCOCO offers an automated way to surface it at scale. Evaluation across four coding tasks and seven LLMs shows the approach finds inconsistencies in roughly one in seven generated inputs and beats the prior TURBULENCE baseline. The work argues that such testing should become part of routine evaluation for code-generating models.

Core claim

MUCOCO employs semantic-preserving mutation analysis to transform a given coding query's program into multiple semantically equivalent mutants, then detects cases where a code LLM produces different outputs or test results on the mutants versus the original program.

What carries the argument

Semantic-preserving mutation analysis that creates equivalent program variants to probe for behavioral differences in LLM responses to the same underlying logic.

If this is right

MUCOCO can be applied to any coding task to generate test inputs that target consistency without hand-crafting new cases.
Roughly 15 percent of the inputs it produces expose inconsistencies across the tested LLMs and tasks.
The method outperforms the closest prior baseline TURBULENCE in the number of inconsistencies detected.
Routine use of automated consistency testing would reveal reliability gaps that static benchmarks currently overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mutation-driven approach could be adapted to test other LLM properties such as robustness to minor input changes or resistance to prompt injection.
Embedding MUCOCO-style checks into developer workflows might reduce the risk of deploying code assistants that behave unpredictably on equivalent problems.
Widespread inconsistencies suggest that current LLMs do not maintain stable internal representations of program semantics across surface variations.

Load-bearing premise

The generated mutants must be truly semantically equivalent to the original, so any observed differences are caused by the LLM rather than by the mutation process or test setup.

What would settle it

A manual audit of the mutants that shows a substantial fraction are not semantically equivalent, or that inconsistencies trace to harness errors instead of the LLM, would undermine the claim that MUCOCO reliably exposes LLM inconsistencies.

Figures

Figures reproduced from arXiv: 2604.19086 by Chua Jin Chou, Ezekiel Soremekun, Khant That Lwin.

**Figure 1.** Figure 1: MUCOCO Workflow An in-depth explanation of our correctness and inconsistency oracles is provided in Section 3.2. Key Insight: To automatically discover consistency errors, we propose an automated testing technique called MUCOCO. The main idea of our technique is to employ mutational analysis and metamorphic testing to detect consistency errors in Code LLMs. The key insight of MUCOCO is that a pair of co… view at source ↗

**Figure 2.** Figure 2: shows that MUCOCO’s lexical mutations (random and sequential renaming) and logical mutations (constant unfolding and DeMorgan) induce the most consistency errors across all settings [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUCOCO offers an automated mutation-based approach to test consistency in code LLMs but assumes mutant equivalence without strong validation.

read the letter

MUCOCO is a new automated testing method that applies semantic-preserving mutation analysis to find inconsistent behaviors in code LLMs. The headline result is that it exposes inconsistencies in about 15% of the generated inputs and outperforms the TURBULENCE baseline across the evaluated tasks and models. The paper does well in identifying a gap in existing benchmarks, which are mostly static and hand-crafted, and proposing a way to generate varied but equivalent programs to probe for consistency. The evaluation setup with multiple coding tasks and LLMs provides a reasonable starting point for seeing how the method performs in practice. The main soft spot is the lack of rigorous validation for the semantic equivalence of the mutants. Without details on mutation operators, how equivalence is ensured, or any checks against confounding factors like test harness issues, it's difficult to be confident that the observed differences are solely due to the LLMs. This could inflate the inconsistency rate if some mutants aren't truly equivalent. Overall, this paper is for researchers in AI-assisted software engineering who are concerned with model reliability. A reader looking for empirical methods to test properties like consistency would find it relevant and worth considering for their own work. It deserves a serious referee because it introduces a novel technique with concrete results, even if the validation aspects need strengthening. I would recommend sending it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes MUCOCO, a method that applies semantic-preserving mutation analysis to automatically generate program variants (mutants) from coding queries and detect behavioral inconsistencies (e.g., differing outputs or test failures) in Code LLMs. It evaluates the approach on four coding tasks and seven LLMs, reporting that approximately 15% of MUCOCO-generated inputs expose inconsistencies and that the method outperforms the TURBULENCE baseline.

Significance. If the core assumption of semantic equivalence holds and the mutants are independently validated, MUCOCO would offer a practical, automated technique for probing consistency properties in code-generating LLMs, an area where most existing benchmarks are static and hand-crafted. The reported 15% inconsistency rate and outperformance of the baseline would provide concrete motivation for consistency testing in LLM-based software engineering tools.

major comments (3)

Abstract and evaluation description: The headline result (15% of inputs expose inconsistencies, outperforming TURBULENCE) is load-bearing on the claim that every generated mutant is behaviorally identical to the original on all inputs. The manuscript provides no formal semantics, equivalence proofs, differential testing on unseen inputs, or post-generation audits of the mutation operators (renaming, restructuring, constant folding, etc.). Without such validation, observed differences cannot be confidently attributed to the LLM rather than mutation artifacts or test-harness issues.
Evaluation section: No details are given on mutation validity checks, statistical significance tests for the 15% rate, or controls for confounding factors such as floating-point semantics, edge-case behavior, or LLM-specific code interpretations. This absence directly affects the soundness of the superiority claim over TURBULENCE and the generalizability of the inconsistency rate.
Method description: The paper relies on 'semantic-preserving mutation analysis' without specifying how equivalence is enforced or measured (e.g., via formal verification, exhaustive differential testing, or oracle-based checks). This is a central methodological gap for an empirical consistency-testing framework.

minor comments (2)

The abstract and introduction could more clearly distinguish between syntactic mutations and guaranteed semantic equivalence to avoid reader confusion about the mutation process.
Results tables or figures (if present) should include confidence intervals or p-values for the inconsistency rates and baseline comparisons to strengthen the empirical claims.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback. We agree that additional details on semantic equivalence validation and methodological rigor are needed. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: Abstract and evaluation description: The headline result (15% of inputs expose inconsistencies, outperforming TURBULENCE) is load-bearing on the claim that every generated mutant is behaviorally identical to the original on all inputs. The manuscript provides no formal semantics, equivalence proofs, differential testing on unseen inputs, or post-generation audits of the mutation operators (renaming, restructuring, constant folding, etc.). Without such validation, observed differences cannot be confidently attributed to the LLM rather than mutation artifacts or test-harness issues.

Authors: We appreciate this concern. Our operators follow standard semantic-preserving transformations from mutation testing (e.g., variable renaming and safe restructurings that maintain data/control flow). We will revise to add a dedicated subsection with operator examples, results from differential testing on held-out inputs, and post-generation audits confirming mutants pass original tests. We will explicitly note that formal proofs are outside the empirical scope of this work and constitute a separate theoretical contribution. These changes will better support attributing inconsistencies to the LLMs. revision: partial
Referee: Evaluation section: No details are given on mutation validity checks, statistical significance tests for the 15% rate, or controls for confounding factors such as floating-point semantics, edge-case behavior, or LLM-specific code interpretations. This absence directly affects the soundness of the superiority claim over TURBULENCE and the generalizability of the inconsistency rate.

Authors: We agree and will revise the Evaluation section to include: explicit mutation validity checks (compilation and original test passage), statistical significance testing for inconsistency rates and baseline comparisons, and discussion of confounding factors with mitigations (e.g., deterministic LLM sampling and fixed-precision environments). We will also expand the TURBULENCE comparison to clarify methodological differences supporting the outperformance claim. revision: yes
Referee: Method description: The paper relies on 'semantic-preserving mutation analysis' without specifying how equivalence is enforced or measured (e.g., via formal verification, exhaustive differential testing, or oracle-based checks). This is a central methodological gap for an empirical consistency-testing framework.

Authors: We acknowledge the gap. We will expand the Method section to detail equivalence enforcement: syntactic validity checks, oracle-based verification via execution on the original test suite, and manual review of sampled mutants. Specific operators and their semantic-preservation rationale (in supported languages) will be described. Formal verification is impractical for arbitrary code and will be noted as a limitation. revision: yes

standing simulated objections not resolved

Formal semantics, equivalence proofs, and exhaustive differential testing on unseen inputs for mutation operators, as these require a major theoretical extension beyond the current empirical study.

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent evaluation

full rationale

The paper describes an empirical technique (MUCOCO) that applies mutation operators to generate program variants, runs them against LLMs, and measures observed behavioral differences. No equations, fitted parameters, uniqueness theorems, or predictive derivations appear in the provided text. The 15% inconsistency rate is reported as a direct experimental outcome rather than a quantity derived from or forced by the method's own inputs. Self-citations, if present, are not load-bearing for any central claim. The semantic-equivalence premise is a methodological assumption open to external validation or falsification and does not constitute a self-definitional or fitted-input reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that semantic equivalence can be automatically preserved through code mutations; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Semantic equivalence can be preserved through automated mutations in code
Required for the mutants to serve as valid test cases for consistency.

invented entities (1)

MUCOCO no independent evidence
purpose: Automated consistency testing framework
The proposed method itself, with no independent evidence outside the paper's evaluation.

pith-pipeline@v0.9.0 · 5476 in / 1190 out tokens · 50986 ms · 2026-05-10T02:53:16.793020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

Benchmarking open-source large language models for log level suggestion

Honarvar, Shahin and van der Wilk, Mark and Donaldson, Alastair F. , booktitle =. 2025 , volume =. doi:10.1109/ICST62969.2025.10989005 , url =

work page doi:10.1109/icst62969.2025.10989005 2025
[2]

Knowledge-based Consistency Testing of Large Language Models

Rajan, Sai Sathiesh and Soremekun, Ezekiel and Chattopadhyay, Sudipta. Knowledge-based Consistency Testing of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.596

work page doi:10.18653/v1/2024.findings-emnlp.596 2024
[3]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , url =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

work page
[5]

2024 , editor =

Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , editor =

work page 2024
[6]

Nguyen and Quang Pham and Nghi D

Dung Manh Nguyen and Thang Chau Phan and Nam Le Hai and Tien-Thong Doan and Nam V. Nguyen and Quang Pham and Nghi D. Q. Bui , booktitle=. Code. 2025 , url=

work page 2025
[7]

2025 , eprint=

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations? , author=. 2025 , eprint=

work page 2025
[9]

n/a , pages =

Program Synthesis with Large Language Models , author =. n/a , pages =. 2021 , note =

work page 2021
[10]

EvalPlus Leaderboard , howpublished =

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and LINGMING ZHANG , year =. EvalPlus Leaderboard , howpublished =

work page
[11]

LLM-Stats HumanEval Benchmark , howpublished =

llm-stats , year =. LLM-Stats HumanEval Benchmark , howpublished =

work page
[12]

BigCode Benchmark , howpublished =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

work page
[13]

SWE-Bench Live , howpublished =

Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang , year =. SWE-Bench Live , howpublished =

work page
[14]

EvalOoop , howpublished =

Fang, Sen and Ding, Weiyuan and Xu, Bowen , year =. EvalOoop , howpublished =

work page
[15]

2024 , eprint=

Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

work page 2024
[16]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[17]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[18]

OpenAI Platform Docs — GPT-4o , howpublished =

OpenAI , year =. OpenAI Platform Docs — GPT-4o , howpublished =

work page
[19]

OpenAI Platform Docs — GPT-5 , howpublished =

OpenAI , year =. OpenAI Platform Docs — GPT-5 , howpublished =

work page
[20]

2024 , month =

Mistral AI — Codestral , howpublished =. 2024 , month =

work page 2024
[21]

2025 , month =

Mistral AI — Codestral 25-08 , howpublished =. 2025 , month =

work page 2025
[22]

2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

work page 2025
[23]

Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

Inferfix: End-to-end program repair with llms , author=. Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

work page
[24]

ACM Transactions on Software Engineering and Methodology , volume=

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=

work page 2025
[25]

IEEE Transactions on Software Engineering , volume=

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation , author=. IEEE Transactions on Software Engineering , volume=. 2024 , publisher=

work page 2024
[26]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[27]

arXiv preprint arXiv:2506.00750 , year=

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning , author=. arXiv preprint arXiv:2506.00750 , year=

work page arXiv
[28]

2025 , howpublished =

Number of Parameters in GPT-4 (Latest Data) , author =. 2025 , howpublished =

work page 2025
[29]

2025 , howpublished =

How Many Parameters Does GPT-5 Have? , author =. 2025 , howpublished =

work page 2025
[30]

2025 , howpublished =

GPT-5 (2025) , author =. 2025 , howpublished =

work page 2025
[31]

2025 , author =

OpenAI Platform Docs — Latest Model Guide , howpublished =. 2025 , author =

work page 2025
[32]

Google , author =

Google Colaboratory (Colab) , howpublished =. Google , author =

work page
[33]

2024 , author =

Hugging Face Docs — Transformers , howpublished =. 2024 , author =

work page 2024
[34]

2025 , author =

Python 3 Library — ast Module , howpublished =. 2025 , author =

work page 2025
[35]

DeepSeek — API Docs News , howpublished =

DeepSeek , year =. DeepSeek — API Docs News , howpublished =

work page
[36]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

work page
[37]

Hugging Face — Qwen/Qwen2.5-Coder-14B , howpublished =

Binyuan Hui and Jian Yang and Zeyu Cui and Jiaxi Yang and Dayiheng Liu and Lei Zhang and Tianyu Liu and Jiajun Zhang and Bowen Yu and Keming Lu and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and Junyang L...

work page
[38]

Gemma Team and Aishwarya Kamath and Johan Ferret and Shreya Pathak and Nino Vieillard and Ramona Merhej and Sarah Perrin and Tatiana Matejovicova and Alexandre Ramé and Morgane Rivière and Louis Rouillard and Thomas Mesnard and Geoffrey Cideron and Jean-bastien Grill and Sabela Ramos and Edouard Yvinec and Michelle Casbon and Etienne Pot and Ivo Penchev a...

work page

[1] [1]

Benchmarking open-source large language models for log level suggestion

Honarvar, Shahin and van der Wilk, Mark and Donaldson, Alastair F. , booktitle =. 2025 , volume =. doi:10.1109/ICST62969.2025.10989005 , url =

work page doi:10.1109/icst62969.2025.10989005 2025

[2] [2]

Knowledge-based Consistency Testing of Large Language Models

Rajan, Sai Sathiesh and Soremekun, Ezekiel and Chattopadhyay, Sudipta. Knowledge-based Consistency Testing of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.596

work page doi:10.18653/v1/2024.findings-emnlp.596 2024

[3] [3]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , url =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

work page

[5] [5]

2024 , editor =

Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , editor =

work page 2024

[6] [6]

Nguyen and Quang Pham and Nghi D

Dung Manh Nguyen and Thang Chau Phan and Nam Le Hai and Tien-Thong Doan and Nam V. Nguyen and Quang Pham and Nghi D. Q. Bui , booktitle=. Code. 2025 , url=

work page 2025

[7] [7]

2025 , eprint=

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations? , author=. 2025 , eprint=

work page 2025

[9] [9]

n/a , pages =

Program Synthesis with Large Language Models , author =. n/a , pages =. 2021 , note =

work page 2021

[10] [10]

EvalPlus Leaderboard , howpublished =

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and LINGMING ZHANG , year =. EvalPlus Leaderboard , howpublished =

work page

[11] [11]

LLM-Stats HumanEval Benchmark , howpublished =

llm-stats , year =. LLM-Stats HumanEval Benchmark , howpublished =

work page

[12] [12]

BigCode Benchmark , howpublished =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

work page

[13] [13]

SWE-Bench Live , howpublished =

Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang , year =. SWE-Bench Live , howpublished =

work page

[14] [14]

EvalOoop , howpublished =

Fang, Sen and Ding, Weiyuan and Xu, Bowen , year =. EvalOoop , howpublished =

work page

[15] [15]

2024 , eprint=

Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

work page 2024

[16] [16]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[17] [17]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[18] [18]

OpenAI Platform Docs — GPT-4o , howpublished =

OpenAI , year =. OpenAI Platform Docs — GPT-4o , howpublished =

work page

[19] [19]

OpenAI Platform Docs — GPT-5 , howpublished =

OpenAI , year =. OpenAI Platform Docs — GPT-5 , howpublished =

work page

[20] [20]

2024 , month =

Mistral AI — Codestral , howpublished =. 2024 , month =

work page 2024

[21] [21]

2025 , month =

Mistral AI — Codestral 25-08 , howpublished =. 2025 , month =

work page 2025

[22] [22]

2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

work page 2025

[23] [23]

Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

Inferfix: End-to-end program repair with llms , author=. Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

work page

[24] [24]

ACM Transactions on Software Engineering and Methodology , volume=

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=

work page 2025

[25] [25]

IEEE Transactions on Software Engineering , volume=

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation , author=. IEEE Transactions on Software Engineering , volume=. 2024 , publisher=

work page 2024

[26] [26]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[27] [27]

arXiv preprint arXiv:2506.00750 , year=

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning , author=. arXiv preprint arXiv:2506.00750 , year=

work page arXiv

[28] [28]

2025 , howpublished =

Number of Parameters in GPT-4 (Latest Data) , author =. 2025 , howpublished =

work page 2025

[29] [29]

2025 , howpublished =

How Many Parameters Does GPT-5 Have? , author =. 2025 , howpublished =

work page 2025

[30] [30]

2025 , howpublished =

GPT-5 (2025) , author =. 2025 , howpublished =

work page 2025

[31] [31]

2025 , author =

OpenAI Platform Docs — Latest Model Guide , howpublished =. 2025 , author =

work page 2025

[32] [32]

Google , author =

Google Colaboratory (Colab) , howpublished =. Google , author =

work page

[33] [33]

2024 , author =

Hugging Face Docs — Transformers , howpublished =. 2024 , author =

work page 2024

[34] [34]

2025 , author =

Python 3 Library — ast Module , howpublished =. 2025 , author =

work page 2025

[35] [35]

DeepSeek — API Docs News , howpublished =

DeepSeek , year =. DeepSeek — API Docs News , howpublished =

work page

[36] [36]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

work page

[37] [37]

Hugging Face — Qwen/Qwen2.5-Coder-14B , howpublished =

Binyuan Hui and Jian Yang and Zeyu Cui and Jiaxi Yang and Dayiheng Liu and Lei Zhang and Tianyu Liu and Jiajun Zhang and Bowen Yu and Keming Lu and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and Junyang L...

work page

[38] [38]

Gemma Team and Aishwarya Kamath and Johan Ferret and Shreya Pathak and Nino Vieillard and Ramona Merhej and Sarah Perrin and Tatiana Matejovicova and Alexandre Ramé and Morgane Rivière and Louis Rouillard and Thomas Mesnard and Geoffrey Cideron and Jean-bastien Grill and Sabela Ramos and Edouard Yvinec and Michelle Casbon and Etienne Pot and Ivo Penchev a...

work page