pith. sign in

arxiv: 2604.19086 · v1 · submitted 2026-04-21 · 💻 cs.SE

MUCOCO: Automated Consistency Testing of Code LLMs

Pith reviewed 2026-05-10 02:53 UTC · model grok-4.3

classification 💻 cs.SE
keywords consistency testingcode LLMsmutation analysissemantic preservationautomated testinginconsistent behaviorsLLM evaluationprogram mutants
0
0 comments X

The pith

MUCOCO automatically generates semantically equivalent program mutants to expose inconsistent behaviors in code LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUCOCO as a method that starts with a coding query and its program, then applies semantic-preserving mutations to create equivalent variants called mutants. It runs each mutant and the original through a code LLM and flags inconsistencies such as differing outputs or unexpected test failures. Existing benchmarks are mostly static and hand-crafted, so they miss this consistency property; MUCOCO offers an automated way to surface it at scale. Evaluation across four coding tasks and seven LLMs shows the approach finds inconsistencies in roughly one in seven generated inputs and beats the prior TURBULENCE baseline. The work argues that such testing should become part of routine evaluation for code-generating models.

Core claim

MUCOCO employs semantic-preserving mutation analysis to transform a given coding query's program into multiple semantically equivalent mutants, then detects cases where a code LLM produces different outputs or test results on the mutants versus the original program.

What carries the argument

Semantic-preserving mutation analysis that creates equivalent program variants to probe for behavioral differences in LLM responses to the same underlying logic.

If this is right

  • MUCOCO can be applied to any coding task to generate test inputs that target consistency without hand-crafting new cases.
  • Roughly 15 percent of the inputs it produces expose inconsistencies across the tested LLMs and tasks.
  • The method outperforms the closest prior baseline TURBULENCE in the number of inconsistencies detected.
  • Routine use of automated consistency testing would reveal reliability gaps that static benchmarks currently overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mutation-driven approach could be adapted to test other LLM properties such as robustness to minor input changes or resistance to prompt injection.
  • Embedding MUCOCO-style checks into developer workflows might reduce the risk of deploying code assistants that behave unpredictably on equivalent problems.
  • Widespread inconsistencies suggest that current LLMs do not maintain stable internal representations of program semantics across surface variations.

Load-bearing premise

The generated mutants must be truly semantically equivalent to the original, so any observed differences are caused by the LLM rather than by the mutation process or test setup.

What would settle it

A manual audit of the mutants that shows a substantial fraction are not semantically equivalent, or that inconsistencies trace to harness errors instead of the LLM, would undermine the claim that MUCOCO reliably exposes LLM inconsistencies.

Figures

Figures reproduced from arXiv: 2604.19086 by Chua Jin Chou, Ezekiel Soremekun, Khant That Lwin.

Figure 1
Figure 1. Figure 1: MUCOCO Workflow An in-depth explanation of our correctness and in￾consistency oracles is provided in Section 3.2. Key Insight: To automatically discover consis￾tency errors, we propose an automated testing tech￾nique called MUCOCO. The main idea of our technique is to employ mutational analysis and metamorphic testing to detect consistency errors in Code LLMs. The key insight of MUCOCO is that a pair of co… view at source ↗
Figure 2
Figure 2. Figure 2: shows that MUCOCO’s lexical muta￾tions (random and sequential renaming) and logi￾cal mutations (constant unfolding and DeMorgan) induce the most consistency errors across all set￾tings [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MUCOCO, a method that applies semantic-preserving mutation analysis to automatically generate program variants (mutants) from coding queries and detect behavioral inconsistencies (e.g., differing outputs or test failures) in Code LLMs. It evaluates the approach on four coding tasks and seven LLMs, reporting that approximately 15% of MUCOCO-generated inputs expose inconsistencies and that the method outperforms the TURBULENCE baseline.

Significance. If the core assumption of semantic equivalence holds and the mutants are independently validated, MUCOCO would offer a practical, automated technique for probing consistency properties in code-generating LLMs, an area where most existing benchmarks are static and hand-crafted. The reported 15% inconsistency rate and outperformance of the baseline would provide concrete motivation for consistency testing in LLM-based software engineering tools.

major comments (3)
  1. Abstract and evaluation description: The headline result (15% of inputs expose inconsistencies, outperforming TURBULENCE) is load-bearing on the claim that every generated mutant is behaviorally identical to the original on all inputs. The manuscript provides no formal semantics, equivalence proofs, differential testing on unseen inputs, or post-generation audits of the mutation operators (renaming, restructuring, constant folding, etc.). Without such validation, observed differences cannot be confidently attributed to the LLM rather than mutation artifacts or test-harness issues.
  2. Evaluation section: No details are given on mutation validity checks, statistical significance tests for the 15% rate, or controls for confounding factors such as floating-point semantics, edge-case behavior, or LLM-specific code interpretations. This absence directly affects the soundness of the superiority claim over TURBULENCE and the generalizability of the inconsistency rate.
  3. Method description: The paper relies on 'semantic-preserving mutation analysis' without specifying how equivalence is enforced or measured (e.g., via formal verification, exhaustive differential testing, or oracle-based checks). This is a central methodological gap for an empirical consistency-testing framework.
minor comments (2)
  1. The abstract and introduction could more clearly distinguish between syntactic mutations and guaranteed semantic equivalence to avoid reader confusion about the mutation process.
  2. Results tables or figures (if present) should include confidence intervals or p-values for the inconsistency rates and baseline comparisons to strengthen the empirical claims.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback. We agree that additional details on semantic equivalence validation and methodological rigor are needed. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: Abstract and evaluation description: The headline result (15% of inputs expose inconsistencies, outperforming TURBULENCE) is load-bearing on the claim that every generated mutant is behaviorally identical to the original on all inputs. The manuscript provides no formal semantics, equivalence proofs, differential testing on unseen inputs, or post-generation audits of the mutation operators (renaming, restructuring, constant folding, etc.). Without such validation, observed differences cannot be confidently attributed to the LLM rather than mutation artifacts or test-harness issues.

    Authors: We appreciate this concern. Our operators follow standard semantic-preserving transformations from mutation testing (e.g., variable renaming and safe restructurings that maintain data/control flow). We will revise to add a dedicated subsection with operator examples, results from differential testing on held-out inputs, and post-generation audits confirming mutants pass original tests. We will explicitly note that formal proofs are outside the empirical scope of this work and constitute a separate theoretical contribution. These changes will better support attributing inconsistencies to the LLMs. revision: partial

  2. Referee: Evaluation section: No details are given on mutation validity checks, statistical significance tests for the 15% rate, or controls for confounding factors such as floating-point semantics, edge-case behavior, or LLM-specific code interpretations. This absence directly affects the soundness of the superiority claim over TURBULENCE and the generalizability of the inconsistency rate.

    Authors: We agree and will revise the Evaluation section to include: explicit mutation validity checks (compilation and original test passage), statistical significance testing for inconsistency rates and baseline comparisons, and discussion of confounding factors with mitigations (e.g., deterministic LLM sampling and fixed-precision environments). We will also expand the TURBULENCE comparison to clarify methodological differences supporting the outperformance claim. revision: yes

  3. Referee: Method description: The paper relies on 'semantic-preserving mutation analysis' without specifying how equivalence is enforced or measured (e.g., via formal verification, exhaustive differential testing, or oracle-based checks). This is a central methodological gap for an empirical consistency-testing framework.

    Authors: We acknowledge the gap. We will expand the Method section to detail equivalence enforcement: syntactic validity checks, oracle-based verification via execution on the original test suite, and manual review of sampled mutants. Specific operators and their semantic-preservation rationale (in supported languages) will be described. Formal verification is impractical for arbitrary code and will be noted as a limitation. revision: yes

standing simulated objections not resolved
  • Formal semantics, equivalence proofs, and exhaustive differential testing on unseen inputs for mutation operators, as these require a major theoretical extension beyond the current empirical study.

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent evaluation

full rationale

The paper describes an empirical technique (MUCOCO) that applies mutation operators to generate program variants, runs them against LLMs, and measures observed behavioral differences. No equations, fitted parameters, uniqueness theorems, or predictive derivations appear in the provided text. The 15% inconsistency rate is reported as a direct experimental outcome rather than a quantity derived from or forced by the method's own inputs. Self-citations, if present, are not load-bearing for any central claim. The semantic-equivalence premise is a methodological assumption open to external validation or falsification and does not constitute a self-definitional or fitted-input reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that semantic equivalence can be automatically preserved through code mutations; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Semantic equivalence can be preserved through automated mutations in code
    Required for the mutants to serve as valid test cases for consistency.
invented entities (1)
  • MUCOCO no independent evidence
    purpose: Automated consistency testing framework
    The proposed method itself, with no independent evidence outside the paper's evaluation.

pith-pipeline@v0.9.0 · 5476 in / 1190 out tokens · 50986 ms · 2026-05-10T02:53:16.793020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Benchmarking open-source large language models for log level suggestion

    Honarvar, Shahin and van der Wilk, Mark and Donaldson, Alastair F. , booktitle =. 2025 , volume =. doi:10.1109/ICST62969.2025.10989005 , url =

  2. [2]

    Knowledge-based Consistency Testing of Large Language Models

    Rajan, Sai Sathiesh and Soremekun, Ezekiel and Chattopadhyay, Sudipta. Knowledge-based Consistency Testing of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.596

  3. [3]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  4. [4]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , url =

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

  5. [5]

    2024 , editor =

    Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , editor =

  6. [6]

    Nguyen and Quang Pham and Nghi D

    Dung Manh Nguyen and Thang Chau Phan and Nam Le Hai and Tien-Thong Doan and Nam V. Nguyen and Quang Pham and Nghi D. Q. Bui , booktitle=. Code. 2025 , url=

  7. [7]

    2025 , eprint=

    Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations? , author=. 2025 , eprint=

  9. [9]

    n/a , pages =

    Program Synthesis with Large Language Models , author =. n/a , pages =. 2021 , note =

  10. [10]

    EvalPlus Leaderboard , howpublished =

    Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and LINGMING ZHANG , year =. EvalPlus Leaderboard , howpublished =

  11. [11]

    LLM-Stats HumanEval Benchmark , howpublished =

    llm-stats , year =. LLM-Stats HumanEval Benchmark , howpublished =

  12. [12]

    BigCode Benchmark , howpublished =

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and GONG, Chen and Hoang, James and Zebaze, Armel and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek and J...

  13. [13]

    SWE-Bench Live , howpublished =

    Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang , year =. SWE-Bench Live , howpublished =

  14. [14]

    EvalOoop , howpublished =

    Fang, Sen and Ding, Weiyuan and Xu, Bowen , year =. EvalOoop , howpublished =

  15. [15]

    2024 , eprint=

    Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

  16. [16]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  17. [17]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  18. [18]

    OpenAI Platform Docs — GPT-4o , howpublished =

    OpenAI , year =. OpenAI Platform Docs — GPT-4o , howpublished =

  19. [19]

    OpenAI Platform Docs — GPT-5 , howpublished =

    OpenAI , year =. OpenAI Platform Docs — GPT-5 , howpublished =

  20. [20]

    2024 , month =

    Mistral AI — Codestral , howpublished =. 2024 , month =

  21. [21]

    2025 , month =

    Mistral AI — Codestral 25-08 , howpublished =. 2025 , month =

  22. [22]

    2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

    RepairAgent: An Autonomous, LLM-Based Agent for Program Repair , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

  23. [23]

    Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

    Inferfix: End-to-end program repair with llms , author=. Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages=

  24. [24]

    ACM Transactions on Software Engineering and Methodology , volume=

    LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=

  25. [25]

    IEEE Transactions on Software Engineering , volume=

    LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation , author=. IEEE Transactions on Software Engineering , volume=. 2024 , publisher=

  26. [26]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  27. [27]

    arXiv preprint arXiv:2506.00750 , year=

    CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning , author=. arXiv preprint arXiv:2506.00750 , year=

  28. [28]

    2025 , howpublished =

    Number of Parameters in GPT-4 (Latest Data) , author =. 2025 , howpublished =

  29. [29]

    2025 , howpublished =

    How Many Parameters Does GPT-5 Have? , author =. 2025 , howpublished =

  30. [30]

    2025 , howpublished =

    GPT-5 (2025) , author =. 2025 , howpublished =

  31. [31]

    2025 , author =

    OpenAI Platform Docs — Latest Model Guide , howpublished =. 2025 , author =

  32. [32]

    Google , author =

    Google Colaboratory (Colab) , howpublished =. Google , author =

  33. [33]

    2024 , author =

    Hugging Face Docs — Transformers , howpublished =. 2024 , author =

  34. [34]

    2025 , author =

    Python 3 Library — ast Module , howpublished =. 2025 , author =

  35. [35]

    DeepSeek — API Docs News , howpublished =

    DeepSeek , year =. DeepSeek — API Docs News , howpublished =

  36. [36]

    Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

  37. [37]

    Hugging Face — Qwen/Qwen2.5-Coder-14B , howpublished =

    Binyuan Hui and Jian Yang and Zeyu Cui and Jiaxi Yang and Dayiheng Liu and Lei Zhang and Tianyu Liu and Jiajun Zhang and Bowen Yu and Keming Lu and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and Junyang L...

  38. [38]

    Gemma Team and Aishwarya Kamath and Johan Ferret and Shreya Pathak and Nino Vieillard and Ramona Merhej and Sarah Perrin and Tatiana Matejovicova and Alexandre Ramé and Morgane Rivière and Louis Rouillard and Thomas Mesnard and Geoffrey Cideron and Jean-bastien Grill and Sabela Ramos and Edouard Yvinec and Michelle Casbon and Etienne Pot and Ivo Penchev a...