Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

Christoph Kessler; Daniel St{\aa}hl; Kristian Sandahl; Xin Sun

arxiv: 2511.10271 · v2 · submitted 2025-11-13 · 💻 cs.SE · cs.AI

Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

Xin Sun , Daniel St{\aa}hl , Kristian Sandahl , Christoph Kessler This is my paper

Pith reviewed 2026-05-17 22:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code generationnon-functional quality characteristicsISO/IEC 25010software quality assurancemaintainabilitysecurityperformance efficiencytechnical debt

0 comments

The pith

LLM-generated code shows misalignment on non-functional qualities like maintainability and security despite functional correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses literature review of 109 papers, industry workshops, and experiments patching real-world issues with three LLMs to examine non-functional quality characteristics guided by the ISO/IEC 25010 model. Academic work centers on security, performance efficiency, and maintainability, while practitioners stress maintainability and readability to prevent technical debt buildup. Empirical results indicate that prompt-based adjustments fail to deliver stable improvements in these areas during practical use. The core finding is a gap between research focus, industry needs, and actual model outputs that calls for built-in quality assurance in generation pipelines.

Core claim

Guided by the ISO/IEC 25010 quality model, the multi-methods study shows existing research primarily emphasizes security, performance efficiency, and maintainability; practitioners instead prioritize maintainability and readability; and empirical patching of real-world issues with three LLMs reveals instability when attempting to optimize these attributes through prompts.

What carries the argument

Multi-methods evaluation that combines a review of 109 papers, practitioner workshops, and empirical tests of LLM-generated patches on real software issues, centered on the attributes of security, maintainability, and performance efficiency.

If this is right

Quality assurance mechanisms need integration into LLM code generation pipelines so outputs pass with quality rather than only functional tests.
Prompt engineering alone proves unstable for reliably improving non-functional qualities in practical software engineering tasks.
Generated code risks accelerating technical debt accumulation if maintainability and readability remain unaddressed.
Research attention should broaden to understudied quality attributes beyond the current emphasis on security, performance, and maintainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams adopting LLMs for code may require new post-generation review tools focused on maintainability metrics.
Future model training could benefit from datasets weighted toward readable and maintainable examples to reduce the observed gaps.
Widespread use of current LLMs for development might increase long-term maintenance costs unless quality controls are added.

Load-bearing premise

The three chosen LLMs, the selected real-world issues, and the specific metrics for security, maintainability, and performance efficiency are representative enough to support general claims about LLM-generated code quality.

What would settle it

A larger study with additional LLMs, more varied real-world issues, and broader quality metrics that finds stable high scores across non-functional attributes would indicate the observed misalignment does not hold generally.

Figures

Figures reproduced from arXiv: 2511.10271 by Christoph Kessler, Daniel St{\aa}hl, Kristian Sandahl, Xin Sun.

**Figure 1.** Figure 1: Search and selection process 3.2. Workshop Design and Execution To complement and validate the findings of our literature review, we held two interactive workshops with industry experts from several organizations. These organizations are actively exploring the integration of LLMs into largescale software systems and seeking to ensure the reliability of LLMs’ outputs. The workshops are designed to answer t… view at source ↗

**Figure 2.** Figure 2: Overview of the experiment procedure. The workflow is divided into three main [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: SWE-bench Lite Instance Structure itself focuses on functional correctness, its automated patch verification system provides a reliable foundation on which we build our NFQC evaluation pipeline. This enables us to analyze NFQCs of functionally correct patches without modifying the benchmark itself [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: SWE-agent Configurations The first component focused on functional verification and performance measurement. For each instance, we used Docker to create an isolated environment, cloned the corresponding repository, applied the generated patch, and executed the test cases provided by SWE-bench Lite. This step verified whether the patch was functionally correct and resolved the given issue. For patches that… view at source ↗

**Figure 5.** Figure 5: The year distribution of papers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Word cloud of the identified literature. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: The distribution of papers by NFQCs. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Patch generation results of different models in the baseline evaluation. The total [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Patch generation results under different prompt strategies optimized for different [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Boxplots of test runtime (s) and memory usage (MB) on a logarithmic scale for [PITH_FULL_IMAGE:figures/full_fig_p052_10.png] view at source ↗

**Figure 11.** Figure 11: Per-instance comparison across models (GPT-4o, DeepSeek, Claude-Sonnet [PITH_FULL_IMAGE:figures/full_fig_p053_11.png] view at source ↗

**Figure 12.** Figure 12: Difference heatmaps for GPT-4o under four NFQC-specific optimizations. Each [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗

**Figure 13.** Figure 13: Difference heatmaps for DeepSeek-Reasoner under four NFQC-specific op [PITH_FULL_IMAGE:figures/full_fig_p055_13.png] view at source ↗

**Figure 14.** Figure 14: Difference heatmaps for Claude-Sonnet-4 under four NFQC-specific optimiza [PITH_FULL_IMAGE:figures/full_fig_p056_14.png] view at source ↗

**Figure 15.** Figure 15: Example of malformed patch error during patch application. [PITH_FULL_IMAGE:figures/full_fig_p057_15.png] view at source ↗

read the original abstract

In recent years, large language models have been widely integrated into software engineering workflows, supporting tasks like code generation. While prior evaluations focus on functional correctness, there is still a limited understanding of the non-functional quality characteristics of generated code. Guided by the ISO/IEC 25010 quality model, this study adopts a multi-methods approach comprising three complementary elements: a literature review of 109 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches regarding security, maintainability, and performance efficiency, which were identified as critical code-level quality attributes. Our results indicate that existing research primarily emphasizes security, performance efficiency, and maintainability, while other quality attributes are understudied. In contrast, practitioners prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. The empirical evaluation demonstrates the instability of optimizing NFQCs through prompts in practical software engineering settings. Overall, our findings expose a misalignment between academic focus, industry priorities, and observed model behavior, highlighting the need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a practical mismatch in priorities for non-functional qualities in LLM code but the experiments with three models and selected issues limit how far the misalignment claim travels.

read the letter

The main thing here is that prompting LLMs for better security, maintainability, and performance efficiency turns out to be unstable in the experiments, while academics and practitioners are looking at different parts of the quality picture. That gap is the clearest signal in the work. The literature review of 109 papers lines up academic attention mostly on security, performance efficiency, and maintainability. The two workshops surface practitioner concerns about readability and the risk of building up technical debt faster with generated code. The patching experiments then test three LLMs on real-world issues and report that prompt tweaks do not reliably improve those attributes. This multi-method setup is straightforward and connects the dots without overclaiming a new theory. The ISO 25010 framing keeps the quality attributes concrete and relevant to software engineering practice. The instability result adds a useful empirical note to discussions about putting LLMs into real code pipelines. The soft spot is scope. Three LLMs and a limited set of issues make it hard to know whether the observed instability and the claimed misalignment hold more broadly. Newer models or different codebases could shift the picture, and the abstract leaves the exact measurement steps for the quality attributes unclear. Without those details it is difficult to judge how robust the patch evaluations are. This paper is aimed at software engineering researchers and tool developers who work on LLM-assisted coding. It gives them a grounded view of where current generation approaches fall short on qualities beyond functional tests. I would send it for peer review so referees can examine the experimental protocol and check how far the findings extend.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM-generated code exhibits a misalignment between academic research priorities (emphasizing security, performance efficiency, and maintainability per a review of 109 papers), industry/practitioner priorities (maintainability and readability to avoid technical debt, from two workshops), and observed LLM behavior (instability when optimizing these NFQCs via prompts in an empirical patching study using three LLMs on real-world issues, guided by ISO/IEC 25010). It concludes that quality assurance mechanisms must be integrated into LLM code generation pipelines.

Significance. If the empirical instability results and misalignment hold under broader testing, the work would be significant for SE practice by motivating QA integration beyond functional correctness testing. The multi-methods design (literature synthesis + practitioner input + direct experiments) is a strength that provides contextual grounding and falsifiable observations about prompt-based NFQC optimization.

major comments (2)

[Empirical evaluation] Empirical evaluation section: the abstract and empirical component provide no detail on concrete metrics, tools, or procedures used to assess security, maintainability, and performance efficiency of the generated patches, nor on inter-rater reliability or statistical tests. This directly affects the load-bearing claim of 'instability of optimizing NFQCs through prompts' and the resulting misalignment conclusion.
[Empirical study] Empirical study (patching real-world issues): the representativeness of the three chosen LLMs and the selected real-world issues is not justified or tested via sensitivity analysis. Because the central misalignment claim and call for QA mechanisms rest on generalizing from these specific observations, limited sampling risks making the instability finding artifactual rather than general.

minor comments (2)

[Abstract] Abstract: the summary of results could preview one or two concrete observations from the empirical patching (e.g., specific instability patterns) to better orient readers before the full methods are described.
[Literature review] Ensure the literature review section explicitly states the search strategy, inclusion/exclusion criteria, and any coding scheme used to categorize the 109 papers' focus on NFQCs, to support the academic-vs-industry comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the transparency and generalizability of our empirical component. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Empirical evaluation] Empirical evaluation section: the abstract and empirical component provide no detail on concrete metrics, tools, or procedures used to assess security, maintainability, and performance efficiency of the generated patches, nor on inter-rater reliability or statistical tests. This directly affects the load-bearing claim of 'instability of optimizing NFQCs through prompts' and the resulting misalignment conclusion.

Authors: We agree that additional methodological detail is needed to support the instability claim. In the revised manuscript we will expand the empirical evaluation section with: (1) explicit metrics for each NFQC (e.g., maintainability via cyclomatic complexity, code duplication, and cognitive complexity measured with SonarQube; security via static analysis with CodeQL and Bandit for common vulnerabilities; performance efficiency via execution time and memory profiling on standardized benchmarks); (2) the exact prompting templates and patch-generation procedure; (3) inter-rater reliability statistics (Cohen’s kappa) for any manual quality assessments performed by the authors; and (4) the statistical tests applied (Wilcoxon signed-rank tests with effect sizes and p-values) to compare prompt-optimization outcomes. These additions will make the evidence for prompt instability more transparent and reproducible. revision: yes
Referee: [Empirical study] Empirical study (patching real-world issues): the representativeness of the three chosen LLMs and the selected real-world issues is not justified or tested via sensitivity analysis. Because the central misalignment claim and call for QA mechanisms rest on generalizing from these specific observations, limited sampling risks making the instability finding artifactual rather than general.

Authors: We acknowledge the sampling limitation. The three LLMs were chosen to represent distinct model families (closed-source frontier, hybrid, and open-source) that were publicly accessible during the study period; the issues were selected from actively maintained GitHub repositories with real developer-reported bugs. A full sensitivity analysis across additional models and issue corpora was not feasible within the original resource constraints. In the revision we will (a) add an explicit justification subsection for the model and issue selection criteria, (b) include a dedicated “Threats to Validity and Limitations” paragraph that discusses the risks of limited sampling and the exploratory nature of the study, and (c) moderate the generalization language while retaining the core observation that prompt-based NFQC optimization proved unstable in the tested settings. We view this as a partial revision because a comprehensive sensitivity study would require a follow-up experiment beyond the scope of the current work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical sources

full rationale

The paper's central claim of misalignment between academic focus, industry priorities, and LLM behavior is derived from three independent elements: a review of 109 external papers, two practitioner workshops, and direct experiments patching real-world issues with three LLMs while measuring security, maintainability, and performance efficiency per ISO 25010. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations reduce any result to the paper's own inputs by construction. The instability observation and call for QA mechanisms follow directly from the reported empirical outcomes rather than from renaming or smuggling prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study in software engineering with no mathematical derivations, so the ledger contains no free parameters, axioms, or invented entities beyond standard methodological choices.

pith-pipeline@v0.9.0 · 5536 in / 1129 out tokens · 25226 ms · 2026-05-17T22:48:04.933065+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results indicate that existing research primarily emphasizes security, performance efficiency, and maintainability... improvements in one quality dimension often come at the cost of others.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

risk_score = Σ W_severity × W_precision × trigger_count

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contract Based Verification of Non-functional Requirements for Embedded Automotive C Code
cs.PL 2026-05 unverdicted novelty 6.0

The authors define general non-functional rules for C modules, propose an interface contract language, implement a Frama-C checker plugin, and demonstrate verification on two Scania truck codebases alongside ACSL func...
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution
cs.SE 2026-05 unverdicted novelty 5.0

Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% mo...
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review
cs.SE 2026-03 unverdicted novelty 3.0

A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 3 Pith papers

[1]

Starcoder: may the source be with you! Trans. Mach. Learn. Res

work page
[2]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

URL:https://openreview.net/forum?id=KoFOg41haE. Li, Y., Choi, D.H., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., et al., 2022. Competition-level code generation with alpha- code. Science 378, 1092–1097. URL:https://www.science.org/ doi/abs/10.1126/science.abq1158, doi:10.1126/science.abq1158, arXiv:https://www.science.org/doi/pdf/10.1126/scienc...

work page doi:10.1126/science.abq1158 2022
[3]

Lovable, 2024

URL:https://doi.org/10.1145/3583131.3590481, doi:10.1145/ 3583131.3590481. Lovable, 2024. Build apps with an AI engineer. URL:https://lovable.dev. accessed: May 13, 2025. Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., Jiang, D., 2024. Wizardcoder: Empowering code large language models with evol-instruct, in: The Twelfth In...

work page doi:10.1145/3583131.3590481 2024
[4]

Nguyen, N., Nadi, S., 2022

URL:https://doi.org/10.1109/MSEC.2024.3355713, doi:10.1109/ MSEC.2024.3355713. Nguyen, N., Nadi, S., 2022. An empirical evaluation of github copilot’s code suggestions, in: 19thIEEE/ACMInternationalConferenceonMiningSoft- ware Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, ACM. pp. 1–5. URL:https://doi.org/10.1145/3524842.3528470, doi:10.11...

work page doi:10.1109/msec.2024.3355713 2024

[1] [1]

Starcoder: may the source be with you! Trans. Mach. Learn. Res

work page

[2] [2]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

URL:https://openreview.net/forum?id=KoFOg41haE. Li, Y., Choi, D.H., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., et al., 2022. Competition-level code generation with alpha- code. Science 378, 1092–1097. URL:https://www.science.org/ doi/abs/10.1126/science.abq1158, doi:10.1126/science.abq1158, arXiv:https://www.science.org/doi/pdf/10.1126/scienc...

work page doi:10.1126/science.abq1158 2022

[3] [3]

Lovable, 2024

URL:https://doi.org/10.1145/3583131.3590481, doi:10.1145/ 3583131.3590481. Lovable, 2024. Build apps with an AI engineer. URL:https://lovable.dev. accessed: May 13, 2025. Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., Jiang, D., 2024. Wizardcoder: Empowering code large language models with evol-instruct, in: The Twelfth In...

work page doi:10.1145/3583131.3590481 2024

[4] [4]

Nguyen, N., Nadi, S., 2022

URL:https://doi.org/10.1109/MSEC.2024.3355713, doi:10.1109/ MSEC.2024.3355713. Nguyen, N., Nadi, S., 2022. An empirical evaluation of github copilot’s code suggestions, in: 19thIEEE/ACMInternationalConferenceonMiningSoft- ware Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, ACM. pp. 1–5. URL:https://doi.org/10.1145/3524842.3528470, doi:10.11...

work page doi:10.1109/msec.2024.3355713 2024