Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

Adam J. Torek; Casey Kennington; Eric L. Melin; Nasir U. Eisty

arxiv: 2411.10656 · v2 · submitted 2024-11-16 · 💻 cs.SE

Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

Eric L. Melin , Adam J. Torek , Nasir U. Eisty , Casey Kennington This is my paper

Pith reviewed 2026-05-23 17:28 UTC · model grok-4.3

classification 💻 cs.SE

keywords large language modelscode generationquantizationpythoncode qualitystatic analysissoftware engineering

0 comments

The pith

Smaller LLMs generate functional Python code yet show limited benchmark scores, inconsistent quantization effects, and recurring quality issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four open-source LLMs on Python code generation tasks and measures how 8-bit and 4-bit quantization changes the output. It applies code similarity metrics plus static analysis to detect maintainability problems. Results indicate that the models produce working code but fall short on benchmarks, that quantization effects differ across models, and that the code frequently triggers quality warnings. A reader would care because smaller quantized models promise lower hardware costs while still risking code that needs extra review before use in real projects.

Core claim

Four smaller open-source LLMs produce functional Python code on standard benchmarks, although performance stays limited. Eight-bit and four-bit quantization yield variable changes in that performance. Static analysis of the generated code identifies repeated quality and maintainability shortfalls, supporting the conclusion that LLM output requires careful validation before integration into software projects.

What carries the argument

Code similarity metrics paired with static code quality assessment applied to LLM outputs across full-precision, 8-bit, and 4-bit quantized versions.

If this is right

Smaller LLMs remain usable for code generation but cannot replace larger models on demanding benchmarks.
Quantization decisions must be tested per model instead of treated as uniformly safe or harmful.
Static analysis tools flag maintainability issues often enough that they become a required post-generation step.
Projects that adopt LLM code must add explicit validation stages before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could pair the models with automated linters or refactoring passes to reduce the flagged maintainability problems.
Real-world project codebases may expose different quality patterns than the chosen academic benchmarks.
The compute savings from quantization might be partly offset by extra human review time on the generated code.

Load-bearing premise

The selected Python benchmarks, similarity metrics, and static analysis tools give a representative picture of the code quality and maintainability that would matter in actual software projects.

What would settle it

A study that inserts the generated code into an existing open-source repository, runs the project's full test suite, and tracks the developer hours needed to reach merge-ready quality would falsify the quality-concern claim if those hours prove near zero.

read the original abstract

Context: Large Language Models (LLMs) like GPT-5 and LLaMA-405b exhibit advanced code generation abilities, but their deployment demands substantial computation resources and energy. Quantization can reduce memory footprint and hardware requirements, yet may degrade code quality. Objective: This study investigates code generation performance of smaller LLMs, examines the effect of quantization, and identifies common code quality issues as a proof of concepts (PoC). Method: Four open-source LLMs are evaluated on Python benchmarks using code similarity metrics, with an analysis on 8-bit and 4-bit quantization, alongside static code quality assessment. Results: While smaller LLMs can generate functional code, benchmark performance is limited. Quantization impacts are variable, and generated code exhibits quality and maintainability concerns. Conclusions: LLM-generated code should be carefully validated before integration into software projects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrow PoC that benchmarks quantized small LLMs on Python code and reports the usual quality shortfalls, adding incremental data without new methods or strong claims.

read the letter

The paper checks four open-source LLMs on Python tasks, measures how 8-bit and 4-bit quantization changes output, and flags quality problems via similarity scores and static analyzers. It finds that the models produce working code but score low on benchmarks, quantization effects vary, and the generated code raises maintainability flags. That is the core result, and it lines up with what the abstract states. The work is useful as a data point for anyone running smaller models locally for code tasks, since it documents the trade-offs in one concrete setting. The authors stick to standard evaluation tools and report the outcomes plainly, which keeps the claims grounded. No circular math or invented metrics appear. The main limitation is that code similarity and static analysis serve as stand-ins for actual project quality. Those proxies can miss integration problems, long-term readability, or team-specific standards that matter in real codebases. The study is labeled a proof-of-concept, so it does not overreach, but readers still need to judge how far the numbers travel beyond the chosen benchmarks. Sample sizes, exact model versions, and statistical tests are not visible from the abstract, which leaves some uncertainty about how stable the differences are. This paper fits a software engineering workshop or short conference track where people track LLM tooling limits. It will not reshape the field, but the empirical record is worth having on file. I would send it for peer review so the methods and data can be checked in full; the setup is straightforward enough that referees can evaluate it quickly.

Referee Report

2 major / 2 minor

Summary. The paper presents a proof-of-concept (PoC) empirical study evaluating four smaller open-source LLMs on Python code generation tasks using standard benchmarks. It measures performance via code similarity metrics, assesses the variable effects of 8-bit and 4-bit quantization, and applies static analysis to identify quality and maintainability issues in the generated code. The central claim is that while functional code can be produced, benchmark results are limited, quantization impacts are inconsistent, and the outputs raise quality concerns that necessitate careful validation before use in projects.

Significance. If the reported benchmark runs and static analyses hold under scrutiny, the work supplies timely preliminary evidence on the practical trade-offs of deploying quantized smaller LLMs for code generation, underscoring maintainability risks that are often overlooked in favor of functional correctness. As an explicitly scoped PoC it does not claim generalizability, but it can usefully inform follow-on studies that validate the proxy metrics against real project outcomes.

major comments (2)

[Method] Method section: the evaluation rests on the assumption that the chosen Python benchmarks, similarity metrics, and static analyzers constitute a reliable proxy for real-world code quality and maintainability. No validation or correlation study against actual software projects is described, which directly limits the strength of the claim that generated code 'exhibits quality and maintainability concerns.'
[Results] Results: without reported details on exact model variants, benchmark selection criteria, sample sizes per task, statistical tests, or data exclusion rules, it is not possible to assess whether the stated performance limitations and variable quantization effects are robustly supported by the underlying runs.

minor comments (2)

[Abstract] Abstract: 'proof of concepts (PoC)' should read 'proof of concept (PoC)'.
The manuscript would benefit from an explicit limitations subsection that discusses the scope of the PoC and the unvalidated status of the quality proxies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. As the work is scoped as an explicit PoC, we have revised the manuscript to strengthen reporting of experimental details and to more clearly delimit the claims made about proxy metrics. Both major comments are addressed below.

read point-by-point responses

Referee: [Method] Method section: the evaluation rests on the assumption that the chosen Python benchmarks, similarity metrics, and static analyzers constitute a reliable proxy for real-world code quality and maintainability. No validation or correlation study against actual software projects is described, which directly limits the strength of the claim that generated code 'exhibits quality and maintainability concerns.'

Authors: We agree that the study relies on established proxy metrics without a direct correlation analysis to real project outcomes. This is an inherent limitation of the PoC scope. In the revision we have (1) added an explicit Limitations subsection that states the proxy nature of the metrics and the absence of real-world validation, (2) softened the phrasing of the central claim from 'exhibits quality and maintainability concerns' to 'suggests potential quality and maintainability issues that warrant further investigation', and (3) referenced prior literature that has similarly used these proxies while noting the need for follow-on studies. revision: yes
Referee: [Results] Results: without reported details on exact model variants, benchmark selection criteria, sample sizes per task, statistical tests, or data exclusion rules, it is not possible to assess whether the stated performance limitations and variable quantization effects are robustly supported by the underlying runs.

Authors: We accept that the original submission omitted several reproducibility details. The revised manuscript now includes: exact model identifiers and quantization configurations from the Hugging Face hub, explicit rationale for benchmark selection (HumanEval, MBPP, and a custom maintainability subset), per-task sample sizes, a statement that no data points were excluded, and clarification that, given the exploratory PoC design, only descriptive statistics and qualitative observations were used rather than formal hypothesis tests. These additions appear in the updated Method and Results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical PoC evaluation study that runs four open-source LLMs on Python benchmarks, applies standard code similarity metrics and static analyzers, and reports observed performance, quantization effects, and quality issues. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. All central claims rest on direct experimental outputs from external benchmarks and tools rather than any internal reduction or self-citation chain, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper with no mathematical axioms, free parameters, or invented entities; all claims rest on the choice of benchmarks and analysis tools rather than derivations.

pith-pipeline@v0.9.0 · 5688 in / 1112 out tokens · 39061 ms · 2026-05-23T17:28:34.308616+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
cs.CL 2025-04 unverdicted novelty 4.0

QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% im...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Large language models for software engineering: Survey and open problems,

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems (arXiv:2310.03533) (2023). arXiv:2310.03533 [cs]

work page arXiv 2023
[3]

Advances in Neural Information Processing Systems 36 (2024)

Liu, J., Xia, C.S., Wang, Y ., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., Sutton, C.: Program synthesis with large language models (arXiv:2108.07732) (2021). arXiv:2108.07732 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.: Swe- bench: Can language models resolve real-world github issues? (arXiv:2310.06770) (2024). arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sundaresan, N., Zhou, M., Blanco, A., Ma, S.: CodeBLEU: a Method for Automatic Evaluation of Code Synthesis (2020)

work page 2020
[7]

https://docs.sonarsource.com/sonarqube/ latest/user-guide/clean-code/definition/

SonarSource: SonarQube Documentation. https://docs.sonarsource.com/sonarqube/ latest/user-guide/clean-code/definition/

work page
[8]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., Han, S.: Awq: Activation-aware weight quantization for llm compression and acceleration (arXiv:2306.00978) (2023). arXiv:2306.00978 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: Gptq: Accurate post-training quantization for generative pre-trained transformers (arXiv:2210.17323) (2023). arXiv:2210.17323 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

arXiv:2311.00889 [cs]

Siddiq, M.L., Santos, J.C.S.: Generate and pray: Using sallms to evaluate the security of llm generated code (arXiv:2311.00889) (2023). arXiv:2311.00889 [cs]

work page arXiv 2023
[12]

arXiv:2303.07263 [cs]

Jin, M., Shahriar, S., Tufano, M., Shi, X., Lu, S., Sundaresan, N., Svyatkovskiy, A.: Inferfix: End-to-end program repair with llms (arXiv:2303.07263) (2023). arXiv:2303.07263 [cs]

work page arXiv 2023
[13]

Kantek, B.P.: Ai-driven software development source code quality (2023) 22

work page 2023
[14]

arXiv:2308.03109 [cs]

Pan, R., Ibrahimzada, A.R., Krishna, R., Sankar, D., Wassi, L.P., Merler, M., Sobolev, B., Pavuluri, R., Sinha, S., Jabbarvand, R.: Lost in translation: A study of bugs intro- duced by large language models while translating code (2024) https://doi.org/10.1145/ 3597503.3639226 . arXiv:2308.03109 [cs]

work page arXiv 2024
[15]

arXiv:2402.07844 [cs]

Du, M., Luu, A.T., Ji, B., Ng, S.-K.: Mercury: An efficiency benchmark for llm code synthesis (arXiv:2402.07844) (2024) https://doi.org/10.48550/arXiv.2402.07844 . arXiv:2402.07844 [cs]

work page doi:10.48550/arxiv.2402.07844 2024
[16]

arXiv:2312.04724 [cs]

Bhatt, M., Chennabasappa, S., Nikolaidis, C., Wan, S., Evtimov, I., Gabi, D., Song, D., Ahmad, F., Aschermann, C., Fontana, L., Frolov, S., Giri, R.P., Kapil, D., Kozyrakis, Y ., LeBlanc, D., Milazzo, J., Straumann, A., Synnaeve, G., V ontimitta, V ., Whit- man, S., Saxe, J.: Purple llama cyberseceval: A secure coding benchmark for lan- guage models (arXi...

work page doi:10.48550/arxiv.2312.04724 2023
[17]

Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation

Zhong, L., Wang, Z.: Can chatgpt replace stackoverflow? a study on robustness and reliability of large language model code generation (arXiv:2308.10335) (2024) https: //doi.org/10.48550/arXiv.2308.10335 . arXiv:2308.10335 [cs]

work page doi:10.48550/arxiv.2308.10335 2024
[18]

Glassman

Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems. CHI EA ’22. Association for Computing Machinery, New York, NY , USA (2022). https://doi.org/10.1145/3491101.351...

work page doi:10.1145/3491101.3519665 2022
[19]

arXiv:2401.15232 [cs]

Nguyen, S., Babe, H.M., Zi, Y ., Guha, A., Anderson, C.J., Feldman, M.Q.: How begin- ning programmers and code llms (mis)read each other (arXiv:2401.15232) (2024) https://doi.org/10.48550/arXiv.2401.15232 . arXiv:2401.15232 [cs]

work page doi:10.48550/arxiv.2401.15232 2024
[20]

arXiv preprint arXiv:2402.02047 (2024)

Spiess, C., Gros, D., Pai, K.S., Pradel, M., Rabin, M.R.I., Jha, S., Devanbu, P., Ahmed, T.: Quality and trust in llm-generated code. arXiv preprint arXiv:2402.02047 (2024)

work page arXiv 2024
[21]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Language Models are Few-Shot Learners

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCan- dlish, S., Radfor...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[23]

arXiv:2308.07633 [cs] 23

Zhu, X., Li, J., Liu, Y ., Ma, C., Wang, W.: A survey on model compression for large language models (arXiv:2308.07633) (2023). arXiv:2308.07633 [cs] 23

work page arXiv 2023
[24]

Li et al

Li, S., Ning, X., Wang, L., Liu, T., Shi, X., Yan, S., Dai, G., Yang, H., Wang, Y .: Evaluating quantized large language models. arXiv preprint arXiv:2402.18158 (2024)

work page arXiv 2024
[25]

Proceedings of Machine Learning and Systems 6, 87–100 (2024)

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., Han, S.: Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6, 87–100 (2024)

work page 2024
[26]

arXiv preprint arXiv:2405.18137 (2024)

Egashira, K., Vero, M., Staab, R., He, J., Vechev, M.: Exploiting llm quantization. arXiv preprint arXiv:2405.18137 (2024)

work page arXiv 2024
[27]

arXiv preprint arXiv:2408.07082 (2024)

Sim ˜oes, I.R.d.S., Venson, E.: Evaluating source code quality with large languagem models: a comparative study. arXiv preprint arXiv:2408.07082 (2024)

work page arXiv 2024
[28]

arXiv:2306.08568 [cs]

Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., Jiang, D.: Wizardcoder: Empowering code large language models with evol-instruct (arXiv:2306.08568) (2023). arXiv:2306.08568 [cs]

work page arXiv 2023
[29]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (arXiv:2310.06825) (2023). arXiv:2310.06825 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

StarCoder 2 and The Stack v2: The Next Generation

Lozhkov, A., Li, R., Allal, L.B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y ., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y ., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T.Y ., Zheltonozhskii, E., Dade, N.O.O., Yu, W., Krauß, L., Jain, N., Su, Y ., He,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Code Llama: Open Foundation Models for Code

Rozi `ere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C.C., Grattafiori, A., Xiong, W., D ´efossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T., Synnaeve, G.: Code llama: Open foundation m...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

In: Association for Computational Linguistics, pp

Bender, E.M., Koller, A.: Climbing towards NLU: On meaning, form, and understanding in the age of data. In: Association for Computational Linguistics, pp. 5185–5198 (2020) 24

work page 2020

[1] [1]

Large language models for software engineering: Survey and open problems,

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems (arXiv:2310.03533) (2023). arXiv:2310.03533 [cs]

work page arXiv 2023

[2] [3]

Advances in Neural Information Processing Systems 36 (2024)

Liu, J., Xia, C.S., Wang, Y ., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[3] [4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., Sutton, C.: Program synthesis with large language models (arXiv:2108.07732) (2021). arXiv:2108.07732 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.: Swe- bench: Can language models resolve real-world github issues? (arXiv:2310.06770) (2024). arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [6]

Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Sundaresan, N., Zhou, M., Blanco, A., Ma, S.: CodeBLEU: a Method for Automatic Evaluation of Code Synthesis (2020)

work page 2020

[6] [7]

https://docs.sonarsource.com/sonarqube/ latest/user-guide/clean-code/definition/

SonarSource: SonarQube Documentation. https://docs.sonarsource.com/sonarqube/ latest/user-guide/clean-code/definition/

work page

[7] [8]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., Han, S.: Awq: Activation-aware weight quantization for llm compression and acceleration (arXiv:2306.00978) (2023). arXiv:2306.00978 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: Gptq: Accurate post-training quantization for generative pre-trained transformers (arXiv:2210.17323) (2023). arXiv:2210.17323 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [11]

arXiv:2311.00889 [cs]

Siddiq, M.L., Santos, J.C.S.: Generate and pray: Using sallms to evaluate the security of llm generated code (arXiv:2311.00889) (2023). arXiv:2311.00889 [cs]

work page arXiv 2023

[10] [12]

arXiv:2303.07263 [cs]

Jin, M., Shahriar, S., Tufano, M., Shi, X., Lu, S., Sundaresan, N., Svyatkovskiy, A.: Inferfix: End-to-end program repair with llms (arXiv:2303.07263) (2023). arXiv:2303.07263 [cs]

work page arXiv 2023

[11] [13]

Kantek, B.P.: Ai-driven software development source code quality (2023) 22

work page 2023

[12] [14]

arXiv:2308.03109 [cs]

Pan, R., Ibrahimzada, A.R., Krishna, R., Sankar, D., Wassi, L.P., Merler, M., Sobolev, B., Pavuluri, R., Sinha, S., Jabbarvand, R.: Lost in translation: A study of bugs intro- duced by large language models while translating code (2024) https://doi.org/10.1145/ 3597503.3639226 . arXiv:2308.03109 [cs]

work page arXiv 2024

[13] [15]

arXiv:2402.07844 [cs]

Du, M., Luu, A.T., Ji, B., Ng, S.-K.: Mercury: An efficiency benchmark for llm code synthesis (arXiv:2402.07844) (2024) https://doi.org/10.48550/arXiv.2402.07844 . arXiv:2402.07844 [cs]

work page doi:10.48550/arxiv.2402.07844 2024

[14] [16]

arXiv:2312.04724 [cs]

Bhatt, M., Chennabasappa, S., Nikolaidis, C., Wan, S., Evtimov, I., Gabi, D., Song, D., Ahmad, F., Aschermann, C., Fontana, L., Frolov, S., Giri, R.P., Kapil, D., Kozyrakis, Y ., LeBlanc, D., Milazzo, J., Straumann, A., Synnaeve, G., V ontimitta, V ., Whit- man, S., Saxe, J.: Purple llama cyberseceval: A secure coding benchmark for lan- guage models (arXi...

work page doi:10.48550/arxiv.2312.04724 2023

[15] [17]

Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation

Zhong, L., Wang, Z.: Can chatgpt replace stackoverflow? a study on robustness and reliability of large language model code generation (arXiv:2308.10335) (2024) https: //doi.org/10.48550/arXiv.2308.10335 . arXiv:2308.10335 [cs]

work page doi:10.48550/arxiv.2308.10335 2024

[16] [18]

Glassman

Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems. CHI EA ’22. Association for Computing Machinery, New York, NY , USA (2022). https://doi.org/10.1145/3491101.351...

work page doi:10.1145/3491101.3519665 2022

[17] [19]

arXiv:2401.15232 [cs]

Nguyen, S., Babe, H.M., Zi, Y ., Guha, A., Anderson, C.J., Feldman, M.Q.: How begin- ning programmers and code llms (mis)read each other (arXiv:2401.15232) (2024) https://doi.org/10.48550/arXiv.2401.15232 . arXiv:2401.15232 [cs]

work page doi:10.48550/arxiv.2401.15232 2024

[18] [20]

arXiv preprint arXiv:2402.02047 (2024)

Spiess, C., Gros, D., Pai, K.S., Pradel, M., Rabin, M.R.I., Jha, S., Devanbu, P., Ahmed, T.: Quality and trust in llm-generated code. arXiv preprint arXiv:2402.02047 (2024)

work page arXiv 2024

[19] [21]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [22]

Language Models are Few-Shot Learners

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakan- tan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCan- dlish, S., Radfor...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[21] [23]

arXiv:2308.07633 [cs] 23

Zhu, X., Li, J., Liu, Y ., Ma, C., Wang, W.: A survey on model compression for large language models (arXiv:2308.07633) (2023). arXiv:2308.07633 [cs] 23

work page arXiv 2023

[22] [24]

Li et al

Li, S., Ning, X., Wang, L., Liu, T., Shi, X., Yan, S., Dai, G., Yang, H., Wang, Y .: Evaluating quantized large language models. arXiv preprint arXiv:2402.18158 (2024)

work page arXiv 2024

[23] [25]

Proceedings of Machine Learning and Systems 6, 87–100 (2024)

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., Han, S.: Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6, 87–100 (2024)

work page 2024

[24] [26]

arXiv preprint arXiv:2405.18137 (2024)

Egashira, K., Vero, M., Staab, R., He, J., Vechev, M.: Exploiting llm quantization. arXiv preprint arXiv:2405.18137 (2024)

work page arXiv 2024

[25] [27]

arXiv preprint arXiv:2408.07082 (2024)

Sim ˜oes, I.R.d.S., Venson, E.: Evaluating source code quality with large languagem models: a comparative study. arXiv preprint arXiv:2408.07082 (2024)

work page arXiv 2024

[26] [28]

arXiv:2306.08568 [cs]

Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., Jiang, D.: Wizardcoder: Empowering code large language models with evol-instruct (arXiv:2306.08568) (2023). arXiv:2306.08568 [cs]

work page arXiv 2023

[27] [29]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.- A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7b (arXiv:2310.06825) (2023). arXiv:2310.06825 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [30]

StarCoder 2 and The Stack v2: The Next Generation

Lozhkov, A., Li, R., Allal, L.B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y ., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y ., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T.Y ., Zheltonozhskii, E., Dade, N.O.O., Yu, W., Krauß, L., Jain, N., Su, Y ., He,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [31]

Code Llama: Open Foundation Models for Code

Rozi `ere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C.C., Grattafiori, A., Xiong, W., D ´efossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T., Synnaeve, G.: Code llama: Open foundation m...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [32]

In: Association for Computational Linguistics, pp

Bender, E.M., Koller, A.: Climbing towards NLU: On meaning, form, and understanding in the age of data. In: Association for Computational Linguistics, pp. 5185–5198 (2020) 24

work page 2020