pith. machine review for the scientific record. sign in

arxiv: 2604.06755 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Babbling Suppression: Making LLMs Greener One Token at a Time

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords babbling suppressionLLM code generationenergy efficiencytoken reductionsoftware engineeringsustainable AIcode completionAI assistants
0
0 comments X

The pith

Babbling Suppression stops LLM code generation once tests pass, cutting energy use by up to 65% without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Babbling Suppression, a method that runs tests on code produced so far by an LLM and halts further token generation as soon as a correct solution appears. This targets the extra output LLMs often continue to produce after the problem is solved. Tests across Python and Java benchmarks with 3-7B parameter models show energy savings in most cases. The technique leaves the model's final answer quality unchanged because it only terminates on outputs that already pass all tests.

Core claim

Babbling Suppression integrates test execution into the LLM generation process by evaluating intermediate outputs and terminating generation once a solution passes all tests. This prevents excessive token generation while having no impact on model accuracy. An empirical study was conducted across two Python and two Java benchmarks, targeting four 3-4B parameter models and six 6-7B parameter models, with babbling observed across all models and higher frequency in Java.

What carries the argument

Babbling Suppression (BS), a model-agnostic technique that evaluates intermediate code outputs against test suites during generation to decide when to stop producing more tokens.

Load-bearing premise

That running tests on intermediate outputs adds negligible overhead relative to the token savings and that the chosen benchmarks' test suites are representative enough to detect correct solutions without false positives or negatives that would alter termination behavior.

What would settle it

Compare total energy and final accuracy on a new set of problems where the test suites are deliberately weak enough to accept incorrect early outputs or where test execution time exceeds the saved generation time.

Figures

Figures reproduced from arXiv: 2604.06755 by Fernando Castor, Lola Solovyeva.

Figure 1
Figure 1. Figure 1: Token likelihood in test-passing solutions, together with histograms showing the lengths of the generated solutions. The bottom x-axis shows the token index, and the left y-axis shows the probability that a token appears in a correctly generated solution. The top x-axis shows the length of the generated solutions, and the right y-axis shows the number of instances with that length. the output and operates … view at source ↗
Figure 2
Figure 2. Figure 2: Step-by-step example of applying babbling suppression to an LLM-generated Python function that computes the square of a number. For the generated output: blue indicates a newly generated line, and the line delimiter is highlighted in red. The generated output is shown with a white background, which turns red (with a cross) if a check fails and green (with a check mark) if all tests pass. Blue boxes represe… view at source ↗
read the original abstract

Context: Large Language Models (LLMs) are increasingly used in modern software development, aiding in code generation, code completion, and refactoring through AI-powered assistants. While they accelerate development workflows, they often produce extraneous output, referred to as "babbling", which incurs additional cognitive, economic, and energy costs. Objective: This work investigates the problem of babbling in LLM-based code generation and proposes a practical, model-agnostic approach to reduce unnecessary output without compromising solution accuracy. Method: We introduce Babbling Suppression (BS), a method that integrates test execution into the LLM generation process by evaluating intermediate outputs and terminating generation once a solution passes all tests. This prevents excessive token generation while having no impact on model accuracy. An empirical study was conducted across two Python and two Java benchmarks, targeting four 3-4B parameter models and six 6-7B parameter models. Results: Our findings show that babbling occurs across all tested models, with higher frequency in Java than in Python. Applying BS significantly reduces energy consumption by up to 65% for Python and 62% for Java in models prone to babbling. Across 40 model-benchmark pairs, 29 showed reduced mean energy consumption, with reductions exceeding 20% in 22 cases. Generated token count decreased in 35 pairs, while the GPU energy-per-token overhead of BS remained below 10% for 26 pairs, decreased for 2, and reached a maximum of 24%, yielding net energy savings in most cases. Implications: BS can make AI-assisted programming more efficient and sustainable by reducing energy consumption and minimizing cognitive effort by developers. Its model-agnostic design allows easy integration, suggesting broad applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Babbling Suppression (BS), a model-agnostic technique that interleaves execution of benchmark test suites on partial LLM-generated code during decoding and terminates generation as soon as an intermediate output passes all tests. It reports that babbling occurs in all tested models (higher in Java), that BS reduces mean energy consumption in 29 of 40 model-benchmark pairs (exceeding 20 % in 22 cases), with peak reductions of 65 % (Python) and 62 % (Java), while claiming no accuracy loss and GPU energy-per-token overhead below 10 % in most pairs.

Significance. If the net energy savings survive full accounting for test-execution overhead and test-suite reliability, the work supplies a practical, immediately deployable intervention that directly addresses the energy footprint of LLM-based code generation. The scale of the study (40 pairs across 3-7 B models and two languages) and the explicit quantification of overhead provide a useful empirical baseline for follow-on work on sustainable code assistants.

major comments (3)
  1. [Energy Measurement and Results] The central net-energy claim rests on the assumption that repeated test-suite execution adds negligible overhead relative to avoided token generation. The manuscript reports GPU energy-per-token overhead reaching 24 % in some pairs but does not state whether CPU energy consumed by test execution is measured and subtracted from the reported savings, nor does it specify the exact frequency of test invocations (every token, every k tokens, or on EOS). Without these details the headline reductions cannot be verified as net positive.
  2. [Method and Accuracy Evaluation] The claim of “no impact on model accuracy” requires that test suites never produce false-positive passes on incomplete or incorrect prefixes. The paper does not report any diagnostic on partial-solution false positives, nor does it describe controls (e.g., minimum test coverage thresholds or manual inspection of early-termination cases) that would rule out premature termination on non-solutions.
  3. [Results] Across the 40 pairs the manuscript states that token count decreased in 35 cases and energy decreased in 29, yet it does not provide a per-pair breakdown showing that the observed energy reduction exceeds the measured overhead in every case where savings are claimed. A single table or figure that nets overhead against gross savings is needed to substantiate the “net energy savings in most cases” statement.
minor comments (3)
  1. [Introduction] The term “babbling” is introduced without a precise operational definition (e.g., tokens generated after a passing solution appears). A short formal definition would improve reproducibility.
  2. [Abstract and Results] The abstract and results text use “up to 65 %” and “up to 62 %” without clarifying whether these maxima are achieved on the same model-benchmark pair or are the single largest observed values; a parenthetical note would remove ambiguity.
  3. [Method] No mention is made of the exact test-framework harness used (pytest, JUnit, etc.) or of any timeout or resource limits placed on test execution; these parameters affect both overhead and reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the constructive criticism provided, which helps improve the clarity and rigor of our work on Babbling Suppression. Below, we address each major comment in detail, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Energy Measurement and Results] The central net-energy claim rests on the assumption that repeated test-suite execution adds negligible overhead relative to avoided token generation. The manuscript reports GPU energy-per-token overhead reaching 24 % in some pairs but does not state whether CPU energy consumed by test execution is measured and subtracted from the reported savings, nor does it specify the exact frequency of test invocations (every token, every k tokens, or on EOS). Without these details the headline reductions cannot be verified as net positive.

    Authors: We agree that additional details on the energy measurement protocol are required to fully substantiate the net savings claims. The current manuscript emphasizes GPU energy for the decoding process, which constitutes the bulk of the energy use. We will revise the Methods and Results sections to specify that test invocations occur after every token (following an initial minimum length to prevent overhead on trivial prefixes). CPU energy for test execution was not measured or subtracted, as our focus was on GPU metrics for LLM inference; we will explicitly note this limitation and provide an estimate that CPU overhead is small relative to the avoided GPU token generation. A new paragraph will explain the net calculation as total GPU energy with BS versus without, incorporating the per-token overhead. revision: yes

  2. Referee: [Method and Accuracy Evaluation] The claim of “no impact on model accuracy” requires that test suites never produce false-positive passes on incomplete or incorrect prefixes. The paper does not report any diagnostic on partial-solution false positives, nor does it describe controls (e.g., minimum test coverage thresholds or manual inspection of early-termination cases) that would rule out premature termination on non-solutions.

    Authors: We clarify that the accuracy metric is the proportion of problems solved correctly, where correctness is defined by passage of the full test suite. Since BS only stops generation upon test passage, the output is always a verified correct solution, preserving the accuracy exactly as in the baseline (no incorrect solutions are accepted). To address potential concerns about false positives, we will add to the revised manuscript a discussion of our verification: experimental logs showed no instances of non-solutions passing tests early, consistent with the design of the benchmarks where tests require complete functionality. We will include a brief description of the test suites' coverage and note that no additional thresholds were applied beyond full passage. revision: yes

  3. Referee: [Results] Across the 40 pairs the manuscript states that token count decreased in 35 cases and energy decreased in 29, yet it does not provide a per-pair breakdown showing that the observed energy reduction exceeds the measured overhead in every case where savings are claimed. A single table or figure that nets overhead against gross savings is needed to substantiate the “net energy savings in most cases” statement.

    Authors: We concur that a detailed per-pair net analysis is necessary for transparency. We will incorporate into the revised Results section a supplementary table listing for all 40 model-benchmark pairs the average token count, gross energy consumption, overhead percentage, and net energy savings. This will clearly show that in the 29 pairs with reduced energy, the savings surpass the overhead. The underlying data from our experiments supports this presentation and will be used to create the table. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper describes an empirical method (Babbling Suppression) that runs tests on intermediate LLM outputs to terminate generation early, then reports measured token counts and GPU energy across 40 model-benchmark pairs. No equations, fitted parameters, or predictions appear; results are direct experimental outcomes on fixed benchmarks. No self-citations are invoked as load-bearing premises, and the central claims rest on observed deltas rather than any definitional or fitted-input reduction. This matches the default case of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method depends on the existence of usable test suites for the generated code and on the assumption that early termination based on those tests does not discard superior solutions that would appear later.

axioms (1)
  • domain assumption Test suites provided with the benchmarks are sufficient to determine functional correctness of partial generations
    The BS method terminates generation only when all tests pass; this requires the tests to be reliable indicators of solution quality.
invented entities (1)
  • Babbling no independent evidence
    purpose: Label for extraneous token output beyond a correct solution
    Introduced to name the unnecessary generation that incurs extra costs; no independent physical or formal definition supplied.

pith-pipeline@v0.9.0 · 5610 in / 1318 out tokens · 30917 ms · 2026-05-10T18:22:55.526797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    Negar Alizadeh, Boris Belchev, Nishant Saurabh, Patricia Kelbert, and Fernando Castor. 2025. Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 725–736. doi:10.1109/MSR66628.2025. 00109

  2. [2]

    Anonymous Anonymous. 2026. Greening AI-Assisted Code Generation by Re- ducing Babbling. doi:10.5281/zenodo.19237762

  3. [3]

    Radu Apsan, Vincenzo Stoico, Michel Albonico, Rudra Dhar, Karthik Vaid- hyanathan, and Ivano Malavolta. 2025. Generating Energy-Efficient Code via Large-Language Models – Where are we now? arXiv:2509.10099 [cs.SE] https://arxiv.org/abs/2509.10099

  4. [4]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

  5. [5]

    BigCode. 2025. https://huggingface.co/spaces/bigcode/bigcode-models- leaderboard Accessed: 2025-10-23

  6. [6]

    Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-Chi Cheung, and Chang Xu

  7. [7]

    InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24)

    JavaBench: A Benchmark of Object-Oriented Code Generation for Eval- uating Large Language Models. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 870–882. doi:10.1145/3691620.3695470

  8. [8]

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation.IEEE Trans. Softw. Eng.49, 7 (July 2023), 3675–3691....

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. Evaluating Large Lan- guage Models Trained on Code.CoRRabs/2107.03374 (2021)

  10. [10]

    Jonathan Cordeiro, Shayan Noei, and Ying Zou. 2026. An Empirical Study on the Code Refactoring Capability of Large Language Models.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10.1145/3801158 Just Accepted

  11. [11]

    Nicole Davila, Igor Wiese, Igor Steinmacher, Lucas Lucio da Silva, Andre Kawamoto, Gilson Jose Peres Favaro, and Ingrid Nunes. 2024. An Industry Case Study on Adoption of AI-based Programming Assistants. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice(Lisbon, Portugal)(ICSE-SEIP ’24). Associatio...

  12. [12]

    Christof Ebert and Panos Louridas. 2023. Generative AI for Software Practitioners. IEEE Software40, 4 (2023), 30–38. doi:10.1109/MS.2023.3265877

  13. [13]

    GitHub. 2025. GitHub Copilot. https://copilot.github.com/ Accessed: 2025-10-23

  14. [14]

    Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Test- ing and Analysis(Vienna, Austria)(ISSTA 2024). Association fo...

  15. [15]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] https://arxiv.org/abs/2105.09938

  16. [16]

    Rasha Ahmad Husein, Hala Aburajouh, and Cagatay Catal. 2025. Large language models for code completion: A systematic literature review.Computer Standards & Interfaces92 (2025), 103917. doi:10.1016/j.csi.2024.103917

  17. [17]

    Shashikant Ilager, Lukas Florian Briem, and Ivona Brandic. 2025. GREEN- CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation. arXiv:2501.11006 [cs.DC] https://arxiv.org/abs/2501.11006

  18. [18]

    Jasmin Jahić and Ashkan Sami. 2024. State of Practice: LLMs in Software En- gineering and Software Architecture. In2024 IEEE 21st International Confer- ence on Software Architecture Companion (ICSA-C). 311–318. doi:10.1109/ICSA- C63560.2024.00059

  19. [19]

    Mohammad Talal Jamil, Shamsa Abid, and Shafay Shamail. 2025. Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 478–489. doi:10.1109/MSR66628.2025.00081

  20. [20]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A Survey on Large Language Models for Code Generation.ACM Trans. Softw. Eng. Methodol.35, 2, Article 58 (Jan. 2026), 72 pages. doi:10.1145/3747588

  21. [21]

    Nurminen, and Zhonghong Ou

    Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen, and Zhonghong Ou. 2018. RAPL in Action: Experiences in Using RAPL for Power Measurements.ACM Trans. Model. Perform. Eval. Comput. Syst.3, 2, Article 9 (March 2018), 26 pages

  22. [22]

    Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen

    Mohammed F. Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen. 5555. Security and Quality in LLM-Generated Code: a Multi-Language, Multi-Model Analysis .IEEE Transactions on Dependable and Secure Computing 01 (March 5555), 1–15. doi:10.1109/TDSC.2026.3672745

  23. [23]

    Will I be replaced?

    Mohammad Amin Kuhail, Sujith Samuel Mathew, Ashraf Khalil, Jose Berengueres, and Syed Jawad Hussain Shah. 2024. “Will I be replaced?” Assessing ChatGPT’s effect on software development and programmer perceptions of AI tools.Science of Computer Programming235 (2024), 103111. doi:10.1016/j.scico.2024.103111

  24. [24]

    Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power Hungry Process- ing: Watts Driving the Cost of AI Deployment?. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil) (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 85–99. doi:10.1145/3630106.3658542

  25. [25]

    Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594. doi:10.1145/3691620.3695527

  26. [26]

    Mohammadjavad Mehditabar, Saurabhsingh Rajput, Antonio Mastropaolo, and Tushar Sharma. 2025. Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency. arXiv:2511.07698 [cs.SE] https://arxiv.org/abs/ 2511.07698

  27. [27]

    Merriam-Webster. 2026. babbling. https://www.merriam-webster.com/ dictionary/babbling Accessed: 2026-01-22

  28. [28]

    Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Program- ming. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 142, 16 pages. doi:10.114...

  29. [29]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification.Proc. ACM Softw. Eng.1, FSE, Article 103 (July 2024), 23 pages. doi:10.1145/3660810

  30. [30]

    NVIDIA Corporation. 2025. NVIDIA Management Library (NVML). https://developer.nvidia.com/management-library-nvml. Last accessed October 22nd, 2025

  31. [31]

    Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget. arXiv:2508.13666 [cs.SE] https://arxiv.org/abs/2508.13666

  32. [32]

    ANTHROPIC PBC. 2026. Claude. https://claude.ai/ Accessed: 2026-03-23

  33. [33]

    Sanyogita Piya and Allison Sullivan. 2024. LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal) (LLM4Code ’24). Association for Computing Machinery, New York, NY, USA, 14–21. doi:10.1145/3643795.3648382

  34. [34]

    PyPI Contributors. 2024. pynvml: Python bindings for NVML. https://pypi.org/ project/pynvml/ Accessed: 2025-10-23

  35. [35]

    Sanka Rasnayaka, Guanlin Wang, Ridwan Shariffdeen, and Ganesh Neelakanta Iyer. 2024. An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal)(LLM4Code ’24). Association for Com- puting Machinery, New York, NY, USA, 111–118. doi...

  36. [36]

    Jaswanth Revuri, Rakesh Kumar Sakthivel, and Gayathri Nagasubramanian. 2026. Chapter Five - Artificial intelligence (AI) technologies and tools for accelerated software development. InCloud-native Architecture (CNA) and Artificial Intelli- gence (AI) for the Future of Software Engineering: The Principles, Patterns, Platforms and Practices, Pethuru Raj, Ma...

  37. [37]

    Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610. doi:10. 1016/j.infsof.2024.107610

  38. [38]

    Lola Solovyeva and Fernando Castor. 2026. Towards Green AI: Decoding the Energy of LLM Inference in Software Development. arXiv:2602.05712 [cs.SE] https://arxiv.org/abs/2602.05712 Conference’17, July 2017, Washington, DC, USA Lola Solovyeva and Fernando Castor

  39. [39]

    Lola Solovyeva, Sophie Weidmann, and Fernando Castor. 2025. AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 49–60. doi:10.1109/Forge66646.2025.00012

  40. [40]

    Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMil- lan, and Toby Jia-Jun Li. 2024. Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 40–46

  41. [41]

    The Economist. 2025. OpenAI’s latest model will change the economics of soft- ware. https://www.economist.com/business/2025/01/20/openais-latest-model- will-change-the-economics-of-software

  42. [42]

    The Economist. 2025. Will OpenAI ever make real money? https://www. economist.com/business/2025/05/15/will-openai-ever-make-real-money

  43. [43]

    Jianxun Wang and Yixiang Chen. 2023. A Review on Code Generation with LLMs: Application and Evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI). 284–289. doi:10.1109/MedAI59581.2023.00044

  44. [44]

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. 2025. Code- Contests+: High-Quality Test Case Generation for Competitive Programming. arXiv:2506.05817 [cs.SE] https://arxiv.org/abs/2506.05817

  45. [45]

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh- Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. 2025. LiveBench: A Challenging, Contamination-Limited LLM Benchmark. arXiv:2...

  46. [46]

    Yusen Zhang, Sarkar Snigdha Sarathi Das, and Rui Zhang. 2024. Verbosity ≠ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models. arXiv:2411.07858 [cs.CL] https://arxiv.org/abs/2411.07858

  47. [47]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894