Code Generation by Differential Test Time Scaling
Pith reviewed 2026-05-21 06:39 UTC · model grok-4.3
The pith
DiffCodeGen selects the best code candidate by clustering execution behaviors on fuzzing-generated inputs without any extra LLM calls.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, applies coverage-guided fuzzing to synthesize inputs without requiring existing tests or large language models, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as the final output. Unlike prior methods, this selection uses no extra model calls and therefore incurs little to no additional token consumption; the process is fully asynchronous and naturally suited to agentic coding. Evaluations across four large language models show consistent gains over baselines and competitive or superior
What carries the argument
Coverage-guided differential analysis that synthesizes inputs via fuzzing, executes candidates to record behaviors, clusters by behavioral similarity, and selects the medoid of the largest cluster.
If this is right
- Performance improves consistently across four different large language models without model-specific tuning.
- Token and time costs remain a small fraction of those required by test-time scaling methods that use public tests or extra LLM inference for selection.
- The method can be combined with reasoning models to produce further gains.
- Because selection requires no additional model calls, the approach scales naturally to large numbers of candidates in asynchronous agentic coding setups.
Where Pith is reading between the lines
- The reliance on automatically generated inputs could allow the technique to work in domains where public test suites are scarce or nonexistent.
- Behavioral clusters might expose systematic error patterns shared across many generated solutions, offering a new diagnostic for code-generation failures.
- The same differential-execution idea could be adapted to select among outputs in other generative tasks such as text or proof synthesis.
Load-bearing premise
The largest behavioral cluster identified from executions on the synthesized inputs reliably contains the correct or best code solution.
What would settle it
A test suite of held-out problems where the medoid of the largest cluster fails on the ground-truth tests while a candidate from a smaller cluster passes would show the selection rule does not reliably pick the best solution.
Figures
read the original abstract
Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DiffCodeGen, a test-time scaling method for code generation. It generates diverse code candidates via sampling and prompting strategies, synthesizes inputs using coverage-guided fuzzing without public tests or extra LLM calls, executes all candidates on these inputs to capture dynamic behavior, clusters candidates by behavioral similarity, and selects the medoid of the largest cluster as output. The paper claims consistent improvements over baselines across four LLMs, competitive or superior performance versus state-of-the-art test-time scaling methods while using only a fraction of the time and tokens, and emphasizes its model-agnostic, asynchronous design suitable for agentic coding.
Significance. If the central claims hold, this could represent a meaningful advance in efficient test-time scaling for code generation by eliminating reliance on public tests and additional model inference for selection. The coverage-guided fuzzing approach for differential behavior analysis is a notable technical choice that enables low-overhead selection. The model-agnostic property and potential combination with reasoning models are positive aspects. Reproducible evaluation across multiple LLMs would strengthen the contribution if detailed metrics confirm the efficiency gains.
major comments (2)
- [Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.
- [Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.
minor comments (2)
- [Method] Provide explicit details on the coverage-guided fuzzing parameters, clustering distance metric, and candidate generation strategies to support reproducibility.
- [Evaluation] Ensure all baselines and comparison methods are clearly defined with references in the evaluation section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of 'consistent improvements over baselines' and 'competitive or superior performance' are stated without any quantitative metrics, effect sizes, statistical significance tests, benchmark details, or evaluation protocol. This absence is load-bearing for assessing support of the central performance and efficiency claims.
Authors: We agree that the abstract would be strengthened by the inclusion of quantitative details. In the revised version, we will update the abstract to report key metrics from our evaluations, such as average pass rate improvements over baselines, token and runtime reductions relative to state-of-the-art test-time scaling methods, the specific benchmarks employed, and a brief note on the evaluation protocol. This will make the central claims more concrete and directly address the concern. revision: yes
-
Referee: [Method] Method section (clustering and selection): The assumption that the largest behavioral cluster from coverage-guided fuzzing inputs reliably contains the correct or best solution is central to the no-extra-LLM-call efficiency argument. The manuscript should provide targeted analysis or counterexample experiments for cases where incorrect candidates share similar failure modes on the synthesized inputs, as this directly risks degrading accuracy while still claiming token/time savings.
Authors: This is a fair and important point about the robustness of the clustering assumption. Our approach uses coverage-guided fuzzing to generate diverse inputs that aim to expose behavioral differences, and our multi-model experiments indicate that the largest cluster frequently aligns with correct solutions. To directly respond, we will add a targeted analysis subsection in the revised manuscript that examines cases of shared failure modes among incorrect candidates, reports observed frequencies, and discusses any impact on accuracy. We will include relevant examples and maintain an honest assessment of limitations. revision: yes
Circularity Check
No circularity: procedural pipeline with independent empirical evaluation
full rationale
The paper describes DiffCodeGen as a sequence of steps—candidate generation via sampling/prompting, coverage-guided fuzzing for input synthesis, execution to capture behaviors, similarity-based clustering, and medoid selection of the largest cluster—without any equations, fitted parameters, or derivations that reduce the output to inputs by construction. Performance claims rest on external empirical results across four LLMs rather than self-referential definitions or self-citation chains. The core heuristic (largest cluster contains the best solution) is an explicit assumption open to falsification, not a tautology or renamed fit. This leaves the method self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- candidate generation parameters
- fuzzing and clustering parameters
axioms (1)
- domain assumption Synthesized fuzzing inputs suffice to expose behavioral differences that correlate with code correctness or quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DIFFCODEGEN clusters the generated code candidates based on their dynamic behavior and selects the candidate with the shortest relative distance to all other candidates in the largest cluster as the final output.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose DIFFCODEGEN, a novel test-time scaling method combining differential testing and dynamic software analysis.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Inbal Shani and GitHub Staff.Survey reveals AI’s impact on the developer experience. 2024.URL:https : / / github . blog/news-insights/research/survey-reveals-ais-impact-on-the-developer-experience/
work page 2024
-
[2]
2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/
Kyle Daigle and GitHub Staff.Survey: The AI wave continues to grow on software development teams. 2024.URL:https: //github.blog/news-insights/research/survey-ai-wave-grows/
work page 2024
-
[3]
Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv:2107.03374 [cs.LG].URL:https: //arxiv.org/abs/2107.03374. 16 Code Generation by Differential Test Time Scaling
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Evaluating Large Language Models in Class-Level Code Generation
Xueying Du et al. “Evaluating Large Language Models in Class-Level Code Generation”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10 . 1145 / 3597503 . 3639219.URL:https : / / doi . org / 10 . 1145 / 3597503 . 3639219
work page 2024
-
[5]
DeepSeek-AI et al.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025. arXiv: 2501.12948 [cs.CL].URL:https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Niklas Muennighoff et al. “s1: Simple test-time scaling”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 20275–20321. ISBN: 979-8-89176-332-6.DOI:10.18653/v1/2025.emnlp-main.1025.URL:https://aclanthology.org/2025. emnlp-main.1025/
work page doi:10.18653/v1/2025.emnlp-main.1025.url:https://aclanthology.org/2025 2025
-
[7]
Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. “Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning”. In:The Thirteenth International Conference on Learning Repre- sentations. 2025.URL:https://openreview.net/forum?id=4FWAwZtd2n
work page 2025
-
[8]
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Huaye Zeng et al. “ACECODER: Acing Coder RL via Automated Test-Case Synthesis”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, July 2025, pp. 12023–12040.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl-long.587. URL:https://a...
-
[9]
Dacheng Li et al. “S*: Test Time Scaling for Code Generation”. In:Findings of the Association for Computational Lin- guistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15964–15978.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/2025.findings- emnlp.865.URL:https://aclanthology.org/2025. findings-emnlp.865/
-
[10]
Xiancai Chen et al. “Revisit Self-Debugging with Self-Generated Tests for Code Generation”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 18003–18023.ISBN: 979-8-89176-251-0.DOI:10.18653/v1/2025.acl- long.881.URL...
-
[11]
Differential Testing for Software
William M. McKeeman. “Differential Testing for Software”. In:Digit. Tech. J.10 (1998), pp. 100–107.URL:https : //api.semanticscholar.org/CorpusID:14018070
work page 1998
-
[12]
Differential testing: a new approach to change detection
Robert B. Evans and Alberto Savoia. “Differential testing: a new approach to change detection”. In:The 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engi- neering: Companion Papers. ESEC-FSE companion ’07. Dubrovnik, Croatia: Association for Computing Machinery, 2007, pp. 549–552...
work page doi:10.1145/1295014.1295038.url:https://doi.org/10.1145/1295014 2007
-
[13]
Hunting for bugs in code coverage tools via randomized differential testing
Yibiao Yang et al. “Hunting for bugs in code coverage tools via randomized differential testing”. In:Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 488–499. DOI:10.1109/ICSE.2019.00061.URL:https://doi.org/10.1109/ICSE.2019.00061
work page doi:10.1109/icse.2019.00061.url:https://doi.org/10.1109/icse.2019.00061 2019
-
[14]
Shaohua Li and Zhendong Su. “Finding Unstable Code via Compiler-Driven Differential Testing”. In:Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. ASPLOS 2023. Vancouver, BC, Canada: Association for Computing Machinery, 2023, pp. 238–251.ISBN: 9781450399180.DOI:10.1145/...
work page doi:10.1145/3582016.3582053.url:https://doi.org/10.1145/3582016.3582053 2023
-
[15]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain et al. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: The Thirteenth International Conference on Learning Representations. 2025.URL:https://openreview.net/forum? id=chfJJYC3iL
work page 2025
-
[16]
Finding and understanding bugs in C compilers
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers”. In:SIGPLAN Not.46.6 (June 2011), pp. 283–294.ISSN: 0362-1340.DOI:10.1145/1993316.1993532.URL:https://doi.org/10. 1145/1993316.1993532
work page doi:10.1145/1993316.1993532.url:https://doi.org/10 2011
-
[17]
Compiler validation via equivalence modulo inputs
Vu Le, Mehrdad Afshari, and Zhendong Su. “Compiler validation via equivalence modulo inputs”. In:SIGPLAN Not.49.6 (June 2014), pp. 216–226.ISSN: 0362-1340.DOI:10.1145/2666356.2594334.URL:https://doi.org/10.1145/ 2666356.2594334
work page doi:10.1145/2666356.2594334.url:https://doi.org/10.1145/ 2014
-
[18]
Finding compiler bugs via live code mutation
Chengnian Sun, Vu Le, and Zhendong Su. “Finding compiler bugs via live code mutation”. In:SIGPLAN Not.51.10 (Oct. 2016), pp. 849–863.ISSN: 0362-1340.DOI:10.1145/3022671.2984038.URL:https://doi.org/10.1145/3022671. 2984038
work page doi:10.1145/3022671.2984038.url:https://doi.org/10.1145/3022671 2016
-
[19]
Testing Database Engines via Pivoted Query Synthesis
Manuel Rigger and Zhendong Su. “Testing Database Engines via Pivoted Query Synthesis”. In:Proc. ACM Program. Lang. 4.OOPSLA (Nov. 2020).DOI:10.1145/3428279.URL:https://doi.org/10.1145/3428279
work page doi:10.1145/3428279.url:https://doi.org/10.1145/3428279 2020
-
[20]
Evaluating Program Semantics Reasoning with Type Inference in System $F$
Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, and Hao Chen. “Evaluating Program Semantics Reasoning with Type Inference in System $F$”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2025.URL:https://openreview.net/forum?id=IA9RmaP0aw
work page 2025
-
[21]
Coverage-based Greybox Fuzzing as Markov Chain
Marcel B ¨ohme, Van-Thuan Pham, and Abhik Roychoudhury. “Coverage-based Greybox Fuzzing as Markov Chain”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS ’16. Vienna, Austria: Association for Computing Machinery, 2016, pp. 1032–1043.ISBN: 9781450341394.DOI:10.1145/2976749.2978428. URL:https://doi.org/10.1145/...
-
[22]
AFL++ : Combining Incremental Steps of Fuzzing Research
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. “AFL++ : Combining Incremental Steps of Fuzzing Research”. In:14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, Aug. 2020.URL: https://www.usenix.org/conference/woot20/presentation/fioraldi
work page 2020
-
[23]
Andreas Zeller, Rahul Gopinath, Marcel B ¨ohme, Gordon Fraser, and Christian Holler. “The Fuzzing Book”. In: (Jan. 2019). DOI:10.60882/cispa.24614928.v1.URL:https://publications.cispa.de/articles/book/The_Fuzzing_ Book/24614928
work page doi:10.60882/cispa.24614928.v1.url:https://publications.cispa.de/articles/book/the_fuzzing_ 2019
-
[24]
Matryoshka: Fuzzing Deeply Nested Branches
Peng Chen, Jianzhong Liu, and Hao Chen. “Matryoshka: Fuzzing Deeply Nested Branches”. In:Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. CCS ’19. London, United Kingdom: Association for Computing Machinery, 2019, pp. 499–513.ISBN: 9781450367479.DOI:10.1145/3319535.3363225.URL:https: //doi.org/10.1145/3319535.3363225
-
[25]
Coverage-directed differential testing of JVM implementations
Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. “Coverage-directed differential testing of JVM implementations”. In:SIGPLAN Not.51.6 (June 2016), pp. 85–99.ISSN: 0362-1340.URL:https://doi.org/10.1145/ 2980983.2908095
-
[26]
NEZHA: Efficient Domain- Independent Differential Testing
Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D. Keromytis, and Suman Jana. “NEZHA: Efficient Domain- Independent Differential Testing”. In:Proceedings of the 2017 IEEE Symposium on Security and Privacy. SP ’17. San Jose, CA, USA: IEEE Press, 2017, pp. 615–632.ISBN: 9781509049318.DOI:10 . 1109 / SP . 2017 . 27.URL:https : //doi.org/10.1109/SP.2017.27
-
[27]
DifFuzz: differential fuzzing for side-channel analysis
Shirin Nilizadeh, Yannic Noller, and Corina S. P ˘as˘areanu. “DifFuzz: differential fuzzing for side-channel analysis”. In: Proceedings of the 41st International Conference on Software Engineering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 176–187.ISBN: 9781728108698.DOI:10.1109/ICSE.2019.00122.URL:https://doi.org/10. 1109/ICSE.2019.00122
work page doi:10.1109/icse.2019.00122.url:https://doi.org/10 2019
-
[28]
In: Proceedings of the 38th International Conference on Software Engineering
Junjie Chen et al. “An empirical comparison of compiler testing techniques”. In:Proceedings of the 38th International Conference on Software Engineering. ICSE ’16. Austin, Texas: Association for Computing Machinery, 2016, pp. 180–190. ISBN: 9781450342056.DOI:10.1145/2884781.2884878.URL:https://doi.org/10.1145/2884781.2884878
work page doi:10.1145/2884781.2884878.url:https://doi.org/10.1145/2884781.2884878 2016
-
[29]
CodeT: Code Generation with Generated Tests
Bei Chen et al. “CodeT: Code Generation with Generated Tests”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=ktrw68Cmu9c
work page 2023
-
[30]
In: Zong, C., Xia, F., Li, W., Navigli, R
Yifeng He, Jicheng Wang, Yuyang Rong, and Hao Chen. “FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation”. In:Findings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 15642–15655.ISBN: 979-8-89176-335-7.DOI:10.18653/v1/ 2025.findings-emnlp.8...
-
[31]
Learning to Write with Cooperative Discriminators
Ari Holtzman et al. “Learning to Write with Cooperative Discriminators”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 1638–1649.DOI:10.18653/v1/P18-1152.URL:https://aclanthology.org/P18-1152/
work page doi:10.18653/v1/p18-1152.url:https://aclanthology.org/p18-1152/ 2018
-
[32]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The Curious Case of Neural Text Degeneration”. In: International Conference on Learning Representations. 2020.URL:https://openreview.net/forum?id=rygGQyrFvH
work page 2020
-
[33]
Hierarchical Neural Story Generation
Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898.DOI:10.18653/v1/P18- 1082.URL:https://aclanthology. org/P18-1082/
-
[34]
Language models are unsupervised multitask learners
Alec Radford et al. “Language models are unsupervised multitask learners”. In:OpenAI blog1.8 (2019), p. 9
work page 2019
-
[35]
Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
Hongxiang Zhang, Hao Chen, Muhao Chen, and Tianyi Zhang. “Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 3028–3046.ISBN: 979-8-89176-332- 6.DOI:10....
work page doi:10.18653/v1/2025.emnlp-main.150.url:https://aclanthology.org/2025.emnlp-main.150/ 2025
-
[36]
A learning algorithm for Boltzmann machines
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A learning algorithm for Boltzmann machines”. In:Cog- nitive science9.1 (1985), pp. 147–169
work page 1985
-
[37]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the Knowledge in a Neural Network. 2015. arXiv:1503.02531 [stat.ML].URL:https://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[38]
Controlling Linguistic Style Aspects in Neural Language Generation
Jessica Ficler and Yoav Goldberg. “Controlling Linguistic Style Aspects in Neural Language Generation”. In:Proceedings of the Workshop on Stylistic Variation. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 94–104.DOI:10.18653/v1/W17-4912.URL:https://aclanthology.org/W17-4912/
work page doi:10.18653/v1/w17-4912.url:https://aclanthology.org/w17-4912/ 2017
-
[39]
In: Duh, K., Gomez, H., Bethard, S
Matthew Renze. “The Effect of Sampling Temperature on Problem Solving in Large Language Models”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 7346–7356.DOI:10.18653/v1/2024.findings- emnlp.432.URL:https://aclanthology.org/ 2024.findings-emnlp.432/
-
[40]
Demystifying LLM-Based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Demystifying LLM-Based Software Engineering Agents”. In:Proc. ACM Softw. Eng.2.FSE (June 2025).DOI:10.1145/3715754.URL:https://doi.org/10.1145/ 3715754
work page doi:10.1145/3715754.url:https://doi.org/10.1145/ 2025
-
[41]
Lipton, Mu Li, and Alexander J
Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola.Dive into Deep Learning.https://D2L.ai. Cambridge University Press, 2023. 18 Code Generation by Differential Test Time Scaling
work page 2023
-
[42]
Holistic Evaluation of Language Models
Percy Liang et al. “Holistic Evaluation of Language Models”. In:Transactions on Machine Learning Research(2023). Featured Certification, Expert Certification, Outstanding Certification.ISSN: 2835-8856.URL:https : / / openreview . net/forum?id=iO4LZibEqW
work page 2023
-
[43]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. “Quantifying Language Models’Sensitivity to Spurious Fea- tures in Prompt Design or: How I learned to start worrying about prompt formatting”. In:International Conference on Representation Learning. V ol. 2024. 2024, pp. 25055–25083.URL:https://proceedings.iclr.cc/paper_files/ paper/2024/file/6c0e...
work page 2024
-
[44]
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Jingming Zhuo et al. “ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs”. In:Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1950–1976.DOI:10 . 18653 / v1 / 2024 . findings - emnlp . 108.URL:https : / / aclanthology . org / 2024 . findings...
work page 2024
-
[45]
What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering
“What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 15...
work page 2025
-
[46]
Prompting Techniques for Secure Code Generation: A Systematic Investigation
Catherine Tony, Nicol ´as E. D´ıaz Ferreyra, Markus Mutas, Salem Dhif, and Riccardo Scandariato. “Prompting Techniques for Secure Code Generation: A Systematic Investigation”. In:ACM Trans. Softw. Eng. Methodol.34.8 (Oct. 2025).ISSN: 1049-331X.DOI:10.1145/3722108.URL:https://doi.org/10.1145/3722108
work page doi:10.1145/3722108.url:https://doi.org/10.1145/3722108 2025
-
[47]
How beginning programmers and code LLMs ( mis)read each other,
Sydney Nguyen et al. “How Beginning Programmers and Code LLMs (Mis)read Each Other”. In:Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. CHI ’24. Honolulu, HI, USA: Association for Computing Ma- chinery, 2024.ISBN: 9798400703300.DOI:10.1145/3613904.3642706.URL:https://doi.org/10.1145/3613904. 3642706
work page doi:10.1145/3613904.3642706.url:https://doi.org/10.1145/3613904 2024
-
[48]
CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing
Shuhan Liu et al. “CREME: Robustness Enhancement of Code LLMs via Layer-Aware Model Editing”. In:Proceedings of the IEEE/ACM 48th International Conference on Software Engineering. ICSE ’26. Rio de Janeiro, Brazil: Association for Computing Machinery, 2026.DOI:3744916.3773111.URL:https://arxiv.org/abs/2507.16407v3
-
[49]
Can Language Models Solve Olympiad Programming?
Ben Shi, Michael Tang, Karthik R Narasimhan, and Shunyu Yao. “Can Language Models Solve Olympiad Programming?” In:First Conference on Language Modeling. 2024.URL:https://openreview.net/forum?id=kGa4fMtP9l
work page 2024
-
[50]
Beam Search Strategies for Neural Machine Translation
Markus Freitag and Yaser Al-Onaizan. “Beam Search Strategies for Neural Machine Translation”. In:Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 56– 60.DOI:10.18653/v1/W17-3207.URL:https://aclanthology.org/W17-3207/
work page doi:10.18653/v1/w17-3207.url:https://aclanthology.org/w17-3207/ 2017
-
[51]
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh Dhole et al. “NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation”. In:Northern European Journal of Language Technology9 (2023).DOI:10 . 3384 / nejlt . 2000 - 1533 . 2023 . 4725.URL:https : //aclanthology.org/2023.nejlt-1.5/
work page 2023
-
[52]
Prompt Perturbation Consistency Learning for Robust Language Models
Yao Qiang et al. “Prompt Perturbation Consistency Learning for Robust Language Models”. In:Findings of the Association for Computational Linguistics: EACL 2024. St. Julian’s, Malta: Association for Computational Linguistics, Mar. 2024, pp. 1357–1370.URL:https://aclanthology.org/2024.findings-eacl.91/
work page 2024
-
[53]
Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?
Zhanke Zhou et al. “Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Ra- tionales?” In:The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.URL:https : / / openreview.net/forum?id=FbuODM02ra
work page 2024
-
[54]
Lalchand Pandia and Allyson Ettinger. “Sorting through the noise: Testing robustness of information processing in pre- trained language models”. In:Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 1583–1596.DOI: 10.18...
work page doi:10.18653/v1/2021.emnlp-main.119.url:https://aclanthology.org/2021.emnlp-main.119/ 2021
-
[55]
Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
Feiteng Fang et al. “Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 10028–10039.DOI:10.18653/ v1/2024.acl-lo...
work page 2024
-
[56]
Models in the Wild: On Corruption Robustness of Neural NLP Systems
Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the Wild: On Corruption Robustness of Neural NLP Systems”. In:Neural Information Processing. Ed. by Tom Gedeon, Kok Wai Wong, and Minho Lee. Cham: Springer International Publishing, 2019, pp. 235–247.ISBN: 978-3-030-36718-3
work page 2019
-
[57]
Understanding Programs by Exploiting (Fuzzing) Test Cases
Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, and Hao Chen. “Understanding Programs by Exploiting (Fuzzing) Test Cases”. In:Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 10667–10679.DOI:10.18653/v1/2023.findings-acl.678.URL:https: //aclanthology.org/2023.fi...
work page doi:10.18653/v1/2023.findings-acl.678.url:https: 2023
-
[58]
Continuous Fuzzing with libFuzzer and AddressSanitizer
Kosta Serebryany. “Continuous Fuzzing with libFuzzer and AddressSanitizer”. In:2016 IEEE Cybersecurity Development (SecDev). 2016, pp. 157–157.URL:https://doi.org/10.1109/SecDev.2016.043
-
[59]
Prompt Fuzzing for Fuzz Driver Generation
Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. “Prompt Fuzzing for Fuzz Driver Generation”. In:Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. CCS ’24. Salt Lake City, UT, USA: Association for Computing Machinery, 2024, pp. 3793–3807.ISBN: 9798400706363.DOI:10.1145/3658644.3670396. URL:https://doi.org/10.1145/3...
-
[60]
Yujia Li et al. “Competition-level code generation with AlphaCode”. In:Science378.6624 (2022), pp. 1092–1097.DOI: 10.1126/science.abq1158. eprint:https://www.science.org/doi/pdf/10.1126/science.abq1158.URL: https://www.science.org/doi/abs/10.1126/science.abq1158
-
[61]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:The Eleventh International Conference on Learning Representations. 2023.URL:https://openreview.net/forum?id=1PL1NIMMrw
work page 2023
-
[62]
Hierarchical Clustering: Objective Functions and Algorithms
Vincent Cohen-addad, Varun Kanade, Frederik Mallmann-trenn, and Claire Mathieu. “Hierarchical Clustering: Objective Functions and Algorithms”. In:J. ACM66.4 (June 2019).ISSN: 0004-5411.DOI:10 . 1145 / 3321386.URL:https : //doi.org/10.1145/3321386
-
[63]
Revisiting agglomerative clustering
Eric K. Tokuda, Cesar H. Comin, and Luciano da F. Costa. “Revisiting agglomerative clustering”. In:Physica A: Statistical Mechanics and its Applications585 (2022), p. 126433.ISSN: 0378-4371.DOI:https://doi.org/10.1016/j.physa. 2021.126433.URL:https://www.sciencedirect.com/science/article/pii/S0378437121007068
-
[64]
Binyuan Hui et al.Qwen2.5Coder Technical Report. 2024. arXiv:2409.12186 [cs.CL].URL:https://arxiv.org/ abs/2409.12186
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Efficient memory management for large language model serving with pagedattention,
Woosuk Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23. Koblenz, Germany: Association for Comput- ing Machinery, 2023, pp. 611–626.ISBN: 9798400702297.DOI:10.1145/3600006.3613165.URL:https://doi.org/ 10.1145/3600006.3613165
work page doi:10.1145/3600006.3613165.url:https://doi.org/ 2023
-
[67]
Dacheng Li et al.S*: Test Time Scaling for Code Generation (Source code). 2025.URL:https://github.com/NovaSky- AI/SkyThought/tree/0d190f11fd8e885bbe113aeccacba5ccde5b1102/skythought/test-time-scaling
work page 2025
-
[68]
Google.Gemini 2.5 Flash-Lite.URL:https : / / docs . cloud . google . com / vertex - ai / generative - ai / docs / models/gemini/2-5-flash-lite
-
[69]
Peter Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In:J. Comput. Appl. Math.20.1 (Nov. 1987), pp. 53–65.ISSN: 0377-0427.DOI:10 . 1016 / 0377 - 0427(87 ) 90125 - 7.URL:https : //doi.org/10.1016/0377-0427(87)90125-7
-
[70]
Chris Yuhao Liu et al.Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy. 2026. arXiv:2507. 01352 [cs.CL].URL:https://arxiv.org/abs/2507.01352
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[71]
QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs
Koen Claessen and John Hughes. “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”. In:Pro- ceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming. ICFP ’00. New York, NY , USA: Association for Computing Machinery, 2000, pp. 268–279.ISBN: 1581132026.URL:https://doi.org/10.1145/ 351240.351266
-
[72]
Property-Based Testing in Practice
Harrison Goldstein, Joseph W. Cutler, Daniel Dickstein, Benjamin C. Pierce, and Andrew Head. “Property-Based Testing in Practice”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.URL:https : / / doi . org / 10 . 1145 / 3597503.3639581
-
[73]
Oracle-Guided Program Selection from Large Language Models
Zhiyu Fan, Haifeng Ruan, Sergey Mechtaev, and Abhik Roychoudhury. “Oracle-Guided Program Selection from Large Language Models”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 628–640.ISBN: 9798400706127.DOI: 10.1145/3650212.3680308...
work page doi:10.1145/3650212.3680308.url:https://doi.org/10.1145/3650212.3680308 2024
-
[74]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. “A Survey on Large Language Models for Code Generation”. In:ACM Trans. Softw. Eng. Methodol.35.2 (Jan. 2026).ISSN: 1049-331X.DOI:10.1145/3747588.URL: https://doi.org/10.1145/3747588
-
[75]
An Yang et al.Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].URL:https://arxiv.org/abs/2505. 09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J
Daya Guo et al. “DeepSeekR1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature645.8081 (Sept. 2025), pp. 633–638.ISSN: 1476-4687.DOI:10.1038/s41586- 025- 09422- z.URL:http://dx.doi.org/10.1038/ s41586-025-09422-z
-
[77]
SWE-bench: Can Language Models Resolve Real-world Github Issues?
Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth Interna- tional Conference on Learning Representations. 2024.URL:https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[78]
In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
Qi Guo et al. “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Lisbon, Portugal: Association for Computing Machinery, 2024.ISBN: 9798400702174.DOI:10.1145/3597503.3623306.URL:https://doi.org/10. 1145/3597503.3623306
work page doi:10.1145/3597503.3623306.url:https://doi.org/10 2024
-
[79]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. “Automated Program Repair in the Era of Large Pre-trained Language Models”. In:2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2023, pp. 1482– 1494.DOI:10.1109/ICSE48619.2023.00129. 20 Code Generation by Differential Test Time Scaling
-
[80]
Yifeng He et al. “UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing”. In:Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2024. Vienna, Austria: Association for Computing Machinery, 2024, pp. 1061–1072.ISBN: 9798400706127.DOI: 10.1145/3650212.3680...
work page doi:10.1145/3650212.3680342.url:https://doi.org/10.1145/3650212.3680342 2024
-
[81]
LitSearch: A retrieval benchmark for scientific literature search
Weimin Xiong, Yiwen Guo, and Hao Chen. “The Program Testing Ability of Large Language Models for Code”. In:Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics, Nov. 2024, pp. 23–34.DOI:10.18653/v1/2024.emnlp- industry.3. URL:https://aclantho...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.