pith. sign in

arxiv: 2503.17181 · v3 · submitted 2025-03-21 · 💻 cs.SE · cs.AI

A Study of LLMs' Preferences for Libraries and Programming Languages

Pith reviewed 2026-05-22 22:47 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords large language modelscode generationlibrary selectionprogramming language preferencesempirical studyPythonNumPyRust
0
0 comments X

The pith

Large language models prefer popular libraries like NumPy and default to Python even when other choices are more suitable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how eight LLMs select libraries and programming languages during code generation. It finds that models overuse familiar options such as NumPy in up to 45 percent of cases where the ground-truth solution does not require it. The same models default to Python in 58 percent of high-performance tasks where it is not optimal and never select Rust in those cases. A reader would care because these selection habits affect the efficiency and appropriateness of code produced by widely used tools.

Core claim

The study reveals that LLMs exhibit a strong tendency to overuse widely adopted libraries such as NumPy, with this usage being unnecessary in up to 45% of cases and deviating from ground-truth solutions. The models also demonstrate a significant preference for Python as the default language, selecting it in 58% of high-performance project initialization tasks where it is not optimal, and never choosing Rust in those cases. This highlights how LLMs prioritize familiarity and popularity over suitability and task-specific optimality.

What carries the argument

Empirical measurement of library and language choices in generated code, scored against ground-truth solutions across multiple tasks and eight LLMs.

If this is right

  • Generated code may be less efficient in performance-critical settings because of language and library biases.
  • Targeted fine-tuning and data diversification could reduce unnecessary selection of popular options.
  • Evaluation benchmarks for code generation need to measure language and library selection fidelity in addition to correctness.
  • Existing correctness-focused tests may overlook design choices that affect real-world code quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt engineering that explicitly requests consideration of alternative languages might mitigate the observed defaults.
  • The bias could slow adoption of efficient languages in AI-assisted projects if not addressed.
  • Extending the evaluation to additional domains and languages would test whether the preference pattern holds more broadly.

Load-bearing premise

The ground-truth solutions used for comparison represent the required or optimal choices for the evaluated tasks.

What would settle it

A new task set in which ground-truth solutions require a less popular but more suitable library or language, with LLMs still selecting popular alternatives at the same rates.

Figures

Figures reproduced from arXiv: 2503.17181 by Detlef Nauck, Don Syme, Helen Yannakoudakis, Jie M. Zhang, Joost Noppen, Lukas Twist, Mark Harman.

Figure 1
Figure 1. Figure 1: Library Preferences, Benchmark Tasks (RQ1). Libraries used by LLMs when generating code for tasks from the BigCodeBench dataset. For each LLM, the most-used libraries are given with the percentage of problems they were imported for, along with total unique libraries used. All libraries not shown are imported for less than 2% of problems [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case Analysis. GitHub growth statistics for libraries studied in case analysis. For each library, the accumulated GitHub stars are plotted over its lifetime (data from GitHub, December 2024). The top/bottom two graphs show case analysis for bench￾mark/project initialisation tasks. different core libraries in 3/5 tasks (“Deep learning”, “Distributed computing” and “Web-server”). Notably, there are multiple … view at source ↗
read the original abstract

Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports the first empirical study of library and language selection preferences across eight LLMs in code-generation tasks. It claims that models overuse popular libraries (e.g., NumPy required in ground-truth solutions but used unnecessarily in up to 45% of cases) and default to Python even for high-performance initialization tasks (58% of cases, with Rust never selected), concluding that LLMs systematically favor familiarity and popularity over task-specific optimality and suitability.

Significance. If the central empirical observations hold after addressing the noted methodological gap, the work fills a clear gap in LLM code-generation evaluation by moving beyond functional correctness to design-choice fidelity. The multi-model coverage and concrete deviation percentages provide a useful baseline for future benchmarks and fine-tuning efforts aimed at diversifying language and library usage. The observational design is appropriate for the question posed.

major comments (2)
  1. [Abstract / Results] Abstract and results sections: The central claim that observed deviations demonstrate prioritization of familiarity over suitability requires that the ground-truth solutions are in fact the required or optimal choices for the tasks. No expert validation, performance benchmarking, or alternative-optimality metric is supplied to rule out the possibility that the LLMs are simply returning other valid (if different) solutions; this assumption is load-bearing for the interpretation.
  2. [Methodology] Methodology (task and ground-truth definition): Sample sizes, task definitions, and the precise criteria used to label a library or language choice as “not required” or “not optimal” are not detailed enough in the provided abstract to allow independent assessment of whether the 45% and 58% figures support the prioritization conclusion.
minor comments (1)
  1. [Abstract] Abstract: The sentence beginning “These results highlight…” contains a comma splice before “underscoring”; a semicolon or rephrasing would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methodology and claims.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results sections: The central claim that observed deviations demonstrate prioritization of familiarity over suitability requires that the ground-truth solutions are in fact the required or optimal choices for the tasks. No expert validation, performance benchmarking, or alternative-optimality metric is supplied to rule out the possibility that the LLMs are simply returning other valid (if different) solutions; this assumption is load-bearing for the interpretation.

    Authors: We agree this is a substantive point. The ground-truth solutions were selected based on standard reference implementations that minimize dependencies or optimize for the task constraints (e.g., built-in functions only for library tasks; performance-oriented languages for initialization). However, to make this assumption more robust, we will add a dedicated paragraph in the methodology section describing the task construction process, include references to established performance comparisons for the language tasks, and expand the limitations section to explicitly discuss alternative valid solutions and their potential impact on the reported percentages. revision: yes

  2. Referee: [Methodology] Methodology (task and ground-truth definition): Sample sizes, task definitions, and the precise criteria used to label a library or language choice as “not required” or “not optimal” are not detailed enough in the provided abstract to allow independent assessment of whether the 45% and 58% figures support the prioritization conclusion.

    Authors: The full manuscript details these elements in Section 3 (Methodology), including sample sizes (100 tasks per library category across 8 models, 50 tasks for language preference), task definitions (e.g., data processing without external libs, high-performance init), and labeling criteria (a choice is labeled 'not required' if the reference solution completes the task using only language builtins or standard library, with no external imports needed). We will revise the abstract to include a concise summary of these details and add explicit cross-references from the results to the methodology section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical comparison with no derivations or self-referential predictions.

full rationale

This paper performs an empirical study by generating code with LLMs and directly comparing library/language choices against provided ground-truth solutions. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation of the central claims. The analysis is self-contained as direct observation of outputs versus external benchmarks (ground-truth), with no reduction of results to the paper's own definitions or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central observations rest on the assumption that the chosen tasks and ground-truth solutions are representative; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The selected coding tasks and ground-truth solutions are representative of real-world requirements and optimality.
    Used to classify LLM choices as overuse or deviation.

pith-pipeline@v0.9.0 · 5726 in / 1069 out tokens · 56081 ms · 2026-05-22T22:47:40.958409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

    cs.AI 2026-05 conditional novelty 8.0

    FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

  2. ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

    cs.AI 2026-02 accept novelty 8.0

    ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

  3. The software space of science

    cs.DL 2026-04 unverdicted novelty 7.0

    A network analysis of software mentions in 1.3 million papers identifies 520 tools in eight communities and shows disciplines maintain distinct, stable tool portfolios that are crystallizing toward common sets.

  4. CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

    cs.SE 2026-04 accept novelty 7.0

    CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

  5. Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

    cs.SE 2026-02 unverdicted novelty 7.0

    Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.

  6. Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

    cs.SE 2025-09 unverdicted novelty 7.0

    A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%,...

  7. Task Abstention for Large Language Models in Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.

  8. FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.

  9. Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

    cs.SE 2026-05 unverdicted novelty 4.0

    Empirical analysis of AI refactoring PRs shows quality attribute improvements in 22.5% of cases with new Pylint issues in 24.17% and Bandit findings in 4.7%, yet 73.5% developer acceptance.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 9 Pith papers · 17 internal anchors

  1. [1]

    Mehmet Akhoroz and Caglar Yildirim. 2025. Conversational AI as a Coding Assistant: Understanding Programmers’ Interactions with and Expectations from Large Language Models for Coding. (Mar. 14, 2025). arXiv: 2503.16508 [cs]. Retrieved July 18, 2025 from http://arxiv.org/abs/2503.16508. Pre- published

  2. [2]

    Andrew Peng et al. 2023. GPT-3.5 Turbo fine-tuning and API updates. (Aug. 22, 2023). Retrieved Dec. 17, 2024 from https://openai.com/index/gpt-3-5-turbo-fi ne-tuning-and-api-updates/

  3. [3]

    Anthropic. 2024. Claude 3 Model Card. (Oct. 22, 2024). Retrieved Jan. 22, 2025 from https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Mo del-Card.pdf

  4. [4]

    Ben Athiwaratkun et al. 2023. Multi-lingual evaluation of code generation models. InProc. ICLR. arXiv: 2210.14868. doi:10.48550/arXiv.2210.14868

  5. [5]

    Jacob Austin et al. 2021. Program Synthesis with Large Language Models. (Aug. 16, 2021). arXiv: 2108.07732. Retrieved Oct. 18, 2024 from http://arxiv.org /abs/2108.07732. Pre-published

  6. [6]

    Rishi Bommasani et al. 2023. The Foundation Model Transparency Index. (Oct. 19, 2023). arXiv: 2310.12941 [cs]. Retrieved May 29, 2025 from http: //arxiv.org/abs/2310.12941. Pre-published

  7. [7]

    William Bugden and Ayman Alahmar. 2022. The Safety and Performance of Prominent Programming Languages.International Journal of Software Engi- neering and Knowledge Engineering, 32, 05, (May 2022), 713–744. doi:10.1142 /S0218194022500231

  8. [8]

    Liguo Chen et al. 2024. A Survey on Evaluating Large Language Models in Code Generation Tasks. Version 1.Journal of computer science and technology. doi:10.48550/ARXIV.2408.16498

  9. [9]

    Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. (July 14, 2021). arXiv: 2107.03374. Retrieved Nov. 18, 2024 from http://arxiv.org /abs/2107.03374. Pre-published

  10. [10]

    Yuxing Cheng et al. 2025. A Survey on Data Contamination for Large Language Models. (June 5, 2025). arXiv: 2502.14425 [cs]. Retrieved July 7, 2025 from http://arxiv.org/abs/2502.14425. Pre-published

  11. [11]

    Rudrajit Choudhuri et al. 2024. What Guides Our Choices? Modeling Devel- opers’ Trust and Behavioral Intentions Towards GenAI. (Dec. 2, 2024). arXiv: 2409.04099 [cs]. Retrieved Dec. 17, 2024 from http://arxiv.org/abs/2409.04099. Pre-published

  12. [12]

    Jishnu Ray Chowdhury and Cornelia Caragea. 2025. Zero-Shot Verification- guided Chain of Thoughts. (Jan. 21, 2025). arXiv: 2501.13122 [cs]. Retrieved July 15, 2025 from http://arxiv.org/abs/2501.13122. Pre-published

  13. [13]

    Competitive programming with AlphaCode

    2024. Competitive programming with AlphaCode. Google DeepMind. (Dec. 17, 2024). Retrieved Dec. 18, 2024 from https://deepmind.google/discover/blog/co mpetitive-programming-with-alphacode/

  14. [14]

    Manuel Costanzo et al. 2021. Performance vs programming effort between Rust and C on multicore architectures: case study in n-body. InProc. CLEI, 1–10. doi:10.1109/CLEI53233.2021.9640225

  15. [15]

    DeepSeek-AI et al. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. (Jan. 5, 2024). arXiv: 2401.02954 [cs]. Retrieved Dec. 17, 2024 from http://arxiv.org/abs/2401.02954. Pre-published

  16. [16]

    Kaustubh Dhole et al. 2023. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation.Northern European Journal of Language Tech- nology, 9. Leon Derczynski, (Ed.) doi:10.3384/nejlt.2000-1533.2023.4725

  17. [17]

    Benedetta Donato et al. 2025. Studying How Configurations Impact Code Generation in LLMs: the Case of ChatGPT. InThe Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension. arXiv, (Feb. 7, 2025). arXiv: 2502.17450[cs]. doi:10.48550/arXiv.2502.17450

  18. [18]

    Extended Syntax | Markdown Guide

    2025. Extended Syntax | Markdown Guide. Retrieved Feb. 7, 2025 from https: //www.markdownguide.org/extended-syntax/

  19. [19]

    Furia et al

    Carlo A. Furia et al. 2024. Towards Causal Analysis of Empirical Software Engi- neering Data: The Impact of Programming Languages on Coding Competitions. ACM Transactions on Software Engineering and Methodology, 33, 1, (Jan. 31, 2024), 1–35. arXiv: 2301.07524[cs]. doi:10.1145/3611667

  20. [20]

    Gallegos et al

    Isabel O. Gallegos et al. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguistics, 50, 3, (Sept. 1, 2024), 1097–1179. doi:10.1162 /coli_a_00524

  21. [21]

    Yulia Gavrilova. 2023. Pros and Cons of Python. Pros and Cons of Python. (Oct. 31, 2023). Retrieved Dec. 20, 2024 from https://serokell.io/blog/python-pr os-and-cons

  22. [22]

    Aaron Grattafiori et al. 2024. The Llama 3 Herd of Models. (Nov. 23, 2024). arXiv: 2407.21783 [cs]. Retrieved Dec. 17, 2024 from http://arxiv.org/abs/2407.21783. Pre-published

  23. [23]

    Sam Gross. [n. d.] PEP 703 – Making the Global Interpreter Lock Optional in CPython | peps.python.org. Python Enhancement Proposals (PEPs). Retrieved May 29, 2025 from https://peps.python.org/pep-0703/

  24. [24]

    Chenchen Gu et al. 2025. Auditing Prompt Caching in Language Model APIs. (Feb. 11, 2025). arXiv: 2502.07776 [cs]. Retrieved May 18, 2025 from http://arx iv.org/abs/2502.07776. Pre-published

  25. [25]

    Yufei Guo et al. 2024. Bias in Large Language Models: Origin, Evaluation, and Mitigation. Version 1. (Nov. 16, 2024). arXiv: 2411.10915 [cs]. Retrieved July 9, 2025 from http://arxiv.org/abs/2411.10915. Pre-published

  26. [26]

    Huizi Hao et al. 2024. An Empirical Study on Developers Shared Conversations with ChatGPT in GitHub Pull Requests and Issues. Version 1. (Mar. 15, 2024). arXiv: 2403.10468 [cs]. Retrieved May 29, 2025 from http://arxiv.org/abs/2403 .10468. Pre-published

  27. [27]

    Yiyang Hao et al. 2022. AixBench: A Code Generation Benchmark Dataset. (July 21, 2022). arXiv: 2206.13179. Retrieved Oct. 29, 2024 from http://arxiv.org /abs/2206.13179. Pre-published

  28. [28]

    Dan Hendrycks et al. 2021. Measuring coding challenge competence with APPS. InProc. NeurIPS Datasets and Benchmarks. arXiv: 2105.09938. doi:10.48550/arXi v.2105.09938

  29. [29]

    Dong Huang et al. 2023. Bias Testing and Mitigation in LLM-based Code Generation. arXiv.org. (Sept. 3, 2023). Retrieved Oct. 8, 2024 from https://arxiv .org/abs/2309.14345v3

  30. [30]

    Binyuan Hui et al. 2024. Qwen2.5-Coder Technical Report. (Nov. 12, 2024). arXiv: 2409.12186 [cs]. Retrieved Dec. 17, 2024 from http://arxiv.org/abs/2409.12186. Pre-published

  31. [31]

    Paul Jansen. 2025. TIOBE Index. TIOBE. Retrieved July 15, 2025 from https://w ww.tiobe.com/tiobe-index/

  32. [32]

    Mistral 7B

    Albert Q. Jiang et al. 2023. Mistral 7B. (Oct. 10, 2023). arXiv: 2310.06825 [cs]. Retrieved Dec. 17, 2024 from http://arxiv.org/abs/2310.06825. Pre-published

  33. [33]

    Juyong Jiang et al. 2024. A Survey on Large Language Models for Code Gener- ation. (June 1, 2024). arXiv: 2406.00515. Retrieved Oct. 10, 2024 from http://arxi v.org/abs/2406.00515. Pre-published

  34. [34]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez et al. 2024. SWE-bench: can language models resolve real- world github issues? InProc. ICLR. arXiv: 2310.06770. doi:10.48550/arXiv.2310 .06770

  35. [35]

    Erik Jones and Jacob Steinhardt. 2022. Capturing failures of large language models via human cognitive biases. InProceedings of the 36th International Con- ference on Neural Information Processing Systems(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, (Nov. 28, 2022), 11785–11799.isbn: 978-1-71387-108-8. 11 Twist et al

  36. [36]

    Dawid Karczewski. 2021. Python vs C++: Technology Comparison. Retrieved Feb. 18, 2025 from https://www.ideamotive.co/blog/python-vs-cpp-technolog y-comparison

  37. [37]

    Kendall and Jean Dickinson Gibbons

    Maurice G. Kendall and Jean Dickinson Gibbons. 1990. Rank Correlation Meth- ods. (5th ed ed.). 1 online resource (vii, 260 pages) vols. Oxford University Press, New York, NY. https://archive.org/details/rankcorrelationm0000kend

  38. [38]

    Anjali Khurana et al. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Soft- ware Help-Seeking. (Mar. 18, 2024). arXiv: 2402.08030 [cs]. Retrieved July 18, 2025 from http://arxiv.org/abs/2402.08030

  39. [39]

    Takeshi Kojima et al. 2022. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, (Nov. 28, 2022), 22199–22213.isbn: 978-1-71387-108-8. Retrieved Mar. 6, 2025 from

  40. [40]

    Adrian Kuhn and Robert DeLine. 2012. On Designing Better Tools for Learning APIs. (June 2012). arXiv: 1402.1188 [cs]. Retrieved July 9, 2025 from http://arx iv.org/abs/1402.1188

  41. [41]

    Yuhang Lai et al. 2022. DS-1000: a natural and reliable benchmark for data science code generation. InProc. ICML. arXiv: 2211.11501. doi:10.48550/arXiv.2 211.11501

  42. [42]

    Decrypt / Jose Antonio Lanz. 2023. Stability AI CEO: There Will Be No (Human) Programmers in Five Years. Decrypt. (July 3, 2023). Retrieved Nov. 13, 2024 from https://decrypt.co/147191/no-human-programmers-five-years-ai-stabi lity-ceo

  43. [43]

    Enrique Larios-Vargas et al. 2020. Selecting third-party libraries: the practition- ers’ perspective. InProc. ESEC/FSE, 245–256. doi:10.1145/3368089.3409711

  44. [44]

    Jasmine Latendresse et al. 2024. Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommen- dations. (Aug. 9, 2024). arXiv: 2408.05128 [cs]. Retrieved Dec. 19, 2024 from http://arxiv.org/abs/2408.05128. Pre-published

  45. [45]

    Junlong Li et al. 2024. Dissecting Human and LLM Preferences. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Lun-Wei Ku et al., (Eds.) Association for Computational Linguistics, Bangkok, Thailand, (Aug. 2024), 1790–1811. doi:10.18653/v1/2024.a cl-long.99

  46. [46]

    Yujia Li et al. 2022. Competition-level code generation with AlphaCode.Science, 378, 6624, 1092–1097. eprint: https://www.science.org/doi/pdf/10.1126/science .abq1158. doi:10.1126/science.abq1158

  47. [47]

    Liang et al

    Jenny T. Liang et al. 2023. A Qualitative Study on the Implementation Design Decisions of Developers. In45th {IEEE/ACM} International Conference on Soft- ware Engineering, {ICSE} 2023, Melbourne, Australia, May 14-20, 2023. arXiv, (Jan. 24, 2023). arXiv: 2301.09789[cs]. doi:10.48550/arXiv.2301.09789

  48. [48]

    Mingwei Liu et al. 2023. CodeGen4Libs: A Two-Stage Approach for Library- Oriented Code Generation. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Luxembourg, Luxembourg, (Sept. 11, 2023), 434–445.isbn: 9798350329964. doi:10.1109/ASE56229.2023.001 59

  49. [49]

    Yan Liu et al. 2023. Uncovering and Quantifying Social Biases in Code Genera- tion. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. arXiv, (May 24, 2023). arXiv: 2305.15377. doi:10.48550/arXiv.2305.15377

  50. [50]

    Zexiong Ma et al. 2024. Compositional API Recommendation for Library- Oriented Code Generation. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension(ICPC ’24). Association for Computing Machinery, New York, NY, USA, (June 13, 2024), 87–98.isbn: 9798400705861. doi:10.1145/3643916.3644403

  51. [51]

    Lovish Madaan et al. 2024. Quantifying Variance in Evaluation Benchmarks. (June 14, 2024). arXiv: 2406.10229 [cs]. Retrieved July 7, 2025 from http://arxi v.org/abs/2406.10229. Pre-published

  52. [52]

    Vahid Majdinasab et al. 2025. Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code.ACM Transactions on Software Engineering and Methodology, 34, 4, (May 31, 2025), 1–46. arXiv: 2402.09299 [cs]. doi:10.1145/3702980

  53. [53]

    Ahmad Mohsin et al. 2024. Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs. (June 18, 2024). arXiv: 2406.12513. Retrieved Oct. 24, 2024 from http://arxiv.org/abs/2406.12513. Pre-published

  54. [54]

    Norman Mu et al. 2025. A Closer Look at System Prompt Robustness. (Feb. 15, 2025). arXiv: 2502.12197 [cs]. Retrieved July 6, 2025 from http://arxiv.org/abs /2502.12197. Pre-published

  55. [55]

    Humza Naveed et al. 2024. A Comprehensive Overview of Large Language Models. (Oct. 17, 2024). arXiv: 2307.06435. Retrieved Nov. 18, 2024 from http: //arxiv.org/abs/2307.06435. Pre-published

  56. [56]

    Need of Dockers and Kubernetes in Modern Software Development - GeakMinds

    2024. Need of Dockers and Kubernetes in Modern Software Development - GeakMinds. (May 15, 2024). Retrieved Nov. 18, 2024 from https://geakminds.co m/need-of-dockers-and-kubernetes-in-modern-software-development/

  57. [57]

    Sydney Nguyen et al. 2024. How Beginning Programmers and Code LLMs (Mis)read Each Other. InProceedings of the CHI Conference on Human Factors in Computing Systems. (May 11, 2024), 1–26. arXiv: 2401.15232 [cs]. doi:10.11 45/3613904.3642706

  58. [58]

    Muhammed Nihal. 2024. The Race to Zero Latency: How to Optimize Code for High-Frequency Trading Quant Firms. Medium. (Aug. 13, 2024). Retrieved May 29, 2025 from https://medium.com/@nihal.143/the-race-to-zero-latency- how-to-optimize-code-for-high-frequency-trading-quant-firms-362f828f9 c16

  59. [59]

    Mbithe Nzomo. 2025. Absolute vs Relative Imports in Python – Real Python. Retrieved Feb. 5, 2025 from https://realpython.com/absolute-vs-relative-pytho n-imports/

  60. [60]

    OpenAI et al. 2024. GPT-4 Technical Report. (Mar. 4, 2024). arXiv: 2303.08774 [cs]. Retrieved Apr. 9, 2024 from http : / / arxiv . org / abs / 2303 . 08774. Pre- published

  61. [61]

    Arkil Patel et al. 2024. Evaluating In-Context Learning of Libraries for Code Generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), {NAACL} 2024, Mexico City, Mexico, June 16-21, 2024. arXiv, (Apr. 4, 2024). arXiv: 2311.09635. do...

  62. [62]

    Debalina Ghosh Paul et al. 2024. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. In2024 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE Computer Society, (July 1, 2024), 87–94.isbn: 9798350365054. doi:10.1109/AITest62860.2024.00019

  63. [63]

    Max Peeperkorn et al. 2024. Is Temperature the Creativity Parameter of Large Language Models? InInternational Conference on Computational Creativity. arXiv, (May 1, 2024). arXiv: 2405.00492[cs]. doi:10.48550/arXiv.2405.00492

  64. [64]

    Sanka Rasnayaka et al. 2024. An empirical study on usage and perceptions of LLMs in a software engineering project. InProc. LLM4Code, 111–118. doi:10.11 45/3643795.3648379

  65. [65]

    Matthew Renze and Erhan Guven. 2024. The effect of sampling temperature on problem solving in large language models. InFindings of EMNLP. arXiv: 2402.05201. doi:10.48550/arXiv.2402.05201

  66. [66]

    June Sallou et al. 2024. Breaking the silence: the threats of using LLMs in software engineering. InProc. ICSE-NIER, 102–106. doi:10.1145/3639476.363976 4

  67. [67]

    Matthew Smith. 2025. AI Vibe Coding: Engineers’ Secret to Fast Development - IEEE Spectrum. IEEE Spectrum. Retrieved May 29, 2025 from https://spectrum .ieee.org/vibe-coding

  68. [68]

    2016.Software Engineering, Global Edition

    Ian Somerville. 2016.Software Engineering, Global Edition. Pearson Education. isbn: 978-1-292-09614-8

  69. [69]

    GitHub Staff. 2024. Octoverse: AI leads Python to top language as the number of global developers surges. The GitHub Blog. (Oct. 29, 2024). Retrieved Feb. 10, 2025 from https://github.blog/news-insights/octoverse/octoverse-2024/

  70. [70]

    Kyle Daigle Staff GitHub. 2024. Survey: The AI wave continues to grow on software development teams. The GitHub Blog. (Aug. 20, 2024). Retrieved Nov. 18, 2024 from https://github.blog/news-insights/research/survey-ai-wave -grows/

  71. [71]

    How do people decide?

    Minaoar Hossain Tanzil et al. 2024. "How do people decide?": A Model for Soft- ware Library Selection. InProceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering. (Apr. 14, 2024), 1–12. arXiv: 2403.16245[cs]. doi:10.1145/3641822.3641865

  72. [73]

    Top PyPI Packages

    2025. Top PyPI Packages. Retrieved Feb. 8, 2025 from https://hugovk.github.io /top-pypi-packages/

  73. [74]

    Usage Statistics and Market Share of Web Servers, November 2024

    2024. Usage Statistics and Market Share of Web Servers, November 2024. Re- trieved Nov. 18, 2024 from https://w3techs.com/technologies/overview/web_s erver

  74. [75]

    Chaozheng Wang et al. 2024. A systematic evaluation of large code models in api suggestion: when, which, and how. InProc. ASE. arXiv: 2409.13178. doi:10.48550/arXiv.2409.13178

  75. [76]

    Chaozheng Wang et al. 2024. Exploring Multi-Lingual Bias of Large Code Models in Code Generation. (Apr. 30, 2024). arXiv: 2404.19368 [cs]. Retrieved Feb. 17, 2025 from http://arxiv.org/abs/2404.19368. Pre-published

  76. [77]

    Kaixin Wang et al. 2025. Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents. Version 2. (May 9, 2025). arXiv: 2505.05283 [cs]. Retrieved July 7, 2025 from http://arxiv .org/abs/2505.05283. Pre-published

  77. [78]

    Ruotong Wang et al. 2024. Investigating and designing for trust in ai-powered code generation tools. InProc. ACM FAccT. arXiv: 2305.11248. doi:10.48550/ar Xiv.2305.11248

  78. [79]

    Zhiruo Wang et al. 2023. Execution-based evaluation for open-domain code generation. InFindings of EMNLP. arXiv: 2212.10481. doi:10.48550/arXiv.2212.1 0481

  79. [80]

    Jason Wei et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural 12 A Study of LLMs’ Preferences for Libraries and Programming Languages Information Processing Systems(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, (Nov. 28, 2022), 24824–24837.isbn: 978-1-7138...

  80. [81]

    Cheng Xu et al. 2024. Benchmark Data Contamination of Large Language Models: A Survey. (June 6, 2024). arXiv: 2406.04244[cs]. Retrieved July 7, 2025 from http://arxiv.org/abs/2406.04244. Pre-published

Showing first 80 references.