A Study of LLMs' Preferences for Libraries and Programming Languages

· 2025 · cs.SE · arXiv 2503.17181

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

cs.AI · 2026-05-11 · conditional · novelty 8.0

FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

cs.AI · 2026-02-11 · accept · novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

The software space of science

cs.DL · 2026-04-26 · unverdicted · novelty 7.0

A network analysis of software mentions in 1.3 million papers identifies 520 tools in eight communities and shows disciplines maintain distinct, stable tool portfolios that are crystallizing toward common sets.

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

cs.SE · 2026-04-14 · accept · novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

cs.SE · 2026-02-05 · unverdicted · novelty 7.0

Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

cs.SE · 2025-09-26 · unverdicted · novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.

Task Abstention for Large Language Models in Code Generation

cs.SE · 2026-05-16 · unverdicted · novelty 6.0

A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

cs.SE · 2026-05-20 · unverdicted · novelty 4.0

Empirical analysis of AI refactoring PRs shows quality attribute improvements in 22.5% of cases with new Pylint issues in 24.17% and Bandit findings in 4.7%, yet 73.5% developer acceptance.

citing papers explorer

Showing 9 of 9 citing papers.

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models cs.AI · 2026-05-11 · conditional · none · ref 3 · internal anchor
FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences cs.AI · 2026-02-11 · accept · none · ref 17 · internal anchor
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
The software space of science cs.DL · 2026-04-26 · unverdicted · none · ref 50 · internal anchor
A network analysis of software mentions in 1.3 million papers identifies 520 tools in eight communities and shows disciplines maintain distinct, stable tool portfolios that are crystallizing toward common sets.
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation cs.SE · 2026-04-14 · accept · none · ref 35 · internal anchor
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations cs.SE · 2026-02-05 · unverdicted · none · ref 42 · internal anchor
Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries cs.SE · 2025-09-26 · unverdicted · none · ref 64 · internal anchor
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
Task Abstention for Large Language Models in Code Generation cs.SE · 2026-05-16 · unverdicted · none · ref 9 · internal anchor
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents cs.CL · 2026-05-04 · unverdicted · none · ref 65 · internal anchor
FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.
Quality and Security Signals in AI-Generated Python Refactoring Pull Requests cs.SE · 2026-05-20 · unverdicted · none · ref 30 · internal anchor
Empirical analysis of AI refactoring PRs shows quality attribute improvements in 22.5% of cases with new Pylint issues in 24.17% and Bandit findings in 4.7%, yet 73.5% developer acceptance.

A Study of LLMs' Preferences for Libraries and Programming Languages

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer