A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
Sustainable scientific software projects exhibit higher and more consistent test coverage with clearer code-test correlations than unsustainable ones, while overall coverage remains low.
JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.
citing papers explorer
-
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
Exploring Sustainability in Scientific Software through Code Quality & Test Coverage Metrics
Sustainable scientific software projects exhibit higher and more consistent test coverage with clearer code-test correlations than unsustainable ones, while overall coverage remains low.
-
JEDI: Java Evaluation of Declarative and Imperative Queries
JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.