hub Mixed citations

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

· 2024 · cs.SE · arXiv 2401.03065

Mixed citation behavior. Most common role is background (45%).

40 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 5 method 1

citation-polarity summary

background 5 use dataset 3 unclear 2 use method 1

representative citing papers

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

cs.SE · 2026-05-25 · unverdicted · novelty 7.0

RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

cs.SE · 2026-04-23 · conditional · novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

An Iterative Test-and-Repair Framework for Competitive Code Generation

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

cs.PL · 2026-03-15 · unverdicted · novelty 7.0

s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

cs.SE · 2025-10-17 · unverdicted · novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

cs.SE · 2025-10-16 · unverdicted · novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

cs.SE · 2025-04-30 · unverdicted · novelty 7.0

CodeFlowBench is a new benchmark with 5000+ problems and GitHub-sourced repos that evaluates LLMs on multi-turn code reuse using dependency-tree structural metrics, revealing performance drops as complexity rises.

CodeMind: Evaluating Large Language Models for Code Reasoning

cs.SE · 2024-02-15 · unverdicted · novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

cs.SE · 2026-05-28 · unverdicted · novelty 6.0

Code-QA-Bench uses an answer-first pipeline and three-condition experiments to generate 628 tasks across 10 Python repositories and quantify that code access drives most performance gains while documentation adds only modest benefit on doc-dependent tasks.

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

Teaching LLMs Program Semantics via Symbolic Execution Traces

cs.SE · 2026-05-07 · unverdicted · novelty 6.0

Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.

Hypothesis generation and updating in large language models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

cs.SE · 2026-04-28 · unverdicted · novelty 6.0

CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.

PrismaDV: Automated Task-Aware Data Unit Test Generation

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer