hub

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu · 2025 · arXiv 2504.14655

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

cs.SE · 2026-04-14 · accept · novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

PRepair mitigates LLM over-editing in code repair via Self-Breaking bug injection and EA-GRPO training, improving precision by up to 31.4% on the fix₁@1 metric.

Think Anywhere in Code Generation

cs.SE · 2026-03-31 · unverdicted · novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

cs.SE · 2026-02-27 · unverdicted · novelty 7.0

IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

cs.SE · 2025-10-21 · conditional · novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Hybrid-LoRA selectively full fine-tunes modules with high sensitivity to low-rank adaptation using a novel score and applies LoRA elsewhere, matching full fine-tuning at 10% budget and outperforming PEFT baselines by up to 5.65%.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

cs.SE · 2026-05-07 · unverdicted · novelty 6.0

ASTOR improves a single code LLM across four tasks by 9.0-9.5% over the best specialist and 7.5-12.8% over prior multi-task RL baselines via utility-driven data scheduling and adaptive KL regularization.

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

cs.SE · 2025-10-01 · conditional · novelty 6.0

PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.

Skywork Open Reasoner 1 Technical Report

cs.LG · 2025-05-28 · conditional · novelty 4.0

Skywork-OR1 uses RL on distilled CoT models to lift math and coding benchmark accuracy by 13-15 points while open-sourcing everything.

citing papers explorer

Showing 13 of 13 citing papers.

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation cs.SE · 2026-04-14 · accept · none · ref 37
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization cs.SE · 2026-04-07 · unverdicted · none · ref 4
PRepair mitigates LLM over-editing in code repair via Self-Breaking bug injection and EA-GRPO training, improving precision by up to 31.4% on the fix₁@1 metric.
Think Anywhere in Code Generation cs.SE · 2026-03-31 · unverdicted · none · ref 24
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution cs.SE · 2026-02-27 · unverdicted · none · ref 28
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment cs.SE · 2025-10-21 · conditional · none · ref 27
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 40
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards cs.CL · 2026-05-14 · unverdicted · none · ref 34
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 9
Hybrid-LoRA selectively full fine-tunes modules with high sensitivity to low-rank adaptation using a novel score and applies LoRA elsewhere, matching full fine-tuning at 10% budget and outperforming PEFT baselines by up to 5.65%.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 75
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs cs.SE · 2026-05-07 · unverdicted · none · ref 24
ASTOR improves a single code LLM across four tasks by 9.0-9.5% over the best specialist and 7.5-12.8% over prior multi-task RL baselines via utility-driven data scheduling and adaptive KL regularization.
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation cs.LG · 2026-04-26 · unverdicted · none · ref 13
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths cs.SE · 2025-10-01 · conditional · none · ref 68
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
Skywork Open Reasoner 1 Technical Report cs.LG · 2025-05-28 · conditional · none · ref 30
Skywork-OR1 uses RL on distilled CoT models to lift math and coding benchmark accuracy by 13-15 points while open-sourcing everything.

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer