Recognition: 3 theorem links
· Lean TheoremLiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3
The pith
LiveCodeBench gathers recent contest problems to evaluate LLMs on code without training data contamination and across multiple skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that LiveCodeBench, built by continuously collecting 400 high-quality problems from contests on LeetCode, AtCoder, and CodeForces between May 2023 and May 2024, supplies a contamination-free and holistic evaluation of LLMs for code that covers generation, self-repair, code execution, and test output prediction, yielding empirical results on contamination levels, model comparisons, and overfitting in static benchmarks.
What carries the argument
LiveCodeBench, a dynamic set of contest problems paired with prompts for multiple code tasks that updates over time.
If this is right
- Static benchmarks can no longer be trusted to measure true progress once contamination is possible.
- Models must succeed on repair, execution, and prediction tasks to demonstrate broad code competence.
- Ongoing collection of new problems keeps the evaluation relevant as models improve.
- Public release of prompts and completions enables targeted analysis of where models fail.
- The provided toolkit supports adding new evaluation scenarios without rebuilding the benchmark.
Where Pith is reading between the lines
- Similar live-collection methods could be applied to other domains such as mathematics problem solving to reduce leakage.
- Model developers may begin training or fine-tuning specifically on recent contest data to improve scores.
- Production teams could adopt this benchmark to select models less likely to rely on memorized solutions.
- The approach implies that all future code benchmarks should incorporate temporal freshness as a core requirement.
Load-bearing premise
Problems taken from recent contests have not entered the training data of the evaluated models and adequately represent real coding demands.
What would settle it
Discovery that a substantial number of LiveCodeBench problems appear in the pretraining corpora of the tested models, or that model rankings on this benchmark match those on older contaminated ones exactly.
read the original abstract
Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LiveCodeBench, a benchmark that continuously collects 400 coding problems published between May 2023 and May 2024 from LeetCode, AtCoder, and CodeForces contests. It evaluates 18 base and 34 instruction-tuned LLMs (52 total) not only on code generation but also on self-repair, code execution, and test output prediction. The work reports empirical findings on contamination, holistic performance, potential overfitting in static benchmarks such as HumanEval and MBPP, and plans to release all prompts, completions, and a toolkit for extension.
Significance. If the contamination-free claim can be substantiated, LiveCodeBench would provide a valuable, evolving resource for assessing generalization in code LLMs beyond saturated static benchmarks. The multi-capability scope and public release of data and completions are strengths that support reproducibility and community use.
major comments (2)
- [Abstract and Introduction] Abstract and Introduction: The central claim that LiveCodeBench is 'contamination-free' rests on the recency of the 400 problems (May 2023–May 2024) without describing any direct verification procedure, such as n-gram overlap searches against public web archives, GitHub repositories, or known training data proxies where contest problems and solutions are routinely posted shortly after release.
- [Evaluation and Results sections] Evaluation and Results sections: The reported empirical findings on contamination and overfitting (performance gaps versus HumanEval/MBPP) do not specify how problem quality was filtered, how test cases were validated, or whether prompt variations and temperature settings were controlled across the 52 models; without these details the attribution of differences to contamination versus difficulty or prompt sensitivity remains unclear.
minor comments (2)
- [Data Collection] The manuscript should clarify the exact criteria used to select and deduplicate the 400 problems across the three platforms.
- [Figures and Tables] Figure captions and table headers could more explicitly state the number of problems per capability (generation, repair, execution, prediction) to aid interpretation of the holistic comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and committing to revisions that strengthen the substantiation of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and Introduction] Abstract and Introduction: The central claim that LiveCodeBench is 'contamination-free' rests on the recency of the 400 problems (May 2023–May 2024) without describing any direct verification procedure, such as n-gram overlap searches against public web archives, GitHub repositories, or known training data proxies where contest problems and solutions are routinely posted shortly after release.
Authors: We appreciate the referee highlighting the need for explicit verification procedures to support the contamination-free claim. The manuscript grounds this claim in the recency of the problems (May 2023–May 2024) relative to the documented training data cutoffs of the 52 evaluated models. To address the concern directly, the revised manuscript will include a new subsection under Data Collection that reports n-gram overlap analyses (n=5–10) performed against public web archives, GitHub repositories of contest solutions, and other accessible proxies. We will present the overlap statistics, discuss any detected cases, and explain why recency combined with these checks provides a practical and defensible basis for the claim, while acknowledging the inherent limits of verifying against proprietary training corpora. revision: yes
-
Referee: [Evaluation and Results sections] Evaluation and Results sections: The reported empirical findings on contamination and overfitting (performance gaps versus HumanEval/MBPP) do not specify how problem quality was filtered, how test cases were validated, or whether prompt variations and temperature settings were controlled across the 52 models; without these details the attribution of differences to contamination versus difficulty or prompt sensitivity remains unclear.
Authors: We agree that greater methodological transparency is required to support the attribution of results. The original manuscript notes that problems are drawn directly from official contest platforms and are therefore high-quality, but we will expand the Evaluation and Results sections in the revision. Specifically, we will add: (1) explicit quality filtering criteria, including requirements for complete problem statements, at least three test cases per problem, and balanced coverage of difficulty levels across LeetCode, AtCoder, and Codeforces; (2) test-case validation details, describing automated execution against reference solutions plus manual spot-checks for correctness and edge-case coverage; and (3) experimental controls, confirming use of fixed prompt templates, temperature=0 for deterministic code generation, and identical decoding parameters across all models. These additions will make clear that observed gaps versus HumanEval/MBPP are not artifacts of inconsistent prompting or unvalidated tests. revision: yes
Circularity Check
No circularity: empirical benchmark built from external contest data
full rationale
The paper constructs LiveCodeBench by collecting 400 problems published May 2023–May 2024 from independent contest platforms (LeetCode, AtCoder, CodeForces) and evaluates 52 LLMs on them, releasing prompts and completions. No derivations, equations, fitted parameters, or predictions appear; the contamination-free claim rests on problem recency and external sources rather than any self-referential definition, self-citation chain, or renaming of prior results. The central claims are therefore independent of the paper's own outputs and do not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Problems from recent LeetCode, AtCoder, and CodeForces contests are free from contamination in current LLM training corpora.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
Super Apriel: One Checkpoint, Many Speeds
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
Think Anywhere in Code Generation
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accu...
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Edit-Based Refinement for Parallel Masked Diffusion Language Models
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
-
Coding Agents Don't Know When to Act
Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
Learning Agent Routing From Early Experience
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
Reference graph
Works this paper leans on
-
[1]
Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x , author =. arXiv preprint arXiv:2303.17568 , year =
-
[6]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2021
-
[7]
The Eleventh International Conference on Learning Representations , year =
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author =. The Eleventh International Conference on Learning Representations , year =
-
[9]
CodeT5+: Open code large language models for code understanding and generation , author =. arXiv preprint arXiv:2305.07922 , year =
-
[10]
Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages =
A systematic evaluation of large language models of code , author =. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages =
-
[13]
arXiv preprint arXiv:2212.10264 , year=
ReCode: Robustness Evaluation of Code Generation Models , author =. arXiv preprint arXiv:2212.10264 , year =
-
[16]
Advances in Neural Information Processing Systems , volume =
Unsupervised translation of programming languages , author =. Advances in Neural Information Processing Systems , volume =
-
[17]
arXiv preprint arXiv:2206.08474 , year=
Xlcost: A benchmark dataset for cross-lingual code intelligence , author =. arXiv preprint arXiv:2206.08474 , year =
-
[18]
arXiv preprint arXiv:2305.04032 , year =
ToolCoder: Teach Code Generation Models to use APIs with search tools , author =. arXiv preprint arXiv:2305.04032 , year =
-
[19]
Natural language to code generation in interactive data science notebooks , author =. arXiv preprint arXiv:2212.09248 , year =
-
[20]
International Conference on Machine Learning , pages =
DS-1000: A natural and reliable benchmark for data science code generation , author =. International Conference on Machine Learning , pages =. 2023 , organization =
2023
-
[21]
Proceedings of the 44th International Conference on Software Engineering , pages =
Jigsaw: Large language models meet program synthesis , author =. Proceedings of the 44th International Conference on Software Engineering , pages =
-
[22]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Codesearchnet challenge: Evaluating the state of semantic code search , author =. arXiv preprint arXiv:1909.09436 , year =
work page internal anchor Pith review arXiv 1909
-
[23]
Mistral 7B , author =. arXiv preprint arXiv:2310.06825 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Textbooks Are All You Need , author =. arXiv preprint arXiv:2306.11644 , year =
work page internal anchor Pith review arXiv
-
[26]
Advances in Neural Information Processing Systems , volume =
Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =
-
[27]
Advances in neural information processing systems , volume =
Language models are few-shot learners , author =. Advances in neural information processing systems , volume =
-
[29]
Michael Royzen and Justin Wei and Russell Coleman , title =. 2023 , url =
work page 2023
-
[30]
Advances in Neural Information Processing Systems , volume =
Chain-of-thought prompting elicits reasoning in large language models , author =. Advances in Neural Information Processing Systems , volume =
-
[31]
arXiv preprint arXiv:2210.00848 , year =
I speak, you verify: Toward trustworthy neural program synthesis , author =. arXiv preprint arXiv:2210.00848 , year =
-
[33]
International Conference on Machine Learning , pages =
Coder reviewer reranking for code generation , author =. International Conference on Machine Learning , pages =. 2023 , organization =
work page 2023
-
[34]
arXiv preprint arXiv:2305.05383 , year=
Code Execution with Pre-trained Language Models , author =. arXiv preprint arXiv:2305.05383 , year =
-
[35]
Codescore: Evaluating code generation by learning code execution
Codescore: Evaluating code generation by learning code execution , author =. arXiv preprint arXiv:2301.09043 , year =
-
[36]
Test-Case-Driven Programming Understanding in Large Language Models for Better Code Generation , author =. arXiv preprint arXiv:2309.16120 , year =
-
[37]
International Conference on Machine Learning , pages =
Lever: Learning to verify language-to-code generation with execution , author =. International Conference on Machine Learning , pages =. 2023 , organization =
work page 2023
-
[38]
arXiv preprint arXiv:2204.11454 , year =
Natural language to code translation with execution , author =. arXiv preprint arXiv:2204.11454 , year =
-
[39]
arXiv preprint arXiv:2202.07612 , year =
CodeGen-Test: An Automatic Code Generation Model Integrating Program Test Information , author =. arXiv preprint arXiv:2202.07612 , year =
-
[40]
arXiv preprint arXiv:2207.14502 , year =
Language models can teach themselves to program better , author =. arXiv preprint arXiv:2207.14502 , year =
-
[44]
arXiv preprint arXiv:2212.10481 , year=
Execution-based evaluation for open-domain code generation , author =. arXiv preprint arXiv:2212.10481 , year =
-
[45]
Advances in Neural Information Processing Systems , volume =
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author =. Advances in Neural Information Processing Systems , volume =
-
[46]
Large language models for software engineering: Survey and open problems
Large Language Models for Software Engineering: Survey and Open Problems , author =. arXiv preprint arXiv:2310.03533 , year =
-
[47]
Large language models meet NL2Code: A survey , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[49]
arXiv preprint arXiv:2305.15717 , year =
The false promise of imitating proprietary llms , author =. arXiv preprint arXiv:2305.15717 , year =
-
[50]
Self-Edit: Fault-Aware Code Editor for Code Generation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =
-
[53]
arXiv preprint arXiv:2307.14936 , year =
Pangu-coder2: Boosting large language models for code with ranking feedback , author =. arXiv preprint arXiv:2307.14936 , year =
-
[54]
arXiv preprint arXiv:2303.05510 , year=
Planning with large language models for code generation , author =. arXiv preprint arXiv:2303.05510 , year =
-
[55]
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =
NL2Type: inferring JavaScript function types from natural language information , author =. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =. 2019 , organization =
work page 2019
-
[56]
arXiv preprint arXiv:2303.09564 , year =
TypeT5: Seq2seq Type Inference using Static Analysis , author =. arXiv preprint arXiv:2303.09564 , year =
-
[57]
Proceedings of the 44th International Conference on Software Engineering , pages =
Type4py: Practical deep similarity learning-based type inference for python , author =. Proceedings of the 44th International Conference on Software Engineering , pages =
-
[58]
IEEE Transactions on Software Engineering , volume =
ATOM: Commit message generation based on abstract syntax tree and hybrid ranking , author =. IEEE Transactions on Software Engineering , volume =. 2020 , publisher =
work page 2020
-
[61]
54th Annual Meeting of the Association for Computational Linguistics 2016 , pages =
Summarizing source code using a neural attention model , author =. 54th Annual Meeting of the Association for Computational Linguistics 2016 , pages =. 2016 , organization =
work page 2016
-
[62]
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =
A neural model for generating natural language summaries of program subroutines , author =. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages =. 2019 , organization =
work page 2019
-
[64]
arXiv preprint arXiv:2108.11590 (2023)
AVATAR: A Parallel Corpus for Java-Python Program Translation , author =. arXiv preprint arXiv:2108.11590 , year =
-
[65]
arXiv preprint arXiv:2206.13619 , year =
DeepPERF: A Deep Learning-Based Approach For Improving Software Performance , author =. arXiv preprint arXiv:2206.13619 , year =
-
[67]
Proceedings of the aaai conference on artificial intelligence , volume =
Deepfix: Fixing common c language errors by deep learning , author =. Proceedings of the aaai conference on artificial intelligence , volume =
-
[68]
arXiv preprint arXiv:2303.07263 , year =
Inferfix: End-to-end program repair with llms , author =. arXiv preprint arXiv:2303.07263 , year =
-
[69]
Fixeval: Execution-based evaluation of program fixes for competitive programming problems , author =
-
[70]
ACM Transactions on Software Engineering and Methodology (TOSEM) , volume =
An empirical study on learning bug-fixing patches in the wild via neural machine translation , author =. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume =. 2019 , publisher =
work page 2019
-
[71]
arXiv preprint arXiv:2210.14179 , year =
Practical program repair in the era of large pre-trained language models , author =. arXiv preprint arXiv:2210.14179 , year =
- [72]
-
[73]
International Conference on Machine Learning , pages =
Tfix: Learning to fix coding errors with a text-to-text transformer , author =. International Conference on Machine Learning , pages =. 2021 , organization =
work page 2021
-
[74]
arXiv preprint arXiv:2303.09384 , year =
LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations , author =. arXiv preprint arXiv:2303.09384 , year =
-
[75]
2022 IEEE Symposium on Security and Privacy (SP) , pages =
Asleep at the keyboard? assessing the security of github copilot’s code contributions , author =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =. 2022 , organization =
work page 2022
-
[76]
Automated Software Engineering , volume =
Can we generate shellcodes via natural language? An empirical study , author =. Automated Software Engineering , volume =. 2022 , publisher =
work page 2022
-
[77]
International Conference on Machine Learning , pages =
Repository-level prompt generation for large language models of code , author =. International Conference on Machine Learning , pages =. 2023 , organization =
work page 2023
- [78]
-
[80]
Proceedings of the 19th International Conference on Mining Software Repositories , pages =
Methods2Test: A dataset of focal methods mapped to test cases , author =. Proceedings of the 19th International Conference on Mining Software Repositories , pages =
-
[81]
Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =
On learning meaningful assert statements for unit test cases , author =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =
-
[82]
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" , author =. arXiv preprint arXiv:2309.12288 , year =
-
[83]
arXiv preprint arXiv:2307.02477 , year=
Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks , author =. arXiv preprint arXiv:2307.02477 , year =
-
[84]
International Conference on Machine Learning , pages =
Large language models can be easily distracted by irrelevant context , author =. International Conference on Machine Learning , pages =. 2023 , organization =
2023
-
[85]
Understanding by understanding not: Modeling negation in language models , author =. arXiv preprint arXiv:2105.03519 , year =
-
[86]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of artificial general intelligence: Early experiments with gpt-4 , author =. arXiv preprint arXiv:2303.12712 , year =
work page internal anchor Pith review arXiv
-
[87]
arXiv preprint arXiv:2305.15507 , year =
The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python , author =. arXiv preprint arXiv:2305.15507 , year =
-
[88]
arXiv preprint arXiv:2308.03762 , year =
GPT-4 Can't Reason , author =. arXiv preprint arXiv:2308.03762 , year =
-
[90]
2023 , eprint =
A Survey on Language Models for Code , author =. 2023 , eprint =
2023
-
[91]
arXiv preprint arXiv:2205.11502 , year =
On the paradox of learning to reason from data , author =. arXiv preprint arXiv:2205.11502 , year =
-
[92]
arXiv preprint arXiv:2310.15164 , year =
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers , author =. arXiv preprint arXiv:2310.15164 , year =
-
[93]
Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381, 2023
Teaching arithmetic to small transformers , author =. arXiv preprint arXiv:2307.03381 , year =
-
[94]
What Algorithms Can Transformers Learn? A Study in Length Generalization,
What Algorithms can Transformers Learn? A Study in Length Generalization , author =. arXiv preprint arXiv:2310.16028 , year =
-
[95]
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D
Faith and Fate: Limits of Transformers on Compositionality , author =. arXiv preprint arXiv:2305.18654 , year =
-
[96]
The expressive power of transformers with chain of thought, 2024
The Expresssive Power of Transformers with Chain of Thought , author =. arXiv preprint arXiv:2310.07923 , year =
-
[97]
4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J
Looped transformers as programmable computers , author =. arXiv preprint arXiv:2301.13196 , year =
-
[98]
Can Transformers Learn to Solve Problems Recursively? , author =. arXiv preprint arXiv:2305.14699 , year =
-
[99]
Solving math word problems with process- and outcome-based feedback
Solving math word problems with process-and outcome-based feedback , author =. arXiv preprint arXiv:2211.14275 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[100]
ACM Computing Surveys , volume =
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author =. ACM Computing Surveys , volume =. 2023 , publisher =
2023
-
[101]
LLMs cannot find reasoning errors, but can correct them! , author =. arXiv preprint arXiv:2311.08516 , year =
-
[102]
2022 , journal =
InCoder: A Generative Model for Code Infilling and Synthesis , author =. 2022 , journal =
2022
-
[104]
and Koller, Alexander , booktitle =
Bender, Emily M. and Koller, Alexander , booktitle =. Climbing towards. 2020 , doi =
2020
-
[105]
Transactions of the Association for Computational Linguistics , year =
Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? , author =. Transactions of the Association for Computational Linguistics , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.