Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
super hub Mixed citations
Program Synthesis with Large Language Models
Mixed citation behavior. Most common role is background (52%).
abstract
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M
authors
co-cited works
representative citing papers
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
Backdoored model code enables deterministic, verifiable stealing of sparse secrets during local LLM fine-tuning via tensor-rule matching and gradient injection, achieving over 98% strict attack success rate while bypassing DP-SGD and auditing defenses.
StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.
Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.
AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.
citing papers explorer
-
BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
-
StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis
StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
-
Regression Accumulation in Multi-Turn LLM Programming Conversations
Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
-
Signature filtering: a lightweight enhancement for statistical watermark detection in large language models
Signature filtering learns unreliable tokens with MILP and removes them at detection time, raising true positive rates from 8-31% to 78-99% across Kgw, Sweet, Unigram, and Exp watermarks on multiple corpora and LLMs while controlling false positives.
-
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.
-
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
-
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.
-
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
-
3D Primitives are a Spatial Language for VLMs
3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
-
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall performance remains far below human-authored specs.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
-
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
CodeT: Code Generation with Generated Tests
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
-
Essential Subspace Merging for Multi-Task Learning
The paper proposes Essential Subspace Decomposition and Merging (ESM/ESM++) to fuse task-specific model updates by isolating and orthogonalizing their principal activation-shift directions.
-
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
-
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
-
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
-
Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers
CDDRefactorER constrains AI-driven refactoring using Cognitive-Driven Development rules to cut failures by 54-71% and raise novice comprehension scores by 22-31%.
-
Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design
Three-example few-shot prompting optimizes LLM-generated vision architectures while a whitespace-normalized hash provides 100x faster duplicate detection than AST parsing across seven benchmarks.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
-
Better & Faster Large Language Models via Multi-token Prediction
Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completion, math programming, and retrieval tasks including new SoTA on HumanEval with the 1
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
-
Compact Constraint Encoding for LLM Code Generation: An Empirical Study of Token Economics and Constraint Compliance
Compact constraint headers reduce prompt tokens by 25-30% with no significant change in constraint compliance rates across tested models and tasks.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.