Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
super hub Mixed citations
Program Synthesis with Large Language Models
Mixed citation behavior. Most common role is background (52%).
abstract
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M
authors
co-cited works
representative citing papers
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
Backdoored model code enables deterministic, verifiable stealing of sparse secrets during local LLM fine-tuning via tensor-rule matching and gradient injection, achieving over 98% strict attack success rate while bypassing DP-SGD and auditing defenses.
StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.
Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.
AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.
citing papers explorer
-
Rosetta: Composable Native Multimodal Pretraining
Rosetta proposes a composable multimodal pretraining method with MAOP to prevent catastrophic forgetting when expanding modalities beyond standard MoE and MoT approaches.
-
Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR
Orthonormal initialization for LoRA in RLVR achieves the minimal gap to full fine-tuning, stabilizes training, and outperforms standard LoRA and prior variants on mathematical reasoning benchmarks.
-
ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping
ShopX is a single foundation model combining intent understanding, planning, and SID-native item fulfillment for agentic shopping, with claimed improvements over tool-mediated systems on Taobao logs.
-
BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
BlockPilot is an instance-adaptive policy that predicts optimal block size from the prefilling representation for diffusion speculative decoding, reporting 5.92 acceptance length and 4.20x speedup on Qwen3-4B.
-
On the Vulnerability of Parameter-Level Defenses to Model Merging
Parameter-level defenses for model merging are vulnerable to Anchor-Guided Attack because protected weights are dominated by the pretrained model, and a new defense ARF is introduced to counter it.
-
AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills
AlgoSkill improves LLM algorithm design on programming benchmarks by framing it as verification-guided scheduling over a typed skill library with MCTS, outperforming direct generation and self-refinement.
-
PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement
PAPERCLAW is a multi-agent system for end-to-end autonomous research paper generation from literature to output, with human refinement and LLM-judge evaluation showing strong results.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking
PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.
-
CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts
CodeSentinel introduces a three-layer defense system using syntax parsing and dynamic scoring to mitigate indirect prompt injection attacks in code contexts for large language models, reporting 0.80 F1 score on six attack families.
-
No Two Developers Think Alike: How Problem-Solving Styles and Experience Shape Needs in Conversational Interaction with Copilot
Mixed-methods study of 27 developers characterizes five Copilot chat interaction modes and ten needs linked to problem-solving styles and experience levels.
-
Model Merging to Evolution: Parameter Space Exploration for Expert Models
MERGEvolve unifies model merging with evolutionary strategies to explore outside convex parameter space and achieves competitive benchmark performance.
-
Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging
RLVR induces sparse off-principal updates forming near-orthogonal shortcuts that degrade merging, addressed via Sensitivity-aware Resolving Merging using Fisher sensitivity, sparsification, and rescaling.
-
The Hidden Power of Scaling Factor in LoRA Optimization
Alpha in LoRA outperforms learning-rate scaling, follows a square-root law with rank, and enables a minimalist LoRA-alpha method that improves performance across tasks.
-
VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
VIA-SD adds a routed slim-verifier tier between direct acceptance and full-model verification in speculative decoding, cutting rejection rates 0.10-0.22 and yielding 10-20% speedups over prior SD methods.
-
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
-
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
EEVEE introduces a router-based multi-dataset test-time prompt learning framework for LLM agents that uses router-prompt co-evolution to improve robustness on heterogeneous data streams.
-
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.
-
PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning
PADD distills from dense teachers to MoE students via neuron clustering, expert warmup, online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing, yielding gains on math reasoning benchmarks.
-
My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents
My Chemical Harness performs evolutionary molecular design by searching over validated synthetic routes with LLMs restricted to high-level preferences, outperforming baselines on an sEH proxy task across multiple metrics.
-
PriFT: Prior-Support Guided Supervised Fine-Tuning
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
-
SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation
SecRL-Prune learns layer-wise pruning policies via RL on CodeLLMs, preserving higher pass@k and var@k than baselines at 10-30% compression on HumanEval and enabling semantics-preserving mutations that reduce malware detections in a case study.
-
Towards the Readability of LLM-Generated Codes through Multitask Representation Engineering
Introduces multitask RepE to improve readability of LLM-generated code while analyzing the tradeoff with correctness via theory and experiments.
-
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems
Introduces PACT protocol that projects agent outputs into action-state records, yielding comparable or better task performance with substantially fewer tokens in multi-agent LLM systems and production harnesses.
-
Characterizing initial human-AI proof formalization workflows
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
-
Neural Change Prediction: Relating Software Changes to Their Effects and Vice Versa
Neural Change Prediction generates mutation data to train bidirectional models linking code changes to behavioral effects for any executable program.
-
eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion
eMoT treats reasoning trajectories as dynamic memories with corrosion, symbolic Python anchoring, and consistency refinement, raising accuracy on Game of 24 to 100% and improving math benchmarks over CoT baselines with a lightweight model.
-
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
-
MESA: Improving MoE Safety Alignment via Decentralized Expertise
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
-
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
-
Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns
Synthesizes existing Tree-of-Thoughts work into a unified taxonomy using classical heuristic search terminology and identifies design patterns across shallow and deep reasoning tasks.
-
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation
Adapts QuantumKatas to Qiskit yielding a 350-task benchmark across 26 categories and evaluates 16 LLMs in 39,200 runs, reporting performance gaps and prompting effects.
-
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
-
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
-
Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution
A Code2Text2Code reengineering framework using neutral specifications, verification steps, and graph-based loss estimation is proposed for LLM-mediated software evolution.
-
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.
-
Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards
A PPO-based RL framework with execution-aware dense rewards and token-level mapping improves pass@1 by 19% on MBPP and reduces execution failures by 51% on RoboEval for LLM code generation.
-
Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks
Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.
-
Prompt Optimization for LLM Code Generation via Reinforcement Learning
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
-
Latent Action Reparameterization for Efficient Agent Inference
LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.
-
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
-
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
-
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning
EDAS modulates RL advantage signals for incorrect rollouts by amplifying penalties on repeated errors and attenuating them on rare ones, yielding average gains of 6.29 points over DAPO on Qwen3-8B across seven math benchmarks.
-
Lever: Speculative LLM Inference on Smartphones
Lever optimizes the drafting, verification, and execution stages of speculative decoding for flash-backed LLM inference on smartphones, reporting 2.93x average latency reduction over baseline flash-offloaded inference.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
Reinforced Collaboration in Multi-Agent Flow Networks
MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
-
Multi-Token Residual Prediction
MRP predicts logit residuals between adjacent denoising steps in DLMs from backbone hidden states to support efficient multi-token denoising, yielding up to 1.4x lossless speedup or 22.6-point accuracy gains on code and math tasks.
-
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length across benchmarks.