Program Synthesis with Large Language Models

Augustus Odena; Carrie Cai; Charles Sutton; David Dohan; Ellen Jiang; Henryk Michalewski; Jacob Austin; Maarten Bosma; Maxwell Nye; Michael Terry

arxiv: 2108.07732 · v1 · submitted 2021-08-16 · 💻 cs.PL · cs.LG

Program Synthesis with Large Language Models

Jacob Austin , Augustus Odena , Maxwell Nye , Maarten Bosma , Henryk Michalewski , David Dohan , Ellen Jiang , Carrie Cai

show 3 more authors

Michael Terry Quoc Le Charles Sutton

This is my paper

Pith reviewed 2026-05-24 13:41 UTC · model grok-4.3

classification 💻 cs.PL cs.LG

keywords program synthesislarge language modelsfew-shot learningcode generationPythonbenchmarksfine-tuning

0 comments

The pith

Large language models synthesize correct Python programs from natural language descriptions for 59.6 percent of basic tasks using few-shot prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how well language models of different sizes can turn short natural language task descriptions into working Python code. It introduces two benchmarks consisting of entry-level problems and more involved math-derived tasks, then reports results in both few-shot and fine-tuned settings. Performance rises steadily with model scale, human feedback improves solutions further, and the work shows where current models still fail at understanding program behavior.

Core claim

Models ranging from 244 million to 137 billion parameters achieve 59.6 percent accuracy on the MBPP benchmark of 974 basic programming problems when prompted with a few examples, without any code-specific fine-tuning. Fine-tuning on held-out benchmark data lifts results by roughly ten points for most sizes. The largest fine-tuned model reaches 83.8 percent on the MathQA-Python collection of 23,914 problems. Natural language feedback from a human halves the error rate relative to the model's first attempt. Accuracy scales log-linearly with parameter count, yet even the strongest models remain largely unable to predict the output of a program given its code and an input.

What carries the argument

The MBPP and MathQA-Python benchmarks, which pair natural language descriptions with short Python solutions, used to track synthesis accuracy across model sizes in few-shot and fine-tuned regimes.

If this is right

Synthesis accuracy improves in a log-linear fashion as the number of parameters grows.
Fine-tuning on a held-out slice of the benchmark data adds about ten percentage points across most model sizes.
Incorporating natural language human feedback reduces the initial error rate by half.
Models remain poor at predicting the concrete output of a program from its source code and a given input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed scaling suggests that further increases in model size could raise the fraction of solvable basic tasks without changes to training data.
The gap between generation success and execution prediction points to a possible benefit from training regimes that include direct execution signals.
If the pattern holds, language models could become reliable first-pass generators for many entry-level coding tasks once paired with simple verification steps.

Load-bearing premise

The benchmark problems share no meaningful overlap with the models' pretraining data and the natural language prompts test genuine synthesis rather than memorization or hidden leakage.

What would settle it

Showing that a substantial fraction of MBPP or MathQA-Python problems or close variants appear in the pretraining data of the tested models, or that accuracy stops rising when the same models are evaluated on an independently created set of equivalent tasks.

Figures

Figures reproduced from arXiv: 2108.07732 by Augustus Odena, Carrie Cai, Charles Sutton, David Dohan, Ellen Jiang, Henryk Michalewski, Jacob Austin, Maarten Bosma, Maxwell Nye, Michael Terry, Quoc Le.

**Figure 2.** Figure 2: An example MathQA prompt along with a Python solution emitted by our largest model. Everything [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance vs model size, measured in two ways. (Left) Fraction of programs solved by [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Fraction of samples solving each task. The x-axis represents the index of a particular task, sorted by the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Performance as a function of which prompt examples are chosen, as measured by fraction of tasks solved by at least one sample. The seed label corresponds to the random seed used to choose which held-out examples are shown as prompts. Seeds are ordered by the fraction of tasks solved by that seed. 4.3 Performance is Sensitive to Prompt Examples While model performance is not strongly sensitive to the number… view at source ↗

**Figure 7.** Figure 7: Test cases for Task 11. The normal test cases incorrectly allow a program that deletes all occurrences of the [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: In rare cases, the model generates a program which trivially passes the test asserts but does not solve the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Higher temperatures achieve better scaling with [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 11.** Figure 11: Number of lines of code that appear in both the pre-training data and in the python programming dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 13.** Figure 13: Percent of problems solved as the number of [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Two example human-model interactions. User text is purple and model text is blue. Left: an under-specified [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Synthesis performance of models fine-tuned on the execution task. While synthesis performance of the [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: An example of a simple MathQA-style problem used as an additional test. We first verified that the model [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: An example of a harder MathQA test problem. Without the parenthesized hint, it is solved by the 137B model [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Fraction of samples solving each MathQA task represented as a histogram and a graph. In the case of the [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Instructions given to the crowd workers (edited slightly for clarity). [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Instructions used to edit the problems. A.2 Instructions for human-model collaboration experiments Each user will be tasked with attempting 12 problems with at most 5 turns of dialog (including an initial automated turn). Each problem will be tackled by two people. After 5 turns the task is considered failed. If the model passes the test cases at any point, the task is considered solved. Instructions: • E… view at source ↗

**Figure 21.** Figure 21: Instructions for human-model collaboration experiments. Instructions have been lightly edited for publication. [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: An extra dialog example. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

read the original abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling numbers on new code benchmarks look real but the synthesis claim rests on unverified lack of pretraining overlap.

read the letter

The paper's core finding is that model size drives log-linear gains in few-shot Python synthesis on two fresh benchmarks, reaching 59.6% on MBPP with the 137B model and 83.8% after fine-tuning on MathQA-Python, with human dialog feedback cutting errors roughly in half. That is the part worth noting first. They also add an error analysis and show the models still cannot reliably predict program outputs from inputs even after fine-tuning. The new datasets themselves are the clearest addition: MBPP's 974 entry-level tasks and the 23k MathQA-Python items give concrete targets that prior few-shot work lacked. The scaling plots and the feedback experiment are straightforward to follow and give numbers that later papers have built on. The execution-prediction test is a useful negative result that highlights a real limit. The main gap is the missing decontamination step. The stress-test note is right that no n-gram overlap stats or removal of potentially seen problems are reported, and the abstract gives no hint that this was checked. For web-scale pretraining that is a material concern; even modest leakage would make the few-shot numbers look more like retrieval than synthesis. The fine-tuning results are less affected but still sit on the same benchmarks. The work is aimed at researchers measuring LLM code capabilities rather than at production tool builders. It shows clear empirical thinking and honest reporting of where the models fail, so it is worth a serious referee's time even with the overlap question left open. I would send it to review and ask the authors to add the decontamination numbers or explain why they are unnecessary.

Referee Report

2 major / 2 minor

Summary. The paper evaluates large language models (244M–137B parameters) on program synthesis from natural language descriptions using two new benchmarks: MBPP (974 entry-level Python tasks) and MathQA-Python (23,914 problems). It reports log-linear scaling of synthesis accuracy with model size; the largest model reaches 59.6% on MBPP via few-shot prompting (improving ~10 points after fine-tuning on a held-out split) and 83.8% on MathQA-Python after fine-tuning. Additional results cover human-in-the-loop dialog refinement (halving error rate), error analysis, and an experiment showing limited ability to predict program execution outputs.

Significance. If the benchmarks are free of pretraining overlap, the work supplies concrete scaling trends, few-shot and fine-tuning numbers, and evidence that dialog feedback improves synthesis; the new benchmarks and the execution-prediction ablation are useful contributions for the program synthesis community.

major comments (2)

[§3 and §5] §3 (Benchmark Construction) and §5 (Experiments): The paper introduces MBPP and MathQA-Python as new benchmarks and reports headline accuracies (59.6% few-shot on MBPP, 83.8% fine-tuned on MathQA-Python) without any decontamination procedure, n-gram overlap statistics, or ablation removing potentially seen items against the 137B model's pretraining corpus. This is load-bearing for the central synthesis claim, because the observed performance and scaling could be explained by retrieval of memorized solutions rather than generalization from the prompt if even modest overlap exists.
[§5.2] §5.2 (Fine-tuning regime): The claim that fine-tuning on a held-out portion improves performance by ~10 points across sizes does not specify the split procedure, whether the held-out set was also checked for pretraining overlap, or how the fine-tuning data relates to the few-shot prompt construction. This affects whether the reported gains demonstrate improved synthesis or simply better adaptation to seen data.

minor comments (2)

[§4] The few-shot prompt templates and exact number of examples per problem are described only at a high level; providing the full prompt text or an appendix would improve reproducibility.
[§6] Table or figure reporting per-problem difficulty or error categories would benefit from explicit counts alongside percentages to allow readers to assess the error analysis.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive comments. We address the two major concerns below regarding data overlap and fine-tuning details. We agree these points affect interpretation of the results and will revise the manuscript to improve clarity and transparency where feasible.

read point-by-point responses

Referee: [§3 and §5] The paper introduces MBPP and MathQA-Python without any decontamination procedure, n-gram overlap statistics, or ablation removing potentially seen items against the 137B model's pretraining corpus. Performance and scaling could be explained by memorization rather than generalization if overlap exists.

Authors: We acknowledge this is a substantive concern. MBPP problems were authored specifically for the benchmark and have no pretraining overlap by construction. MathQA-Python is a conversion of an existing dataset; we will add n-gram overlap statistics against publicly available code corpora (e.g., GitHub dumps) to §3 in the revision. However, we lack access to the full pretraining corpus of the 137B model, so a complete decontamination or ablation study is not possible. We will explicitly discuss this limitation and its implications for the scaling claims. revision: partial
Referee: [§5.2] The fine-tuning claim does not specify the split procedure, whether the held-out set was checked for pretraining overlap, or how the fine-tuning data relates to the few-shot prompt construction.

Authors: We will revise §5.2 to clarify: the held-out fine-tuning split is a random 20% partition of the training problems, kept strictly disjoint from the test set. Few-shot prompts are constructed by sampling distinct examples from the remaining training data. We did not perform overlap checks on the held-out split due to lack of pretraining corpus access. These details and the limitation will be added to the revised manuscript. revision: yes

standing simulated objections not resolved

Complete decontamination or ablation against the full pretraining corpus of the 137B model, as we do not have access to it.

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark measurements

full rationale

The paper reports direct experimental results (accuracy percentages, scaling observations) on held-out benchmarks MBPP and MathQA-Python. No derivations, equations, fitted parameters renamed as predictions, or self-referential definitions exist. Central claims (e.g., 59.6% few-shot, 83.8% fine-tuned) are measurements of model outputs, not reductions to inputs by construction. Any self-citations are incidental and non-load-bearing for the empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of LLM evaluation: that benchmark problems are uncontaminated, that few-shot prompting measures synthesis rather than memorization, and that accuracy on these tasks correlates with real-world utility. No free parameters are fitted to produce the headline percentages; model sizes are given as inputs.

axioms (2)

domain assumption Few-shot prompting with a well-designed prompt elicits genuine program synthesis rather than surface pattern matching or data leakage.
Invoked when reporting 59.6% few-shot accuracy on MBPP without fine-tuning.
domain assumption The MBPP and MathQA-Python problems are representative of entry-level and math-related programming tasks solvable by humans.
Stated in the abstract description of the datasets.

pith-pipeline@v0.9.0 · 5885 in / 1393 out tokens · 20417 ms · 2026-05-24T13:41:56.439923+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
cs.LG 2026-05 conditional novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors
cs.CR 2026-04 unverdicted novelty 8.0

Backdoored model code enables deterministic, verifiable stealing of sparse secrets during local LLM fine-tuning via tensor-rule matching and gradient injection, achieving over 98% strict attack success rate while bypa...
StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis
quant-ph 2026-04 conditional novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
cs.LG 2026-03 unverdicted novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
cs.AI 2025-09 unverdicted novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
cs.CR 2025-07 unverdicted novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Training-Free Looped Transformers
cs.LG 2026-05 unverdicted novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Learnability-Informed Fine-Tuning of Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
Self-Policy Distillation via Capability-Selective Subspace Projection
cs.CL 2026-05 unverdicted novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
cs.LG 2026-05 unverdicted novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
cs.LG 2026-05 accept novelty 7.0

Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases a...
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
cs.SE 2026-05 unverdicted novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
cs.AI 2026-05 conditional novelty 7.0

BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
cs.SE 2026-05 conditional novelty 7.0

Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps...
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
cs.AI 2026-05 unverdicted novelty 7.0

New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
cs.SE 2026-05 unverdicted novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
cs.LG 2026-05 unverdicted novelty 7.0

MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.
The IsalProgram Programming Language
cs.PL 2026-05 unverdicted novelty 7.0

IsalProgram is a regular assembly-like language where all instruction strings are valid programs executed on a circular doubly linked list VM without addresses or variable names.
Constrained Code Generation with Discrete Diffusion
cs.CL 2026-05 unverdicted novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to stee...
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
cs.CL 2026-05 unverdicted novelty 7.0

AgentKernelArena is a new open benchmark that measures complete AI agent workflows on 196 GPU kernel tasks with correctness, performance, and generalization checks to unseen configurations.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
cs.CL 2026-05 unverdicted novelty 7.0

Fine-tuning LLMs on an unseen language teaches syntax but fails to transfer semantic competence, leaving Python with up to a 19% performance advantage and no tested intervention closing the gap.
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
cs.SE 2026-05 unverdicted novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
cs.CL 2026-05 unverdicted novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
3D Primitives are a Spatial Language for VLMs
cs.CV 2026-05 conditional novelty 7.0

3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
cs.IT 2026-05 unverdicted novelty 7.0

CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
Multi-Token Residual Prediction
cs.LG 2026-05 unverdicted novelty 7.0

MRP predicts logit residuals from hidden states to support dependency-aware multi-token denoising in a single forward pass for diffusion language models, yielding up to 1.42× lossless speedup on SDAR models.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
cs.CR 2026-05 unverdicted novelty 7.0

A binomial multibit watermarking scheme encodes every payload bit at each LLM token with dynamic redirection, outperforming baselines in accuracy and robustness for large payloads.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
cs.AI 2026-05 unverdicted novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
cs.LG 2026-05 unverdicted novelty 7.0

Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
Prospective Compression in Human Abstraction Learning
cs.AI 2026-05 unverdicted novelty 7.0

Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
cs.NE 2026-05 unverdicted novelty 7.0

QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
cs.MA 2026-05 unverdicted novelty 7.0

SmartEval is a new benchmark showing LLM-generated smart contracts score 8.29 points higher than expert versions on average but frequently omit logic (35.3%) or mishandle state transitions (23.4%).
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
cs.LG 2026-05 unverdicted novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
cs.LG 2026-05 unverdicted novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
cs.AI 2026-05 unverdicted novelty 7.0

CoCoDA co-evolves a typed compositional DAG of primitive and composite tools with the agent planner, using signature-based retrieval and a size-based reward to scale libraries efficiently and let an 8B model match or ...

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 377 Pith papers · 6 internal anchors

[1]

Language Models are Few-Shot Learners

Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 62326dc7c4f7b849d6f013ba46489d6c-Paper.pdf. big-bench collaboration. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation, 2021. URL https://github.com/google/BIG-bench/. Sid Black, Leo Gao, Phil Wang, Connor Leahy, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

URL https://arxiv.org/abs/2005.14165. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. arXiv preprint arXiv:2012.07805, 2020. Mark Chen, Jerry Tworek, Heewoo Jun, Qi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/1926385.1926423 2005
[3]

Measuring Coding Challenge Competence With APPS

URL http://arxiv.org/abs/2105.09938. Abram Hindle, Earl Barr, Zhendong Su, Prem Devanbu, and Mark Gable. On the “naturalness” of software. In International Conference on Software Engineering (ICSE). 2012. 26 Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. In Association of Computational Linguistics (ACL), 201...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

URL http://arxiv.org/abs/2003.13848. Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgiu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2003
[5]

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton

URL https://arxiv.org/abs/2002.09030. Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton. BUSTLE: bottom-up program- synthesis through learning-guided exploration. CoRR, abs/2007.14381, 2020. URL https://arxiv.org/abs/ 2007.14381. Irene Vlassi Pandi, Earl T Barr, Andrew D Gordon, and Charles Sutton. OptTyper: Probabilistic type in...

work page arXiv 2002
[6]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL http://arxiv.org/abs/1910.10683. Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. InACM PLDI, 2014. Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from “big code”. In ACM Symposium on Principles of Programming Languages (POPL), 2015. Veselin Raychev, Pavol Bielik, and ...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[7]

Learning to Execute

URL https://books.google.com/books?id=3BITSQAACAAJ. 29 Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. LambdaNet: Probabilistic type inference using graph neural networks. In International Conference on Learning Representations, 2020. Michihiro Yasunaga and Percy Liang. Graph-based, self-supervised program repair from diagnostic feedback. In Inter...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

If the question does not seem to be a good or useful question, ﬂag it for removal

Well-deﬁned, unambiguous question and test case: Ensure the question is well-deﬁned and unambiguous, given the question and a test case. If the question does not seem to be a good or useful question, ﬂag it for removal

work page
[9]

No special conditions: Remove any special conditions speciﬁed in the question (e.g., requirements to solve the problem using a regex, printing to the console, or using a lambda function)

work page
[10]

Function signature looks "normal" (inputs and outputs): Make sure the function signature is not unusual (e.g., one common case was to pass in a list and the length of that list)

work page
[11]

If they use strings as enums, deﬁne these values in the natural language question

Make sure the return values are well-speciﬁed: Sometimes they return strings indicating success or failure; consider whether it could be changed to a standard Boolean value. If they use strings as enums, deﬁne these values in the natural language question

work page
[12]

Test cases are accurate: Make sure the test cases contain no errors

work page
[13]

Float comparisons are handled correctly: If the function returns ﬂoating point values, test using math.isclose(): import math math.isclose(a, b, rel_tol=0.001)

work page
[14]

If a question asks for a subset of a list (e.g., the largest n numbers), but does not specify an order, add that speciﬁcation to the question text

Questions asking for n elements of a list may not specify an expected order: disambiguate or adjust tests. If a question asks for a subset of a list (e.g., the largest n numbers), but does not specify an order, add that speciﬁcation to the question text

work page
[15]

Close, but it needs to return i if count is equal to len(str)

Consider whether using sets (set()) in the asserts is the right way to test results Figure 20: Instructions used to edit the problems. A.2 Instructions for human-model collaboration experiments Each user will be tasked with attempting 12 problems with at most 5 turns of dialog (including an initial automated turn). Each problem will be tackled by two peop...

work page

[1] [1]

Language Models are Few-Shot Learners

Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 62326dc7c4f7b849d6f013ba46489d6c-Paper.pdf. big-bench collaboration. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation, 2021. URL https://github.com/google/BIG-bench/. Sid Black, Leo Gao, Phil Wang, Connor Leahy, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

URL https://arxiv.org/abs/2005.14165. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. arXiv preprint arXiv:2012.07805, 2020. Mark Chen, Jerry Tworek, Heewoo Jun, Qi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/1926385.1926423 2005

[3] [3]

Measuring Coding Challenge Competence With APPS

URL http://arxiv.org/abs/2105.09938. Abram Hindle, Earl Barr, Zhendong Su, Prem Devanbu, and Mark Gable. On the “naturalness” of software. In International Conference on Software Engineering (ICSE). 2012. 26 Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. In Association of Computational Linguistics (ACL), 201...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

URL http://arxiv.org/abs/2003.13848. Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgiu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2003

[5] [5]

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton

URL https://arxiv.org/abs/2002.09030. Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton. BUSTLE: bottom-up program- synthesis through learning-guided exploration. CoRR, abs/2007.14381, 2020. URL https://arxiv.org/abs/ 2007.14381. Irene Vlassi Pandi, Earl T Barr, Andrew D Gordon, and Charles Sutton. OptTyper: Probabilistic type in...

work page arXiv 2002

[6] [6]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL http://arxiv.org/abs/1910.10683. Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. InACM PLDI, 2014. Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from “big code”. In ACM Symposium on Principles of Programming Languages (POPL), 2015. Veselin Raychev, Pavol Bielik, and ...

work page internal anchor Pith review Pith/arXiv arXiv 1910

[7] [7]

Learning to Execute

URL https://books.google.com/books?id=3BITSQAACAAJ. 29 Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. LambdaNet: Probabilistic type inference using graph neural networks. In International Conference on Learning Representations, 2020. Michihiro Yasunaga and Percy Liang. Graph-based, self-supervised program repair from diagnostic feedback. In Inter...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

If the question does not seem to be a good or useful question, ﬂag it for removal

Well-deﬁned, unambiguous question and test case: Ensure the question is well-deﬁned and unambiguous, given the question and a test case. If the question does not seem to be a good or useful question, ﬂag it for removal

work page

[9] [9]

No special conditions: Remove any special conditions speciﬁed in the question (e.g., requirements to solve the problem using a regex, printing to the console, or using a lambda function)

work page

[10] [10]

Function signature looks "normal" (inputs and outputs): Make sure the function signature is not unusual (e.g., one common case was to pass in a list and the length of that list)

work page

[11] [11]

If they use strings as enums, deﬁne these values in the natural language question

Make sure the return values are well-speciﬁed: Sometimes they return strings indicating success or failure; consider whether it could be changed to a standard Boolean value. If they use strings as enums, deﬁne these values in the natural language question

work page

[12] [12]

Test cases are accurate: Make sure the test cases contain no errors

work page

[13] [13]

Float comparisons are handled correctly: If the function returns ﬂoating point values, test using math.isclose(): import math math.isclose(a, b, rel_tol=0.001)

work page

[14] [14]

If a question asks for a subset of a list (e.g., the largest n numbers), but does not specify an order, add that speciﬁcation to the question text

Questions asking for n elements of a list may not specify an expected order: disambiguate or adjust tests. If a question asks for a subset of a list (e.g., the largest n numbers), but does not specify an order, add that speciﬁcation to the question text

work page

[15] [15]

Close, but it needs to return i if count is equal to len(str)

Consider whether using sets (set()) in the asserts is the right way to test results Figure 20: Instructions used to edit the problems. A.2 Instructions for human-model collaboration experiments Each user will be tasked with attempting 12 problems with at most 5 turns of dialog (including an initial automated turn). Each problem will be tackled by two peop...

work page