super hub Mixed citations

Program Synthesis with Large Language Models

Augustus Odena, David Dohan, Henryk Michalewski, Jacob Austin, Maarten Bosma, Maxwell Nye · 2021 · cs.PL · arXiv 2108.07732

Mixed citation behavior. Most common role is background (52%).

567 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 567 citing papers more from Augustus Odena arXiv PDF

abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 58 dataset 41 method 4 other 2

citation-polarity summary

background 55 use dataset 36 unclear 9 use method 4 support 1

claims ledger

abstract This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M

authors

Augustus Odena David Dohan Henryk Michalewski Jacob Austin Maarten Bosma Maxwell Nye

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

cs.LG · 2026-05-16 · conditional · novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

cs.CR · 2026-04-30 · unverdicted · novelty 8.0

Backdoored model code enables deterministic, verifiable stealing of sparse secrets during local LLM fine-tuning via tensor-rule matching and gradient injection, achieving over 98% strict attack success rate while bypassing DP-SGD and auditing defenses.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Regression Accumulation in Multi-Turn LLM Programming Conversations

cs.SE · 2026-07-02 · conditional · novelty 7.0

Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.

AxDafny: Agentic Verified Code Generation in Dafny

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.

citing papers explorer

Showing 50 of 567 citing papers.

Rosetta: Composable Native Multimodal Pretraining cs.CV · 2026-07-01 · unverdicted · none · ref 4 · internal anchor
Rosetta proposes a composable multimodal pretraining method with MAOP to prevent catastrophic forgetting when expanding modalities beyond standard MoE and MoT approaches.
Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR cs.LG · 2026-06-30 · unverdicted · none · ref 77 · internal anchor
Orthonormal initialization for LoRA in RLVR achieves the minimal gap to full fine-tuning, stabilizes training, and outperforms standard LoRA and prior variants on mathematical reasoning benchmarks.
ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping cs.IR · 2026-06-30 · unverdicted · none · ref 44 · internal anchor
ShopX is a single foundation model combining intent understanding, planning, and SID-native item fulfillment for agentic shopping, with claimed improvements over tool-mediated systems on Taobao logs.
BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding cs.CL · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
BlockPilot is an instance-adaptive policy that predicts optimal block size from the prefilling representation for diffusion speculative decoding, reporting 5.92 acceptance length and 4.20x speedup on Qwen3-4B.
On the Vulnerability of Parameter-Level Defenses to Model Merging cs.LG · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Parameter-level defenses for model merging are vulnerable to Anchor-Guided Attack because protected weights are dominated by the pretrained model, and a new defense ARF is introduced to counter it.
AlgoSkill: Learning to Design Algorithms by Scheduling Human-Like Skills cs.AI · 2026-06-29 · unverdicted · none · ref 10 · internal anchor
AlgoSkill improves LLM algorithm design on programming benchmarks by framing it as verification-guided scheduling over a typed skill library with MCTS, outperforming direct generation and self-refinement.
PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement cs.AI · 2026-06-21 · unverdicted · none · ref 48 · internal anchor
PAPERCLAW is a multi-agent system for end-to-end autonomous research paper generation from literature to output, with human refinement and LLM-judge evaluation showing strong results.
Sakana Fugu Technical Report cs.LG · 2026-06-19 · unverdicted · none · ref 156 · internal anchor
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking cs.CR · 2026-06-18 · unverdicted · none · ref 37 · internal anchor
PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.
CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts cs.CR · 2026-06-17 · unverdicted · none · ref 3 · internal anchor
CodeSentinel introduces a three-layer defense system using syntax parsing and dynamic scoring to mitigate indirect prompt injection attacks in code contexts for large language models, reporting 0.80 F1 score on six attack families.
No Two Developers Think Alike: How Problem-Solving Styles and Experience Shape Needs in Conversational Interaction with Copilot cs.SE · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
Mixed-methods study of 27 developers characterizes five Copilot chat interaction modes and ten needs linked to problem-solving styles and experience levels.
Model Merging to Evolution: Parameter Space Exploration for Expert Models cs.NE · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
MERGEvolve unifies model merging with evolutionary strategies to explore outside convex parameter space and achieves competitive benchmark performance.
Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging cs.LG · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
RLVR induces sparse off-principal updates forming near-orthogonal shortcuts that degrade merging, addressed via Sensitivity-aware Resolving Merging using Fisher sensitivity, sparsification, and rescaling.
The Hidden Power of Scaling Factor in LoRA Optimization cs.AI · 2026-06-11 · unverdicted · none · ref 63 · internal anchor
Alpha in LoRA outperforms learning-rate scaling, follows a square-root law with rank, and enables a minimalist LoRA-alpha method that improves performance across tasks.
VIA-SD: Verification via Intra-Model Routing for Speculative Decoding cs.CL · 2026-06-10 · unverdicted · none · ref 9 · internal anchor
VIA-SD adds a routed slim-verifier tier between direct acceptance and full-model verification in speculative decoding, cutting rejection rates 0.10-0.22 and yielding 10-20% speedups over prior SD methods.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 44 · internal anchor
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents cs.LG · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
EEVEE introduces a router-based multi-dataset test-time prompt learning framework for LLM agents that uses router-prompt co-evolution to improve robustness on heterogeneous data streams.
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs cs.CL · 2026-06-09 · unverdicted · none · ref 16 · internal anchor
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.
PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning cs.CL · 2026-06-09 · unverdicted · none · ref 45 · internal anchor
PADD distills from dense teachers to MoE students via neuron clustering, expert warmup, online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing, yielding gains on math reasoning benchmarks.
My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents physics.chem-ph · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
My Chemical Harness performs evolutionary molecular design by searching over validated synthetic routes with LLMs restricted to high-level preferences, outperforming baselines on an sEH proxy task across multiple metrics.
PriFT: Prior-Support Guided Supervised Fine-Tuning cs.CL · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation cs.CR · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
SecRL-Prune learns layer-wise pruning policies via RL on CodeLLMs, preserving higher pass@k and var@k than baselines at 10-30% compression on HumanEval and enabling semantics-preserving mutations that reduce malware detections in a case study.
Towards the Readability of LLM-Generated Codes through Multitask Representation Engineering cs.SE · 2026-06-04 · unverdicted · none · ref 28 · internal anchor
Introduces multitask RepE to improve readability of LLM-generated code while analyzing the tradeoff with correctness via theory and experiments.
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems cs.AI · 2026-06-03 · unverdicted · none · ref 28 · internal anchor
Introduces PACT protocol that projects agent outputs into action-state records, yielding comparable or better task performance with substantially fewer tokens in multi-agent LLM systems and production harnesses.
Characterizing initial human-AI proof formalization workflows cs.AI · 2026-06-02 · unverdicted · none · ref 201 · internal anchor
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
Neural Change Prediction: Relating Software Changes to Their Effects and Vice Versa cs.SE · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
Neural Change Prediction generates mutation data to train bidirectional models linking code changes to behavioral effects for any executable program.
eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion cs.AI · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
eMoT treats reasoning trajectories as dynamic memories with corrosion, symbolic Python anchoring, and consistency refinement, raising accuracy on Game of 24 to 100% and improving math benchmarks over CoT baselines with a lightweight model.
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CL · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
MESA: Improving MoE Safety Alignment via Decentralized Expertise cs.LG · 2026-05-30 · unverdicted · none · ref 37 · internal anchor
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models cs.SE · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
Synthesizes existing Tree-of-Thoughts work into a unified taxonomy using classical heuristic search terminology and identifies design patterns across shallow and deep reasoning tasks.
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation quant-ph · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
Adapts QuantumKatas to Qiskit yielding a 350-task benchmark across 26 categories and evaluates 16 LLMs in 39,200 runs, reporting performance gaps and prompting effects.
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling cs.LG · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training cs.LG · 2026-05-25 · unverdicted · none · ref 1 · internal anchor
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution cs.SE · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
A Code2Text2Code reengineering framework using neutral specifications, verification steps, and graph-based loss estimation is proposed for LLM-mediated software evolution.
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning cs.CL · 2026-05-22 · unverdicted · none · ref 3 · 2 links · internal anchor
ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.
Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
A PPO-based RL framework with execution-aware dense rewards and token-level mapping improves pass@1 by 19% on MBPP and reduces execution failures by 51% on RoboEval for LLM code generation.
Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks cs.CR · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.
Prompt Optimization for LLM Code Generation via Reinforcement Learning cs.SE · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 2 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Latent Action Reparameterization for Efficient Agent Inference cs.AI · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control cs.LG · 2026-05-18 · unverdicted · none · ref 51 · internal anchor
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
Leveraging Error Diversity in Group Rollouts for Reinforcement Learning cs.LG · 2026-05-17 · unverdicted · none · ref 1 · 2 links · internal anchor
EDAS modulates RL advantage signals for incorrect rollouts by amplifying penalties on repeated errors and attenuating them on rare ones, yielding average gains of 6.29 points over DAPO on Qwen3-8B across seven math benchmarks.
Lever: Speculative LLM Inference on Smartphones cs.LG · 2026-05-16 · unverdicted · none · ref 3 · internal anchor
Lever optimizes the drafting, verification, and execution stages of speculative decoding for flash-backed LLM inference on smartphones, reporting 2.93x average latency reduction over baseline flash-offloaded inference.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code cs.SE · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
Reinforced Collaboration in Multi-Agent Flow Networks cs.LG · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
Multi-Token Residual Prediction cs.LG · 2026-05-12 · unverdicted · none · ref 4 · 2 links · internal anchor
MRP predicts logit residuals between adjacent denoising steps in DLMs from backbone hidden states to support efficient multi-token denoising, yielding up to 1.4x lossless speedup or 22.6-point accuracy gains on code and math tasks.
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting cs.LG · 2026-05-12 · unverdicted · none · ref 26 · internal anchor
D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length across benchmarks.

Program Synthesis with Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer