super hub Mixed citations

Program Synthesis with Large Language Models

Augustus Odena, David Dohan, Henryk Michalewski, Jacob Austin, Maarten Bosma, Maxwell Nye · 2021 · cs.PL · arXiv 2108.07732

Mixed citation behavior. Most common role is background (52%).

562 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 562 citing papers more from Augustus Odena arXiv PDF

abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 58 dataset 41 method 4 other 2

citation-polarity summary

background 55 use dataset 36 unclear 9 use method 4 support 1

claims ledger

abstract This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The M

authors

Augustus Odena David Dohan Henryk Michalewski Jacob Austin Maarten Bosma Maxwell Nye

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

cs.LG · 2026-05-16 · conditional · novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

cs.CR · 2026-04-30 · unverdicted · novelty 8.0

Backdoored model code enables deterministic, verifiable stealing of sparse secrets during local LLM fine-tuning via tensor-rule matching and gradient injection, achieving over 98% strict attack success rate while bypassing DP-SGD and auditing defenses.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Regression Accumulation in Multi-Turn LLM Programming Conversations

cs.SE · 2026-07-02 · conditional · novelty 7.0

Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.

AxDafny: Agentic Verified Code Generation in Dafny

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.

citing papers explorer

Showing 50 of 562 citing papers.

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution cs.LG · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 57 · 2 links · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation cs.CL · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 27 · 2 links · internal anchor
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts cs.CR · 2026-05-07 · unverdicted · none · ref 115 · 2 links · internal anchor
PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
CodeEvolve: LLM-Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement cs.SE · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
CodeEvolve uses runtime-guided target selection and MCTS-augmented LLM evolution to optimize real Java and Apex code, reporting 15.22x average speedup on seven hotspots while preserving correctness.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation cs.LG · 2026-05-06 · unverdicted · none · ref 130 · internal anchor
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning cs.CR · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis cs.CV · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
BlenderRAG improves LLM-generated Blender code for 3D objects by retrieving semantically similar examples from a curated multimodal dataset of 500 expert-validated cases.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning cs.SE · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems cs.AI · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction cs.CL · 2026-04-30 · unverdicted · none · ref 78 · internal anchor
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? cs.SE · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis cs.SE · 2026-04-27 · conditional · none · ref 2 · internal anchor
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics cs.CL · 2026-04-26 · unverdicted · none · ref 3 · internal anchor
Query position is a first-order variable in dLLM ICL whose variance matches semantic quality impact; mitigated via Average Confidence metric and training-free Auto-ICL routing.
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns cs.LG · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
Profiling shows persistent expert load imbalance and domain-specific activation patterns in large MoE models; workload-aware grouping and placement reduce all-to-all communication volume by up to 20x.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows cs.SE · 2026-04-25 · unverdicted · none · ref 1 · internal anchor
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification cs.SE · 2026-04-24 · unverdicted · none · ref 8 · internal anchor
Open-weight LLMs reach 81-91% success generating formally verified Dafny code for complex algorithmic problems when given structural signatures and self-healing verifier feedback.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.
Hybrid Policy Distillation for LLMs cs.CL · 2026-04-22 · unverdicted · none · ref 41 · internal anchor
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation cs.SE · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 80 · internal anchor
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 79 · internal anchor
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation cs.CL · 2026-04-18 · unverdicted · none · ref 2 · internal anchor
Input/output constraints boost LLM-generated decision model structural similarity to gold standards by 37-54%, with models matching gold outcomes on 51-53% of test scenarios while removing redundant logic.
Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation cs.AR · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in optimal settings across benchmarks.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 37 · internal anchor
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models cs.LG · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
Accelerating Speculative Decoding with Block Diffusion Draft Trees cs.CL · 2026-04-14 · unverdicted · none · ref 20 · internal anchor
DDTree builds a draft tree from a block diffusion drafter using a best-first heap on its output probabilities and verifies the tree in one target-model pass via an ancestor-only attention mask, increasing average accepted tokens per round.
Latent-Condensed Transformer for Efficient Long Context Modeling cs.CL · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
Latent-Condensed Attention condenses context in MLA's latent space via query-aware semantic pooling and positional anchor selection, delivering up to 2.5x prefilling speedup and 90% KV cache reduction at 128K length with a length-independent error bound.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code cs.SE · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation cs.SE · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions cs.LG · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards cs.LG · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
RLVR for LLMs tolerates up to 15% verifier noise with validation accuracy within 2 points of clean baselines across three model families and two task domains.
Babbling Suppression: Making LLMs Greener One Token at a Time cs.SE · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints cs.SE · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers cs.SE · 2026-03-17 · conditional · none · ref 7 · internal anchor
CDDRefactorER constrains AI-driven refactoring using Cognitive-Driven Development rules to cut failures by 54-71% and raise novice comprehension scores by 22-31%.
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization cs.LG · 2026-03-09 · unverdicted · none · ref 1 · internal anchor
CAMEL is a scaling law capturing nonlinear model-size and mixture interactions to extrapolate optimal data mixtures for large LLMs from small-model experiments, reducing optimization cost by 50% and improving benchmarks by up to 3%.
Robust Policy Optimization to Prevent Catastrophic Forgetting cs.LG · 2026-02-09 · unverdicted · none · ref 6 · internal anchor
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 3 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 34 · internal anchor
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
MoCo: A One-Stop Shop for Model Collaboration Research cs.CL · 2026-01-29 · accept · none · ref 1 · internal anchor
MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
Token-Level LLM Collaboration via FusionRoute cs.AI · 2026-01-08 · unverdicted · none · ref 2 · internal anchor
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design cs.CV · 2025-12-30 · conditional · none · ref 4 · internal anchor
Three-example few-shot prompting optimizes LLM-generated vision architectures while a whitespace-normalized hash provides 100x faster duplicate detection than AST parsing across seven benchmarks.
VeruSAGE: A Study of Agent-Based Verification for Rust Systems cs.OS · 2025-12-20 · unverdicted · none · ref 5 · internal anchor
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 2 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study cs.SE · 2025-11-29 · unverdicted · none · ref 4 · internal anchor
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match cs.CL · 2025-11-28 · unverdicted · none · ref 3 · internal anchor
FLy is a training-free method that speeds up LLM generation by accepting semantically correct but non-exact draft tokens via an entropy gate and deferred verification window.

Program Synthesis with Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer