Title resolution pending

Evaluating Large Language Models Trained on Code , author= · 2021

44 Pith papers cite this work. Polarity classification is still indexing.

44 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

cs.CL · 2023-10-10 · unverdicted · novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

cs.IR · 2026-04-19 · unverdicted · novelty 7.0

Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expansion insufficient to fix it.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

cs.SE · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

cs.CR · 2026-05-17 · conditional · novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

Computer Use at the Edge of the Statistical Precipice

cs.SE · 2026-05-07 · unverdicted · novelty 6.0

A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.

Parallel Prefix Verification for Speculative Generation

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.

Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalization to unseen cases.

citing papers explorer

Showing 44 of 44 citing papers.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? cs.CL · 2023-10-10 · unverdicted · none · ref 13
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation cs.LG · 2026-05-15 · unverdicted · none · ref 24
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 91 · 2 links
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
DARE: Diffusion Language Model Activation Reuse for Efficient Inference cs.LG · 2026-05-01 · unverdicted · none · ref 20
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 30
BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.
Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 39
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 41 · 2 links
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks cs.CR · 2026-04-20 · unverdicted · none · ref 52
LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion cs.CL · 2026-04-20 · unverdicted · none · ref 11
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers cs.IR · 2026-04-19 · unverdicted · none · ref 29
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expansion insufficient to fix it.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 101
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 115
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 39
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 1
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents cs.SE · 2026-05-17 · unverdicted · none · ref 8 · 2 links
DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents cs.CR · 2026-05-17 · conditional · none · ref 51
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 13
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
CTFusion: A CTF-based Benchmark for LLM Agent Evaluation cs.LG · 2026-05-12 · unverdicted · none · ref 29
CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning cs.CL · 2026-05-10 · unverdicted · none · ref 31
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles cs.LG · 2026-05-08 · unverdicted · none · ref 99
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 159
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
Computer Use at the Edge of the Statistical Precipice cs.SE · 2026-05-07 · unverdicted · none · ref 4
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
Parallel Prefix Verification for Speculative Generation cs.AI · 2026-05-05 · unverdicted · none · ref 41
PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM cs.CL · 2026-05-03 · unverdicted · none · ref 114
A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalization to unseen cases.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 38
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems cs.AI · 2026-04-23 · unverdicted · none · ref 54
DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.
Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought cs.LG · 2026-04-20 · unverdicted · none · ref 40
CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.
Probabilistic Programs of Thought cs.CL · 2026-04-19 · unverdicted · none · ref 45
Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 102
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 119
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 148
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 45
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 89
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 35
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 5
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 39
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation cs.AI · 2026-05-11 · unverdicted · none · ref 32
RADAR is a redundancy-aware, query-adaptive framework that uses conditional discrete graph diffusion to generate efficient communication topologies for multi-agent LLM systems, outperforming baselines on six benchmarks with higher accuracy and lower token use.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 23
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.
GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models cs.CL · 2026-05-02 · unverdicted · none · ref 42
GIFT guides adapter fine-tuning on base models with confidence signals from instruction-tuned models before merging, yielding task-specialized models that outperform direct fine-tuning on math and knowledge benchmarks.
GiVA: Gradient-Informed Bases for Vector-Based Adaptation cs.CL · 2026-04-23 · unverdicted · none · ref 19
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective cs.CR · 2026-04-20 · unverdicted · none · ref 22
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 62
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 232 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding cs.DC · 2026-05-13 · unreviewed · ref 38 · 2 links

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer