pith. sign in

arxiv: 2311.07911 · v1 · submitted 2023-11-14 · 💻 cs.CL · cs.AI· cs.LG

Instruction-Following Evaluation for Large Language Models

Pith reviewed 2026-05-24 05:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords instruction followingLLM evaluationbenchmarkverifiable instructionsnatural language instructionsmodel assessment
0
0 comments X

The pith

IFEval introduces an objective benchmark using verifiable instructions to test how well LLMs follow natural language directives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IFEval as a benchmark for evaluating large language models on following instructions through a set of automatically checkable rules. It identifies 25 types of verifiable instructions such as length limits and keyword requirements and builds around 500 prompts that embed one or more of these rules. This setup replaces costly human judgments and potentially biased LLM evaluators with direct, reproducible checks. Results from two widely used LLMs illustrate how the benchmark works in practice. The approach aims to standardize measurement of a core LLM capability.

Core claim

IFEval is a benchmark built from 25 types of verifiable instructions and approximately 500 prompts, each containing one or more such instructions, that enables objective and reproducible assessment of whether large language models follow natural language directives without relying on human evaluators or other LLMs.

What carries the argument

IFEval benchmark, a collection of 25 verifiable instruction types and roughly 500 prompts designed for automatic verification of model outputs.

If this is right

  • Instruction-following performance of LLMs becomes measurable through direct checks rather than subjective review.
  • Evaluations can be run quickly and reproduced by anyone with access to the prompts and verification code.
  • Models can be compared on their ability to satisfy concrete constraints such as output length or required mentions.
  • Public release of the prompts and code allows consistent tracking of progress across different LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended by turning additional practical constraints into verifiable rules.
  • Combining IFEval scores with other capability tests might reveal whether instruction following correlates with broader model reliability.
  • The focus on checkable rules suggests many real-world instructions could be made objective if their success conditions are defined clearly enough.

Load-bearing premise

The 25 chosen types of verifiable instructions and the prompts built from them capture the instruction-following skills that matter in actual LLM use.

What would settle it

A test showing that models scoring high on IFEval still fail to follow similar but non-verifiable instructions in open-ended tasks would undermine the benchmark's coverage.

Figures

Figures reproduced from arXiv: 2311.07911 by Denny Zhou, Jeffrey Zhou, Le Hou, Siddhartha Brahma, Sujoy Basu, Swaroop Mishra, Tianjian Lu, Yi Luan.

Figure 1
Figure 1. Figure 1: Instructions such as “write at least 25 sentences” can be automatically and objectively ver [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Instruction-level strict-accuracy of each model, separated by each instruction category. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instruction following accuracy per detailed category. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Instruction-Following Eval (IFEval), a benchmark for assessing LLMs' ability to follow natural language instructions. It consists of 25 types of verifiable instructions (examples: 'write in more than 400 words', 'mention the keyword of AI at least 3 times') and around 500 prompts each containing one or more such instructions. The design enables automatic, objective verification without human or LLM judges. Results are reported for two widely available LLMs, and the code and data are released publicly.

Significance. If the 25 instruction types are representative of practical use, IFEval would offer a significant contribution as a low-cost, reproducible, and bias-resistant alternative to existing evaluation methods. The public release of code and data is an explicit strength that directly supports adoption and further work in the field.

major comments (1)
  1. [Abstract / instruction types section] Abstract and section describing the 25 instruction types: the manuscript states only that the authors 'identified' the 25 types of verifiable instructions but supplies no sampling method, no comparison to distributions of instructions in user logs or existing datasets, and no correlation study with downstream task performance. This is load-bearing for the central claim that the benchmark evaluates instruction-following capabilities that matter for practical LLM use.
minor comments (2)
  1. Specify the exact number of prompts (rather than 'around 500') and provide a breakdown by instruction type in the main text.
  2. Add a table or enumerated list of all 25 instruction types together with one example prompt per type to improve clarity and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying this point about the justification of the 25 instruction types. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract / instruction types section] Abstract and section describing the 25 instruction types: the manuscript states only that the authors 'identified' the 25 types of verifiable instructions but supplies no sampling method, no comparison to distributions of instructions in user logs or existing datasets, and no correlation study with downstream task performance. This is load-bearing for the central claim that the benchmark evaluates instruction-following capabilities that matter for practical LLM use.

    Authors: We thank the referee for this observation. The 25 instruction types were selected as a set of constraints that can be verified deterministically by code (word counts, required keywords, formatting rules, etc.) while also reflecting instruction patterns commonly seen in LLM interactions. The manuscript's central contribution is the introduction of an objective, reproducible evaluation method rather than a claim that these 25 types constitute a statistically representative sample of all user instructions. No formal sampling from logs or datasets was performed, as the work prioritizes verifiable instructions over distributional analysis; likewise, no correlation study with downstream tasks was conducted, as that would require a separate experimental design outside the paper's scope. We believe the benchmark remains useful for its stated purpose of enabling automatic evaluation without human or LLM judges. We do not intend to add sampling procedures or correlation analyses in a revision. revision: no

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper introduces IFEval as a benchmark of verifiable instructions without any equations, derivations, parameter fitting, or predictions. The 25 instruction types are described as identified by the authors, with no self-citation chains, uniqueness theorems, or ansatzes invoked to justify them. No load-bearing step reduces to its own inputs by construction; the work relies on external verifiable criteria rather than internal fitting or self-referential logic. This matches the default expectation of no circularity for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper adds a new benchmark without introducing fitted parameters or new entities. The central addition rests on the domain assumption that certain instructions admit unambiguous automatic verification.

axioms (1)
  • domain assumption Certain natural language instructions (e.g., word counts, keyword mentions) can be verified automatically without ambiguity or human judgment.
    The benchmark is built around this property to enable objective evaluation.

pith-pipeline@v0.9.0 · 5716 in / 1307 out tokens · 30261 ms · 2026-05-24T05:31:54.659877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

    cs.LG 2026-05 conditional novelty 8.0

    BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.

  2. Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

    cs.LG 2026-05 accept novelty 8.0

    Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

  3. Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

    cs.AI 2026-04 unverdicted novelty 8.0

    User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

  4. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  5. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  6. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

    cs.AI 2026-05 conditional novelty 7.0

    DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

  7. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

    cs.AI 2026-05 conditional novelty 7.0

    Presents the first fully open pipeline for clinical LLMs that unifies eight public QA datasets with clinician-vetted synthetic data from guidelines and vignettes, achieving improved performance on medical benchmarks w...

  8. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  9. Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...

  10. 3D Primitives are a Spatial Language for VLMs

    cs.CV 2026-05 conditional novelty 7.0

    3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.

  11. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  12. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  13. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  14. Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

  15. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effec...

  16. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  17. Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.

  18. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 7.0

    GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

  19. ProactBench: Beyond What The User Asked For

    cs.LG 2026-05 unverdicted novelty 7.0

    ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

  20. WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

    cs.SD 2026-05 accept novelty 7.0

    WASIL is a released dataset of 8,529 in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, explicit like/dislike feedback, answerability annotations, a 2,000-turn MSA and dialect test set, ...

  21. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  22. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  23. Steer Like the LLM: Activation Steering that Mimics Prompting

    cs.CL 2026-05 unverdicted novelty 7.0

    PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

  24. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    cs.AI 2026-05 unverdicted novelty 7.0

    TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...

  25. When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs assigned advocate roles in political statement analysis frequently override those roles due to epistemic constraints, as quantified by new metrics and a stance classifier across 60 English and German statements.

  26. Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

    cs.CL 2026-04 unverdicted novelty 7.0

    Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

  27. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  28. Calibeating Prediction-Powered Inference

    stat.ML 2026-04 unverdicted novelty 7.0

    Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.

  29. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  30. Super Apriel: One Checkpoint, Many Speeds

    cs.LG 2026-04 unverdicted novelty 7.0

    A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

  31. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 7.0

    EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

  32. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  33. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  34. Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding

    cs.CL 2026-04 unverdicted novelty 7.0

    Schema-key wording functions as an implicit instruction channel under constrained decoding, with experiments showing that rephrasing only the keys can substantially change accuracy on math benchmarks while prompt, mod...

  35. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  36. Do AI Coding Agents Log Like Humans? An Empirical Study

    cs.SE 2026-04 unverdicted novelty 7.0

    AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...

  37. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  38. MARS: Enabling Autoregressive Models Multi-Token Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

  39. Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

  40. Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

    cs.LG 2026-04 unverdicted novelty 7.0

    Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.

  41. Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

    cs.CL 2026-04 accept novelty 7.0

    Banning filler words like 'very' and 'just' improved LLM reasoning by 6.7 percentage points while E-Prime improved it by only 3.7, with gains ranking in exact inverse order of theoretical depth across models and tasks.

  42. IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

    cs.CL 2026-03 unverdicted novelty 7.0

    IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than ...

  43. Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution

    cs.SE 2026-02 unverdicted novelty 7.0

    IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.

  44. CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

    cs.LG 2026-02 unverdicted novelty 7.0

    CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.

  45. Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

    eess.AS 2025-09 unverdicted novelty 7.0

    Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

  46. PRIMETIME : Limits of LLMs in Temporal Primitives

    cs.NE 2025-04 unverdicted novelty 7.0

    PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

  47. VoiceBench: Benchmarking LLM-Based Voice Assistants

    cs.CL 2024-10 unverdicted novelty 7.0

    VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

  48. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  49. ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    ARES generates 100K rubric-annotated QA instances from raw documents and shows rubric-based RL trained on them outperforms continual pretraining, SFT, and binary-reward RL on seven benchmarks.

  50. PACE: Two-Timescale Self-Evolution for Small Language Model Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.

  51. On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

    cs.LG 2026-05 conditional novelty 6.0

    On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.

  52. Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

    cs.CL 2026-05 conditional novelty 6.0

    Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.

  53. Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

    cs.LG 2026-05 unverdicted novelty 6.0

    Introspective Training annotates data with natural-language feedback from a thinking reward model and conditions all LLM training stages on that feedback, bending scaling curves for up to 2.8x compute efficiency gains...

  54. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    cs.CL 2026-05 unverdicted novelty 6.0

    DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

  55. Post-Trained MoE Can Skip Half Experts via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.

  56. Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.

  57. Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

    cs.CL 2026-05 unverdicted novelty 6.0

    Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.

  58. PreFT: Prefill-only finetuning for efficient inference

    cs.LG 2026-05 accept novelty 6.0

    Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

  59. Bayesian Model Merging

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...

  60. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 175 Pith papers · 10 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  3. [3]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  5. [5]

    A survey on evaluation of large language models

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023

    Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023

  8. [8]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM : Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  9. [9]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  10. [10]

    GPTScore: Evaluate as You Desire

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023

  11. [11]

    Gpt-4 passes the bar exam

    Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. Gpt-4 passes the bar exam. Available at SSRN 4389233, 2023

  12. [12]

    Gpt-4 vs

    Anis Koubaa. Gpt-4 vs. gpt-3.5: A concise showdown. 2023

  13. [13]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

  14. [14]

    Cross-task generalization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3470--3487, 2022

  15. [15]

    Automated evaluation of written discourse coherence using gpt-4

    Ben Naismith, Phoebe Mulcaire, and Jill Burstein. Automated evaluation of written discourse coherence using gpt-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp.\ 394--403, 2023

  16. [16]

    GPT-4 Technical Report

    OpenAI . Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  17. [17]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  18. [18]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  19. [19]

    Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

    Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

  20. [20]

    Towards better evaluation of instruction-following: A case-study in summarization

    Ondrej Skopek, Rahul Aralikatte, Sian Gooding, and Victor Carbune. Towards better evaluation of instruction-following: A case-study in summarization. arXiv preprint arXiv:2310.08394, 2023

  21. [21]

    Evaluating large language models on controlled generation tasks

    Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wieting, Nanyun Peng, and Xuezhe Ma. Evaluating large language models on controlled generation tasks. arXiv preprint arXiv:2310.14542, 2023

  22. [22]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  24. [24]

    Multitask prompted training enables zero-shot task generalization

    Sanh Victor, Webson Albert, Raffel Colin, Bach Stephen, Sutawika Lintang, Alyafeai Zaid, Chaffin Antoine, Stiegler Arnaud, Raja Arun, Dey Manan, et al. Multitask prompted training enables zero-shot task generalization. In ICLR, 2022

  25. [25]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

  26. [26]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022

  27. [27]

    Large language models are diverse role-players for summarization evaluation

    Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078, 2023

  28. [28]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  29. [29]

    Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections

    Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 2856--2878, 2021

  30. [30]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  31. [31]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  32. [32]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...