pith. machine review for the scientific record. sign in

arxiv: 2009.03300 · v3 · submitted 2020-09-07 · 💻 cs.CY · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Measuring Massive Multitask Language Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:39 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.LG
keywords language modelsmultitask evaluationworld knowledgeproblem solvingbenchmarksGPT-3model capabilities
0
0 comments X

The pith

Current language models, including the largest GPT-3, still require substantial improvements to reach expert-level accuracy on a new 57-task test of knowledge and problem solving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a new test covering 57 tasks from fields such as elementary mathematics, US history, computer science, and law to measure a text model's multitask accuracy. Models must show extensive world knowledge and problem-solving ability to score highly on the test. Most recent models perform near random chance, while the very largest GPT-3 model improves by almost 20 percentage points on average. Yet the best models remain well below expert levels on every single task. The test also reveals lopsided results, frequent failure to recognize errors, and near-random accuracy on topics like morality and law.

Core claim

The paper establishes that a test with 57 tasks is needed to assess models' extensive world knowledge and problem solving ability, and that even the most advanced models fall short of expert performance across all these tasks, with particular weaknesses in socially important domains.

What carries the argument

A new test consisting of 57 multiple-choice tasks covering subjects from elementary mathematics to professional levels in areas such as history, computer science, and law.

If this is right

  • Models exhibit lopsided performance across the different tasks.
  • Models frequently do not know when they are wrong.
  • Models achieve near-random accuracy on socially important subjects such as morality and law.
  • The test can be used to analyze models across many tasks and identify important shortcomings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This kind of broad test allows tracking of how model performance changes as models increase in size.
  • Task-by-task results could help focus additional training on areas where models are weakest.
  • The approach provides a consistent way to compare models on a shared set of academic and professional questions.

Load-bearing premise

The 57 chosen tasks and their expert-level thresholds accurately capture extensive world knowledge and problem solving ability without selection bias or overly narrow definitions of expertise.

What would settle it

A model that attains expert-level accuracy on all 57 tasks would show that current best models do not need further substantial improvements.

read the original abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes the Massive Multitask Language Understanding (MMLU) benchmark consisting of 57 multiple-choice tasks drawn from academic and professional domains such as mathematics, history, computer science, and law. The authors evaluate a range of language models and report that most achieve near-random accuracy of approximately 25%, while the largest GPT-3 model reaches an average of 43.9% (a nearly 20-point gain over random). All evaluated models remain substantially below the stated expert-level accuracy of roughly 89% on every task, with notably weak performance on morality and law; models also exhibit lopsided subject performance and poor calibration regarding their own errors. The benchmark is positioned as a tool for measuring breadth of world knowledge and problem-solving ability.

Significance. If the reported measurements hold, this work supplies a valuable, broad-coverage benchmark that enables systematic tracking of language-model progress across many domains simultaneously. Notable strengths include the careful sourcing of questions from real exams and textbooks, the public release of the full dataset for reproducibility, the consistent evaluation protocol applied to multiple model families, and the inclusion of clear random-chance baselines. These elements allow the community to replicate and extend the results, and the empirical gaps documented have already shaped subsequent scaling and evaluation research.

minor comments (3)
  1. [Section 3] Section 3: Additional quantitative detail on how expert-level accuracy thresholds were estimated for each task (e.g., number of experts, agreement statistics) would help readers evaluate the size of the reported gaps to expert performance.
  2. [Table 1] Table 1 and Section 4: The random baseline is uniformly listed near 25%; explicitly confirming that every task uses four options and noting any exceptions would remove minor ambiguity.
  3. [Section 5] Section 5: The discussion of lopsided performance and poor self-knowledge would be strengthened by reporting per-subject standard deviations or statistical tests of imbalance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary accurately captures the MMLU benchmark, its construction from real academic and professional sources, the evaluation results across model families, and the documented gaps relative to expert performance. We are pleased that the strengths of careful sourcing, public data release, consistent protocols, and random baselines are highlighted. As the report lists no specific major comments or requested changes, we have no points requiring detailed rebuttal or disagreement.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs a new benchmark (MMLU) with 57 tasks drawn from existing exams and reports direct empirical accuracy measurements for language models against random-chance baselines and stated expert thresholds. No equations, derivations, or first-principles predictions appear; the central claims consist solely of observed performance numbers on the released test set. Self-citations, if present, are incidental background references and do not serve as load-bearing justification for any result. The evaluation chain is therefore self-contained and externally verifiable via the dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the work rests on the standard assumption that multiple-choice accuracy on curated exam questions measures relevant knowledge.

pith-pipeline@v0.9.0 · 5472 in / 955 out tokens · 34780 ms · 2026-05-10T12:39:02.575757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/PhiForcing.lean phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

  • IndisputableMonolith/Foundation/DimensionForcing.lean dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 8.0

    HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...

  3. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  4. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  5. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  6. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  7. RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

    cs.LG 2026-05 unverdicted novelty 7.0

    RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

  8. Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

    cs.CL 2026-05 unverdicted novelty 7.0

    New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

  9. Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

    cs.CL 2026-05 unverdicted novelty 7.0

    TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...

  10. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  11. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  12. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  13. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  14. Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

  15. Task-Aware Calibration: Provably Optimal Decoding in LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

  16. Skill Description Deception Attack against Task Routing in Internet of Agents

    cs.MA 2026-05 conditional novelty 7.0

    Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.

  17. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  18. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  19. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  20. BadDLM: Backdooring Diffusion Language Models with Diverse Targets

    cs.CR 2026-05 unverdicted novelty 7.0

    BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.

  21. DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

    cs.AI 2026-05 unverdicted novelty 7.0

    DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...

  22. Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...

  23. Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

    cs.CR 2026-05 conditional novelty 7.0

    A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

  24. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  25. Dataset Watermarking for Closed LLMs with Provable Detection

    cs.LG 2026-05 unverdicted novelty 7.0

    A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...

  26. Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

    cs.CL 2026-05 accept novelty 7.0

    Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.

  27. SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

    cs.DC 2026-05 unverdicted novelty 7.0

    SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.

  28. SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

    cs.AI 2026-04 unverdicted novelty 7.0

    SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.

  29. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  30. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  31. Breaking the Secret: Economic Interventions for Combating Collusion in Embodied Multi-Agent Systems

    cs.CR 2026-04 unverdicted novelty 7.0

    A mutagenic incentive mechanism reshapes payoffs in embodied MAS to induce strategic defection from collusion, achieving performance comparable to non-collusion baselines in simulations and real-world tests.

  32. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  33. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

    cs.CL 2026-04 unverdicted novelty 7.0

    Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

  34. Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

    cs.CL 2026-04 unverdicted novelty 7.0

    Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.

  35. AgileLog: A Forkable Shared Log for Agents on Data Streams

    cs.DC 2026-04 unverdicted novelty 7.0

    AgileLog introduces forkable shared logs with cheap forking and isolation to support AI agents on data streams.

  36. Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

    cs.LG 2026-04 unverdicted novelty 7.0

    Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

  37. Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

    cs.CL 2026-04 unverdicted novelty 7.0

    GRIP integrates retrieval into autoregressive generation through self-triggered control tokens for dynamic query planning, outperforming RAG baselines on QA benchmarks with fewer parameters than GPT-4o.

  38. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  39. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  40. Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition

    cs.LG 2026-04 unverdicted novelty 7.0

    MATU quantifies uncertainty in LLM multi-agent systems by turning reasoning trajectories into embedding matrices, stacking runs into a tensor, and decomposing it to separate sources of variability.

  41. Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

    cs.CL 2026-04 unverdicted novelty 7.0

    Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.

  42. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  43. DeonticBench: A Benchmark for Reasoning over Rules

    cs.CL 2026-04 unverdicted novelty 7.0

    DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

  44. PolyReal: A Benchmark for Real-World Polymer Science Workflows

    cs.CV 2026-04 unverdicted novelty 7.0

    PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.

  45. Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.

  46. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  47. Path-Constrained Mixture-of-Experts

    cs.LG 2026-03 unverdicted novelty 7.0

    PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

  48. KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

    cs.CL 2026-03 conditional novelty 7.0

    KMMMU benchmark demonstrates that leading multimodal models achieve at most 52.42% accuracy on hard Korean exam questions, highlighting limitations in non-English multimodal understanding.

  49. PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

    cs.CL 2026-03 unverdicted novelty 7.0

    PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

  50. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  51. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  52. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  53. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  54. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  55. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    cs.CL 2024-02 unverdicted novelty 7.0

    LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.

  56. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  57. WizardLM: Empowering large pre-trained language models to follow complex instructions

    cs.CL 2023-04 conditional novelty 7.0

    WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.

  58. Capabilities of GPT-4 on Medical Challenge Problems

    cs.CL 2023-03 unverdicted novelty 7.0

    GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.

  59. LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.

  60. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

    cs.AI 2026-05 unverdicted novelty 6.0

    MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

Reference graph

Works this paper leans on

300 extracted references · 200 canonical work pages · cited by 197 Pith papers · 9 internal anchors

  1. [1]

    M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). J. Artif. Intell. Res., 47: 0 253--279, 2013

  2. [2]

    Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language, 2019

  3. [3]

    Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language, 2020

  4. [4]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

  6. [6]

    Clark, O

    P. Clark, O. Etzioni, D. Khashabi, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick, O. Tafjord, N. Tandon, S. Bhakthavatsalam, D. Groeneveld, M. Guerquin, and M. Schmitz. From 'f' to 'a' on the n.y. regents science exams: An overview of the aristo project. ArXiv, abs/1909.01958, 2019

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019

  8. [8]

    Geirhos, J.-H

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks, 2020

  9. [9]

    C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. ICML, 2017

  10. [10]

    Hendrycks, M

    D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure. ICLR, 2019 a

  11. [11]

    Natural adversarial examples

    D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples. ArXiv, abs/1907.07174, 2019 b

  12. [12]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning ai with shared human values, 2020

  13. [13]

    Huang, R

    L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019

  14. [14]

    Kaplan, S

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models, 2020

  15. [15]

    Khashabi, T

    D. Khashabi, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system, 2020

  16. [16]

    T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition, 2019

  17. [17]

    Kumar, P

    A. Kumar, P. Liang, and T. Ma. Verified uncertainty calibration, 2019

  18. [18]

    G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations, 2017

  19. [19]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2020

  20. [20]

    Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

  21. [21]

    Mihaylov, P

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

  22. [22]

    Ovadia, E

    Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model's uncertainty? E valuating predictive uncertainty under dataset shift. NeurIPS, 2019

  23. [23]

    Petroni, T

    F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?, 2019

  24. [24]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

  25. [25]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019

  26. [26]

    Richardson, C

    M. Richardson, C. J. Burges, and E. Renshaw. MCT est: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193--203, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics

  27. [27]

    A. B. Sai, A. K. Mohankumar, and M. M. Khapra. A survey of evaluation metrics used for nlg systems. 2020

  28. [28]

    A. Turing. Computing machinery and intelligence. 1950

  29. [29]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018

  30. [30]

    A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019

  31. [31]

    Zellers, A

    R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence?, 2019

  32. [32]

    Zellers, A

    R. Zellers, A. Holtzman, E. Clark, L. Qin, A. Farhadi, and Y. Choi. Evaluating machines by their real-world language use, 2020

  33. [33]

    Realizable and unrealizable specifications of reactive systems , author =

  34. [34]

    Kingma and Jimmy Ba , year = 2014, journal =

    Diederik P. Kingma and Jimmy Ba , year = 2014, journal =. Adam:

  35. [35]

    Generative Adverarial Metric [sic] , author =

  36. [36]

    Feature Denoising for Improving Adversarial Robustness , author =

  37. [37]

    Intriguing properties of neural networks , author =

  38. [38]

    ArXiv , volume =

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author =. ArXiv , volume =

  39. [39]

    Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E , year = 2012, journal =

  40. [40]

    CoRR , volume =

    Striving for Simplicity: The All Convolutional Net , author =. CoRR , volume =

  41. [41]

    Adversarial Logit Pairing , author =

  42. [42]

    Evaluating and Understanding the Robustness of Adversarial Logit Pairing , author =

  43. [43]

    Concrete problems in

    Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in

  44. [44]

    Computer , publisher =

    Google street view: Capturing the world at street level , author =. Computer , publisher =

  45. [45]

    Nicomachean Ethics , author =

  46. [46]

    General Purpose Intelligence: Arguing the Orthogonality Thesis , author =

  47. [47]

    Army of None: Autonomous Weapons and the Future of War , author =

  48. [48]

    Synthesizing robust adversarial examples , author =

  49. [49]

    CoRR , volume =

    Adversarial Transformation Networks: Learning to Generate Adversarial Examples , author =. CoRR , volume =

  50. [50]

    The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , author =. J. Artif. Intell. Res. , volume = 47, pages =

  51. [51]

    doi: 10.18653/v1/2020.acl-main.463

    Bender, Emily M. and Koller, Alexander , year = 2020, month = jul, booktitle =. Climbing towards. doi:10.18653/v1/2020.acl-main.463 , url =

  52. [52]

    An Introduction to the Principles of Morals and Legislation , author =

  53. [53]

    Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach , author =

  54. [54]

    ArXiv , volume =

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. ArXiv , volume =

  55. [55]

    Mixmatch: A holistic approach to semi-supervised learning , author =

  56. [56]

    CoRR , volume =

    Big but Imperceptible Adversarial Perturbations via Semantic Manipulation , author =. CoRR , volume =

  57. [57]

    Support Vector Machines Under Adversarial Label Noise , author =

  58. [58]

    Piqa: Reasoning about physical commonsense in natural language

    PIQA: Reasoning about Physical Commonsense in Natural Language , author =. 1911.11641 , archiveprefix =

  59. [59]

    PIQA: Reasoning about Physical Commonsense in Natural Language , author =

  60. [60]

    2004.10151 , archiveprefix =

    Experience Grounds Language , author =. 2004.10151 , archiveprefix =

  61. [61]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =

  62. [62]

    Superintelligence: Paths, Dangers, Strategies , author =

  63. [63]

    Adversarial Filters of Dataset Biases , author =

  64. [64]

    Approximating

    Wieland Brendel and Matthias Bethge , year = 2018, journal =. Approximating

  65. [65]

    Adversarial patch , author =

  66. [66]

    CoRR , volume =

    Unrestricted Adversarial Examples , author =. CoRR , volume =

  67. [67]

    Language Models are Few-Shot Learners

    Language Models are Few-Shot Learners , author =. 2005.14165 , archiveprefix =

  68. [68]

    ArXiv , volume =

    Language Models are Few-Shot Learners , author =. ArXiv , volume =

  69. [69]

    Learning to rank using gradient descent , author =

  70. [70]

    Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods , author =

  71. [71]

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =

    Adversarial examples are not easily detected: Bypassing ten detection methods , author =. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages =

  72. [72]

    2017 ieee symposium on security and privacy (sp) , pages =

    Towards evaluating the robustness of neural networks , author =. 2017 ieee symposium on security and privacy (sp) , pages =

  73. [73]

    Chapelle, Olivier and Schlkopf, Bernhard and Zien, Alexander , year = 2010, publisher =

  74. [74]

    Chawla, Nitesh V and Bowyer, Kevin W and Hall, Lawrence O and Kegelmeyer, W Philip , year = 2002, journal =

  75. [75]

    Dual Path Networks , author =

  76. [76]

    Deep Reinforcement Learning from Human Preferences , author =

  77. [77]

    Learning multiple layers of features from tiny images , author =

  78. [78]

    Describing Textures in the Wild , author =

  79. [79]

    Lawrence Zitnick and Piotr Dollar , year = 2014, journal =

    Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollar , year = 2014, journal =. Microsoft

  80. [80]

    Certified adversarial robustness via randomized smoothing , author =

Showing first 80 references.