super hub Mixed citations

Humanity's Last Exam

Adam Khoja, Alice Gatti, Dan Hendrycks, Jason Hausenloy, Long Phan, Mantas Mazeika + 2 more · 2025 · cs.LG · DOI 10.1038/s41586-025-09962-4 · arXiv 2501.14249

Mixed citation behavior. Most common role is background (42%).

140 Pith papers citing it

8 external citations · Pith

Background 42% of classified citations

open full Pith review browse 140 citing papers more from Adam Khoja arXiv PDF

abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 14 dataset 13 method 5 other 1

citation-polarity summary

background 14 use dataset 12 use method 5 support 1 unclear 1

claims ledger

abstract Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, hu

authors

Adam Khoja Alice Gatti Dan Hendrycks Jason Hausenloy Long Phan Mantas Mazeika Nathaniel Li Oliver Zhang Richard Ren Ryan Kim

co-cited works

representative citing papers

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset

cs.AR · 2026-06-10 · unverdicted · novelty 8.0

PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.

Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Mat-Pref benchmark shows GRPO after SFT lets Qwen3-8B reach 65-72% on compositional materials reasoning tasks, exceeding zero-shot 235B models on held-out structure families and cross-property transfer.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Representing Time Series as Structured Programs for LLM Reasoning

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

T2SP converts time series into structured programs for trends, periods, and events, enabling off-the-shelf LLMs to perform better on editing, captioning, and QA tasks than raw string inputs.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

cs.LG · 2026-05-28 · conditional · novelty 7.0 · 2 refs

ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

citing papers explorer

Showing 50 of 140 citing papers.

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset cs.AR · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? cs.AI · 2026-06-03 · unverdicted · none · ref 24 · internal anchor
The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth cs.CL · 2026-05-24 · unverdicted · none · ref 53 · internal anchor
Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 49 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 24 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV · 2026-04-17 · unverdicted · none · ref 43 · internal anchor
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 18 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 44 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 51 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Meta-Benchmarks for Financial-Services LLM Evaluation cs.AI · 2026-07-02 · unverdicted · none · ref 19 · internal anchor
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents cs.CL · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials cs.LG · 2026-06-20 · unverdicted · none · ref 7 · internal anchor
Mat-Pref benchmark shows GRPO after SFT lets Qwen3-8B reach 65-72% on compositional materials reasoning tasks, exceeding zero-shot 235B models on held-out structure families and cross-property transfer.
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients cs.CL · 2026-06-16 · unverdicted · none · ref 145 · internal anchor
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
Representing Time Series as Structured Programs for LLM Reasoning cs.LG · 2026-06-10 · unverdicted · none · ref 58 · internal anchor
T2SP converts time series into structured programs for trends, periods, and events, enabling off-the-shelf LLMs to perform better on editing, captioning, and QA tasks than raw string inputs.
Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 101 · internal anchor
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research cs.LG · 2026-05-28 · conditional · none · ref 2 · 2 links · internal anchor
ResearchClawBench supplies 40 grounded tasks and expert rubrics to measure autonomous research agents, with the strongest systems scoring only 21.5 and 20.7 on average.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 44 · internal anchor
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 30 · internal anchor
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction cs.AI · 2026-05-24 · unverdicted · none · ref 1 · internal anchor
Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 4 · internal anchor
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
Evaluating Cognitive Age Alignment in Interactive AI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints cs.AI · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics cs.AI · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 56 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
MaD Physics: Evaluating information seeking under constraints in physical environments cs.AI · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs cs.AI · 2026-05-10 · unverdicted · none · ref 15 · internal anchor
TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules cs.AI · 2026-05-09 · unverdicted · none · ref 29 · internal anchor
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use cs.LG · 2026-05-03 · unverdicted · none · ref 42 · internal anchor
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Super Apriel: One Checkpoint, Many Speeds cs.LG · 2026-04-21 · unverdicted · none · ref 42 · internal anchor
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints cs.LG · 2026-04-17 · unverdicted · none · ref 2 · 2 links · internal anchor
Stargazer benchmarks AI agents on physics-constrained model fitting for astrophysical data, revealing that agents achieve statistical fits but often fail to recover correct physical parameters.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 39 · internal anchor
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation cs.LG · 2026-04-03 · unverdicted · none · ref 25 · internal anchor
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests cs.IR · 2026-01-24 · unverdicted · none · ref 40 · internal anchor
Large-scale log study of 14M+ agentic searches finds short sessions, intent-specific repetition patterns, and that 54% of new query terms trace to prior retrieved evidence.
MemEvolve: Meta-Evolution of Agent Memory Systems cs.CL · 2025-12-21 · unverdicted · none · ref 13 · internal anchor
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models cs.CL · 2025-07-05 · conditional · none · ref 17 · internal anchor
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
EduArt: An educational-level benchmark for evaluating art history knowledge in large language models cs.CL · 2026-07-02 · unverdicted · none · ref 25 · internal anchor
EduArt is a new benchmark of 871 educational questions that reveals multimodal LLMs perform near ceiling on multiple-choice art history items but drop sharply on open completion and error identification tasks.
Subliminal Clocks: Latent Time Modelling in Diffusion Language Models cs.AI · 2026-07-02 · unverdicted · none · ref 58 · internal anchor
DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States cs.AI · 2026-07-01 · unverdicted · none · ref 15 · internal anchor
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
ECHO: Prune to act, trace to learn with selective turn memory in agentic RL cs.LG · 2026-06-30 · unverdicted · none · ref 19 · internal anchor
ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.
ACE: Pluggable Adaptive Context Elasticizer across Agents cs.AI · 2026-06-30 · unverdicted · none · ref 23 · internal anchor
ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.
MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration cs.AI · 2026-06-22 · unverdicted · none · ref 41 · internal anchor
MINCE shrinks IFEVAL by 54%, MMLU by 89%, and GSM8K by 70% via few-model Monte Carlo calibration while keeping maximum drift at or below 2.62 percentage points.
How Inference Compute Shapes Frontier LLM Evaluation cs.AI · 2026-06-16 · unverdicted · none · ref 9 · 2 links · internal anchor
Frontier LLM scores on challenging benchmarks rise substantially with increased inference compute, showing that fixed-budget evaluations increasingly understate capability.
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents cs.AI · 2026-06-16 · unverdicted · none · ref 11 · internal anchor
SEAGym turns existing benchmarks into multi-view evaluation sources for measuring reusable improvements in LLM agent harnesses, revealing complementary signals missed by single-curve or isolated-task tests.
ICBCBench: An Industry Consortium Benchmark for Financial Deep Research cs.CE · 2026-06-16 · unverdicted · none · ref 31 · internal anchor
ICBCBench is a new consortium-built benchmark that jointly measures retrieval-reasoning accuracy and end-to-end report quality for deep research agents in finance.
MiniMax Sparse Attention cs.AI · 2026-06-11 · unverdicted · none · ref 77 · internal anchor
MiniMax Sparse Attention is a GQA-based block-sparse attention mechanism that selects top-k blocks independently per group and delivers 28.4x per-token compute reduction at 1M context with on-par performance plus 14.2x prefill and 7.6x decode speedups via co-designed GPU kernel.
The Illusion of Multi-Agent Advantage cs.AI · 2026-06-11 · unverdicted · none · ref 28 · internal anchor
Automatically generated multi-agent systems underperform CoT-SC on benchmarks and a new diagnostic dataset, exposing architectural bloat that fails to deliver functional utility.
APPO: Agentic Procedural Policy Optimization cs.LG · 2026-06-10 · unverdicted · none · ref 53 · internal anchor
APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.

Humanity's Last Exam

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer