super hub Mixed citations

Humanity's Last Exam

Adam Khoja, Alice Gatti, Dan Hendrycks, Jason Hausenloy, Long Phan, Mantas Mazeika + 2 more · 2025 · cs.LG · DOI 10.1038/s41586-025-09962-4 · arXiv 2501.14249

Mixed citation behavior. Most common role is background (42%).

125 Pith papers citing it

8 external citations · Pith

Background 42% of classified citations

open full Pith review browse 125 citing papers more from Adam Khoja arXiv PDF

abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 13 dataset 12 method 5 other 1

citation-polarity summary

background 13 use dataset 11 use method 5 support 1 unclear 1

claims ledger

abstract Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, hu

authors

Adam Khoja Alice Gatti Dan Hendrycks Jason Hausenloy Long Phan Mantas Mazeika Nathaniel Li Oliver Zhang Richard Ren Ryan Kim

co-cited works

representative citing papers

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset

cs.AR · 2026-06-10 · unverdicted · novelty 8.0

PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

MaD Physics: Evaluating information seeking under constraints in physical environments

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.

LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

citing papers explorer

Showing 50 of 105 citing papers after filters.

PCB-QA: Evaluating LLMs over the First Printed Circuit Board Design Question-Answer Dataset cs.AR · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
PCB-QA is the first QA benchmark for LLMs on printed circuit board designs, with Gemini 3 Flash Preview reaching 93% accuracy on a JSON textual representation.
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? cs.AI · 2026-06-03 · unverdicted · none · ref 24 · internal anchor
The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth cs.CL · 2026-05-24 · unverdicted · none · ref 53 · internal anchor
Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 24 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV · 2026-04-17 · unverdicted · none · ref 43 · internal anchor
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 44 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 51 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents cs.CL · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 101 · internal anchor
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 44 · internal anchor
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 30 · internal anchor
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction cs.AI · 2026-05-24 · unverdicted · none · ref 1 · internal anchor
Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.
Evaluating Cognitive Age Alignment in Interactive AI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints cs.AI · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics cs.AI · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 56 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
MaD Physics: Evaluating information seeking under constraints in physical environments cs.AI · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs cs.AI · 2026-05-10 · unverdicted · none · ref 15 · internal anchor
TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules cs.AI · 2026-05-09 · unverdicted · none · ref 29 · internal anchor
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use cs.LG · 2026-05-03 · unverdicted · none · ref 42 · internal anchor
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Super Apriel: One Checkpoint, Many Speeds cs.LG · 2026-04-21 · unverdicted · none · ref 42 · internal anchor
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints cs.LG · 2026-04-17 · unverdicted · none · ref 2 · 2 links · internal anchor
Stargazer benchmarks AI agents on physics-constrained model fitting for astrophysical data, revealing that agents achieve statistical fits but often fail to recover correct physical parameters.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 39 · internal anchor
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation cs.LG · 2026-04-03 · unverdicted · none · ref 25 · internal anchor
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests cs.IR · 2026-01-24 · unverdicted · none · ref 40 · internal anchor
Large-scale log study of 14M+ agentic searches finds short sessions, intent-specific repetition patterns, and that 54% of new query terms trace to prior retrieved evidence.
MemEvolve: Meta-Evolution of Agent Memory Systems cs.CL · 2025-12-21 · unverdicted · none · ref 13 · internal anchor
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States cs.AI · 2026-07-01 · unverdicted · none · ref 15 · internal anchor
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
ECHO: Prune to act, trace to learn with selective turn memory in agentic RL cs.LG · 2026-06-30 · unverdicted · none · ref 19 · internal anchor
ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.
ACE: Pluggable Adaptive Context Elasticizer across Agents cs.AI · 2026-06-30 · unverdicted · none · ref 23 · internal anchor
ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.
How Inference Compute Shapes Frontier LLM Evaluation cs.AI · 2026-06-16 · unverdicted · none · ref 9 · internal anchor
Frontier LLM scores on challenging benchmarks rise substantially with increased inference compute, showing that fixed-budget evaluations increasingly understate capability.
Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation cs.CR · 2026-06-03 · unverdicted · none · ref 25 · internal anchor
Deep research agents exhibit widespread search-time contamination on six public benchmarks, with three defined leakage types inflating performance by up to 4%.
Quantifying Faithful Confidence Expression in Large Reasoning Models cs.CL · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
A new framework quantifies faithful confidence expression in large reasoning models by comparing linguistic decisiveness to token probabilities, hidden states, and response consistency, revealing it as a persistent challenge.
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses cs.AI · 2026-06-01 · unverdicted · none · ref 62 · internal anchor
Harness-1 uses a state-externalizing harness for RL-trained search agents and reports 0.730 average curated recall, outperforming the next open subagent by 11.4 points.
EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA cs.AI · 2026-05-27 · unverdicted · none · ref 7 · internal anchor
EAPO uses policy entropy ratio to adaptively weight positive samples in RLVR for open-ended QA, claiming better diversity and stability than fixed-weight baselines on medical datasets.
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy cs.AI · 2026-05-25 · unverdicted · none · ref 37 · internal anchor
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning cs.AI · 2026-05-23 · unverdicted · none · ref 34 · internal anchor
AgentFugue introduces a plug-in shared reasoning hub trained with SFT and RL that enables peer agents to share intermediate reasoning, yielding gains on long-horizon tasks over strong baselines.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent cs.AI · 2026-05-23 · unverdicted · none · ref 24 · internal anchor
SAM is a standalone memory framework for long-horizon LLM agents that creates state-adaptive cues from interactions, preserves raw trajectories for intent-driven recall, and optimizes the module via expert supervision and RL, outperforming baselines on BrowseComp and related benchmarks.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 72 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 31 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
Forecasting Downstream Performance of LLMs With Proxy Metrics cs.CL · 2026-05-18 · unverdicted · none · ref 75 · internal anchor
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL · 2026-05-15 · unverdicted · none · ref 40 · 2 links · internal anchor
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 35 · internal anchor
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
Instructions Shape Production of Language, not Processing cs.CL · 2026-05-11 · unverdicted · none · ref 188 · 2 links · internal anchor
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
The Generalized Turing Test: A Foundation for Comparing Intelligence cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems cs.AI · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering cs.CL · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 42 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.

Humanity's Last Exam

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer