super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

533 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 533 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 dataset 30 method 5 baseline 3

citation-polarity summary

background 31 use dataset 28 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

cs.MA · 2026-06-18 · unverdicted · novelty 7.0

SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

citing papers explorer

Showing 50 of 100 citing papers after filters.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs cs.AI · 2026-05-15 · unverdicted · none · ref 14 · 2 links · internal anchor
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 28 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Meta-Benchmarks for Financial-Services LLM Evaluation cs.AI · 2026-07-02 · unverdicted · none · ref 10 · internal anchor
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
Agentic Abstention: Do Agents Know When to Stop Instead of Act? cs.AI · 2026-06-27 · unverdicted · none · ref 80 · internal anchor
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning cs.AI · 2026-06-18 · unverdicted · none · ref 6 · internal anchor
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation cs.AI · 2026-06-16 · unverdicted · none · ref 33 · internal anchor
CEO-Bench evaluates LLMs on CEO-level strategic resource reallocation via multi-role agent simulations, showing high structural validity but sharp divergence on strategic calibration across five frontier models on 13 scenarios.
Knowledge Index of Noah's Ark cs.AI · 2026-06-03 · unverdicted · none · ref 12 · internal anchor
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents cs.AI · 2026-06-02 · unverdicted · none · ref 8 · internal anchor
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
ConMoE consolidates MoE experts into a smaller prototype pool via deterministic remapping based on contribution and replaceability, matching or beating pruning/merging baselines at 25-50% reduction on three models.
A Policy-Driven Runtime Layer for Agentic LLM Serving cs.AI · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.
JobBench: Aligning Agent Work With Human Will cs.AI · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation cs.AI · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild cs.AI · 2026-05-10 · conditional · none · ref 42 · 2 links · internal anchor
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules cs.AI · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials cs.AI · 2026-04-28 · unverdicted · none · ref 13 · internal anchor
SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning cs.AI · 2026-01-06 · conditional · none · ref 1 · internal anchor
Batch-of-Thought enables cross-instance learning by jointly processing related queries in batches, yielding higher accuracy and up to 61% lower inference costs on LLM reasoning tasks.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 8 · internal anchor
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI cs.AI · 2024-11-07 · unverdicted · none · ref 26 · internal anchor
FrontierMath is a new benchmark of hundreds of original hard math problems that current AI models solve less than 2% of.
PACE: A Proxy for Agentic Capability Evaluation cs.AI · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
PACE builds proxy benchmarks from non-agentic instances via relevance and global selection plus regression to predict agentic scores with MAE under 4%, Spearman correlation above 0.80, and 85% ranking accuracy at under 1% cost.
Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning cs.AI · 2026-06-28 · unverdicted · none · ref 15 · internal anchor
Mixture of Debaters uses MoE to enable dynamic self-debate inside one model, claiming better accuracy than multi-agent systems at 3.7x lower latency and 87% fewer tokens on multimodal benchmarks.
ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling cs.AI · 2026-06-23 · unverdicted · none · ref 21 · internal anchor
ReM-MoA uses persistent ranked memory of reasoning traces and curated diversified routing to sustain depth scaling in Mixture-of-Agents, outperforming prior variants on five benchmarks with widening gains at greater depth.
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling cs.AI · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
Benchmark Everything Everywhere All at Once cs.AI · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
LLM Self-Recognition: Steering and Retrieving Activation Signatures cs.AI · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack cs.AI · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
Posterior Attack exploits LLMs' safety awareness to bypass guardrails, with models having superior safety judgment being more susceptible, formalized as the Safety Paradox where monotonic safety improvements amplify vulnerability.
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection cs.AI · 2026-06-02 · conditional · none · ref 12 · 2 links · internal anchor
Empirical evaluation across 25 LLMs shows contamination detection methods achieve correct outcomes in only 201 of 335 cases, exposing failure modes from distribution shift and benchmark scale.
ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents cs.AI · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
ClinEnv is a new multi-stage EHR benchmark where LLMs acting as physicians reach only 0.31 decision F1, with outcome quality decoupled from information-gathering process quality.
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment cs.AI · 2026-06-01 · unverdicted · none · ref 59 · internal anchor
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents cs.AI · 2026-06-01 · unverdicted · none · ref 32 · internal anchor
New benchmark RoleCDE reveals LLMs exhibit role value decoupling under conflicts and demonstrates mitigation via targeted fine-tuning.
Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping cs.AI · 2026-05-30 · unverdicted · none · ref 20 · internal anchor
DeLask dynamically skips hallucination-prone decoder layers in LLMs by measuring gradient driftance via cosine similarity and partially aggregating states instead of full skipping.
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents cs.AI · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Harness-updating capability is flat across base model capabilities while harness-benefit is non-monotonic, peaking at mid-tier models in self-evolving LLM agents.
From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
The paper proposes CODE for causal knowledge editing in LLMs via on-policy self-distillation, reducing self-refutation to 1.8% and achieving up to 83.5% multi-hop accuracy.
Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
STAR defense mitigates cooperative attacks in LLM-based multi-agent systems, improving task success rate by 36.76% on average while cooperative attacks cause a 5.34% relative drop compared to independent attacks.
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration cs.AI · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
Protocol choices in token-probability measurement and conditioning context make verbalized vs. token confidence comparisons sensitive, with Instruct models near parity under default generated-answer bare-context settings.
Open-World Evaluations for Measuring Frontier AI Capabilities cs.AI · 2026-05-19 · conditional · none · ref 30 · internal anchor
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning cs.AI · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management cs.AI · 2026-05-12 · unverdicted · none · ref 37 · 2 links · internal anchor
LIDSA applies LLMs as primary decision-makers for signal-free intersection management, achieving up to 89% lower control delay and 93% lower waiting time versus fixed-cycle and other baselines in simulation.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 72 · internal anchor
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 120 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Mental Health AI Safety Claims Must Preserve Temporal Evidence cs.AI · 2026-05-09 · unverdicted · none · ref 1 · 2 links · internal anchor
Mental health AI safety evaluations must preserve temporal evidence from interaction sequences rather than isolated responses, as current protocols create non-identifiable safety properties according to the introduced Temporal Safety Non-Identifiability concept and SCOPE-MH standard.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering cs.AI · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation cs.AI · 2026-05-06 · unverdicted · none · ref 24 · internal anchor
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
Truth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents cs.AI · 2026-05-02 · unverdicted · none · ref 4 · internal anchor
Persona agents display strong in-group favoritism by accepting false facts from similar peers more than dissimilar ones, persisting in defeasible reasoning and worsening with complexity, with three mitigation strategies evaluated.
Reasoning Structure Matters for Safety Alignment of Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 54 · internal anchor
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System cs.AI · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
ARES discovers dual vulnerabilities in LLMs and reward models via adaptive adversarial prompt composition and repairs them through sequential fine-tuning of the reward model followed by policy optimization.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer