hub

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D · 2022 · cs.CL · arXiv 2211.09110

59 Pith papers cite this work. Polarity classification is still indexing.

59 Pith papers citing it

open full Pith review browse 59 citing papers arXiv PDF

abstract

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

claims ledger

abstract Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).

co-cited works

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring

cs.CY · 2026-05-07 · accept · novelty 8.0

Continuous auditing creates an unavoidable cover regime in which static auditors cannot simultaneously eliminate coverage and granularity failures, shown via new policies, strategies, and a reproducible simulator.

SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

q-bio.NC · 2026-05-13 · unverdicted · novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

LLMSpace is the first framework to jointly model operational and embodied carbon for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and workload patterns such as prefill-decode behavior.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

cs.CL · 2026-04-10 · accept · novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

QLoRA: Efficient Finetuning of Quantized LLMs

cs.LG · 2023-05-23 · conditional · novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

Causal Bias Detection in Generative Artifical Intelligence

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

cs.CR · 2026-05-11 · unverdicted · novelty 6.0

GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

cs.HC · 2026-05-08 · unverdicted · novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

ModelLens: Finding the Best for Your Task from Myriads of Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

cs.MA · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.

citing papers explorer

Showing 50 of 59 citing papers.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 37 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring cs.CY · 2026-05-07 · accept · none · ref 10
Continuous auditing creates an unavoidable cover regime in which static auditors cannot simultaneously eliminate coverage and granularity failures, shown via new policies, strategies, and a reproducible simulator.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting q-bio.NC · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations cs.HC · 2026-05-09 · accept · none · ref 42 · internal anchor
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 19 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models cs.CV · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites cs.LG · 2026-05-07 · unverdicted · none · ref 50 · 2 links · internal anchor
LLMSpace is the first framework to jointly model operational and embodied carbon for LLM inference on LEO satellites, incorporating radiation-hardened hardware, peripheral systems, and workload patterns such as prefill-decode behavior.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs cs.DC · 2026-05-05 · unverdicted · none · ref 26 · internal anchor
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 3 · internal anchor
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 138 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation cs.CL · 2026-04-10 · accept · none · ref 20 · internal anchor
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 115 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
QLoRA: Efficient Finetuning of Quantized LLMs cs.LG · 2023-05-23 · conditional · none · ref 35 · internal anchor
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Causal Bias Detection in Generative Artifical Intelligence cs.AI · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing cs.CR · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 130 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios cs.HC · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 129 · internal anchor
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
ModelLens: Finding the Best for Your Task from Myriads of Models cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems cs.MA · 2026-05-04 · unverdicted · none · ref 24 · 2 links · internal anchor
CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models cs.CL · 2026-05-03 · unverdicted · none · ref 6 · internal anchor
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 116 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models cs.IR · 2026-04-27 · conditional · none · ref 20 · internal anchor
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora cs.SE · 2026-04-27 · unverdicted · none · ref 29 · internal anchor
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.
Are Large Language Models Economically Viable for Industry Deployment? cs.CL · 2026-04-21 · unverdicted · none · ref 52 · internal anchor
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier cs.AI · 2026-04-19 · unverdicted · none · ref 1 · internal anchor
ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domains while characterizing a buffer-skew failure mode.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models cs.LG · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment cs.AI · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis cs.CV · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 8 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing cs.AI · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
Measuring Representation Robustness in Large Language Models for Geometry cs.CL · 2026-04-03 · unverdicted · none · ref 17 · internal anchor
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 169 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 22 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena cs.CL · 2023-06-09 · accept · none · ref 24 · internal anchor
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 66 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 187 · 2 links · internal anchor
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 300
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Mental Health AI Safety Claims Must Preserve Temporal Evidence cs.AI · 2026-05-09 · unverdicted · none · ref 5 · internal anchor
Mental health AI safety evaluations that discard temporal sequence and accumulation produce invalid conclusions; the paper formalizes this as Temporal Safety Non-Identifiability and proposes SCOPE-MH as a reporting standard that preserves evidence.
DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration cs.MA · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
DePAI is a proposed democratic architecture that integrates DAOs, DePIN, and AI to enable human oversight of machine execution in physical-digital systems under transparent on-chain rules.
Reasoning emerges from constrained inference manifolds in large language models cs.LG · 2026-05-02 · unverdicted · none · ref 7 · internal anchor
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
TRUST: A Framework for Decentralized AI Service v.0.1 cs.AI · 2026-04-29 · unverdicted · none · ref 24 · internal anchor
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.
To Copilot and Beyond: 22 AI Systems Developers Want Built cs.SE · 2026-04-09 · unverdicted · none · ref 34 · internal anchor
Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not core craft.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 91 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

Holistic Evaluation of Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer