hub

Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation · 2025 · arXiv 2502.06559

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 other 2

citation-polarity summary

unclear 2 background 1 support 1

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Dataset Watermarking for Closed LLMs with Provable Detection

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.

To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems

cs.CY · 2026-04-30 · unverdicted · novelty 6.0

A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.

Simulating the Evolution of Alignment and Values in Machine Intelligence

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

cs.AI · 2026-04-01 · conditional · novelty 6.0

RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

cs.CL · 2026-03-16 · unverdicted · novelty 6.0

Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

Computational Hermeneutics: Evaluating generative AI as a cultural technology

cs.AI · 2026-03-31 · unverdicted · novelty 5.0

Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.

From Human-Level AI Tales to AI Leveling Human Scales

cs.LG · 2026-02-21 · unverdicted · novelty 5.0

Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

cs.AI · 2026-02-13 · unverdicted · novelty 5.0

MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.

VERA-MH Concept Paper

cs.CY · 2025-10-17 · unverdicted · novelty 5.0

VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

AI Consciousness and Existential Risk

cs.AI · 2025-11-24 · unverdicted · novelty 2.0

Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.

citing papers explorer

Showing 13 of 13 citing papers.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 22
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Dataset Watermarking for Closed LLMs with Provable Detection cs.LG · 2026-05-07 · unverdicted · none · ref 3
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems cs.CY · 2026-04-30 · unverdicted · none · ref 44
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 8
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics cs.AI · 2026-04-01 · conditional · none · ref 1
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 4
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics cs.CY · 2026-04-02 · unverdicted · none · ref 36 · 2 links
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
Computational Hermeneutics: Evaluating generative AI as a cultural technology cs.AI · 2026-03-31 · unverdicted · none · ref 32
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
From Human-Level AI Tales to AI Leveling Human Scales cs.LG · 2026-02-21 · unverdicted · none · ref 3
Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents cs.AI · 2026-02-13 · unverdicted · none · ref 24
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
VERA-MH Concept Paper cs.CY · 2025-10-17 · unverdicted · none · ref 3
VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.
Position: AI Evaluations Should be Grounded on a Theory of Capability cs.AI · 2025-09-23 · conditional · none · ref 17
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
AI Consciousness and Existential Risk cs.AI · 2025-11-24 · unverdicted · none · ref 51
Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.

Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer