AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
hub
Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.
citing papers explorer
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
-
To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
-
Simulating the Evolution of Alignment and Values in Machine Intelligence
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
-
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
-
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
-
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
-
Computational Hermeneutics: Evaluating generative AI as a cultural technology
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
-
From Human-Level AI Tales to AI Leveling Human Scales
Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.
-
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
-
VERA-MH Concept Paper
VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
-
AI Consciousness and Existential Risk
Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.