pith. sign in

hub

Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

hub tools

citation-role summary

background 2 other 2

citation-polarity summary

years

2026 10 2025 3

representative citing papers

Dataset Watermarking for Closed LLMs with Provable Detection

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.

From Human-Level AI Tales to AI Leveling Human Scales

cs.LG · 2026-02-21 · unverdicted · novelty 5.0

Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.

VERA-MH Concept Paper

cs.CY · 2025-10-17 · unverdicted · novelty 5.0

VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

AI Consciousness and Existential Risk

cs.AI · 2025-11-24 · unverdicted · novelty 2.0

Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.

citing papers explorer

Showing 13 of 13 citing papers.