Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.
State of What Art? A Call for Multi-Prompt LLM Eval- uation,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2representative citing papers
An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.
citing papers explorer
-
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.
-
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.