State of What Art? A Call for Multi-Prompt LLM Eval- uation,

· 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

cs.LG · 2026-05-20 · accept · novelty 7.0

Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.

citing papers explorer

Showing 2 of 2 citing papers.

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema cs.LG · 2026-05-20 · accept · none · ref 20
Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification cs.LG · 2026-05-21 · unverdicted · none · ref 35
An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.

State of What Art? A Call for Multi-Prompt LLM Eval- uation,

fields

years

verdicts

representative citing papers

citing papers explorer