Title resolution pending

URLhttps://arxiv · 2022 · arXiv 2207.07048

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

citing papers explorer

Showing 2 of 2 citing papers.

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness cs.LG · 2026-05-22 · unverdicted · none · ref 14
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 17
Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer