A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
Journal of the American Statistical Association , volume=
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
A binning-based Bayesian ROPE equivalence testing method is introduced to quantitatively assess practical equivalence between synthetic and real pre-crash scenario datasets for driving automation safety impact evaluation.
citing papers explorer
-
Computer Use at the Edge of the Statistical Precipice
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
-
Practical validation of synthetic pre-crash scenarios
A binning-based Bayesian ROPE equivalence testing method is introduced to quantitatively assess practical equivalence between synthetic and real pre-crash scenario datasets for driving automation safety impact evaluation.