pith. sign in

arxiv: 2605.28873 · v1 · pith:YBPOGTJYnew · submitted 2026-05-25 · 💻 cs.LG

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

classification 💻 cs.LG
keywords boundquantizationauditbenchmarksdeltatemplatebelowbenchmark
0
0 comments X
read the original abstract

This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $\delta^{*} \le (z_{1-\alpha/2}+z_{1-\beta})\sqrt{\rho_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $\rho_d$. The bound turns "how reliable is my quantization claim?" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $\rho_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\pm 1.5$ pp of the binomial reference $\sqrt{p(1-p)/n}$, so much of the variance reported as "benchmark unreliability" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|\Delta|=3.2$ pp) is below the implied MDE at $\rho_d=0.10$ but above it at $\rho_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

    cs.LG 2026-06 unverdicted novelty 7.0

    High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.

  2. Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME

    cs.CV 2026-06 conditional novelty 6.0

    Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.