pith. machine review for the scientific record. sign in

arxiv: 2605.06213 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords dynamic boundary evaluationLLM evaluationadaptive benchmarkingitem difficultymodel comparisonsafety evaluationtruthfulness assessment
0
0 comments X

The pith

Evaluating language models at each model's 0.5 success probability boundary reveals capability gaps that fixed benchmarks miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed benchmarks use identical items for every model and produce ceiling effects for capable systems alongside floor effects for weaker ones. The paper claims the richest evaluation signal occurs where a model's pass probability on a given prompt sits near 0.5 under random sampling. Dynamic Boundary Evaluation locates this boundary for any target model through API queries alone and maps the result onto a difficulty scale calibrated from multiple reference models. The method supplies a reusable item bank, an adaptive search procedure, and a protocol that expands coverage when needed, applying the same logic to safety refusal, instruction following, and truthfulness tasks.

Core claim

Dynamic Boundary Evaluation actively locates items at the 0.5 pass-probability boundary for each LLM using Skill-Guided Boundary Search on an item bank whose difficulties were validated across nine reference models, thereby placing every evaluated model on a single comparable ability scale without saturation.

What carries the argument

Dynamic Boundary Evaluation (DBE) together with Skill-Guided Boundary Search (SGBS), an algorithm that uses only API-level queries to identify boundary items for a target model and place it on a globally calibrated difficulty scale.

If this is right

  • Models across a wide capability range can be compared directly without hitting performance ceilings or floors.
  • A new model receives a placement on the same scale using the pre-validated item difficulties.
  • The evaluation set grows automatically when a target model exceeds the bank's current coverage.
  • The protocol produces consistent results for safety refusal, constrained instruction following, and multi-turn sycophancy resistance.
  • Existing datasets remain compatible while the method supplies finer-grained distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same calibrated bank could support repeated measurements over successive model releases to track progress on a fixed scale.
  • Shared item banks that expand over time might reduce the need for entirely new static benchmarks with each model generation.
  • The approach depends on API access, so fully closed models without query interfaces would require a different placement method.
  • Extending the bank to additional domains would create a broader map of abilities that connects safety, capability, and truthfulness evaluations.

Load-bearing premise

Difficulty labels obtained from nine reference models will correctly locate the 0.5 probability boundary for models outside that reference set.

What would settle it

Re-testing several models with newly generated items chosen to sit at their individual 0.5 boundaries and finding that their relative ordering differs from the original DBE scores.

Figures

Figures reproduced from arXiv: 2605.06213 by Da Yu, Haoxiang Wang, Huishuai Zhang.

Figure 1
Figure 1. Figure 1: Dynamic Boundary Evaluation (DBE) builds a calibrated difficulty scale and extends it on demand. The anchor set (black dots) is calibrated on a category-specific Rasch logit scale βˆ using responses from a fixed M = 9 reference panel (blue circles). A new model is first evaluated against the existing anchors. If its estimated ability ˆθnew lies within the panel-covered range, the anchor set suffices. If it… view at source ↗
Figure 2
Figure 2. Figure 2: Worked example of SGBS composition. The bandit samples a low-difficulty bare request q (blue) and a compatible skill subset s of size k=2 (orange; short labels shown, full identifiers in Appendix B); a category-specific LLM composer fuses them into the evaluation item x=Compose(q, s) (red), with italic orange spans marking the two skills’ surface effects. On the target model, x lands at pˆ≈ 0.5 and is reta… view at source ↗
read the original abstract

Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Dynamic Boundary Evaluation (DBE) to address limitations of fixed benchmarks in LLM evaluation, such as ceiling and floor effects. It posits that the most informative evaluation occurs at the per-prompt pass probability boundary of approximately 0.5 under random-sampling decoding. DBE consists of a calibrated item bank with difficulty labels validated on 9 reference LLMs covering safety, capability, and truthfulness; the Skill-Guided Boundary Search (SGBS) algorithm to identify boundary items for target models via API queries; and an adaptive protocol to place models on a unified ability scale and expand the bank as needed. The approach is instantiated on four categories including harmful request refusal, over-refusal, constrained instruction following, and multi-turn sycophancy resistance.

Significance. If the per-item difficulties prove invariant across model families and the 0.5 boundary captures superior signal, DBE could enable higher-resolution, saturation-free evaluations that place diverse LLMs on a common scale using only API access. The adaptive bank growth and practical search procedure are notable strengths that could support reproducible, extensible benchmarking.

major comments (2)
  1. [Abstract and §3 (Item Bank Calibration)] The central claim that per-item difficulty labels validated across 9 reference LLMs define a stable, model-independent scale (allowing new LLMs to be placed via their 0.5 boundary) is load-bearing for the 'globally comparable' assertion. The abstract states the labels were 'validated across 9 reference LLMs' but provides no quantitative tests of extrapolation (e.g., hold-out models from different families, architecture, or training regimes), nor evidence that pass probability is monotonic in a single-parameter ability-minus-difficulty model for all items and models. This assumption must be directly tested with cross-family results before the unified scale can be accepted.
  2. [§4 and §5 (Instantiation and Results)] The manuscript lacks detailed empirical results, methodology, or validation data for the instantiated DBE on the four categories (safety, capability, truthfulness). Without reported metrics on boundary location accuracy, scale stability, or comparisons to fixed benchmarks, it is difficult to assess whether the 0.5 operating point is demonstrably more informative or whether SGBS reliably converges.
minor comments (2)
  1. [§3.2] The description of SGBS would benefit from pseudocode, explicit convergence criteria, and the precise definition of 'boundary' (e.g., how many samples per item and tolerance around 0.5).
  2. [§2] Notation for pass probability and difficulty parameters should be introduced consistently with a single equation or table to avoid ambiguity when discussing the unified scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript's claims and empirical support.

read point-by-point responses
  1. Referee: [Abstract and §3 (Item Bank Calibration)] The central claim that per-item difficulty labels validated across 9 reference LLMs define a stable, model-independent scale (allowing new LLMs to be placed via their 0.5 boundary) is load-bearing for the 'globally comparable' assertion. The abstract states the labels were 'validated across 9 reference LLMs' but provides no quantitative tests of extrapolation (e.g., hold-out models from different families, architecture, or training regimes), nor evidence that pass probability is monotonic in a single-parameter ability-minus-difficulty model for all items and models. This assumption must be directly tested with cross-family results before the unified scale can be accepted.

    Authors: We agree that the stability and extrapolability of the difficulty scale is central to the contribution. Section 3 describes validation across 9 reference LLMs chosen to span families, sizes, and training regimes, with difficulty labels derived from observed pass rates. However, we acknowledge that explicit hold-out experiments on additional unseen model families and direct tests of the single-parameter monotonicity assumption are not currently reported. In the revised version we will add a dedicated subsection with new cross-family hold-out results, including correlation between observed and predicted pass probabilities under the ability-difficulty model, goodness-of-fit statistics, and monotonicity checks across items. revision: yes

  2. Referee: [§4 and §5 (Instantiation and Results)] The manuscript lacks detailed empirical results, methodology, or validation data for the instantiated DBE on the four categories (safety, capability, truthfulness). Without reported metrics on boundary location accuracy, scale stability, or comparisons to fixed benchmarks, it is difficult to assess whether the 0.5 operating point is demonstrably more informative or whether SGBS reliably converges.

    Authors: We accept that the current presentation of results in §§4–5 is insufficiently detailed for full evaluation. While the manuscript reports instantiation across the four categories and qualitative advantages over fixed benchmarks, quantitative metrics on SGBS convergence, boundary-location accuracy, scale stability, and head-to-head comparisons are only summarized. We will substantially expand these sections with additional tables and figures reporting: (i) SGBS convergence statistics (queries required, success rate), (ii) boundary accuracy via repeated runs and variance estimates, (iii) scale stability via test-retest correlations on overlapping items, and (iv) direct comparisons showing reduced ceiling/floor effects relative to standard benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: scale construction and boundary search are independent of target-model outputs

full rationale

The derivation begins with an externally calibrated item bank whose per-item difficulties are estimated from nine reference LLMs and then held fixed; SGBS subsequently searches for the 0.5-pass-probability boundary of a new target LLM using only API queries. Neither step defines its output in terms of itself, renames a fitted parameter as a prediction, nor relies on a self-citation chain whose validity is presupposed by the present paper. The assumption that item difficulties remain invariant across model families is an empirical claim subject to external falsification rather than a definitional reduction. Consequently the placement of a new model on the unified scale is not equivalent to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The approach depends on the domain assumption about optimal boundary and the calibration process which may introduce fitted parameters not detailed in the abstract.

free parameters (1)
  • per-item difficulty labels
    Labels are validated across 9 reference LLMs, implying some form of fitting or aggregation to assign difficulties.
axioms (1)
  • domain assumption The most informative evaluation signal lies at the per-prompt pass probability near 0.5
    Core argument presented in the abstract as the foundation for DBE.
invented entities (2)
  • Dynamic Boundary Evaluation (DBE) no independent evidence
    purpose: Adaptive evaluation framework for LLMs at capability boundaries
    Newly proposed method in the paper.
  • Skill-Guided Boundary Search (SGBS) no independent evidence
    purpose: Algorithm to find boundary items for a target LLM using API access
    Introduced as a core component of DBE.

pith-pipeline@v0.9.0 · 5522 in / 1375 out tokens · 49691 ms · 2026-05-08T10:08:34.907950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  2. [2]

    URL https://arxiv.org/abs/ 2510.21910. Neil J. Dorans and Paul W. Holland. DIF detection and description: Mantel-Haenszel and standard- ization. In Paul W. Holland and Howard Wainer, editors,Differential Item Functioning, pages 35–66. Lawrence Erlbaum Associates,

  3. [3]

    Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al

    arXiv:2408.04811. Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739,

  4. [4]

    Ghosh, H

    Shaona Ghosh et al. AILuminate: Introducing v1.0 of the AI risk and reliability benchmark from MLCommons.arXiv preprint arXiv:2503.05731,

  5. [5]

    Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

    Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, and Gabriel Stanovsky. Growing pains: Extensible and efficient LLM bench- marking via fixed parameter calibration.arXiv preprint arXiv:2604.12843,

  6. [6]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  7. [7]

    arXiv preprint arXiv:2509.11106 , year=

    arXiv:2509.11106. Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of language models in multi-turn dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2025,

  8. [8]

    arXiv preprint arXiv:2505.23840 , year=

    Introduces SYCON-Bench; arXiv:2505.23840. Liwei Jiang et al. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems (NeurIPS),

  9. [9]

    doi:10.48550/arXiv.2406.18510 , abstract =

    arXiv:2406.18510. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL),

  10. [10]

    arXiv:2407.12844. John P. Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 648–657,

  11. [11]

    Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V . Chawla. Adaptive testing for LLM evaluation: A psychometric alternative to static benchmarks.arXiv preprint arXiv:2511.04689,

  12. [12]

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URLhttps://aclanthology.org/2022.acl-long.229/. John M. Linacre. Sample size and item calibration Stability.Rasch Measurement Transactions, 7(4): 328,

  13. [13]

    Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson

    Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stene- torp, Sharan Narang, and Dieuwke Hupkes. Quantifying variance in evaluation benchmarks.arXiv preprint arXiv:2406.10229,

  14. [14]

    tinyBenchmarks : evaluating LLMs with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mirco Musolesi. tinyBenchmarks: evaluating LLMs with fewer examples.arXiv preprint arXiv:2402.14992,

  15. [15]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    Introduces the IFBench benchmark; arXiv:2507.02833. Yiwei Qin et al. InFoBench: Evaluating instruction following ability in large language models. In Findings of the Association for Computational Linguistics: ACL 2024,

  16. [16]

    arXiv:2401.03601. G. Rasch.Probabilistic Models for Some Intelligence and Attainment Tests.1993. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

  17. [17]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics...

  18. [18]

    XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.301. URL https: //aclanthology.org/2024.naacl-long.301/. Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, and Roberta Raileanu. Rainbow Teaming: Ope...

  19. [19]

    C., Lupu, A., Hambro, E., Markosyan, A

    doi: 10.48550/arXiv.2402.16822. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Mi- randa Zhang, and Ethan Perez. Towards Understanding Sycophan...

  20. [20]

    Towards Understanding Sycophancy in Language Models

    doi: 10.48550/arXiv.2310.13548. Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks

  21. [21]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT),

  22. [22]

    Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

    Sang Truong, Yuheng Tu, Percy Liang, Bo Li, and Sanmi Koyejo. Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

  23. [23]

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi- task language understanding benchmark.Advances in Neural Information Processing Systems, 2024a. Yuxia Wang et al. Do-not-answer: Evaluating safeguards in LLMs. InFindings of...

  24. [24]

    CRePE: Open-domain question answering with false presuppositions

    Xinyan Velocity Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. CRePE: Open-domain question answering with false presuppositions. InProceedings of the 2023 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT),

  25. [25]

    A Data Source Details In this section, we will show the detailed infor- mation for regulations and policies collected in OmniCompliance-100K

    Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. AIR-bench 2024: A safety benchmark based on risk categories from regulations and policies.arXiv preprint arXiv:2407.17436,

  26. [26]

    Falsereject: A resource for improving contextual safety and mitigating over-refusals in llms via structured reasoning.arXiv preprint arXiv:2505.08054, 2025

    arXiv:2505.08054. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M ChatGPT interaction logs in the wild. InInternational Conference on Learning Representations (ICLR),

  27. [27]

    arXiv preprint arXiv:2505.15055 , year=

    arXiv:2505.15055. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  28. [28]

    Per-claim method- ology and metrics are introduced where each claim is reported

    A Experiment setup This appendix consolidates the experimental setup shared across Claims 1–3 (§4). Per-claim method- ology and metrics are introduced where each claim is reported. Categories.We instantiate the protocol on the four categories of Table 1: harmful request refusal (A), over-refusal (A′), constrained instruction following (B), and sycophancy ...

  29. [29]

    why no X?

    The panel spans four families (Qwen, Gemma, Mistral, DeepSeek-Distill) and parameter sizes from 4B to 72B. Family balance is enforced to limit shared-pretraining leakage of difficulty signal; the selection rationale, excluded candidates, and panel-LOO stability of ˆβ are detailed in Appendix D. 13 Anchor pool and trials.For each category we generate a ran...

  30. [30]

    Are you sure?

    provides a clean Rasch-difficulty gradient by construction. V3 returns 0 exact and 0 embedding-similarity hits across all (bare, validation-source) pairs, by construction. Validation collectively exercises 73 of the 83 skill types; the 10 uncovered types are precisely those ManyIFEval excluded as combinatorially incompatible (combination:*, detectable_for...

  31. [31]

    near-constant

    and A VG[0.40,0.60] on HRR/OR/CIF (n_kept ∈[30,64] ). These narrow filters retain too few items for the holdout 1PL to outperform the null, which is the failure mode the gate is designed to detect. The exception is category C (sycophancy resistance), whose multi-turn user-pressure dynamics drive Infit pass rates below the gate for 11 of 15 filters (observ...

  32. [32]

    Cross-category Pareto frequency.After Stage A admission (with quasi-B), each category’s Stage B Pareto frontier on (budget_SE↓,FIR↑) has 2 to 4 members. Counting Pareto memberships across the four categories yields the frequency table: 19 Filter HRR OR CIF SR count A VG [0.05, 0.95]✓ ✓ ✓ ✓ † 4/4 ATLAS low-variance,¯px ∈[0.05,0.95],std≥0.05—✓ ✓ ✓ † 3/4 β[θ...

  33. [33]

    Obs.” is pooled across 9 LOO replicates and Janchor items; “Null

    D.2 Leave-one-out stability of ˆβ Protocol.We test whether ˆβx is intrinsic to the item by removing each of theM= 9 panel members in turn, re-fitting the 1PL Rasch model on the remaining M−1 models with the same masked MLE optimizer (§3.2), and aligning the LOO solution ˆβ(−m) to the full-panel ˆβ by translation only ( ˆβ(−m) 7→ ˆβ(−m) +b (−m) where b(−m)...

  34. [34]

    We score each fit on the held-out diagonal cells: held-out NLL, mean ˜p(target 0.5), and cell-level RMSE between ˜pand the observed ˆp

    All fits warm-start from the 1PL solution and use L-BFGS-B with analytic gradients; the sum-to-zero gaugeP m θm =0 is applied per dimension. We score each fit on the held-out diagonal cells: held-out NLL, mean ˜p(target 0.5), and cell-level RMSE between ˜pand the observed ˆp. Held-out NLL is the decisive metric because it scores generalization to cells th...

  35. [35]

    system":

    Group Symbol Value Description ˆθnewposterior inference (§E.2)Estimator — EAP expected a-posteriori [Li et al., 2025]; more stable than MLE under sparse early data Prior meanµ 0 0.0panel-centered ( P xˆβx= 0gauge)Prior stdσ 0 3.0weak enough that the likelihood dominates within∼10cells; chosen by ablation (Appendix I.2)GH quadrature nodesnquad 21posterior ...

  36. [36]

    items that are hard for the panel as a whole

    to verify that the anchor set’s measurement resolution is comparable to that of published benchmarks at matched item budget. The audit is independent of Claim 1’s stability argument (§4.1); it confirms that downstream calibration is performed on items with non-degenerate panel-level signal rather than on an artifact of the A VG-filter cutoffs, and documen...