pith. sign in
Pith Number

pith:Q2XUZPOL

pith:2026:Q2XUZPOLM4F32YDX64DCHRK4EG
not attested not anchored not stored refs resolved

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Christopher M. Homan, Chris Welty, Deepak Pandita, Flip Korn

Multi-level bootstrapping models annotator variance to find the N and K needed for statistically significant evaluations.

arxiv:2605.13801 v1 · 2026-05-13 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{Q2XUZPOLM4F32YDX64DCHRK4EG}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items (N) and the number of responses per item (K) required to achieve statistical significance.

C2weakest assumption

That datasets containing large numbers of ratings per item together with persistent rater identifiers are available, representative of typical evaluation settings, and that multi-level bootstrapping accurately captures real annotator variance without introducing new artifacts.

C3one line summary

Multi-level bootstrapping models annotator variance using large rater-ID datasets to find optimal tradeoffs between number of items N and ratings per item K for statistically significant AI evaluations.

References

31 extracted · 31 resolved · 0 Pith anchors

[1] Reproducibility Checklist , 2023 2023
[2] ACL Rolling Review , 2024 2024
[3] Dices dataset: Diversity in conversational ai evaluation for safety 2023
[4] 2016, Nature, 533, 452, doi: 10.1038/533452a 2016 · doi:10.1038/533452a
[5] Toward benchmarking group explanations: Evaluating the effect of aggregation strategies versus explanation 2021
Receipt and verification
First computed 2026-05-18T02:44:15.506015Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

86af4cbdcb670bbd6077f70623c55c2190c2155bbf6044378f4e4a4788fa552b

Aliases

arxiv: 2605.13801 · arxiv_version: 2605.13801v1 · doi: 10.48550/arxiv.2605.13801 · pith_short_12: Q2XUZPOLM4F3 · pith_short_16: Q2XUZPOLM4F32YDX · pith_short_8: Q2XUZPOL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/Q2XUZPOLM4F32YDX64DCHRK4EG \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 86af4cbdcb670bbd6077f70623c55c2190c2155bbf6044378f4e4a4788fa552b
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "164a9bef9af553411a717b73363bbad1d623c68a3be64100a6ded7d8d10528c4",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T17:22:27Z",
    "title_canon_sha256": "a29bc2911f33f194c25dd30689f81a22a966c777e44598867deef27b86c869f5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13801",
    "kind": "arxiv",
    "version": 1
  }
}