Pith Number

pith:AWGGH2D4

pith:2026:AWGGH2D4NG3B6I2RCZ5MDLU535

not attested not anchored not stored refs resolved

Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Hui Huang, Muyun Yang, Xuanxin Wu, Yuki Arase

Large reasoning models outperform standard LLMs as judges on accuracy and robustness but still carry strong evaluation biases that an explicit planning step can reduce.

arxiv:2601.03630 v2 · 2026-01-07 · cs.CL

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{AWGGH2D4NG3B6I2RCZ5MDLU535}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

LRMs outperform non-reasoning LLMs in judgment accuracy, particularly on reasoning-intensive tasks, demonstrate superior instruction-following and robustness, yet still exhibit strong evaluation biases that PlanJudge mitigates while preserving accuracy.

C2weakest assumption

That the chosen tasks, adversarial attacks, and bias metrics comprehensively capture real-world judgment scenarios and that observed improvements generalize beyond the tested models and datasets.

C3one line summary

Reasoning models judge better than non-reasoning LLMs yet retain biases; generating an evaluation plan first mitigates bias without losing accuracy.

References

12 extracted · 12 resolved · 0 Pith anchors

[1] InFindings of the Association for Computational Linguistics: ACL 2025, pages 5880–5895 2025

[2] Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev 2025

[3] Are reasoning models more prone to hallucination? 2023

[4] Planning: A detailed evaluation plan is specified based on the current evaluation task

[5] We investigate three distinct strategies for the first step of plan generation:

Receipt and verification

First computed	2026-05-17T23:39:16.710192Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

058c63e87c69b61f2351167ac1ae9ddf6bd0db3118300fbbfbddc82c4a84b427

Aliases

arxiv: 2601.03630 · arxiv_version: 2601.03630v2 · doi: 10.48550/arxiv.2601.03630 · pith_short_12: AWGGH2D4NG3B · pith_short_16: AWGGH2D4NG3B6I2R · pith_short_8: AWGGH2D4

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/AWGGH2D4NG3B6I2RCZ5MDLU535 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 058c63e87c69b61f2351167ac1ae9ddf6bd0db3118300fbbfbddc82c4a84b427

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "07229dbd403bf02fe1ca97dd890d85b09906bc427ad26ba8a4cb77eca547b26a",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-01-07T06:19:26Z",
    "title_canon_sha256": "815ace6fd9ed995b8703dba2203cf04166011cfe3041c0dda4ee93706a84ff21"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.03630",
    "kind": "arxiv",
    "version": 2
  }
}