pith. sign in
Pith Number

pith:4BVJROPA

pith:2026:4BVJROPAXSUCTRS5KZBKUQADBN
not attested not anchored not stored refs resolved

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Fei Wang, Inderjit Dhillon, Nirmal Patel

Decomposing discrete rewards into ordinal binary indicators isolates evaluation noise and stabilizes policy updates in RLAIF without extra compute.

arxiv:2605.12667 v1 · 2026-05-12 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{4BVJROPAXSUCTRS5KZBKUQADBN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators.

C2weakest assumption

That decomposing discrete rewards into ordinal binary indicators structurally isolates evaluation noise and prevents outlier evaluations from corrupting the global update, as stated in the abstract description of the framework.

C3one line summary

ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.

References

41 extracted · 41 resolved · 2 Pith anchors

[1] Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z · doi:10.1038/s41586-025-09422-z
[2] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles , author=. 2025 , eprint= 2025
[3] Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy , year=
[4] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs , author=. 2025 , eprint= 2025
[5] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training , author=. 2026 , eprint= 2026
Receipt and verification
First computed 2026-05-18T03:09:50.304220Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e06a98b9e0bca829c65d5642aa40030b57466e5bdfc12b0321aadb8be9a211cf

Aliases

arxiv: 2605.12667 · arxiv_version: 2605.12667v1 · doi: 10.48550/arxiv.2605.12667 · pith_short_12: 4BVJROPAXSUC · pith_short_16: 4BVJROPAXSUCTRS5 · pith_short_8: 4BVJROPA
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/4BVJROPAXSUCTRS5KZBKUQADBN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e06a98b9e0bca829c65d5642aa40030b57466e5bdfc12b0321aadb8be9a211cf
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "20cf49d7c089d5de6656ff33d2420fcd4c87b9714b87d760d8a9a206e4596a4b",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-12T19:17:14Z",
    "title_canon_sha256": "587504c2283391984dcacc0c61ac7c4a3ba7a095f7037440aca38a8efea03726"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12667",
    "kind": "arxiv",
    "version": 1
  }
}