pith:4BVJROPA
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Decomposing discrete rewards into ordinal binary indicators isolates evaluation noise and stabilizes policy updates in RLAIF without extra compute.
arxiv:2605.12667 v1 · 2026-05-12 · cs.LG · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{4BVJROPAXSUCTRS5KZBKUQADBN}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators.
That decomposing discrete rewards into ordinal binary indicators structurally isolates evaluation noise and prevents outlier evaluations from corrupting the global update, as stated in the abstract description of the framework.
ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.
References
Receipt and verification
| First computed | 2026-05-18T03:09:50.304220Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
e06a98b9e0bca829c65d5642aa40030b57466e5bdfc12b0321aadb8be9a211cf
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/4BVJROPAXSUCTRS5KZBKUQADBN \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e06a98b9e0bca829c65d5642aa40030b57466e5bdfc12b0321aadb8be9a211cf
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "20cf49d7c089d5de6656ff33d2420fcd4c87b9714b87d760d8a9a206e4596a4b",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-12T19:17:14Z",
"title_canon_sha256": "587504c2283391984dcacc0c61ac7c4a3ba7a095f7037440aca38a8efea03726"
},
"schema_version": "1.0",
"source": {
"id": "2605.12667",
"kind": "arxiv",
"version": 1
}
}