pith:U6MVV3FT
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks.
arxiv:2605.13554 v1 · 2026-05-13 · cs.LG · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{U6MVV3FTONPQYZXFVZ4EWWSDO5}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.
That advantages derived directly from contrastive Q-values provide a stable and unbiased signal suitable for on-policy PPO optimization without introducing additional instability or requiring further corrections.
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
References
Formal links
Receipt and verification
| First computed | 2026-05-18T02:44:23.690855Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "63390fe6dd7b862ea3da4f6b89f9776ffb51f3acf67604cdc7f36b3dcb7720df",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by-sa/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-13T13:58:49Z",
"title_canon_sha256": "0a28ee2e3ca18cb54f5c3d4725563a9488e0b7905796b1d04cff7892f580e4dc"
},
"schema_version": "1.0",
"source": {
"id": "2605.13554",
"kind": "arxiv",
"version": 1
}
}