Pith Number

pith:U6MVV3FT

pith:2026:U6MVV3FTONPQYZXFVZ4EWWSDO5

not attested not anchored not stored refs resolved

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Arnol Manuel Fokam, Arnu Pretorius, Asim Osman, Daniel Rajaonarivonivelomanantsoa, Felix Chalumeau, Juan Claude Formanek, Mark Bergh, Noah De Nicola, Omayma Mahjoub, Oussama Hidaoui, Refiloe Shabe, Ruan John de Kock, Sasha Abramowitz, Siddarth Singh, Simon Verster Du Toit, Ulrich Armel Mbou Sob

CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks.

arxiv:2605.13554 v1 · 2026-05-13 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{U6MVV3FTONPQYZXFVZ4EWWSDO5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.

C2weakest assumption

That advantages derived directly from contrastive Q-values provide a stable and unbiased signal suitable for on-policy PPO optimization without introducing additional instability or requiring further corrections.

C3one line summary

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

References

26 extracted · 26 resolved · 6 Pith anchors

[1] ISBN 9781450392686 · doi:10.1145/3520304.3528937

[2] Demystifying the mechanisms behind emergent exploration in goal-conditioned rl.arXiv preprint arXiv:2510.14129,

[3] Felix Book, Arne Traue, Maximilian Schenke, Barnabas Haucke-Korber, and Oliver Wallscheid

[4] Accelerating goal-conditioned RL algorithms and research

[6] arXiv preprint arXiv:2107.01460 , year= 2011

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-18T02:44:23.690855Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217

Aliases

arxiv: 2605.13554 · arxiv_version: 2605.13554v1 · doi: 10.48550/arxiv.2605.13554 · pith_short_12: U6MVV3FTONPQ · pith_short_16: U6MVV3FTONPQYZXF · pith_short_8: U6MVV3FT

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "63390fe6dd7b862ea3da4f6b89f9776ffb51f3acf67604cdc7f36b3dcb7720df",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-sa/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T13:58:49Z",
    "title_canon_sha256": "0a28ee2e3ca18cb54f5c3d4725563a9488e0b7905796b1d04cff7892f580e4dc"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13554",
    "kind": "arxiv",
    "version": 1
  }
}