{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:U6MVV3FTONPQYZXFVZ4EWWSDO5","short_pith_number":"pith:U6MVV3FT","schema_version":"1.0","canonical_sha256":"a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217","source":{"kind":"arxiv","id":"2605.13554","version":1},"attestation_state":"computed","paper":{"title":"Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Arnol Manuel Fokam, Arnu Pretorius, Asim Osman, Daniel Rajaonarivonivelomanantsoa, Felix Chalumeau, Juan Claude Formanek, Mark Bergh, Noah De Nicola, Omayma Mahjoub, Oussama Hidaoui, Refiloe Shabe, Ruan John de Kock, Sasha Abramowitz, Siddarth Singh, Simon Verster Du Toit, Ulrich Armel Mbou Sob","submitted_at":"2026-05-13T13:58:49Z","abstract_excerpt":"Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.13554","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-13T13:58:49Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"0a28ee2e3ca18cb54f5c3d4725563a9488e0b7905796b1d04cff7892f580e4dc","abstract_canon_sha256":"63390fe6dd7b862ea3da4f6b89f9776ffb51f3acf67604cdc7f36b3dcb7720df"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T02:44:23.691281Z","signature_b64":"HZ+tsar0mlb2fTzM8vX6oqzFZn3LrGhuVNN4NhRibh/luVba9fsh9g3POz2l9IzJ3qNvT3chvOw/4E1qb+n4AA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a7995aecb3735f0c66e5ae784b5a4377527a759a4175cb10df6d94023c4fa217","last_reissued_at":"2026-05-18T02:44:23.690855Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T02:44:23.690855Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Arnol Manuel Fokam, Arnu Pretorius, Asim Osman, Daniel Rajaonarivonivelomanantsoa, Felix Chalumeau, Juan Claude Formanek, Mark Bergh, Noah De Nicola, Omayma Mahjoub, Oussama Hidaoui, Refiloe Shabe, Ruan John de Kock, Sasha Abramowitz, Siddarth Singh, Simon Verster Du Toit, Ulrich Armel Mbou Sob","submitted_at":"2026-05-13T13:58:49Z","abstract_excerpt":"Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That advantages derived directly from contrastive Q-values provide a stable and unbiased signal suitable for on-policy PPO optimization without introducing additional instability or requiring further corrections.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4b0e6e5acc23e9a97334c831b30dbf1b5c1436995f83a2fa070d226df171a6dc"},"source":{"id":"2605.13554","kind":"arxiv","version":1},"verdict":{"id":"76c73591-e3c2-4796-8faf-7264f8d2859f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:08:03.482747Z","strongest_claim":"CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.","one_line_summary":"CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That advantages derived directly from contrastive Q-values provide a stable and unbiased signal suitable for on-policy PPO optimization without introducing additional instability or requiring further corrections.","pith_extraction_headline":"CPPO derives advantages from contrastive Q-values to enable on-policy self-supervised RL that matches reward-based PPO in most tasks."},"references":{"count":26,"sample":[{"doi":"10.1145/3520304.3528937","year":null,"title":"ISBN 9781450392686","work_id":"c59f5744-8a15-48cc-b9d2-0b874431da1b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Demystifying the mechanisms behind emergent exploration in goal-conditioned rl.arXiv preprint arXiv:2510.14129,","work_id":"01d08c4c-5754-432c-af58-93779661e7a0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Felix Book, Arne Traue, Maximilian Schenke, Barnabas Haucke-Korber, and Oliver Wallscheid","work_id":"64834bf5-ebb6-4b6f-9300-49f2ada83fbf","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Accelerating goal-conditioned RL algorithms and research","work_id":"f9e535c2-b74c-4dcf-bb61-30989835ae57","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"arXiv preprint arXiv:2107.01460 , year=","work_id":"8f6e3282-7149-476f-a158-df2f822f8a72","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":26,"snapshot_sha256":"28914bede60a6c6690924607473fc57d2e59bdee970ff7f115364da4749fb08b","internal_anchors":6},"formal_canon":{"evidence_count":1,"snapshot_sha256":"3aa5113e4f8a4cec8ebea3355c98501d43403f97a6481b657ebbfa01e1a327bd"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.13554","created_at":"2026-05-18T02:44:23.690927+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.13554v1","created_at":"2026-05-18T02:44:23.690927+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.13554","created_at":"2026-05-18T02:44:23.690927+00:00"},{"alias_kind":"pith_short_12","alias_value":"U6MVV3FTONPQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"U6MVV3FTONPQYZXF","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"U6MVV3FT","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5","json":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5.json","graph_json":"https://pith.science/api/pith-number/U6MVV3FTONPQYZXFVZ4EWWSDO5/graph.json","events_json":"https://pith.science/api/pith-number/U6MVV3FTONPQYZXFVZ4EWWSDO5/events.json","paper":"https://pith.science/paper/U6MVV3FT"},"agent_actions":{"view_html":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5","download_json":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5.json","view_paper":"https://pith.science/paper/U6MVV3FT","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.13554&json=true","fetch_graph":"https://pith.science/api/pith-number/U6MVV3FTONPQYZXFVZ4EWWSDO5/graph.json","fetch_events":"https://pith.science/api/pith-number/U6MVV3FTONPQYZXFVZ4EWWSDO5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5/action/storage_attestation","attest_author":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5/action/author_attestation","sign_citation":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5/action/citation_signature","submit_replication":"https://pith.science/pith/U6MVV3FTONPQYZXFVZ4EWWSDO5/action/replication_record"}},"created_at":"2026-05-18T02:44:23.690927+00:00","updated_at":"2026-05-18T02:44:23.690927+00:00"}