{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2026:GY43OMZ3V66BX2ECGX6W2HOQP4","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"f2b0b5545aeada88c5c61f59a3fe22d06cfb6430a1b9d68a2b9b87d21a91f7dd","cross_cats_sorted":[],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-02-06T08:03:11Z","title_canon_sha256":"6ff29da67c0f21410f70c05d05895ed21596719fc205159b6d48e4dd92eb0618"},"schema_version":"1.0","source":{"id":"2602.06475","kind":"arxiv","version":2}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2602.06475","created_at":"2026-05-18T03:09:23Z"},{"alias_kind":"arxiv_version","alias_value":"2602.06475v2","created_at":"2026-05-18T03:09:23Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.06475","created_at":"2026-05-18T03:09:23Z"},{"alias_kind":"pith_short_12","alias_value":"GY43OMZ3V66B","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_16","alias_value":"GY43OMZ3V66BX2EC","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_8","alias_value":"GY43OMZ3","created_at":"2026-05-18T12:33:37Z"}],"graph_snapshots":[{"event_id":"sha256:5631d3fa37e2926f70008a955fda733f077edebe401aeba379f578a9c5d23ad7","target":"graph","created_at":"2026-05-18T03:09:23Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"That multi-candidate reasoning trajectories for a fixed question can be validly interpreted as a family of counterfactual experiments with sufficient theoretical support, and that the resulting robustness and effectiveness reward will produce reasoning patterns that generalize without introducing new failure modes or biases."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"Treating multiple reasoning paths for one question as counterfactual experiments trains LLMs to favor stable and transferable reasoning patterns over lucky guesses."}],"snapshot_sha256":"ddacc500c2b9e32c8303b6154e5ffe472607665e4f672b628be814a62383cc00"},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"paper":{"abstract_excerpt":"Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Caus","authors_text":"Changwen Zheng, Huijie Guo, Hui Xiong, Jiahuan Zhou, Jingyao Wang, Peizheng Guo, Wenwen Qiang","cross_cats":[],"headline":"Treating multiple reasoning paths for one question as counterfactual experiments trains LLMs to favor stable and transferable reasoning patterns over lucky guesses.","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-02-06T08:03:11Z","title":"Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning"},"references":{"count":19,"internal_anchors":8,"resolved_work":19,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Training language models to reason efficiently","work_id":"ddc0d049-f0ca-4946-aa25-6843f6072e08","year":null},{"cited_arxiv_id":"2107.03374","doi":"","is_internal_anchor":true,"ref_index":2,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":3,"title":"Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models","work_id":"e7962a8e-fc74-4d96-9a1b-3c7897f6c60d","year":null},{"cited_arxiv_id":"2110.14168","doi":"","is_internal_anchor":true,"ref_index":4,"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":5,"title":"Group causal policy optimization for post-training large language models.arXiv preprint arXiv:2508.05428,","work_id":"7c0a4979-d2cf-4925-be8d-10b6ea3b1b52","year":null}],"snapshot_sha256":"d31a5b8b9918a54e71d0b2f490d0209c49cb47b7d7f4cd239c1e1f722e324eb3"},"source":{"id":"2602.06475","kind":"arxiv","version":2},"verdict":{"created_at":"2026-05-16T07:18:03.368093Z","id":"4e68fa02-2d1d-429f-8588-f614cda1b38f","model_set":{"reader":"grok-4.3"},"one_line_summary":"Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"Treating multiple reasoning paths for one question as counterfactual experiments trains LLMs to favor stable and transferable reasoning patterns over lucky guesses.","strongest_claim":"We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions.","weakest_assumption":"That multi-candidate reasoning trajectories for a fixed question can be validly interpreted as a family of counterfactual experiments with sufficient theoretical support, and that the resulting robustness and effectiveness reward will produce reasoning patterns that generalize without introducing new failure modes or biases."}},"verdict_id":"4e68fa02-2d1d-429f-8588-f614cda1b38f"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:a5947bbff03ac4d7c6a7bb4f5a2780ebb70b583edd1e4ef34b64874b8f85e4c5","target":"record","created_at":"2026-05-18T03:09:23Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"f2b0b5545aeada88c5c61f59a3fe22d06cfb6430a1b9d68a2b9b87d21a91f7dd","cross_cats_sorted":[],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-02-06T08:03:11Z","title_canon_sha256":"6ff29da67c0f21410f70c05d05895ed21596719fc205159b6d48e4dd92eb0618"},"schema_version":"1.0","source":{"id":"2602.06475","kind":"arxiv","version":2}},"canonical_sha256":"3639b7333bafbc1be88235fd6d1dd07f24fd3d19aaa6381bb3febc6451ec41e9","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"3639b7333bafbc1be88235fd6d1dd07f24fd3d19aaa6381bb3febc6451ec41e9","first_computed_at":"2026-05-18T03:09:23.754563Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-18T03:09:23.754563Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"HIwSqlpSkPaCVQeQDUcx7u4ZDAiKmsxjaRa9kFxbGB4Eh1RdGmeXBUcitK8u6UzWapDP3pt49Io1Yaht3YHBCw==","signature_status":"signed_v1","signed_at":"2026-05-18T03:09:23.755334Z","signed_message":"canonical_sha256_bytes"},"source_id":"2602.06475","source_kind":"arxiv","source_version":2}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:a5947bbff03ac4d7c6a7bb4f5a2780ebb70b583edd1e4ef34b64874b8f85e4c5","sha256:5631d3fa37e2926f70008a955fda733f077edebe401aeba379f578a9c5d23ad7"],"state_sha256":"70089d04328b3a33621bf2fe7e292d4e93878e8bf43a8c9b2fe805c3fbc48835"}