{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:SW3MFUY7E3GKINHIMFFDXQQHKX","short_pith_number":"pith:SW3MFUY7","schema_version":"1.0","canonical_sha256":"95b6c2d31f26cca434e8614a3bc20755fd77a64cc1c201bead4661616441c705","source":{"kind":"arxiv","id":"2605.14297","version":1},"attestation_state":"computed","paper":{"title":"Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Hybrid Policy Optimization mixes pathwise and score-function gradients to keep policy updates unbiased in hybrid discrete-continuous action spaces.","cross_cats":["cs.AI","math.OC","stat.ML"],"primary_cat":"cs.LG","authors_text":"Daniel Russo, Matias Alvo, Yash Kanoria","submitted_at":"2026-05-14T02:59:45Z","abstract_excerpt":"We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.14297","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-14T02:59:45Z","cross_cats_sorted":["cs.AI","math.OC","stat.ML"],"title_canon_sha256":"e1114d9c38f5a4309d09384a406f3e6b004dc368b29ae41ae882ac0db629b4f5","abstract_canon_sha256":"e02d514969680dbb6e652b72a68618d3483c1c5330c1af21e75c6a72910658bf"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:10.133136Z","signature_b64":"gZWlIvm/WX5tPee+b+2VLAS3GJEe6ws6+WwhjPBpIthNIfTgwmPgd0dotAyn+CzZzMReib6xXu0jZ3+nbBftBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"95b6c2d31f26cca434e8614a3bc20755fd77a64cc1c201bead4661616441c705","last_reissued_at":"2026-05-17T23:39:10.132625Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:10.132625Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Hybrid Policy Optimization mixes pathwise and score-function gradients to keep policy updates unbiased in hybrid discrete-continuous action spaces.","cross_cats":["cs.AI","math.OC","stat.ML"],"primary_cat":"cs.LG","authors_text":"Daniel Russo, Matias Alvo, Yash Kanoria","submitted_at":"2026-05-14T02:59:45Z","abstract_excerpt":"We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form... Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The mixed gradient estimator maintains unbiasedness despite the combination of pathwise and score-function components, and that the simulator allows backpropagation where smoothness permits without introducing bias from discrete actions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Hybrid Policy Optimization mixes pathwise and score-function gradients to keep policy updates unbiased in hybrid discrete-continuous action spaces.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"78286b4f6bb9c9039e66889001ec11d0342629fdc318bdc05820d0da7b312e59"},"source":{"id":"2605.14297","kind":"arxiv","version":1},"verdict":{"id":"c6107710-0e46-4588-b01e-ef586618a85c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:09:09.475787Z","strongest_claim":"we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form... Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows.","one_line_summary":"HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The mixed gradient estimator maintains unbiasedness despite the combination of pathwise and score-function components, and that the simulator allows backpropagation where smoothness permits without introducing bias from discrete actions.","pith_extraction_headline":"Hybrid Policy Optimization mixes pathwise and score-function gradients to keep policy updates unbiased in hybrid discrete-continuous action spaces."},"references":{"count":118,"sample":[{"doi":"","year":1992,"title":"Machine learning , volume=","work_id":"15236db4-ba67-4457-ba1f-f1c64db32413","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"International conference on machine learning , pages=","work_id":"1b004632-ad43-4f04-ba31-468765de30ad","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Dynamic programming and optimal control 3rd edition, volume ii , author=","work_id":"48e8a747-5b40-4a90-97d6-795c97ffd533","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"fb4b96c7-3106-4baa-86ba-34546a03ba2a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"87fb4db8-b97a-4595-92bb-55bba75eb86c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":118,"snapshot_sha256":"751231bd1eb2b4f6b18748579e57a246eb352586781a8b6c5e6f82947417d7d7","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"00389b8db5ade0f6271c44948ef79ca801757a3c5a08f11e76560cab2dc1ea86"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.14297","created_at":"2026-05-17T23:39:10.132705+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.14297v1","created_at":"2026-05-17T23:39:10.132705+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.14297","created_at":"2026-05-17T23:39:10.132705+00:00"},{"alias_kind":"pith_short_12","alias_value":"SW3MFUY7E3GK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"SW3MFUY7E3GKINHI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"SW3MFUY7","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX","json":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX.json","graph_json":"https://pith.science/api/pith-number/SW3MFUY7E3GKINHIMFFDXQQHKX/graph.json","events_json":"https://pith.science/api/pith-number/SW3MFUY7E3GKINHIMFFDXQQHKX/events.json","paper":"https://pith.science/paper/SW3MFUY7"},"agent_actions":{"view_html":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX","download_json":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX.json","view_paper":"https://pith.science/paper/SW3MFUY7","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.14297&json=true","fetch_graph":"https://pith.science/api/pith-number/SW3MFUY7E3GKINHIMFFDXQQHKX/graph.json","fetch_events":"https://pith.science/api/pith-number/SW3MFUY7E3GKINHIMFFDXQQHKX/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX/action/timestamp_anchor","attest_storage":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX/action/storage_attestation","attest_author":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX/action/author_attestation","sign_citation":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX/action/citation_signature","submit_replication":"https://pith.science/pith/SW3MFUY7E3GKINHIMFFDXQQHKX/action/replication_record"}},"created_at":"2026-05-17T23:39:10.132705+00:00","updated_at":"2026-05-17T23:39:10.132705+00:00"}