{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:R2MVQSCXWXV7LP42AYNCDGFLR6","short_pith_number":"pith:R2MVQSCX","schema_version":"1.0","canonical_sha256":"8e99584857b5ebf5bf9a061a2198ab8f92e0fefc8f5199cd514f398f85b9462c","source":{"kind":"arxiv","id":"2602.22495","version":3},"attestation_state":"computed","paper":{"title":"Reinforcement-aware Knowledge Distillation for LLM Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RLAD enables better distillation of reasoning LLMs by imitating the teacher selectively during policy updates.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Dhananjay Ram, Shuli Jiang, Shuo Yang, Stefano Soatto, Wei Xia, Yantao Shen, Yuting Zhang, Zhaoyang Zhang, Zhuowen Tu","submitted_at":"2026-02-26T00:20:39Z","abstract_excerpt":"Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2602.22495","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-02-26T00:20:39Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"5c443d568afefd33c13d07cff1913eafe48bc17f95dc977340c2973acbd21553","abstract_canon_sha256":"9fc3211235a91ef9f4e24e5340f6e9b9d74918b26850234a2a52ae235f7949f6"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-06-19T16:12:18.947242Z","signature_b64":"L/C8kR1yng+5ENcpCBtdF5/8qQiiboCOP/izT4PAH4x7SX39PHUomFxPpla05DIBIqnKndzw3HxJDvIVpadhBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"8e99584857b5ebf5bf9a061a2198ab8f92e0fefc8f5199cd514f398f85b9462c","last_reissued_at":"2026-06-19T16:12:18.946766Z","signature_status":"signed_v1","first_computed_at":"2026-06-19T16:12:18.946766Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Reinforcement-aware Knowledge Distillation for LLM Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RLAD enables better distillation of reasoning LLMs by imitating the teacher selectively during policy updates.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Dhananjay Ram, Shuli Jiang, Shuo Yang, Stefano Soatto, Wei Xia, Yantao Shen, Yuting Zhang, Zhaoyang Zhang, Zhuowen Tu","submitted_at":"2026-02-26T00:20:39Z","abstract_excerpt":"Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That guiding the student toward the teacher only when it improves the current policy update will reliably avoid distribution mismatch and objective interference without introducing new instabilities or requiring additional hyperparameter tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RLAD replaces standard KL-based distillation with Trust Region Ratio Distillation, a PPO-style likelihood ratio objective that performs advantage-aware imitation on student rollouts and outperforms offline KD, GRPO, and KL on-policy KD on logic and math benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RLAD enables better distillation of reasoning LLMs by imitating the teacher selectively during policy updates.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"801bf2559c547c7cc79448d1ec3eb9031eefe7e4801c2f01723a689f98daaa2a"},"source":{"id":"2602.22495","kind":"arxiv","version":3},"verdict":{"id":"30034ae4-366e-443a-ac64-4c51ceaee026","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:39:05.412985Z","strongest_claim":"Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.","one_line_summary":"RLAD replaces standard KL-based distillation with Trust Region Ratio Distillation, a PPO-style likelihood ratio objective that performs advantage-aware imitation on student rollouts and outperforms offline KD, GRPO, and KL on-policy KD on logic and math benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That guiding the student toward the teacher only when it improves the current policy update will reliably avoid distribution mismatch and objective interference without introducing new instabilities or requiring additional hyperparameter tuning.","pith_extraction_headline":"RLAD enables better distillation of reasoning LLMs by imitating the teacher selectively during policy updates."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2602.22495/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2a15af77ebaec38f81cd023bcf478d0fc039dc2ed1a3651001b9b80e99b330e7"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2602.22495","created_at":"2026-06-19T16:12:18.946821+00:00"},{"alias_kind":"arxiv_version","alias_value":"2602.22495v3","created_at":"2026-06-19T16:12:18.946821+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.22495","created_at":"2026-06-19T16:12:18.946821+00:00"},{"alias_kind":"pith_short_12","alias_value":"R2MVQSCXWXV7","created_at":"2026-06-19T16:12:18.946821+00:00"},{"alias_kind":"pith_short_16","alias_value":"R2MVQSCXWXV7LP42","created_at":"2026-06-19T16:12:18.946821+00:00"},{"alias_kind":"pith_short_8","alias_value":"R2MVQSCX","created_at":"2026-06-19T16:12:18.946821+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":4,"internal_anchor_count":4,"sample":[{"citing_arxiv_id":"2605.12652","citing_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03128","citing_title":"Self-Distilled RLVR","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10674","citing_title":"Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07274","citing_title":"Structured Role-Aware Policy Optimization for Multimodal Reasoning","ref_index":26,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6","json":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6.json","graph_json":"https://pith.science/api/pith-number/R2MVQSCXWXV7LP42AYNCDGFLR6/graph.json","events_json":"https://pith.science/api/pith-number/R2MVQSCXWXV7LP42AYNCDGFLR6/events.json","paper":"https://pith.science/paper/R2MVQSCX"},"agent_actions":{"view_html":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6","download_json":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6.json","view_paper":"https://pith.science/paper/R2MVQSCX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2602.22495&json=true","fetch_graph":"https://pith.science/api/pith-number/R2MVQSCXWXV7LP42AYNCDGFLR6/graph.json","fetch_events":"https://pith.science/api/pith-number/R2MVQSCXWXV7LP42AYNCDGFLR6/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6/action/timestamp_anchor","attest_storage":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6/action/storage_attestation","attest_author":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6/action/author_attestation","sign_citation":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6/action/citation_signature","submit_replication":"https://pith.science/pith/R2MVQSCXWXV7LP42AYNCDGFLR6/action/replication_record"}},"created_at":"2026-06-19T16:12:18.946821+00:00","updated_at":"2026-06-19T16:12:18.946821+00:00"}