{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:QPK55XOEOCON22UGJKKJS3DQKV","short_pith_number":"pith:QPK55XOE","schema_version":"1.0","canonical_sha256":"83d5deddc4709cdd6a864a94996c7055788aa593dd016cfe512bacfeae05da21","source":{"kind":"arxiv","id":"2605.11215","version":2},"attestation_state":"computed","paper":{"title":"ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"ReCoVer keeps the per-iteration gradient distribution identical to failure-free LLM pre-training by holding microbatch count constant after any GPU losses.","cross_cats":["cs.AI"],"primary_cat":"cs.DC","authors_text":"Avinash Maurya, Bogdan Nicolae, Franck Cappello, Hui Zhou, Paul Hovland, Ruijie Zhang, Sheng Di, Zhengyang Wang, Zheng Zhang, Ziyue Liu","submitted_at":"2026-05-11T20:28:31Z","abstract_excerpt":"Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.11215","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.DC","submitted_at":"2026-05-11T20:28:31Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"91e4fbfce92febf85a941d4a82beae08b28fc6555cf7a0e0e3d3ff02ac94ab6b","abstract_canon_sha256":"bc2338b3a12770977420da8129c7e8404ca43f5fe48360a1c4d2aa3f6349b47c"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-25T02:01:23.074587Z","signature_b64":"y72eCCdirJc/xytIK7J1disaYwpMpAgZ63JGbQa2/BfkQUyoAP0Et0EpAe3800NHqFlduolFgfKV/G2nXke/Ag==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"83d5deddc4709cdd6a864a94996c7055788aa593dd016cfe512bacfeae05da21","last_reissued_at":"2026-05-25T02:01:23.073962Z","signature_status":"signed_v1","first_computed_at":"2026-05-25T02:01:23.073962Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"ReCoVer keeps the per-iteration gradient distribution identical to failure-free LLM pre-training by holding microbatch count constant after any GPU losses.","cross_cats":["cs.AI"],"primary_cat":"cs.DC","authors_text":"Avinash Maurya, Bogdan Nicolae, Franck Cappello, Hui Zhou, Paul Hovland, Ruijie Zhang, Sheng Di, Zhengyang Wang, Zheng Zhang, Ziyue Liu","submitted_at":"2026-05-11T20:28:31Z","abstract_excerpt":"Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates 2.23× higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that maintaining a constant number of microbatches per iteration across survivors, combined with the fault-tolerant collectives and in-step recovery, produces gradients that are stochastically equivalent to a failure-free run without introducing bias or divergence over long training.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkpoint-restart on up to 512 GPUs with 256 failures.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ReCoVer keeps the per-iteration gradient distribution identical to failure-free LLM pre-training by holding microbatch count constant after any GPU losses.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b47d8432bf270fa1b7ccf78990de751d442283e8e22538a700f526e32c5ae205"},"source":{"id":"2605.11215","kind":"arxiv","version":2},"verdict":{"id":"e91a6509-42d6-4b37-b6f5-9c2e207957a5","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T01:55:57.685577Z","strongest_claim":"ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates 2.23× higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours.","one_line_summary":"ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkpoint-restart on up to 512 GPUs with 256 failures.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that maintaining a constant number of microbatches per iteration across survivors, combined with the fault-tolerant collectives and in-step recovery, produces gradients that are stochastically equivalent to a failure-free run without introducing bias or divergence over long training.","pith_extraction_headline":"ReCoVer keeps the per-iteration gradient distribution identical to failure-free LLM pre-training by holding microbatch count constant after any GPU losses."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.11215/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T04:42:00.902561Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T12:40:12.459574Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T10:01:17.482223Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T08:39:58.272638Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"1571e2ab6a6d3553a45a8bea8ad93f63fe234e12c4769b27c6d184092f24d180"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"993bbae0f3ff26ec079c6132c1e2cd3a95449a029fbd1dfd489a783a9563d2cd"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.11215","created_at":"2026-05-25T02:01:23.074037+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.11215v2","created_at":"2026-05-25T02:01:23.074037+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.11215","created_at":"2026-05-25T02:01:23.074037+00:00"},{"alias_kind":"pith_short_12","alias_value":"QPK55XOEOCON","created_at":"2026-05-25T02:01:23.074037+00:00"},{"alias_kind":"pith_short_16","alias_value":"QPK55XOEOCON22UG","created_at":"2026-05-25T02:01:23.074037+00:00"},{"alias_kind":"pith_short_8","alias_value":"QPK55XOE","created_at":"2026-05-25T02:01:23.074037+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV","json":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV.json","graph_json":"https://pith.science/api/pith-number/QPK55XOEOCON22UGJKKJS3DQKV/graph.json","events_json":"https://pith.science/api/pith-number/QPK55XOEOCON22UGJKKJS3DQKV/events.json","paper":"https://pith.science/paper/QPK55XOE"},"agent_actions":{"view_html":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV","download_json":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV.json","view_paper":"https://pith.science/paper/QPK55XOE","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.11215&json=true","fetch_graph":"https://pith.science/api/pith-number/QPK55XOEOCON22UGJKKJS3DQKV/graph.json","fetch_events":"https://pith.science/api/pith-number/QPK55XOEOCON22UGJKKJS3DQKV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV/action/storage_attestation","attest_author":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV/action/author_attestation","sign_citation":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV/action/citation_signature","submit_replication":"https://pith.science/pith/QPK55XOEOCON22UGJKKJS3DQKV/action/replication_record"}},"created_at":"2026-05-25T02:01:23.074037+00:00","updated_at":"2026-05-25T02:01:23.074037+00:00"}