{"paper":{"title":"Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Asteria enables practical second-order LLM training by managing optimizer state and background tasks at the runtime level.","cross_cats":["cs.LG"],"primary_cat":"cs.DC","authors_text":"Junhao Zhang, Wes Armour, Yishun Lu, Zeyu Yang","submitted_at":"2026-05-15T17:03:55Z","abstract_excerpt":"Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \\textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtim"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The bounded-staleness protocol and asynchronous shadow-state preparation preserve optimizer effectiveness without introducing unacceptable latency or convergence degradation; this is invoked in the description of the distributed training protocol and the training-hook mechanism.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Asteria enables practical second-order LLM training by managing optimizer state and background tasks at the runtime level.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8a923ecb098796e926da84289e311c6d4c6ae206b50bcd9d28f65fc8b7a6a489"},"source":{"id":"2605.16184","kind":"arxiv","version":1},"verdict":{"id":"ffbe5073-ba6b-4763-bc81-95bff7590dc3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T18:25:25.742696Z","strongest_claim":"Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.","one_line_summary":"Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The bounded-staleness protocol and asynchronous shadow-state preparation preserve optimizer effectiveness without introducing unacceptable latency or convergence degradation; this is invoked in the description of the distributed training protocol and the training-hook mechanism.","pith_extraction_headline":"Asteria enables practical second-order LLM training by managing optimizer state and background tasks at the runtime level."},"integrity":{"clean":false,"summary":{"advisory":0,"critical":1,"by_detector":{"doi_compliance":{"total":1,"advisory":0,"critical":1,"informational":0}},"informational":0},"endpoint":"/pith/2605.16184/integrity.json","findings":[{"note":"Identifier '10.5555/3488766.3488792' is syntactically valid but the DOI registry (doi.org) returned 404, and Crossref / OpenAlex / internal corpus also have no record. The cited work could not be located through any authoritative source.","detector":"doi_compliance","severity":"critical","ref_index":22,"audited_at":"2026-05-19T18:31:07.664936Z","detected_doi":"10.5555/3488766.3488792","finding_type":"unresolvable_identifier","verdict_class":"cross_source","detected_arxiv_id":null}],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T18:31:18.731147Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T18:31:07.664936Z","status":"completed","version":"1.0.0","findings_count":1},{"name":"cited_work_retraction","ran_at":"2026-05-19T17:52:01.423501Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T17:49:47.141079Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T17:33:30.681142Z","status":"skipped","version":"1.0.0","findings_count":0},{"name":"external_links","ran_at":"2026-05-19T17:31:43.142821Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T16:41:55.417383Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"0fca3d6d21030c87ba8f5f2fb7a4f2eafe9bebbfc06a8c4bd8d64045e7436116"},"references":{"count":30,"sample":[{"doi":"","year":null,"title":"Decoupled weight decay regularization","work_id":"e46e7f06-2a03-4133-8bb3-0f5986c8c24f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.1711.05101","year":null,"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","ref_index":2,"cited_arxiv_id":"1711.05101","is_internal_anchor":true},{"doi":"","year":2018,"title":"Shampoo: Preconditioned stochastic tensor optimization","work_id":"85339699-854e-4bf1-aa39-0d8b72be6def","ref_index":3,"cited_arxiv_id":"1802.09568","is_internal_anchor":true},{"doi":"10.48550/arxiv.2409.11321","year":2025,"title":"SOAP: Improving and Stabilizing Shampoo using Adam","work_id":"65c9d9eb-0524-49bd-9aa5-2756ce24dc7f","ref_index":4,"cited_arxiv_id":"2409.11321","is_internal_anchor":true},{"doi":"10.48550/arxiv.2","year":2026,"title":"Towards Learning Boulder Excavation with Hydraulic Excavators","work_id":"66aaca7e-8976-4bb5-928b-3e1d434b72a7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":30,"snapshot_sha256":"186a5cd29b74add182429f715aa44d294f83549396b00224511a0f074ef6d297","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"578d662e79ecc5973c566118a296ab1ae904e580bf4eb2737f9982166cde6f91"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}