CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to
Pvm: Efficient shadow paging for deploying secure containers in cloud-native environment,
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
AutoRocq is an LLM agent that learns proofs on-the-fly by collaborating with the Rocq prover to verify programs on SV-COMP benchmarks and Linux kernel modules.
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
Edera presents an optimized hypervisor with paravirtualization that achieves near-Docker performance while providing stronger isolation, demonstrated via a Kubernetes-compatible runtime and driver isolation.
citing papers explorer
-
All is Not Lost: LLM Recovery without Checkpoints
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
-
Agentic Verification of Software Systems
AutoRocq is an LLM agent that learns proofs on-the-fly by collaborating with the Rocq prover to verify programs on SV-COMP benchmarks and Linux kernel modules.
-
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
-
Goldilocks Isolation: High Performance VMs with Edera
Edera presents an optimized hypervisor with paravirtualization that achieves near-Docker performance while providing stronger isolation, demonstrated via a Kubernetes-compatible runtime and driver isolation.