ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.
citing papers explorer
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.