RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
Canonical reference
Title resolution pending
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
citing papers explorer
No citing papers match the current filters.