FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
Packing analysis: Packing is more appropriate for large models or datasets in supervised fine-tuning.arXiv preprint arXiv:2410.08081,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.