pith. sign in

Understanding stragglers in large model training using what-if analysis

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

fields

cs.DC 3

years

2026 3

roles

background 2

polarities

background 1 support 1

representative citing papers

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

cs.DC · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.

citing papers explorer

Showing 3 of 3 citing papers.