Amped: An analytical model for performance in distributed training of transformers

· 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

cs.DC · 2025-08-29 · unverdicted · novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

citing papers explorer

Showing 1 of 1 citing paper.

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection cs.DC · 2025-08-29 · unverdicted · none · ref 27
Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

Amped: An analytical model for performance in distributed training of transformers

fields

years

verdicts

representative citing papers

citing papers explorer