Advances in neural information processing systems , volume=

Large scale distributed deep networks , author=

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials

cs.DC · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

cs.DC · 2026-05-18 · unverdicted · novelty 5.0

AdaptiveLoad cuts computational imbalance in video DiT training from 39% to 18.9% and raises throughput 27.2% via memory-compute constraints and a custom LayerNorm-Modulate kernel.

citing papers explorer

Showing 4 of 4 citing papers.

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials cs.DC · 2026-05-18 · unverdicted · none · ref 25 · 2 links
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 21 · 2 links
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 57
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training cs.DC · 2026-05-18 · unverdicted · none · ref 30
AdaptiveLoad cuts computational imbalance in video DiT training from 39% to 18.9% and raises throughput 27.2% via memory-compute constraints and a custom LayerNorm-Modulate kernel.

Advances in neural information processing systems , volume=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer