InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion parameter models

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

cs.DC · 2025-11-10 · unverdicted · novelty 6.0

DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters

cs.DC · 2025-09-29 · unverdicted · novelty 6.0

HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

cs.DC · 2026-05-06 · unverdicted · novelty 4.0

CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.

citing papers explorer

Showing 4 of 4 citing papers.

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet cs.DC · 2026-05-18 · unverdicted · none · ref 69
EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 25
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters cs.DC · 2025-09-29 · unverdicted · none · ref 27
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training cs.DC · 2026-05-06 · unverdicted · none · ref 46
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.

InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

fields

years

verdicts

representative citing papers

citing papers explorer