hub

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

cs.DC · 2026-05-18 · unverdicted · novelty 7.0

PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

cs.DC · 2026-04-14 · unverdicted · novelty 7.0

PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing

cs.DC · 2026-03-16 · unverdicted · novelty 6.0

CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.

HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters

cs.DC · 2025-09-29 · unverdicted · novelty 6.0

HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

cs.DC · 2025-09-24 · unverdicted · novelty 6.0

Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.

eLLM: Elastic Memory Management Framework for Efficient LLM Serving

cs.DC · 2025-06-18 · unverdicted · novelty 6.0

eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.

CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

cs.AR · 2026-04-13 · unverdicted · novelty 5.0

CUTEv2 delivers a unified matrix unit architecture for CPUs that achieves over 90% GEMM utilization, 1.57-2.31x speedups on AI models, and a compact 0.53 mm² footprint while supporting cross-platform integration via asynchronous abstractions.

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

cs.DC · 2026-05-22

citing papers explorer

Showing 10 of 10 citing papers.

Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 7
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications cs.DC · 2026-05-18 · unverdicted · none · ref 18
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving cs.DC · 2026-04-14 · unverdicted · none · ref 5
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 5
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing cs.DC · 2026-03-16 · unverdicted · none · ref 16
CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters cs.DC · 2025-09-29 · unverdicted · none · ref 4
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services cs.DC · 2025-09-24 · unverdicted · none · ref 14
Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.
eLLM: Elastic Memory Management Framework for Efficient LLM Serving cs.DC · 2025-06-18 · unverdicted · none · ref 5
eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.
CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead cs.AR · 2026-04-13 · unverdicted · none · ref 8
CUTEv2 delivers a unified matrix unit architecture for CPUs that achieves over 90% GEMM utilization, 1.57-2.31x speedups on AI models, and a compact 0.53 mm² footprint while supporting cross-platform integration via asynchronous abstractions.
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs cs.DC · 2026-05-22 · unreviewed · ref 5

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer