Echo: Simulating distributed training at scale

Echo: Simulating Distributed Training At Scale · 2024 · arXiv 2412.12487

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

cs.DC · 2026-05-20 · unverdicted · novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.

Simulating Unified Tensor Resharding in heterogeneous AI systems

cs.DC · 2026-06-25 · unverdicted · novelty 6.0

Xsim is a heterogeneity-aware simulator for distributed LLM training supporting load balancing, customized collectives, tensor resharding, and pluggable network simulation, reporting under 5% error in training time predictions.

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

cs.DC · 2026-05-16 · unverdicted · novelty 5.0 · 2 refs

Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.

Libra: Efficient Resource Management for Agentic RL Post-Training

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

cs.DC · 2026-04-13 · unverdicted · novelty 4.0

StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation cs.DC · 2026-05-20 · unverdicted · none · ref 27
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
Simulating Unified Tensor Resharding in heterogeneous AI systems cs.DC · 2026-06-25 · unverdicted · none · ref 19
Xsim is a heterogeneity-aware simulator for distributed LLM training supporting load balancing, customized collectives, tensor resharding, and pluggable network simulation, reporting under 5% error in training time predictions.
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference cs.DC · 2026-05-16 · unverdicted · none · ref 4 · 2 links
Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.
Libra: Efficient Resource Management for Agentic RL Post-Training cs.LG · 2026-06-02 · unverdicted · none · ref 11
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO cs.DC · 2026-04-13 · unverdicted · none · ref 30
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

Echo: Simulating distributed training at scale

fields

years

verdicts

representative citing papers

citing papers explorer