CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
Efficient large-scale language model training on gpu clusters using megatron-lm,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AR 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.
citing papers explorer
-
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
-
Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.