ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=

Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan M · 2025 · arXiv 8958.375435

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

RNG: Flat Datacenter Networks at Scale

cs.NI · 2026-04-16 · unverdicted · novelty 7.0 · 3 refs

RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.

SCENIC: Stream Computation-Enhanced SmartNIC

cs.AR · 2026-04-16 · unverdicted · novelty 7.0

SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

cs.DC · 2026-05-08 · unverdicted · novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

citing papers explorer

Showing 3 of 3 citing papers.

RNG: Flat Datacenter Networks at Scale cs.NI · 2026-04-16 · unverdicted · none · ref 42 · 3 links
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
SCENIC: Stream Computation-Enhanced SmartNIC cs.AR · 2026-04-16 · unverdicted · none · ref 80
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 47
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer