OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.
hub Mixed citations
Rdma over ethernet for distributed training at meta scale
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
NeuroRisk is a physics-informed deep unrolled optimizer for risk-aware traffic engineering that achieves small optimality gaps and 100-100000x speedup over solvers while outperforming neural baselines on throughput.
Bridge reduces All-to-All completion time by typically 3x to 10x and improves AllReduce by up to 6.6x over Ring by reusing optical subrings across multiple steps in reconfigurable networks.
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
Aether orchestrates five AI agents atop a unified digital twin to automate network change validation, achieving 100% error detection, 92-96% diagnostic coverage, and 6-7 minute completion on synthetic scenarios and real ISP incidents.
This study empirically characterizes congestion responses in EDR/HDR/NDR InfiniBand, Cray Slingshot, and Ethernet fabrics under controlled steady and bursty collective communication patterns at multiple system scales.
A hierarchical review of energy storage technologies for smoothing the sub-second variable loads of AI data centers on the utility grid.
citing papers explorer
-
Near-optimal Online Traffic Engineering
OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.
-
NeuroRisk: Physics-Informed Neural Optimization for Risk-Aware Traffic Engineering
NeuroRisk is a physics-informed deep unrolled optimizer for risk-aware traffic engineering that achieves small optimality gaps and 100-100000x speedup over solvers while outperforming neural baselines on throughput.
-
Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings
Bridge reduces All-to-All completion time by typically 3x to 10x and improves AllReduce by up to 6.6x over Ring by reusing optical subrings across multiple steps in reconfigurable networks.
-
RNG: Flat Datacenter Networks at Scale
RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.
-
SCENIC: Stream Computation-Enhanced SmartNIC
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
-
UCCL-Zip: Lossless Compression Supercharged GPU Communication
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
-
Aether: Network Validation Using Agentic AI and Digital Twin
Aether orchestrates five AI agents atop a unified digital twin to automate network change validation, achieving 100% error detection, 92-96% diagnostic coverage, and 6-7 minute completion on synthetic scenarios and real ISP incidents.
-
Characterizing the Impact of Congestion in Modern HPC Interconnects
This study empirically characterizes congestion responses in EDR/HDR/NDR InfiniBand, Cray Slingshot, and Ethernet fabrics under controlled steady and bursty collective communication patterns at multiple system scales.
-
Grid Integration of AI Data Centers: A Critical Review of Energy Storage Solutions
A hierarchical review of energy storage technologies for smoothing the sub-second variable loads of AI data centers on the utility grid.