fabric-lib: RDMA Point-to-Point Communication for LLM Systems
Pith reviewed 2026-05-18 02:50 UTC · model grok-4.3
The pith
fabric-lib bridges common NICs with a uniform one-sided WriteImm interface and ImmCounter notifications for portable LLM point-to-point communication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
fabric-lib bridges the functionality of common NICs to expose a uniform interface for one-sided WriteImm operations with an ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU, and achieves peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS EFA while enabling production use cases for KvCache transfer, RL weight updates, and MoE dispatch/combine.
What carries the argument
The ImmCounter primitive, which supplies reliable completion notification for one-sided WriteImm operations without any ordering assumptions from the underlying network transport.
If this is right
- KvCache transfers for disaggregated inference support dynamic scaling without hardware-specific code.
- Reinforcement learning weight updates for trillion-parameter models complete in 1.3 seconds.
- MoE dispatch and combine operations achieve lower decode latency than DeepEP on ConnectX-7 and become feasible on EFA.
- Point-to-point primitives can be used alongside collectives without locking systems to one vendor's NIC.
Where Pith is reading between the lines
- Similar abstraction layers could reduce hardware-specific code in training frameworks that span multiple cloud providers.
- The no-ordering design may simplify fault-tolerant extensions for larger clusters where messages can arrive out of sequence.
- Adding support for additional RDMA verbs such as reads or atomics would broaden the library's use in other distributed systems.
Load-bearing premise
A single uniform interface can be built across different NICs such as ConnectX-7 and EFA while keeping the exact performance and notification semantics required by the workloads.
What would settle it
Measure whether ImmCounter notifications arrive reliably and throughput stays near 400 Gbps when running the three production workloads at scale on EFA hardware.
read the original abstract
Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present fabric-lib, which bridges the functionality of common NICs to expose a uniform interface. fabric-lib exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase fabric-lib through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in. fabric-lib is open-sourced at https://github.com/perplexityai/pplx-garden/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents fabric-lib, a library providing a uniform RDMA point-to-point interface for LLM systems. It exposes one-sided WriteImm operations paired with an ImmCounter primitive for completion notification that avoids assumptions on network transport ordering, transparently manages multiple NICs per GPU, and is shown to reach 400 Gbps peak throughput on both NVIDIA ConnectX-7 and AWS EFA. The library is integrated into three production workloads: dynamic KV cache transfer for disaggregated inference, RL weight updates completing in 1.3 s for trillion-parameter models, and an MoE dispatch/combine kernel that matches or exceeds DeepEP decode latency on ConnectX-7 while providing the first viable latencies on EFA. The work is open-sourced.
Significance. If the portability and performance claims hold, the contribution is significant for distributed LLM systems. It supplies a practical, vendor-agnostic primitive that complements existing collectives for emerging patterns such as disaggregated inference and MoE routing, thereby reducing hardware lock-in. Explicit production deployments together with concrete throughput and latency numbers provide direct evidence of utility; the open-source release further strengthens reproducibility and adoption potential.
major comments (1)
- [Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.
minor comments (2)
- [Abstract] Abstract: the MoE use-case sentence states that latencies 'exceed DeepEP decode latency' but does not quantify the improvement or specify the exact metric (e.g., per-token latency or end-to-end time).
- [Related work] The manuscript would benefit from a short paragraph contrasting fabric-lib with prior portable RDMA abstractions (e.g., libfabric, UCX) to clarify the novelty of the ImmCounter design.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work and the recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: [Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.
Authors: We agree that additional methodological details in the performance evaluation section would improve clarity and allow readers to better assess the reported peaks. In the revised manuscript we will expand this section to describe the message sizes used to reach the 400 Gbps peaks, the number of trials conducted, and the variability measures (such as standard deviation or range) observed across runs on both ConnectX-7 and EFA. revision: yes
Circularity Check
No circularity: implementation and benchmark claims rest on code and measurements
full rationale
The manuscript describes a systems implementation (fabric-lib) that bridges RDMA primitives across NICs and reports throughput and latency numbers from production workloads. No mathematical derivation, fitted parameters, or predictions appear; the central claims are supported by direct implementation details and empirical results rather than any reduction to self-referential definitions or prior self-citations. The ImmCounter primitive is presented as an engineering abstraction whose correctness is asserted via the described semantics and tested workloads, not derived from equations that presuppose the outcome. This is a standard non-circular systems paper whose evidence is external to any internal fitting or renaming.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TransferEngine exposes one-sided WRITE IMM operations with a novel IMMCOUNTER primitive for completion notification that does not rely on message ordering... transparently managing multiple NICs per GPU
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Efficient Remote KV Cache Reuse with GPU-native Video Codec
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
-
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 ...
-
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exce...
-
UCCL-Zip: Lossless Compression Supercharged GPU Communication
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.