fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Kevin Hu (1); Lequn Chen (1) ((1) Perplexity AI); Nandor Licker (1); Vladimir Zaytsev (1)

arxiv: 2510.27656 · v2 · submitted 2025-10-31 · 💻 cs.DC

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Nandor Licker (1) , Kevin Hu (1) , Vladimir Zaytsev (1) , Lequn Chen (1) ((1) Perplexity AI) This is my paper

Pith reviewed 2026-05-18 02:50 UTC · model grok-4.3

classification 💻 cs.DC

keywords RDMApoint-to-point communicationLLM systemsdisaggregated inferenceMoE routingnetwork interface controllersportable networking

0 comments

The pith

fabric-lib bridges common NICs with a uniform one-sided WriteImm interface and ImmCounter notifications for portable LLM point-to-point communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces fabric-lib to deliver flexible point-to-point RDMA communication for LLM system patterns such as disaggregated inference, MoE routing, and asynchronous reinforcement fine-tuning. Existing solutions tie implementations to particular network cards, which limits portability across hardware and integration into inference engines. fabric-lib solves this by exposing a consistent interface for one-sided writes with an ImmCounter primitive that signals completions without depending on network ordering guarantees. It also handles multiple NICs per GPU transparently. The result is shown through peak 400 Gbps throughput on both NVIDIA ConnectX-7 and AWS EFA plus three working production deployments.

Core claim

fabric-lib bridges the functionality of common NICs to expose a uniform interface for one-sided WriteImm operations with an ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU, and achieves peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS EFA while enabling production use cases for KvCache transfer, RL weight updates, and MoE dispatch/combine.

What carries the argument

The ImmCounter primitive, which supplies reliable completion notification for one-sided WriteImm operations without any ordering assumptions from the underlying network transport.

If this is right

KvCache transfers for disaggregated inference support dynamic scaling without hardware-specific code.
Reinforcement learning weight updates for trillion-parameter models complete in 1.3 seconds.
MoE dispatch and combine operations achieve lower decode latency than DeepEP on ConnectX-7 and become feasible on EFA.
Point-to-point primitives can be used alongside collectives without locking systems to one vendor's NIC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar abstraction layers could reduce hardware-specific code in training frameworks that span multiple cloud providers.
The no-ordering design may simplify fault-tolerant extensions for larger clusters where messages can arrive out of sequence.
Adding support for additional RDMA verbs such as reads or atomics would broaden the library's use in other distributed systems.

Load-bearing premise

A single uniform interface can be built across different NICs such as ConnectX-7 and EFA while keeping the exact performance and notification semantics required by the workloads.

What would settle it

Measure whether ImmCounter notifications arrive reliably and throughput stays near 400 Gbps when running the three production workloads at scale on EFA hardware.

read the original abstract

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present fabric-lib, which bridges the functionality of common NICs to expose a uniform interface. fabric-lib exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase fabric-lib through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in. fabric-lib is open-sourced at https://github.com/perplexityai/pplx-garden/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

fabric-lib ships a practical open-source RDMA abstraction with ImmCounter and multi-NIC handling that targets real LLM point-to-point needs across ConnectX-7 and EFA.

read the letter

The key takeaway is that fabric-lib offers a portable RDMA interface tailored to LLM point-to-point patterns, with an ImmCounter primitive that works without transport ordering assumptions and transparent multi-NIC handling per GPU. The paper does a good job showing how this plugs into production use cases. They report KvCache transfers for dynamic disaggregated inference, RL weight updates in 1.3 seconds for trillion-parameter models, and an MoE dispatch/combine that exceeds DeepEP decode latency on ConnectX-7 while providing the first viable option on EFA. Reaching 400 Gbps peak on both NVIDIA and AWS hardware is a solid data point, and releasing the code at the github link adds real value for others to build on. The soft spots are mostly around verification depth. The abstract gives throughput and latency numbers but does not detail error bars, exact baseline comparisons, or how they confirmed ImmCounter notifications remain reliable across platforms under stress. The concern that EFA's RDMA variant might introduce subtle differences in ordering or reliability that could affect the notification mechanism is worth a close read of the implementation section. If the paper shows explicit handling for those differences, the claim strengthens; otherwise it stays a bit open. No major flaws in the central argument, which rests on the implementation and reported measurements. The citation pattern seems focused on relevant prior work without over-reliance on self-cites. This paper is for practitioners and researchers working on scalable LLM systems who want to reduce hardware lock-in in their communication layers. A reader interested in disaggregated inference or MoE routing would get concrete ideas and code to try. It deserves serious peer review because the engineering contribution is timely and the open-source aspect allows referees to inspect the actual primitives.

Referee Report

1 major / 2 minor

Summary. The manuscript presents fabric-lib, a library providing a uniform RDMA point-to-point interface for LLM systems. It exposes one-sided WriteImm operations paired with an ImmCounter primitive for completion notification that avoids assumptions on network transport ordering, transparently manages multiple NICs per GPU, and is shown to reach 400 Gbps peak throughput on both NVIDIA ConnectX-7 and AWS EFA. The library is integrated into three production workloads: dynamic KV cache transfer for disaggregated inference, RL weight updates completing in 1.3 s for trillion-parameter models, and an MoE dispatch/combine kernel that matches or exceeds DeepEP decode latency on ConnectX-7 while providing the first viable latencies on EFA. The work is open-sourced.

Significance. If the portability and performance claims hold, the contribution is significant for distributed LLM systems. It supplies a practical, vendor-agnostic primitive that complements existing collectives for emerging patterns such as disaggregated inference and MoE routing, thereby reducing hardware lock-in. Explicit production deployments together with concrete throughput and latency numbers provide direct evidence of utility; the open-source release further strengthens reproducibility and adoption potential.

major comments (1)

[Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.

minor comments (2)

[Abstract] Abstract: the MoE use-case sentence states that latencies 'exceed DeepEP decode latency' but does not quantify the improvement or specify the exact metric (e.g., per-token latency or end-to-end time).
[Related work] The manuscript would benefit from a short paragraph contrasting fabric-lib with prior portable RDMA abstractions (e.g., libfabric, UCX) to clarify the novelty of the ImmCounter design.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and the recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.

Authors: We agree that additional methodological details in the performance evaluation section would improve clarity and allow readers to better assess the reported peaks. In the revised manuscript we will expand this section to describe the message sizes used to reach the 400 Gbps peaks, the number of trials conducted, and the variability measures (such as standard deviation or range) observed across runs on both ConnectX-7 and EFA. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation and benchmark claims rest on code and measurements

full rationale

The manuscript describes a systems implementation (fabric-lib) that bridges RDMA primitives across NICs and reports throughput and latency numbers from production workloads. No mathematical derivation, fitted parameters, or predictions appear; the central claims are supported by direct implementation details and empirical results rather than any reduction to self-referential definitions or prior self-citations. The ImmCounter primitive is presented as an engineering abstraction whose correctness is asserted via the described semantics and tested workloads, not derived from equations that presuppose the outcome. This is a standard non-circular systems paper whose evidence is external to any internal fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and engineering paper; the central claim does not rest on mathematical axioms, fitted parameters, or newly postulated entities.

pith-pipeline@v0.9.0 · 5777 in / 1331 out tokens · 51832 ms · 2026-05-18T02:50:29.458550+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TransferEngine exposes one-sided WRITE IMM operations with a novel IMMCOUNTER primitive for completion notification that does not rely on message ordering... transparently managing multiple NICs per GPU
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Remote KV Cache Reuse with GPU-native Video Codec
cs.DC 2026-02 conditional novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
cs.DC 2026-05 unverdicted novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 ...
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
cs.DC 2026-05 conditional novelty 6.0

Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exce...
UCCL-Zip: Lossless Compression Supercharged GPU Communication
cs.DC 2026-04 unverdicted novelty 6.0

UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.