pith. sign in

arxiv: 2510.27656 · v2 · submitted 2025-10-31 · 💻 cs.DC

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Pith reviewed 2026-05-18 02:50 UTC · model grok-4.3

classification 💻 cs.DC
keywords RDMApoint-to-point communicationLLM systemsdisaggregated inferenceMoE routingnetwork interface controllersportable networking
0
0 comments X

The pith

fabric-lib bridges common NICs with a uniform one-sided WriteImm interface and ImmCounter notifications for portable LLM point-to-point communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces fabric-lib to deliver flexible point-to-point RDMA communication for LLM system patterns such as disaggregated inference, MoE routing, and asynchronous reinforcement fine-tuning. Existing solutions tie implementations to particular network cards, which limits portability across hardware and integration into inference engines. fabric-lib solves this by exposing a consistent interface for one-sided writes with an ImmCounter primitive that signals completions without depending on network ordering guarantees. It also handles multiple NICs per GPU transparently. The result is shown through peak 400 Gbps throughput on both NVIDIA ConnectX-7 and AWS EFA plus three working production deployments.

Core claim

fabric-lib bridges the functionality of common NICs to expose a uniform interface for one-sided WriteImm operations with an ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU, and achieves peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS EFA while enabling production use cases for KvCache transfer, RL weight updates, and MoE dispatch/combine.

What carries the argument

The ImmCounter primitive, which supplies reliable completion notification for one-sided WriteImm operations without any ordering assumptions from the underlying network transport.

If this is right

  • KvCache transfers for disaggregated inference support dynamic scaling without hardware-specific code.
  • Reinforcement learning weight updates for trillion-parameter models complete in 1.3 seconds.
  • MoE dispatch and combine operations achieve lower decode latency than DeepEP on ConnectX-7 and become feasible on EFA.
  • Point-to-point primitives can be used alongside collectives without locking systems to one vendor's NIC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar abstraction layers could reduce hardware-specific code in training frameworks that span multiple cloud providers.
  • The no-ordering design may simplify fault-tolerant extensions for larger clusters where messages can arrive out of sequence.
  • Adding support for additional RDMA verbs such as reads or atomics would broaden the library's use in other distributed systems.

Load-bearing premise

A single uniform interface can be built across different NICs such as ConnectX-7 and EFA while keeping the exact performance and notification semantics required by the workloads.

What would settle it

Measure whether ImmCounter notifications arrive reliably and throughput stays near 400 Gbps when running the three production workloads at scale on EFA hardware.

read the original abstract

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present fabric-lib, which bridges the functionality of common NICs to expose a uniform interface. fabric-lib exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase fabric-lib through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in. fabric-lib is open-sourced at https://github.com/perplexityai/pplx-garden/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents fabric-lib, a library providing a uniform RDMA point-to-point interface for LLM systems. It exposes one-sided WriteImm operations paired with an ImmCounter primitive for completion notification that avoids assumptions on network transport ordering, transparently manages multiple NICs per GPU, and is shown to reach 400 Gbps peak throughput on both NVIDIA ConnectX-7 and AWS EFA. The library is integrated into three production workloads: dynamic KV cache transfer for disaggregated inference, RL weight updates completing in 1.3 s for trillion-parameter models, and an MoE dispatch/combine kernel that matches or exceeds DeepEP decode latency on ConnectX-7 while providing the first viable latencies on EFA. The work is open-sourced.

Significance. If the portability and performance claims hold, the contribution is significant for distributed LLM systems. It supplies a practical, vendor-agnostic primitive that complements existing collectives for emerging patterns such as disaggregated inference and MoE routing, thereby reducing hardware lock-in. Explicit production deployments together with concrete throughput and latency numbers provide direct evidence of utility; the open-source release further strengthens reproducibility and adoption potential.

major comments (1)
  1. [Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.
minor comments (2)
  1. [Abstract] Abstract: the MoE use-case sentence states that latencies 'exceed DeepEP decode latency' but does not quantify the improvement or specify the exact metric (e.g., per-token latency or end-to-end time).
  2. [Related work] The manuscript would benefit from a short paragraph contrasting fabric-lib with prior portable RDMA abstractions (e.g., libfabric, UCX) to clarify the novelty of the ImmCounter design.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and the recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: [Performance evaluation section] Performance evaluation section: the central claim of 400 Gbps peak throughput on both ConnectX-7 and EFA is load-bearing for the portability argument, yet the manuscript provides no description of message sizes, number of trials, or variability measures used to obtain these numbers. Without these details the reader cannot assess whether the reported peak is representative or platform-specific.

    Authors: We agree that additional methodological details in the performance evaluation section would improve clarity and allow readers to better assess the reported peaks. In the revised manuscript we will expand this section to describe the message sizes used to reach the 400 Gbps peaks, the number of trials conducted, and the variability measures (such as standard deviation or range) observed across runs on both ConnectX-7 and EFA. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation and benchmark claims rest on code and measurements

full rationale

The manuscript describes a systems implementation (fabric-lib) that bridges RDMA primitives across NICs and reports throughput and latency numbers from production workloads. No mathematical derivation, fitted parameters, or predictions appear; the central claims are supported by direct implementation details and empirical results rather than any reduction to self-referential definitions or prior self-citations. The ImmCounter primitive is presented as an engineering abstraction whose correctness is asserted via the described semantics and tested workloads, not derived from equations that presuppose the outcome. This is a standard non-circular systems paper whose evidence is external to any internal fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and engineering paper; the central claim does not rest on mathematical axioms, fitted parameters, or newly postulated entities.

pith-pipeline@v0.9.0 · 5777 in / 1331 out tokens · 51832 ms · 2026-05-18T02:50:29.458550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Remote KV Cache Reuse with GPU-native Video Codec

    cs.DC 2026-02 conditional novelty 7.0

    KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

  2. NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

    cs.DC 2026-05 unverdicted novelty 6.0

    NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 ...

  3. Eliminating Hidden Serialization in Multi-Node Megakernel Communication

    cs.DC 2026-05 conditional novelty 6.0

    Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exce...

  4. UCCL-Zip: Lossless Compression Supercharged GPU Communication

    cs.DC 2026-04 unverdicted novelty 6.0

    UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.