OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 3polarities
background 3representative citing papers
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
FPGA lock agents with on-chip tables achieve up to 51X higher TPC-C throughput than CPU baselines by removing DRAM access overhead for lock operations.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
A vision for a cloud SmartNIC that hides Parquet decoding costs by offloading parsing and filters directly on the network datapath, backed by DuckDB performance estimates.
citing papers explorer
-
Near-optimal Online Traffic Engineering
OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.
-
SCENIC: Stream Computation-Enhanced SmartNIC
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
-
Mestra: Exploring Migration on Virtualized CGRAs
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
-
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
-
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration
FPGA lock agents with on-chip tables achieve up to 51X higher TPC-C throughput than CPU baselines by removing DRAM access overhead for lock operations.
-
Proxics: an efficient programming model for far memory accelerators
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
Should I Hide My Duck in the Lake?
A vision for a cloud SmartNIC that hides Parquet decoding costs by offloading parsing and filters directly on the network datapath, backed by DuckDB performance estimates.