DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, Mingxing Zhang · 2025 · arXiv 1569.376484

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Near-optimal Online Traffic Engineering

cs.NI · 2026-05-15 · unverdicted · novelty 7.0

OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.

SCENIC: Stream Computation-Enhanced SmartNIC

cs.AR · 2026-04-16 · unverdicted · novelty 7.0

SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.

Mestra: Exploring Migration on Virtualized CGRAs

cs.AR · 2026-04-06 · unverdicted · novelty 7.0

Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration

cs.AR · 2026-05-13 · conditional · novelty 6.0

FPGA lock agents with on-chip tables achieve up to 51X higher TPC-C throughput than CPU baselines by removing DRAM access overhead for lock operations.

Proxics: an efficient programming model for far memory accelerators

cs.OS · 2026-04-20 · conditional · novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

cs.NI · 2026-04-18 · unverdicted · novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

cs.DC · 2026-05-08 · unverdicted · novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

Should I Hide My Duck in the Lake?

cs.DB · 2026-02-21 · unverdicted · novelty 5.0

A vision for a cloud SmartNIC that hides Parquet decoding costs by offloading parsing and filters directly on the network datapath, backed by DuckDB performance estimates.

citing papers explorer

Showing 9 of 9 citing papers.

Near-optimal Online Traffic Engineering cs.NI · 2026-05-15 · unverdicted · none · ref 44
OnlineTE uses optimization decomposition to enable distributed, near-optimal traffic engineering that reacts in seconds to changes in large WANs and outperforms prior centralized approaches in emulation.
SCENIC: Stream Computation-Enhanced SmartNIC cs.AR · 2026-04-16 · unverdicted · none · ref 73
SCENIC delivers a programmable 200G SmartNIC with offloaded protocol stacks, stream compute units, and full OS transparency that matches commercial performance for custom offloads like collective communication and GPU data partitioning.
Mestra: Exploring Migration on Virtualized CGRAs cs.AR · 2026-04-06 · unverdicted · none · ref 10
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding cs.DC · 2026-05-20 · unverdicted · none · ref 7
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration cs.AR · 2026-05-13 · conditional · none · ref 24
FPGA lock agents with on-chip tables achieve up to 51X higher TPC-C throughput than CPU baselines by removing DRAM access overhead for lock operations.
Proxics: an efficient programming model for far memory accelerators cs.OS · 2026-04-20 · conditional · none · ref 70
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations cs.NI · 2026-04-18 · unverdicted · none · ref 18
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 37
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Should I Hide My Duck in the Lake? cs.DB · 2026-02-21 · unverdicted · none · ref 29
A vision for a cloud SmartNIC that hides Parquet decoding costs by offloading parsing and filters directly on the network datapath, backed by DuckDB performance estimates.

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer