pith. sign in

arxiv: 2605.21603 · v1 · pith:NIIWYHZLnew · submitted 2026-05-20 · 💻 cs.DC

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Pith reviewed 2026-05-22 08:31 UTC · model grok-4.3

classification 💻 cs.DC
keywords intra-device parallelismoperator schedulinggraph partitioningML frameworksprogrammable interfacethroughput optimizationCUDA Graphs compatibility
0
0 comments X

The pith

DynaFlow decouples logical model definition from physical execution schedule to add intra-device parallelism flexibly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that intra-device parallelism strategies can be integrated into existing ML systems without invasive code overhauls by separating the logical model graph from how operators are actually scheduled on hardware. Current approaches force developers into model-specific rewrites that are expensive to maintain because strategies depend heavily on workload, architecture, and hardware context. DynaFlow solves this with a frontend that adds annotations for partitioning the graph and a programmable interface to define custom strategies, plus a backend that runs the resulting control and data flows asynchronously while avoiding extra memory copies. If the approach works, ML developers could reuse the same parallelism ideas across frameworks and adapt them quickly to new settings instead of building separate versions each time.

Core claim

DynaFlow enables transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. It supplies annotations for graph partitioning and a programmable interface for custom strategies in the frontend, while the backend asynchronously manages complex control and data flows, uses custom memory management to remove copy overhead, and keeps compatibility with optimizations such as CUDA Graphs and TorchInductor.

What carries the argument

Decoupling of the logical model definition from the physical execution schedule, realized through annotations for graph partitioning and a programmable interface for custom intra-device parallelism strategies.

If this is right

  • Representative parallelism strategies integrate into six state-of-the-art ML systems with only minimal code changes.
  • Throughput improves by up to 1.29x for inference and training workloads.
  • Compatibility is retained with existing optimizations including CUDA Graphs and TorchInductor.
  • Strategies adapt to different workloads, model architectures, and hardware without maintaining multiple specialized versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of logic from schedule could lower the cost of experimenting with new operator-overlap ideas across the broader ML ecosystem.
  • Framework designers might adopt similar decoupling layers to support dynamic scheduling as a built-in feature rather than an add-on.
  • The technique could be tested on training loops with larger batch sizes to see whether the asynchronous backend scales without introducing new bottlenecks.

Load-bearing premise

The annotations for graph partitioning and the programmable interface can be added to existing ML frameworks without invasive overhauls or breaking compatibility with optimizations like CUDA Graphs and TorchInductor.

What would settle it

Integrating DynaFlow into a seventh ML framework and measuring both the lines of code changed and the resulting throughput on a range of models and hardware to check whether gains stay near 1.29x.

Figures

Figures reproduced from arXiv: 2605.21603 by Baris Kasikci, Hongtao Zhang, Jinbin Luo, Shengkai Lin, Stephanie Wang, Yibo Wu, Yile Gu, Yi Pan, Ziren Wang, Ziyi Xu.

Figure 1
Figure 1. Figure 1: Representative intra-device parallelism strategies: (a) Overlapping computation and communication on different streams; (b) Fine-grained kernel fusion; (c) Splitting the input batch for concurrent execution. To address this, recent research has explored intra-device parallelism, a class of strategies that aims to maximize re￾source utilization within a single device. Techniques such as overlapping computat… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of different intra-device parallelism strate￾gies under different execution contexts on serving Llama-3-70B with 4 GPUs and tensor parallelism. other optimizations like CUDA Graphs. To address the challenge, we introduce DynaFlow, a trans￾parent, flexible, and efficient framework for integrating intra￾device parallelism into existing ML systems. DynaFlow enables programmers to easily implement … view at source ↗
Figure 3
Figure 3. Figure 3: Execution time breakdown of serving a Llama-3-8B model on 2 GPUs with tensor parallelism (TP); (b) a DeepSeek￾V2-Lite model on 2 GPUs with expert parallelism (EP). The batch size is 512 and the sequence length is 1024. 2 BACKGROUND AND MOTIVATION Modern large-scale ML models are composed of a se￾quence of operators with highly diverse resource require￾ments. These are broadly categorized by their resource … view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: APIs for graph partition. ✞ ☎ # Initialize parallel execution for N micro-batches def split(batch_sizes: list[int]): pass # Get operators ready to execute for a micro-batch def get_ready_ops(ubatch_idx: int) -> list[op]: pass # Dispatch one or more ready operators to execute def execute(operators: tuple[op], stream=None, replace_func=None): pass ✝ ✆ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: APIs for programmable operator scheduling. Python context manager to wrap any code sections. 3.2.2 Programmable Scheduling The DynaFlow frontend provides a unified and dynamic abstraction for scheduling the partitioned subgraphs. De￾signing this abstraction requires a balance between flexibil￾ity and complexity. One design alternative is to expose all subgraph executables to users directly. This would prov… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of using DynaFlow’s API to define DBO (up) and Tokenweave (down), with the desired execution order. within a fully Python-native frontend to preserve flexibil￾ity. To implement a custom strategy, a developer inherits from a base class, OpSchedulerBase, and overrides its schedule method. Inside this method, the developer in￾teracts with the DynaFlow backend using a set of high-level APIs (detailed … view at source ↗
Figure 8
Figure 8. Figure 8: CPU execution time for a single forward pass in vLLM with different DynaFlow configurations. strategies into state-of-the-art ML systems with minimal code changes and quantify the resulting performance im￾provements. 5.1 Evaluation Setup Testbeds. We evaluate DynaFlow on (1) a DGX B200 sys￾tem with 8 NVIDIA B200 GPUs connected by NVLink and (2) an H100 system with 4 NVIDIA H100 GPUs connected by NVLink. We… view at source ↗
Figure 9
Figure 9. Figure 9: Serving throughput of DynaFlow-enabled NanoFlow integration. 5.2 Microbenchmarks 5.2.1 Frontend Effectiveness Transparency We first evaluate transparency by quanti￾fying the engineering cost, in lines of code (LoC), of inte￾grating DynaFlow into existing ML systems. In vLLM, integration was minimal, requiring only 75 LoC in the GPUModelRunner to handle attention metadata for micro￾batches. For MoE models, … view at source ↗
Figure 10
Figure 10. Figure 10: Serving throughput of DynaFlow-based DBO integra￾tion in vLLM. throughput improvement over baseline vLLM on Llama-3- 8B, Llama-3-70B, and Qwen-2.5-72B. The speedup mainly comes from the network-bound and memory-bound oper￾ations being overlapped when we split the batch. As the number of GPUs increased, the overlapped communication and straggler effects take a higher ratio in the end-to-end time, so the sp… view at source ↗
Figure 11
Figure 11. Figure 11: End-to-end throughput of DynaFlow-enabled communication overlap. vLLM Tokenweave vLLM w/ Tokenweave (DynaFlow) HF HF w/ Tokenweave (DynaFlow) Megatron-LM Comet Megatron-LM w/ Comet (DynaFlow) Llama-3-8B, 2 GPUs, vLLM Llama-3-8B, 2 GPUs, vLLM Llama-3-8B, 2 GPUs, HF, Infer Qwen, 4 GPUs, Megatron, Infer Mixtral, 4 GPUs, Megatron, Infer [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end throughput of DynaFlow-enabled communication fusion. 5.3.5 Flux We used DynaFlow’s replace func API to inte￾grate fused compute-communication kernels from Triton￾distributed1 (Zheng et al., 2025a;b) into vLLM, targeting the Linear and AllReduce subgraphs. This integration, however, resulted in a performance degradation of up to 20% compared to the original baseline. Profiling analysis in￾dicate… view at source ↗
Figure 13
Figure 13. Figure 13: Overhead analysis. Init, Load, Trace, Analysis, Capture refer to engine initialization, model weight loading, model tracing using TorchDynamo, static analysis in DynaFlow, and CUDA graph capture [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation study. memory, graph, dynamic refer to zero-copy memory pre-allocation, CUDA graph, and dynamic scheduling. where disabling CUDA Graphs decreased its throughput to 0.83x, indicating the general importance of mitigating CPU overhead for this workload. Next, disabling our zero-copy memory pre-allocation mechanism resulted in a throughput of 1.10x. Finally, we use a static splitting strategy that sp… view at source ↗
Figure 15
Figure 15. Figure 15: Serving throughput of DynaFlow-based DBO integra￾tion in vLLM under PCIe interconnect. A.4 Installation Please follow the instructions under examples/ae in the GitHub repository under branch ae to install DynaFlow and the modified systems. A.5 Evaluation and expected result The evaluation scripts will generate JSON files contain￾ing the throughput metrics in either token/s or seq/s under examples/ae/resul… view at source ↗
read the original abstract

Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose DynaFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. DynaFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that DynaFlow can integrate representative parallelism strategies into 6 state-of-the-art ML systems with minimal code changes, achieving up to a 1.29x throughput improvement. DynaFlow is publicly available at https://github.com/uw-syfi/DynaFlow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DynaFlow, a framework that decouples logical model definition from physical execution schedule to enable transparent integration of intra-device parallelism strategies. It provides a frontend with graph partitioning annotations and a programmable interface for custom strategies, backed by an asynchronous backend using custom memory management that claims to preserve compatibility with CUDA Graphs and TorchInductor. Evaluation shows integration of representative strategies into 6 state-of-the-art ML systems with minimal code changes and up to 1.29x throughput gains.

Significance. If the compatibility and minimal-overhaul claims hold, the work could meaningfully reduce engineering costs for adopting context-sensitive intra-device parallelism across ML frameworks, improving resource utilization in inference and training. Public code release supports reproducibility and further experimentation.

major comments (2)
  1. [§4.3] §4.3 (Backend Implementation): The claim that the asynchronous control/data-flow management and custom memory management preserve compatibility with static CUDA Graphs and TorchInductor is load-bearing for the central 'transparent integration without invasive changes' thesis, yet the manuscript provides no concrete mechanism (e.g., static pre-allocation rules or how partitioning annotations eliminate runtime decisions) that would guarantee capture succeeds for arbitrary custom strategies.
  2. [§5.1] §5.1 (Integration Experiments): The reported 1.29x throughput gains across the six systems rest on integration results, but without explicit baseline definitions, data exclusion criteria, or component ablations, it is not possible to confirm that gains are attributable to DynaFlow rather than unstated factors or framework-specific tuning.
minor comments (2)
  1. [§3.1] Notation in §3.1 for the programmable interface could be clarified with a small example of a complete custom strategy definition to aid reader understanding.
  2. [Figure 5] Figure 5 (throughput plots): Adding per-run variance or confidence intervals would strengthen visual interpretation of the speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major point below, clarifying the mechanisms and experimental details while committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Backend Implementation): The claim that the asynchronous control/data-flow management and custom memory management preserve compatibility with static CUDA Graphs and TorchInductor is load-bearing for the central 'transparent integration without invasive changes' thesis, yet the manuscript provides no concrete mechanism (e.g., static pre-allocation rules or how partitioning annotations eliminate runtime decisions) that would guarantee capture succeeds for arbitrary custom strategies.

    Authors: We agree that §4.3 would benefit from greater specificity on the compatibility mechanism. In the revised manuscript we will expand this section to explain that the frontend partitioning annotations are resolved at graph-construction time, producing a fixed operator grouping and data-flow DAG. This static plan is then handed to the asynchronous backend, which performs all memory allocations upfront using a custom pool sized to the maximum live tensors required by the plan. Because no allocations or control-flow decisions occur after the initial capture phase, the resulting execution stream satisfies the requirements for CUDA Graph capture and remains compatible with TorchInductor’s static optimizations. We will include a short pseudocode example and a table contrasting dynamic versus annotated execution to make the guarantee explicit for the representative strategies we evaluate. revision: yes

  2. Referee: [§5.1] §5.1 (Integration Experiments): The reported 1.29x throughput gains across the six systems rest on integration results, but without explicit baseline definitions, data exclusion criteria, or component ablations, it is not possible to confirm that gains are attributable to DynaFlow rather than unstated factors or framework-specific tuning.

    Authors: We acknowledge the value of additional experimental transparency. In the revision we will (1) explicitly define the baseline as the unmodified framework executing the identical model without any intra-device parallelism, (2) state the data-exclusion rules (discard first 20 % of iterations as warm-up and any run whose throughput deviates more than two standard deviations from the median), and (3) add a component ablation that isolates the contribution of the programmable scheduling interface from the custom memory manager. These clarifications will be placed in §5.1 and the corresponding appendix, allowing readers to attribute the observed speedups directly to the parallelism strategies enabled by DynaFlow. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on implementation and measurements

full rationale

The paper describes a systems framework (DynaFlow) that decouples logical model definition from execution schedule via annotations and a programmable interface. Its central claims—transparent integration into 6 ML systems with minimal changes and up to 1.29x throughput—are presented as outcomes of the implemented backend (asynchronous control/data-flow, custom memory management, CUDA Graph compatibility) and empirical evaluation. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation chains appear in the abstract or description that would reduce the result to its inputs by construction. The contribution is self-contained as an engineering artifact whose correctness is externally verifiable via the public GitHub release and reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new systems framework rather than new mathematical entities or parameters; it relies on standard assumptions about ML framework APIs and device execution models.

pith-pipeline@v0.9.0 · 5783 in / 1088 out tokens · 28400 ms · 2026-05-22T08:31:39.522207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 11 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

  2. [2]

    Flux: Fast software-based communication over- lap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

    Chang, L.-W., Bao, W., Hou, Q., Jiang, C., Zheng, N., Zhong, Y ., Zhang, X., Song, Z., Yao, C., Jiang, Z., et al. Flux: Fast software-based communication over- lap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

  3. [3]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Ge, S., Zhang, Y ., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

  4. [4]

    TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

    Gond, R., Kwatra, N., and Ramjee, R. Tokenweave: Effi- cient compute-communication overlap for distributed llm inference.arXiv preprint arXiv:2505.11329,

  5. [5]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

  6. [6]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    USENIX Association. ISBN 978-1- 931971-16-4. URL https://www.usenix.org/ conference/osdi14/technical-sessions/ presentation/li_mu. Li, S., Zhao, Y ., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P., et al. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704,

  7. [7]

    Liang, W., Liu, T., Wright, L., Constable, W., Gu, A., Huang, C.-C., Zhang, I., Feng, W., Huang, H., Wang, J., et al. Torchtitan: One-stop pytorch native solution DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling for production ready llm pre-training.arXiv preprint arXiv:2410.06511,

  8. [8]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  9. [9]

    YaRN: Efficient Context Window Extension of Large Language Models

    Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

  10. [10]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Sergeev, A. and Del Balso, M. Horovod: fast and easy distributed deep learning in tensorflow.arXiv preprint arXiv:1802.05799,

  11. [11]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    URL https://hazyresearch.stanford.edu/ blog/2025-05-27-no-bubbles. Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

  12. [12]

    L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

    Team, M. L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al. Longcat- flash technical report.arXiv preprint arXiv:2509.01322,

  13. [13]

    Efficient Streaming Language Models with Attention Sinks

    USENIX Association. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  14. [14]

    Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811,

    Zhang, S., Zheng, N., Lin, H., Jiang, Z., Bao, W., Jiang, C., Hou, Q., Cui, W., Zheng, S., Chang, L.-W., et al. Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811,

  15. [15]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Zhao, Y ., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

  16. [16]

    ISBN 9798331314385

    Zheng, S., Bao, W., Hou, Q., Zheng, X., Fang, J., Huang, C., Li, T., Duanmu, H., Chen, R., Xu, R., Guo, Y ., Zheng, N., Jiang, Z., Di, X., Wang, D., Ye, J., Lin, H., Chang, L.-W., Lu, L., Liang, Y ., Zhai, J., and Liu, X. Triton- distributed: Programming overlapping kernels on dis- tributed ai systems with the triton compiler, 2025a. URL https://arxiv.org...