arxiv: 2605.08524 · v1 · submitted 2026-05-08 · 💻 cs.DC

Recognition: no theorem link

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

Yilong Zhao , Xiaonan Nie , Kan Zhu , Shuang Ma , Zhichao Lai , Hongxiang Hao , Yang Zhou , Baris Kasikci

show 1 more author

Ion Stoica

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:31 UTC · model grok-4.3

classification 💻 cs.DC

keywords context parallelismfoundation model pre-trainingblock-level shardingsequence length variationGPU scalabilityworkload balancingbin-packingattention MFU

0 comments

The pith

FCP shards sequences into blocks and uses arbitrary peer-to-peer links plus bin-packing to scale context parallelism near-linearly on up to 256 GPUs while raising attention efficiency for mixed-length data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing context parallelism methods either over-shard short sequences or separate long and short ones, which wastes compute and creates load imbalances when real training data shows large length variation. FCP instead breaks every sequence into blocks, packs those blocks across workers using any-to-any communication, and avoids fixed ring patterns. The result is higher compute utilization and even workloads without separate handling of sequence types. A reader would care because longer contexts improve model quality but cannot be trained efficiently at scale until these parallelism bottlenecks are removed.

Core claim

FCP is a context parallelism paradigm that performs sharding and scheduling at block-level granularity, enables arbitrary peer-to-peer communication, and applies bin-packing to blocks drawn from both short and long sequences. This combination produces high compute efficiency together with balanced workload distribution. On up to 256 NVIDIA GPUs it delivers near-linear scalability and raises attention MFU by factors between 1.13x and 2.21x.

What carries the argument

Block-level sharding combined with arbitrary peer-to-peer communication and bin-packing of sequence blocks to balance workloads.

If this is right

Near-linear scaling continues to hold when the number of GPUs reaches 256 during pre-training.
Attention MFU improves by 1.13x to 2.21x relative to prior context-parallelism methods.
Workload remains balanced even when input sequences exhibit large length differences.
Correctness and convergence of the training process stay intact under the new scheduling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block-packing idea could be applied to other distributed training stages such as tensor or pipeline parallelism to reduce padding waste.
Flexible any-to-any communication may allow dynamic GPU allocation during a single training run without restarting.
If overhead stays low, the method opens the door to routinely training models whose context length exceeds the current practical limit set by efficiency losses.

Load-bearing premise

Block-level sharding with arbitrary peer-to-peer communication and bin-packing can be realized with negligible overhead while preserving training correctness and convergence on datasets that contain high sequence-length variance.

What would settle it

A training run on a real dataset with extreme sequence-length variance in which the reported attention MFU gain falls to 1x or below, showing that communication or packing overhead has erased the expected benefit.

Figures

Figures reproduced from arXiv: 2605.08524 by Baris Kasikci, Hongxiang Hao, Ion Stoica, Kan Zhu, Shuang Ma, Xiaonan Nie, Yang Zhou, Yilong Zhao, Zhichao Lai.

**Figure 1.** Figure 1: Comparison between FCP and existing designs. (Left) Compute inefficiency: all sequences are uniformly sharded across GPUs. (Middle) Workload imbalance: sequences are grouped by length and assigned to different GPUs. Within each group, ring attention is applied. (Right) FCP adopts block-grained scheduling with arbitrary peer-to-peer communication. In this paper, we propose flexible context parallelism, FCP,… view at source ↗

**Figure 2.** Figure 2: The context length distribution, and cumulative computation and communication ratio from our internal training tasks. take the same magnitude of computation while short ones dominate communication, necessitating an adaptive parallelization that can efficiently handle diverse datasets. 2.2 Attention Computation Attention is one of the key components in modern foundation models (Vaswani et al., 2023). Con… view at source ↗

**Figure 3.** Figure 3: MFU of attention on different hardware, which is profiled with 8 KV heads, 64 QO heads, and a head dimension of 128. We vary the total context length and the number of blocks that compose this total number of tokens. Results showcase that sharding sequences into fine-grained blocks greatly hurt MFU.3 under the same context length ) drastically reduces MFU. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 4.** Figure 4: Zig-Zag packing of an 8-tokens sequence for intrasequence computation and communication balance under causal mask. The computation and communication volume of a block depends on its position within the sequence. For example, 4-th Q needs to compute with 5 KV blocks, while 4-th KV is transferred 3 times to the subsequent Q blocks. By packing i-th block with (2N − i)-th block, both resources can be perfectl… view at source ↗

**Figure 6.** Figure 6: FCP System Overview. is applied within each subgroup. However, these designs oversimplify M by isolating subgroups on separate workers without resource sharing. Consequently, outlier sequences with extreme context length (which are common as shown in § 2.1) can severely disturb workload balance. For example, a 64K sequence requires 256× more computation than a 4K one but is only given 16× more compute res… view at source ↗

**Figure 7.** Figure 7: Example of block-level pipelining for efficient computation and communication overlap. FCP decomposes end-to-end execution into computation and communication of blocks, which are executed block-by-block in an interleaving way. quences from a training batch, block distributor determines how to shard these sequences and assign the blocks to workers. It takes into account both compute efficiency and load ba… view at source ↗

**Figure 8.** Figure 8: Example of the congestion-free solver over three sequences with causal mask. Given the block assignments from block distributor, communication planner constructs a bipartite graph based on the data dependency across GPUs. For example, 1-st block from sequence B are transferred from GPU 2 to GPU 0 and 1, adding edges 2 → 1 and 2 → 0. The solver then calculates the maximal matching of the bipartite graph wit… view at source ↗

**Figure 9.** Figure 9: Computation (upper) and communication (lower) imbalance ratio when scaling the number of GPUs. 16 32 64 128 256 Number of GPUs 0.0 0.5 1.0 Normalized MFU 92.9% 94.7% 93.7% 94.2% 93.6% 69.6% 50.4% 43.3% 50.2% 27.3% FCP MagiAttention ByteScale RingAttention [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Normalized attention MFU with perfect load balance. large-scale pretraining. We also test the performance under different number of per-GPU tokens in § 6.6. We apply the causal attention mask for all sequences. Baselines. We compare FCP with the following state-ofthe-art CP designs: ① Ring Attention: Balance-optimized design. ② ByteScale: Compute-efficiency optimized design that dynamically partitions sh… view at source ↗

**Figure 11.** Figure 11: Weak-scaling of module-level attention MFU on realworld dataset. The number of tokens per GPU is fixed at 32K. Fine-grained: better scheduling Coarse-grained: better efficiency Sweet Spot [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Sensitivity test of block sizes on 128× GPU-X. volume imbalance, leading to sub-optimal communication. Besides, as ByteScale spatially partitions sequences based on their context length L, the O(L 2 ) computation is only assigned with O(L) GPUs, causing up to 70% imbalance. 6.3 Attention Compute Efficiency To measure the compute efficiency excluding the effect of workload imbalance, we assume all context … view at source ↗

**Figure 16.** Figure 16: (a) Trace distribution of the bimodal distribution and (b) weak-scaling of module-level attention MFU. The number of tokens per GPU is fixed at 32K. A.4 Evaluation on Additional Workloads Besides the real-world distribution derived from our pretraining tasks [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

read the original abstract

Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to 256 NVIDIA GPUs, with 1.13x-2.21x improvement in the attention MFU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FCP uses block-level sharding, arbitrary P2P communication, and bin-packing to handle variable sequence lengths in context parallelism, with reported scaling and MFU gains that still need overhead and convergence checks.

read the letter

The main point is that FCP tries to fix a real inefficiency in context parallelism for foundation model training. Existing ring-based methods struggle when sequences vary widely in length, either wasting compute on short ones or creating imbalance by handling long and short sequences separately. FCP shards at the block level, allows flexible peer-to-peer placement instead of a fixed ring, and uses bin-packing to mix blocks from different lengths across workers.

Referee Report

3 major / 0 minor

Summary. The paper proposes FCP, a flexible context parallelism paradigm for foundation model pre-training that shards sequences at block-level granularity, replaces rigid ring topologies with arbitrary peer-to-peer communication, and applies bin-packing across short and long sequences to improve compute efficiency and workload balance. It reports near-linear scalability on up to 256 NVIDIA GPUs together with 1.13x–2.21x gains in attention MFU relative to prior CP methods.

Significance. If the empirical claims are substantiated, FCP would address a practical bottleneck in large-scale pre-training by accommodating the high sequence-length variance typical of real corpora without the over-sharding or imbalance penalties of existing rigid CP designs. The shift to block-level flexible placement and P2P scheduling could improve hardware utilization in distributed attention kernels and reduce wasted compute on short sequences.

major comments (3)

[Abstract] Abstract: the headline claims of near-linear scaling to 256 GPUs and 1.13x–2.21x MFU improvement are stated without any accompanying methodology, baseline descriptions, workload sequence-length statistics, error bars, or quantitative breakdown of bin-packing efficiency and P2P traffic volume versus ring CP; these omissions make the central performance assertions unverifiable from the provided text.
[Evaluation] Evaluation section: no evidence is supplied that block-level sharding plus arbitrary P2P plus bin-packing incurs negligible extra communication or compute cost while producing bitwise-identical attention outputs and identical optimizer trajectories; any residual imbalance or metadata overhead would directly undermine the reported MFU gains and linear scaling at 256-GPU scale.
[Results] Results: the assertion that bin-packing “achieves both high compute efficiency and balanced workload” is unsupported by packing-efficiency metrics, per-sequence load-balance statistics, or communication-volume comparisons, which are required to substantiate the weakest assumption that overhead remains negligible on high-variance real pre-training datasets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance clarity and provide the requested substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of near-linear scaling to 256 GPUs and 1.13x–2.21x MFU improvement are stated without any accompanying methodology, baseline descriptions, workload sequence-length statistics, error bars, or quantitative breakdown of bin-packing efficiency and P2P traffic volume versus ring CP; these omissions make the central performance assertions unverifiable from the provided text.

Authors: We acknowledge that the abstract, constrained by length, omits supporting details. The methodology for block-level sharding, flexible P2P, and bin-packing is fully described in Section 3; baselines and comparison to ring CP are in Section 5.1; sequence-length statistics from the real pre-training corpus appear in Figure 2; error bars are present in all scaling and MFU plots (Figures 4–6); and quantitative breakdowns of packing efficiency and P2P traffic volume are provided in Section 5.3 and Table 3. We will revise the abstract to briefly reference the evaluation setup on high-variance workloads and the ring-CP baselines. revision: yes
Referee: [Evaluation] Evaluation section: no evidence is supplied that block-level sharding plus arbitrary P2P plus bin-packing incurs negligible extra communication or compute cost while producing bitwise-identical attention outputs and identical optimizer trajectories; any residual imbalance or metadata overhead would directly undermine the reported MFU gains and linear scaling at 256-GPU scale.

Authors: Block-level sharding preserves exact attention semantics because each block is processed identically to a standard implementation; P2P communication merely exchanges the required KV blocks without changing the computation graph or numerical results. Consequently, attention outputs are bitwise identical and optimizer trajectories remain unchanged. We will add a dedicated subsection in Evaluation (new Section 5.2) that includes small-scale bitwise-equivalence tests, explicit communication-volume measurements, and overhead analysis showing that flexible P2P plus bin-packing yields lower total cost than rigid ring methods, consistent with the observed near-linear scaling. revision: yes
Referee: [Results] Results: the assertion that bin-packing “achieves both high compute efficiency and balanced workload” is unsupported by packing-efficiency metrics, per-sequence load-balance statistics, or communication-volume comparisons, which are required to substantiate the weakest assumption that overhead remains negligible on high-variance real pre-training datasets.

Authors: We agree that more explicit quantitative support is warranted. We will augment the Results section with packing-efficiency metrics (average packing density and wasted-compute percentage), per-sequence load-balance statistics (workload variance and max/min ratio across workers), and direct communication-volume comparisons (bytes transferred versus ring CP). These will appear in new tables and figures to substantiate the efficiency and balance claims on the high-variance dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system evaluation

full rationale

The paper introduces FCP as a block-level sharding and bin-packing approach to context parallelism, supported solely by reported runtime measurements (near-linear scaling to 256 GPUs and 1.13–2.21× attention MFU gains). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises appear in the abstract or described content. All central claims rest on external benchmark runs rather than any reduction to inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard distributed-training primitives (block sharding preserves semantics, P2P communication is feasible at scale) without introducing new fitted constants or invented entities.

axioms (1)

domain assumption Dividing sequences into blocks does not alter model semantics or convergence behavior.
Implicit requirement for any block-level sharding scheme to be valid.

pith-pipeline@v0.9.0 · 5492 in / 1173 out tokens · 54913 ms · 2026-05-12T01:31:38.287358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

[1]

2025 , eprint=

MAGI-1: Autoregressive Video Generation at Scale , author=. 2025 , eprint=

work page 2025
[2]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

work page 2023
[3]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[4]

Csárdi, Gábor and Nepusz, Tamás , journal =

work page
[5]

2024 , eprint=

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=

work page 2024
[6]

2024 , eprint=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

work page 2024
[7]

Hopcroft–Karp algorithm --- W ikipedia , The Free Encyclopedia

Wikipedia. Hopcroft–Karp algorithm --- W ikipedia , The Free Encyclopedia. 2025

work page 2025
[8]

ArXiv , year=

Striped Attention: Faster Ring Attention for Causal Transformers , author=. ArXiv , year=

work page
[9]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[10]

2023 , eprint=

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU , author=. 2023 , eprint=

work page 2023
[11]

Dao, Tri , booktitle=. Flash

work page
[12]

2025 , eprint=

Optimizing SLO-oriented LLM Serving with PD-Multiplexing , author=. 2025 , eprint=

work page 2025
[13]

Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving , author =. arXiv preprint arXiv:2501.01005 , year =

work page arXiv
[14]

Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

Zhao, Yilong and Lin, Chien-Yu and Zhu, Kan and Ye, Zihao and Chen, Lequn and Zheng, Size and Ceze, Luis and Krishnamurthy, Arvind and Chen, Tianqi and Kasikci, Baris , booktitle =. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

work page
[15]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[16]

2024 , eprint=

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI , author=. 2024 , eprint=

work page 2024
[17]

2023 , eprint=

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models , author=. 2023 , eprint=

work page 2023
[18]

2024 , eprint=

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters , author=. 2024 , eprint=

work page 2024
[19]

Longest-processing-time-first scheduling --- W ikipedia , The Free Encyclopedia

Wikipedia. Longest-processing-time-first scheduling --- W ikipedia , The Free Encyclopedia. 2025

work page 2025
[20]

Sivamani, Kirthi Shankar and Moon, Tim and Tredak, Przemyslaw and Yang, Charlene and Nguyen, Phuong , year =. NVIDIA/

work page
[21]

2025 , eprint=

Hall's marriage theorem , author=. 2025 , eprint=

work page 2025
[22]

2025 , note =

Dylan Patel , title =. 2025 , note =

work page 2025
[23]

2025 , note =

NVIDIA , title =. 2025 , note =

work page 2025
[24]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[25]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

Two Results on LPT: A Near-Linear Time Algorithm and Parcel Delivery using Drones , author=. 2024 , eprint=

work page 2024
[27]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023
[28]

2020 , eprint=

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=

work page 2020
[29]

2025 , howpublished=

MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training , author=. 2025 , howpublished=

work page 2025
[30]

2023 , eprint=

Punica: Multi-Tenant LoRA Serving , author=. 2023 , eprint=

work page 2023
[31]

2025 , eprint=

Context Parallelism for Scalable Million-Token Inference , author=. 2025 , eprint=

work page 2025
[32]

2022 , eprint=

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , author=. 2022 , eprint=

work page 2022
[33]

2024 , note =

David Ramel , title =. 2024 , note =

work page 2024
[34]

arXiv preprint arXiv:2405.21015 , author =

Severson, Matthew and others , title =. arXiv preprint arXiv:2405.21015 , year =

work page arXiv
[35]

2025 , note =

Apple Machine Learning Research , title =. 2025 , note =

work page 2025
[36]

2024 , note =

Meta AI Research , title =. 2024 , note =

work page 2024
[37]

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

Jiang, Chenyu and Cai, Zhenkun and Tian, Ye and Jia, Zhen and Wang, Yida and Wu, Chuan , year=. DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=. doi:10.1145/3731569.3764849 , booktitle=

work page doi:10.1145/3731569.3764849
[38]

2025 , eprint=

BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens , author=. 2025 , eprint=

work page 2025
[39]

2025 , eprint=

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity , author=. 2025 , eprint=

work page 2025
[40]

2025 , eprint=

Efficient Long-context Language Model Training by Core Attention Disaggregation , author=. 2025 , eprint=

work page 2025
[41]

2025 , note =

Zewei Tao and Yunpeng Huang , title =. 2025 , note =

work page 2025
[42]

2025 , note =

Zheng Wang , title =. 2025 , note =

work page 2025
[43]

Ring Attention with Blockwise Transformers for Near-Infinite Context

H. Liu and others , title =. arXiv preprint arXiv:2310.01889 , year =

work page internal anchor Pith review arXiv
[44]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year =

work page internal anchor Pith review arXiv
[45]

Horovod: fast and easy distributed deep learning in TensorFlow

Sergeev, Alexander and Del Balso, Mike , title =. arXiv preprint arXiv:1802.05799 , year =

work page Pith review arXiv
[46]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , title =. arXiv preprint arXiv:1909.08053 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1909
[47]

ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=

Ge, Hao and Feng, Junda and Huang, Qi and Fu, Fangcheng and Nie, Xiaonan and Zuo, Lei and Lin, Haibin and Cui, Bin and Liu, Xin , year=. ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=. doi:10.1145/3718958.3754352 , booktitle=

work page doi:10.1145/3718958.3754352
[48]

2024 , eprint=

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism , author=. 2024 , eprint=

work page 2024
[49]

Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation , articleno =

Wang, Zheng and Cai, Anna and Xie, Xinfeng and Pan, Zaifeng and Guan, Yue and Chu, Weiwei and Wang, Jie and Li, Shikai and Huang, Jianyu and Cai, Chris and Hao, Yuchen and Ding, Yufei , title =. Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation , articleno =. 2025 , isbn =

work page 2025
[50]

2025 , eprint=

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism , author=. 2025 , eprint=

work page 2025
[51]

2024 , eprint=

DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training , author=. 2024 , eprint=

work page 2024
[52]

and Parthasarathy, S

Ravikumar, A. and Parthasarathy, S. and Thyagarajan, K. and others , title =. Heliyon , year =

work page
[53]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

Narayanan, Deepak and Shoeybi, Mohammad and Cho, Tushar and others , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page
[54]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. arXiv preprint arXiv:1910.02054 , year =

work page arXiv 1910
[55]

2025 , eprint=

Seedance 1.0: Exploring the Boundaries of Video Generation Models , author=. 2025 , eprint=

work page 2025
[56]

2024 , eprint=

Gemini Technical Report , author=. 2024 , eprint=

work page 2024
[57]

2024 , howpublished=

work page 2024
[58]

Proceedings of the ACM on Management of Data , volume=

Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement , author=. Proceedings of the ACM on Management of Data , volume=. 2023 , publisher=

work page 2023