pith. machine review for the scientific record. sign in

arxiv: 2605.08524 · v1 · submitted 2026-05-08 · 💻 cs.DC

Recognition: no theorem link

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:31 UTC · model grok-4.3

classification 💻 cs.DC
keywords context parallelismfoundation model pre-trainingblock-level shardingsequence length variationGPU scalabilityworkload balancingbin-packingattention MFU
0
0 comments X

The pith

FCP shards sequences into blocks and uses arbitrary peer-to-peer links plus bin-packing to scale context parallelism near-linearly on up to 256 GPUs while raising attention efficiency for mixed-length data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing context parallelism methods either over-shard short sequences or separate long and short ones, which wastes compute and creates load imbalances when real training data shows large length variation. FCP instead breaks every sequence into blocks, packs those blocks across workers using any-to-any communication, and avoids fixed ring patterns. The result is higher compute utilization and even workloads without separate handling of sequence types. A reader would care because longer contexts improve model quality but cannot be trained efficiently at scale until these parallelism bottlenecks are removed.

Core claim

FCP is a context parallelism paradigm that performs sharding and scheduling at block-level granularity, enables arbitrary peer-to-peer communication, and applies bin-packing to blocks drawn from both short and long sequences. This combination produces high compute efficiency together with balanced workload distribution. On up to 256 NVIDIA GPUs it delivers near-linear scalability and raises attention MFU by factors between 1.13x and 2.21x.

What carries the argument

Block-level sharding combined with arbitrary peer-to-peer communication and bin-packing of sequence blocks to balance workloads.

If this is right

  • Near-linear scaling continues to hold when the number of GPUs reaches 256 during pre-training.
  • Attention MFU improves by 1.13x to 2.21x relative to prior context-parallelism methods.
  • Workload remains balanced even when input sequences exhibit large length differences.
  • Correctness and convergence of the training process stay intact under the new scheduling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block-packing idea could be applied to other distributed training stages such as tensor or pipeline parallelism to reduce padding waste.
  • Flexible any-to-any communication may allow dynamic GPU allocation during a single training run without restarting.
  • If overhead stays low, the method opens the door to routinely training models whose context length exceeds the current practical limit set by efficiency losses.

Load-bearing premise

Block-level sharding with arbitrary peer-to-peer communication and bin-packing can be realized with negligible overhead while preserving training correctness and convergence on datasets that contain high sequence-length variance.

What would settle it

A training run on a real dataset with extreme sequence-length variance in which the reported attention MFU gain falls to 1x or below, showing that communication or packing overhead has erased the expected benefit.

Figures

Figures reproduced from arXiv: 2605.08524 by Baris Kasikci, Hongxiang Hao, Ion Stoica, Kan Zhu, Shuang Ma, Xiaonan Nie, Yang Zhou, Yilong Zhao, Zhichao Lai.

Figure 1
Figure 1. Figure 1: Comparison between FCP and existing designs. (Left) Compute inefficiency: all sequences are uniformly sharded across GPUs. (Middle) Workload imbalance: sequences are grouped by length and assigned to different GPUs. Within each group, ring attention is applied. (Right) FCP adopts block-grained scheduling with arbitrary peer-to-peer communication. In this paper, we propose flexible context parallelism, FCP,… view at source ↗
Figure 2
Figure 2. Figure 2: The context length distribution, and cumulative compu￾tation and communication ratio from our internal training tasks. take the same magnitude of computation while short ones dominate communication, necessitating an adaptive paral￾lelization that can efficiently handle diverse datasets. 2.2 Attention Computation Attention is one of the key components in modern founda￾tion models (Vaswani et al., 2023). Con… view at source ↗
Figure 3
Figure 3. Figure 3: MFU of attention on different hardware, which is pro￾filed with 8 KV heads, 64 QO heads, and a head dimension of 128. We vary the total context length and the number of blocks that compose this total number of tokens. Results showcase that sharding sequences into fine-grained blocks greatly hurt MFU.3 under the same context length ) drastically reduces MFU. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 4
Figure 4. Figure 4: Zig-Zag packing of an 8-tokens sequence for intra￾sequence computation and communication balance under causal mask. The computation and communication volume of a block depends on its position within the sequence. For example, 4-th Q needs to compute with 5 KV blocks, while 4-th KV is transferred 3 times to the subsequent Q blocks. By packing i-th block with (2N − i)-th block, both resources can be perfectl… view at source ↗
Figure 6
Figure 6. Figure 6: FCP System Overview. is applied within each subgroup. However, these designs oversimplify M by isolating subgroups on separate workers without resource sharing. Consequently, outlier sequences with extreme context length (which are common as shown in § 2.1) can severely disturb workload balance. For exam￾ple, a 64K sequence requires 256× more computation than a 4K one but is only given 16× more compute res… view at source ↗
Figure 7
Figure 7. Figure 7: Example of block-level pipelining for efficient compu￾tation and communication overlap. FCP decomposes end-to-end execution into computation and communication of blocks, which are executed block-by-block in an interleaving way. quences from a training batch, block distributor determines how to shard these sequences and assign the blocks to work￾ers. It takes into account both compute efficiency and load ba… view at source ↗
Figure 8
Figure 8. Figure 8: Example of the congestion-free solver over three sequences with causal mask. Given the block assignments from block distributor, communication planner constructs a bipartite graph based on the data dependency across GPUs. For example, 1-st block from sequence B are transferred from GPU 2 to GPU 0 and 1, adding edges 2 → 1 and 2 → 0. The solver then calculates the maximal matching of the bipartite graph wit… view at source ↗
Figure 9
Figure 9. Figure 9: Computation (upper) and communication (lower) imbal￾ance ratio when scaling the number of GPUs. 16 32 64 128 256 Number of GPUs 0.0 0.5 1.0 Normalized MFU 92.9% 94.7% 93.7% 94.2% 93.6% 69.6% 50.4% 43.3% 50.2% 27.3% FCP MagiAttention ByteScale RingAttention [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Normalized attention MFU with perfect load balance. large-scale pretraining. We also test the performance under different number of per-GPU tokens in § 6.6. We apply the causal attention mask for all sequences. Baselines. We compare FCP with the following state-of￾the-art CP designs: ① Ring Attention: Balance-optimized design. ② ByteScale: Compute-efficiency optimized design that dynamically partitions sh… view at source ↗
Figure 11
Figure 11. Figure 11: Weak-scaling of module-level attention MFU on real￾world dataset. The number of tokens per GPU is fixed at 32K. Fine-grained: better scheduling Coarse-grained: better efficiency Sweet Spot [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sensitivity test of block sizes on 128× GPU-X. volume imbalance, leading to sub-optimal communication. Besides, as ByteScale spatially partitions sequences based on their context length L, the O(L 2 ) computation is only assigned with O(L) GPUs, causing up to 70% imbalance. 6.3 Attention Compute Efficiency To measure the compute efficiency excluding the effect of workload imbalance, we assume all context … view at source ↗
Figure 16
Figure 16. Figure 16: (a) Trace distribution of the bimodal distribution and (b) weak-scaling of module-level attention MFU. The number of tokens per GPU is fixed at 32K. A.4 Evaluation on Additional Workloads Besides the real-world distribution derived from our pre￾training tasks [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
read the original abstract

Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to 256 NVIDIA GPUs, with 1.13x-2.21x improvement in the attention MFU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes FCP, a flexible context parallelism paradigm for foundation model pre-training that shards sequences at block-level granularity, replaces rigid ring topologies with arbitrary peer-to-peer communication, and applies bin-packing across short and long sequences to improve compute efficiency and workload balance. It reports near-linear scalability on up to 256 NVIDIA GPUs together with 1.13x–2.21x gains in attention MFU relative to prior CP methods.

Significance. If the empirical claims are substantiated, FCP would address a practical bottleneck in large-scale pre-training by accommodating the high sequence-length variance typical of real corpora without the over-sharding or imbalance penalties of existing rigid CP designs. The shift to block-level flexible placement and P2P scheduling could improve hardware utilization in distributed attention kernels and reduce wasted compute on short sequences.

major comments (3)
  1. [Abstract] Abstract: the headline claims of near-linear scaling to 256 GPUs and 1.13x–2.21x MFU improvement are stated without any accompanying methodology, baseline descriptions, workload sequence-length statistics, error bars, or quantitative breakdown of bin-packing efficiency and P2P traffic volume versus ring CP; these omissions make the central performance assertions unverifiable from the provided text.
  2. [Evaluation] Evaluation section: no evidence is supplied that block-level sharding plus arbitrary P2P plus bin-packing incurs negligible extra communication or compute cost while producing bitwise-identical attention outputs and identical optimizer trajectories; any residual imbalance or metadata overhead would directly undermine the reported MFU gains and linear scaling at 256-GPU scale.
  3. [Results] Results: the assertion that bin-packing “achieves both high compute efficiency and balanced workload” is unsupported by packing-efficiency metrics, per-sequence load-balance statistics, or communication-volume comparisons, which are required to substantiate the weakest assumption that overhead remains negligible on high-variance real pre-training datasets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance clarity and provide the requested substantiation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claims of near-linear scaling to 256 GPUs and 1.13x–2.21x MFU improvement are stated without any accompanying methodology, baseline descriptions, workload sequence-length statistics, error bars, or quantitative breakdown of bin-packing efficiency and P2P traffic volume versus ring CP; these omissions make the central performance assertions unverifiable from the provided text.

    Authors: We acknowledge that the abstract, constrained by length, omits supporting details. The methodology for block-level sharding, flexible P2P, and bin-packing is fully described in Section 3; baselines and comparison to ring CP are in Section 5.1; sequence-length statistics from the real pre-training corpus appear in Figure 2; error bars are present in all scaling and MFU plots (Figures 4–6); and quantitative breakdowns of packing efficiency and P2P traffic volume are provided in Section 5.3 and Table 3. We will revise the abstract to briefly reference the evaluation setup on high-variance workloads and the ring-CP baselines. revision: yes

  2. Referee: [Evaluation] Evaluation section: no evidence is supplied that block-level sharding plus arbitrary P2P plus bin-packing incurs negligible extra communication or compute cost while producing bitwise-identical attention outputs and identical optimizer trajectories; any residual imbalance or metadata overhead would directly undermine the reported MFU gains and linear scaling at 256-GPU scale.

    Authors: Block-level sharding preserves exact attention semantics because each block is processed identically to a standard implementation; P2P communication merely exchanges the required KV blocks without changing the computation graph or numerical results. Consequently, attention outputs are bitwise identical and optimizer trajectories remain unchanged. We will add a dedicated subsection in Evaluation (new Section 5.2) that includes small-scale bitwise-equivalence tests, explicit communication-volume measurements, and overhead analysis showing that flexible P2P plus bin-packing yields lower total cost than rigid ring methods, consistent with the observed near-linear scaling. revision: yes

  3. Referee: [Results] Results: the assertion that bin-packing “achieves both high compute efficiency and balanced workload” is unsupported by packing-efficiency metrics, per-sequence load-balance statistics, or communication-volume comparisons, which are required to substantiate the weakest assumption that overhead remains negligible on high-variance real pre-training datasets.

    Authors: We agree that more explicit quantitative support is warranted. We will augment the Results section with packing-efficiency metrics (average packing density and wasted-compute percentage), per-sequence load-balance statistics (workload variance and max/min ratio across workers), and direct communication-volume comparisons (bytes transferred versus ring CP). These will appear in new tables and figures to substantiate the efficiency and balance claims on the high-variance dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system evaluation

full rationale

The paper introduces FCP as a block-level sharding and bin-packing approach to context parallelism, supported solely by reported runtime measurements (near-linear scaling to 256 GPUs and 1.13–2.21× attention MFU gains). No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises appear in the abstract or described content. All central claims rest on external benchmark runs rather than any reduction to inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard distributed-training primitives (block sharding preserves semantics, P2P communication is feasible at scale) without introducing new fitted constants or invented entities.

axioms (1)
  • domain assumption Dividing sequences into blocks does not alter model semantics or convergence behavior.
    Implicit requirement for any block-level sharding scheme to be valid.

pith-pipeline@v0.9.0 · 5492 in / 1173 out tokens · 54913 ms · 2026-05-12T01:31:38.287358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    2025 , eprint=

    MAGI-1: Autoregressive Video Generation at Scale , author=. 2025 , eprint=

  2. [2]

    2023 , eprint=

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

  3. [3]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  4. [4]

    Csárdi, Gábor and Nepusz, Tamás , journal =

  5. [5]

    2024 , eprint=

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=

  6. [6]

    2024 , eprint=

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

  7. [7]

    Hopcroft–Karp algorithm --- W ikipedia , The Free Encyclopedia

    Wikipedia. Hopcroft–Karp algorithm --- W ikipedia , The Free Encyclopedia. 2025

  8. [8]

    ArXiv , year=

    Striped Attention: Faster Ring Attention for Causal Transformers , author=. ArXiv , year=

  9. [9]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  10. [10]

    2023 , eprint=

    Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU , author=. 2023 , eprint=

  11. [11]

    Dao, Tri , booktitle=. Flash

  12. [12]

    2025 , eprint=

    Optimizing SLO-oriented LLM Serving with PD-Multiplexing , author=. 2025 , eprint=

  13. [13]

    Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving , author =. arXiv preprint arXiv:2501.01005 , year =

  14. [14]

    Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

    Zhao, Yilong and Lin, Chien-Yu and Zhu, Kan and Ye, Zihao and Chen, Lequn and Zheng, Size and Ceze, Luis and Krishnamurthy, Arvind and Chen, Tianqi and Kasikci, Baris , booktitle =. Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving , url =

  15. [15]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems (NeurIPS) , year=

  16. [16]

    2024 , eprint=

    USP: A Unified Sequence Parallelism Approach for Long Context Generative AI , author=. 2024 , eprint=

  17. [17]

    2023 , eprint=

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models , author=. 2023 , eprint=

  18. [18]

    2024 , eprint=

    Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters , author=. 2024 , eprint=

  19. [19]

    Longest-processing-time-first scheduling --- W ikipedia , The Free Encyclopedia

    Wikipedia. Longest-processing-time-first scheduling --- W ikipedia , The Free Encyclopedia. 2025

  20. [20]

    Sivamani, Kirthi Shankar and Moon, Tim and Tredak, Przemyslaw and Yang, Charlene and Nguyen, Phuong , year =. NVIDIA/

  21. [21]

    2025 , eprint=

    Hall's marriage theorem , author=. 2025 , eprint=

  22. [22]

    2025 , note =

    Dylan Patel , title =. 2025 , note =

  23. [23]

    2025 , note =

    NVIDIA , title =. 2025 , note =

  24. [24]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  25. [25]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  26. [26]

    2024 , eprint=

    Two Results on LPT: A Near-Linear Time Algorithm and Parcel Delivery using Drones , author=. 2024 , eprint=

  27. [27]

    2023 , eprint=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

  28. [28]

    2020 , eprint=

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=

  29. [29]

    2025 , howpublished=

    MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training , author=. 2025 , howpublished=

  30. [30]

    2023 , eprint=

    Punica: Multi-Tenant LoRA Serving , author=. 2023 , eprint=

  31. [31]

    2025 , eprint=

    Context Parallelism for Scalable Million-Token Inference , author=. 2025 , eprint=

  32. [32]

    2022 , eprint=

    Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , author=. 2022 , eprint=

  33. [33]

    2024 , note =

    David Ramel , title =. 2024 , note =

  34. [34]

    arXiv preprint arXiv:2405.21015 , author =

    Severson, Matthew and others , title =. arXiv preprint arXiv:2405.21015 , year =

  35. [35]

    2025 , note =

    Apple Machine Learning Research , title =. 2025 , note =

  36. [36]

    2024 , note =

    Meta AI Research , title =. 2024 , note =

  37. [37]

    DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

    Jiang, Chenyu and Cai, Zhenkun and Tian, Ye and Jia, Zhen and Wang, Yida and Wu, Chuan , year=. DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=. doi:10.1145/3731569.3764849 , booktitle=

  38. [38]

    2025 , eprint=

    BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    Efficient Long-context Language Model Training by Core Attention Disaggregation , author=. 2025 , eprint=

  41. [41]

    2025 , note =

    Zewei Tao and Yunpeng Huang , title =. 2025 , note =

  42. [42]

    2025 , note =

    Zheng Wang , title =. 2025 , note =

  43. [43]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    H. Liu and others , title =. arXiv preprint arXiv:2310.01889 , year =

  44. [44]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Goyal, Priya and Doll. Accurate, Large Minibatch. arXiv preprint arXiv:1706.02677 , year =

  45. [45]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Sergeev, Alexander and Del Balso, Mike , title =. arXiv preprint arXiv:1802.05799 , year =

  46. [46]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , title =. arXiv preprint arXiv:1909.08053 , year =

  47. [47]

    ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=

    Ge, Hao and Feng, Junda and Huang, Qi and Fu, Fangcheng and Nie, Xiaonan and Zuo, Lei and Lin, Haibin and Cui, Bin and Liu, Xin , year=. ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs , url=. doi:10.1145/3718958.3754352 , booktitle=

  48. [48]

    2024 , eprint=

    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism , author=. 2024 , eprint=

  49. [49]

    Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation , articleno =

    Wang, Zheng and Cai, Anna and Xie, Xinfeng and Pan, Zaifeng and Guan, Yue and Chu, Weiwei and Wang, Jie and Li, Shikai and Huang, Jianyu and Cai, Chris and Hao, Yuchen and Ding, Yufei , title =. Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation , articleno =. 2025 , isbn =

  50. [50]

    2025 , eprint=

    FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism , author=. 2025 , eprint=

  51. [51]

    2024 , eprint=

    DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training , author=. 2024 , eprint=

  52. [52]

    and Parthasarathy, S

    Ravikumar, A. and Parthasarathy, S. and Thyagarajan, K. and others , title =. Heliyon , year =

  53. [53]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

    Narayanan, Deepak and Shoeybi, Mohammad and Cho, Tushar and others , title =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

  54. [54]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. arXiv preprint arXiv:1910.02054 , year =

  55. [55]

    2025 , eprint=

    Seedance 1.0: Exploring the Boundaries of Video Generation Models , author=. 2025 , eprint=

  56. [56]

    2024 , eprint=

    Gemini Technical Report , author=. 2024 , eprint=

  57. [57]

    2024 , howpublished=

  58. [58]

    Proceedings of the ACM on Management of Data , volume=

    Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement , author=. Proceedings of the ACM on Management of Data , volume=. 2023 , publisher=