HSAP: A Hierachical Sequence-aware Parallelism for Hybrid-Context Generative Models
Pith reviewed 2026-06-30 07:24 UTC · model grok-4.3
The pith
A hierarchical sequence-aware parallelism algorithm computes correct causal attention on hybrid-context packed sequences across devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Sequence-Aware Parallelism algorithm conquers intensive tensor transmission and partial attention computation across device groups by using JIT compilation to optimize the communication strategy of all device groups at the NCCL level; when embedded in the hierarchical framework, this enables correct causal attention on hybrid-context packed sequences while preserving high parallelism degrees.
What carries the argument
The Sequence-Aware Parallelism algorithm, which applies JIT compilation to tune NCCL communication for correct partial causal attention across device groups on hybrid-context sequences.
If this is right
- Sequence parallelism can be applied to packed hybrid-context data at full degree without attention contamination.
- Memory and communication overhead can be managed hierarchically while retaining the benefits of the sequence-aware method.
- Training and fine-tuning of generative models on packed sequences becomes feasible at larger scale across multiple devices.
Where Pith is reading between the lines
- The approach may combine with tensor or pipeline parallelism to support even larger models without redesigning attention kernels.
- Similar communication optimization could apply to other distributed attention patterns beyond causal masks.
- If the JIT strategy generalizes, it could reduce the need to limit context packing in production LLM pipelines.
Load-bearing premise
The JIT-optimized NCCL communication strategy correctly assembles partial causal attention results on hybrid-context sequences without errors or prohibitive extra cost.
What would settle it
Compare attention output tensors produced by the algorithm on a batch of hybrid-context packed sequences against the same computation run without any sequence parallelism; any mismatch or unexpectedly high communication volume would disprove the claim.
Figures
read the original abstract
In this paper, we aim to combine the advantages of existing sequence parallelism paradigms and overcomes their drawbacks, the most serious of which is the incapability to correctly compute causal attention on the hybrid-context packed sequences, in a stronger sequence parallelism framework. The practical technique of packing sequences for efficiently pretraining and fine-tuning large language models causes cross-contamination problem in attention computation, which can be effectively solved when no parallelism in the sequence length dimension is taken. However, in sequence parallelism, existing approaches either ignore the scenario of hybrid-context sequences or conversely sacrifice and limit parallelism degree for supporting the scenario. To this end, we innovatively propose an efficient Sequence-Aware Parallelism algorithm to conquer the obstacles of intensive tensor transmission and partial attention computation across multiple device groups. Our algorithm utilizes JIT (Just-In-Time) compilation to optimize the communication strategy of all device groups in NCCL level. Further, we integrate existing sequence parallelism paradigms into a Hierachical Sequence-Aware Parallelism framework which benefits from our sequence-aware algorithm. We additionally elaborate on the memory and communication overhead management of the hierachical framework to optimize its performance. Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HSAP, a hierarchical sequence-aware parallelism framework for hybrid-context generative models. It introduces a Sequence-Aware Parallelism algorithm that uses JIT compilation to optimize NCCL-level communication across device groups, enabling correct partial causal attention computation on packed hybrid-context sequences without cross-contamination. The framework integrates existing sequence parallelism methods, manages memory and communication overhead, and claims to outperform prior sequence parallelism approaches in multiple metrics based on experiments.
Significance. If the central claims hold, the work would address a practical limitation in sequence parallelism for packed sequences during LLM pretraining and fine-tuning, potentially allowing higher degrees of parallelism while preserving causality. The emphasis on JIT-optimized communication and hierarchical integration could offer efficiency gains, though the absence of any supporting derivations or results makes the significance currently speculative.
major comments (2)
- [Abstract] Abstract: The central claim that the Sequence-Aware Parallelism algorithm 'correctly compute partial causal attention on hybrid-context sequences across device groups without introducing errors' is asserted without any equations, mask-handling logic, communication schedule, or verification that the JIT strategy at NCCL level preserves causality when tensors are split and exchanged. This mechanism is load-bearing for the paper's advantage over existing sequence parallelism methods.
- [Abstract] Abstract: The statement that the approach 'outperform other state-of-the-arts sequence parallelism approches in multiple metrics' through 'multiple experiments' is unsupported by any reported data, tables, error bars, model sizes, datasets, or experimental setup, preventing assessment of whether the hierarchical framework delivers the claimed benefits.
minor comments (3)
- [Abstract] Typo: 'Hierachical' should be spelled 'Hierarchical'.
- [Abstract] Typo: 'approches' should be 'approaches'.
- [Abstract] The abstract is overly dense; clearer separation between the problem statement, the proposed algorithm, the hierarchical framework, and the overhead management would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The two major points both concern the abstract's high-level claims. We agree these claims require stronger grounding and will revise the manuscript to incorporate the requested details from the algorithm description and experimental evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the Sequence-Aware Parallelism algorithm 'correctly compute partial causal attention on hybrid-context sequences across device groups without introducing errors' is asserted without any equations, mask-handling logic, communication schedule, or verification that the JIT strategy at NCCL level preserves causality when tensors are split and exchanged. This mechanism is load-bearing for the paper's advantage over existing sequence parallelism methods.
Authors: We agree the abstract alone does not supply the supporting derivations. The manuscript body contains the Sequence-Aware Parallelism algorithm description, including the equations governing partial causal attention on packed hybrid-context sequences, the mask construction logic that prevents cross-contamination across device groups, the JIT-optimized NCCL communication schedule, and the verification that causality is preserved under tensor splitting and exchange. We will revise the abstract to reference these elements explicitly and, if needed, add a concise summary of the mask and communication logic. revision: yes
-
Referee: [Abstract] Abstract: The statement that the approach 'outperform other state-of-the-arts sequence parallelism approches in multiple metrics' through 'multiple experiments' is unsupported by any reported data, tables, error bars, model sizes, datasets, or experimental setup, preventing assessment of whether the hierarchical framework delivers the claimed benefits.
Authors: We acknowledge that the abstract references experimental outcomes without presenting the supporting data. The manuscript includes an experiments section reporting comparisons against prior sequence parallelism methods across multiple metrics, with tables, error bars, model sizes, datasets, and experimental configurations. We will revise the abstract to include a brief, quantitative summary of the key results or qualify the performance claim until the full results are visible in the abstract. revision: yes
Circularity Check
No circularity: algorithmic proposal is self-contained with no self-referential reductions
full rationale
The paper introduces a new Sequence-Aware Parallelism algorithm and hierarchical framework as an independent engineering contribution, supported by experimental results rather than any derivation chain. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked in a load-bearing way that reduces the central claim to its own inputs by construction. The abstract and description frame the work as overcoming prior limitations through a novel JIT-optimized NCCL strategy, without any self-definitional loops or renamed known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal attention must be computed correctly without cross-contamination on packed hybrid-context sequences
invented entities (2)
-
Sequence-Aware Parallelism algorithm
no independent evidence
-
Hierachical Sequence-Aware Parallelism framework
no independent evidence
Reference graph
Works this paper leans on
-
[2]
Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, PeterJ. , year=. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , journal=
-
[3]
Online normalizer calculation for softmax
Milakov, Maxim and Gimelshein, Natalia , year=. Online normalizer calculation for softmax. , journal=
-
[4]
2023 , month=
Structured Packing in LLM Training Improves Long Context Utilization , author=. 2023 , month=
2023
-
[5]
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[7]
Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=
Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=
-
[8]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
Attention is all you need , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
-
[10]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
How Long Can Context Length of Open-Source LLMs truly Promise? , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
2023
-
[11]
2023 , url =
MosaicML NLP Team , title =. 2023 , url =
2023
-
[13]
YaRN: Efficient Context Window Extension of Large Language Models
Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Scaling vision transformers to gigapixel images via hierarchical self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[16]
Advances in Neural Information Processing Systems , volume=
Combiner: Full attention transformer with sparse computation cost , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Transactions of the Association for Computational Linguistics , volume=
Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=
-
[19]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
2023
-
[21]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[23]
Proceedings of the 52nd International Conference on Parallel Processing , pages=
Colossal-ai: A unified deep learning system for large-scale parallel training , author=. Proceedings of the 52nd International Conference on Parallel Processing , pages=
-
[24]
Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=
LightSeq:: Sequence Level Parallelism for Distributed Training of Long Context Transformers , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=
2023
-
[25]
arXiv preprint arXiv:2311.02382 , year=
Ultra-Long Sequence Distributed Transformer , author=. arXiv preprint arXiv:2311.02382 , year=
-
[26]
International Conference on Learning Representations , year=
Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=
-
[27]
Linformer: Self-Attention with Linear Complexity
Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
Advances in Neural Information Processing Systems , volume=
Luna: Linear Unified Nested Attention , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Advances in Neural Information Processing Systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
The Twelfth International Conference on Learning Representations , year=
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=
-
[31]
and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , DOI =
-
[32]
and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =
Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Zidek, Augustin and Potapenko, Anna..nyals, Oriol and Senior, Andrew W. and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =. Highly accurate protein structure prediction with...
-
[33]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =
Gu, Albert and Dao, Tri , year =. Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =
-
[34]
Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =
Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Grella, Matteo and GV, Kranthi Kira.. Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =. RWKV: Reinventing RNNs for the Transformer Era , DOI =
-
[35]
and Salakhutdinov, Ruslan , year =
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , year =. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context , DOI =
-
[36]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =
Dao, Tri , year =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =
-
[37]
and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =
Shi, Weijia and Min, Sewon and Lomeli, Maria and Zhou, Chunting and Li, Margaret and James, Rich and Lin, Xi Victoria and Smith, Noah A. and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =. In-Context Pretraining: Language Modeling Beyond Document Boundaries , DOI =
-
[38]
Evolutionary-scale prediction of atomic-level protein structure with a language model , volume =
Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , year =. Evolutionary-scale prediction of atomic-level protein structure w...
-
[39]
Block-State Transformers , repository =
Fathi, Mahan and Pilault, Jonathan and Firat, Orhan and Pal, Christopher and Bacon, Pierre-Luc and Goroshin, Ross , year =. Block-State Transformers , repository =
-
[40]
01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =
01-ai, , year =. 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =
-
[41]
, year =
Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford Alpaca: An Instruction-following LLaMA model , publisher =
-
[42]
and Gonzalez, Joseph E
Li, Dacheng and Shao, Rulin and Xie, Anze and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , year =. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers , repository =
-
[43]
and Fitzgibbon, Andrew , year =
Krell, Mario Michael and Kosec, Matej and Perez, Sergio P. and Fitzgibbon, Andrew , year =. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , DOI =
-
[44]
De Vries, Harm , title =
-
[45]
2024 , eprint=
World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=
2024
-
[46]
2024 , url=
Video generation models as world simulators , author=. 2024 , url=
2024
-
[47]
2023 , month =
GPT-4 Technical Report , DOI =. 2023 , month =
2023
-
[48]
2023 , month =
Gemini: A Family of Highly Capable Multimodal Models , DOI =. 2023 , month =
2023
-
[50]
Together Computer , title =
-
[51]
2023 , eprint=
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=
2023
-
[52]
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. https://openai.com/research/video-generation-models-as-world-simulators Video generation models as world simulators
2024
-
[53]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877--1901
2020
-
[54]
Together Computer. 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models
2023
-
[55]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359
2022
-
[56]
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. https://arxiv.org/abs/2305.14233 Enhancing chat language models by scaling high-quality instructional conversations . Preprint, arXiv:2305.14233
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team , Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Millican..na, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023. https://doi.org/10.48550/arXiv.2312.11805 Gemini: A family of highly capable multimodal models . ArXiv:2312....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[58]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5
2023
-
[60]
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. https://doi.org/10.48550/arXiv.2107.02027 Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance . ArXiv:2107.02027 [cs, math]
-
[61]
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 a . How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
2023
-
[62]
Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 b . Lightseq:: Sequence level parallelism for distributed training of long context transformers. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023)
2023
-
[63]
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023 c . Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766--775
2023
-
[64]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023 d . https://doi.org/10.18653/v1/2023.acl-long.134 Sequence parallelism: Long sequence training from system perspective . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391--2404, Toronto, Canada. Associati...
-
[65]
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024. https://arxiv.org/abs/2402.08268 World model on million-length video and language with blockwise ringattention . Preprint, arXiv:2402.08268
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023 a . Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
2023
-
[67]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 b . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu
-
[69]
Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv: Performance,arXiv: Performance
2018
-
[70]
OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, ..rvin Anadkat, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2023. https://doi.org/10.48550/arXiv.2303.08774 Gpt-4 technical report . ArXiv:2303.08774 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[71]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning
2019
-
[72]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed
-
[73]
Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. 2023. https://doi.org/10.48550/arXiv.2310.10638 In-context pretraining: Language modeling beyond document boundaries . ArXiv:2310.10638 [cs]
-
[74]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[75]
Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Henryk Michalewski, -L ukasz Kuci’nski, and Piotr Mi l o’s. 2023. Structured packing in llm training improves long context utilization
2023
-
[76]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010
2017
- [77]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.