Recognition: 1 theorem link
· Lean TheoremHetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
Pith reviewed 2026-05-16 22:17 UTC · model grok-4.3
The pith
HetRL models LLM reinforcement learning scheduling on mixed GPUs as one joint optimization problem and solves it with hybrid or exact algorithms to raise throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem and provides two complementary approaches: a hybrid scheduling algorithm that efficiently identifies near-optimal solutions and an ILP-based scheduling algorithm that obtains optimal solutions, enabling flexible trade-offs between solution optimality and efficiency. Extensive evaluation shows that HetRL achieves up to 9.17x the throughput of state-of-the-art systems and 3.17x on average.
What carries the argument
The constrained joint optimization formulation that encodes all computation and data dependencies across the multiple models and tasks of an LLM RL workflow, solved by either a hybrid heuristic or an integer-linear-programming algorithm.
If this is right
- Mid-range and older GPUs become usable for LLM RL training without large efficiency losses.
- Operators can trade scheduler run time against schedule quality on the fly by choosing the hybrid or ILP solver.
- Heterogeneous clusters no longer need to be partitioned into homogeneous sub-clusters for RL workloads.
- The same modeling approach can be applied to other multi-model training pipelines that share data dependencies.
Where Pith is reading between the lines
- The same joint-optimization style may extend to inference serving or fine-tuning pipelines that also mix heterogeneous accelerators.
- Dynamic re-optimization when GPUs join or leave the cluster could be added by periodically re-solving the same model.
- If the optimization model turns out to be too slow for very large clusters, lighter machine-learning-based approximations of the same constraints become a natural next step.
Load-bearing premise
The dependencies among models and tasks in LLM reinforcement learning can be captured accurately enough by a mathematical optimization model that the resulting schedule is both correct and fast enough to justify the modeling effort.
What would settle it
A direct measurement on a real heterogeneous cluster showing that the schedules produced by the optimizer produce actual runtimes far worse than the model predicted or that the time to solve the optimizer exceeds the throughput gains it delivers.
Figures
read the original abstract
As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs and alleviate the shortage of homogeneous high-end GPUs within a single availability zone. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and networks. HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem and provides two complementary approaches for addressing this problem: (1) a hybrid scheduling algorithm that efficiently identifies near-optimal solutions, and (2) an integer linear programming (ILP)-based scheduling algorithm that obtains optimal solutions, enabling flexible trade-offs between solution optimality and efficiency. Our extensive evaluation, consuming 20,000 GPU-hours, shows that HetRL achieves up to 9.17x the throughput of state-of-the-art systems, and 3.17x on average, across a wide range of workloads and settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HetRL, a distributed system for efficient reinforcement learning training of LLMs in heterogeneous GPU and network environments. It formulates the scheduling of multi-model RL workflows as a constrained joint optimization problem and provides a hybrid scheduling algorithm for near-optimal solutions along with an ILP-based algorithm for optimal solutions, claiming up to 9.17x throughput (3.17x on average) over state-of-the-art systems based on an evaluation consuming 20,000 GPU-hours across diverse workloads.
Significance. If the performance claims hold under rigorous validation of the modeling assumptions, HetRL could meaningfully advance practical LLM post-training by enabling effective use of mixed-generation GPUs, reducing dependence on scarce high-end homogeneous clusters. The scale of the empirical evaluation is a notable strength, providing broad coverage of workloads and settings that supports potential real-world applicability.
major comments (2)
- The headline throughput gains (up to 9.17x) rest on the claim that the constrained joint optimization accurately encodes all computation and data dependencies (model updates, activations, gradients, heterogeneous memory/network) with negligible abstraction error. The manuscript must explicitly detail how dynamic aspects such as variable sequence lengths, on-the-fly KV cache sizing, and non-stationary network contention are represented in the optimization formulation; without this, it is unclear whether the computed schedules translate to the reported speedups in real execution.
- The evaluation reports 20,000 GPU-hours of results but provides insufficient information on exact baseline implementations, workload definitions (e.g., specific RL tasks, model sizes, sequence length distributions), and potential confounding factors such as whether all systems were tested under identical heterogeneous GPU/network conditions. This weakens support for the central claim that the gains are attributable to the proposed scheduling rather than experimental setup differences.
minor comments (2)
- Clarify the precise definitions of 'state-of-the-art systems' used for comparison and include a table summarizing their configurations relative to HetRL.
- Add error bars or statistical significance measures to all throughput figures to convey variability across runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments identify important areas for improving clarity around the optimization modeling and evaluation transparency. We address each point below and will incorporate the suggested revisions to strengthen the paper.
read point-by-point responses
-
Referee: The headline throughput gains (up to 9.17x) rest on the claim that the constrained joint optimization accurately encodes all computation and data dependencies (model updates, activations, gradients, heterogeneous memory/network) with negligible abstraction error. The manuscript must explicitly detail how dynamic aspects such as variable sequence lengths, on-the-fly KV cache sizing, and non-stationary network contention are represented in the optimization formulation; without this, it is unclear whether the computed schedules translate to the reported speedups in real execution.
Authors: We appreciate this comment on the modeling assumptions. The joint optimization in Section 3 represents the RL workflow as a DAG with nodes for forward/backward passes and model updates; computation costs are obtained via offline profiling using representative sequence length distributions from each workload, while KV cache memory is encoded as per-layer capacity constraints using the maximum profiled size. Non-stationary network contention is addressed through the hybrid scheduler's monitoring loop that triggers re-optimization when bandwidth deviates beyond a threshold (detailed in Section 4.3). We agree the exposition can be strengthened and will add explicit equations and a short subsection in the revision showing how these dynamic factors are abstracted with bounded error via profiling. This will clarify the link to observed speedups. revision: partial
-
Referee: The evaluation reports 20,000 GPU-hours of results but provides insufficient information on exact baseline implementations, workload definitions (e.g., specific RL tasks, model sizes, sequence length distributions), and potential confounding factors such as whether all systems were tested under identical heterogeneous GPU/network conditions. This weakens support for the central claim that the gains are attributable to the proposed scheduling rather than experimental setup differences.
Authors: We thank the referee for highlighting the need for greater evaluation detail. Section 5.1 already specifies the baselines as Megatron-LM and DeepSpeed with their schedulers ported to the heterogeneous setting, workloads as PPO/DPO/GRPO on 7B–70B models, and sequence lengths drawn from the distributions in Table 2. All runs used the same physical cluster (A100/V100/RTX 3090 mix) with network conditions controlled via traffic control emulation. To fully address the concern we will expand the section with an appendix table listing exact model configurations, sequence length histograms, baseline code versions, and a discussion of controlled variables. These additions will be included in the revised manuscript. revision: yes
Circularity Check
No significant circularity; performance claims rest on direct empirical measurement
full rationale
The paper models heterogeneous RL scheduling as a constrained joint optimization problem solved via hybrid or ILP algorithms, then reports throughput gains (up to 9.17x, 3.17x average) from 20,000 GPU-hour experiments across workloads. No derivation chain reduces any claimed result to fitted parameters, self-referential quantities, or self-citation load-bearing premises. The optimization formulation is presented as an engineering modeling choice whose validity is assessed by runtime measurement rather than by construction or imported uniqueness theorems. No self-definitional steps, renamed empirical patterns, or ansatz smuggling appear in the provided text. The work is self-contained against external benchmarks via direct execution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RL training workflows consist of multiple models and tasks whose computation and data dependencies can be precisely modeled for joint optimization
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://aws.amazon.com/hpc/efa/, 2025 a
Aws elastic fabric adapter. https://aws.amazon.com/hpc/efa/, 2025 a
work page 2025
-
[2]
https://github.com/aws/aws-ofi-nccl, 2025 b
Aws ofi nccl. https://github.com/aws/aws-ofi-nccl, 2025 b
work page 2025
-
[3]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Ahmadian, A., Cremer, C., Gall \' e , M., Fadaee, M., Kreutzer, J., Pietquin, O., \" U st \" u n, A., and Hooker, S. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Ku, L., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volu...
-
[4]
Bui, T. N. and Moon, B. R. Genetic algorithm and graph partitioning. IEEE Trans. Computers , 45 0 (7): 0 841--855, 1996. doi:10.1109/12.508322. URL https://doi.org/10.1109/12.508322
-
[5]
K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T. T., Marks, S., S \' e gerie, C., Carroll, M., Peng, A., Christoffersen, P. J. K., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E. J., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L., Hase,...
work page 2023
-
[6]
Safe RLHF: safe reinforcement learning from human feedback
Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y., and Yang, Y. Safe RLHF: safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw
work page 2024
-
[7]
DeepSeek - AI. Deepseek-v3 technical report. CoRR, abs/2412.19437, 2024. doi:10.48550/ARXIV.2412.19437. URL https://doi.org/10.48550/arXiv.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[8]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Fu, W., Gao, J., Shen, X., Zhu, C., Mei, Z., He, C., Xu, S., Wei, G., Mei, J., Wang, J., Yang, T., Yuan, B., and Wu, Y. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. CoRR, abs/2505.24298, 2025. doi:10.48550/ARXIV.2505.24298. URL https://doi.org/10.48550/arXiv.2505.24298
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.24298 2025
-
[9]
An empirical study on low GPU utilization of deep learning jobs
Gao, Y., He, Y., Li, X., Zhao, B., Lin, H., Liang, Y., Zhong, J., Zhang, H., Wang, J., Zeng, Y., Gui, K., Tong, J., and Yang, M. An empirical study on low GPU utilization of deep learning jobs. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 , pp.\ 96:1--96:13. ACM , 2024...
-
[10]
Garey, M. R. and Johnson, D. S. Computers and Intractability: A Guide to the Theory of NP-Completeness . W. H. Freeman, 1979. ISBN 0-7167-1044-7
work page 1979
-
[11]
Asyncflow: An asynchronous streaming RL framework for efficient LLM post-training
Han, Z., You, A., Wang, H., Luo, K., Yang, G., Shi, W., Chen, M., Zhang, S., Lan, Z., Deng, C., Ji, H., Liu, W., Huang, Y., Zhang, Y., Pan, C., Wang, J., Huang, X., Li, C., and Wu, J. Asyncflow: An asynchronous streaming RL framework for efficient LLM post-training. CoRR, abs/2507.01663, 2025. doi:10.48550/ARXIV.2507.01663. URL https://doi.org/10.48550/ar...
-
[12]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Hu, J., Wu, X., Wang, W., Xianyu, Zhang, D., and Cao, Y. Openrlhf: An easy-to-use, scalable and high-performance RLHF framework. CoRR, abs/2405.11143, 2024. doi:10.48550/ARXIV.2405.11143. URL https://doi.org/10.48550/arXiv.2405.11143
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.11143 2024
-
[13]
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M. X., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alch \' e - Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing System...
work page 2019
-
[14]
Jamieson, K. G. and Talwalkar, A. Non-stochastic best arm identification and hyperparameter optimization. In Gretton, A. and Robert, C. C. (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016 , volume 51 of JMLR Workshop and Conference Proceedings , pp.\ 240--248. JM...
work page 2016
-
[15]
Beyond data and model parallelism for deep neural networks
Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of the Second Conference on Machine Learning and Systems, SysML 2019, Stanford, CA, USA, March 31 - April 2, 2019. mlsys.org, 2019. URL https://proceedings.mlsys.org/paper\_files/paper/2019/hash/b...
work page 2019
-
[16]
Demystifying cost-efficiency in llm serving over heterogeneous gpus
Jiang, Y., Fu, F., Yao, X., He, G., Miao, X., Klimovic, A., Cui, B., Yuan, B., and Yoneki, E. Demystifying cost-efficiency in llm serving over heterogeneous gpus. In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, Canada, July 13-19, 2025 . OpenReview.net, 2024
work page 2025
-
[17]
Thunderserve: High-performance and cost-efficient LLM serving in cloud environments
Jiang, Y., Fu, F., Yao, X., Wang, T., Cui, B., Klimovic, A., and Yoneki, E. Thunderserve: High-performance and cost-efficient LLM serving in cloud environments. CoRR, abs/2502.09334, 2025 a . doi:10.48550/ARXIV.2502.09334. URL https://doi.org/10.48550/arXiv.2502.09334
-
[18]
Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment
Jiang, Y., Yan, R., and Yuan, B. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 b . URL https://openreview.net/forum?id=Cs6MrbFuMq
work page 2025
-
[19]
Karp, R. M. Reducibility among combinatorial problems. In Miller, R. E. and Thatcher, J. W. (eds.), Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA , The IBM Research Symposia Series, pp.\ 85--103. Plenum Press, New York, 1972. doi:1...
-
[20]
A survey of reinforcement learning from human feedback
Kaufmann, T., Weng, P., Bengs, V., and H \" u llermeier, E. A survey of reinforcement learning from human feedback. Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=f7OkIurx4b
work page 2025
-
[21]
H., Gonzalez, J., Zhang, H., and Stoica, I
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Flinn, J., Seltzer, M. I., Druschel, P., Kaufmann, A., and Mace, J. (eds.), Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, German...
-
[22]
Lam, S. and Sethi, R. Worst case analysis of two scheduling algorithms. SIAM J. Comput. , 6 0 (3): 0 518--536, 1977. doi:10.1137/0206037. URL https://doi.org/10.1137/0206037
-
[23]
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. T \" u lu 3: Pushing frontiers in open language model post-training. CoR...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024
-
[24]
G., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A
Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=ry18Ww5ee
work page 2017
-
[25]
Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., Huang, Y., Chen, Z., Zhang, H., Gonzalez, J. E., and Stoica, I. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. In Geambasu, R. and Nightingale, E. (eds.), 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 1...
work page 2023
-
[26]
Llama Team . The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi:10.48550/ARXIV.2407.21783. URL https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[27]
Helix: Serving large language models over heterogeneous gpus and network via max-flow
Mei, Y., Zhuang, Y., Miao, X., Yang, J., Jia, Z., and Vinayak, R. Helix: Serving large language models over heterogeneous gpus and network via max-flow. In Eeckhout, L., Smaragdakis, G., Liang, K., Sampson, A., Kim, M. A., and Rossbach, C. J. (eds.), Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages an...
-
[28]
Noukhovitch, M., Huang, S., Xhonneux, S., Hosseini, A., Agarwal, R., and Courville, A. C. Asynchronous RLHF: faster and more efficient off-policy RL for language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=FhTAG591Ve
work page 2025
-
[29]
Qwen Team . Qwen3 technical report. CoRR, abs/2505.09388, 2025. doi:10.48550/ARXIV.2505.09388. URL https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[30]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems...
work page 2023
-
[31]
Generalized Slow Roll for Tensors
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: memory optimizations toward training trillion parameter models. In Cuicchi, C., Qualters, I., and Kramer, W. T. (eds.), Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020 ,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
-
[32]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024. doi:10.48550/ARXIV.2402.03300. URL https://doi.org/10.48550/arXiv.2402.03300
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
-
[34]
J., Jain, S., Taghibakhshi, A., Ausin, M
Shen, G., Wang, Z., Delalleau, O., Zeng, J., Dong, Y., Egert, D., Sun, S., Zhang, J. J., Jain, S., Taghibakhshi, A., Ausin, M. S., Aithal, A., and Kuchaiev, O. Nemo-aligner: Scalable toolkit for efficient model alignment. CoRR, abs/2405.01481, 2024. doi:10.48550/ARXIV.2405.01481. URL https://doi.org/10.48550/arXiv.2405.01481
-
[35]
Hybridflow: A flexible and efficient RLHF framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp.\ 1279--1297. ACM , 2025. doi:10.1145/3689031.3696075. URL https:/...
-
[36]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[37]
Soper, A. J., Walshaw, C., and Cross, M. A combined evolutionary search and multilevel optimisation approach to graph-partitioning. J. Glob. Optim., 29 0 (2): 0 225--241, 2004. doi:10.1023/B:JOGO.0000042115.44455.F3. URL https://doi.org/10.1023/B:JOGO.0000042115.44455.f3
-
[38]
Strati, F., Elvinger, P., Kerimoglu, T., and Klimovic, A. ML training with cloud GPU shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, EuroMLSys 2024, Athens, Greece, 22 April 2024, pp.\ 107--116. ACM , 2024. doi:10.1145/3642970.3655843. URL https://doi.org/10.1145/3642970.3655843
-
[39]
S., Hu, Q., Chen, T., Buzcu, B., Han, S., Delgado, P., and Klimovic, A
Strati, F., Zhang, Z., Manos, G., P \' e riz, I. S., Hu, Q., Chen, T., Buzcu, B., Han, S., Delgado, P., and Klimovic, A. Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters. In Won, Y., Kwon, Y., Yuan, D., and Isaacs, R. (eds.), Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP...
-
[40]
Piper: Multidimensional planner for DNN parallelization
Tarnawski, J., Narayanan, D., and Phanishayee, A. Piper: Multidimensional planner for DNN parallelization. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.\ 24829--24840, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/d01eeca8...
work page 2021
-
[41]
Metis: Fast automatic distributed training on heterogeneous gpus
Um, T., Oh, B., Kang, M., Lee, W., Kim, G., Kim, D., Kim, Y., Muzzammil, M., and Jeon, M. Metis: Fast automatic distributed training on heterogeneous gpus. In Bagchi, S. and Zhang, Y. (eds.), Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024 , pp.\ 563--578. USENIX Association, 2024. URL ht...
work page 2024
-
[42]
How to keep pushing ML accelerator performance? know your rooflines! IEEE J
Verhelst, M., Benini, L., and Verma, N. How to keep pushing ML accelerator performance? know your rooflines! IEEE J. Solid State Circuits , 60 0 (6): 0 1888--1905, 2025. doi:10.1109/JSSC.2025.3553765. URL https://doi.org/10.1109/JSSC.2025.3553765
-
[43]
Wu, B., Wang, S., Tang, Y., Ding, J., Helenowski, E., Tan, L., Xu, T., Gowda, T., Chen, Z., Zhu, C., Tang, X., Qian, Y., Zhu, B., and Hou, R. Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale LLM training. CoRR, abs/2505.24034, 2025 a . doi:10.48550/ARXIV.2505.24034. URL https://doi.org/10.48550/arXiv.2505.24034
-
[44]
M., Lentz, M., Zhuo, D., and Stoica, I
Wu, Y., Liu, X., Jin, S., Xu, C., Qian, F., Mao, Z. M., Lentz, M., Zhuo, D., and Stoica, I. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. CoRR, abs/2504.03871, 2025 b . doi:10.48550/ARXIV.2504.03871. URL https://doi.org/10.48550/arXiv.2504.03871
-
[45]
Yao, Z., Aminabadi, R. Y., Ruwase, O., Rajbhandari, S., Wu, X., Awan, A. A., Rasley, J., Zhang, M., Li, C., Holmes, C., Zhou, Z., Wyatt, M., Smith, M., Kurilenko, L., Qin, H., Tanaka, M., Che, S., Song, S. L., and He, Y. Deepspeed-chat: Easy, fast and affordable RLHF training of chatgpt-like models at all scales. CoRR, abs/2308.01320, 2023. doi:10.48550/A...
-
[46]
Decentralized training of foundation models in heterogeneous environments
Yuan, B., He, Y., Davis, J., Zhang, T., Dao, T., Chen, B., Liang, P., R \' e , C., and Zhang, C. Decentralized training of foundation models in heterogeneous environments. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2...
work page 2022
-
[47]
Understanding gpu architecture implications on llm serving workloads
Zhang, Z. Understanding gpu architecture implications on llm serving workloads. Master's thesis, ETH Zurich, 2024
work page 2024
-
[48]
Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Aguilera, M. K. and Weatherspoon, H. (eds.), 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsb...
work page 2022
-
[49]
Streamrl: Scalable, heterogeneous, and elastic RL for llms with disaggregated stream generation
Zhong, Y., Zhang, Z., Song, X., Hu, H., Jin, C., Wu, B., Chen, N., Chen, Y., Zhou, Y., Wan, C., Zhou, H., Jiang, Y., Zhu, Y., and Jiang, D. Streamrl: Scalable, heterogeneous, and elastic RL for llms with disaggregated stream generation. CoRR, abs/2504.15930, 2025 a . doi:10.48550/ARXIV.2504.15930. URL https://doi.org/10.48550/arXiv.2504.15930
-
[50]
Optimizing RLHF training for large language models with stage fusion
Zhong, Y., Zhang, Z., Wu, B., Liu, S., Chen, Y., Wan, C., Hu, H., Xia, L., Ming, R., Zhu, Y., and Jin, X. Optimizing RLHF training for large language models with stage fusion. In Benson, T. A. and Mysore, R. N. (eds.), 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025, Philadelphia, PA, USA, April 28-30, 2025 , pp.\ 489--503....
work page 2025
-
[51]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.