Optimizing Teacher-Student Partitioning for Scalable Knowledge Distillation on HPC Systems

Adrian P. Dieguez; Alex Batlle; Harris Teague; Jordi Ros-Giralt; Victor Conchello Vendrell; Vinnam Kim

arxiv: 2606.27797 · v1 · pith:2NGSTNM7new · submitted 2026-06-26 · 💻 cs.DC · cs.AI· cs.LG

Optimizing Teacher-Student Partitioning for Scalable Knowledge Distillation on HPC Systems

Adrian P. Dieguez , Victor Conchello Vendrell , Alex Batlle , Vinnam Kim , Jordi Ros-Giralt , Harris Teague This is my paper

Pith reviewed 2026-06-29 03:02 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords knowledge distillationHPC systemsteacher-student partitioningvertical horizontal splittingtopology-aware parallelismasymmetric model partitioningscalable traininginflection points

0 comments

The pith

Decoupling teacher and student partitioning in knowledge distillation yields up to 67% higher samples-per-second than symmetric methods on HPC systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that libraries treating teacher and student models symmetrically in knowledge distillation waste resources due to their large differences in size and communication needs. It introduces a method that partitions the two models independently by combining vertical and horizontal splits, guided by an analytical expression for choosing the better regime at inflection points. This exploits asymmetry and topology-aware parallelism to avoid unnecessary teacher data structures. A sympathetic reader would care because the result makes large-scale guided distillation feasible on existing production HPC hardware without altering the models themselves.

Core claim

The paper claims that an HPC-aware methodology for knowledge distillation decouples teacher and student partitioning, combines vertical and horizontal model splits, and derives an analytical expression for inflection points between the regimes; when the best strategy is selected this way, the approach achieves up to 67% higher samples-per-second than TRL by eliminating redundant teacher-model data structures and applying topology-aware parallelism on production HPC clusters.

What carries the argument

The analytical expression that locates inflection points between vertical and horizontal splitting regimes, which selects the optimal partitioning strategy for asymmetric teacher-student models.

If this is right

Avoiding unnecessary teacher-model data structures directly raises samples-per-second throughput.
Topology-aware parallelism becomes effective once teacher and student partitions are chosen independently.
The inflection-point expression removes the need for exhaustive search over split strategies.
GKD training scales better on production HPC clusters without changes to the underlying distillation loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling logic could extend to other compression methods that pair large and small models, such as pruning or quantization-aware training.
If the inflection expression depends on measured communication costs, it could be recomputed at runtime for dynamic cluster conditions.
The approach suggests that future KD frameworks should expose separate partitioning APIs rather than assume symmetric data-parallel layouts.

Load-bearing premise

The derived analytical expression for inflection points between vertical and horizontal splitting regimes will continue to identify the optimal strategy on hardware and model sizes different from the tested production clusters.

What would settle it

Measuring actual optimal splits versus the expression's predictions on a different HPC cluster with altered interconnect latency or on models substantially larger or smaller than those tested.

Figures

Figures reproduced from arXiv: 2606.27797 by Adrian P. Dieguez, Alex Batlle, Harris Teague, Jordi Ros-Giralt, Victor Conchello Vendrell, Vinnam Kim.

**Figure 1.** Figure 1: (a) Training throughput from 2 to 16 nodes comparing original TRL GKD best configuration against our proposal best configuration; (b) Training loss evolution when using ZeRO (DDP) or TP in the teacher, showing no negative effect on accuracy. 5 Effects of Horizontal and Vertical Parallelism for GKD Section 4 demonstrated how teacher’s forward pass can be accelerated under the existing DeepSpeed DDP-style tr… view at source ↗

**Figure 2.** Figure 2: Partitioning options: (a) partitioning for a forward pass where each GPU holds a disjoint subset of full layers; (b) each GPU stores a portion of each layer, syncing through frequent communication collectives to execute the forward pass [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Knowledge Distillation (KD) enables training smaller student models under the guidance of larger teacher models, and the widely adopted TRL library implements it. Yet, TRL treats both models symmetrically, missing opportunities to exploit their pronounced asymmetry in memory footprint, and communication requirements. This paper presents an HPC-aware methodology for KD that decouples teacher and student partitioning efficiently. Our approach achieves up to 67% higher samples-per-second than TRL by avoiding unnecessary teacher-model data structures and selecting the best split strategy. We combine vertical and horizontal partitioning of models, deriving an analytical expression that identifies the existence of inflection points between splitting regimes. These results showed that exploiting teacher--student asymmetry through topology-aware parallelism notably accelerated GKD training on production HPC clusters at our company

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical HPC tweak for knowledge distillation by splitting teacher and student asymmetrically, with a claimed 67% throughput gain and an analytical selector for split regimes, but the selector's value outside their clusters is unclear.

read the letter

The core claim is that TRL handles teacher and student the same way, but you can cut unnecessary memory and communication by partitioning them separately on HPC hardware. The authors report up to 67% more samples per second on their production clusters by combining vertical and horizontal splits and using an analytical expression to pick between regimes.

The new piece is the closed-form expression for inflection points. It lets you decide the split without exhaustive search, which is a modest but direct engineering step beyond just applying known partitioning tricks to the KD asymmetry.

The empirical side holds up as far as it goes: real runs on actual clusters show the speedup, and skipping teacher data structures is a sensible move given the size difference. That part reads as honest measurement rather than fitted curves.

The weak point is scope. The expression embeds hardware-specific numbers for memory footprints, all-reduce costs, and interconnect behavior. Change the GPU count, network, or model dimension ratios and the inflection points move, so the selector may not stay optimal. The abstract gives no derivation steps or cross-hardware checks, which leaves the analytical contribution looking narrow.

This is for people who already run knowledge distillation at scale on clusters and need to squeeze out throughput. A distributed-systems or large-model training reader would get immediate use from the partitioning tactics. The work is coherent enough on its own terms to deserve referee time; the measured gains are concrete even if the formula needs more testing.

Referee Report

2 major / 2 minor

Summary. The paper claims that by decoupling teacher and student partitioning in knowledge distillation using a combination of vertical and horizontal splits on HPC systems, and using a derived analytical expression to identify inflection points for choosing the optimal strategy, they achieve up to 67% higher samples-per-second than the TRL library. This is validated on production HPC clusters at the authors' company.

Significance. If the analytical expression for inflection points generalizes beyond the tested clusters and the performance gains are reproducible, this work could provide a valuable methodology for scalable knowledge distillation by exploiting teacher-student asymmetry in memory and communication requirements. The topology-aware parallelism is a practical contribution for large-scale training on HPC.

major comments (2)

[§3 (Methodology)] §3 (Methodology): The analytical expression identifying inflection points between vertical and horizontal partitioning regimes is stated without derivation steps, explicit assumptions about memory footprints or interconnect characteristics, or proof of its parameter-free nature. This is load-bearing for the central claim that the expression selects the best split strategy.
[§5 (Results)] §5 (Results): The reported 67% throughput gain lacks error bars, details on the number of runs, dataset sizes, or model dimensions used in the experiments. Without these, it is unclear whether the gain holds under the stated conditions or depends on post-hoc tuning specific to the production clusters.

minor comments (2)

[Abstract] Abstract: The abstract mentions 'GKD training' without defining the acronym on first use.
[Figure 2] Figure 2: The figure comparing split strategies would benefit from clearer labels on the axes indicating the inflection point location.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the methodology and experimental reporting. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [§3 (Methodology)] §3 (Methodology): The analytical expression identifying inflection points between vertical and horizontal partitioning regimes is stated without derivation steps, explicit assumptions about memory footprints or interconnect characteristics, or proof of its parameter-free nature. This is load-bearing for the central claim that the expression selects the best split strategy.

Authors: We agree that the derivation was insufficiently detailed. In the revised version we will expand §3 to include the full step-by-step derivation of the inflection-point expression, the explicit assumptions on memory footprints and interconnect bandwidth/latency, and a short argument establishing its parameter-free character under the stated model. This will directly support the claim that the expression selects the optimal regime. revision: yes
Referee: [§5 (Results)] §5 (Results): The reported 67% throughput gain lacks error bars, details on the number of runs, dataset sizes, or model dimensions used in the experiments. Without these, it is unclear whether the gain holds under the stated conditions or depends on post-hoc tuning specific to the production clusters.

Authors: We acknowledge the omission of statistical and experimental details. The revised §5 will report error bars computed over multiple independent runs, the exact number of runs, the dataset sizes employed, and the model dimensions (parameter counts and layer configurations) for both teacher and student. These additions will allow readers to assess reproducibility on the described HPC clusters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports an empirical 67% throughput gain measured directly on production HPC clusters and presents a derived analytical expression for inflection points between partitioning regimes. No provided equations, self-citations, or steps reduce the reported performance numbers or the inflection-point selector to a fitted parameter or self-referential definition by construction. The central claims rest on external measurements and first-principles derivation rather than internal re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard domain assumptions about model memory footprints and inter-node communication costs in distributed training. No new free parameters, axioms, or invented entities are introduced.

axioms (1)

domain assumption Standard assumptions about model memory footprint and communication requirements in distributed training on HPC clusters
Invoked to justify decoupling teacher and student partitioning and to derive the analytical inflection-point expression.

pith-pipeline@v0.9.1-grok · 5681 in / 1252 out tokens · 48060 ms · 2026-06-29T03:02:26.744138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · 2 internal anchors

[1]

In: International Conference on Learning Representations (ICLR) (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mis- takes. In: International Conference on Learning Representations (ICLR) (2024)

2024
[2]

arXiv preprint arXiv:2207.00032 , year =

Aminabadi, R.Y., Rajbhandari, S., Zhang, M., Awan, A.A., Li, C., Li, D., Zheng, E., Rasley, J., Smith, S., Ruwase, O., He, Y.: Deepspeed inference: Enabling ef- ficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032 (2022),https://arxiv.org/abs/2207.00032

work page arXiv 2022
[3]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv (2015),https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

co/docs/trl/index(2026), accessed: Mar 10, 2026

Hugging Face: Trl — transformer reinforcement learning.https://huggingface. co/docs/trl/index(2026), accessed: Mar 10, 2026

2026
[5]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. p. 611–626. SOSP ’23, ACM (2023)

2023
[6]

In: Proceedings of the International Workshop on Reproducible Research in Pattern Recognition (RRPR)

Matsubara, Y.: torchdistill: A modular, configuration-driven framework for knowl- edge distillation. In: Proceedings of the International Workshop on Reproducible Research in Pattern Recognition (RRPR). Lecture Notes in Computer Science, vol. 12636, pp. 24–44. Springer (2021)

2021
[7]

Dieguez et al

Microsoft: Deepspeed: Accelerating deep learning training and inference.https: //github.com/microsoft/DeepSpeed(2024), accessed: 2025-08-06 14 Adrian P. Dieguez et al

2024
[8]

In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pérez Diéguez, A., Batlle Casellas, A., Torres, A., Teague, H., Ros, J.: Pretrain- ing llms at scale: Tuning strategies and performance portability. In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. p. 1512–1523. ACM (2025)

2025
[9]

ArXiv (May 2020)

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations to- ward training trillion parameter models. ArXiv (May 2020)

2020
[10]

In: Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) at NeurIPS (2019)

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In: Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) at NeurIPS (2019)

2019
[11]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model par- allelism (2020),https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Tan,S.,Tam,W.L.,Wang,Y.,Gong,W.,Yang,Y.,Tang,H.,He,K.,Liu,J.,Wang, J., Zhao, S., Zhang, P., Tang, J.: Gkd: A general knowledge distillation framework forlarge-scalepre-trainedlanguagemodel.arXivpreprintarXiv:2306.06629(2023), https://arxiv.org/abs/2306.06629

work page arXiv 2023
[13]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

2017

[1] [1]

In: International Conference on Learning Representations (ICLR) (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mis- takes. In: International Conference on Learning Representations (ICLR) (2024)

2024

[2] [2]

arXiv preprint arXiv:2207.00032 , year =

Aminabadi, R.Y., Rajbhandari, S., Zhang, M., Awan, A.A., Li, C., Li, D., Zheng, E., Rasley, J., Smith, S., Ruwase, O., He, Y.: Deepspeed inference: Enabling ef- ficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032 (2022),https://arxiv.org/abs/2207.00032

work page arXiv 2022

[3] [3]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv (2015),https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

co/docs/trl/index(2026), accessed: Mar 10, 2026

Hugging Face: Trl — transformer reinforcement learning.https://huggingface. co/docs/trl/index(2026), accessed: Mar 10, 2026

2026

[5] [5]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. p. 611–626. SOSP ’23, ACM (2023)

2023

[6] [6]

In: Proceedings of the International Workshop on Reproducible Research in Pattern Recognition (RRPR)

Matsubara, Y.: torchdistill: A modular, configuration-driven framework for knowl- edge distillation. In: Proceedings of the International Workshop on Reproducible Research in Pattern Recognition (RRPR). Lecture Notes in Computer Science, vol. 12636, pp. 24–44. Springer (2021)

2021

[7] [7]

Dieguez et al

Microsoft: Deepspeed: Accelerating deep learning training and inference.https: //github.com/microsoft/DeepSpeed(2024), accessed: 2025-08-06 14 Adrian P. Dieguez et al

2024

[8] [8]

In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pérez Diéguez, A., Batlle Casellas, A., Torres, A., Teague, H., Ros, J.: Pretrain- ing llms at scale: Tuning strategies and performance portability. In: Proceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. p. 1512–1523. ACM (2025)

2025

[9] [9]

ArXiv (May 2020)

Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations to- ward training trillion parameter models. ArXiv (May 2020)

2020

[10] [10]

In: Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) at NeurIPS (2019)

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In: Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2) at NeurIPS (2019)

2019

[11] [11]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model par- allelism (2020),https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Tan,S.,Tam,W.L.,Wang,Y.,Gong,W.,Yang,Y.,Tang,H.,He,K.,Liu,J.,Wang, J., Zhao, S., Zhang, P., Tang, J.: Gkd: A general knowledge distillation framework forlarge-scalepre-trainedlanguagemodel.arXivpreprintarXiv:2306.06629(2023), https://arxiv.org/abs/2306.06629

work page arXiv 2023

[13] [13]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

2017