CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

Dohyeon Kwon; Youngjin Park

arxiv: 2606.31796 · v1 · pith:MKRS2OTCnew · submitted 2026-06-30 · 💻 cs.CL · cs.AI

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield

Dohyeon Kwon , Youngjin Park This is my paper

Pith reviewed 2026-07-01 05:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords selective ground truth token traininggradient couplingrecurrent depth compressionmixture of efficient expertsparameter-efficient language modelstoken supervisionloss reduction

0 comments

The pith

Selective supervision on 15% of tokens recovers 67% of full loss reduction in language models through positive gradient coupling in shared weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that concentrating supervision on the 15% of output tokens carrying semantic content allows the remaining 85% to improve via positive gradient coupling in position-shared transformer weights. This yields 4.5 times per-supervised-token efficiency while recovering most of the loss reduction that full supervision would produce. Depth compression via layer averaging followed by recurrent unrolling reduces a 48-layer model to six layers with 34 effective recurrent steps, matching the loss of a model twice as large. Assembling multiple such compressed models into a mixture of efficient experts further lowers loss beyond any single model at comparable active parameters. The techniques are shown on a Korean foundation model trained entirely from the authors' runs.

Core claim

Selective Ground Truth Token Training concentrates supervision on roughly 15% of tokens yet recovers about 67% of full-sequence loss reduction because the gradient coupling coefficient gamma-bar equals 0.72; the effect is guaranteed when this coefficient is positive and collapses on shuffled text. A 48-layer transformer compressed to six layers and restored by 34 recurrent unrollings reaches held-out loss 2.934, within noise of a 566M dense model at 2.926. A two-expert mixture of these compressed models reaches loss 2.789, outperforming the best single compressed model at 2.926.

What carries the argument

Selective Ground Truth Token Training (SGT) that exploits positive gradient coupling (gamma-bar = 0.72) across position-shared weights, combined with recurrent unrolling after layer averaging for depth compression and Mixture of Efficient Experts (MoEE) fusion.

Load-bearing premise

The improvement on unsupervised tokens stems from natural-language structure that produces positive gradient coupling in the shared weights.

What would settle it

Repeating the selective-training runs on shuffled text and finding either no loss reduction on unsupervised tokens or a non-positive gradient coupling coefficient would falsify the central efficiency claim.

Figures

Figures reproduced from arXiv: 2606.31796 by Dohyeon Kwon, Youngjin Park.

**Figure 1.** Figure 1: Sovereign multi-teacher fusion. Frontier models are compressed, then serve triple duty: standalone deployment, distillation teacher, and MoE expert. The sovereign student absorbs collective knowledge via GT-focused KL distillation. 2. Used as a teacher for SGT distillation, where only GT-position logits are matched (reducing distillation memory and compute by 5×). 3. Assembled as an expert in a MoEE archi… view at source ↗

**Figure 2.** Figure 2: Parameter efficiency frontier. Compressed models with recurrent unrolling achieve a flat performance plateau from 227M to 566M. MoEE at 1,022M breaks below the plateau. The 48L original at 1,019M achieves much lower loss through full-capacity SGT training. step 100. Extending to 1,000 steps (verified in a separate run) produced <0.01 additional improvement, confirming that 500 steps capture the full SGT le… view at source ↗

**Figure 3.** Figure 3: CHERRY-1.8B architecture. Sovereign tokenizer → 48L DUS backbone → Oracle MTP heads. Cabstractor fuses four frozen extractors. All trainable parameters are sovereign. Korean tokens. Post-CPT weight-space cosine similarity to the seed is below 0.35. 5.3 Oracle multi-token prediction Three cross-attention prediction heads [9] (343M total) at horizons h ∈ {1, 2, 3} receive oracle future hidden states during t… view at source ↗

read the original abstract

We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output tokens that carry semantic payload. Through positive gradient coupling in position-shared transformer weights -- a token-level instance of auxiliary-task transfer -- the remaining 85% of unsupervised tokens still improve substantially, giving a 4.5x per-supervised-token efficiency (at the step-100 eval optimum, ~67% of the full-sequence loss reduction is recovered from 15% of the supervision). We prove that this improvement on unsupervised tokens is guaranteed whenever the gradient coupling coefficient gamma-bar = 0.72 is positive (Theorem 1), and show the effect is a property of natural-language structure: it collapses on shuffled text. (2) Depth compression with recurrent recovery. A 48-layer, 1B-parameter transformer is compressed to 6 layers (227M) by averaging adjacent layers and restored through learned recurrent unrolling. With 34 effective recurrent layers it reaches a held-out loss of 2.934, within measurement noise of a 566M dense model at 2.926 -- a 2.5x reduction in parameters. (3) Fusion of compressed experts. Assembling several compressed models as a Mixture of Efficient Experts (MoEE) with multi-token prediction improves over each single expert at comparable active parameters: a 2-expert MoEE reaches loss 2.789 versus 2.926 for the best single compressed model. We validate these techniques on CHERRY-1.8B, a Korean foundation model whose every trainable parameter derives from our own training runs. We are explicit throughout about the scope of the evidence (one model family, Korean data, loss-based metrics) and about which claims are established versus prospective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The selective token supervision is the freshest piece but its link to natural-language structure rests on a control that does not separate semantics from local token statistics.

read the letter

The paper's main new element is Selective Ground Truth Token Training, which supervises only about 15% of tokens yet recovers roughly two-thirds of the full loss drop through measured positive gradient coupling. The other two pieces—recurrent unrolling after layer averaging for depth recovery and a small MoEE on the compressed experts—are more incremental but are assembled into one pipeline on a Korean 1.8B model the authors trained from scratch.

They do a few things right. The scope is stated plainly: one model family, Korean data, loss metrics only. Theorem 1 is presented as a conditional guarantee rather than an unconditional claim. The recurrent compression result lands close to a larger dense baseline at lower parameter count, which is a usable engineering data point if the numbers replicate.

The soft spot is the interpretation of the shuffled-text result. The abstract treats the collapse of the effect on shuffled data as evidence that the coupling is a property of natural-language structure. Shuffling removes both long-range semantics and short-range co-occurrence statistics at once, so the drop in gamma-bar cannot be pinned on linguistic structure specifically. An n-gram-preserving control would be needed to make that step secure. The theorem itself only requires gamma-bar > 0; the causal story attached to it is the part that needs more work.

People experimenting with cheaper supervision schedules or compressed transformers could extract practical ideas from the selective-token section. The paper is careful enough about its own limits that a reader can judge the claims without over-reading.

It should go to peer review. The combination is concrete, the authors flag what is measured versus what is prospective, and the gradient-coupling angle is worth a closer look even if the current control does not fully support the strongest reading.

Referee Report

1 major / 1 minor

Summary. The paper introduces three techniques for compute-efficient language model training on the CHERRY-1.8B Korean model: (1) Selective Ground Truth Token Training (SGT) that supervises ~15% of tokens yet recovers ~67% of full-sequence loss reduction via positive gradient coupling (gamma-bar=0.72), with Theorem 1 guaranteeing unsupervised-token improvement conditional on gamma-bar>0 and the effect collapsing on shuffled text; (2) depth compression of a 48-layer model to 6 layers via averaging and recurrent unrolling, achieving loss 2.934 comparable to a 566M dense model; (3) Mixture of Efficient Experts (MoEE) with multi-token prediction, where a 2-expert variant reaches loss 2.789. The work explicitly limits claims to one model family, Korean data, and loss metrics.

Significance. If the empirical results and conditional theorem hold, SGT provides a measured 4.5x per-supervised-token efficiency gain with a general guarantee, the recurrent compression delivers a 2.5x parameter reduction at comparable loss, and MoEE demonstrates gains from fusing compressed experts. The explicit scope statement and focus on falsifiable loss metrics are strengths. These techniques could meaningfully advance parameter- and supervision-efficient training if the gradient-coupling mechanism proves robust beyond the reported setting.

major comments (1)

[Abstract and Theorem 1 section] Abstract and the section presenting Theorem 1 plus the shuffled-text experiment: the interpretation that positive gamma-bar arises specifically from natural-language structure (rather than generic token-transition statistics) rests on the shuffled-text collapse. Shuffling destroys both higher-order semantics and local co-occurrence statistics simultaneously; without an intermediate control preserving n-gram or bigram distributions while disrupting only long-range order, the attribution to linguistic structure is not isolated. This affects the explanatory claim for why the 4.5x efficiency appears on natural text, even though the conditional guarantee of Theorem 1 itself remains intact.

minor comments (1)

Ensure all reported scalars (gamma-bar=0.72, 15% token fraction, 67% recovery, step-100 optimum, loss values 2.789/2.926/2.934) are explicitly cross-referenced to the exact experimental tables, figures, or appendix derivations where they are computed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the interpretability of our SGT results. The single major comment concerns the strength of the attribution in the abstract and Theorem 1 section. We address it point-by-point below.

read point-by-point responses

Referee: [Abstract and Theorem 1 section] Abstract and the section presenting Theorem 1 plus the shuffled-text experiment: the interpretation that positive gamma-bar arises specifically from natural-language structure (rather than generic token-transition statistics) rests on the shuffled-text collapse. Shuffling destroys both higher-order semantics and local co-occurrence statistics simultaneously; without an intermediate control preserving n-gram or bigram distributions while disrupting only long-range order, the attribution to linguistic structure is not isolated. This affects the explanatory claim for why the 4.5x efficiency appears on natural text, even though the conditional guarantee of Theorem 1 itself remains intact.

Authors: We agree that the shuffled-text experiment does not isolate higher-order linguistic structure from local token-transition statistics, as random shuffling disrupts both simultaneously. The current control only demonstrates that the positive gamma-bar (and thus the efficiency gain) depends on the original sequential token statistics rather than on fully randomized transitions. We will revise the abstract and the relevant section to state that the effect is a property of the preserved token co-occurrence structure in natural text (as shown by the collapse under shuffling), without claiming isolation of long-range semantics. This keeps Theorem 1 and all empirical results unchanged while removing the over-attribution. We will also add a brief note that n-gram-preserving controls would be a useful direction for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper measures the gradient coupling coefficient gamma-bar empirically (reported as 0.72) and states a theorem guaranteeing improvement on unsupervised tokens conditional on gamma-bar > 0. The claim that the effect stems from natural-language structure is backed by the shuffled-text control experiment. No quoted step reduces a claimed prediction or result to a fitted parameter by the paper's own equations, nor does any load-bearing premise rely on self-citation chains, imported uniqueness theorems, or ansatzes smuggled via prior work. The techniques are validated directly on the authors' own runs with explicit scope limitations, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Central claims rest on the mathematical guarantee in Theorem 1 for gradient coupling and on empirical loss measurements from training CHERRY-1.8B; the 0.72 coupling coefficient and 15% token fraction are presented as observed values that enable the efficiency result.

free parameters (2)

gamma-bar = 0.72
Gradient coupling coefficient stated as 0.72 in the description of Theorem 1 for SGT.
supervised token fraction = ~0.15
The ~15% of tokens selected for supervision in SGT.

axioms (1)

domain assumption Improvement on unsupervised tokens is guaranteed whenever gradient coupling coefficient gamma-bar is positive (Theorem 1).
The proof is invoked to guarantee the 4.5x efficiency result under the stated coupling.

pith-pipeline@v0.9.1-grok · 5871 in / 1469 out tokens · 27045 ms · 2026-07-01T05:44:22.498141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 13 canonical work pages · 4 internal anchors

[1]

ICML41–48 (2009)

Bengio, Y.et al.Curriculum learning.Proc. ICML41–48 (2009)

2009
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C.et al.Accelerating large language model decoding with speculative sampling. Preprint athttps://arxiv.org/abs/2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-V3 technical report.Preprint at https://arxiv.org/abs/2412

DeepSeek-AI. DeepSeek-V3 technical report.Preprint at https://arxiv.org/abs/2412. 19437(2024)

2024
[4]

DeepSeek-V4: Towards real-world expertise in reasoning and coding

DeepSeek-AI. DeepSeek-V4: Towards real-world expertise in reasoning and coding. Technical Report(2026)

2026
[5]

ICLR(2019)

Dehghani, M.et al.Universal transformers.Proc. ICLR(2019)

2019
[6]

BERT: Pre-training of deep bidi- rectional transformers for language understanding.Proc

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidi- rectional transformers for language understanding.Proc. NAACL-HLT, pp. 4171–4186 (2019)

2019
[7]

& Shazeer, N

Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR23(120), 1–39 (2022)

2022
[8]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Geiping, J.et al.Scaling up test-time compute with latent reasoning: A recurrent depth approach.Advances in Neural Information Processing Systems(NeurIPS 2025). https://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

ICML(2024)

Gloeckle, F.et al.Better & faster large language models via multi-token prediction.Proc. ICML(2024)

2024
[10]

& Roberts, D

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P . & Roberts, D. A. The unreason- able ineffectiveness of the deeper layers.Proc. ICLR(2025). https://arxiv.org/abs/ 2403.17887. 25 CHERRYAXOps Team, TeamSparta Inc

work page arXiv 2025
[11]

& Dao, T

Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Proc. Conference on Language Modeling (COLM)(2024). https://arxiv.org/abs/2312. 00752

2024
[12]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 633–638 (2025)

Guo, D.et al.(DeepSeek-AI). DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 633–638 (2025). https://doi.org/10.1038/ s41586-025-09422-z

2025
[13]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network.NIPS 2014 Deep Learning Workshop.https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

& Hongler, C

Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems31(NeurIPS 2018)

2018
[15]

arXiv preprint arXiv:1910.00762 , year=

Jiang, A. H., Wong, D. L.-K., Zhou, G.,et al.Accelerating deep learning by focusing on the biggest losers.Preprint athttps://arxiv.org/abs/1910.00762(2019)

work page arXiv 1910
[16]

Jiao, X.et al.TinyBERT: Distilling BERT for natural language understanding.Findings of EMNLP(2020)

2020
[17]

Scaling Laws for Neural Language Models

Kaplan, J.et al.Scaling laws for neural language models.Preprint at https://arxiv. org/abs/2001.08361(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

and Fleuret, F

Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling.Proc. ICML, pp. 2530–2539 (2018)

2018
[19]

NAACL-HLT Industry Track(2024)

Kim, D.et al.SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling.Proc. NAACL-HLT Industry Track(2024). https://arxiv.org/abs/ 2312.15166

work page arXiv 2024
[20]

& Soricut, R

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P . & Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations.Proc. ICLR(2020)

2020
[21]

& Matias, Y

Leviathan, Y., Kalman, M. & Matias, Y. Fast inference from transformers via speculative decoding.Proc. ICML(2023)

2023
[22]

EXAONE 3.5: Series of large language models for real-world use cases

LG AI Research. EXAONE 3.5: Series of large language models for real-world use cases. Preprint athttps://arxiv.org/abs/2412.04862(2024)

work page arXiv 2024
[23]

Li, M., Gururangan, S., Dettmers, T.,et al.Branch-train-merge: Embarrassingly parallel training of expert language models.Preprint at https://arxiv.org/abs/2208.03306 (2022)

work page arXiv 2022
[24]

ICLR(2024)

Lightman, H., Kosaraju, V ., Burda, Y.,et al.Let’s verify step by step.Proc. ICLR(2024)

2024
[25]

IEEE ICCV2980–2988 (2017)

Lin, T.-Y.et al.Focal loss for dense object detection.Proc. IEEE ICCV2980–2988 (2017)

2017
[26]

https://arxiv.org/abs/2404

Lin, Z.et al.Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems37(NeurIPS 2024, Oral). https://arxiv.org/abs/2404. 07965

2024
[27]

ICML(2024)

Liu, Z.et al.MobileLLM: Optimizing sub-billion parameter language models for on- device use cases.Proc. ICML(2024). 26 CHERRYAXOps Team, TeamSparta Inc

2024
[28]

& Hutter, F

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization.Proc. ICLR(2019)

2019
[29]

Ma, X.et al.LLM-Pruner: On the structural pruning of large language models.Advances in Neural Information Processing Systems36(NeurIPS 2023)

2023
[30]

Men, X.et al.ShortGPT: Layers in large language models are more redundant than you expect.Findings of ACL(2025), 20192–20204.https://arxiv.org/abs/2403.03853

work page arXiv 2025
[31]

ICML(2022)

Mindermann, S., Brauner, J., Razzak, M.,et al.Prioritized training on points that are learnable, worth learning, and not yet learnt.Proc. ICML(2022)

2022
[32]

MiniMax-M3: Native multimodal mixture-of-experts.Technical Report (2026)

MiniMax AI. MiniMax-M3: Native multimodal mixture-of-experts.Technical Report (2026)

2026
[33]

HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

NAVER Cloud HyperCLOVA X Team. HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

work page arXiv 2024
[34]

NVFP4: A 4-bit floating-point format for efficient low-precision inference

NVIDIA. NVFP4: A 4-bit floating-point format for efficient low-precision inference. NVIDIA Technical Report(2025)

2025
[35]

Peng, B.et al.RWKV: Reinventing RNNs for the transformer era.Findings of EMNLP (2023), 14048–14077

2023
[36]

Radford, A.et al.Language models are unsupervised multitask learners.OpenAI Blog (2019)

2019
[37]

Raffel, C., Shazeer, N., Roberts, A.,et al.Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR21(140):1–67 (2020)

2020
[38]

& Birch, A

Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units.Proc. ACL1715–1725 (2016)

2016
[39]

ICLR(2017)

Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of- experts layer.Proc. ICLR(2017)

2017
[40]

A.X-K1: A sovereign Korean foundation model.Preprint at https://arxiv

SK Telecom. A.X-K1: A sovereign Korean foundation model.Preprint at https://arxiv. org/abs/2601.09200(2026)

work page arXiv 2026
[41]

Vaswani, A.et al.Attention is all you need.Advances in Neural Information Processing Systems30(NeurIPS 2017), 5998–6008

2017
[42]

Wang, W.et al.MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers.Findings of ACL(2021)

2021
[43]

Y.,et al.Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.Proc

Wortsman, M., Ilharco, G., Gadre, S. Y.,et al.Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.Proc. ICML(2022)

2022
[44]

& Finn, C

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. & Finn, C. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems33(NeurIPS 2020)

2020
[45]

GLM-5.2: A 744B parameter sparse model with IndexShare attention

Zhipu AI (Z.ai). GLM-5.2: A 744B parameter sparse model with IndexShare attention. Technical Report(2026). 27 CHERRYAXOps Team, TeamSparta Inc. Appendix A Notation Table A.1:Notation used throughout the paper. Symbol Meaning FFull token set: all positions in a response GGround Truth token set (|G|/|F | ≈0.15) ¯GNon-GT token set:F \ G G ∗ Super GT: single ...

2026
[46]

Factual tokens alone achieve 75% of full SGT performanceat 41% of the GT budget, confirming that factual entities are the highest-ROI supervision targets
[47]

4.5× for full SGT), consistent with the hypothesis that CoT pivots carry disproportionate semantic load

Reasoning tokens yield the highest per-token ROI(8.9× vs. 4.5× for full SGT), consistent with the hypothesis that CoT pivots carry disproportionate semantic load
[48]

This confirms that SSGT is best suited for installingspecific capabilities(self-correction, verification) rather than general- purpose training

Self-correction (SSGT) tokens achieve11.3 × ROI—the highest among all categories— but the absolute eval loss (2.13) is furthest from full SGT. This confirms that SSGT is best suited for installingspecific capabilities(self-correction, verification) rather than general- purpose training
[49]

W” and “Eval loss

All single-category runs show W> 1, confirming that wave propagation operates regardless of which GT category provides the supervision signal. C Depth compression C.1 Compression-recovery trajectory Observation.Recovery is remarkably uniform across compression levels (81–84%), sug- gesting that adjacent-layer merging preserves the model’s essential struct...

work page arXiv

[1] [1]

ICML41–48 (2009)

Bengio, Y.et al.Curriculum learning.Proc. ICML41–48 (2009)

2009

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C.et al.Accelerating large language model decoding with speculative sampling. Preprint athttps://arxiv.org/abs/2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-V3 technical report.Preprint at https://arxiv.org/abs/2412

DeepSeek-AI. DeepSeek-V3 technical report.Preprint at https://arxiv.org/abs/2412. 19437(2024)

2024

[4] [4]

DeepSeek-V4: Towards real-world expertise in reasoning and coding

DeepSeek-AI. DeepSeek-V4: Towards real-world expertise in reasoning and coding. Technical Report(2026)

2026

[5] [5]

ICLR(2019)

Dehghani, M.et al.Universal transformers.Proc. ICLR(2019)

2019

[6] [6]

BERT: Pre-training of deep bidi- rectional transformers for language understanding.Proc

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidi- rectional transformers for language understanding.Proc. NAACL-HLT, pp. 4171–4186 (2019)

2019

[7] [7]

& Shazeer, N

Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR23(120), 1–39 (2022)

2022

[8] [8]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Geiping, J.et al.Scaling up test-time compute with latent reasoning: A recurrent depth approach.Advances in Neural Information Processing Systems(NeurIPS 2025). https://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

ICML(2024)

Gloeckle, F.et al.Better & faster large language models via multi-token prediction.Proc. ICML(2024)

2024

[10] [10]

& Roberts, D

Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P . & Roberts, D. A. The unreason- able ineffectiveness of the deeper layers.Proc. ICLR(2025). https://arxiv.org/abs/ 2403.17887. 25 CHERRYAXOps Team, TeamSparta Inc

work page arXiv 2025

[11] [11]

& Dao, T

Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Proc. Conference on Language Modeling (COLM)(2024). https://arxiv.org/abs/2312. 00752

2024

[12] [12]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 633–638 (2025)

Guo, D.et al.(DeepSeek-AI). DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 633–638 (2025). https://doi.org/10.1038/ s41586-025-09422-z

2025

[13] [13]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network.NIPS 2014 Deep Learning Workshop.https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

& Hongler, C

Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems31(NeurIPS 2018)

2018

[15] [15]

arXiv preprint arXiv:1910.00762 , year=

Jiang, A. H., Wong, D. L.-K., Zhou, G.,et al.Accelerating deep learning by focusing on the biggest losers.Preprint athttps://arxiv.org/abs/1910.00762(2019)

work page arXiv 1910

[16] [16]

Jiao, X.et al.TinyBERT: Distilling BERT for natural language understanding.Findings of EMNLP(2020)

2020

[17] [17]

Scaling Laws for Neural Language Models

Kaplan, J.et al.Scaling laws for neural language models.Preprint at https://arxiv. org/abs/2001.08361(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

and Fleuret, F

Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling.Proc. ICML, pp. 2530–2539 (2018)

2018

[19] [19]

NAACL-HLT Industry Track(2024)

Kim, D.et al.SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling.Proc. NAACL-HLT Industry Track(2024). https://arxiv.org/abs/ 2312.15166

work page arXiv 2024

[20] [20]

& Soricut, R

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P . & Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations.Proc. ICLR(2020)

2020

[21] [21]

& Matias, Y

Leviathan, Y., Kalman, M. & Matias, Y. Fast inference from transformers via speculative decoding.Proc. ICML(2023)

2023

[22] [22]

EXAONE 3.5: Series of large language models for real-world use cases

LG AI Research. EXAONE 3.5: Series of large language models for real-world use cases. Preprint athttps://arxiv.org/abs/2412.04862(2024)

work page arXiv 2024

[23] [23]

Li, M., Gururangan, S., Dettmers, T.,et al.Branch-train-merge: Embarrassingly parallel training of expert language models.Preprint at https://arxiv.org/abs/2208.03306 (2022)

work page arXiv 2022

[24] [24]

ICLR(2024)

Lightman, H., Kosaraju, V ., Burda, Y.,et al.Let’s verify step by step.Proc. ICLR(2024)

2024

[25] [25]

IEEE ICCV2980–2988 (2017)

Lin, T.-Y.et al.Focal loss for dense object detection.Proc. IEEE ICCV2980–2988 (2017)

2017

[26] [26]

https://arxiv.org/abs/2404

Lin, Z.et al.Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems37(NeurIPS 2024, Oral). https://arxiv.org/abs/2404. 07965

2024

[27] [27]

ICML(2024)

Liu, Z.et al.MobileLLM: Optimizing sub-billion parameter language models for on- device use cases.Proc. ICML(2024). 26 CHERRYAXOps Team, TeamSparta Inc

2024

[28] [28]

& Hutter, F

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization.Proc. ICLR(2019)

2019

[29] [29]

Ma, X.et al.LLM-Pruner: On the structural pruning of large language models.Advances in Neural Information Processing Systems36(NeurIPS 2023)

2023

[30] [30]

Men, X.et al.ShortGPT: Layers in large language models are more redundant than you expect.Findings of ACL(2025), 20192–20204.https://arxiv.org/abs/2403.03853

work page arXiv 2025

[31] [31]

ICML(2022)

Mindermann, S., Brauner, J., Razzak, M.,et al.Prioritized training on points that are learnable, worth learning, and not yet learnt.Proc. ICML(2022)

2022

[32] [32]

MiniMax-M3: Native multimodal mixture-of-experts.Technical Report (2026)

MiniMax AI. MiniMax-M3: Native multimodal mixture-of-experts.Technical Report (2026)

2026

[33] [33]

HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

NAVER Cloud HyperCLOVA X Team. HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

work page arXiv 2024

[34] [34]

NVFP4: A 4-bit floating-point format for efficient low-precision inference

NVIDIA. NVFP4: A 4-bit floating-point format for efficient low-precision inference. NVIDIA Technical Report(2025)

2025

[35] [35]

Peng, B.et al.RWKV: Reinventing RNNs for the transformer era.Findings of EMNLP (2023), 14048–14077

2023

[36] [36]

Radford, A.et al.Language models are unsupervised multitask learners.OpenAI Blog (2019)

2019

[37] [37]

Raffel, C., Shazeer, N., Roberts, A.,et al.Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR21(140):1–67 (2020)

2020

[38] [38]

& Birch, A

Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units.Proc. ACL1715–1725 (2016)

2016

[39] [39]

ICLR(2017)

Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of- experts layer.Proc. ICLR(2017)

2017

[40] [40]

A.X-K1: A sovereign Korean foundation model.Preprint at https://arxiv

SK Telecom. A.X-K1: A sovereign Korean foundation model.Preprint at https://arxiv. org/abs/2601.09200(2026)

work page arXiv 2026

[41] [41]

Vaswani, A.et al.Attention is all you need.Advances in Neural Information Processing Systems30(NeurIPS 2017), 5998–6008

2017

[42] [42]

Wang, W.et al.MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers.Findings of ACL(2021)

2021

[43] [43]

Y.,et al.Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.Proc

Wortsman, M., Ilharco, G., Gadre, S. Y.,et al.Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.Proc. ICML(2022)

2022

[44] [44]

& Finn, C

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. & Finn, C. Gradient surgery for multi-task learning.Advances in Neural Information Processing Systems33(NeurIPS 2020)

2020

[45] [45]

GLM-5.2: A 744B parameter sparse model with IndexShare attention

Zhipu AI (Z.ai). GLM-5.2: A 744B parameter sparse model with IndexShare attention. Technical Report(2026). 27 CHERRYAXOps Team, TeamSparta Inc. Appendix A Notation Table A.1:Notation used throughout the paper. Symbol Meaning FFull token set: all positions in a response GGround Truth token set (|G|/|F | ≈0.15) ¯GNon-GT token set:F \ G G ∗ Super GT: single ...

2026

[46] [46]

Factual tokens alone achieve 75% of full SGT performanceat 41% of the GT budget, confirming that factual entities are the highest-ROI supervision targets

[47] [47]

4.5× for full SGT), consistent with the hypothesis that CoT pivots carry disproportionate semantic load

Reasoning tokens yield the highest per-token ROI(8.9× vs. 4.5× for full SGT), consistent with the hypothesis that CoT pivots carry disproportionate semantic load

[48] [48]

This confirms that SSGT is best suited for installingspecific capabilities(self-correction, verification) rather than general- purpose training

Self-correction (SSGT) tokens achieve11.3 × ROI—the highest among all categories— but the absolute eval loss (2.13) is furthest from full SGT. This confirms that SSGT is best suited for installingspecific capabilities(self-correction, verification) rather than general- purpose training

[49] [49]

W” and “Eval loss

All single-category runs show W> 1, confirming that wave propagation operates regardless of which GT category provides the supervision signal. C Depth compression C.1 Compression-recovery trajectory Observation.Recovery is remarkably uniform across compression levels (81–84%), sug- gesting that adjacent-layer merging preserves the model’s essential struct...

work page arXiv