pith. sign in

arxiv: 2605.24956 · v1 · pith:6NSVQ2QEnew · submitted 2026-05-24 · 💻 cs.CL

NITP: Next Implicit Token Prediction for LLM Pre-training

Pith reviewed 2026-06-30 12:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords next token predictionLLM pre-trainingrepresentation supervisionself-supervised learningMoE modelsMMLU-Prolatent space regularization
0
0 comments X

The pith

NITP augments next-token prediction with dense supervision from shallow-layer representations to constrain LLM latent spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard next-token prediction supervises language models only through discrete one-hot labels in output space, leaving the latent representation space under-constrained and prone to degenerate geometries. NITP augments this with dense continuous supervision by training the model to predict the implicit semantic content of the next token, taking shallow-layer representations from the same model as self-supervised targets. The paper provides theoretical analysis that this regularizes the optimization landscape by reducing under-constrained degrees of freedom and promoting compact, structured representations. Empirically it reports consistent downstream gains on dense and MoE models from 0.5B to 9B parameters, including a 5.7% absolute lift on MMLU-Pro for the largest model, at roughly 2% extra training FLOPs and zero inference cost. A sympathetic reader would care because tighter latent-space constraints could translate into stronger generalization without changing model scale or inference budget.

Core claim

NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. This augments discrete next-token prediction with dense continuous supervision directly in the representation space. Theoretical analysis shows that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry.

What carries the argument

Next Implicit Token Prediction (NITP) that uses shallow-layer representations as self-supervised targets for the implicit semantic content of the next token.

If this is right

  • Consistent downstream gains across dense and MoE models ranging from 0.5B to 9B parameters.
  • 5.7% absolute improvement on MMLU-Pro for a 9B MoE model.
  • Additional gains of 6.4% on C3 and 4.3% on CommonsenseQA.
  • Approximately 2% extra training FLOPs with no added inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The claimed regularization of representation geometry could be directly tested by tracking anisotropy metrics on hidden states before and after NITP training.
  • Similar dense supervision from early layers might transfer to non-language modalities where next-token-style objectives exist.
  • The low overhead suggests NITP could be stacked with other representation-level regularizers without compounding compute costs.

Load-bearing premise

Shallow-layer representations from the same model provide stable self-supervised targets for the implicit semantic content of the next token without introducing instability or circular dependencies during training.

What would settle it

Running the same pre-training run with and without NITP on a 9B-scale model and finding no improvement (or degradation) on MMLU-Pro or C3 relative to the NTP baseline.

Figures

Figures reproduced from arXiv: 2605.24956 by Debing Zhang, Junchi Yan, Shaofeng Zhang, Xiangdong Zhang, Xiaohan Qin, Yu Cheng.

Figure 1
Figure 1. Figure 1: Top: Representation geometry of the last hidden states under NTP and NITP. Bottom: Average downstream performance of 9B MoE and 2B dense models (details in Appendix B). Team et al., 2025; He & Su, 2025; Chen et al., 2024). By maximizing the likelihood of the next token over massive corpora, this paradigm learns general-purpose representa￾tions that support a wide range of downstream tasks (Guo et al., 2025… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of NITP. Next Implicit Token Prediction supervises hidden states by predicting temporally shifted implicit tokens, i.e. shallow-layer representations, and is jointly optimized with the standard next-token prediction objective. These shallow representations act as semantics anchors (see Section 3.3). The last-layer hidden states are projected to match the implicit targets using a cosine similarity … view at source ↗
Figure 3
Figure 3. Figure 3: Loss comparison between whether temporal shift or not [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average performance under different NITP loss. MoE model in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of the NITP loss. Evolution of LNITP during pre-training, exhibiting a characteristic three-phase behavior: an initial collapse induced by random initialization, a transient hump caused by the emergence of structured shallow-layer targets, and a long-term stable convergence. According to the task types used in our evaluation pipeline, the remaining tasks are grouped into Classification (1… view at source ↗
read the original abstract

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard next-token prediction (NTP) leaves the latent representation space under-constrained, leading to degenerate and anisotropic hidden states. It proposes Next Implicit Token Prediction (NITP), which augments NTP with dense continuous supervision by training the model to predict the implicit semantic content of the next token using shallow-layer representations from the same model as stable self-supervised targets. The paper provides a theoretical analysis showing that NITP regularizes the optimization landscape and reports empirical results across dense and MoE models (0.5B to 9B parameters) with consistent downstream gains, including a 5.7% absolute improvement on MMLU-Pro for a 9B MoE model, at roughly 2% additional training FLOPs and no inference cost. Code is released at https://github.com/aHapBean/NITP.

Significance. If the stability of the moving targets and the claimed regularization effect hold under joint optimization, NITP would offer a low-overhead method to improve representation geometry during pre-training of both dense and MoE models. The reported gains on multiple benchmarks and the public implementation supporting reproducibility are concrete strengths.

major comments (2)
  1. [Abstract] Abstract: The central claim that shallow-layer representations serve as 'stable' self-supervised targets is load-bearing for both the theoretical regularization argument and the empirical gains. Because these targets are computed from shallow layers updated by the same optimizer on the same forward pass, they are non-stationary; the manuscript supplies no measurement of target drift (e.g., cosine similarity of shallow states on held-out prefixes across training steps), no ablation separating stability from the claimed benefit, and no analysis showing that joint dynamics avoid oscillatory or collapsed solutions.
  2. [Abstract] Abstract: The theoretical analysis is invoked to show that NITP 'regularizes the optimization landscape by mitigating under-constrained degrees of freedom,' yet the provided text contains no equations, derivations, or formal statements of the analysis, preventing evaluation of whether the claimed mitigation follows from the construction or is contradicted by the moving-target supervision.
minor comments (1)
  1. [Abstract] Abstract: Experimental details (baselines, number of runs, variance, exact model configurations, and training hyperparameters) are absent, making it difficult to assess the reported improvements such as the 5.7% MMLU-Pro gain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments below. Both points identify substantive gaps in the submitted version, and we will incorporate the requested evidence and formalization in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that shallow-layer representations serve as 'stable' self-supervised targets is load-bearing for both the theoretical regularization argument and the empirical gains. Because these targets are computed from shallow layers updated by the same optimizer on the same forward pass, they are non-stationary; the manuscript supplies no measurement of target drift (e.g., cosine similarity of shallow states on held-out prefixes across training steps), no ablation separating stability from the claimed benefit, and no analysis showing that joint dynamics avoid oscillatory or collapsed solutions.

    Authors: We agree that the targets are non-stationary by construction. Our use of the term 'stable' was intended to reflect the empirical observation that shallow-layer representations change more slowly than deeper ones, but this was not quantified. We will add (i) measurements of target drift via cosine similarity on held-out prefixes across training checkpoints, (ii) an ablation that isolates the effect of target stability (e.g., by freezing shallow layers at selected points), and (iii) monitoring of loss curves and representation metrics to check for oscillatory or collapsed behavior under joint optimization. These additions will be included in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: The theoretical analysis is invoked to show that NITP 'regularizes the optimization landscape by mitigating under-constrained degrees of freedom,' yet the provided text contains no equations, derivations, or formal statements of the analysis, preventing evaluation of whether the claimed mitigation follows from the construction or is contradicted by the moving-target supervision.

    Authors: The submitted version omitted the theoretical analysis section. The full manuscript contains a dedicated section with the formal argument, including the loss formulation, a derivation showing how the additional continuous supervision term reduces the effective degrees of freedom in the representation space, and a discussion of its interaction with the moving targets. We will restore and expand this section in the revision, explicitly addressing whether the regularization holds under non-stationary targets. revision: yes

Circularity Check

1 steps flagged

NITP targets defined from model's own shallow layers reduce supervision to internal states

specific steps
  1. self definitional [Abstract]
    "NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets."

    The target for the 'implicit semantic content' at position t+1 is the shallow-layer representation computed by the identical model on the same pass; therefore the added loss term is constructed from the model's internal states rather than an independent signal, rendering the regularization effect partly definitional.

full rationale

The core NITP mechanism defines its dense supervision signal directly from the shallow-layer hidden states of the model under training on the same forward pass. This makes the claimed regularization of the representation geometry dependent on the model's own evolving parameters rather than an external or fixed target, so the 'mitigation of under-constrained degrees of freedom' follows in part by construction of the loss. Empirical gains are reported but the theoretical analysis rests on the unverified stability assertion for these moving targets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; full manuscript required for ledger.

pith-pipeline@v0.9.1-grok · 5773 in / 1102 out tokens · 45222 ms · 2026-06-30T12:20:27.249014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 36 canonical work pages · 21 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebr \'o n, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  3. [3]

    Transformers need glasses! information over-squashing in language tasks

    Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara \'u jo, J., Vitvitskyi, O., Pascanu, R., and Veli c kovi \'c , P. Transformers need glasses! information over-squashing in language tasks. Advances in Neural Information Processing Systems, 37: 0 98111--98142, 2024

  4. [4]

    Representation learning: A review and new perspectives

    Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35 0 (8): 0 1798--1828, 2013

  5. [5]

    Next token prediction towards multimodal intelligence: A comprehensive survey

    Chen, L., Wang, Z., Ren, S., Li, L., Zhao, H., Li, Y., Cai, Z., Guo, H., Zhang, L., Xiong, Y., et al. Next token prediction towards multimodal intelligence: A comprehensive survey. arXiv preprint arXiv:2412.18619, 2024

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    Opencompass: A universal evaluation platform for foundation models

    Contributors, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

  9. [9]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

    Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019

  10. [10]

    Representation degeneration problem in training natural language generation models

    Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y. Representation degeneration problem in training natural language generation models. arXiv preprint arXiv:1907.12009, 2019

  11. [11]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  12. [12]

    Better & Faster Large Language Models via Multi-token Prediction

    Gloeckle, F., Idrissi, B. Y., Rozi \`e re, B., Lopez-Paz, D., and Synnaeve, G. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

  13. [13]

    Bootstrap your own latent-a new approach to self-supervised learning

    Grill, J.-B., Strub, F., Altch \'e , F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 0 21271--21284, 2020

  14. [14]

    Minillm: Knowledge distillation of large language models

    Gu, Y., Dong, L., Wei, F., and Huang, M. Minillm: Knowledge distillation of large language models. In International Conference on Learning Representations, volume 2024, pp.\ 32694--32717, 2024 a

  15. [15]

    Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation

    Gu, Z., Zhu, X., Ye, H., Zhang, L., Wang, J., Zhu, Y., Jiang, S., Xiong, Z., Li, Z., Wu, W., et al. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pp.\ 18099--18107, 2024 b

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    and Su, W

    He, H. and Su, W. J. A law of next-token prediction in large language models. Physical Review E, 112 0 (3): 0 035317, 2025

  18. [18]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  19. [19]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  20. [20]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, J., Lv, C., Zhang, Y., Fu, Y., et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36: 0 62991--63010, 2023

  21. [21]

    Tinybert: Distilling bert for natural language understanding

    Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. Tinybert: Distilling bert for natural language understanding. In Findings of the association for computational linguistics: EMNLP 2020, pp.\ 4163--4174, 2020

  22. [22]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  23. [23]

    When choosing plausible alternatives, clever hans can be clever

    Kavumba, P., Inoue, N., Heinzerling, B., Singh, K., Reisert, P., and Inui, K. When choosing plausible alternatives, clever hans can be clever. arXiv preprint arXiv:1911.00225, 2019

  24. [24]

    J., and Kawaguchi, K

    Lee, S., Kang, M., Lee, J., Hwang, S. J., and Kawaguchi, K. Self-distillation for further pre-training of transformers. arXiv preprint arXiv:2210.02871, 2022

  25. [25]

    Less is more: Task-aware layer-wise distillation for language model compression

    Liang, C., Zuo, S., Zhang, Q., He, P., Chen, W., and Zhao, T. Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pp.\ 20852--20867. PMLR, 2023

  26. [26]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

  27. [27]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024 b

  28. [28]

    L-mtp: Leap multi-token prediction beyond adjacent context for large language models

    Liu, X., Xia, X., Zhao, W., Zhang, M., Yu, X., Su, X., Yang, S., Ng, S.-K., and Chua, T.-S. L-mtp: Leap multi-token prediction beyond adjacent context for large language models. arXiv preprint arXiv:2505.17505, 2025

  29. [29]

    Fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics

    Liu, Z., Kong, C., Liu, Y., and Sun, M. Fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics. arXiv preprint arXiv:2403.01509, 2024 c

  30. [30]

    Y., Pezeshki, M., Mitliagkas, I., Lopez-Paz, D., and Ahuja, K

    Mahajan, D., Goyal, S., Idrissi, B. Y., Pezeshki, M., Mitliagkas, I., Lopez-Paz, D., and Ahuja, K. Beyond multi-token prediction: Pretraining llms with future summaries. arXiv preprint arXiv:2510.14751, 2025

  31. [31]

    Language Models are Few-Shot Learners

    Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1 0 (3): 0 3, 2020

  32. [32]

    New insights and perspectives on the natural gradient method

    Martens, J. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21 0 (146): 0 1--76, 2020

  33. [33]

    Mteb: Massive text embedding benchmark

    Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, 2023

  34. [34]

    Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  35. [35]

    C., and Bau, D

    Pal, K., Sun, J., Yuan, A., Wallace, B. C., and Bau, D. Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023

  36. [36]

    The lambada dataset: Word prediction requiring a broad discourse context

    Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 1525--1534, 2016

  37. [37]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  38. [38]

    FitNets: Hints for Thin Deep Nets

    Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. arxiv 2014. arXiv preprint arXiv:1412.6550, 2014

  39. [39]

    and Vetterli, M

    Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pp.\ 606--610. IEEE, 2007

  40. [40]

    Your llm knows the future: Uncovering its multi-token prediction potential

    Samragh, M., Kundu, A., Harrison, D., Nishu, K., Naik, D., Cho, M., and Farajtabar, M. Your llm knows the future: Uncovering its multi-token prediction potential. arXiv preprint arXiv:2507.11851, 2025

  41. [41]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  42. [42]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  43. [43]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv, R. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013, 2025

  44. [44]

    Investigating prior knowledge for challenging chinese machine reading comprehension

    Sun, K., Yu, D., Yu, D., and Cardie, C. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 8: 0 141--155, 2020 a

  45. [45]

    Patient knowledge distillation for bert model compression

    Sun, S., Cheng, Y., Gan, Z., and Liu, J. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019

  46. [46]

    Contrastive distillation on intermediate representations for language model compression

    Sun, S., Gan, Z., Fang, Y., Cheng, Y., Wang, S., and Liu, J. Contrastive distillation on intermediate representations for language model compression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 498--508, 2020 b

  47. [47]

    W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al

    Suzgun, M., Scales, N., Sch \"a rli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

  48. [48]

    Llm pretraining with continuous concepts

    Tack, J., Lanchantin, J., Yu, J., Cohen, A., Kulikov, I., Lan, J., Hao, S., Tian, Y., Weston, J., and Li, X. Llm pretraining with continuous concepts. arXiv preprint arXiv:2502.08524, 2025

  49. [49]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

  50. [50]

    Kimi K2: Open Agentic Intelligence

    Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

  51. [51]

    Improving neural language generation with spectrum control

    Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., and Gu, Q. Improving neural language generation with spectrum control. In International Conference on Learning Representations, 2020

  52. [52]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

  53. [53]

    Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective

    Wen, K., Li, Z., Wang, J., Hall, D., Liang, P., and Ma, T. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective. arXiv preprint arXiv:2410.05192, 2024

  54. [54]

    Sheared llama: Accelerating language model pre-training via structured pruning

    Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023

  55. [55]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  56. [56]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  57. [57]

    Be your own teacher: Improve the performance of convolutional neural networks via self distillation

    Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3713--3722, 2019

  58. [58]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025

  59. [59]

    Representation degeneration problem in prompt-based models for natural language understanding

    Zhao, Q., He, R., Zhang, J., Liu, C., and Wang, B. Representation degeneration problem in prompt-based models for natural language understanding. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 13946--13957, 2024

  60. [60]

    Agieval: A human-centric benchmark for evaluating foundation models

    Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pp.\ 2299--2314, 2024

  61. [61]

    M., Fuadi, E

    Zuhri, Z. M., Fuadi, E. H., and Aji, A. F. Predicting the order of upcoming tokens improves language modeling. arXiv preprint arXiv:2508.19228, 2025