Hidden layer distillation yields systematic perplexity gains over logit KD in LLM pre-training but does not consistently improve downstream performance.
5th International Conference on Learning Representations
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
citing papers explorer
-
A Study on Hidden Layer Distillation for Large Language Model Pre-Training
Hidden layer distillation yields systematic perplexity gains over logit KD in LLM pre-training but does not consistently improve downstream performance.
-
K-Quantization and its Impact on Output Performance
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.