Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
hub
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.
citing papers explorer
-
Simply Stabilizing the Loop via Fully Looped Transformer
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.
-
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
-
Compute Optimal Tokenization
In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.
-
The Efficiency Gap in Byte Modeling
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
-
Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.
-
GiVA: Gradient-Informed Bases for Vector-Based Adaptation
GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
-
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.