arxiv: 2002.05202 · v1 · submitted 2020-02-12 · 💻 cs.LG · cs.NE· stat.ML

Recognition: 2 theorem links

· Lean Theorem

GLU Variants Improve Transformer

Noam Shazeer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:56 UTC · model grok-4.3

classification 💻 cs.LG cs.NEstat.ML

keywords GLUGated Linear UnitsTransformeractivation functionsfeed-forward sublayerssequence-to-sequenceReLUGELU

0 comments

The pith

Some gated linear unit variants improve Transformer quality over standard ReLU or GELU activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines changes to Gated Linear Units by swapping the usual sigmoid for other nonlinear functions. These modified units are placed inside the feed-forward sections of the Transformer for sequence-to-sequence work. Tests show that certain swaps produce higher quality results than the ReLU or GELU versions normally used. A reader would care because this points to a small, targeted tweak that can lift performance in widely used models. The approach keeps the rest of the architecture unchanged while focusing gains on the activation step.

Core claim

Gated Linear Units consist of the component-wise product of two linear projections, one of which is first passed through a nonlinear function. Variations on GLU are possible by using different nonlinear functions in place of sigmoid. When these variants are inserted into the feed-forward sublayers of the Transformer sequence-to-sequence model, some of them yield quality improvements over the typically-used ReLU or GELU activations.

What carries the argument

Gated Linear Units with alternative nonlinear functions replacing sigmoid, placed in the feed-forward sublayers of the Transformer.

Load-bearing premise

The quality gains come from the choice of nonlinear function in the GLU and would appear in other models, datasets, and training conditions.

What would settle it

Repeating the experiments on the same datasets with identical training but swapping the best GLU variants back to ReLU or GELU and finding no quality drop would show the claim does not hold.

read the original abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shazeer's short note reports that a few GLU variants beat ReLU and GELU inside Transformer feed-forward layers, but the experiments give no details on controls or variance so the gains are hard to trust.

read the letter

Hey, this is a very short empirical note. It takes the Gated Linear Unit from 2016, swaps the sigmoid for a few other nonlinearities, and drops the result into the feed-forward sublayer of the 2017 Transformer. The claim is that some of those variants produce better quality than the standard ReLU or GELU on the tasks they tried. That is the entire contribution in one paragraph. What the paper does cleanly is keep the change minimal and cheap to test. Anyone already running Transformers can implement the variants in a few lines and see if they help. That kind of low-friction tweak is worth knowing about even if the effect size turns out modest. The soft spot is exactly what the stress-test note flags. The abstract mentions empirical tests and quality improvements but supplies zero information on model scale, dataset, number of runs, random seeds, or whether every other hyperparameter stayed identical across conditions. On WMT-scale work, BLEU deltas smaller than about half a point often sit inside run-to-run noise, so without those controls it is impossible to know whether the activation choice is doing the work. The paper does not appear to contain any deeper analysis or theory either, just the observation. This is the sort of thing that is useful for practitioners who train large models and are willing to run their own ablations. It is not aimed at readers looking for new mechanisms or formal guarantees. I would bring it to a reading group for a ten-minute discussion on activation choices, and I would cite it if I were writing about SwiGLU-style blocks. It is worth sending to peer review because the idea is practical and the author has a good track record, but any referee would rightly ask for the missing experimental details before accepting the central claim.

Referee Report

3 major / 1 minor

Summary. The paper introduces variations on Gated Linear Units (GLU) by replacing the sigmoid with other nonlinear or linear functions. These variants are tested by replacing the standard activation in the feed-forward sublayers of the Transformer sequence-to-sequence model, with the central claim that some variants produce quality improvements over the usual ReLU or GELU activations.

Significance. If the improvements are shown to be robust, this would supply a simple, capacity-neutral modification that could be adopted in Transformer-based models for tasks such as machine translation, offering a modest but practical gain in model quality.

major comments (3)

Abstract: the claim that 'some of them yield quality improvements' is presented without any description of model sizes, datasets, number of runs, statistical significance, or controls for other variables; this information is load-bearing for the central empirical claim.
Experiments section: no indication is given that multiple random seeds were used or that identical hyperparameter schedules and initialization were applied across all activation choices; single-run BLEU deltas on WMT-scale tasks frequently lie within the 0.2-0.5 point noise floor, leaving attribution to the GLU variants insecure.
Results section: the manuscript does not report whether the GLU variants were implemented with identical parameter count and computational cost to the ReLU/GELU baselines, which is required to ensure the observed gains are not due to effective capacity differences.

minor comments (1)

§2: the definitions of the GLU variants would benefit from explicit equations showing the replacement function in each case.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and will incorporate clarifications into a revised manuscript.

read point-by-point responses

Referee: Abstract: the claim that 'some of them yield quality improvements' is presented without any description of model sizes, datasets, number of runs, statistical significance, or controls for other variables; this information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would be strengthened by additional context on the experimental setting. In the revision we will expand the abstract to note that results are reported on the WMT 2014 English-to-German and English-to-French tasks using the Transformer base model (approximately 65 million parameters) and that the observed BLEU gains are on the order of 0.2–0.6 points relative to ReLU and GELU baselines. Full details on training procedure and controls remain in the body of the paper. revision: yes
Referee: Experiments section: no indication is given that multiple random seeds were used or that identical hyperparameter schedules and initialization were applied across all activation choices; single-run BLEU deltas on WMT-scale tasks frequently lie within the 0.2-0.5 point noise floor, leaving attribution to the GLU variants insecure.

Authors: All variants were trained with identical hyperparameter schedules, data order, and initialization procedures; the only difference was the choice of nonlinearity inside the feed-forward sublayer. Because of the high computational cost of WMT-scale training we report single runs, which is standard practice for such experiments. In the revised Experiments section we will explicitly document the shared training protocol and add a short discussion of run-to-run variance, noting that the improvements are consistent across two language pairs but should be interpreted in light of typical BLEU noise levels. revision: partial
Referee: Results section: the manuscript does not report whether the GLU variants were implemented with identical parameter count and computational cost to the ReLU/GELU baselines, which is required to ensure the observed gains are not due to effective capacity differences.

Authors: The GLU variants were constructed to preserve exactly the same parameter count and FLOPs as the baseline ReLU/GELU feed-forward layers by keeping the same hidden dimension and projection sizes; the gating mechanism simply re-uses the second linear projection as the gate. We will add an explicit statement in the revised Results section confirming that every compared model has identical parameter count (65 M for base, 213 M for big) and essentially identical computational cost. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with no derivation chain

full rationale

The paper is an empirical study that defines GLU variants by reference to prior work and tests them by swapping activations in the Transformer feed-forward sublayers, reporting measured quality deltas on WMT tasks. No first-principles derivations, predictions, or fitted parameters are presented as outputs; the central claim rests entirely on experimental outcomes rather than any reduction to self-citation, self-definition, or ansatz. Citations to the original GLU and Transformer papers supply background definitions only and are not load-bearing for any claimed prediction or uniqueness result. The derivation chain is therefore absent, and the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning experimental assumptions rather than new axioms or invented entities; no free parameters are introduced in the abstract.

axioms (1)

domain assumption Standard supervised training and evaluation protocols for sequence-to-sequence models produce reliable quality comparisons.
Invoked implicitly when claiming quality improvements from activation changes.

pith-pipeline@v0.9.0 · 5369 in / 1107 out tokens · 44935 ms · 2026-05-10T16:56:44.333093+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We test these variants in the feed-forward sublayers of the Transformer sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FFNGEGLU(x, W, V, W2) = (GELU(xW) ⊗ xV)W2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations
cs.LG 2026-04 unverdicted novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving ...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
cs.CV 2026-05 unverdicted novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding
cs.LG 2026-05 unverdicted novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.
Neural Statistical Functions
cs.LG 2026-05 unverdicted novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
cs.LG 2026-05 unverdicted novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
eess.AS 2026-05 unverdicted novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
Fast Byte Latent Transformer
cs.CL 2026-05 unverdicted novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity
stat.ML 2026-05 unverdicted novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
Degradation-Aware Adaptive Context Gating for Unified Image Restoration
cs.CV 2026-05 unverdicted novelty 7.0

DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.
Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Can an MLP Absorb Its Own Skip Connection?
cs.LG 2026-04 accept novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images
cs.CV 2026-04 unverdicted novelty 7.0

WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Scalable Model-Based Clustering with Sequential Monte Carlo
stat.ML 2026-04 unverdicted novelty 7.0

A memory-efficient SMC clustering method decomposes problems into approximately independent subproblems to handle large-scale online clustering with complex distributions.
Mamba Sequence Modeling meets Model Predictive Control
math.OC 2026-04 unverdicted novelty 7.0

Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Envisioning the Future, One Step at a Time
cs.CV 2026-04 unverdicted novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
Self-Supervised Foundation Model for Calcium-imaging Population Dynamics
q-bio.QM 2026-04 unverdicted novelty 7.0

CalM uses a discrete tokenizer and dual-axis autoregressive transformer pretrained self-supervised on calcium traces to outperform specialized baselines on population dynamics forecasting and adapt to superior behavio...
Screening Is Enough
cs.LG 2026-04 unverdicted novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Less is More: Recursive Reasoning with Tiny Networks
cs.LG 2025-10 unverdicted novelty 7.0

TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Jamba: A Hybrid Transformer-Mamba Language Model
cs.CL 2024-03 conditional novelty 7.0

Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
The Power of Scale for Parameter-Efficient Prompt Tuning
cs.CL 2021-04 unverdicted novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
math.OC 2026-05 unverdicted novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
cs.LG 2026-05 unverdicted novelty 6.0

CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Tyche: One Step Flow for Efficient Probabilistic Weather Forecasting
cs.LG 2026-05 unverdicted novelty 6.0

Tyche achieves competitive probabilistic weather forecasting skill and calibration using a single-step flow model with JVP-regularized training and rollout finetuning.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
cs.LG 2026-05 unverdicted novelty 6.0

Cumulative-goodness Forward-Forward networks exhibit layer free-riding where discrimination gradients decay exponentially with prior positive margins; per-block, hardness-gated, and depth-scaled remedies yield 4-45x b...
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

CHE-TKG is a collaborative dual-view model that jointly captures historical evidence and evolutionary dynamics in temporal knowledge graphs via separate encoders and contrastive alignment to achieve state-of-the-art r...
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 128 Pith papers · 5 internal anchors

[1]

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. CoRR, abs/1612.08083, 2016. URL http://arxiv.org/abs/1612.08083

work page Pith review arXiv 2016
[2]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315--323, 2011

work page 2011
[3]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Three new graphical models for statistical language modelling

Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641--648, 2007

work page 2007
[5]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019
[6]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review arXiv 2016
[7]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review arXiv 2017
[8]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018

work page Pith review arXiv 2018
[9]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017
[10]

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review arXiv 2018
[11]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019

work page internal anchor Pith review arXiv 1905