super hub Mixed citations

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

Mixed citation behavior. Most common role is background (47%).

269 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 269 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

background 27 use method 23 unclear 4 use dataset 2 extend 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Joint inference of weak lensing convergence map and cosmology with diffusion models

astro-ph.CO · 2026-06-30 · unverdicted · novelty 7.0

A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials

physics.comp-ph · 2026-06-19 · unverdicted · novelty 7.0

MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

AttentionCap, a customized Transformer, predicts capacitance matrices across multiple process nodes with 0.67% self-capacitance and 3.99% coupling error on unseen designs, outperforming CNN baselines in accuracy and speed.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

q-bio.BM · 2026-05-29 · unverdicted · novelty 7.0

mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

An In-Vitro Study on Cross-Lingual Generalization in Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

citing papers explorer

Showing 13 of 13 citing papers after filters.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 95 · internal anchor
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 86 · internal anchor
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 95 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 19 · internal anchor
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 297 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 101 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices cs.CV · 2023-12-28 · unverdicted · none · ref 105 · internal anchor
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration cs.CL · 2023-11-07 · unverdicted · none · ref 53 · internal anchor
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models cs.SE · 2023-10-17 · unverdicted · none · ref 55 · internal anchor
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
DINOv2: Learning Robust Visual Features without Supervision cs.CV · 2023-04-14 · unverdicted · none · ref 22 · internal anchor
Pith review generated a malformed one-line summary.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 59 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 286 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 74 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer