pith. sign in

super hub Mixed citations

GLU Variants Improve Transformer

Mixed citation behavior. Most common role is background (47%).

269 Pith papers citing it
Background 47% of classified citations
abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

claims ledger

  • abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

co-cited works

clear filters

representative citing papers

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

citing papers explorer

Showing 27 of 27 citing papers after filters.

  • Moshi: a speech-text foundation model for real-time dialogue eess.AS · 2024-09-17 · accept · none · ref 88 · internal anchor

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  • Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 28 · internal anchor

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  • DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 165 · internal anchor

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  • Improving Dictionary Learning with Gated Sparse Autoencoders cs.LG · 2024-04-24 · unverdicted · none · ref 239 · internal anchor

    Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

  • Jamba: A Hybrid Transformer-Mamba Language Model cs.CL · 2024-03-28 · conditional · none · ref 45 · internal anchor

    Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.

  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 36 · internal anchor

    GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

  • Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models cs.LG · 2024-02-29 · unverdicted · none · ref 27 · internal anchor

    Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.

  • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits cs.CL · 2024-02-27 · unverdicted · none · ref 9 · internal anchor

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  • OLMo: Accelerating the Science of Language Models cs.CL · 2024-02-01 · accept · none · ref 8 · internal anchor

    OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.

  • When Attention Sink Emerges in Language Models: An Empirical View cs.CL · 2024-10-14 · accept · none · ref 43 · internal anchor

    Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

  • Pixtral 12B cs.CV · 2024-10-09 · unverdicted · none · ref 19 · internal anchor

    Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.

  • Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 76 · internal anchor

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 21 · internal anchor

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  • FlashNorm: Fast Normalization for Transformers cs.LG · 2024-07-12 · accept · none · ref 14 · internal anchor

    FlashNorm is an exact algebraic reformulation of RMSNorm plus linear projection that folds weights and defers normalization to allow parallel execution, plus scale-invariance simplifications that remove redundant norms in certain architectures.

  • DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 163 · internal anchor

    DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

  • An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 44 · internal anchor

    An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

  • Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 29 · internal anchor

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.

  • Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 183 · internal anchor

    ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

  • Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 61 · internal anchor

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  • mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 55 · internal anchor

    mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

  • InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 217 · internal anchor

    InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 98 · internal anchor

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • DeepSeek-VL: Towards Real-World Vision-Language Understanding cs.AI · 2024-03-08 · unverdicted · none · ref 24 · internal anchor

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.

  • Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 70 · internal anchor

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 167 · internal anchor

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  • TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 30 · internal anchor

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

  • Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 110 · internal anchor

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.