pith. sign in

super hub Mixed citations

GLU Variants Improve Transformer

Mixed citation behavior. Most common role is background (47%).

209 Pith papers citing it
Background 47% of classified citations
abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

claims ledger

  • abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

co-cited works

clear filters

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

cs.CV · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

Can an MLP Absorb Its Own Skip Connection?

cs.LG · 2026-04-26 · accept · novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

citing papers explorer

Showing 24 of 24 citing papers after filters.

  • Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo cond-mat.str-el · 2026-05-13 · conditional · none · ref 41 · internal anchor

    PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

  • From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models cs.LG · 2026-05-11 · unverdicted · none · ref 47 · internal anchor

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  • Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 23 · internal anchor

    GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

  • Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting cs.CV · 2026-05-01 · unverdicted · none · ref 20 · 2 links · internal anchor

    LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

  • Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning astro-ph.GA · 2026-04-28 · unverdicted · none · ref 61 · internal anchor

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  • Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 57 · internal anchor

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

  • Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 44 · internal anchor

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 139 · internal anchor

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  • Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 12 · internal anchor

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  • HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor

    HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.

  • Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  • Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 54 · internal anchor

    Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

  • Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 78 · internal anchor

    Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

  • Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 76 · internal anchor

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  • LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 93 · internal anchor

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  • CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation physics.ins-det · 2026-05-12 · unverdicted · none · ref 45 · internal anchor

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.

  • gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 9 · internal anchor

    OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.

  • Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 65 · internal anchor

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  • Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 183 · internal anchor

    ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

  • InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 217 · internal anchor

    InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 98 · internal anchor

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 70 · internal anchor

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  • TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 30 · internal anchor

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

  • Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 110 · internal anchor

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.