pith. sign in

super hub Mixed citations

GLU Variants Improve Transformer

Mixed citation behavior. Most common role is background (47%).

221 Pith papers citing it
Background 47% of classified citations
abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

claims ledger

  • abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

co-cited works

clear filters

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

cs.CV · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting cs.CV · 2026-05-01 · unverdicted · none · ref 20 · 2 links · internal anchor

    LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

  • Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 44 · internal anchor

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  • HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor

    HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.

  • Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  • Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 54 · internal anchor

    Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

  • Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 78 · internal anchor

    Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

  • Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 76 · internal anchor

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  • Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 65 · internal anchor

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.