pith. sign in

super hub Mixed citations

GLU Variants Improve Transformer

Mixed citation behavior. Most common role is background (47%).

220 Pith papers citing it
Background 47% of classified citations
abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

claims ledger

  • abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

co-cited works

clear filters

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

cs.CV · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.AI · 2026-03-09 · unverdicted · none · ref 43 · internal anchor

    CID-TKG combines historical invariance and evolutionary dynamics graphs with contrastive alignment of view-specific relation representations to reach state-of-the-art performance on temporal knowledge graph extrapolation.

  • Training Agents Inside of Scalable World Models cs.AI · 2025-09-29 · conditional · none · ref 34 · internal anchor

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  • A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 52 · internal anchor

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  • Generative Recursive Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 47 · 2 links · internal anchor

    GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.

  • Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 120 · internal anchor

    Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.

  • Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 250 · internal anchor

    Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 21 · internal anchor

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  • LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 30 · 3 links · internal anchor

    LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

  • Hierarchical Reasoning Model cs.AI · 2025-06-26 · unverdicted · none · ref 57 · internal anchor

    HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.

  • SAGE-32B: Agentic Reasoning via Iterative Distillation cs.AI · 2026-01-04 · unverdicted · none · ref 8 · internal anchor

    SAGE-32B improves multi-tool agentic success rates over same-size baselines by combining iterative distillation with an inverse-reasoning meta-cognition head.

  • DeepSeek-VL: Towards Real-World Vision-Language Understanding cs.AI · 2024-03-08 · unverdicted · none · ref 24 · internal anchor

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.