super hub Mixed citations

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

Mixed citation behavior. Most common role is background (47%).

209 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 209 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

background 27 use method 23 unclear 4 use dataset 2 extend 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

q-bio.BM · 2026-05-29 · unverdicted · novelty 7.0

mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Degradation-Aware Adaptive Context Gating for Unified Image Restoration

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

cs.CV · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

Can an MLP Absorb Its Own Skip Connection?

cs.LG · 2026-04-26 · accept · novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

citing papers explorer

Showing 50 of 209 citing papers.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 36 · internal anchor
GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models cs.LG · 2024-02-29 · unverdicted · none · ref 27 · internal anchor
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits cs.CL · 2024-02-27 · unverdicted · none · ref 9 · internal anchor
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
OLMo: Accelerating the Science of Language Models cs.CL · 2024-02-01 · accept · none · ref 8 · internal anchor
OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 52 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
The Power of Scale for Parameter-Efficient Prompt Tuning cs.CL · 2021-04-18 · unverdicted · none · ref 45 · internal anchor
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT cs.CL · 2026-06-25 · unverdicted · none · ref 27 · internal anchor
Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.
Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction hep-ex · 2026-06-16 · unverdicted · none · ref 25 · internal anchor
A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 49 · internal anchor
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders cs.CL · 2026-05-28 · unverdicted · none · ref 26 · internal anchor
Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.
STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction cs.LG · 2026-05-28 · unverdicted · none · ref 21 · internal anchor
A Transformer model with app-identity shuffling and ultra-long context achieves vocabulary-free next-app prediction with cross-dataset zero-shot capability and competitive cold-start performance.
Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Continual multilingual pre-training of an English-centric MoE model produces language-agnostic routing in early layers and specialization in final layers; updating only final-layer experts yields competitive multilingual performance while changing less than 2% of parameters.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation cs.LG · 2026-05-22 · unverdicted · none · ref 30 · internal anchor
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 26 · internal anchor
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models cs.CV · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
Polynomial replacements for activations in MLPs, convolutions, and attention within MetaFormer yield PolyNeXt models that match or exceed standard performance on ImageNet, ADE20K, and robustness benchmarks while beating prior polynomial networks.
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor cs.LG · 2026-05-20 · conditional · none · ref 60 · internal anchor
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
HRM-Text: Efficient Pretraining Beyond Scaling cs.CL · 2026-05-20 · unverdicted · none · ref 27 · internal anchor
A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
Generative Recursive Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 47 · 2 links · internal anchor
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 120 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 44 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 49 · 2 links · internal anchor
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 71 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization cs.LG · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators cs.LG · 2026-05-07 · unverdicted · none · ref 31 · internal anchor
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Tyche: One Step Flow for Efficient Probabilistic Weather Forecasting cs.LG · 2026-05-07 · unverdicted · none · ref 12 · internal anchor
Tyche achieves competitive probabilistic weather forecasting skill and calibration using a single-step flow model with JVP-regularized training and rollout finetuning.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity cs.LG · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant cs.LG · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
Cumulative-goodness Forward-Forward networks exhibit layer free-riding where discrimination gradients decay exponentially with prior positive margins; per-block, hardness-gated, and depth-scaled remedies yield 4-45x better layer separation but <1% accuracy change on CIFAR and Tiny ImageNet.
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts cs.LG · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.CL · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
CHE-TKG is a collaborative dual-view model that jointly captures historical evidence and evolutionary dynamics in temporal knowledge graphs via separate encoders and contrastive alignment to achieve state-of-the-art reasoning.
Velox: Learning Representations of 4D Geometry and Appearance cs.CV · 2026-05-06 · unverdicted · none · ref 76 · internal anchor
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 31 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
PointTransformerX is a fully PyTorch-native 3D point cloud transformer backbone that reaches 98.7% of PointTransformer V3 accuracy on ScanNet using 79.2% fewer parameters, 1.6x faster inference, and only 253 MB memory while running natively on NVIDIA, AMD, and CPU hardware.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions cs.LG · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens q-bio.NC · 2026-04-20 · unverdicted · none · ref 107 · internal anchor
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems cs.IR · 2026-04-20 · unverdicted · none · ref 26 · 2 links · internal anchor
RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions cs.LG · 2026-04-15 · unverdicted · none · ref 13 · internal anchor
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 33 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Fast Spatial Memory with Elastic Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 43 · internal anchor
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
Leveraging Artist Catalogs for Cold-Start Music Recommendation cs.IR · 2026-04-08 · unverdicted · none · ref 58 · internal anchor
ACARec attends over artist catalogs to generate CF embeddings for new tracks, more than doubling recall and NDCG versus content-only baselines in music recommendation.
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
Orthogonal Quadratic Complements project low-rank quadratic auxiliary branches onto the orthogonal complement of the main hidden state in vision transformer FFNs, improving accuracy by 1.3 points on CIFAR-100 and 1.4 points on TinyImageNet over baselines.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer