super hub Mixed citations

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

Mixed citation behavior. Most common role is background (47%).

269 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 269 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

background 27 use method 23 unclear 4 use dataset 2 extend 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Joint inference of weak lensing convergence map and cosmology with diffusion models

astro-ph.CO · 2026-06-30 · unverdicted · novelty 7.0

A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials

physics.comp-ph · 2026-06-19 · unverdicted · novelty 7.0

MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

AttentionCap, a customized Transformer, predicts capacitance matrices across multiple process nodes with 0.67% self-capacitance and 3.99% coupling error on unseen designs, outperforming CNN baselines in accuracy and speed.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

q-bio.BM · 2026-05-29 · unverdicted · novelty 7.0

mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

An In-Vitro Study on Cross-Lingual Generalization in Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

citing papers explorer

Showing 50 of 195 citing papers after filters.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 258 · 2 links · internal anchor
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Tight Sample Complexity of Transformers cs.LG · 2026-06-08 · unverdicted · none · ref 63 · internal anchor
Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.
CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations cs.LG · 2026-04-14 · unverdicted · none · ref 22 · internal anchor
CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.
Test-Time Training with KV Binding Is Secretly Linear Attention cs.LG · 2026-02-24 · conditional · none · ref 15 · internal anchor
Test-time training with KV binding reduces to learned linear attention.
Joint inference of weak lensing convergence map and cosmology with diffusion models astro-ph.CO · 2026-06-30 · unverdicted · none · ref 63 · internal anchor
A transformer-based diffusion model learns the joint distribution of convergence maps and cosmology from log-normal weak lensing simulations and generates calibrated posterior samples matching MCMC results.
DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection eess.AS · 2026-06-28 · unverdicted · none · ref 55 · internal anchor
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation cs.CV · 2026-06-26 · unverdicted · none · ref 10 · internal anchor
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials physics.comp-ph · 2026-06-19 · unverdicted · none · ref 43 · internal anchor
MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.
AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction cs.LG · 2026-06-06 · unverdicted · none · ref 23 · internal anchor
AttentionCap, a customized Transformer, predicts capacitance matrices across multiple process nodes with 0.67% self-capacitance and 3.99% coupling error on unseen designs, outperforming CNN baselines in accuracy and speed.
Stateful Visual Encoders for Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 63 · internal anchor
Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 56 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression cs.CL · 2026-06-01 · unverdicted · none · ref 112 · internal anchor
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.
mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties q-bio.BM · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.
Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences cs.LG · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.
An In-Vitro Study on Cross-Lingual Generalization in Language Models cs.CL · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech cs.CL · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning cs.AI · 2026-05-25 · unverdicted · none · ref 41 · internal anchor
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration cs.CV · 2026-05-20 · unverdicted · none · ref 67 · internal anchor
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV · 2026-05-18 · unverdicted · none · ref 40 · internal anchor
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
$\phi$-Balancing for Mixture-of-Experts Training cs.LG · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting cs.CV · 2026-05-14 · unverdicted · none · ref 72 · internal anchor
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo cond-mat.str-el · 2026-05-13 · conditional · none · ref 41 · internal anchor
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding cs.LG · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.
Neural Statistical Functions cs.LG · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Locking Pretrained Weights via Deep Low-Rank Residual Distillation cs.LG · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining cs.CL · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models cs.LG · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 23 · internal anchor
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
Fast Byte Latent Transformer cs.CL · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity stat.ML · 2026-05-08 · unverdicted · none · ref 73 · internal anchor
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Degradation-Aware Adaptive Context Gating for Unified Image Restoration cs.CV · 2026-05-02 · unverdicted · none · ref 47 · internal anchor
DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.
Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting cs.CV · 2026-05-01 · unverdicted · none · ref 20 · 2 links · internal anchor
LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning astro-ph.GA · 2026-04-28 · unverdicted · none · ref 61 · internal anchor
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Can an MLP Absorb Its Own Skip Connection? cs.LG · 2026-04-26 · accept · none · ref 9 · internal anchor
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images cs.CV · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
Grokking of Diffusion Models: Case Study on Modular Addition cs.LG · 2026-04-20 · unverdicted · none · ref 23 · internal anchor
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Scalable Model-Based Clustering with Sequential Monte Carlo stat.ML · 2026-04-16 · unverdicted · none · ref 3 · internal anchor
A memory-efficient SMC clustering method decomposes problems into approximately independent subproblems to handle large-scale online clustering with complex distributions.
Mamba Sequence Modeling meets Model Predictive Control math.OC · 2026-04-15 · unverdicted · none · ref 18 · internal anchor
Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts cs.CL · 2026-04-13 · unverdicted · none · ref 24 · internal anchor
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 178 · internal anchor
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Envisioning the Future, One Step at a Time cs.CV · 2026-04-10 · unverdicted · none · ref 93 · internal anchor
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 57 · internal anchor
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators cs.AR · 2026-04-09 · conditional · none · ref 62 · internal anchor
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 44 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
Screening Is Enough cs.LG · 2026-04-01 · unverdicted · none · ref 31 · internal anchor
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.AI · 2026-03-09 · unverdicted · none · ref 43 · internal anchor
CID-TKG combines historical invariance and evolutionary dynamics graphs with contrastive alignment of view-specific relation representations to reach state-of-the-art performance on temporal knowledge graph extrapolation.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training cs.CV · 2026-03-04 · unverdicted · none · ref 56 · internal anchor
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation cs.CV · 2026-03-03 · unverdicted · none · ref 32 · internal anchor
GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.
NEST: Nested Event Stream Transformer for Sequences of Multisets cs.LG · 2026-01-31 · unverdicted · none · ref 31 · internal anchor
NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.
Deep Delta Learning cs.LG · 2026-01-01 · unverdicted · none · ref 11 · internal anchor
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer