super hub Mixed citations

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

Mixed citation behavior. Most common role is background (47%).

235 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 235 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

background 27 use method 23 unclear 4 use dataset 2 extend 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

q-bio.BM · 2026-05-29 · unverdicted · novelty 7.0

mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

An In-Vitro Study on Cross-Lingual Generalization in Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

citing papers explorer

Showing 50 of 235 citing papers.

CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.CL · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
CHE-TKG is a collaborative dual-view model that jointly captures historical evidence and evolutionary dynamics in temporal knowledge graphs via separate encoders and contrastive alignment to achieve state-of-the-art reasoning.
Velox: Learning Representations of 4D Geometry and Appearance cs.CV · 2026-05-06 · unverdicted · none · ref 76 · internal anchor
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 31 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
PointTransformerX is a fully PyTorch-native 3D point cloud transformer backbone that reaches 98.7% of PointTransformer V3 accuracy on ScanNet using 79.2% fewer parameters, 1.6x faster inference, and only 253 MB memory while running natively on NVIDIA, AMD, and CPU hardware.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions cs.LG · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens q-bio.NC · 2026-04-20 · unverdicted · none · ref 107 · internal anchor
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems cs.IR · 2026-04-20 · unverdicted · none · ref 26 · 2 links · internal anchor
RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions cs.LG · 2026-04-15 · unverdicted · none · ref 13 · internal anchor
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 33 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Fast Spatial Memory with Elastic Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 43 · internal anchor
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
Leveraging Artist Catalogs for Cold-Start Music Recommendation cs.IR · 2026-04-08 · unverdicted · none · ref 58 · internal anchor
ACARec attends over artist catalogs to generate CF embeddings for new tracks, more than doubling recall and NDCG versus content-only baselines in music recommendation.
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
Orthogonal Quadratic Complements project low-rank quadratic auxiliary branches onto the orthogonal complement of the main hidden state in vision transformer FFNs, improving accuracy by 1.3 points on CIFAR-100 and 1.4 points on TinyImageNet over baselines.
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning cs.LG · 2026-03-30 · unverdicted · none · ref 8 · internal anchor
ARCS generates valid SPICE-simulatable analog circuits in milliseconds via graph VAE, flow-matching, and GRPO reinforcement learning, reaching 99.9% validity with 8 evaluations across 32 topologies.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 18 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes cs.LG · 2026-03-30 · unverdicted · none · ref 37 · internal anchor
FluidFlow uses conditional flow-matching with U-Net and DiT architectures to predict pressure and friction coefficients on airfoils and 3D aircraft meshes, outperforming MLP baselines with better generalization.
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers cs.LG · 2026-03-18 · unverdicted · none · ref 34 · internal anchor
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling cs.LG · 2026-03-15 · unverdicted · none · ref 32 · internal anchor
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation cs.RO · 2026-03-05 · conditional · none · ref 29 · internal anchor
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 19 · internal anchor
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction cs.PF · 2026-01-21 · unverdicted · none · ref 61 · internal anchor
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
ViT$^3$: Unlocking Test-Time Training in Vision cs.CV · 2025-12-01 · unverdicted · none · ref 50 · internal anchor
ViT³ is a Test-Time Training vision model that achieves linear complexity, matches or exceeds other linear models like Mamba on classification, generation, detection and segmentation, and narrows the gap to standard vision Transformers.
Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 54 · internal anchor
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 78 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
Vision-LLMs for Spatiotemporal Traffic Forecasting cs.LG · 2025-10-13 · conditional · none · ref 32 · internal anchor
ST-Vision-LLM reframes spatiotemporal traffic forecasting as vision-language fusion, using visual encoders on traffic grids and efficient numerical tokenization to achieve 15.6% better long-term accuracy and 30% gains in few-shot cross-domain settings.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 13 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Scaling Sequence-to-Sequence Generative Neural Rendering cs.CV · 2025-10-05 · unverdicted · none · ref 13 · internal anchor
Kaleido is a masked autoregressive generative model that unifies 3D view synthesis and video modeling by pre-training a single transformer on video data, achieving SOTA zero-shot and many-view performance on view synthesis benchmarks.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 251 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining cs.CL · 2025-09-05 · unverdicted · none · ref 9 · internal anchor
Sparse crosscoders on LLM checkpoint triplets track emergence, maintenance, and discontinuation of linguistic features during pretraining via a new RelIE metric.
mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling cs.LG · 2025-07-02 · unverdicted · none · ref 35 · internal anchor
mGRADE uses learnable-spaced convolutions shown to be equivalent to delay embeddings plus a lightweight gated recurrent component to achieve low-memory multi-timescale sequence modeling.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 250 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 32 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 115 · internal anchor
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
A3 : an Analytical Low-Rank Approximation Framework for Attention cs.CL · 2025-05-19 · conditional · none · ref 12 · internal anchor
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 23 · internal anchor
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
When Attention Sink Emerges in Language Models: An Empirical View cs.CL · 2024-10-14 · accept · none · ref 43 · internal anchor
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
Pixtral 12B cs.CV · 2024-10-09 · unverdicted · none · ref 19 · internal anchor
Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 76 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 21 · internal anchor
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
FlashNorm: Fast Normalization for Transformers cs.LG · 2024-07-12 · accept · none · ref 14 · internal anchor
FlashNorm is an exact algebraic reformulation of RMSNorm plus linear projection that folds weights and defers normalization to allow parallel execution, plus scale-invariance simplifications that remove redundant norms in certain architectures.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 163 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 44 · internal anchor
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 29 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 86 · internal anchor
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 95 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 19 · internal anchor
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 297 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer