hub

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N · 2023

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

VLAs are Confined yet Capable of Generalizing to Novel Instructions

cs.RO · 2025-05-06 · unverdicted · novelty 7.0

Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.

Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization

math.OC · 2026-05-13 · unverdicted · novelty 6.0

Adam-SHANG is a convergent Adam variant for stochastic smooth convex optimization that uses a stable lagged-preconditioner update and a computable trace-ratio stepsize rule.

WhisperRT -- Turning Whisper into a Causal Streaming Model

cs.CL · 2025-08-17 · conditional · novelty 6.0

WhisperRT converts Whisper to a causal streaming ASR model via encoder causality, decoder synchronization on partial states, and fine-tuning, achieving better performance than non-fine-tuned streaming methods on sub-300ms chunks with lower complexity.

Functional Subspace, where language models can use vector algebra to solve problems

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.

Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation

cs.LG · 2025-11-21 · unverdicted · novelty 5.0

An adapted scaling law predicts GPU energy consumption for diffusion model inference with R² > 0.9 within architectures and strong cross-architecture generalization.

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

cs.LG · 2025-07-01 · unverdicted · novelty 5.0

JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

cs.LG · 2026-05-18 · unverdicted · novelty 4.0

Teger is a backbone-agnostic structured uncertainty module that uses discrete Forman curvature for spatial graph rewiring inside a low-rank-plus-diagonal covariance head to mitigate over-squashing and improve residual error propagation in spatio-temporal forecasting.

ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification

cs.LG · 2026-05-21

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

cs.LG · 2026-05-18

Language Modeling with Hyperspherical Flows

cs.LG · 2026-05-11 · 2 refs

citing papers explorer

Showing 11 of 11 citing papers.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 25
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
VLAs are Confined yet Capable of Generalizing to Novel Instructions cs.RO · 2025-05-06 · unverdicted · none · ref 39
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization math.OC · 2026-05-13 · unverdicted · none · ref 39
Adam-SHANG is a convergent Adam variant for stochastic smooth convex optimization that uses a stable lagged-preconditioner update and a computable trace-ratio stepsize rule.
WhisperRT -- Turning Whisper into a Causal Streaming Model cs.CL · 2025-08-17 · conditional · none · ref 39
WhisperRT converts Whisper to a causal streaming ASR model via encoder causality, decoder synchronization on partial states, and fine-tuning, achieving better performance than non-fine-tuned streaming methods on sub-300ms chunks with lower complexity.
Functional Subspace, where language models can use vector algebra to solve problems cs.CL · 2026-02-02 · unverdicted · none · ref 10
LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.
Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation cs.LG · 2025-11-21 · unverdicted · none · ref 32
An adapted scaling law predicts GPU energy consumption for diffusion model inference with R² > 0.9 within architectures and strong cross-architecture generalization.
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models cs.LG · 2025-07-01 · unverdicted · none · ref 29
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing cs.LG · 2026-05-18 · unverdicted · none · ref 23
Teger is a backbone-agnostic structured uncertainty module that uses discrete Forman curvature for spatial graph rewiring inside a low-rank-plus-diagonal covariance head to mitigate over-squashing and improve residual error propagation in spatio-temporal forecasting.
ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification cs.LG · 2026-05-21 · unreviewed · ref 15
KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture cs.LG · 2026-05-18 · unreviewed · ref 19
Language Modeling with Hyperspherical Flows cs.LG · 2026-05-11 · unreviewed · ref 89 · 2 links

Gomez, Lukasz Kaiser, and Illia Polosukhin

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer