FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
hub
Gomez, Lukasz Kaiser, and Illia Polosukhin
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Adam-SHANG is a convergent Adam variant for stochastic smooth convex optimization that uses a stable lagged-preconditioner update and a computable trace-ratio stepsize rule.
WhisperRT converts Whisper to a causal streaming ASR model via encoder causality, decoder synchronization on partial states, and fine-tuning, achieving better performance than non-fine-tuned streaming methods on sub-300ms chunks with lower complexity.
LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.
An adapted scaling law predicts GPU energy consumption for diffusion model inference with R² > 0.9 within architectures and strong cross-architecture generalization.
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
Teger is a backbone-agnostic structured uncertainty module that uses discrete Forman curvature for spatial graph rewiring inside a low-rank-plus-diagonal covariance head to mitigate over-squashing and improve residual error propagation in spatio-temporal forecasting.
citing papers explorer
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
-
VLAs are Confined yet Capable of Generalizing to Novel Instructions
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
-
Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization
Adam-SHANG is a convergent Adam variant for stochastic smooth convex optimization that uses a stable lagged-preconditioner update and a computable trace-ratio stepsize rule.
-
WhisperRT -- Turning Whisper into a Causal Streaming Model
WhisperRT converts Whisper to a causal streaming ASR model via encoder causality, decoder synchronization on partial states, and fine-tuning, achieving better performance than non-fine-tuned streaming methods on sub-300ms chunks with lower complexity.
-
Functional Subspace, where language models can use vector algebra to solve problems
LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.
-
Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation
An adapted scaling law predicts GPU energy consumption for diffusion model inference with R² > 0.9 within architectures and strong cross-architecture generalization.
-
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
-
Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing
Teger is a backbone-agnostic structured uncertainty module that uses discrete Forman curvature for spatial graph rewiring inside a low-rank-plus-diagonal covariance head to mitigate over-squashing and improve residual error propagation in spatio-temporal forecasting.
- ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification
- KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture
- Language Modeling with Hyperspherical Flows