hub

Root mean square layer normalization.Advances in neural information processing systems, 32

Biao Zhang, Rico Sennrich · 2019

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

browse 14 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

method 3 background 1

citation-polarity summary

use method 3 background 1

representative citing papers

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

HRM-Text: Efficient Pretraining Beyond Scaling

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.

Efficient Image Synthesis with Sphere Latent Encoder

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Nucleus-Image: Sparse MoE for Image Generation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

cs.LG · 2026-04-06 · unverdicted · novelty 6.0 · 2 refs

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

cs.RO · 2026-03-03 · unverdicted · novelty 6.0

HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

Graph Hierarchical Recurrence for Long-Range Generalization

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

GHR uses hierarchical recurrence on pooled graph abstractions to improve long-range dependency capture and out-of-range generalization while using far fewer parameters than existing models.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

cs.LG · 2026-05-05 · unverdicted · novelty 4.0

Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(mn) complexity and proven scalability.

citing papers explorer

Showing 14 of 14 citing papers.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 54
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 177
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 43
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
HRM-Text: Efficient Pretraining Beyond Scaling cs.CL · 2026-05-20 · unverdicted · none · ref 26
A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
Efficient Image Synthesis with Sphere Latent Encoder cs.CV · 2026-05-15 · unverdicted · none · ref 41
Decouples Sphere Encoder into fixed pretrained encoder and spherical latent denoiser, yielding higher quality and faster inference than the joint original on Animal-Faces, Oxford-Flowers and ImageNet-1K.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 103
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 24
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 94 · 2 links
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations cs.RO · 2026-03-03 · unverdicted · none · ref 46
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 20
Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step
Graph Hierarchical Recurrence for Long-Range Generalization cs.LG · 2026-05-18 · unverdicted · none · ref 38
GHR uses hierarchical recurrence on pooled graph abstractions to improve long-range dependency capture and out-of-range generalization while using far fewer parameters than existing models.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 95 · 2 links
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 10
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer cs.LG · 2026-05-05 · unverdicted · none · ref 10
Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(mn) complexity and proven scalability.

Root mean square layer normalization.Advances in neural information processing systems, 32

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer