LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Root mean square layer normalization.Advances in Neural Information Processing Systems, 32
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
citing papers explorer
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.