LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Physics of language models: Part 2.1, grade-school math and the hidden reasoning process
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
citing papers explorer
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distributions.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
- Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning