DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
GTLM injects graph-aware attention biases into LLMs using only 0.015% extra parameters, enabling native graph processing that matches 7B models with a 1B model on text-attributed graph benchmarks.
DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
citing papers explorer
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.