super hub Mixed citations

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bingxuan Wang, Bin Wang, Bo Liu, DeepSeek-AI · 2024 · cs.CL · arXiv 2405.04434

Mixed citation behavior. Most common role is background (70%).

118 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 118 citing papers more from Aixin Liu arXiv PDF

abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 method 5 dataset 3 baseline 1

citation-polarity summary

background 23 use method 5 use dataset 3 baseline 1 support 1

claims ledger

abstract We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe

authors

Aixin Liu Bei Feng Bingxuan Wang Bin Wang Bo Liu DeepSeek-AI

co-cited works

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

Training-Free Looped Transformers

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

cs.DC · 2026-05-22 · unverdicted · novelty 7.0

HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.

Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

cs.LG · 2026-05-06 · conditional · novelty 7.0 · 2 refs

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

DPC: A Distributed Page Cache over CXL

cs.DC · 2026-04-21 · conditional · novelty 7.0

DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.

Using large language models for embodied planning introduces systematic safety risks

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

citing papers explorer

Showing 50 of 118 citing papers.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 26 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Training-Free Looped Transformers cs.LG · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs cs.DC · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation cs.LG · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
Text2CAD-Bench supplies 600 dual-prompt examples across four geometric and domain levels to test LLMs on text-to-parametric CAD, finding solid basic performance but sharp drops on complex topology and advanced features.
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 10 · internal anchor
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
$\phi$-Balancing for Mixture-of-Experts Training cs.LG · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference stat.ML · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels cs.LG · 2026-05-06 · conditional · none · ref 6 · 2 links · internal anchor
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization remains unsolved.
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs cs.CR · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 1 · 2 links · internal anchor
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
DPC: A Distributed Page Cache over CXL cs.DC · 2026-04-21 · conditional · none · ref 13 · internal anchor
DPC maintains exactly one DRAM copy of each file page in a CXL-connected cluster and delivers up to 12.4X speedup (5.6X geometric mean) over replicated caches on data-sharing workloads.
Using large language models for embodied planning introduces systematic safety risks cs.AI · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations cs.LG · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks cs.LG · 2026-04-11 · unverdicted · none · ref 15 · internal anchor
In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spectral initialization.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning cs.AI · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 11 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 9 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 8 · internal anchor
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 28 · internal anchor
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts cs.LG · 2024-08-28 · conditional · none · ref 7 · internal anchor
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 19 · internal anchor
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 17 · internal anchor
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.
Latent Cache Flow: Model-to-Model Communication Without Text cs.LG · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Latent Cache Flow uses small adapters to jointly translate and compress KV caches between LLMs, enabling accurate communication even with mismatched contexts and outperforming both prior cache adapters and text in early tests.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 66 · internal anchor
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse cs.DC · 2026-05-16 · unverdicted · none · ref 41 · internal anchor
ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models cs.LG · 2026-05-15 · unverdicted · none · ref 47 · internal anchor
UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision cs.AI · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 61 · internal anchor
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
EMO: Frustratingly Easy Progressive Training of Extendable MoE cs.LG · 2026-05-13 · unverdicted · none · ref 7 · 2 links · internal anchor
EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 100 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 138 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressive quantization.
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay cs.AI · 2026-05-10 · unverdicted · none · ref 42 · internal anchor
NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 19 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 122 · internal anchor
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs cs.PL · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer