International Conference on Learning Representations , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

PACT: Peak-Aware Cross-Attention Graph Transformers for Efficient Storm-Surge Emulation

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

PACT introduces a peak-aware cross-attention graph transformer that emulates station-level storm surges more accurately than prior graph neural network baselines while running in seconds after training.

Continual Fine-Tuning of Large Language Models via Program Memory

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

ProCL organizes LoRA adapters into input-conditioned program memory slots that combine with a distributed adapter to improve retention and reduce forgetting in continual LLM fine-tuning.

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

SPHERE applies a Parseval penalty to MoE policies in continual RL to maintain spectral plasticity, yielding 133% and 50% higher average success on MetaWorld and HumanoidBench versus unregularized MoE baselines.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.

Minimal-Intervention KV Retention via Set-Conditioned Diversity

cs.LG · 2026-05-14 · conditional · novelty 5.0

A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory held-out evaluation.

citing papers explorer

Showing 8 of 8 citing papers.

PACT: Peak-Aware Cross-Attention Graph Transformers for Efficient Storm-Surge Emulation cs.LG · 2026-05-09 · unverdicted · none · ref 8
PACT introduces a peak-aware cross-attention graph transformer that emulates station-level storm surges more accurately than prior graph neural network baselines while running in seconds after training.
Continual Fine-Tuning of Large Language Models via Program Memory cs.LG · 2026-05-13 · unverdicted · none · ref 35
ProCL organizes LoRA adapters into input-conditioned program memory slots that combine with a distributed adapter to improve retention and reduce forgetting in continual LLM fine-tuning.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 9
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 91
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 44 · 2 links
SPHERE applies a Parseval penalty to MoE policies in continual RL to maintain spectral plasticity, yielding 133% and 50% higher average success on MetaWorld and HumanoidBench versus unregularized MoE baselines.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 1 · 2 links
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs cs.AI · 2026-04-20 · unverdicted · none · ref 16
A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.
Minimal-Intervention KV Retention via Set-Conditioned Diversity cs.LG · 2026-05-14 · conditional · none · ref 4
A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory held-out evaluation.

International Conference on Learning Representations , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer