SpikingBrain: Spiking Brain-inspired Large Models

Yuqi Pan , Yupeng Feng , Jinghao Zhuang , Siyu Ding , Han Xu , Zehao Liu , Bohan Sun , Yuhong Chou

show 11 more authors

Xuerui Qiu Anlin Deng Anjie Hu Shurong Wang Peng Zhou Man Yao Jibin Wu Jian Yang Guoliang Sun Bo Xu Guoqi Li

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CL

keywords modelstrainingefficientlargespikingspikingbrainbrain-inspiredinference

0 comments

read the original abstract

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LayerBoost applies layer-specific attention changes guided by sensitivity analysis plus brief distillation to cut LLM inference latency up to 68% while keeping competitive quality.
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
cs.LG 2026-04 unverdicted novelty 5.0

LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems
cs.AI 2026-04 unverdicted novelty 4.0

LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.