Scalable matmul-free language modeling

· 2024 · arXiv 2406.02528

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Cumulative state updates in CMRU restore gradient flow through time in quantized bistable RNNs, yielding more stable convergence and competitive or superior performance versus LRUs and minGRUs on long-range sequence tasks.

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

cs.LG · 2026-04-05 · unverdicted · novelty 5.0

BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.

Hands-on PDC in Undergraduate Computing Education

cs.CY · 2026-04-28 · unverdicted · novelty 4.0

A three-year evaluation of an undergraduate assignment shows that using a real supercomputer for matrix multiplication benchmarks improves student understanding of parallelism and multithreading.

citing papers explorer

Showing 4 of 4 citing papers.

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 11
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications cs.LG · 2026-05-12 · unverdicted · none · ref 47
Cumulative state updates in CMRU restore gradient flow through time in quantized bistable RNNs, yielding more stable convergence and competitive or superior performance versus LRUs and minGRUs on long-range sequence tasks.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design cs.LG · 2026-04-05 · unverdicted · none · ref 28
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
Hands-on PDC in Undergraduate Computing Education cs.CY · 2026-04-28 · unverdicted · none · ref 15
A three-year evaluation of an undergraduate assignment shows that using a real supercomputer for matrix multiplication benchmarks improves student understanding of parallelism and multithreading.

Scalable matmul-free language modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer