Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
Scalable matmul-free language modeling
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Cumulative state updates in CMRU restore gradient flow through time in quantized bistable RNNs, yielding more stable convergence and competitive or superior performance versus LRUs and minGRUs on long-range sequence tasks.
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
A three-year evaluation of an undergraduate assignment shows that using a real supercomputer for matrix multiplication benchmarks improves student understanding of parallelism and multithreading.
citing papers explorer
-
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
-
Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications
Cumulative state updates in CMRU restore gradient flow through time in quantized bistable RNNs, yielding more stable convergence and competitive or superior performance versus LRUs and minGRUs on long-range sequence tasks.
-
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
-
Hands-on PDC in Undergraduate Computing Education
A three-year evaluation of an undergraduate assignment shows that using a real supercomputer for matrix multiplication benchmarks improves student understanding of parallelism and multithreading.