Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling

Bojian Yin , Shurong Wang , Haoyu Tan , Sander Bohte , Federico Corradi , Guoqi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords memoryefficientinformationlongrecurrentrnnssequencesurnns

0 comments

The pith

suRNNs use neuron-level binary switches to update recurrent states only on informative events, matching Transformer accuracy on long-range tasks while remaining more efficient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sequential data like audio or video often contains long stretches of silence or static content where little new information arrives. Standard recurrent networks update their internal state at every time step, which gradually overwrites older memories and makes it hard for learning signals to reach back across many steps. The paper introduces Selective-Update RNNs that equip each neuron with a learned binary switch. When the current input carries little new information the switch stays closed and the neuron state remains exactly the same. This preserves an unaltered copy of past events and creates an unobstructed path for gradients to flow backward in time. Because each neuron can learn its own update frequency, the model adapts to the actual density of information rather than the raw length of the sequence. Experiments on the Long Range Arena, WikiText, and synthetic benchmarks are reported to show accuracy that matches or exceeds that of Transformer models while using substantially less computation for long-term storage.

Core claim

Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage.

Load-bearing premise

That a neuron-level binary switch can be trained to reliably identify informative events without introducing training instability or irreversible loss of critical past information.

read the original abstract

Real-world sequential signals, such as audio or video, contain critical information that is often embedded within long periods of silence or noise. While recurrent neural networks (RNNs) are designed to process such data efficiently, they often suffer from ``memory decay'' due to a rigid update schedule: they typically update their internal state at every time step, even when the input is static. This constant activity forces the model to overwrite its own memory and makes it hard for the learning signal to reach back to distant past events. Here we show that we can overcome this limitation using Selective-Update RNNs (suRNNs), a non-linear architecture that learns to preserve its memory when the input is redundant. By using a neuron-level binary switch that only opens for informative events, suRNNs decouple the recurrent updates from the raw sequence length. This mechanism allows the model to maintain an exact, unchanged memory of the past during low-information intervals, creating a direct path for gradients to flow across time. Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage. By allowing each neuron to learn its own update timescale, our approach resolves the mismatch between how long a sequence is and how much information it actually contains. By providing a principled approach to managing temporal information density, this work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

suRNNs add a per-neuron binary switch for sparse RNN updates that could help long sequences, but the results need real verification.

read the letter

The main thing to know is that this paper adds a neuron-level binary switch to RNNs so the hidden state only updates on informative inputs and stays exactly the same otherwise. That lets the model skip redundant steps without overwriting memory, which should help gradients reach farther back in time on long sequences full of silence or noise. The idea is presented as a way to let each neuron pick its own update rate instead of forcing an update at every time step. I think this is a reasonable architectural move that builds on gating but makes the sparsity explicit and per-neuron. It directly targets the memory decay problem that standard RNNs have when inputs are static for long stretches. The motivation section does a clean job laying that out. If the switch can be trained stably, the efficiency gain for real-time audio or edge video makes sense because you avoid both quadratic attention and constant state churn. The soft spot is the experimental support. The abstract says the model matches or beats Transformers on Long Range Arena, WikiText, and synthetic tasks while staying more efficient, but there are no numbers, no error bars, no listed baselines, and no description of how the binary switch is optimized or whether straight-through gradients or some other trick is used. Without those details it is impossible to tell whether the claimed performance actually comes from the selective update or from other unmentioned choices. Training instability or accidental loss of key past information during skipped steps is a plausible risk that needs checking. The citation pattern looks standard and does not seem to hide prior work. This paper is aimed at people who need linear-time sequence models for long, redundant data on constrained hardware. A reader already working on efficient RNN variants would get value from the mechanism even if the numbers require follow-up. It deserves a serious referee because the core idea is distinct enough and the problem is practical. I would recommend sending it out for review so the implementation and results can be examined properly.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Selective-Update RNNs (suRNNs) via an independent architectural mechanism: a neuron-level binary switch that selectively opens for informative events, decoupling recurrent updates from raw sequence length. Performance claims rest on external experimental benchmarks (Long Range Arena, WikiText, synthetic tasks) rather than any fitted parameters or equations that would tautologically reproduce the results. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided derivation; the core idea is presented as a direct architectural addition without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of an invented binary-switch component whose training dynamics and stability are not independently evidenced in the abstract.

invented entities (1)

neuron-level binary switch no independent evidence
purpose: to decide whether to update the recurrent state at each time step
Introduced to solve memory decay by preserving exact state during low-information intervals; no external validation or falsifiable prediction is supplied in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1108 out tokens · 92035 ms · 2026-05-16T03:17:38.892067+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

When the gate is switched off (gt,i = 0), the i-th neuron acts as an ideal memory cell, preserving the exact same state from the previous time step
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the effective credit-assignment depth scales with the number of informative updates |Uon_i(s,t)| rather than sequence length
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

selective update induces a dual-mode dynamics... identity map to preserve states during non-informative intervals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Working Memory in a Recurrent Spiking Neural Networks With Heterogeneous Synaptic Delays
q-bio.NC 2026-04 unverdicted novelty 7.0

A recurrent SNN with heterogeneous synaptic delays (D=41) achieves perfect F1=1.0 recall of 16 arbitrary spike patterns on a synthetic benchmark by representing them as chains of overlapping spiking motifs.