Hierarchical vs. Flat Iteration in Shared-Weight Transformers

· 2026 · cs.CL · arXiv 2604.14442

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.

representative citing papers

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

cs.LG · 2026-05-31 · unverdicted · novelty 4.0

CART is a recurrent transformer with shared core, frozen prelude KV tensors, and LTI stability gate that fails to beat dense baselines at parameter parity across tested widths.

citing papers explorer

Showing 1 of 1 citing paper.

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability cs.LG · 2026-05-31 · unverdicted · none · ref 4 · internal anchor
CART is a recurrent transformer with shared core, frozen prelude KV tensors, and LTI stability gate that fails to beat dense baselines at parameter parity across tested widths.

Hierarchical vs. Flat Iteration in Shared-Weight Transformers

fields

years

verdicts

representative citing papers

citing papers explorer