Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
hub
First conference on language modeling , year=
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.
Multi-Beholder integrates one-class classification into multiple instance learning to predict LGG biomarker status from histopathology images, reporting AUCs of 0.973 on TCGA-LGG and 0.820 on an external Xiangya cohort.
MATE uses permutation-invariant sum-aggregated memory of transition embeddings to solve CMDPs with online adaptation and computational advantages over Transformers and RNNs.
FedNL reformulates federated learning as nested optimization with linear attention for collaborative test-time adaptation on non-IID data.
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
citing papers explorer
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
-
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.
-
Multi-Beholder: Biomarker Prediction for Low-Grade Glioma with Multiple Instance Learning and One-Class Classification
Multi-Beholder integrates one-class classification into multiple instance learning to predict LGG biomarker status from histopathology images, reporting AUCs of 0.973 on TCGA-LGG and 0.820 on an external Xiangya cohort.
-
MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings
MATE uses permutation-invariant sum-aggregated memory of transition embeddings to solve CMDPs with online adaptation and computational advantages over Transformers and RNNs.
-
Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation
FedNL reformulates federated learning as nested optimization with linear attention for collaborative test-time adaptation on non-IID data.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
-
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.