Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Delta Attention Residuals

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.

Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A gated residual KAN framework called Temporal Functional Circuits maps edge functions to input lags, ranks them by activation, and validates faithfulness via interventions showing that learned B-splines add predictive value beyond base activations.

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.

Position: Ideas Should be the Center of Machine Learning Research

cs.LG · 2026-05-14 · conditional · novelty 4.0

Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.

citing papers explorer

Showing 8 of 8 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 65 · 3 links
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Delta Attention Residuals cs.LG · 2026-05-13 · unverdicted · none · ref 26
Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.
Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition cs.LG · 2026-05-22 · unverdicted · none · ref 4
Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts cs.CL · 2026-05-16 · unverdicted · none · ref 79
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer cs.LG · 2026-05-08 · unverdicted · none · ref 6
Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting cs.LG · 2026-05-07 · unverdicted · none · ref 45
A gated residual KAN framework called Temporal Functional Circuits maps edge functions to input lags, ranks them by activation, and validates faithfulness via interventions showing that learned B-splines add predictive value beyond base activations.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 48
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
Position: Ideas Should be the Center of Machine Learning Research cs.LG · 2026-05-14 · conditional · none · ref 18
Machine learning research should prioritize ideas by testing their predicted behavioral signatures in modern models through custom experiments instead of leaderboard chasing or abstract theorems.

Transformer Circuits Thread , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer