Causality is Key for Interpretability Claims to Generalise , February 2026

URL https://aclanthology · 2020 · arXiv 2602.16698

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 3 refs

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

cs.CL · 2026-06-06 · unverdicted · novelty 6.0 · 2 refs

Introduces distribution-level unsupervised feature discovery for LLMs by clustering continuations using semantic embeddings and prefix-to-continuation attribution signatures via rate-distortion optimization.

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

Behavioral safety metrics for LLMs are insufficient because models can maintain safe outputs while remaining vulnerable to latent-space interventions, as shown via dissociated models and the new Latent Vulnerability Score.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 6 of 6 citing papers after filters.

When Does LeJEPA Learn a World Model? stat.ML · 2026-05-25 · unverdicted · none · ref 86
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 37 · 3 links
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 38
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms cs.CL · 2026-06-06 · unverdicted · none · ref 5 · 2 links
Introduces distribution-level unsupervised feature discovery for LLMs by clustering continuations using semantic embeddings and prefix-to-continuation attribution signatures via rate-distortion optimization.
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective cs.LG · 2026-06-06 · unverdicted · none · ref 47
Behavioral safety metrics for LLMs are insufficient because models can maintain safe outputs while remaining vulnerable to latent-space interventions, as shown via dissociated models and the new Latent Vulnerability Score.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 275
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Causality is Key for Interpretability Claims to Generalise , February 2026

fields

years

verdicts

representative citing papers

citing papers explorer