Title resolution pending

Ferrando, J · 2024 · arXiv 2405.00208

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa

cs.CL · 2026-05-05 · conditional · novelty 7.0

LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.

In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

cs.AI · 2026-04-20 · conditional · novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

cs.CL · 2026-02-18 · unverdicted · novelty 6.0

CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.

ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

ConceptTracer supplies an interactive interface and saliency/selectivity metrics to locate concept-responsive neurons in neural representations, shown on TabPFN.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

cs.LG · 2025-09-11 · unverdicted · novelty 5.0

Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

Relational Linear Properties in Language Models: An Empirical Investigation

cs.LG · 2026-05-21

citing papers explorer

Showing 2 of 2 citing papers after filters.

Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 10
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 22
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer