A Language Model’s Guide Through Latent Space

A language model’s guide through latent space · 2024 · arXiv 2402.14433

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.

A Byzantine Fault Tolerance Approach towards AI Safety

cs.DC · 2025-04-20 · unverdicted · novelty 4.0

Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

cs.LG · 2025-10-28

citing papers explorer

Showing 5 of 5 citing papers.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 43
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 193
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering cs.CL · 2026-05-28 · unverdicted · none · ref 20
UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.
A Byzantine Fault Tolerance Approach towards AI Safety cs.DC · 2025-04-20 · unverdicted · none · ref 3
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought cs.LG · 2025-10-28 · unreviewed · ref 30

A Language Model’s Guide Through Latent Space

fields

years

verdicts

representative citing papers

citing papers explorer