Steer llm latents for hallucination detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li · 2025 · arXiv 2503.01917

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.

REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control

cs.CL · 2025-11-25 · unverdicted · novelty 5.0

REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.

citing papers explorer

Showing 3 of 3 citing papers.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 79
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention cs.CL · 2026-05-07 · unverdicted · none · ref 8
FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control cs.CL · 2025-11-25 · unverdicted · none · ref 40
REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.

Steer llm latents for hallucination detection

fields

years

verdicts

representative citing papers

citing papers explorer