REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
Steer llm latents for hallucination detection
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3verdicts
UNVERDICTED 3representative citing papers
FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.
REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.
citing papers explorer
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
-
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.
-
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control
REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.