REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Buyun Liang; Darshan Thaker; Fengrui Tian; Hamed Hassani; Jinqi Luo; Kaleab A. Kinfu; Kwan Ho Ryan Chan; Liangzu Peng; Ren\'e Vidal

arxiv: 2605.12813 · v2 · pith:ZSY4N6MWnew · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Buyun Liang , Jinqi Luo , Liangzu Peng , Kwan Ho Ryan Chan , Darshan Thaker , Kaleab A. Kinfu , Fengrui Tian , Hamed Hassani

show 1 more author

Ren\'e Vidal

This is my paper

Pith reviewed 2026-06-30 21:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords adversarial attacksLLM hallucinationslatent space optimizationrealistic promptsconstrained optimizationLLM robustnessediting directions

0 comments

The pith

REALISTA optimizes continuous combinations of input-specific editing directions in latent space to produce semantically coherent adversarial prompts that elicit LLM hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates hallucination elicitation as finding adversarial prompts that remain semantically equivalent to benign ones. Prior discrete attacks search only limited variations while continuous latent attacks often produce invalid decoded text. REALISTA addresses both limits by building an input-dependent dictionary of valid rephrasing directions and optimizing their combinations. This yields prompts that stay coherent yet trigger hallucinations. The approach works on open-source models at least as well as prior realistic methods and succeeds on large reasoning models in free-form settings where earlier attacks fail.

Core claim

REALISTA constructs an input-dependent dictionary of valid editing directions, each tied to a semantically equivalent and coherent rephrasing, then optimizes continuous combinations of these directions in latent space. The resulting adversarial prompts remain realistic rephrasings yet reliably elicit hallucinations, achieving superior or comparable results to state-of-the-art realistic attacks on open-source LLMs and succeeding on large reasoning models under free-form response conditions where prior realistic attacks fail.

What carries the argument

Input-dependent dictionary of valid editing directions, which supplies a set of semantically coherent rephrasings whose continuous linear combinations are optimized in latent space while preserving equivalence after decoding.

If this is right

Hallucination elicitation reduces to constrained optimization over realistic rephrasings rather than unrestricted prompt search.
Large reasoning models remain vulnerable to free-form adversarial inputs even when discrete prompt attacks cannot reach them.
The same dictionary-plus-optimization design can be applied to other LLM failure modes that require semantic preservation.
Performance gains appear on both open-source and reasoning models without sacrificing prompt realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could generate synthetic training examples to strengthen alignment procedures against realistic adversarial inputs.
Extending the dictionary construction to other modalities or tasks might produce comparable attacks on vision-language models.
If the dictionary size or coverage proves insufficient on some inputs, hybrid discrete-continuous search could be tested as a direct follow-up.

Load-bearing premise

The dictionary of editing directions contains enough effective combinations that remain semantically equivalent and coherent once decoded into text.

What would settle it

Running the optimization on a target LLM and finding that every decoded prompt from the optimized combination either fails to elicit hallucinations or produces incoherent or non-equivalent text.

Figures

Figures reproduced from arXiv: 2605.12813 by Buyun Liang, Darshan Thaker, Fengrui Tian, Hamed Hassani, Jinqi Luo, Kaleab A. Kinfu, Kwan Ho Ryan Chan, Liangzu Peng, Ren\'e Vidal.

**Figure 2.** Figure 2: (Left) Input-dependent edit dictionary construction. We employ a concept optimization procedure to construct a set of latent concepts c (1) , . . . , c (n) conditioned on the original prompt x0 and WordNet (Miller, 1995). These concepts are assembled into an edit dictionary D, where each column corresponds to an interpretable editing direction z (i) = c (i) − z0. See §3.1 and Appendix §C for details on the… view at source ↗

**Figure 3.** Figure 3: Optimization trajectory when solving (12). At each optimization iteration, the objective value and the maximum constraint violation are reported as bootstrap means (10,000 resamples) computed over the MMLU subset [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 6.** Figure 6: Normalized word edit distance across layers between original prompts and their reconstructions. Dots indicate the bestperforming layer for each model, and shaded regions show standard deviation over 10,000 bootstrap samples. Reconstruction quality degrades as depth increases. The reconstruction quality of decoder ψ directly affects our latent-space optimization in REALISTA. To assess this, we conduct an e… view at source ↗

**Figure 7.** Figure 7: Top 50 most frequently used concepts. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗

**Figure 7.** Figure 7: Top 50 most frequently used concepts. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗

**Figure 8.** Figure 8: Top 50-100 most frequently used concepts. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

**Figure 8.** Figure 8: Top 50-100 most frequently used concepts. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗

read the original abstract

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REALISTA's input-dependent dictionary for latent attacks is a clean hybrid idea, but the abstract's lack of numbers leaves the main claims uncheckable.

read the letter

The paper's core move is to build, for each input, a dictionary of latent directions that each decode to a valid rephrasing, then optimize continuous combinations of those directions. This sits between pure discrete prompt search and unrestricted latent attacks. The framing is direct and the motivation is practical: existing realistic attacks either stay too narrow or produce garbage after decoding.

What the work does well is state the constraint problem clearly and propose a concrete algorithmic fix that keeps the search continuous while trying to enforce semantic equivalence by construction. Releasing code is also a plus for anyone who wants to test the dictionary construction themselves.

The soft spot is obvious from the abstract alone: no quantitative results, no tables, no details on how they measure coherence or equivalence after decoding, and no characterization of how dense or complete the per-input dictionaries actually are. The central claim—that this succeeds on large reasoning models in free-form settings where prior realistic attacks fail—rests entirely on that unshown empirical part. If the dictionary turns out sparse or the decoded outputs drift, the method reduces to the discrete case the authors criticize.

This is for groups doing LLM robustness evaluation who need better test inputs. It is worth a serious referee because the problem matters and the proposed construction is not a trivial extension, but any review should start by demanding the full experimental section and failure analysis on the decoded prompts.

Referee Report

2 major / 1 minor

Summary. The paper proposes REALISTA, a latent adversarial attack framework for eliciting hallucinations in LLMs. It formulates the task as constrained optimization over semantically coherent and equivalent prompts, constructs an input-dependent dictionary of valid editing directions in latent space, and optimizes continuous combinations of these directions. The central empirical claim is that this yields superior or comparable performance to prior realistic attacks on open-source LLMs and succeeds against large reasoning models in free-form response settings where earlier realistic methods fail.

Significance. If the central empirical results hold, the work would be significant for LLM reliability evaluation by combining the search flexibility of continuous latent attacks with the semantic validity of discrete rephrasing attacks. The public code release at https://github.com/Buyun-Liang/REALISTA is a clear strength that supports reproducibility.

major comments (2)

[Abstract] Abstract: performance claims (superior/comparable results on open-source LLMs; success on large reasoning models in free-form settings) are stated without any quantitative numbers, tables, error bars, or details on how semantic equivalence and coherence are enforced or measured.
[Abstract] The load-bearing premise that the input-dependent dictionary of editing directions is sufficiently dense and combinable to produce valid, coherent, semantically equivalent rephrasings after gradient-based optimization (rather than out-of-distribution text) receives no quantitative characterization of coverage, density, or decoded-output failure rate; without this the superiority claim on reasoning models cannot be assessed.

minor comments (1)

[Abstract] The abstract would benefit from a brief sentence on the scale of the evaluated models and the number of test cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the abstract could be strengthened with quantitative details. We agree that incorporating key results and method specifics will improve clarity and will revise the abstract accordingly. Our responses to the major comments are below.

read point-by-point responses

Referee: [Abstract] Abstract: performance claims (superior/comparable results on open-source LLMs; success on large reasoning models in free-form settings) are stated without any quantitative numbers, tables, error bars, or details on how semantic equivalence and coherence are enforced or measured.

Authors: We acknowledge that the current abstract presents these claims qualitatively. The full manuscript (Section 4) includes detailed tables, figures with error bars, and quantitative comparisons showing performance margins on open-source LLMs as well as success rates on reasoning models. We will revise the abstract to include 1-2 key quantitative highlights (e.g., relative improvements and success rates) and a concise clause on how semantic equivalence is enforced via the constrained optimization over input-dependent editing directions (detailed in Section 3). revision: yes
Referee: [Abstract] The load-bearing premise that the input-dependent dictionary of editing directions is sufficiently dense and combinable to produce valid, coherent, semantically equivalent rephrasings after gradient-based optimization (rather than out-of-distribution text) receives no quantitative characterization of coverage, density, or decoded-output failure rate; without this the superiority claim on reasoning models cannot be assessed.

Authors: The manuscript validates coherence and equivalence through automatic metrics and human evaluation in the experiments, but we agree that explicit quantitative characterization of dictionary coverage, density, and decoded-output failure rates is not currently in the abstract (or sufficiently highlighted). We will add a brief analysis of these aspects (e.g., failure rate of optimized prompts remaining valid rephrasings) to the method/experiments sections and reference the key statistic in the revised abstract to support the claims on reasoning models. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic framework with independent construction steps

full rationale

The paper presents REALISTA as a procedural framework: it constructs an input-dependent dictionary of editing directions from the input prompt and then performs continuous optimization over combinations of those directions. No equations or claims reduce a derived quantity to a fitted parameter by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the optimization objective is stated as an external constrained problem rather than a self-referential definition. The method therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes the existence of a latent space in which linear combinations of editing directions remain decodable to coherent text, but no details are given.

pith-pipeline@v0.9.1-grok · 5794 in / 1201 out tokens · 20256 ms · 2026-06-30T21:55:06.071217+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL http://arxiv.org/abs/2309.08600 . arXiv:2309.08600 [cs]. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Lan- guage Understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/1390156.1390191 2019
[2]

On the importance of different cough phases for covid-19 detection,

URL https://openreview.net/forum?id=dF wBosAcJkN. Li, B., Li, Z., Du, Q., Luo, J., Wang, W., Xie, Y ., Stepput- tis, S., Wang, C., Sycara, K., Ravikumar, P., and others. Logicity: Advancing neuro-symbolic ai with abstract urban simulation.Advances in Neural Information Pro- cessing Systems, 37:69840–69864, 2024. Li, R., Wang, H., and Mao, C. LARGO: Latent...

work page doi:10.1109/icassp49357.2023.10095871 2024
[3]

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A

arXiv:2503.10965 [cs]. Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. In Glober- son, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems, volume 37, pp. 61065– ...

work page doi:10.52202/079 2024
[4]

Qwen2.5 Technical Report

Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL http://aclweb.org/ant hology/D14-1162. 12 REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep Contextualized Word Representations. In Walker, M., Ji, H., and Ste...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/d14-1162 2018
[5]

arXiv:2501.09929 [cs]

URL http://arxiv.org/abs/2501.09929 . arXiv:2501.09929 [cs]. Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., Mc- Coy, R. T., Kim, N., Durme, B. V ., Bowman, S., Das, D., and Pavlick, E. What Do You Learn from Con- text? Probing for Sentence Structure in Contextual- ized Word Representations. InInternational Confer- ence on Learning Representations, 2...

work page arXiv 2019
[6]

Steering Language Models With Activation Engineering

URL http://arxiv.org/abs/2308.10248 . arXiv:2308.10248 [cs]. Wang, C., Duan, J., Xiao, C., Kim, E., Stamm, M., and Xu, K. Semantic Adversarial Attacks via Diffusion Models, September 2023. URL http://arxiv.org/abs/2309 .07398. arXiv:2309.07398 [cs]. Wiegreffe, S., Tafjord, O., Belinkov, Y ., Hajishirzi, H., and Sabharwal, A. Answer, Assemble, Ace: Under- ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-0048 2023
[7]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URL http://arxiv.org/abs/2307.15043 . arXiv:2307.15043 [cs]. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation Engineering: A Top-Down Approach t...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL http://arxiv.org/abs/2309.08600 . arXiv:2309.08600 [cs]. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Lan- guage Understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/1390156.1390191 2019

[2] [2]

On the importance of different cough phases for covid-19 detection,

URL https://openreview.net/forum?id=dF wBosAcJkN. Li, B., Li, Z., Du, Q., Luo, J., Wang, W., Xie, Y ., Stepput- tis, S., Wang, C., Sycara, K., Ravikumar, P., and others. Logicity: Advancing neuro-symbolic ai with abstract urban simulation.Advances in Neural Information Pro- cessing Systems, 37:69840–69864, 2024. Li, R., Wang, H., and Mao, C. LARGO: Latent...

work page doi:10.1109/icassp49357.2023.10095871 2024

[3] [3]

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A

arXiv:2503.10965 [cs]. Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. In Glober- son, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems, volume 37, pp. 61065– ...

work page doi:10.52202/079 2024

[4] [4]

Qwen2.5 Technical Report

Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL http://aclweb.org/ant hology/D14-1162. 12 REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep Contextualized Word Representations. In Walker, M., Ji, H., and Ste...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/d14-1162 2018

[5] [5]

arXiv:2501.09929 [cs]

URL http://arxiv.org/abs/2501.09929 . arXiv:2501.09929 [cs]. Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., Mc- Coy, R. T., Kim, N., Durme, B. V ., Bowman, S., Das, D., and Pavlick, E. What Do You Learn from Con- text? Probing for Sentence Structure in Contextual- ized Word Representations. InInternational Confer- ence on Learning Representations, 2...

work page arXiv 2019

[6] [6]

Steering Language Models With Activation Engineering

URL http://arxiv.org/abs/2308.10248 . arXiv:2308.10248 [cs]. Wang, C., Duan, J., Xiao, C., Kim, E., Stamm, M., and Xu, K. Semantic Adversarial Attacks via Diffusion Models, September 2023. URL http://arxiv.org/abs/2309 .07398. arXiv:2309.07398 [cs]. Wiegreffe, S., Tafjord, O., Belinkov, Y ., Hajishirzi, H., and Sabharwal, A. Answer, Assemble, Ace: Under- ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-0048 2023

[7] [7]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URL http://arxiv.org/abs/2307.15043 . arXiv:2307.15043 [cs]. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation Engineering: A Top-Down Approach t...

work page internal anchor Pith review Pith/arXiv arXiv 2025