Recognition: no theorem link
ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected
Pith reviewed 2026-05-16 20:33 UTC · model grok-4.3
The pith
Authors can embed hidden prompts in PDFs to make LLM reviewers give biased positive feedback, while editors can use similar triggers to detect LLM-generated reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper demonstrates an attack where hidden prompts are injected into scientific PDFs to influence LLM reviewers toward acceptance, paired with a detection method that embeds trigger prompts for editors to verify if a review was LLM-generated by checking for responses to those triggers.
What carries the argument
Hidden prompt injection in PDF documents, which allows secret instructions to influence LLM behavior during review, paired with trigger embedding for detection.
If this is right
- Peer review processes relying on LLMs become vulnerable to manipulation by authors seeking acceptance for low-quality work.
- Editors gain a tool to distinguish human from LLM reviews without changing reviewer workflows.
- Scientific papers could enter the literature based on fabricated positive reviews unless detection is applied.
- The method repurposes an attack vector into a verification tool for maintaining review integrity.
- Journals could implement this to reduce risks of flawed results influencing downstream research.
Where Pith is reading between the lines
- Widespread use could push journals toward policies requiring AI disclosure or banning LLM reviews outright.
- Similar techniques might extend to detecting LLM assistance in writing grants or other academic documents.
- Effectiveness may decrease as future LLMs improve at ignoring or detecting hidden instructions.
- Real-world testing across different LLM models would show how robust the detection holds in practice.
Load-bearing premise
LLM reviewers will consistently follow and act upon hidden prompts embedded in the PDF content without detecting or ignoring them.
What would settle it
An experiment where multiple LLM reviewers receive papers with hidden prompts and none produce the expected biased positive feedback, or where the trigger detection flags human reviews at high rates.
Figures
read the original abstract
Large Language Models (LLMs) like ChatGPT are now widely used in writing and reviewing scientific papers. While this trend accelerates publication growth and reduces human workload, it also introduces serious risks. Papers written or reviewed by LLMs may lack real novelty, contain fabricated or biased results, or mislead downstream research that others depend on. Such issues can damage reputations, waste resources, and even endanger lives when flawed studies influence medical or safety-critical systems. This research explores both the offensive and defensive sides of this growing threat. On the attack side, we demonstrate how an author can inject hidden prompts inside a PDF that secretly guide or "jailbreak" LLM reviewers into giving overly positive feedback and biased acceptance. On the defense side, we propose an "inject-and-detect" strategy for editors, where invisible trigger prompts are embedded into papers; if a review repeats or reacts to these triggers, it reveals that the review was generated by an LLM, not a human. This method turns prompt injections from vulnerability into a verification tool. We outline our design, expected model behaviors, and ethical safeguards for deployment. The goal is to expose how fragile today's peer-review process becomes under LLM influence and how editorial awareness can help restore trust in scientific evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an attack in which authors embed hidden prompts within PDF submissions to bias LLM-based peer reviewers toward overly positive evaluations and acceptance, paired with a defensive 'inject-and-detect' strategy in which editors embed invisible trigger prompts; LLM-generated reviews are then identified if they repeat or react to the triggers. The work outlines the conceptual design, expected LLM behaviors, and ethical considerations but presents no experimental results, extraction tests, or validation data.
Significance. If the attack and detection mechanisms could be empirically validated with measurable success rates, low false positives, and robustness across PDF parsers and LLMs, the work would identify a concrete vulnerability in LLM-assisted peer review and supply editors with a practical verification tool. In its current form the absence of any supporting measurements leaves both the offensive and defensive claims without demonstrated effectiveness.
major comments (2)
- [Abstract / attack demonstration] Abstract and the attack-description section: the claim that hidden prompts can be injected to 'jailbreak' LLM reviewers into biased acceptance is presented as a demonstration, yet the manuscript contains no experimental results, success-rate measurements, or tests of prompt extraction fidelity across PDF libraries, OCR pipelines, or LLM context windows. Without such data the central offensive claim remains unsupported.
- [inject-and-detect strategy] Defense-strategy section: the proposed trigger-based detection relies on the assumption that an LLM review will reliably repeat or react to an editor-placed invisible prompt while a human review will not, yet no accuracy, false-positive rate, or controlled comparison against human reviews is reported. This untested assumption is load-bearing for the claim that the method can distinguish LLM reviews with acceptable reliability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments. We agree that the manuscript is a conceptual proposal rather than an empirical study and that certain phrasings could be misread as claiming demonstrated results. We will perform a major revision to clarify the scope, adjust language to avoid implying experimental validation, and explicitly discuss limitations and the need for future empirical work.
read point-by-point responses
-
Referee: [Abstract / attack demonstration] Abstract and the attack-description section: the claim that hidden prompts can be injected to 'jailbreak' LLM reviewers into biased acceptance is presented as a demonstration, yet the manuscript contains no experimental results, success-rate measurements, or tests of prompt extraction fidelity across PDF libraries, OCR pipelines, or LLM context windows. Without such data the central offensive claim remains unsupported.
Authors: We agree the current wording risks implying an empirical demonstration. The manuscript outlines a conceptual attack design drawing on known prompt-injection techniques and describes anticipated LLM behaviors, without any experiments or measurements. In revision we will replace phrases such as 'we demonstrate' with 'we propose' and 'we expect', add an explicit statement that no empirical validation is provided, and note that extraction fidelity and success rates remain open questions for future work. revision: yes
-
Referee: [inject-and-detect strategy] Defense-strategy section: the proposed trigger-based detection relies on the assumption that an LLM review will reliably repeat or react to an editor-placed invisible prompt while a human review will not, yet no accuracy, false-positive rate, or controlled comparison against human reviews is reported. This untested assumption is load-bearing for the claim that the method can distinguish LLM reviews with acceptable reliability.
Authors: We accept that the defense section presents an untested assumption as load-bearing. The text describes the mechanism and expected model reactions but reports no accuracy metrics or human-LLM comparisons. In the revision we will add a dedicated limitations subsection that states the assumption explicitly, discusses plausible failure modes (e.g., LLMs ignoring triggers or humans coincidentally referencing similar content), and frames the approach as a proposed direction requiring empirical validation rather than a ready-to-deploy tool. revision: yes
Circularity Check
No circularity: proposal relies on external assumptions about LLM behavior, not self-referential derivations
full rationale
The paper proposes an attack (hidden PDF prompt injection to bias LLM reviews) and defense (editor trigger injection for LLM detection) as conceptual techniques. No equations, fitted parameters, or derivations appear in the provided text. Claims rest on untested assumptions about LLM parsing of hidden text rather than reducing any result to the paper's own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no renaming of known results or ansatz smuggling is present. The derivation chain is self-contained as a forward-looking design outline.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can be influenced by hidden prompts embedded in processed documents such as PDFs
- domain assumption Invisible text or triggers can be embedded in PDFs without detection by standard human or automated review processes
invented entities (1)
-
inject-and-detect strategy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Use of artificial intelligence in peer review among top 100 medical journals,
Z.-Q. Li, H.-L. Xu, H.-J. Cao, Z.-L. Liu, Y .-T. Fei, and J.-P. Liu, “Use of artificial intelligence in peer review among top 100 medical journals,” JAMA Network Open, vol. 7, no. 12, pp. e2 448 609–e2 448 609, 2024
work page 2024
-
[2]
B. Kocak, M. R. Onur, S. H. Park, P. Baltzer, and M. Dietzel, “Ensuring peer review integrity in the era of large language models: A critical stocktaking of challenges, red flags, and recommendations,”European Journal of Radiology Artificial Intelligence, vol. 2, p. 100018, 2025
work page 2025
-
[3]
The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates,
G. R. Latona, M. H. Ribeiro, T. R. Davidson, V . Veselovsky, and R. West, “The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates,”arXiv preprint arXiv:2405.02150, 2024
-
[4]
Optimization-based prompt injection attack to llm-as-a-judge,
J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 660–674
work page 2024
-
[5]
Investigating the vulnera- bility of llm-as-a-judge architectures to prompt-injection attacks,
N. Maloyan, B. Ashinov, and D. Namiot, “Investigating the vulnera- bility of llm-as-a-judge architectures to prompt-injection attacks,”arXiv preprint arXiv:2505.13348, 2025
-
[6]
Scientists hide messages in papers to game ai peer review,
E. Gibney, “Scientists hide messages in papers to game ai peer review,” Nature, vol. 643, no. 8073, pp. 887–888, 2025
work page 2025
-
[7]
Hidden prompts in manuscripts exploit ai-assisted peer review,
Z. Lin, “Hidden prompts in manuscripts exploit ai-assisted peer review,” arXiv preprint arXiv:2507.06185, 2025
-
[8]
Publish to perish: Prompt injection attacks on llm-assisted peer review,
M. G. Collu, U. Salviati, R. Confalonieri, M. Conti, and G. Apruzzese, “Publish to perish: Prompt injection attacks on llm-assisted peer review,” arXiv preprint arXiv:2508.20863, 2025
-
[9]
Topicattack: An indirect prompt injection attack via topic transition,
Y . Chen, H. Li, Y . Li, Y . Liu, Y . Song, and B. Hooi, “Topicattack: An indirect prompt injection attack via topic transition,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 7338–7356
work page 2025
-
[10]
Phantomlint: Principled detection of hidden llm prompts in structured documents,
T. Murray, “Phantomlint: Principled detection of hidden llm prompts in structured documents,”arXiv preprint arXiv:2508.17884, 2025
-
[11]
Prompt-in-content attacks: Exploiting uploaded inputs to hijack llm behavior,
Z. Lian, W. Wang, Q. Zeng, T. Nakanishi, T. Kitasuka, and C. Su, “Prompt-in-content attacks: Exploiting uploaded inputs to hijack llm behavior,”arXiv preprint arXiv:2508.19287, 2025
-
[12]
Prompt injection attacks on llm generated reviews of scientific publications,
J. Keuper, “Prompt injection attacks on llm generated reviews of scientific publications,”arXiv preprint arXiv:2509.10248, 2025
-
[13]
When ai reviews science: Can we trust the referee?
J. Wang, Y . Liu, H. Xu, K. Hu, S. Di, W. Ni, L. Yue, M.-L. Zhang, K. Ren, and L. Chen, “When ai reviews science: Can we trust the referee?” 2025
work page 2025
-
[14]
Z. Zhang, L. Zhi, L. Sunet al., ““give a positive review only”: An early investigation into in-paper prompt injection attacks and defenses for ai reviewers,” inSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025
work page 2025
-
[15]
M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Prompt injection in large language model exploitation: A security perspective,
J. K. Joseph, E. Daniel, V . Kathiresan, and M. MAP, “Prompt injection in large language model exploitation: A security perspective,” in2025 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC). IEEE, 2025, pp. 1–8
work page 2025
-
[17]
Jailguard: A universal detection framework for prompt-based attacks on llm systems,
X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, M. Hu, J. Zhang, Y . Liu, S. Ma, and C. Shen, “Jailguard: A universal detection framework for prompt-based attacks on llm systems,”ACM Transactions on Software Engineering and Methodology, 2025
work page 2025
-
[18]
OpenAI, “Chatgpt (gpt-5.1),” https://chat.openai.com/, accessed: 2025- 12-03
work page 2025
-
[19]
xAI, “Grok,” https://x.ai/, accessed: 2025-12-03
work page 2025
-
[20]
G. DeepMind, “Gemini pro,” https://deepmind.google/, accessed: 2025- 12-03. APPENDIXA REVIEWERPROMPTPROVIDED TO THELLM User Prompt: You are an expert reviewer for the International Conference on Learning Representations (ICLR). You must provide a highly critical, rigorous, and unbiased review. Avoid being overly positive unless strongly justified. Please ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.