arxiv: 2512.20405 · v3 · submitted 2025-12-23 · 💻 cs.CR

Recognition: no theorem link

ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected

Kanchon Gharami , Sanjiv Kumar Sarkar , Safayat Bin Hakim , Yongxin Liu , Nahid Farhady Ghalaty , Shafika Showkat Moni

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:33 UTC · model grok-4.3

classification 💻 cs.CR

keywords prompt injectionLLM reviewerspeer reviewPDF hidden textjailbreakreview detectionscientific publishing

0 comments

The pith

Authors can embed hidden prompts in PDFs to make LLM reviewers give biased positive feedback, while editors can use similar triggers to detect LLM-generated reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that authors can embed invisible instructions within PDF files of scientific papers to manipulate large language models acting as reviewers into providing overly favorable evaluations and recommending acceptance. The attack exploits the way LLMs parse and follow textual content, including elements not visible to human readers. As a countermeasure, the authors propose that editors insert their own secret trigger prompts into submitted papers; any review that references or responds to these triggers indicates it was produced by an LLM rather than a human expert. This dual approach reveals both the vulnerability of LLM-assisted peer review and a practical way to identify when it has been used. If effective, it would allow journals to maintain integrity by flagging or rejecting automated reviews while deterring manipulation attempts.

Core claim

This paper demonstrates an attack where hidden prompts are injected into scientific PDFs to influence LLM reviewers toward acceptance, paired with a detection method that embeds trigger prompts for editors to verify if a review was LLM-generated by checking for responses to those triggers.

What carries the argument

Hidden prompt injection in PDF documents, which allows secret instructions to influence LLM behavior during review, paired with trigger embedding for detection.

If this is right

Peer review processes relying on LLMs become vulnerable to manipulation by authors seeking acceptance for low-quality work.
Editors gain a tool to distinguish human from LLM reviews without changing reviewer workflows.
Scientific papers could enter the literature based on fabricated positive reviews unless detection is applied.
The method repurposes an attack vector into a verification tool for maintaining review integrity.
Journals could implement this to reduce risks of flawed results influencing downstream research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could push journals toward policies requiring AI disclosure or banning LLM reviews outright.
Similar techniques might extend to detecting LLM assistance in writing grants or other academic documents.
Effectiveness may decrease as future LLMs improve at ignoring or detecting hidden instructions.
Real-world testing across different LLM models would show how robust the detection holds in practice.

Load-bearing premise

LLM reviewers will consistently follow and act upon hidden prompts embedded in the PDF content without detecting or ignoring them.

What would settle it

An experiment where multiple LLM reviewers receive papers with hidden prompts and none produce the expected biased positive feedback, or where the trigger detection flags human reviews at high rates.

Figures

Figures reproduced from arXiv: 2512.20405 by Kanchon Gharami, Nahid Farhady Ghalaty, Safayat Bin Hakim, Sanjiv Kumar Sarkar, Shafika Showkat Moni, Yongxin Liu.

**Figure 1.** Figure 1: High level overview of attack pipeline. Become →white text Yes No Inject Phantom Check Manuscript Review Ready Editor’s [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: High level overview of defense pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Distribution of overall ratings for each model after the attack. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Average scores for each evaluation dimension across models after the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Recommendation distribution (Reject to Accept) per model under the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Confusion matrix showing perfect separation between clean and [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Large Language Models (LLMs) like ChatGPT are now widely used in writing and reviewing scientific papers. While this trend accelerates publication growth and reduces human workload, it also introduces serious risks. Papers written or reviewed by LLMs may lack real novelty, contain fabricated or biased results, or mislead downstream research that others depend on. Such issues can damage reputations, waste resources, and even endanger lives when flawed studies influence medical or safety-critical systems. This research explores both the offensive and defensive sides of this growing threat. On the attack side, we demonstrate how an author can inject hidden prompts inside a PDF that secretly guide or "jailbreak" LLM reviewers into giving overly positive feedback and biased acceptance. On the defense side, we propose an "inject-and-detect" strategy for editors, where invisible trigger prompts are embedded into papers; if a review repeats or reacts to these triggers, it reveals that the review was generated by an LLM, not a human. This method turns prompt injections from vulnerability into a verification tool. We outline our design, expected model behaviors, and ethical safeguards for deployment. The goal is to expose how fragile today's peer-review process becomes under LLM influence and how editorial awareness can help restore trust in scientific evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a prompt-injection attack on LLM reviewers via hidden PDF text plus an inject-and-detect defense for editors, but both rest on untested assumptions about reliable extraction and behavioral response.

read the letter

The main point is that authors could embed hidden instructions in a PDF to push LLM reviewers toward positive, biased feedback, while editors could plant their own triggers to flag when a review is machine-generated. The paper frames this as turning a known security issue into a practical check on review integrity. That reframing is the clearest new angle; prior work on prompt injection has not focused on this specific publishing workflow. The authors also flag ethical issues around deployment, which shows they thought through real-world use. The central weakness is the absence of any concrete tests. The argument assumes hidden text survives PDF extraction, reaches the LLM context, and actually overrides normal review behavior, yet no extraction rates, success rates, or false-positive numbers appear. If standard parsers drop the hidden content or the model treats it as ordinary text, both the attack and the detection stop working. The abstract mentions demonstrations, but without data or error analysis the claims stay speculative. This is for researchers tracking AI security in academic publishing or editors building review tools. A reader already working on prompt engineering or research integrity would pick up the use-case framing quickly. I would send it for peer review; the idea is worth testing and the authors would benefit from referee pressure to add controlled measurements.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes an attack in which authors embed hidden prompts within PDF submissions to bias LLM-based peer reviewers toward overly positive evaluations and acceptance, paired with a defensive 'inject-and-detect' strategy in which editors embed invisible trigger prompts; LLM-generated reviews are then identified if they repeat or react to the triggers. The work outlines the conceptual design, expected LLM behaviors, and ethical considerations but presents no experimental results, extraction tests, or validation data.

Significance. If the attack and detection mechanisms could be empirically validated with measurable success rates, low false positives, and robustness across PDF parsers and LLMs, the work would identify a concrete vulnerability in LLM-assisted peer review and supply editors with a practical verification tool. In its current form the absence of any supporting measurements leaves both the offensive and defensive claims without demonstrated effectiveness.

major comments (2)

[Abstract / attack demonstration] Abstract and the attack-description section: the claim that hidden prompts can be injected to 'jailbreak' LLM reviewers into biased acceptance is presented as a demonstration, yet the manuscript contains no experimental results, success-rate measurements, or tests of prompt extraction fidelity across PDF libraries, OCR pipelines, or LLM context windows. Without such data the central offensive claim remains unsupported.
[inject-and-detect strategy] Defense-strategy section: the proposed trigger-based detection relies on the assumption that an LLM review will reliably repeat or react to an editor-placed invisible prompt while a human review will not, yet no accuracy, false-positive rate, or controlled comparison against human reviews is reported. This untested assumption is load-bearing for the claim that the method can distinguish LLM reviews with acceptable reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We agree that the manuscript is a conceptual proposal rather than an empirical study and that certain phrasings could be misread as claiming demonstrated results. We will perform a major revision to clarify the scope, adjust language to avoid implying experimental validation, and explicitly discuss limitations and the need for future empirical work.

read point-by-point responses

Referee: [Abstract / attack demonstration] Abstract and the attack-description section: the claim that hidden prompts can be injected to 'jailbreak' LLM reviewers into biased acceptance is presented as a demonstration, yet the manuscript contains no experimental results, success-rate measurements, or tests of prompt extraction fidelity across PDF libraries, OCR pipelines, or LLM context windows. Without such data the central offensive claim remains unsupported.

Authors: We agree the current wording risks implying an empirical demonstration. The manuscript outlines a conceptual attack design drawing on known prompt-injection techniques and describes anticipated LLM behaviors, without any experiments or measurements. In revision we will replace phrases such as 'we demonstrate' with 'we propose' and 'we expect', add an explicit statement that no empirical validation is provided, and note that extraction fidelity and success rates remain open questions for future work. revision: yes
Referee: [inject-and-detect strategy] Defense-strategy section: the proposed trigger-based detection relies on the assumption that an LLM review will reliably repeat or react to an editor-placed invisible prompt while a human review will not, yet no accuracy, false-positive rate, or controlled comparison against human reviews is reported. This untested assumption is load-bearing for the claim that the method can distinguish LLM reviews with acceptable reliability.

Authors: We accept that the defense section presents an untested assumption as load-bearing. The text describes the mechanism and expected model reactions but reports no accuracy metrics or human-LLM comparisons. In the revision we will add a dedicated limitations subsection that states the assumption explicitly, discusses plausible failure modes (e.g., LLMs ignoring triggers or humans coincidentally referencing similar content), and frames the approach as a proposed direction requiring empirical validation rather than a ready-to-deploy tool. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal relies on external assumptions about LLM behavior, not self-referential derivations

full rationale

The paper proposes an attack (hidden PDF prompt injection to bias LLM reviews) and defense (editor trigger injection for LLM detection) as conceptual techniques. No equations, fitted parameters, or derivations appear in the provided text. Claims rest on untested assumptions about LLM parsing of hidden text rather than reducing any result to the paper's own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no renaming of known results or ansatz smuggling is present. The derivation chain is self-contained as a forward-looking design outline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on assumptions about LLM susceptibility to hidden document prompts and the technical feasibility of undetectable PDF embedding, while introducing a novel detection strategy without external validation.

axioms (2)

domain assumption LLMs can be influenced by hidden prompts embedded in processed documents such as PDFs
This underpins the attack demonstration and is invoked when describing jailbreaking of reviewers.
domain assumption Invisible text or triggers can be embedded in PDFs without detection by standard human or automated review processes
Required for both the attack and the proposed defense to function as described.

invented entities (1)

inject-and-detect strategy no independent evidence
purpose: To detect LLM-generated reviews by embedding secret triggers that reveal AI involvement if responded to
This is the core new contribution proposed in the paper.

pith-pipeline@v0.9.0 · 5548 in / 1422 out tokens · 34717 ms · 2026-05-16T20:33:33.053977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Use of artificial intelligence in peer review among top 100 medical journals,

Z.-Q. Li, H.-L. Xu, H.-J. Cao, Z.-L. Liu, Y .-T. Fei, and J.-P. Liu, “Use of artificial intelligence in peer review among top 100 medical journals,” JAMA Network Open, vol. 7, no. 12, pp. e2 448 609–e2 448 609, 2024

work page 2024
[2]

Ensuring peer review integrity in the era of large language models: A critical stocktaking of challenges, red flags, and recommendations,

B. Kocak, M. R. Onur, S. H. Park, P. Baltzer, and M. Dietzel, “Ensuring peer review integrity in the era of large language models: A critical stocktaking of challenges, red flags, and recommendations,”European Journal of Radiology Artificial Intelligence, vol. 2, p. 100018, 2025

work page 2025
[3]

The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates,

G. R. Latona, M. H. Ribeiro, T. R. Davidson, V . Veselovsky, and R. West, “The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates,”arXiv preprint arXiv:2405.02150, 2024

work page arXiv 2024
[4]

Optimization-based prompt injection attack to llm-as-a-judge,

J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 660–674

work page 2024
[5]

Investigating the vulnera- bility of llm-as-a-judge architectures to prompt-injection attacks,

N. Maloyan, B. Ashinov, and D. Namiot, “Investigating the vulnera- bility of llm-as-a-judge architectures to prompt-injection attacks,”arXiv preprint arXiv:2505.13348, 2025

work page arXiv 2025
[6]

Scientists hide messages in papers to game ai peer review,

E. Gibney, “Scientists hide messages in papers to game ai peer review,” Nature, vol. 643, no. 8073, pp. 887–888, 2025

work page 2025
[7]

Hidden prompts in manuscripts exploit ai-assisted peer review,

Z. Lin, “Hidden prompts in manuscripts exploit ai-assisted peer review,” arXiv preprint arXiv:2507.06185, 2025

work page arXiv 2025
[8]

Publish to perish: Prompt injection attacks on llm-assisted peer review,

M. G. Collu, U. Salviati, R. Confalonieri, M. Conti, and G. Apruzzese, “Publish to perish: Prompt injection attacks on llm-assisted peer review,” arXiv preprint arXiv:2508.20863, 2025

work page arXiv 2025
[9]

Topicattack: An indirect prompt injection attack via topic transition,

Y . Chen, H. Li, Y . Li, Y . Liu, Y . Song, and B. Hooi, “Topicattack: An indirect prompt injection attack via topic transition,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 7338–7356

work page 2025
[10]

Phantomlint: Principled detection of hidden llm prompts in structured documents,

T. Murray, “Phantomlint: Principled detection of hidden llm prompts in structured documents,”arXiv preprint arXiv:2508.17884, 2025

work page arXiv 2025
[11]

Prompt-in-content attacks: Exploiting uploaded inputs to hijack llm behavior,

Z. Lian, W. Wang, Q. Zeng, T. Nakanishi, T. Kitasuka, and C. Su, “Prompt-in-content attacks: Exploiting uploaded inputs to hijack llm behavior,”arXiv preprint arXiv:2508.19287, 2025

work page arXiv 2025
[12]

Prompt injection attacks on llm generated reviews of scientific publications,

J. Keuper, “Prompt injection attacks on llm generated reviews of scientific publications,”arXiv preprint arXiv:2509.10248, 2025

work page arXiv 2025
[13]

When ai reviews science: Can we trust the referee?

J. Wang, Y . Liu, H. Xu, K. Hu, S. Di, W. Ni, L. Yue, M.-L. Zhang, K. Ren, and L. Chen, “When ai reviews science: Can we trust the referee?” 2025

work page 2025
[14]

“give a positive review only

Z. Zhang, L. Zhi, L. Sunet al., ““give a positive review only”: An early investigation into in-paper prompt injection attacks and defenses for ai reviewers,” inSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025

work page 2025
[15]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Prompt injection in large language model exploitation: A security perspective,

J. K. Joseph, E. Daniel, V . Kathiresan, and M. MAP, “Prompt injection in large language model exploitation: A security perspective,” in2025 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC). IEEE, 2025, pp. 1–8

work page 2025
[17]

Jailguard: A universal detection framework for prompt-based attacks on llm systems,

X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, M. Hu, J. Zhang, Y . Liu, S. Ma, and C. Shen, “Jailguard: A universal detection framework for prompt-based attacks on llm systems,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[18]

Chatgpt (gpt-5.1),

OpenAI, “Chatgpt (gpt-5.1),” https://chat.openai.com/, accessed: 2025- 12-03

work page 2025
[19]

xAI, “Grok,” https://x.ai/, accessed: 2025-12-03

work page 2025
[20]

Gemini pro,

G. DeepMind, “Gemini pro,” https://deepmind.google/, accessed: 2025- 12-03. APPENDIXA REVIEWERPROMPTPROVIDED TO THELLM User Prompt: You are an expert reviewer for the International Conference on Learning Representations (ICLR). You must provide a highly critical, rigorous, and unbiased review. Avoid being overly positive unless strongly justified. Please ...

work page 2025