SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

Haibo Jin; Haohan Wang; Huimin Zeng; Xucheng Yu

arxiv: 2605.21948 · v1 · pith:FY7PWNWCnew · submitted 2026-05-21 · 💻 cs.LG

SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

Xucheng Yu , Haibo Jin , Huimin Zeng , Haohan Wang This is my paper

Pith reviewed 2026-05-22 07:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords Generative Engine OptimizationLLM ranking defenseSemantic manipulationProduct description attacksPerplexity detectionIntegrity scoring

0 comments

The pith

SCI-Defense detects semantic manipulation attacks on LLM rankings by scoring four specific signals in product descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCI-Defense as a framework to counter Generative Engine Optimization attacks that add misleading semantic cues to product descriptions in order to inflate their rankings in LLM-based systems. It combines perplexity checks with Semantic Integrity Scoring across authority attribution, narrative purposiveness, comparative claims, and temporal claims, plus inter-candidate comparison. A reader would care because these attacks could distort recommendations and reduce trust in AI search tools. The approach reaches perfect precision and zero false positives on Amazon data while exposing that standard defenses miss purely semantic tricks.

Core claim

SCI-Defense combines Perplexity detection, Semantic Integrity Scoring on four manipulation dimensions, and Inter-Candidate Detection to identify GEO attacks, reaching Precision of 1.000 and FPR of 0.000 with high recall against string, reasoning, and review attacks on 600 Amazon product descriptions while showing that existing PPL-only, classifier, and paraphrasing defenses record zero recall.

What carries the argument

Semantic Integrity Scoring that checks content along Authority Attribution, Narrative Purposiveness, Comparative Claims, and Temporal Claims to flag manipulation.

Load-bearing premise

The four manipulation dimensions capture the main detectable semantic signals used by GEO attacks.

What would settle it

A manipulation method that raises product rankings in an LLM system without increasing scores on any of the four dimensions would show the defense misses the attack.

Figures

Figures reproduced from arXiv: 2605.21948 by Haibo Jin, Haohan Wang, Huimin Zeng, Xucheng Yu.

**Figure 1.** Figure 1: SCI-Defense: a three-component framework defending LLM-based ranking against GEO attacks. Each attack type is matched to the component designed to detect it: PPL (GPT-2 perplexity) intercepts statistically anomalous String attacks; SIS (GPT-4o) scores four semantic dimensions to expose the persuasion structure of Reasoning and Review attacks; ICD (cross-candidate embedding similarity) provides complementar… view at source ↗

**Figure 2.** Figure 2: Estimated Sfinal score distributions for clean descriptions and three attack types. Clean descriptions cluster well below τs=0.55 (zero false positives). String attacks are predominantly intercepted by the PPL early-exit stage before reaching SIS scoring. Reasoning attacks concentrate above τm=0.65, yielding Recall= 0.952. Review attacks straddle τs, with 74.7% of scores falling in [0.45, 0.55), explaining… view at source ↗

read the original abstract

LLM-based ranking systems are vulnerable to Generative Engine Optimization (GEO) attacks, where adversaries inject semantic signals into product descriptions to artificially boost rankings. We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD). SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). Evaluated on 600 Amazon product descriptions across 6 categories, SCI-Defense achieves Precision=1.000 and FPR=0.000, with Recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively. On 600 MS MARCO web passages, String attacks are blocked with perfect recall while Review attacks yield near-zero recall, as web passages lack the persuasion-oriented signals that SIS targets in product descriptions. We demonstrate that existing defenses -- PPL-only filters, SafetyClf content classifiers, and paraphrasing -- achieve zero recall against semantic manipulation attacks. We further demonstrate new attacks such as Specification Amplification and Use-Case Saturation can expose semantic relevance manipulation as a structural defense blind spot that suggests directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCI-Defense catches most attacks on product descriptions with perfect precision but drops sharply on web passages and flags its own gaps for new tactics.

read the letter

The main point is that SCI-Defense combines perplexity detection, a semantic scorer on four manipulation dimensions, and inter-candidate detection to block generative engine optimization attacks. It reports perfect precision and zero false positive rate on Amazon product descriptions for string, reasoning, and review attacks, while existing methods get zero recall. What the paper does well is lay out the vulnerability in LLM ranking systems and show through experiments that semantic changes can fool rankings without obvious fluency issues. Testing on both product and web passage datasets highlights the method's strengths in one area and weaknesses in another. Acknowledging that attacks like specification amplification expose a blind spot for semantic relevance manipulation is helpful. The soft spots are the domain specificity and lack of detail. On MS MARCO, review attacks have near-zero recall because those passages do not carry the same persuasion signals that the four dimensions target. This suggests the approach may not generalize if attackers use different semantic tactics. The abstract provides no exact scoring formulas, implementation details, or significance tests, so it is difficult to verify how the results were obtained or how sensitive they are to the test set. This paper is for researchers in AI security and information retrieval who deal with ranking manipulation. It shows clear thinking about the problem and reports results in a way that reveals limitations rather than hiding them, so it deserves a serious referee. I would recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SCI-Defense, a three-component defense (Perplexity detection (PPL), Semantic Integrity Scoring (SIS) on four manipulation dimensions—Authority Attribution, Narrative Purposiveness, Comparative Claims, Temporal Claims—and Inter-Candidate Detection (ICD)) against Generative Engine Optimization attacks that inject semantic signals into content to manipulate LLM-based rankings. Evaluated on 600 Amazon product descriptions, it reports Precision=1.000, FPR=0.000, and recalls of 1.000/0.952/0.830 against String/Reasoning/Review attacks; on 600 MS MARCO passages, String attacks are fully blocked but Review attacks yield near-zero recall. The work shows existing defenses (PPL-only, SafetyClf, paraphrasing) achieve zero recall and identifies new attacks (Specification Amplification, Use-Case Saturation) that expose structural blind spots.

Significance. If the results hold, the paper contributes by demonstrating concrete vulnerabilities in generative ranking systems and by showing that existing content filters fail against semantic manipulation. Explicitly surfacing new attack vectors and domain-specific limitations (persuasion signals in product text vs. web passages) provides a useful map for future defenses rather than claiming a complete solution.

major comments (3)

[Abstract] Abstract: The headline metrics (Precision=1.000, FPR=0.000, Recall 1.000/0.952/0.830 on Amazon data) are reported without the scoring formulas for SIS across the four dimensions or implementation details for the PPL+SIS+ICD pipeline, which is load-bearing for assessing whether the central performance claims can be reproduced or generalized.
[Abstract] Abstract: Near-zero recall on Review attacks for MS MARCO web passages (versus strong results on Amazon product descriptions) shows that SIS effectiveness depends on persuasion-oriented signals absent from general web text; this domain specificity directly limits the scope of the claim that SCI-Defense defends GEO attacks.
[Abstract] Abstract: The explicit statement that Specification Amplification and Use-Case Saturation expose a structural blind spot for semantic relevance manipulation indicates that the four SIS dimensions may not capture primary signals used by all GEO tactics, undermining robustness claims even if the three-component pipeline is implemented as described.

minor comments (2)

The evaluation would be strengthened by reporting statistical significance tests or confidence intervals alongside the precision/recall figures.
Adding pseudocode or a detailed algorithmic description of how SIS aggregates the four dimensions would improve clarity and reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree with the need for clarification or revision and where we provide additional context from the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline metrics (Precision=1.000, FPR=0.000, Recall 1.000/0.952/0.830 on Amazon data) are reported without the scoring formulas for SIS across the four dimensions or implementation details for the PPL+SIS+ICD pipeline, which is load-bearing for assessing whether the central performance claims can be reproduced or generalized.

Authors: We agree that the abstract's brevity omits the explicit scoring formulas for Semantic Integrity Scoring (SIS) on the four dimensions and the precise implementation details of the combined PPL+SIS+ICD pipeline. These formulas (e.g., weighted aggregation across Authority Attribution, Narrative Purposiveness, Comparative Claims, and Temporal Claims) and pipeline steps are fully specified in Sections 3.2 and 4.1 of the manuscript to support reproducibility. To improve standalone readability of the abstract, we will make a partial revision by adding one sentence briefly describing the SIS dimensions and noting that full formulas and pipeline details appear in the main text. revision: partial
Referee: [Abstract] Abstract: Near-zero recall on Review attacks for MS MARCO web passages (versus strong results on Amazon product descriptions) shows that SIS effectiveness depends on persuasion-oriented signals absent from general web text; this domain specificity directly limits the scope of the claim that SCI-Defense defends GEO attacks.

Authors: The referee correctly identifies this as a core finding rather than an oversight. The manuscript explicitly attributes the near-zero recall on Review attacks in MS MARCO to the lack of persuasion-oriented signals in general web passages, contrasting with their presence in Amazon product descriptions. We already frame this as evidence of domain specificity in the results and discussion sections. We will revise the abstract and conclusion to more prominently state that SCI-Defense's effectiveness is strongest in persuasion-rich domains such as product text and to qualify the scope of claims about defending GEO attacks more broadly. revision: yes
Referee: [Abstract] Abstract: The explicit statement that Specification Amplification and Use-Case Saturation expose a structural blind spot for semantic relevance manipulation indicates that the four SIS dimensions may not capture primary signals used by all GEO tactics, undermining robustness claims even if the three-component pipeline is implemented as described.

Authors: We acknowledge the referee's concern and note that the manuscript already presents Specification Amplification and Use-Case Saturation as exposing a structural blind spot in the current four SIS dimensions for certain semantic relevance manipulations. This is positioned as identifying an avenue for future research rather than a claim of comprehensive robustness against every possible GEO tactic. To prevent any overinterpretation, we will revise the discussion to more explicitly separate the demonstrated effectiveness against the three evaluated attack types from the acknowledged limitations against other semantic strategies, while reiterating the need for expanded dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on independent attack datasets

full rationale

The paper proposes SCI-Defense as a three-component pipeline (PPL + SIS + ICD) where SIS explicitly scores four hand-specified manipulation dimensions (Authority Attribution, Narrative Purposiveness, Comparative Claims, Temporal Claims). Performance metrics are obtained by direct evaluation on separately generated attack datasets (600 Amazon descriptions and 600 MS MARCO passages) rather than by any fitted parameter, self-referential equation, or self-citation that reduces the claimed result to the input by construction. The authors themselves note structural blind spots for unmodeled tactics such as Specification Amplification, confirming that the assessment is externally falsifiable and not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical applied ML paper with no formal derivations or mathematical axioms; claims rest on experimental results from constructed attack datasets.

pith-pipeline@v0.9.0 · 5761 in / 1171 out tokens · 33162 ms · 2026-05-22T07:39:18.697430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). ... Sbase = λAA·SAA + λNP·SNP + λCA·SCA + λTC·STC
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

GEO: Generative engine optimization

Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, and Ameet Deshpande. GEO: Generative engine optimization. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

work page 2024
[2]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Core: Corpus-based ranking exploitation via llm manipulation.arXiv preprint arXiv:2602.03608, 2026

Anonymous. Core: Corpus-based ranking exploitation via llm manipulation.arXiv preprint arXiv:2602.03608, 2026

work page arXiv 2026
[4]

Adversarial examples are not easily detected: Bypassing ten detection methods

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Garg, Andreas Terzis, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?arXiv preprint arXiv:2306.15447, 2023

work page arXiv 2023
[5]

HotFlip: White-box adversarial examples for text classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. HotFlip: White-box adversarial examples for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

work page 2018
[6]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injections

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injections. InAISec Workshop at CCS, 2023

work page 2023
[7]

Large language models are zero-shot rankers for recommender systems

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Proceedings of ECIR, 2024

work page 2024
[8]

Llama guard: Llm-based input-output safeguard for human-ai conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. 2023

work page 2023
[9]

Baseline defenses for adversarial attacks against aligned language models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. 2023

work page 2023
[10]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024
[11]

Jailbreaking large language models against moderation guardrails via cipher characters.Advances in Neural Information Processing Systems, 37:59408–59435, 2024

Haibo Jin, Andy Zhou, Joe D Menke, and Haohan Wang. Jailbreaking large language models against moderation guardrails via cipher characters.Advances in Neural Information Processing Systems, 37:59408–59435, 2024

work page 2024
[12]

A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023

work page arXiv 2023
[13]

Manning, and Chelsea Finn

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-shot machine-generated text detection using probability curvature.arXiv preprint arXiv:2301.11305, 2023

work page arXiv 2023
[14]

Adversarial search engine optimization for large language models

Fredrik Nestaas, Edvard Hallström, and Samuele Mücke. Adversarial search engine optimization for large language models. InProceedings of the ACM Web Conference 2024, 2024

work page 2024
[15]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024

Samuel Pfrommer, Yatong Cohen, Stefano Soatto, et al. Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024

work page arXiv 2024
[17]

Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. InarXiv preprint arXiv:2309.15088, 2023. 11

work page arXiv 2023
[18]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Large language models are effective text rankers with pairwise ranking prompting

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024

work page 2024
[20]

Language models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019

work page 2019
[21]

Is chatgpt good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. InProceedings of EMNLP, 2023

work page 2023
[22]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Yu, Yuqi Liu, Jianfeng Gao, and Kai Chen. Datasentinel: A game-theoretic detection of prompt injection attacks. InProceedings of IEEE S&P, 2025

work page 2025
[23]

ShieldLM: Empowering llms as aligned, trustworthy and responsible language models.arXiv preprint arXiv:2402.04269, 2024

Zheng Zhang, Puhan Shi, Lixin Hu, Biao Qin, Yangqiu Li, Dawei Yin, and Ping Li. ShieldLM: Empowering llms as aligned, trustworthy and responsible language models.arXiv preprint arXiv:2402.04269, 2024

work page arXiv 2024
[24]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. InProceedings of EMNLP, 2023

work page 2023
[25]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A SCI-Defense Pseudocode Algorithm 2 provides the complete pseudocode for SCI-Defense, with all symbolic parameters defined. Concrete values for all thre...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

GEO: Generative engine optimization

Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, and Ameet Deshpande. GEO: Generative engine optimization. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

work page 2024

[2] [2]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Core: Corpus-based ranking exploitation via llm manipulation.arXiv preprint arXiv:2602.03608, 2026

Anonymous. Core: Corpus-based ranking exploitation via llm manipulation.arXiv preprint arXiv:2602.03608, 2026

work page arXiv 2026

[4] [4]

Adversarial examples are not easily detected: Bypassing ten detection methods

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Garg, Andreas Terzis, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?arXiv preprint arXiv:2306.15447, 2023

work page arXiv 2023

[5] [5]

HotFlip: White-box adversarial examples for text classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. HotFlip: White-box adversarial examples for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018

work page 2018

[6] [6]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injections

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injections. InAISec Workshop at CCS, 2023

work page 2023

[7] [7]

Large language models are zero-shot rankers for recommender systems

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Proceedings of ECIR, 2024

work page 2024

[8] [8]

Llama guard: Llm-based input-output safeguard for human-ai conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. 2023

work page 2023

[9] [9]

Baseline defenses for adversarial attacks against aligned language models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. 2023

work page 2023

[10] [10]

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

Haibo Jin, Leyang Hu, Xinnuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.arXiv preprint arXiv:2407.01599, 2024

work page arXiv 2024

[11] [11]

Jailbreaking large language models against moderation guardrails via cipher characters.Advances in Neural Information Processing Systems, 37:59408–59435, 2024

Haibo Jin, Andy Zhou, Joe D Menke, and Haohan Wang. Jailbreaking large language models against moderation guardrails via cipher characters.Advances in Neural Information Processing Systems, 37:59408–59435, 2024

work page 2024

[12] [12]

A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023

work page arXiv 2023

[13] [13]

Manning, and Chelsea Finn

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-shot machine-generated text detection using probability curvature.arXiv preprint arXiv:2301.11305, 2023

work page arXiv 2023

[14] [14]

Adversarial search engine optimization for large language models

Fredrik Nestaas, Edvard Hallström, and Samuele Mücke. Adversarial search engine optimization for large language models. InProceedings of the ACM Web Conference 2024, 2024

work page 2024

[15] [15]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024

Samuel Pfrommer, Yatong Cohen, Stefano Soatto, et al. Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024

work page arXiv 2024

[17] [17]

Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. InarXiv preprint arXiv:2309.15088, 2023. 11

work page arXiv 2023

[18] [18]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Large language models are effective text rankers with pairwise ranking prompting

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024

work page 2024

[20] [20]

Language models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019

work page 2019

[21] [21]

Is chatgpt good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. InProceedings of EMNLP, 2023

work page 2023

[22] [22]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Yu, Yuqi Liu, Jianfeng Gao, and Kai Chen. Datasentinel: A game-theoretic detection of prompt injection attacks. InProceedings of IEEE S&P, 2025

work page 2025

[23] [23]

ShieldLM: Empowering llms as aligned, trustworthy and responsible language models.arXiv preprint arXiv:2402.04269, 2024

Zheng Zhang, Puhan Shi, Lixin Hu, Biao Qin, Yangqiu Li, Dawei Yin, and Ping Li. ShieldLM: Empowering llms as aligned, trustworthy and responsible language models.arXiv preprint arXiv:2402.04269, 2024

work page arXiv 2024

[24] [24]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. InProceedings of EMNLP, 2023

work page 2023

[25] [25]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A SCI-Defense Pseudocode Algorithm 2 provides the complete pseudocode for SCI-Defense, with all symbolic parameters defined. Concrete values for all thre...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page