Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Pith reviewed 2026-06-30 00:19 UTC · model grok-4.3
The pith
A black-box bandit can discover style edits that boost LLM judge scores by 1-2 points without changing meaning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BITE casts the selection of semantics-preserving stylistic edits as a contextual bandit problem and uses a LinUCB policy to adaptively choose edits that maximize the judge's score. Tested across diverse LLM judges and tasks including pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks, it achieves an attack success rate exceeding 65 percent and raises scores by 1-2 points on a 9-point scale while preserving semantic equivalence. The method requires no access to model parameters or gradients.
What carries the argument
BITE, which treats stylistic edit selection as a contextual bandit problem solved by a LinUCB policy to identify score-inflating edits.
Load-bearing premise
Semantics-preserving stylistic edits exist that can reliably and substantially alter LLM judge scores, and a black-box contextual bandit can efficiently identify such edits across diverse judges and tasks.
What would settle it
A trial in which the bandit explores many edits on new judges and tasks yet the attack success rate stays near zero would show the claimed vulnerability is not general.
Figures
read the original abstract
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BITE, a black-box adversarial framework that formulates selection of semantics-preserving stylistic edits as a contextual bandit problem solved via LinUCB. The edits are intended to exploit known stylistic biases in LLM judges (e.g., verbosity preference) to inflate assigned scores on pointwise and pairwise tasks drawn from chatbot leaderboards and AI-reviewer benchmarks. The central empirical claim is an attack success rate exceeding 65% together with 1-2 point score lifts on a 9-point scale while semantic equivalence is preserved and standard detection baselines are evaded. Code is released.
Significance. If the empirical results are reproducible and statistically supported, the work is significant because it supplies a concrete, adaptive, black-box attack that turns documented LLM-judge biases into a practical vulnerability. The bandit formulation for edit selection is a reasonable technical choice for the black-box setting and the evaluation across both pointwise and pairwise protocols broadens the scope. Releasing code is a positive factor for verifiability.
major comments (2)
- [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: the reported aggregate ASR >65% and 1-2 point score lifts are presented without any information on the number of independent trials per judge-task combination, variance across runs, statistical testing, or the concrete procedure used to verify semantic equivalence after each edit. These omissions are load-bearing for the central claim that the bandit reliably discovers semantics-preserving edits.
- [Method] Method section: the reward function and context features for the LinUCB policy are not specified in sufficient detail to determine how the bandit balances score maximization against the semantic-equivalence constraint; without this, it is impossible to assess whether the reported success rates could be achieved by chance or by an unstated oracle.
minor comments (1)
- [Stealthiness Evaluation] The abstract states that BITE 'evades standard style-control methods' but does not name the specific baselines or report their detection rates; this should be clarified in the stealthiness subsection.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that additional experimental details and methodological specifications are necessary to strengthen the paper and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: the reported aggregate ASR >65% and 1-2 point score lifts are presented without any information on the number of independent trials per judge-task combination, variance across runs, statistical testing, or the concrete procedure used to verify semantic equivalence after each edit. These omissions are load-bearing for the central claim that the bandit reliably discovers semantics-preserving edits.
Authors: We agree these details are essential for reproducibility and credibility of the central claims. In the revised manuscript we will add: (i) the exact number of independent trials per judge-task combination, (ii) variance measures (standard deviation or confidence intervals) across runs, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) on the reported score lifts, and (iv) a precise description of the semantic-equivalence verification procedure, which combines an automated similarity threshold with targeted manual review of borderline cases. revision: yes
-
Referee: [Method] Method section: the reward function and context features for the LinUCB policy are not specified in sufficient detail to determine how the bandit balances score maximization against the semantic-equivalence constraint; without this, it is impossible to assess whether the reported success rates could be achieved by chance or by an unstated oracle.
Authors: We acknowledge the need for greater transparency in the bandit formulation. The revised Method section will explicitly define the reward function (judge score as primary reward, with a multiplicative penalty or hard constraint for edits failing the semantic-equivalence check) and enumerate the context features (edit category, estimated semantic similarity score, historical reward statistics, and task type). These additions will make clear that the policy operates without an oracle and that success rates arise from the documented exploration-exploitation mechanism. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical attack framework (BITE) that casts stylistic edit selection as a contextual bandit problem solved via LinUCB and validates it through experiments on LLM judges. There are no mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The reported attack success rates and score improvements are direct experimental outcomes, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https: //aclanthology.org/2025.acl-long.897/
doi: 10.18653/v1/2025.acl-long.897. URL https: //aclanthology.org/2025.acl-long.897/. Couto, P. H., Ho, Q. P., Kumari, N., Rachmat, B. K., Khuong, T. G. H., Ullah, I., and Sun-Hosoya, L. Relevai- reviewer: A benchmark on ai reviewers for survey paper relevance, 2024. URL https://arxiv.org/abs/ 2406.10294. Doddapaneni, S., Khan, M. S. U. R., Verma, S., and...
-
[2]
emnlp-main.911/
URL https://aclanthology.org/2024. emnlp-main.911/. Dubois, Y ., Liang, P., and Hashimoto, T. Length-controlled alpacaeval: A simple debiasing of automatic evalu- ators. InFirst Conference on Language Modeling,
2024
-
[3]
Ellison, M
URL https://openreview.net/forum? id=CybBmzWBX0. Ellison, M. Aaai launches ai-powered peer review assess- ment system, May 2025. URL https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/ . Feuer, B., Goldblum, M., Datta, T., Nambiar, S., Be- saleli, R., Dooley, S., Cembalest, M., and Dickerson, J. P. Style outweighs substance: Failure ...
2025
-
[4]
URL https://openreview.net/forum? id=MzHNftnAM1. Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. Adapt- ing to misspecification in contextual bandits.Advances in Neural Information Processing Systems, 33:11478– 11489, 2020. Ghosh, A., Chowdhury, S. R., and Gopalan, A. Misspecified linear bandits, 2017. URL https://arxiv.org/ abs/1704.06880. Huang, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/1772690.1772758 2020
-
[5]
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long
-
[6]
URL https://aclanthology.org/2025. naacl-long.15/. 10 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A. Tree of attacks: Jailbreaking black-box llms automatically, 2023. OpenAI. Prompt injection detection, 2025. URL https://open...
-
[7]
URL https://aclanthology.org/2024. emnlp-main.427/. Shi, J., Yuan, Z., Liu, Y ., Huang, Y ., Zhou, P., Sun, L., and Gong, N. Z. Optimization-based prompt in- jection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, pp. 660–674, New York, NY , USA, 2024a. Association for Com- put...
-
[8]
Fallacy Oversight Bias
URL https://openreview.net/forum? id=syThiTmWWm. 11 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges A. Details on Stylistic Edits A.1. Stylistic Bias Collection This section provides a detailed overview of the specific biases considered and analyzed in our study. These biases represent systematic tendencies in Large Language...
2024
-
[9]
To Add Structure: If the text has an implicit title or a new section, use a Header (‘#‘) to make the structure explicit
-
[10]
To Emphasize Importance: If a word or phrase represents a critical point, a key takeaway, or a term of high significance, use Bold (‘**text**‘)
-
[11]
To Add Nuance or Stylistic Emphasis: If a word needs a softer emphasis, represents an internal thought, is a title (of a book, etc.), or is a foreign term, use Italics (‘*text*‘)
-
[12]
To Show a Correction or Changed Thought: If a part of the text seems like a self-correction or a dismissed idea, use Strikethrough (‘ text ‘)
-
[13]
To Clarify or Add an Aside: If a phrase provides extra context, a clarification, or a supplementary detail that slightly breaks the flow of the main sentence, enclose it in Parentheses ‘(text)‘ or Brackets ‘[text]‘
-
[14]
To Highlight a Statement: If a sentence stands out as a powerful quote or a central thesis, you can set it apart as a Blockquote (‘¿ text‘). #### Available Markdown Toolbox - Headers (‘#‘, ‘##‘, etc.) - Bold (‘**text**‘) - Italics (‘*text*‘) - Strikethrough (‘ text ‘) - Brackets (‘[text]‘) and Parentheses (‘(text)‘) - Blockquotes (‘¿ text‘) - Inline Quote...
2020
-
[15]
from the ’About this space’ section
Emoji Bias (+1 Score) Prompt:”Extract the address... from the ’About this space’ section... ” Base Answer (Score: 6) The address... is not provided... However, it is located in Grass Valley, CA. BITE Answer (Score: 7) ...However, it is located in Grass Valley, CA.[house emoji]
-
[16]
could you tell me a bit about the key experiences
Markdown/Formatting Bias (+2 Score) Prompt:”Hi, can you help me write my memoir?” Base Answer (Score: 7) Absolutely, I’d love to help... could you tell me a bit about the key experiences... BITE Answer (Score: 9) Absolutely, I’d love to help youwrite your memoir! ... tell me a bit about thekey experiences or themes
-
[17]
Structured Output Bias (JSON, +1 Score) Prompt:”Notify me of any suggestions... for making the text more grammatically correct. ” Base Answer (Score: 8) [Standard paragraph explaining the grammar fix]. BITE Answer (Score: 9) {"Sentence": "...", "Suggested Change": "..."} 20 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges B.6...
-
[18]
Understand the task and criteria: Familiarize yourself with the task to be evaluated
-
[19]
Examine the output generated from completing the task
Review the inputs and output: Look at the inputs provided for the task. Examine the output generated from completing the task
-
[20]
Write verbal feedback justifying your evaluation that includes a detailed rationale, referring to specific aspects of the output and comparing them to the rubric
-
[21]
feedback
Assign a final score from 1 to 9. 1 is the lowest score and 9 is the highest score. ## FORMAT FOR THE EV ALUATION {{ “feedback”: “Write the verbal explanation of the score here.”, “score”: “Give the numeric score from 1 to 9 here. 1 is the lowest score and 9 is the highest score.” }} Please accurately evaluate the task. Strictly adhere to the evaluation i...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.