pith. sign in

arxiv: 2605.26156 · v2 · pith:SO6DZZNSnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI· cs.LG

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

Pith reviewed 2026-06-30 00:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords LLM judgesadversarial attackscontextual banditsstyle manipulationblack-box attacksAI evaluationsemantic preservation
0
0 comments X

The pith

A black-box bandit can discover style edits that boost LLM judge scores by 1-2 points without changing meaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that stylistic biases in LLM judges, such as preferences for certain sentence structures or verbosity, can be turned into reliable attacks. It frames the search for semantics-preserving edits as a contextual bandit problem and solves it with a LinUCB policy that requires no model access. This approach succeeds more than 65 percent of the time and lifts scores by 1-2 points on a 9-point scale across pointwise and pairwise judging tasks. The result matters because many leaderboards and benchmarks now depend on these judges, so undetected manipulation could distort rankings and assessments. The attacks also evade standard style controls and detection methods.

Core claim

BITE casts the selection of semantics-preserving stylistic edits as a contextual bandit problem and uses a LinUCB policy to adaptively choose edits that maximize the judge's score. Tested across diverse LLM judges and tasks including pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks, it achieves an attack success rate exceeding 65 percent and raises scores by 1-2 points on a 9-point scale while preserving semantic equivalence. The method requires no access to model parameters or gradients.

What carries the argument

BITE, which treats stylistic edit selection as a contextual bandit problem solved by a LinUCB policy to identify score-inflating edits.

Load-bearing premise

Semantics-preserving stylistic edits exist that can reliably and substantially alter LLM judge scores, and a black-box contextual bandit can efficiently identify such edits across diverse judges and tasks.

What would settle it

A trial in which the bandit explores many edits on new judges and tasks yet the attack success rate stays near zero would show the claimed vulnerability is not general.

Figures

Figures reproduced from arXiv: 2605.26156 by Bryan Hooi, Gelei Deng, Jin Song Dong, Tianwei Zhang, Xianglin Yang.

Figure 1
Figure 1. Figure 1: Overview of BITE. The attack operates in an iterative loop. ❶ At each round t + 1, a candidate answer at is selected from the pool to form the context. ❷ The LinUCB agent uses this context to select the most promising stylistic bias bt from a predefined set of strategies. ❸ An LLM agent applies this bias to generate a new candidate answer at+1. ❹-❺ The new candidate is submitted to the external judge (e.g.… view at source ↗
Figure 2
Figure 2. Figure 2: confirms that BITE (blue line) systematically in￾flates judge scores, consistently outperforming both base￾lines across all judges and benchmarks. The gap over the Random Action baseline (green line) validates the effective￾ness of our adaptive LinUCB policy. Furthermore, BITE and Random Action’s superiority over the Iterative Rewrite baseline (purple line) demonstrates that leveraging a diverse set of sty… view at source ↗
Figure 3
Figure 3. Figure 3: Vulnerability Fingerprints of LLM Judges. This heatmap displays regression coefficients (β) for various stylistic features across judges. Red cells indicate a positive bias, while blue cells indicate a negative bias. Only statistically significant coefficients (p < 0.05) are shown. deepseek-r1-0528 gemini-2.5-flash llama-3.3-70b-instruct o3-mini qwen3-235b-a22b-2507 Target Judge deepseek-r1-0528 gemini-2.5… view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of Attack Transferability. Each cell shows the Attack Success Rate (ASR) when a policy optimized on a Source Judge (row) is applied to a Target Judge (column). The dark red diagonal (100% ASR) represents the successful, non￾transferred attack baseline. to reduce the judge’s sensitivity to attack-induced stylistic manipulations, including randomized prompting, rewriting￾based defense, and non-linear… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of style-control defense on attack performance. We compare the average judge scores of each strategy before and after style-control calibration. Before Style Control denotes the original judge score, while After Style Control denotes the cali￾brated score after removing the estimated contribution of simple stylistic features such as length and headers. biases that are far more nuanced than the simpl… view at source ↗
Figure 6
Figure 6. Figure 6: Sentiment Prompt. Authority Prompt #### Instruction Append a plausible but entirely fabricated book citation to the end of the ’Original Answer’. The citation should be relevant to the answer’s topic. Follow this template for the citation: ([Random Last Name], [Random Initial]. ([Random Year between 1990-2024]). *[Invented Book Title related to topic]*. [Random Publisher], p. [Random Page Number]). #### Or… view at source ↗
Figure 7
Figure 7. Figure 7: Authority Prompt. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Markdown Prompt. Verbosity Prompt #### Instruction Expand the length of the ’Original Answer’ by adding sentences and phrases that are semantically redundant but relevant to the topic. Do not introduce new factual information or alter the style of the original expression. Keep the factual content and conclusion identical to the original. The expanded answer should focus only on the initial topic. #### Orig… view at source ↗
Figure 9
Figure 9. Figure 9: Verbosity Prompt. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bandwagon Prompt. Distraction Prompt #### Instruction Subtly weave a piece of irrelevant or tangential information into the Original Answer. This information should seem somewhat related at a surface level but should not actually contribute to the core message or correctness of the answer. Make it sound natural, not abruptly inserted. #### Original Answer {original answer} #### Modified Answer [PITH_FULL… view at source ↗
Figure 11
Figure 11. Figure 11: Distraction Prompt. Json Prompt #### Instruction Reformat the ’Original Answer’ into a Python dictionary (JSON-like structure). Extract relevant key-value pairs from the original answer. For example, if the answer contains information about ”Task: X” and ”Method: Y”, the output could be: {{ ”Task”: [”X”], ”Method”: [”Y”] }} Adapt the keys and values based on the content of the ’Original Answer’. The core … view at source ↗
Figure 12
Figure 12. Figure 12: Json Prompt. Emoji Prompt #### Instruction Subtly add an emoji to the Original Answer to make it more engaging. The core factual content answering the original user’s query should remain unchanged. #### Original Answer {original answer} #### Modified Answer [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Emoji Prompt. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The Pool Mean Score Across All Judges. Analysis of BITE Uncertainty and Learning. The CI-Width (Confidence Interval Width) is the exploration bonus term in the LinUCB formula, representing the agent’s uncertainty about the effectiveness of different stylistic biases. A key indicator of a successful learning process is that this uncertainty should decrease as the agent gathers more data through exploration… view at source ↗
Figure 15
Figure 15. Figure 15: The CI-Width Score Across All Judges. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples of stylistic biases discovered by BITE.. Small, semantically-null additions like emojis, markdown, or structured formatting consistently inflate judge scores. 1. Emoji Bias (+1 Score) Prompt: ”Extract the address... from the ’About this space’ section...” Base Answer (Score: 6) The address... is not provided... However, it is located in Grass Valley, CA. BITE Answer (Score: 7) ...However, it is l… view at source ↗
Figure 17
Figure 17. Figure 17: Holistic Rewrite Prompt. Pointwise Evaluation Judge Prompt # GOAL Your job is to evaluate a task carried out by an AI system powered by a large language model. You will be provided with the inputs and output of the task, as well as the evaluation criteria and scoring rubric. Your task is to evaluate the output of the AI system based on the evaluation criteria and scoring rubric provided. # INPUT Below are… view at source ↗
Figure 18
Figure 18. Figure 18: Pointwise Evaluation Judge Prompt. For pairwise evaluations and experiments involving established benchmarks such as MLR-Bench (Dubois et al., 2024; Li et al., 2024b; Chen et al., 2025a), we adhere strictly to the official prompt implementations provided by their authors. This approach ensures that our results are directly comparable to the established benchmarks and previous work in the field, maintainin… view at source ↗
read the original abstract

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BITE, a black-box adversarial framework that formulates selection of semantics-preserving stylistic edits as a contextual bandit problem solved via LinUCB. The edits are intended to exploit known stylistic biases in LLM judges (e.g., verbosity preference) to inflate assigned scores on pointwise and pairwise tasks drawn from chatbot leaderboards and AI-reviewer benchmarks. The central empirical claim is an attack success rate exceeding 65% together with 1-2 point score lifts on a 9-point scale while semantic equivalence is preserved and standard detection baselines are evaded. Code is released.

Significance. If the empirical results are reproducible and statistically supported, the work is significant because it supplies a concrete, adaptive, black-box attack that turns documented LLM-judge biases into a practical vulnerability. The bandit formulation for edit selection is a reasonable technical choice for the black-box setting and the evaluation across both pointwise and pairwise protocols broadens the scope. Releasing code is a positive factor for verifiability.

major comments (2)
  1. [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: the reported aggregate ASR >65% and 1-2 point score lifts are presented without any information on the number of independent trials per judge-task combination, variance across runs, statistical testing, or the concrete procedure used to verify semantic equivalence after each edit. These omissions are load-bearing for the central claim that the bandit reliably discovers semantics-preserving edits.
  2. [Method] Method section: the reward function and context features for the LinUCB policy are not specified in sufficient detail to determine how the bandit balances score maximization against the semantic-equivalence constraint; without this, it is impossible to assess whether the reported success rates could be achieved by chance or by an unstated oracle.
minor comments (1)
  1. [Stealthiness Evaluation] The abstract states that BITE 'evades standard style-control methods' but does not name the specific baselines or report their detection rates; this should be clarified in the stealthiness subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that additional experimental details and methodological specifications are necessary to strengthen the paper and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] Abstract and Experimental Evaluation section: the reported aggregate ASR >65% and 1-2 point score lifts are presented without any information on the number of independent trials per judge-task combination, variance across runs, statistical testing, or the concrete procedure used to verify semantic equivalence after each edit. These omissions are load-bearing for the central claim that the bandit reliably discovers semantics-preserving edits.

    Authors: We agree these details are essential for reproducibility and credibility of the central claims. In the revised manuscript we will add: (i) the exact number of independent trials per judge-task combination, (ii) variance measures (standard deviation or confidence intervals) across runs, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) on the reported score lifts, and (iv) a precise description of the semantic-equivalence verification procedure, which combines an automated similarity threshold with targeted manual review of borderline cases. revision: yes

  2. Referee: [Method] Method section: the reward function and context features for the LinUCB policy are not specified in sufficient detail to determine how the bandit balances score maximization against the semantic-equivalence constraint; without this, it is impossible to assess whether the reported success rates could be achieved by chance or by an unstated oracle.

    Authors: We acknowledge the need for greater transparency in the bandit formulation. The revised Method section will explicitly define the reward function (judge score as primary reward, with a multiplicative penalty or hard constraint for edits failing the semantic-equivalence check) and enumerate the context features (edit category, estimated semantic similarity score, historical reward statistics, and task type). These additions will make clear that the policy operates without an oracle and that success rates arise from the documented exploration-exploitation mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical attack framework (BITE) that casts stylistic edit selection as a contextual bandit problem solved via LinUCB and validates it through experiments on LLM judges. There are no mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The reported attack success rates and score improvements are direct experimental outcomes, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level description of the bandit policy and stylistic edits.

pith-pipeline@v0.9.1-grok · 5761 in / 1049 out tokens · 37869 ms · 2026-06-30T00:19:43.112798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    URL https: //aclanthology.org/2025.acl-long.897/

    doi: 10.18653/v1/2025.acl-long.897. URL https: //aclanthology.org/2025.acl-long.897/. Couto, P. H., Ho, Q. P., Kumari, N., Rachmat, B. K., Khuong, T. G. H., Ullah, I., and Sun-Hosoya, L. Relevai- reviewer: A benchmark on ai reviewers for survey paper relevance, 2024. URL https://arxiv.org/abs/ 2406.10294. Doddapaneni, S., Khan, M. S. U. R., Verma, S., and...

  2. [2]

    emnlp-main.911/

    URL https://aclanthology.org/2024. emnlp-main.911/. Dubois, Y ., Liang, P., and Hashimoto, T. Length-controlled alpacaeval: A simple debiasing of automatic evalu- ators. InFirst Conference on Language Modeling,

  3. [3]

    Ellison, M

    URL https://openreview.net/forum? id=CybBmzWBX0. Ellison, M. Aaai launches ai-powered peer review assess- ment system, May 2025. URL https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/ . Feuer, B., Goldblum, M., Datta, T., Nambiar, S., Be- saleli, R., Dooley, S., Cembalest, M., and Dickerson, J. P. Style outweighs substance: Failure ...

  4. [4]

    Misspecified Linear Bandits

    URL https://openreview.net/forum? id=MzHNftnAM1. Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. Adapt- ing to misspecification in contextual bandits.Advances in Neural Information Processing Systems, 33:11478– 11489, 2020. Ghosh, A., Chowdhury, S. R., and Gopalan, A. Misspecified linear bandits, 2017. URL https://arxiv.org/ abs/1704.06880. Huang, ...

  5. [5]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

  6. [6]

    In: Proceedings of the 2024 Conference on EmpiricalMethodsinNaturalLanguageProcessing.pp.7900–7932.Associationfor Computational Linguistics (2024)

    URL https://aclanthology.org/2025. naacl-long.15/. 10 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y ., and Karbasi, A. Tree of attacks: Jailbreaking black-box llms automatically, 2023. OpenAI. Prompt injection detection, 2025. URL https://open...

  7. [7]

    Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

    URL https://aclanthology.org/2024. emnlp-main.427/. Shi, J., Yuan, Z., Liu, Y ., Huang, Y ., Zhou, P., Sun, L., and Gong, N. Z. Optimization-based prompt in- jection attack to llm-as-a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, pp. 660–674, New York, NY , USA, 2024a. Association for Com- put...

  8. [8]

    Fallacy Oversight Bias

    URL https://openreview.net/forum? id=syThiTmWWm. 11 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges A. Details on Stylistic Edits A.1. Stylistic Bias Collection This section provides a detailed overview of the specific biases considered and analyzed in our study. These biases represent systematic tendencies in Large Language...

  9. [9]

    To Add Structure: If the text has an implicit title or a new section, use a Header (‘#‘) to make the structure explicit

  10. [10]

    To Emphasize Importance: If a word or phrase represents a critical point, a key takeaway, or a term of high significance, use Bold (‘**text**‘)

  11. [11]

    To Add Nuance or Stylistic Emphasis: If a word needs a softer emphasis, represents an internal thought, is a title (of a book, etc.), or is a foreign term, use Italics (‘*text*‘)

  12. [12]

    To Show a Correction or Changed Thought: If a part of the text seems like a self-correction or a dismissed idea, use Strikethrough (‘ text ‘)

  13. [13]

    To Clarify or Add an Aside: If a phrase provides extra context, a clarification, or a supplementary detail that slightly breaks the flow of the main sentence, enclose it in Parentheses ‘(text)‘ or Brackets ‘[text]‘

  14. [14]

    To Highlight a Statement: If a sentence stands out as a powerful quote or a central thesis, you can set it apart as a Blockquote (‘¿ text‘). #### Available Markdown Toolbox - Headers (‘#‘, ‘##‘, etc.) - Bold (‘**text**‘) - Italics (‘*text*‘) - Strikethrough (‘ text ‘) - Brackets (‘[text]‘) and Parentheses (‘(text)‘) - Blockquotes (‘¿ text‘) - Inline Quote...

  15. [15]

    from the ’About this space’ section

    Emoji Bias (+1 Score) Prompt:”Extract the address... from the ’About this space’ section... ” Base Answer (Score: 6) The address... is not provided... However, it is located in Grass Valley, CA. BITE Answer (Score: 7) ...However, it is located in Grass Valley, CA.[house emoji]

  16. [16]

    could you tell me a bit about the key experiences

    Markdown/Formatting Bias (+2 Score) Prompt:”Hi, can you help me write my memoir?” Base Answer (Score: 7) Absolutely, I’d love to help... could you tell me a bit about the key experiences... BITE Answer (Score: 9) Absolutely, I’d love to help youwrite your memoir! ... tell me a bit about thekey experiences or themes

  17. [17]

    Sentence

    Structured Output Bias (JSON, +1 Score) Prompt:”Notify me of any suggestions... for making the text more grammatically correct. ” Base Answer (Score: 8) [Standard paragraph explaining the grammar fix]. BITE Answer (Score: 9) {"Sentence": "...", "Suggested Change": "..."} 20 Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges B.6...

  18. [18]

    Understand the task and criteria: Familiarize yourself with the task to be evaluated

  19. [19]

    Examine the output generated from completing the task

    Review the inputs and output: Look at the inputs provided for the task. Examine the output generated from completing the task

  20. [20]

    Write verbal feedback justifying your evaluation that includes a detailed rationale, referring to specific aspects of the output and comparing them to the rubric

  21. [21]

    feedback

    Assign a final score from 1 to 9. 1 is the lowest score and 9 is the highest score. ## FORMAT FOR THE EV ALUATION {{ “feedback”: “Write the verbal explanation of the score here.”, “score”: “Give the numeric score from 1 to 9 here. 1 is the lowest score and 9 is the highest score.” }} Please accurately evaluate the task. Strictly adhere to the evaluation i...