One token to fool llm-as-a-judge

Yulai Zhao, Haolin Liu, Dian Yu, SY Kung, Haitao Mi, Dong Yu · 2025 · arXiv 2507.08794

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.

When AI reviews science: Can we trust the referee?

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

cs.LG · 2025-10-01 · unverdicted · novelty 6.0

Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.

LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

An LLM produces consistent categorical judgments and appropriate confidence declines when evaluating powerline segmentation quality under controlled visual degradations, suggesting it can serve as a reliable watchdog.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

citing papers explorer

Showing 9 of 9 citing papers.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 21
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models cs.LG · 2026-04-03 · unverdicted · none · ref 21
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 6 · 2 links
ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 95
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 5
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 10
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers cs.LG · 2025-10-01 · unverdicted · none · ref 36
Derives backward and forward corrections for asymmetric verifier noise that improve RLVR performance on math reasoning tasks.
LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection cs.AI · 2026-04-07 · unverdicted · none · ref 14
An LLM produces consistent categorical judgments and appropriate confidence declines when evaluating powerline segmentation quality under controlled visual degradations, suggesting it can serve as a reliable watchdog.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 218
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

One token to fool llm-as-a-judge

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer