Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Chi-Kuang Yeh; Lei Ding; Yi-Ting Hung; Zilong Zhang

arxiv: 2606.19057 · v1 · pith:3ZEIAIJDnew · submitted 2026-06-17 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Zilong Zhang , Yi-Ting Hung , Lei Ding , Chi-Kuang Yeh This is my paper

Pith reviewed 2026-06-26 19:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords LLM evaluationpositive-unlabeled learningpartial optimal transportbias correctionhuman preferencesLLM as judgegeometric auditing

0 comments

The pith

Partial optimal transport aligns a small set of human-verified positives with unlabeled LLM outputs to recover consistent preferences and correct judge biases without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM judges used for scalable evaluation exhibit systematic biases, such as favoring longer responses regardless of actual quality. Human supervision supplies reliable positive labels but leaves most outputs unlabeled, creating a selective supervision setting. The paper casts this as a positive-unlabeled learning problem and introduces a geometric auditing method based on partial optimal transport. By matching human positives to a reliable subset of unlabeled items inside a fixed embedding space, the method extracts human-consistent preferences and adjusts biased LLM judgments. Experiments show gains in human alignment, resistance to presentation biases, and usable confidence scores.

Core claim

By treating LLM evaluation under selective human supervision as positive-unlabeled learning, partial optimal transport between a small verified positive set and a reliable unlabeled subset in embedding space identifies alignments that reflect human preferences, enabling debiasing of LLM judges without retraining or full labeling.

What carries the argument

Partial optimal transport alignment between human-verified positives and a reliable subset of unlabeled outputs in a fixed embedding space, used to recover human-consistent preferences.

If this is right

The corrected judgments exhibit higher correlation with human preferences than the original biased judges.
The method reduces sensitivity to presentation biases such as verbosity without changing the underlying judge model.
It supplies interpretable confidence estimates derived from the transport plan.
The framework provides a scalable alternative to full human labeling or judge retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transport-based auditing could be applied to other selective-supervision settings where only positive labels are cheap to obtain.
Transport cost between positives and unlabeled items might serve as a diagnostic signal for when an LLM judge has drifted from human standards.
Extending the reliable-subset selection step to use learned embeddings instead of fixed ones could further reduce dependence on the initial embedding choice.

Load-bearing premise

A reliable subset of unlabeled outputs can be identified and aligned with human positives via partial optimal transport to reveal true preferences.

What would settle it

A test set where the transport-derived alignments show no increase in correlation with held-out human judgments or no reduction in verbosity bias compared with the original LLM judge.

Figures

Figures reproduced from arXiv: 2606.19057 by Chi-Kuang Yeh, Lei Ding, Yi-Ting Hung, Zilong Zhang.

**Figure 1.** Figure 1: Overview of proposed PUAUDIT framework; (A) Embedding construction, where pairwise comparison from datasets like Chatbot Arena are processed via a reward model to derive score embedding; (B) Difference Embedding, which extracts feature representations to construct difference vectors that capture the specific direction of quality improvement between winner and loser response; (C) Optimal Transportation, whe… view at source ↗

**Figure 2.** Figure 2: Diagnostic visualization of the POT-based alignment score on the MT-Bench run. Panels (a) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of original and adjusted consistency ratios across question types and models. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Given a question prompt, two LLMs each generate a response, with one designated as [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: The vector decomposition for Lemma A.1: We denote [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Geometric orthogonality demonstrates the robustness of our proposed method against [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Non-generative attacks used to evaluate surface-level biases in LLM-based judges. The [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for the length expansion attack. The attack increases response length [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for the sentiment style attack. The attack modifies the emotional tone of [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template used by the offline LLM-as-judge to generate preference judgments and [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable positive judgments but leaving most outputs unlabelled and potentially mixed in quality. We formulate LLM evaluation under selective human supervision as a positive--unlabelled learning problem and propose a geometric auditing framework based on Partial Optimal Transport. By aligning a small set of human--verified positives with a reliable subset of unlabelled outputs in a fixed embedding space, our method identifies human--consistent preferences and corrects biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable and statistically grounded alternative to existing LLM--as--a--judge pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is casting LLM judge auditing as positive-unlabeled learning solved by partial optimal transport on embeddings, but the reliable subset step is underspecified and risks circularity if it leans on the judge being audited.

read the letter

The main thing here is that the authors treat selective human labels on LLM outputs as a positive-unlabeled problem and bring in partial optimal transport to align the verified positives with some unlabelled examples in a fixed embedding space. That combination is the actual new piece, and it aims to recover human-consistent preferences and debias the judge without retraining.

They do a reasonable job naming the real bottleneck: human supervision is costly and incomplete, while LLM judges carry biases like verbosity that sit apart from semantic quality. Working in a fixed embedding space and aiming for interpretable confidence estimates keeps the method lightweight, which fits the practical constraints of evaluation pipelines.

The soft spot is the reliable subset of unlabelled outputs. The abstract gives no mechanism for picking it, and the stress-test concern lands because if that step depends on the same biased judge, the partial OT alignment cannot cleanly correct the bias. Fixed embeddings may also mix stylistic presentation with content quality, which would weaken the geometric claim. Without explicit selection rules or transport cost details, the statistical correction is not yet demonstrated.

Experiments are said to show better human alignment and robustness, but the abstract supplies no setups, baselines, or numbers, so the evidence cannot be weighed. The formulation draws on existing PU learning and OT tools without obvious internal contradictions.

This is for people building or auditing LLM evaluation systems who need to stretch limited human labels. A reader working on judge debiasing would find the framing worth examining even if the details need tightening. It deserves a serious referee to check the subset selection and the experimental support.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims to formulate LLM evaluation under selective human supervision as a positive-unlabeled learning problem and introduces a geometric auditing framework based on partial optimal transport. By aligning human-verified positives with a reliable subset of unlabeled outputs in a fixed embedding space, the method purportedly identifies human-consistent preferences, corrects biases (e.g., verbosity) in LLM judges, and yields interpretable confidence estimates without retraining.

Significance. If the core assumptions hold and the method can be validated with explicit, non-circular subset selection and transport definitions, it would provide a scalable, statistically grounded alternative to existing LLM-as-a-judge pipelines, potentially improving robustness to presentation biases while leveraging limited human labels. The combination of PU learning with partial OT for auditing is a novel angle with possible broader applicability to AI evaluation.

major comments (2)

[Abstract] Abstract: The central claim requires that a 'reliable subset' of unlabeled outputs can be identified independently and aligned via partial OT to recover human-consistent preferences, but no mechanism, criteria, or independence assumptions for selecting this subset are specified. If selection depends on the LLM judge whose biases are being audited, the partial OT step cannot guarantee correction and the PU-learning reduction becomes circular.
[Abstract] Abstract: The geometric auditing premise assumes that a fixed embedding space and partial OT can separate semantic quality from presentation biases (e.g., verbosity), but no transport cost definition or validation is provided to support this separation; embeddings commonly encode surface features, so this requires a concrete test or counterexample to establish that the alignment yields bias-corrected preferences.

minor comments (1)

[Abstract] Abstract: The statement that 'Experiments demonstrate improved alignment...' lacks any reference to datasets, baselines, metrics, or error bars, which weakens the ability to evaluate the empirical support for the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract and the underlying framework. We address each major comment below and indicate where revisions will strengthen the presentation of the method's assumptions and definitions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim requires that a 'reliable subset' of unlabeled outputs can be identified independently and aligned via partial OT to recover human-consistent preferences, but no mechanism, criteria, or independence assumptions for selecting this subset are specified. If selection depends on the LLM judge whose biases are being audited, the partial OT step cannot guarantee correction and the PU-learning reduction becomes circular.

Authors: We agree that the abstract does not explicitly state the selection mechanism or independence assumptions. The manuscript selects the reliable subset via a distance-based threshold in the fixed embedding space to the human positives, using only embedding geometry and excluding any LLM-judge scores; this is intended to keep the procedure non-circular. We will revise the abstract, methods, and experimental sections to state the criterion, the independence from judge outputs, and the resulting PU-learning reduction explicitly. revision: yes
Referee: [Abstract] Abstract: The geometric auditing premise assumes that a fixed embedding space and partial OT can separate semantic quality from presentation biases (e.g., verbosity), but no transport cost definition or validation is provided to support this separation; embeddings commonly encode surface features, so this requires a concrete test or counterexample to establish that the alignment yields bias-corrected preferences.

Authors: We agree that the abstract omits the transport cost definition and supporting validation. The manuscript defines the cost as Euclidean distance in the fixed embedding space and reports improved human alignment after transport; however, to directly address separation from surface features such as verbosity, we will add an explicit cost-function statement plus a targeted validation experiment (or counterexample) in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper formulates LLM evaluation as a positive-unlabeled learning problem and proposes alignment via partial optimal transport between human-verified positives and a reliable subset of unlabeled outputs in fixed embeddings. No equations, selection procedures, or claims in the abstract reduce by construction to fitted inputs renamed as predictions, self-definitions, or self-citation load-bearing uniqueness theorems. The central method is presented as an application of existing PU learning and OT techniques to the auditing task, with experiments claimed to demonstrate alignment improvements. No load-bearing step is shown to be equivalent to its inputs by definition. This is the expected outcome for a method paper whose assumptions are stated explicitly rather than derived internally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities. Full text required for audit.

pith-pipeline@v0.9.1-grok · 5689 in / 1089 out tokens · 42623 ms · 2026-06-26T19:02:20.416757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 12 linked inside Pith

[1]

Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631,

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631,

Pith/arXiv arXiv
[2]

Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799,

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799,

arXiv 2006
[3]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216,

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216,

Pith/arXiv arXiv
[4]

LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281,

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281,

arXiv
[5]

A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594,

10 Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594,

Pith/arXiv arXiv
[6]

Impact of positional encoding: Clean and adversarial rademacher complexity for transformers under in-context regression.arXiv preprint arXiv:2512.09275,

Weiyi He and Yue Xing. Impact of positional encoding: Clean and adversarial rademacher complexity for transformers under in-context regression.arXiv preprint arXiv:2512.09275,

arXiv
[7]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

Pith/arXiv arXiv
[8]

Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535,

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535,

arXiv
[9]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

Pith/arXiv arXiv
[10]

G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

Pith/arXiv arXiv
[11]

Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118,

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118,

arXiv 2005
[12]

Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076,

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076,

arXiv
[13]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716,

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716,

arXiv
[14]

Vera et al

Hugo S. Vera et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354,

Pith/arXiv arXiv
[15]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

11 Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhang- orodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

Pith/arXiv arXiv
[16]

Qwen2.5: A comprehensive series of large language models.arXiv preprint arXiv:2412.15115,

Chen Yang, Zhen Sun, Da Yu, Haoran Li, Jiahui Li, Jun Xu, Xiang Zheng, Zhi Liu, Shaohan Chen, Yu Zeng, et al. Qwen2.5: A comprehensive series of large language models.arXiv preprint arXiv:2412.15115,

Pith/arXiv arXiv
[17]

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al

Technical report introducing the Qwen2.5 model family. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in LLM-as-a-judge. URL https://arxiv. org/abs/2410.02736,

Pith/arXiv arXiv
[18]

Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

Pith/arXiv arXiv 1904
[19]

Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

Pith/arXiv arXiv
[20]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

arXiv
[21]

Attacker

Therefore, POT will route the transport mass toU + to minimize the global cose. ■ 14 Lemma A.1.Let z be the unit consensus direction where we define it as the centre of the positive cone Cτ(z) as in Assumption 4.1 and let z⊥ i , x⊥ be unit vectors (d−1) -dimensional orthogonal complement subspace (Figure 5). Assuming the unlabeled samples’ orthogonal comp...

2024

[1] [1]

Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631,

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631,

Pith/arXiv arXiv

[2] [2]

Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799,

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799,

arXiv 2006

[3] [3]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216,

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216,

Pith/arXiv arXiv

[4] [4]

LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281,

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281,

arXiv

[5] [5]

A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594,

10 Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594,

Pith/arXiv arXiv

[6] [6]

Impact of positional encoding: Clean and adversarial rademacher complexity for transformers under in-context regression.arXiv preprint arXiv:2512.09275,

Weiyi He and Yue Xing. Impact of positional encoding: Clean and adversarial rademacher complexity for transformers under in-context regression.arXiv preprint arXiv:2512.09275,

arXiv

[7] [7]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

Pith/arXiv arXiv

[8] [8]

Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535,

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535,

arXiv

[9] [9]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

Pith/arXiv arXiv

[10] [10]

G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

Pith/arXiv arXiv

[11] [11]

Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118,

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118,

arXiv 2005

[12] [12]

Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076,

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models.arXiv preprint arXiv:2310.10076,

arXiv

[13] [13]

A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716,

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716,

arXiv

[14] [14]

Vera et al

Hugo S. Vera et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354,

Pith/arXiv arXiv

[15] [15]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

11 Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhang- orodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

Pith/arXiv arXiv

[16] [16]

Qwen2.5: A comprehensive series of large language models.arXiv preprint arXiv:2412.15115,

Chen Yang, Zhen Sun, Da Yu, Haoran Li, Jiahui Li, Jun Xu, Xiang Zheng, Zhi Liu, Shaohan Chen, Yu Zeng, et al. Qwen2.5: A comprehensive series of large language models.arXiv preprint arXiv:2412.15115,

Pith/arXiv arXiv

[17] [17]

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al

Technical report introducing the Qwen2.5 model family. Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in LLM-as-a-judge. URL https://arxiv. org/abs/2410.02736,

Pith/arXiv arXiv

[18] [18]

Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

Pith/arXiv arXiv 1904

[19] [19]

Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advanc- ing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

Pith/arXiv arXiv

[20] [20]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631,

arXiv

[21] [21]

Attacker

Therefore, POT will route the transport mass toU + to minimize the global cose. ■ 14 Lemma A.1.Let z be the unit consensus direction where we define it as the centre of the positive cone Cτ(z) as in Assumption 4.1 and let z⊥ i , x⊥ be unit vectors (d−1) -dimensional orthogonal complement subspace (Figure 5). Assuming the unlabeled samples’ orthogonal comp...

2024