arxiv: 2604.05593 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Xin Sun , Di Wu , Sijing Qin , Isao Echizen , Abdallah El Ali , Saku Sugawara

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords label biasLLM-as-a-Judgetrust assessmentheuristic cuesource disclosureAI evaluationhuman-AI comparisonattention patterns

0 comments

The pith

Both humans and LLMs assign higher trust to identical information when labeled human-authored than when labeled AI-generated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether disclosed source labels shape trust judgments in evaluation settings. It presents the same content under swapped labels and measures responses from people and from LLMs used as judges. Both groups show higher trust for the human label. Eye-tracking and internal model analysis reveal that labels function as quick heuristic cues, with attention and uncertainty patterns aligning between humans and models. The finding questions the fairness of label-exposed LLM evaluations and suggests that aligning models to human judgments may carry over the same shortcut reliance.

Core claim

Using a counterfactual setup that holds all text constant and varies only the source label, the work shows that trust ratings rise when content is marked as human-authored and fall when marked as AI-generated. This pattern appears in both human participants and LLM judges. Model attention concentrates more on the label region than the content region, with stronger label focus under human labels, while decision logits indicate greater uncertainty under AI labels. These internal patterns match the human eye-tracking data, indicating that the source label acts as a shared heuristic cue.

What carries the argument

The counterfactual design that isolates the source label by presenting identical content under human versus AI authorship disclosures.

If this is right

LLM-as-a-Judge systems may systematically undervalue AI-generated outputs when source labels are visible.
Alignment procedures that train on human preferences risk embedding label-based heuristics into model behavior.
Evaluation validity suffers when labels are disclosed, because judgments track the label more than the content.
Attention and uncertainty metrics in LLMs can serve as detectable signals of this heuristic reliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same label effect could appear in other judgment tasks such as quality scoring or fact-checking beyond trust.
Blinding source labels during both human and model evaluation might eliminate the bias and produce more content-focused assessments.
If training data contain labeled examples, models may learn to overweight labels even when labels are not explicitly provided at inference time.

Load-bearing premise

The experiment successfully keeps every factor except the source label identical across conditions, so no other cue influences the trust difference.

What would settle it

A replication in which the source label is removed or replaced with a neutral marker and the trust gap between the former human and AI conditions disappears for both humans and LLMs.

Figures

Figures reproduced from arXiv: 2604.05593 by Abdallah El Ali, Di Wu, Isao Echizen, Saku Sugawara, Sijing Qin, Xin Sun.

**Figure 2.** Figure 2: An example heatmap of gaze points on stimuli that are displayed on the lab monitor during the human [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (Left): LLM’s attention allocation between two AoAs (i.e., label vs. content) and (Right): LLM’s logit entropy, across two label conditions (i.e., Human vs. AI). (**p<.01, *p<.05). Mean of Fixation Count Mean of Fixation Duration 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0 - AOI-Label AOI-Label AOI-Content AOI-Content [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Analyses by GEE test (Hardin and Hilbe, 2012) with FDR correction (Haynes, 2013) of human gaze patterns (i.e., fixation count and fixation duration) in two AoIs. (**p<.01, *p<.05, “ns” is not significant). allocation and decision confidence. Across models, attention is consistently label-dominant, with stronger label-AoA attention under Human than AI labels, while decision uncertainty is highest under AI … view at source ↗

**Figure 7.** Figure 7: Instruction shown to the participants during [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Human trust ratings in Study 1. Top: Trust score distributions for the four (2 × 2) conditions crossing true answer source (Human vs. LLM) and disclosed label (Human vs. AI). Bottom: Main effects collapsed across the other factor: trust scores by true source (regardless of label) and by disclosed label (regardless of true source). Violins show density; red dots indicate means; black lines indicate median… view at source ↗

**Figure 9.** Figure 9: Trust scores of LLM-as-a-Judge under three label conditions from Study 3 (Sec. 3.3). Mean trust ratings (with error bars) produced by each model when the same health QA content is shown across three labels: Human, AI, and a non-semantical Placebo label as “[TAG]”. Horizontal brackets indicate significant pairwise differences (∗∗∗p < .001, ∗∗p < .01, ∗p < .05). Attention Density (LogRatio) Logit Entropy (E… view at source ↗

**Figure 10.** Figure 10: reports the attention allocation “LogRatio” between Label AoA and Content AoA across three label conditions. Across all models and conditions, LogRatio is consistently above zero, indicating consistent label-dominant attention allocation at the judgment step. The placebo condition often elicits the largest LogRatio, suggesting that an underspecified label can attract extra processing to the label regi… view at source ↗

**Figure 11.** Figure 11: shows the logits entropy at the judgment step under three label conditions. Across models, the AI label generally yields higher entropy than Trust Scores [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: LLM attention distribution density across [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows LLMs and humans both trust identical content more when labeled human-authored than AI-generated, with LLM attention and uncertainty tracking the human pattern.

read the letter

Both humans and LLMs give more trust to content labeled human-authored than to identical content labeled AI-generated. The paper backs this with eye-tracking for people and attention weights plus logit uncertainty for models. The counterfactual design keeps content fixed and varies only the label. Human data shows label fixation as a strong cue. LLM analysis reveals denser attention to labels overall, stronger for human labels, and higher uncertainty for AI labels. These lines of evidence line up, which is the useful part. The controls appear sufficient to isolate the label effect, with no major confounds noted. This adds a concrete observation to discussions on LLM evaluation reliability. A minor limitation is the lack of direct tests on how much this bias distorts actual evaluation outcomes in practice. The alignment propagation idea is noted cautiously but stays at the level of motivation rather than demonstration. Readers working on LLM-as-a-Judge systems or alignment will get the most from it. The empirical patterns are worth a look for anyone building trust in automated assessments. It should go to peer review. The multi-method approach makes the core claim worth checking in detail.

Referee Report

0 major / 2 minor

Summary. The paper claims that source labels (human-authored vs. AI-generated) bias trust judgments in both humans and LLMs used as judges. Using a counterfactual design that holds content constant while manipulating only the label, the authors report higher trust ratings for human-labeled content. Human data include explicit ratings and eye-tracking fixation metrics showing heavy label reliance; LLM data include judgment outputs, denser attention weights on label tokens (stronger for human labels), and higher logit-based uncertainty for AI labels. These patterns are interpreted as evidence of shared heuristic reliance on source labels, raising concerns for the validity of LLM-as-a-Judge evaluations and potential propagation of biases via alignment.

Significance. If the central empirical result holds under the reported controls, the work is significant for AI evaluation research because it identifies a concrete, measurable bias that affects both human and model judgments in the same direction. The convergence between behavioral (eye-tracking) and internal-state (attention and uncertainty) measures provides a rare cross-species link between cognitive heuristics and model mechanisms, directly supporting the call for debiased evaluation protocols.

minor comments (2)

Abstract: the summary of results would be strengthened by including at least the total sample sizes for human participants and LLM trials, along with the primary statistical test outcomes or effect sizes that support the trust difference claim.
The description of LLM attention analysis should clarify how label-region attention weights are normalized and aggregated across layers to ensure direct comparability with the human eye-tracking fixation metrics.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of our findings, and the recommendation for minor revision. We are pleased that the cross-species convergence between human behavioral measures and LLM internal states was recognized as significant for AI evaluation research.

Circularity Check

0 steps flagged

Empirical study with no derivational chain or self-referential reductions

full rationale

This paper reports an empirical observational study using counterfactual designs to compare trust judgments by humans and LLMs under manipulated source labels. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on experimental measurements (ratings, eye-tracking, attention weights, logits) with reported controls for confounds, and no load-bearing self-citations or uniqueness theorems are invoked to justify core results. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the counterfactual experimental design and the interpretation of eye-tracking and model internals as direct evidence of heuristic reliance. No free parameters, new entities, or non-standard axioms are mentioned.

axioms (1)

standard math Standard assumptions of experimental psychology and statistical inference hold for trust ratings, gaze data, and model logits.
Invoked implicitly to interpret differences across label conditions as evidence of heuristic reliance.

pith-pipeline@v0.9.0 · 5500 in / 1253 out tokens · 69878 ms · 2026-05-10T19:34:29.741413+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Tobii AB. 2024. http://www.tobii.com/ Tobii pro lab . Computer software

2024
[2]

Benjamin R Bates, Sharon Romina, Rukhsana Ahmed, and Danielle Hopson. 2006. The effect of source credibility on consumers' perceptions of the quality of health information on the internet. Medical informatics and the Internet in medicine, 31(1):45--52

2006
[3]

Oliver Brady, Paul Nulty, Lili Zhang, Tom \'a s E Ward, and David P McGovern. 2025. Dual-process theory and decision-making in large language models. Nat. Rev. Psychol., 4(12):777--792

2025
[4]

Cacioppo, Louis G

John T. Cacioppo, Louis G. Tassinary, and Gary G. Berntson. 2016. Strong Inference in Psychophysiological Science, page 3–15. Cambridge Handbooks in Psychology. Cambridge University Press

2016
[5]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.474 Humans or LLM s as the judge? a study on judgement bias . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301--8327, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[6]

Vanessa Cheung, Maximilian Maier, and Falk Lieder. 2025. https://doi.org/10.1073/pnas.2412015122 Large language models show amplified cognitive biases in moral decision-making . Proceedings of the National Academy of Sciences, 122(25):e2412015122

work page doi:10.1073/pnas.2412015122 2025
[7]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. https://doi.org/10.1145/3637528.3671882 Neural retrievers are biased towards llm-generated content . In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24, page 526–537, New York, NY, USA. Association for...

work page doi:10.1145/3637528.3671882 2024
[8]

Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D\" u nner. 2024. Questioning the survey responses of large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA. Curran Associates Inc

2024
[9]

Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.739 Cognitive bias in decision-making with LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12640--12653, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.739 2024
[10]

Abdallah El Ali, Karthikeya Puttur Venkatraj, Sophie Morosoli, Laurens Naudts, Natali Helberger, and Pablo Cesar. 2024. https://doi.org/10.1145/3613905.3650750 Transparent ai disclosure obligations: Who, what, when, where, why, how . In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA '24, New York, NY, USA. Associati...

work page doi:10.1145/3613905.3650750 2024
[11]

Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods, 39(2):175--191

2007
[12]

Bertram Gawronski, Dillon M Luke, and Laura A Creighton. 2024. Dual-process theories. In The Oxford Handbook of Social Cognition, Second Edition, pages 319--353. Oxford University Press

2024
[13]

Ellen R Girden. 1992. ANOVA: Repeated measures. 84. Sage

1992
[14]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. https://arxiv.org/abs/2411.15594 A survey on llm-as-a-judge . Preprint, arXiv:2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Rajarshi Haldar and Julia Hockenmaier. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1361 Rating roulette: Self-inconsistency in LLM -as-a-judge frameworks . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24986--25004, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-emnlp.1361 2025
[16]

James W Hardin and Joseph M Hilbe. 2012. Generalized estimating equations, second edition, 2 edition. Chapman & Hall/CRC, Philadelphia, PA

2012
[17]

Winston Haynes. 2013. https://doi.org/10.1007/978-1-4419-9863-7\_1215 Benjamini--Hochberg Method , pages 78--78. Springer New York, New York, NY

work page doi:10.1007/978-1-4419-9863-7 2013
[18]

InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Maurice Jakesch, Megan French, Xiao Ma, Jeffrey T. Hancock, and Mor Naaman. 2019. https://doi.org/10.1145/3290605.3300469 Ai-mediated communication: How the perception that profile text was written by ai affects trustworthiness . In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI '19, page 1–13, New York, NY, USA. Associa...

work page doi:10.1145/3290605.3300469 2019
[19]

Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2024. A survey on human preference learning for large language models. arXiv preprint arXiv:2406.11191

work page arXiv 2024
[20]

Johnson, Jennifer E

Frances C. Johnson, Jennifer E. Rowley, and Laura Sbaffi. 2015. https://api.semanticscholar.org/CorpusID:206454953 Modelling trust formation in health information contexts . Journal of Information Science, 41:415 -- 429

2015
[21]

Marcel A Just and Patricia A Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychol. Rev., 87(4):329--354

1980
[22]

Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin. 2025. https://doi.org/10.1162/TACL.a.58 Large language models are human-like internally . Transactions of the Association for Computational Linguistics, 13:1743--1766

work page doi:10.1162/tacl.a.58 2025
[23]

Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenčiak, Ada Böhm, and Jan Kulveit. 2025. https://doi.org/10.1073/pnas.2415697122 Ai–ai bias: Large language models favor communications generated by large language models . Proceedings of the National Academy of Sciences, 122(31):e2415697122

work page doi:10.1073/pnas.2415697122 2025
[24]

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.138 From generation to judgment: Opportunities and challenges of LLM -as-a-judge . In Proceedings of the 2025 Conference on Em...

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[25]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579

work page internal anchor Pith review arXiv 2024
[26]

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, and Haixiang Hu. 2025 b . https://arxiv.org/abs/2506.22316 Evaluating scoring bias in llm-as-a-judge . Preprint, arXiv:2506.22316

work page arXiv 2025
[27]

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. 2025 c . https://arxiv.org/abs/2506.09443 Llms cannot reliably judge (yet?): A comprehensive assessment on the robustness of llm-as-a-judge . Preprint, arXiv:2506.09443

work page arXiv 2025
[28]

Shyam Sundar

Q.Vera Liao and S. Shyam Sundar. 2022. https://doi.org/10.1145/3531146.3533182 Designing for responsible trust in ai systems: A communication perspective . In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT '22, page 1257–1268, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3531146.3533182 2022
[29]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://arxiv.org/abs/2303.16634 G-eval: Nlg evaluation using gpt-4 with better human alignment . Preprint, arXiv:2303.16634

work page internal anchor Pith review arXiv 2023
[30]

Joao Marecos, Duarte Tude Graça, Francisco Goiana-da Silva, Hutan Ashrafian, and Ara Darzi. 2024. https://doi.org/10.3390/journalmedia5020046 Source credibility labels and other nudging interventions in the context of online health misinformation: A systematic literature review . Journalism and Media, 5(2):702--717

work page doi:10.3390/journalmedia5020046 2024
[31]

Arash Marioriyad, Mohammad Hossein Rohban, and Mahdieh Soleymani Baghshah. 2025. https://arxiv.org/abs/2509.26072 The silent judge: Unacknowledged shortcut bias in llm-as-a-judge . Preprint, arXiv:2509.26072

work page arXiv 2025
[32]

OpenAI. 2024. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[34]

Jonathan Peirce, Jeremy R Gray, Sol Simpson, Michael MacAskill, Richard H \"o chenberger, Hiroyuki Sogo, Erik Kastman, and Jonas Kristoffer Lindel v. 2019. PsychoPy2 : Experiments in behavior made easy. Behavior Research Methods, 51(1):195--203

2019
[35]

Moritz Reis, Florian Reis, and Wilfried Kunde. 2024. Influence of believed AI involvement on the perception of digital medical advice. Nature Medicine

2024
[36]

Bernard Rosner, Robert J Glynn, and Mei-Ling T Lee. 2006. The wilcoxon signed rank test for paired comparisons of clustered data. Biometrics, 62(1):185--192

2006
[37]

Rowley, Frances C

Jennifer E. Rowley, Frances C. Johnson, and Laura Sbaffi. 2015. https://api.semanticscholar.org/CorpusID:21888204 Students’ trust judgements in online health information seeking . Health Informatics Journal, 21:316 -- 327

2015
[38]

Ali, Angèle Christin, Andrew Smart, and Riitta Katila

Nicolas Scharowski, Michaela Benk, Swen J. KÃŒhne, LÃ©ane Wettstein, and Florian BrÃŒhlmann. 2023. https://doi.org/10.1145/3593013.3593994 Certification Labels for Trustworthy AI : Insights From an Empirical Mixed - Method Study . In 2023 ACM Conference on Fairness , Accountability , and Transparency , pages 248--260, Chicago IL USA. ACM

work page doi:10.1145/3593013.3593994 2023
[39]

Kayla Schroeder and Zach Wood-Doughty. 2025. https://arxiv.org/abs/2412.12509 Can you trust llm judgments? reliability of llm-as-a-judge . Preprint, arXiv:2412.12509

work page arXiv 2025
[40]

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.569 Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11297--11339, Suzhou, China. Association for Comp...

work page doi:10.18653/v1/2025.emnlp-main.569 2025
[41]

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. 2025. https://arxiv.org/abs/2406.07791 Judging the judges: A systematic study of position bias in llm-as-a-judge . Preprint, arXiv:2406.07791

work page arXiv 2025
[42]

Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. 2025. https://arxiv.org/abs/2508.06709 Play favorites: A statistical method to measure self-bias in llm-as-a-judge . Preprint, arXiv:2508.06709

work page arXiv 2025
[43]

Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. 2024. https://doi.org/10.1162/tacl_a_00685 Do LLM s exhibit human-like response biases? a case study in survey design . Transactions of the Association for Computational Linguistics, 12:1011--1026

work page doi:10.1162/tacl_a_00685 2024
[44]

Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. 2025 a . https://arxiv.org/abs/2504.09946 Assessing judging bias in large reasoning models: An empirical study . Preprint, arXiv:2504.09946

work page arXiv 2025
[45]

Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, and Shikun Zhang. 2025 b . https://arxiv.org/abs/2509.21117 Trustjudge: Inconsistencies of llm-as-a-judge and how to alleviate them . Preprint, arXiv:2509.21117

work page arXiv 2025
[46]

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. 2025. https://arxiv.org/abs/2410.21819 Self-preference bias in llm-as-a-judge . Preprint, arXiv:2410.21819

work page arXiv 2025
[47]

Sarah Wiegreffe and Yuval Pinter. 2019. https://doi.org/10.18653/v1/D19-1002 Attention is not not explanation . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11--20, Hong Kong, China. Association for Computational Linguistics

work page doi:10.18653/v1/d19-1002 2019
[48]

Torr, Bernard Ghanem, and Guohao Li

Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, James Evans, Philip H.S. Torr, Bernard Ghanem, and Guohao Li. 2024. Can large language model agents simulate human trust behavior? In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red ...

2024
[49]

Shweta Yadav, Deepak Gupta, and Dina Demner-Fushman. 2022. https://arxiv.org/abs/2206.06581 Chq-summ: A dataset for consumer healthcare question summarization . Preprint, arXiv:2206.06581

work page arXiv 2022
[50]

Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, and Taha Kass-Hout. 2025. https://arxiv.org/abs/2505.17100 Any large language model can be a reliable judge: Debiasing with a reasoning-based bias detector . Preprint, arXiv:2505.17100

work page arXiv 2025
[51]

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2024. https://arxiv.org/abs/2410.02736 Justice or prejudice? quantifying biases in llm-as-a-judge . Preprint, arXiv:2410.02736

work page arXiv 2024
[52]

Yidan Yin, Nan Jia, and Cheryl J. Wakslak. 2024. https://doi.org/10.1073/pnas.2319112121 Ai can help people feel heard, but an ai label diminishes this impact . Proceedings of the National Academy of Sciences, 121(14):e2319112121

work page doi:10.1073/pnas.2319112121 2024
[53]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://arxiv.org/abs/2306.05685 Judging llm-as-a-judge with mt-bench and chatbot arena . Preprint, arXiv:2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[55]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...