pith. machine review for the scientific record. sign in

arxiv: 2604.18835 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM-as-a-judgesemantic similaritysensitivity testingpositional biascontext coherencemodel fingerprintdocument comparisonperturbation analysis
0
0 comments X

The pith

LLM similarity scores depend on where a change occurs in a document, the coherence of surrounding text, and which model is judging, not only the meaning of the change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how large language models behave when asked to score semantic similarity between pairs of documents. It places one altered sentence inside longer texts and changes its position, the nature of the alteration, and whether the surrounding sentences match the topic. Models consistently treat the same semantic change differently depending on these structural factors. The differences appear across multiple models and thousands of examples, showing that the scores capture more than pure meaning. The authors supply a reusable method to measure and compare these effects for any current or future model.

Core claim

LLMs penalize semantic differences more when the altered sentence appears earlier in the document, assign systematically lower and more polarized scores when the change is placed in topically unrelated context, and each produce a stable model-specific scoring distribution that remains unchanged across perturbation types while still obeying a shared order of leniency for different kinds of alterations.

What carries the argument

The needle-in-a-haystack experimental framework that embeds one controlled semantic perturbation inside varying context and position, then measures similarity scores across all combinations.

If this is right

  • Similarity scores will be harsher for changes that occur near the start of a document than for equivalent changes near the end.
  • Placing a semantic change inside topically unrelated text will drive scores toward the extremes of very low or very high similarity.
  • Each model can be identified by its own stable scoring pattern even when the type of semantic change is varied.
  • The same testing procedure can be applied directly to new models to audit their sensitivity without retraining or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Ranking or retrieval systems that rely on these scores may systematically undervalue documents whose key differences appear later in the text.
  • Providing more coherent surrounding context could be a practical way to reduce the impact of isolated alterations on model judgments.
  • The shared leniency hierarchy across models suggests a common underlying mechanism that could be probed in other evaluation tasks such as summarization or question answering.

Load-bearing premise

The chosen sentence alterations, context contrasts, and document lengths are representative of the semantic differences that arise in practical document comparison tasks.

What would settle it

A replication that uses different perturbation types or new document domains and finds no consistent positional bias or no difference between related and unrelated context would show the sensitivities are narrower than claimed.

read the original abstract

We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable "fingerprint" that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a scalable, multifactorial experimental framework for probing LLM sensitivity to subtle semantic changes in pairwise document comparison, framed as a 'needle-in-a-haystack' problem. A single altered sentence (via negation, conjunction swap, or named entity replacement) is embedded in varying contexts (original vs. topically unrelated), positions, and document lengths. Five LLMs are tested on tens of thousands of pairs, revealing positional bias (harsher penalties for early differences), bipolarized scores with unrelated context, distinct model-specific scoring 'fingerprints' invariant to perturbation type, and a universal hierarchy in leniency across perturbation types. The work positions the framework as an LLM-agnostic toolkit for auditing and comparing scoring behavior.

Significance. If the results hold, this work would be significant for NLP and LLM evaluation research by demonstrating that semantic similarity scores from LLMs are influenced by document structure, context coherence, and model identity beyond the semantic change itself. The large-scale, controlled, multifactorial design (varying perturbation type, context, position, and length) and identification of stable model fingerprints provide a practical, reproducible method for auditing current and future models. This addresses a key gap in understanding LLM-as-a-judge reliability for tasks like retrieval and evaluation, with the toolkit offering clear utility for the community.

major comments (2)
  1. [Experimental results section] Experimental results section: The description of tens of thousands of document pairs provides no statistical details, error bars, confidence intervals, p-values, or controls for prompt sensitivity (e.g., testing multiple prompt templates or paraphrases). This leaves claims of positional bias, bipolarized scores, and model fingerprints only partially supported, as observed differences could arise from sampling noise or prompt-specific artifacts rather than robust effects.
  2. [Method section on perturbation and context design] Method section on perturbation and context design: The three perturbation types (negation, conjunction swap, named entity replacement) and original-vs-unrelated context contrast are presented without justification or validation that they represent the range of semantic differences arising in practical document comparison tasks. If these synthetic 'needles' induce larger or qualitatively different meaning shifts than natural paraphrases, implications, or domain edits, the reported sensitivities may be specific to the haystack construction and not generalizable, weakening the central claim that effects go 'beyond the semantic change itself'.
minor comments (2)
  1. [Abstract] Abstract: While the high-level findings are clear, including precise counts (e.g., exact number of pairs per condition and specific model names) and a one-sentence note on the analysis approach would improve immediate accessibility for readers.
  2. [Figure captions and legends] Figure captions and legends: Ensure all visualizations of score distributions explicitly label axes, include sample sizes per condition, and note any aggregation methods to support interpretation of the 'bipolarized' and 'fingerprint' patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the statistical presentation and methodological justification in our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Experimental results section] Experimental results section: The description of tens of thousands of document pairs provides no statistical details, error bars, confidence intervals, p-values, or controls for prompt sensitivity (e.g., testing multiple prompt templates or paraphrases). This leaves claims of positional bias, bipolarized scores, and model fingerprints only partially supported, as observed differences could arise from sampling noise or prompt-specific artifacts rather than robust effects.

    Authors: We agree that additional statistical details would improve the robustness of the presented findings. The scale of the experiments (tens of thousands of pairs) was intended to provide stability, but we acknowledge the lack of explicit error bars, confidence intervals, and p-values in the current results section. In the revised manuscript, we will add error bars to figures, report confidence intervals for key metrics, and include appropriate statistical tests for the main effects. We will also add controls for prompt sensitivity by evaluating multiple prompt templates and paraphrases, confirming that the core patterns (positional bias, bipolarization, and model fingerprints) hold across variations. These revisions will be incorporated to better rule out sampling noise or prompt artifacts. revision: yes

  2. Referee: [Method section on perturbation and context design] Method section on perturbation and context design: The three perturbation types (negation, conjunction swap, named entity replacement) and original-vs-unrelated context contrast are presented without justification or validation that they represent the range of semantic differences arising in practical document comparison tasks. If these synthetic 'needles' induce larger or qualitatively different meaning shifts than natural paraphrases, implications, or domain edits, the reported sensitivities may be specific to the haystack construction and not generalizable, weakening the central claim that effects go 'beyond the semantic change itself'.

    Authors: We selected negation, conjunction swap, and named entity replacement as they correspond to well-established categories of semantic alteration in the NLP literature on meaning-preserving and meaning-changing edits, enabling precise isolation of perturbation type while keeping changes minimal and controlled. The original-versus-unrelated context contrast was chosen to directly test the role of topical coherence in modulating LLM judgments. In the revised methods section, we will add explicit justification with references to prior semantic perturbation studies and clarify the design rationale. We will also include a limitations discussion acknowledging that these synthetic perturbations may not exhaustively cover all natural variations (e.g., paraphrases or domain-specific edits) and that broader validation would require additional experiments. This does not undermine the central claim, which concerns sensitivity to structural and contextual factors beyond any given semantic change within the controlled framework; the toolkit remains extensible to other perturbation types. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical experimental study

full rationale

The paper describes a multifactorial experimental framework that embeds controlled semantic perturbations (negation, conjunction swap, named entity replacement) into documents and measures LLM similarity scores across variations in context, position, and length. All reported findings—positional bias, bipolarization under unrelated context, and model-specific fingerprints—derive directly from the tens of thousands of pairwise comparisons performed on five LLMs. No equations, fitted parameters, predictions, or uniqueness theorems are invoked; the central claims rest on observable experimental outcomes rather than any self-referential reduction or self-citation chain. References to prior candidate-order effects are external and non-load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions about LLM prompting behavior and the representativeness of the chosen perturbations; no new free parameters, axioms beyond domain norms, or invented entities are introduced.

axioms (1)
  • domain assumption LLMs can be prompted to produce consistent numerical similarity scores between document pairs
    The entire sensitivity analysis depends on treating the model outputs as reliable measurements of semantic similarity.

pith-pipeline@v0.9.0 · 5595 in / 1345 out tokens · 50466 ms · 2026-05-10T04:28:28.412862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    2020 , url =

    LtCmdrData , title =. 2020 , url =

  2. [2]

    CoRR , volume =

    Joshua Goodman , title =. CoRR , volume =. 2001 , url =

  3. [3]

    2001 , issn =

    A bit of progress in language modeling , journal =. 2001 , issn =. doi:10.1006/csla.2001.0174 , author =

  4. [4]

    CoRR , volume =

    Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

  5. [5]

    Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , year =

    Hwa, Rebecca , title =. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , year =

  6. [6]

    , title =

    Jurafsky, Daniel and Martin, James H. , title =. 2009 , edition =

  7. [7]

    Robustness of

    Ayush Singh and Navpreet Singh and Shubham Vatsal , year =. Robustness of. 2407.08989 , archivePrefix =

  8. [8]

    2023 , eprint =

    Are Large Language Models Really Robust to Word-Level Perturbations? , author =. 2023 , eprint =

  9. [9]

    Dissecting In-Context Learning of Translations in

    Vikas Raunak and Hany Hassan Awadalla and Arul Menezes , year =. Dissecting In-Context Learning of Translations in. 2310.15987 , archivePrefix =

  10. [10]

    Robertson and Jay J

    Steve Rathje and Dan-Mircea Mirea and Ilia Sucholutsky and Raja Marjieh and Claire E. Robertson and Jay J. Van Bavel , title =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =

  11. [11]

    2024 , eprint =

    Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction , author =. 2024 , eprint =

  12. [12]

    2023 , eprint =

    Lost in the Middle: How Language Models Use Long Contexts , author =. 2023 , eprint =

  13. [13]

    2408.13704 , archivePrefix =

    Yicheng Wang and Jiayi Yuan and Yu-Neng Chuang and Zhuoer Wang and Yingchi Liu and Mark Cusick and Param Kulkarni and Zhengping Ji and Yasser Ibrahim and Xia Hu , year =. 2408.13704 , archivePrefix =

  14. [14]

    Adian Liusie and Potsawee Manakul and Mark J. F. Gales , year =. 2307.07889 , archivePrefix =

  15. [15]

    Efficient

    Adian Liusie and Vatsal Raina and Yassir Fathullah and Mark Gales , year =. Efficient. 2405.05894 , archivePrefix =

  16. [17]

    2024 , eprint =

    Automated Genre-Aware Article Scoring and Feedback Using Large Language Models , author =. 2024 , eprint =

  17. [18]

    and others , title =

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and others , title =. Advances in Neural Information Processing Systems , volume =

  18. [20]

    Automated Essay Scoring and Revising Based on Open-Source Large Language Models , year =

    Song, Yishen and Zhu, Qianta and Wang, Huaibo and Zheng, Qinhua , journal =. Automated Essay Scoring and Revising Based on Open-Source Large Language Models , year =

  19. [25]

    Proceedings of the Fifteenth International Conference on Machine Learning , pages =

    Lin, Dekang , title =. Proceedings of the Fifteenth International Conference on Machine Learning , pages =. 1998 , isbn =

  20. [28]

    Quarterly Journal of Experimental Psychology , volume =

    Anne E Cook and Erinn K Walsh and Margaret A A Bills and John C Kircher and Edward J O'Brien , title =. Quarterly Journal of Experimental Psychology , volume =. 2018 , doi =

  21. [29]

    2023 , eprint =

    Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu , title =. 2023 , eprint =

  22. [30]

    Hashimoto , title =

    Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. Advances in Neural Information Processing Systems , volume =

  23. [31]

    Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages =

    Daniel Cer and Mona Diab and Eneko Agirre and I\. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages =. 2017 , publisher =

  24. [32]

    2023 , howpublished =

    Greg Kamradt , title =. 2023 , howpublished =

  25. [33]

    Adams and Alan S

    Kenneth A. Adams and Alan S. Kaye , title =. St.\ John's Law Review , volume =

  26. [34]

    2024 , eprint =

    Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg , title =. 2024 , eprint =

  27. [35]

    George E. P. Box , title =. Journal of the American Statistical Association , volume =

  28. [36]

    2024 , eprint =

    Lanxin Shi and Changmao Ma and Weijia Liang and Xiaodan Diao and Wenxuan Ma and Soroush Vosoughi , title =. 2024 , eprint =

  29. [37]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (

    Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (

  30. [38]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (

    Douwe Kiela and Max Bartolo and Yixin Nie and Divyansh Kaushik and Atticus Geiger and Zhengxuan Wu and Bertie Vidgen and Grusha Prasad and Amanpreet Singh and Pratik Ringshia and Zhiyi Ma and Tristan Thrush and Sebastian Riedel and Zeerak Waseem and Pontus Stenetorp and Robin Jia and Mohit Bansal and Christopher Potts and Adina Williams , title =. Proceed...

  31. [39]

    Second Joint Conference on Lexical and Computational Semantics (*

    Eneko Agirre and Daniel Cer and Mona Diab and Aitor Gonzalez-Agirre and Weiwei Guo , title =. Second Joint Conference on Lexical and Computational Semantics (*

  32. [40]

    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (

    Alexis Conneau and Douwe Kiela , title =. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (

  33. [41]

    Adams and Alan S

    Kenneth A. Adams and Alan S. Kaye. Revisiting the ambiguity of ``and'' and ``or'' in legal drafting. St.\ John's Law Review, 80 0 (4): 0 1167--1198, 2006

  34. [42]

    S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, I\ n igo Lopez-Gazpio, and Lucia Specia. SemEval -2017 task 1: Semantic textual similarity -- multilingual and cross-lingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1--14. Association for Computational Linguistics, 2017. doi:10.18653/v1/S17-2001

  35. [43]

    Humans or llms as the judge? a study on judgement biases, September 2024

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement biases, 2024. URL https://arxiv.org/abs/2402.10669

  36. [44]

    Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631, Toronto, Canada, July 2023. Association for Computational L...

  37. [45]

    Validation of semantic illusions independent of anomaly detection: evidence from eye movements

    Anne E Cook, Erinn K Walsh, Margaret A A Bills, John C Kircher, and Edward J O'Brien. Validation of semantic illusions independent of anomaly detection: evidence from eye movements. Quarterly Journal of Experimental Psychology, 71 0 (1): 0 113--121, 2018. doi:10.1080/17470218.2016.1264432. URL https://doi.org/10.1080/17470218.2016.1264432

  38. [46]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm : A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, volume 36, 2023

  39. [47]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM -as-a-judge, 2025. URL https://arxiv.org/abs/2411.15594

  40. [48]

    Semantic Similarity from Natural Language and Ontology Analysis

    S \'e bastien Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain. Semantic Similarity from Natural Language and Ontology Analysis. Springer International Publishing, 2015. ISBN 9783031021565. doi:10.1007/978-3-031-02156-5. URL http://dx.doi.org/10.1007/978-3-031-02156-5

  41. [49]

    spaCy: Industrial-strength natural language processing in python, 2020

    Matthew Honnibal, Ines Montani, Sofie Van Landghem, and Adriane Boyd. spacy: Industrial-strength natural language processing in python. 2020. doi:10.5281/zenodo.1212303

  42. [50]

    RULER : What's the real context size of your long-context language models?, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER : What's the real context size of your long-context language models?, 2024

  43. [51]

    Needle in a haystack -- pressure testing LLMs

    Greg Kamradt. Needle in a haystack -- pressure testing LLMs . https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023. Accessed: 2025-10-01

  44. [52]

    Towards leveraging large language models for automated medical Q&A evaluation, 2024

    Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, and Bahador Saket. Towards leveraging large language models for automated medical Q&A evaluation, 2024. URL https://arxiv.org/abs/2409.01941

  45. [53]

    Evaluating mathematical reasoning of large language models: A focus on error identification and correction, 2024

    Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction, 2024. URL https://arxiv.org/abs/2406.00755

  46. [54]

    An information-theoretic definition of similarity

    Dekang Lin. An information-theoretic definition of similarity. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML '98, pages 296--304, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1558605568

  47. [55]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023 a . URL https://arxiv.org/abs/2307.03172

  48. [56]

    G-Eval : NLG evaluation using GPT-4 with better human alignment, 2023 b

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment, 2023 b

  49. [57]

    Plain text wikipedia 202011, 2020

    LtCmdrData. Plain text wikipedia 202011, 2020. URL https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011. Accessed: 2025-09-11

  50. [58]

    Nieuwland and Jos J.A

    Mante S. Nieuwland and Jos J.A. Van Berkum . Testing the limits of the semantic illusion phenomenon: Erps reveal temporary semantic change deafness in discourse comprehension. Cognitive Brain Research, 24 0 (3): 0 691--701, 2005. ISSN 0926-6410. doi:https://doi.org/10.1016/j.cogbrainres.2005.04.003. URL https://www.sciencedirect.com/science/article/pii/S0...

  51. [59]

    Judging the judges: A systematic study of position bias in LLM -as-a-judge, 2024

    Lanxin Shi, Changmao Ma, Weijia Liang, Xiaodan Diao, Wenxuan Ma, and Soroush Vosoughi. Judging the judges: A systematic study of position bias in LLM -as-a-judge, 2024

  52. [60]

    Automated essay scoring and revising based on open-source large language models

    Yishen Song, Qianta Zhu, Huaibo Wang, and Qinhua Zheng. Automated essay scoring and revising based on open-source large language models. IEEE Transactions on Learning Technologies, 17: 0 1880--1890, 2024

  53. [61]

    Are large language models really robust to word-level perturbations?, 2023

    Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, and Dacheng Tao. Are large language models really robust to word-level perturbations?, 2023. URL https://arxiv.org/abs/2309.11166

  54. [62]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9...

  55. [63]

    Xing, et al

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023