Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3
The pith
Multimodal LLMs compress ordinal clinical scores toward the middle of the scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All three LLM families exhibit a pronounced central tendency effect in which predictions are systematically compressed toward the middle of the scale, producing over-prediction at the low end (scores 0 to 1) and under-prediction at the high end (scores 5 to 4). This endpoint compression is not removed by few-shot exemplars that span the full score range or by stripping clinical terminology from the prompt. The effect hits the clinically critical extremes hardest, where accurate distinction between severe impairment and normal performance most influences screening decisions.
What carries the argument
Per-score error breakdown that isolates the endpoint compression pattern in multimodal LLM outputs on the 0-5 ordinal rubric.
If this is right
- Screening decisions that depend on correctly identifying the lowest or highest scores will be less reliable when current LLMs are used without adjustment.
- Aggregate metrics such as within-1 accuracy can look acceptable while large directional errors remain at the scale boundaries.
- Any clinical deployment of LLM raters will need explicit post-hoc calibration steps that target the extremes.
- The same compression pattern is likely to appear in other ordinal clinical rating tasks that use similar zero-shot multimodal prompts.
Where Pith is reading between the lines
- The bias may generalize to other multimodal medical image scoring rubrics that require fine-grained ordinal distinctions.
- Targeted fine-tuning on balanced clinical score distributions could reduce the inward pull, though this approach was not tested in the study.
- A lightweight calibration model applied after the LLM step might restore accuracy at the ends while preserving the base model's speed.
Load-bearing premise
The central tendency bias is a property of the LLM families themselves rather than an artifact of the specific prompt phrasing or dataset selection.
What would settle it
Re-running the LLMs on a fresh set of clock drawings and finding that average predicted scores for human-labeled 0s stay near 0 and for human-labeled 5s stay near 5, with no net positive error at the low end or negative error at the high end, would falsify the compression claim.
Figures
read the original abstract
Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks three multimodal LLM families against fine-tuned Vision Transformers for ordinal scoring of Clock Drawing Test images on two public datasets using the Shulman rubric. It reports that LLMs achieve competitive tolerance-based agreement (e.g., GPT-5 MAE 0.67, within-1 accuracy 92%) but exhibit a central tendency bias with systematic over-prediction at low scores (0-1) and under-prediction at high scores (4-5), an effect that persists in ablations using full-range few-shot examples and removal of clinical terminology. Supervised models show superior calibration (MAE 0.52, within-1 accuracy 91%). The work positions this as an extension of LLM-as-a-judge bias literature to clinical multimodal assessment.
Significance. If the central tendency bias result holds after addressing robustness concerns, the paper makes a useful contribution by documenting a reproducible failure mode of frontier multimodal LLMs on clinically critical ordinal scales. The comparison to supervised baselines, the persistence across ablations, and the emphasis on implications for high-stakes screening decisions provide concrete evidence that calibration-aware evaluation is needed before deployment. This extends existing NLP bias findings to a multimodal clinical setting with direct relevance to cognitive impairment screening.
major comments (1)
- [Per-score analysis] Per-score analysis (results section): The central claim of a 'pronounced central tendency effect' rests on directional errors at score extremes, yet the manuscript provides neither per-bin sample counts, score histograms, nor error bars/confidence intervals on the per-score metrics. Clinical ordinal datasets are typically imbalanced with sparse tails; without these quantities it is impossible to assess whether the reported over-prediction at 0-1 and under-prediction at 4-5 are statistically reliable or driven by a small number of ambiguous cases. Adding these diagnostics is load-bearing for the robustness of the bias claim.
minor comments (2)
- [Methods] Clarify the exact model versions and prompt templates used for each LLM family; the abstract references 'GPT-5' which may be a placeholder.
- [Abstract] The abstract states 'all three LLM families' but does not name them explicitly; list the families in the first paragraph of the results for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment on per-score analysis below and will incorporate the requested diagnostics to strengthen the robustness of our central tendency bias claims.
read point-by-point responses
-
Referee: Per-score analysis (results section): The central claim of a 'pronounced central tendency effect' rests on directional errors at score extremes, yet the manuscript provides neither per-bin sample counts, score histograms, nor error bars/confidence intervals on the per-score metrics. Clinical ordinal datasets are typically imbalanced with sparse tails; without these quantities it is impossible to assess whether the reported over-prediction at 0-1 and under-prediction at 4-5 are statistically reliable or driven by a small number of ambiguous cases. Adding these diagnostics is load-bearing for the robustness of the bias claim.
Authors: We agree that the current presentation of per-score results would benefit from explicit sample counts, histograms, and statistical uncertainty measures to confirm the reliability of the bias at the extremes. In the revised manuscript we will add: (1) a figure with ground-truth score histograms for both datasets, (2) a table reporting the exact number of images per Shulman score bin, and (3) 95% bootstrap confidence intervals on the per-score MAE and signed bias values. These additions will allow readers to verify that the observed over-prediction at scores 0-1 and under-prediction at scores 4-5 are not artifacts of very small tail samples. We expect the core finding to remain unchanged but will be more convincingly supported. revision: yes
Circularity Check
No circularity: direct empirical benchmarking against ground-truth labels
full rationale
The paper is a purely observational benchmarking study that measures LLM outputs directly against human-annotated ground-truth scores on two public datasets. It reports per-score error patterns, compares to supervised baselines, and tests prompt ablations. No derivation chain, fitted parameters renamed as predictions, self-referential equations, or load-bearing self-citations exist. The central tendency observation is computed from model predictions versus external labels and does not reduce to any input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Shulman rubric provides a reliable ordinal ground truth for Clock Drawing Test images in clinical assessment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Humans or llms as the judge? a study on judgement bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, 2024
work page 2024
-
[2]
Shuqing Chen, Daniel Stromer, Harb Alnasser Alabdalrahim, Stefan Schwab, Markus Weih, and Andreas Maier. Automatic dementia screening and scoring by applying deep learning on clock-drawing tests.Scientific Reports, 10(1):20854, 2020
work page 2020
-
[3]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Igor Douven. A bayesian perspective on likert scales and central tendency.Psychonomic bulletin & review, 25(3):1203–1211, 2018
work page 2018
-
[5]
A simple approach to ordinal classification
Eibe Frank and Mark Hall. A simple approach to ordinal classification. InEuropean conference on machine learning, pages 145–156. Springer, 2001
work page 2001
-
[6]
Morris Freedman.Clock drawing: A neuropsychological analysis. Oxford University Press, 1994
work page 1994
-
[7]
Vicki A Freedman and Judith D Kasper. Cohort profile: the national health and aging trends study (nhats).International journal of epidemiology, 48(4):1044–1045g, 2019
work page 2019
-
[8]
A survey on llm-as-a-judge.The Innovation, 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024
work page 2024
-
[9]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017
work page 2017
-
[10]
Carmen Jiménez-Mesa, Juan E Arco, Meritxell Valentí-Soler, Belén Frades-Payo, Maria A Zea-Sevilla, Andrés Ortiz, Marina Ávila-Villanueva, Diego Castillo-Barnes, Javier Ramirez, Teodoro Del Ser-Quijano, et al. Using explainable artificial intelligence in the clock drawing test to reveal the cognitive impairment pattern.International Journal of Neural Syste...
work page 2023
-
[11]
Carmen Jimenez-Mesa, Juan E Arco, Meritxell Valenti-Soler, Belen Frades-Payo, Maria A Zea-Sevilla, Andres Ortiz, Marina Avila-Villanueva, Javier Ramirez, Teodoro del Ser-Quijano, Cristobal Carnero-Pardo, et al. A comparative study of deep learning approaches for cognitive impairment diagnosis based on the clock-drawing test. InInternational Work-Conferenc...
work page 2024
-
[12]
Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, and Chris Callison-Burch. Predicting explainable dementia types with llm-aided feature engineering.Bioinformatics, 41 (4):btaf156, 2025
work page 2025
-
[13]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025
Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, and Haixiang Hu. Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, and Cho-Jui Hsieh. Mossbench: Is your multimodal language model oversensitive to safe queries?arXiv preprint arXiv:2406.17806, 2024
-
[16]
Daisuke Ono, Dennis W Dickson, and Shunsuke Koga. Evaluating the efficacy of few-shot learning for gpt-4vision in neurodegenerative disease histopathology: a comparative analysis with convolutional neural network model.Neuropathology and applied neurobiology, 50(4): e12997, 2024
work page 2024
-
[17]
Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024
work page 2024
-
[18]
Elisabete Pinto and Ruth Peters. Literature review of the clock drawing test as a tool for cognitive screening.Dementia and geriatric cognitive disorders, 27(3):201–213, 2009
work page 2009
-
[19]
Raksit Raksasat, Surat Teerapittayanon, Sirawaj Itthipuripat, Kearkiat Praditpornsilpa, Aisawan Petchlorlian, Thiparat Chotibut, Chaipat Chunharas, and Itthi Chatnuntawech. Attentive pairwise interaction network for ai-assisted clock drawing test assessment of early visuospatial deficits. Scientific Reports, 13(1):18113, 2023
work page 2023
-
[20]
Kenichiro Sato, Yoshiki Niimi, Tatsuo Mano, Atsushi Iwata, and Takeshi Iwatsubo. Automated evaluation of conventional clock-drawing test using deep neural network: Potential as a mass screening tool to detect individuals with cognitive decline.Frontiers in neurology, 13:896403, 2022
work page 2022
-
[21]
Judging the judges: A systematic study of position bias in llm-as-a-judge
Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...
work page 2025
-
[22]
Kenneth I Shulman. Clock-drawing: is it the ideal cognitive screening test?International journal of geriatric psychiatry, 15(6):548–561, 2000
work page 2000
-
[23]
Kenneth I Shulman, Dolores Pushkar Gold, Carole A Cohen, and Carla A Zucchero. Clock- drawing and dementia in the community: a longitudinal study.International journal of geriatric psychiatry, 8(6):487–496, 1993
work page 1993
-
[24]
William Souillard-Mandar, Randall Davis, Cynthia Rudin, Rhoda Au, David J Libon, Rodney Swenson, Catherine C Price, Melissa Lamar, and Dana L Penney. Learning classification models of cognitive conditions from subtle behaviors in the digital clock drawing test.Machine learning, 102(3):393–441, 2016
work page 2016
-
[25]
Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Beyond direct diagnosis: Llm-based multi-specialist agent consultation for automatic diagnosis.arXiv preprint arXiv:2401.16107, 2024
-
[26]
Wanying Wu, Yuhu Guo, Qi Li, and Congzhuo Jia. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses.Liver International, 45(4):e16112, 2025. 11
work page 2025
-
[27]
Screening cognitive assessments (mmse, cdt, moca) of eight large language models
Adam Wysoki´nski and Zofia Galczak. Screening cognitive assessments (mmse, cdt, moca) of eight large language models. 2026
work page 2026
-
[28]
Fengying Yang, Biao Xu, Jiacheng Lin, Donghua Zheng, Shangen Lan, Kexin Luo, and Guanci Yang. Early detection of alzheimer’s disease based on leveraging multimodal features of the clock drawing test.Journal of Alzheimer’s Disease, page 13872877261423940, 2026
work page 2026
-
[29]
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Boxiao Yu, Savas Ozdemir, Jiong Wu, Yizhou Chen, Ruogu Fang, Kuangyu Shi, and Kuang Gong. Pet image denoising via text-guided diffusion: integrating anatomical priors through text prompts.arXiv preprint arXiv:2502.21260, 2025
-
[31]
Jiaqing Zhang, Sabyasachi Bandyopadhyay, Faith Kimmet, Jack Wittmayer, Kia Khezeli, David J Libon, Catherine C Price, and Parisa Rashidi. Developing a fair and interpretable representation of the clock drawing test for mitigating low education and racial bias.Scientific Reports, 14(1):17444, 2024
work page 2024
-
[32]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. A Technical Appendices and Supplementary Material A.1 Supervised Model Details All supervise...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.