pith. sign in

arxiv: 2605.16386 · v1 · pith:V2COHBZMnew · submitted 2026-05-11 · 💻 cs.CV

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal LLMscentral tendency biasordinal clinical scoringClock Drawing TestShulman rubricLLM evaluation biascognitive screeningmodel auditing
0
0 comments X

The pith

Multimodal LLMs compress ordinal clinical scores toward the middle of the scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests multimodal large language models as automated scorers for the Clock Drawing Test using the Shulman rubric on two public datasets. It compares zero-shot LLM performance to fully fine-tuned vision transformers and finds that LLMs stay competitive on tolerance-based metrics yet reveal a clear pattern when errors are broken down by score. All three LLM families systematically push predictions inward, overestimating the lowest scores and underestimating the highest ones. The pattern survives changes to the prompt that add balanced examples across the full range or remove clinical wording. Readers should care because the extremes of the scale are exactly where scoring errors most affect whether cognitive impairment screening leads to further care.

Core claim

All three LLM families exhibit a pronounced central tendency effect in which predictions are systematically compressed toward the middle of the scale, producing over-prediction at the low end (scores 0 to 1) and under-prediction at the high end (scores 5 to 4). This endpoint compression is not removed by few-shot exemplars that span the full score range or by stripping clinical terminology from the prompt. The effect hits the clinically critical extremes hardest, where accurate distinction between severe impairment and normal performance most influences screening decisions.

What carries the argument

Per-score error breakdown that isolates the endpoint compression pattern in multimodal LLM outputs on the 0-5 ordinal rubric.

If this is right

  • Screening decisions that depend on correctly identifying the lowest or highest scores will be less reliable when current LLMs are used without adjustment.
  • Aggregate metrics such as within-1 accuracy can look acceptable while large directional errors remain at the scale boundaries.
  • Any clinical deployment of LLM raters will need explicit post-hoc calibration steps that target the extremes.
  • The same compression pattern is likely to appear in other ordinal clinical rating tasks that use similar zero-shot multimodal prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bias may generalize to other multimodal medical image scoring rubrics that require fine-grained ordinal distinctions.
  • Targeted fine-tuning on balanced clinical score distributions could reduce the inward pull, though this approach was not tested in the study.
  • A lightweight calibration model applied after the LLM step might restore accuracy at the ends while preserving the base model's speed.

Load-bearing premise

The central tendency bias is a property of the LLM families themselves rather than an artifact of the specific prompt phrasing or dataset selection.

What would settle it

Re-running the LLMs on a fresh set of clock drawings and finding that average predicted scores for human-labeled 0s stay near 0 and for human-labeled 5s stay near 5, with no net positive error at the low end or negative error at the high end, would falsify the compression claim.

Figures

Figures reproduced from arXiv: 2605.16386 by Bhanu Cherukuvada, Catherine Price, Jessica Sena, Jiaqing Zhang, Miguel Contreras, Parisa Rashidi, Sandeep Elluri, Scott Siegel, Subhash Nerella, Yonah Joffe.

Figure 1
Figure 1. Figure 1: Predicted-score distributions versus ground truth. Supervised models (left) approximate the [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices for ViT-Ordinal (unfrozen) and GPT-5 (zero-shot) on NHATS and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Score-level calibration. Supervised models (solid) cluster near the identity diago￾nal; LLM judges (dashed) exhibit shallower slopes. Supervised models cluster near the identity diago￾nal, while all three LLMs produce calibration curves with noticeably shallower slopes: mean predictions lie above the diagonal at the low end (true scores 0– 1) and below it at the high end (true scores 4–5). A bootstrap test… view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrices for supervised models. Unfrozen ViT variants concentrate mass along [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrices for LLM judges under zero-shot clinical prompting. All three models [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrices under prompt ablations. Few-shot prompting increases diagonal mass [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks three multimodal LLM families against fine-tuned Vision Transformers for ordinal scoring of Clock Drawing Test images on two public datasets using the Shulman rubric. It reports that LLMs achieve competitive tolerance-based agreement (e.g., GPT-5 MAE 0.67, within-1 accuracy 92%) but exhibit a central tendency bias with systematic over-prediction at low scores (0-1) and under-prediction at high scores (4-5), an effect that persists in ablations using full-range few-shot examples and removal of clinical terminology. Supervised models show superior calibration (MAE 0.52, within-1 accuracy 91%). The work positions this as an extension of LLM-as-a-judge bias literature to clinical multimodal assessment.

Significance. If the central tendency bias result holds after addressing robustness concerns, the paper makes a useful contribution by documenting a reproducible failure mode of frontier multimodal LLMs on clinically critical ordinal scales. The comparison to supervised baselines, the persistence across ablations, and the emphasis on implications for high-stakes screening decisions provide concrete evidence that calibration-aware evaluation is needed before deployment. This extends existing NLP bias findings to a multimodal clinical setting with direct relevance to cognitive impairment screening.

major comments (1)
  1. [Per-score analysis] Per-score analysis (results section): The central claim of a 'pronounced central tendency effect' rests on directional errors at score extremes, yet the manuscript provides neither per-bin sample counts, score histograms, nor error bars/confidence intervals on the per-score metrics. Clinical ordinal datasets are typically imbalanced with sparse tails; without these quantities it is impossible to assess whether the reported over-prediction at 0-1 and under-prediction at 4-5 are statistically reliable or driven by a small number of ambiguous cases. Adding these diagnostics is load-bearing for the robustness of the bias claim.
minor comments (2)
  1. [Methods] Clarify the exact model versions and prompt templates used for each LLM family; the abstract references 'GPT-5' which may be a placeholder.
  2. [Abstract] The abstract states 'all three LLM families' but does not name them explicitly; list the families in the first paragraph of the results for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on per-score analysis below and will incorporate the requested diagnostics to strengthen the robustness of our central tendency bias claims.

read point-by-point responses
  1. Referee: Per-score analysis (results section): The central claim of a 'pronounced central tendency effect' rests on directional errors at score extremes, yet the manuscript provides neither per-bin sample counts, score histograms, nor error bars/confidence intervals on the per-score metrics. Clinical ordinal datasets are typically imbalanced with sparse tails; without these quantities it is impossible to assess whether the reported over-prediction at 0-1 and under-prediction at 4-5 are statistically reliable or driven by a small number of ambiguous cases. Adding these diagnostics is load-bearing for the robustness of the bias claim.

    Authors: We agree that the current presentation of per-score results would benefit from explicit sample counts, histograms, and statistical uncertainty measures to confirm the reliability of the bias at the extremes. In the revised manuscript we will add: (1) a figure with ground-truth score histograms for both datasets, (2) a table reporting the exact number of images per Shulman score bin, and (3) 95% bootstrap confidence intervals on the per-score MAE and signed bias values. These additions will allow readers to verify that the observed over-prediction at scores 0-1 and under-prediction at scores 4-5 are not artifacts of very small tail samples. We expect the core finding to remain unchanged but will be more convincingly supported. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking against ground-truth labels

full rationale

The paper is a purely observational benchmarking study that measures LLM outputs directly against human-annotated ground-truth scores on two public datasets. It reports per-score error patterns, compares to supervised baselines, and tests prompt ablations. No derivation chain, fitted parameters renamed as predictions, self-referential equations, or load-bearing self-citations exist. The central tendency observation is computed from model predictions versus external labels and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard clinical assumptions about the validity of the Shulman rubric and the representativeness of the public datasets; no free parameters are fitted to produce the bias result and no new entities are postulated.

axioms (1)
  • domain assumption The Shulman rubric provides a reliable ordinal ground truth for Clock Drawing Test images in clinical assessment.
    All model comparisons and bias measurements are defined relative to scores produced under this rubric.

pith-pipeline@v0.9.0 · 5814 in / 1286 out tokens · 49661 ms · 2026-05-20T22:34:47.345297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4)

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Humans or llms as the judge? a study on judgement bias

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, 2024

  2. [2]

    Automatic dementia screening and scoring by applying deep learning on clock-drawing tests.Scientific Reports, 10(1):20854, 2020

    Shuqing Chen, Daniel Stromer, Harb Alnasser Alabdalrahim, Stefan Schwab, Markus Weih, and Andreas Maier. Automatic dementia screening and scoring by applying deep learning on clock-drawing tests.Scientific Reports, 10(1):20854, 2020

  3. [3]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  4. [4]

    A bayesian perspective on likert scales and central tendency.Psychonomic bulletin & review, 25(3):1203–1211, 2018

    Igor Douven. A bayesian perspective on likert scales and central tendency.Psychonomic bulletin & review, 25(3):1203–1211, 2018

  5. [5]

    A simple approach to ordinal classification

    Eibe Frank and Mark Hall. A simple approach to ordinal classification. InEuropean conference on machine learning, pages 145–156. Springer, 2001

  6. [6]

    Oxford University Press, 1994

    Morris Freedman.Clock drawing: A neuropsychological analysis. Oxford University Press, 1994

  7. [7]

    Cohort profile: the national health and aging trends study (nhats).International journal of epidemiology, 48(4):1044–1045g, 2019

    Vicki A Freedman and Judith D Kasper. Cohort profile: the national health and aging trends study (nhats).International journal of epidemiology, 48(4):1044–1045g, 2019

  8. [8]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  9. [9]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

  10. [10]

    Using explainable artificial intelligence in the clock drawing test to reveal the cognitive impairment pattern.International Journal of Neural Systems, 33(04): 2350015, 2023

    Carmen Jiménez-Mesa, Juan E Arco, Meritxell Valentí-Soler, Belén Frades-Payo, Maria A Zea-Sevilla, Andrés Ortiz, Marina Ávila-Villanueva, Diego Castillo-Barnes, Javier Ramirez, Teodoro Del Ser-Quijano, et al. Using explainable artificial intelligence in the clock drawing test to reveal the cognitive impairment pattern.International Journal of Neural Syste...

  11. [11]

    A comparative study of deep learning approaches for cognitive impairment diagnosis based on the clock-drawing test

    Carmen Jimenez-Mesa, Juan E Arco, Meritxell Valenti-Soler, Belen Frades-Payo, Maria A Zea-Sevilla, Andres Ortiz, Marina Avila-Villanueva, Javier Ramirez, Teodoro del Ser-Quijano, Cristobal Carnero-Pardo, et al. A comparative study of deep learning approaches for cognitive impairment diagnosis based on the clock-drawing test. InInternational Work-Conferenc...

  12. [12]

    Predicting explainable dementia types with llm-aided feature engineering.Bioinformatics, 41 (4):btaf156, 2025

    Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, and Chris Callison-Burch. Predicting explainable dementia types with llm-aided feature engineering.Bioinformatics, 41 (4):btaf156, 2025

  13. [13]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

  14. [14]

    Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025

    Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, and Haixiang Hu. Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025

  15. [15]

    Mossbench: Is your multimodal language model oversensitive to safe queries?arXiv preprint arXiv:2406.17806, 2024

    Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, and Cho-Jui Hsieh. Mossbench: Is your multimodal language model oversensitive to safe queries?arXiv preprint arXiv:2406.17806, 2024

  16. [16]

    Daisuke Ono, Dennis W Dickson, and Shunsuke Koga. Evaluating the efficacy of few-shot learning for gpt-4vision in neurodegenerative disease histopathology: a comparative analysis with convolutional neural network model.Neuropathology and applied neurobiology, 50(4): e12997, 2024

  17. [17]

    Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

  18. [18]

    Literature review of the clock drawing test as a tool for cognitive screening.Dementia and geriatric cognitive disorders, 27(3):201–213, 2009

    Elisabete Pinto and Ruth Peters. Literature review of the clock drawing test as a tool for cognitive screening.Dementia and geriatric cognitive disorders, 27(3):201–213, 2009

  19. [19]

    Attentive pairwise interaction network for ai-assisted clock drawing test assessment of early visuospatial deficits

    Raksit Raksasat, Surat Teerapittayanon, Sirawaj Itthipuripat, Kearkiat Praditpornsilpa, Aisawan Petchlorlian, Thiparat Chotibut, Chaipat Chunharas, and Itthi Chatnuntawech. Attentive pairwise interaction network for ai-assisted clock drawing test assessment of early visuospatial deficits. Scientific Reports, 13(1):18113, 2023

  20. [20]

    Kenichiro Sato, Yoshiki Niimi, Tatsuo Mano, Atsushi Iwata, and Takeshi Iwatsubo. Automated evaluation of conventional clock-drawing test using deep neural network: Potential as a mass screening tool to detect individuals with cognitive decline.Frontiers in neurology, 13:896403, 2022

  21. [21]

    Judging the judges: A systematic study of position bias in llm-as-a-judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...

  22. [22]

    Clock-drawing: is it the ideal cognitive screening test?International journal of geriatric psychiatry, 15(6):548–561, 2000

    Kenneth I Shulman. Clock-drawing: is it the ideal cognitive screening test?International journal of geriatric psychiatry, 15(6):548–561, 2000

  23. [23]

    Clock- drawing and dementia in the community: a longitudinal study.International journal of geriatric psychiatry, 8(6):487–496, 1993

    Kenneth I Shulman, Dolores Pushkar Gold, Carole A Cohen, and Carla A Zucchero. Clock- drawing and dementia in the community: a longitudinal study.International journal of geriatric psychiatry, 8(6):487–496, 1993

  24. [24]

    Learning classification models of cognitive conditions from subtle behaviors in the digital clock drawing test.Machine learning, 102(3):393–441, 2016

    William Souillard-Mandar, Randall Davis, Cynthia Rudin, Rhoda Au, David J Libon, Rodney Swenson, Catherine C Price, Melissa Lamar, and Dana L Penney. Learning classification models of cognitive conditions from subtle behaviors in the digital clock drawing test.Machine learning, 102(3):393–441, 2016

  25. [25]

    Beyond direct diagnosis: Llm-based multi-specialist agent consultation for automatic diagnosis.arXiv preprint arXiv:2401.16107, 2024

    Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Beyond direct diagnosis: Llm-based multi-specialist agent consultation for automatic diagnosis.arXiv preprint arXiv:2401.16107, 2024

  26. [26]

    Wanying Wu, Yuhu Guo, Qi Li, and Congzhuo Jia. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses.Liver International, 45(4):e16112, 2025. 11

  27. [27]

    Screening cognitive assessments (mmse, cdt, moca) of eight large language models

    Adam Wysoki´nski and Zofia Galczak. Screening cognitive assessments (mmse, cdt, moca) of eight large language models. 2026

  28. [28]

    Early detection of alzheimer’s disease based on leveraging multimodal features of the clock drawing test.Journal of Alzheimer’s Disease, page 13872877261423940, 2026

    Fengying Yang, Biao Xu, Jiacheng Lin, Donghua Zheng, Shangen Lan, Kexin Luo, and Guanci Yang. Early detection of alzheimer’s disease based on leveraging multimodal features of the clock drawing test.Journal of Alzheimer’s Disease, page 13872877261423940, 2026

  29. [29]

    Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

  30. [30]

    Pet image denoising via text-guided diffusion: integrating anatomical priors through text prompts.arXiv preprint arXiv:2502.21260, 2025

    Boxiao Yu, Savas Ozdemir, Jiong Wu, Yizhou Chen, Ruogu Fang, Kuangyu Shi, and Kuang Gong. Pet image denoising via text-guided diffusion: integrating anatomical priors through text prompts.arXiv preprint arXiv:2502.21260, 2025

  31. [31]

    Developing a fair and interpretable representation of the clock drawing test for mitigating low education and racial bias.Scientific Reports, 14(1):17444, 2024

    Jiaqing Zhang, Sabyasachi Bandyopadhyay, Faith Kimmet, Jack Wittmayer, Kia Khezeli, David J Libon, Catherine C Price, and Parisa Rashidi. Developing a fair and interpretable representation of the clock drawing test for mitigating low education and racial bias.Scientific Reports, 14(1):17444, 2024

  32. [32]

    role": "system

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. A Technical Appendices and Supplementary Material A.1 Supervised Model Details All supervise...