Analysis of Blood Report Images Using General Purpose Vision-Language Models
Pith reviewed 2026-05-18 18:18 UTC · model grok-4.3
The pith
General-purpose vision-language models can interpret blood report images to give patients clear explanations of results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
General-purpose VLMs are a practical technology for preliminary blood report analysis because they can extract and explain key information directly from report images, thereby improving health literacy and reducing barriers to understanding complex medical data.
What carries the argument
Comparative prompting of three VLMs on blood report images followed by Sentence-BERT similarity scoring of their free-text answers.
If this is right
- Patient-facing apps could let users photograph blood reports and receive immediate plain-language summaries.
- Health literacy may increase because explanations come directly from the image without manual data entry.
- The same prompting and comparison method could support quick checks on other printed lab or imaging documents.
- Development of specialized medical VLMs could start from these general models rather than from scratch.
Where Pith is reading between the lines
- The approach could be tested on reports from different countries or labs to check robustness across formats and languages.
- Real deployment would still require separate validation against ground-truth lab values and patient outcomes.
- Mobile-phone versions might reduce anxiety by giving users an on-the-spot draft interpretation before a clinic visit.
Load-bearing premise
That agreement between different models' answers, scored by sentence embeddings, reliably shows the answers are medically correct and useful.
What would settle it
A side-by-side comparison of the VLM outputs against independent physician annotations on the same 100 reports, measuring factual errors or clinical omissions.
Figures
read the original abstract
The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates three general-purpose vision-language models (Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick) on a dataset of 100 blood report images. Models are prompted with clinically relevant questions, and performance is assessed by computing pairwise similarities between responses using Sentence-BERT. The authors conclude that these VLMs constitute a practical and promising technology for patient-facing preliminary analysis tools that can improve health literacy.
Significance. If the similarity-based findings were corroborated by direct accuracy measures, the work would provide a useful starting point for exploring accessible VLM applications in medical report interpretation. The application domain is relevant to computer vision and healthcare AI, and the comparative setup across three models offers a baseline for future studies.
major comments (2)
- [Abstract and evaluation approach] Abstract and evaluation description: the sole metric is Sentence-BERT similarity across model answers to the same prompts, with no ground-truth labels, extracted numerical values from the reports, expert correctness annotations, or accuracy/error-rate analysis. This directly undermines the central claim that the models deliver 'clear interpretations' suitable for patient-facing tools, because inter-model agreement can arise from shared biases or consistent hallucinations without matching actual report content.
- [Results and discussion] Dataset and results discussion: although the abstract notes the limited size of 100 images, the manuscript provides no breakdown of response variability, failure cases, or comparison against even simple baselines such as OCR-plus-rule extraction of lab values. This absence weakens the assertion that the approach is 'practical' for preliminary analysis.
minor comments (1)
- [Methods] The description of prompt adaptation to each blood report could be clarified with an example prompt template to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript evaluating general-purpose VLMs for blood report analysis. We have revised the paper to more explicitly acknowledge the limitations of our similarity-based evaluation and to expand the discussion of results. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and evaluation approach] Abstract and evaluation description: the sole metric is Sentence-BERT similarity across model answers to the same prompts, with no ground-truth labels, extracted numerical values from the reports, expert correctness annotations, or accuracy/error-rate analysis. This directly undermines the central claim that the models deliver 'clear interpretations' suitable for patient-facing tools, because inter-model agreement can arise from shared biases or consistent hallucinations without matching actual report content.
Authors: We agree that inter-model similarity via Sentence-BERT is an indirect proxy and does not establish factual correctness or rule out shared hallucinations. Our original intent was to provide an initial comparative baseline across three VLMs on this task in the absence of an annotated benchmark. We have revised the abstract and added a limitations paragraph in the discussion to frame the results more cautiously, emphasizing consistency rather than verified accuracy and explicitly noting that expert validation is required before any patient-facing deployment. The comparative setup remains useful as an early reference point for future studies. revision: yes
-
Referee: [Results and discussion] Dataset and results discussion: although the abstract notes the limited size of 100 images, the manuscript provides no breakdown of response variability, failure cases, or comparison against even simple baselines such as OCR-plus-rule extraction of lab values. This absence weakens the assertion that the approach is 'practical' for preliminary analysis.
Authors: We accept that additional granularity on the results would strengthen the paper. The revised manuscript now includes a new subsection with quantitative measures of response variability (e.g., average pairwise similarities and cases of divergence) and qualitative examples of failure modes such as missed parameters or inconsistent unit handling. A direct OCR-plus-rule baseline comparison was outside the scope of the original end-to-end VLM study; we have added this as a recommended direction for follow-up work rather than claiming the current approach is already optimal. revision: partial
- Direct accuracy or error-rate evaluation against ground-truth labels or expert annotations, which would require new data labeling not performed in the current exploratory study.
Circularity Check
No circularity: empirical evaluation of off-the-shelf VLMs using standard similarity metric
full rationale
The paper conducts a direct empirical comparison of three general-purpose VLMs (Qwen-VL-Max, Gemini 2.5 Pro, Llama 4 Maverick) on 100 blood-report images. Prompts are applied and responses are compared via Sentence-BERT cosine similarity; no equations, parameter fitting, derivations, or predictions are claimed. The central claim that VLMs are promising for patient-facing tools rests on this observed similarity, which is an external measurement rather than a quantity defined by the authors' own prior work or inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. The analysis is self-contained against external benchmarks and does not reduce any result to its own definitions by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The prompts used are clinically relevant and appropriate for blood report analysis.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded... overall average similarity score of 0.803.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vision Encoder: Typically based on transformer models (e.g., CLIP), trained on large-scale image- text datasets to associate visual features with semantic meaning [1]
-
[2]
Projector: A set of layers that transforms the output of the vision encoder into a format compatible with the LLM, often represented as image tokens [1]
-
[3]
Large Language Model (LLM): Processes the combined input of text and image tokens to generate context-aware textual responses [1]. Fig. 1. Standard architecture of a Vision-Language Model (VLM). The Vision Encoder processes the input image into embeddings, which are then projected into the language model's space by the Projector module. The Large Language...
-
[4]
NVIDIA, “Vision Language Models,” NVIDIA Glossary. [Online]. Available: https://www.nvidia.com/en-us/glossary/vision-language- models/. [Accessed: Aug. 5, 2025]
work page 2025
-
[5]
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,
I. Hartsock and G. Rasool, “Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,” arXiv preprint arXiv:2403.02469, 2024. [Online]. Available: https://arxiv.org/abs/2403.02469
-
[6]
Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks, 2019
work page 2019
-
[7]
MammoVLM: A Generative Large Vision–Language Model for Mammography- Related Diagnostic Assistance,
Z. Cao, Z. Deng, J. Ma, J. Hu, and L. Ma, “MammoVLM: A Generative Large Vision–Language Model for Mammography- Related Diagnostic Assistance,” *Information Fusion*, vol. 118, p. 102998, 2025
work page 2025
-
[8]
Xraygpt: Chest radiographs summarization using medical vision-language models
O. Thawkar, A. Shaker, S. S. Mullappilly, H. Cholakkal, R. M. Anwer, S. Khan, J. Laaksonen, and F. S. Khan, “XrayGPT: Chest Radiographs Summarization using Large Medical Vision-Language Models,” arXiv preprint arXiv:2306.07971, 2023. [Online]. Available: https://arxiv.org/abs/2306.07971
-
[10]
Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)
[Online]. Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)
-
[11]
Ö. Aydin, E. Karaarslan, F. S. Erenay, and N. Bačanin Džakula, “Generative AI in Academic Writing: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma,” *arXiv preprint arXiv:2503.04765*, Mar. 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.04765
-
[12]
D. Singh, “Medical Lab Report Dataset,” Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/dikshaasinghhh/bajaj. [Accessed: Jul. 20, 2025]
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.