pith. sign in

arxiv: 2509.06033 · v1 · submitted 2025-09-07 · 💻 cs.CV

Analysis of Blood Report Images Using General Purpose Vision-Language Models

Pith reviewed 2026-05-18 18:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsblood report analysismedical image interpretationpatient health literacyAI-assisted healthcarepreliminary diagnosis tools
0
0 comments X

The pith

General-purpose vision-language models can interpret blood report images to give patients clear explanations of results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether off-the-shelf vision-language models can read images of blood reports and answer questions about the findings. It runs three models on 100 varied reports, asks each the same set of clinically relevant questions, and measures how closely the answers match using sentence embeddings. The authors conclude that these models already produce consistent, understandable interpretations that could help people make sense of their lab results without waiting for a doctor.

Core claim

General-purpose VLMs are a practical technology for preliminary blood report analysis because they can extract and explain key information directly from report images, thereby improving health literacy and reducing barriers to understanding complex medical data.

What carries the argument

Comparative prompting of three VLMs on blood report images followed by Sentence-BERT similarity scoring of their free-text answers.

If this is right

  • Patient-facing apps could let users photograph blood reports and receive immediate plain-language summaries.
  • Health literacy may increase because explanations come directly from the image without manual data entry.
  • The same prompting and comparison method could support quick checks on other printed lab or imaging documents.
  • Development of specialized medical VLMs could start from these general models rather than from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on reports from different countries or labs to check robustness across formats and languages.
  • Real deployment would still require separate validation against ground-truth lab values and patient outcomes.
  • Mobile-phone versions might reduce anxiety by giving users an on-the-spot draft interpretation before a clinic visit.

Load-bearing premise

That agreement between different models' answers, scored by sentence embeddings, reliably shows the answers are medically correct and useful.

What would settle it

A side-by-side comparison of the VLM outputs against independent physician annotations on the same 100 reports, measuring factual errors or clinical omissions.

Figures

Figures reproduced from arXiv: 2509.06033 by Hamid Beigy, Nadia Bakhsheshi.

Figure 1
Figure 1. Figure 1: Standard architecture of a Vision-Language Model (VLM). The Vision Encoder processes the input image into embeddings, which are then projected into the language model's space by the Projector module. The Large Language Model (LLM) merges these visual tokens with text tokens from the text prompt to generate a textual response. IV. METHODOLOGY This section outlines the methodology used to evaluate the perfor… view at source ↗
read the original abstract

The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates three general-purpose vision-language models (Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick) on a dataset of 100 blood report images. Models are prompted with clinically relevant questions, and performance is assessed by computing pairwise similarities between responses using Sentence-BERT. The authors conclude that these VLMs constitute a practical and promising technology for patient-facing preliminary analysis tools that can improve health literacy.

Significance. If the similarity-based findings were corroborated by direct accuracy measures, the work would provide a useful starting point for exploring accessible VLM applications in medical report interpretation. The application domain is relevant to computer vision and healthcare AI, and the comparative setup across three models offers a baseline for future studies.

major comments (2)
  1. [Abstract and evaluation approach] Abstract and evaluation description: the sole metric is Sentence-BERT similarity across model answers to the same prompts, with no ground-truth labels, extracted numerical values from the reports, expert correctness annotations, or accuracy/error-rate analysis. This directly undermines the central claim that the models deliver 'clear interpretations' suitable for patient-facing tools, because inter-model agreement can arise from shared biases or consistent hallucinations without matching actual report content.
  2. [Results and discussion] Dataset and results discussion: although the abstract notes the limited size of 100 images, the manuscript provides no breakdown of response variability, failure cases, or comparison against even simple baselines such as OCR-plus-rule extraction of lab values. This absence weakens the assertion that the approach is 'practical' for preliminary analysis.
minor comments (1)
  1. [Methods] The description of prompt adaptation to each blood report could be clarified with an example prompt template to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript evaluating general-purpose VLMs for blood report analysis. We have revised the paper to more explicitly acknowledge the limitations of our similarity-based evaluation and to expand the discussion of results. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and evaluation approach] Abstract and evaluation description: the sole metric is Sentence-BERT similarity across model answers to the same prompts, with no ground-truth labels, extracted numerical values from the reports, expert correctness annotations, or accuracy/error-rate analysis. This directly undermines the central claim that the models deliver 'clear interpretations' suitable for patient-facing tools, because inter-model agreement can arise from shared biases or consistent hallucinations without matching actual report content.

    Authors: We agree that inter-model similarity via Sentence-BERT is an indirect proxy and does not establish factual correctness or rule out shared hallucinations. Our original intent was to provide an initial comparative baseline across three VLMs on this task in the absence of an annotated benchmark. We have revised the abstract and added a limitations paragraph in the discussion to frame the results more cautiously, emphasizing consistency rather than verified accuracy and explicitly noting that expert validation is required before any patient-facing deployment. The comparative setup remains useful as an early reference point for future studies. revision: yes

  2. Referee: [Results and discussion] Dataset and results discussion: although the abstract notes the limited size of 100 images, the manuscript provides no breakdown of response variability, failure cases, or comparison against even simple baselines such as OCR-plus-rule extraction of lab values. This absence weakens the assertion that the approach is 'practical' for preliminary analysis.

    Authors: We accept that additional granularity on the results would strengthen the paper. The revised manuscript now includes a new subsection with quantitative measures of response variability (e.g., average pairwise similarities and cases of divergence) and qualitative examples of failure modes such as missed parameters or inconsistent unit handling. A direct OCR-plus-rule baseline comparison was outside the scope of the original end-to-end VLM study; we have added this as a recommended direction for follow-up work rather than claiming the current approach is already optimal. revision: partial

standing simulated objections not resolved
  • Direct accuracy or error-rate evaluation against ground-truth labels or expert annotations, which would require new data labeling not performed in the current exploratory study.

Circularity Check

0 steps flagged

No circularity: empirical evaluation of off-the-shelf VLMs using standard similarity metric

full rationale

The paper conducts a direct empirical comparison of three general-purpose VLMs (Qwen-VL-Max, Gemini 2.5 Pro, Llama 4 Maverick) on 100 blood-report images. Prompts are applied and responses are compared via Sentence-BERT cosine similarity; no equations, parameter fitting, derivations, or predictions are claimed. The central claim that VLMs are promising for patient-facing tools rests on this observed similarity, which is an external measurement rather than a quantity defined by the authors' own prior work or inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. The analysis is self-contained against external benchmarks and does not reduce any result to its own definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unvalidated premise that clinically relevant prompts were used and that model agreement via embeddings indicates practical utility for health applications.

axioms (1)
  • domain assumption The prompts used are clinically relevant and appropriate for blood report analysis.
    Stated in the abstract as 'prompted with clinically relevant questions' with no further validation or expert review described.

pith-pipeline@v0.9.0 · 5710 in / 1330 out tokens · 53492 ms · 2026-05-18T18:18:52.901799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    Vision Encoder: Typically based on transformer models (e.g., CLIP), trained on large-scale image- text datasets to associate visual features with semantic meaning [1]

  2. [2]

    Projector: A set of layers that transforms the output of the vision encoder into a format compatible with the LLM, often represented as image tokens [1]

  3. [3]

    Medical Lab Report Dataset

    Large Language Model (LLM): Processes the combined input of text and image tokens to generate context-aware textual responses [1]. Fig. 1. Standard architecture of a Vision-Language Model (VLM). The Vision Encoder processes the input image into embeddings, which are then projected into the language model's space by the Projector module. The Large Language...

  4. [4]

    Vision Language Models,

    NVIDIA, “Vision Language Models,” NVIDIA Glossary. [Online]. Available: https://www.nvidia.com/en-us/glossary/vision-language- models/. [Accessed: Aug. 5, 2025]

  5. [5]

    Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,

    I. Hartsock and G. Rasool, “Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,” arXiv preprint arXiv:2403.02469, 2024. [Online]. Available: https://arxiv.org/abs/2403.02469

  6. [6]

    Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks, 2019

  7. [7]

    MammoVLM: A Generative Large Vision–Language Model for Mammography- Related Diagnostic Assistance,

    Z. Cao, Z. Deng, J. Ma, J. Hu, and L. Ma, “MammoVLM: A Generative Large Vision–Language Model for Mammography- Related Diagnostic Assistance,” *Information Fusion*, vol. 118, p. 102998, 2025

  8. [8]

    Xraygpt: Chest radiographs summarization using medical vision-language models

    O. Thawkar, A. Shaker, S. S. Mullappilly, H. Cholakkal, R. M. Anwer, S. Khan, J. Laaksonen, and F. S. Khan, “XrayGPT: Chest Radiographs Summarization using Large Medical Vision-Language Models,” arXiv preprint arXiv:2306.07971, 2023. [Online]. Available: https://arxiv.org/abs/2306.07971

  9. [10]

    Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)

    [Online]. Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)

  10. [11]

    Generative AI in Academic Writing: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma,

    Ö. Aydin, E. Karaarslan, F. S. Erenay, and N. Bačanin Džakula, “Generative AI in Academic Writing: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma,” *arXiv preprint arXiv:2503.04765*, Mar. 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.04765

  11. [12]

    Medical Lab Report Dataset,

    D. Singh, “Medical Lab Report Dataset,” Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/dikshaasinghhh/bajaj. [Accessed: Jul. 20, 2025]