Enhancing Pathological VLMs with Cross-scale Reasoning
Pith reviewed 2026-06-27 02:06 UTC · model grok-4.3
The pith
A reinforcement-learned VLM trained on curated cross-scale pathology questions reaches SOTA on both multi-magnification and single-scale benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScaleReasoner-R1, trained via reinforcement learning to optimize performance on cross-scale VQA tasks, achieves state-of-the-art performance on the cross-scale reasoning benchmark and generalizes to state-of-the-art performance on established single-scale benchmarks.
What carries the argument
The leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) that produces the Scale-VQA benchmark used to train ScaleReasoner-R1 with reinforcement learning.
If this is right
- Limited cross-scale supervision can significantly improve pathological understanding in VLMs.
- The model generalizes from cross-scale training to SOTA results on single-scale benchmarks.
- Pathology interpretation can be formulated as multi-magnification reasoning.
- Multi-image VQA benchmarks require explicit protection against text-only shortcuts.
Where Pith is reading between the lines
- Cross-scale supervision may prove useful for VLMs in any domain whose data naturally spans multiple resolutions.
- The same curation approach could be tested on other multi-image medical or scientific VQA tasks to reduce shortcut learning.
- Reinforcement learning may be especially suited to teaching evidence integration across scales compared with standard supervised fine-tuning.
Load-bearing premise
The leakage-aware curation pipeline successfully removes text-only shortcuts so that performance truly reflects cross-scale visual reasoning.
What would settle it
If models trained on Scale-VQA retain high accuracy when the same questions are rephrased to restore magnification-dependent text cues, or if single-scale benchmark gains disappear when cross-scale questions are removed from training, the claim that gains stem from visual cross-scale reasoning would be falsified.
Figures
read the original abstract
Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language models (VLMs) include various scales, they often lack explicit cross-scale reasoning objectives. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on cross-scale VQA tasks. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. Code is available at https://github.com/iMVR-PL/ScaleReasoner-R1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a cross-scale reasoning paradigm for pathological VLMs. It constructs Scale-VQA, a benchmark of 4,685 multiple-choice VQA items grounded in 2,537 multi-magnification pathology images, via a leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) intended to eliminate magnification-dependent textual shortcuts. ScaleReasoner-R1 is then trained with reinforcement learning on this benchmark and is reported to achieve SOTA on Scale-VQA while generalizing to SOTA on established single-scale pathology benchmarks. The authors conclude that even limited cross-scale supervision improves pathological understanding; code is released.
Significance. If the curation pipeline demonstrably removes text-only shortcuts, the work would be significant: it supplies the first explicit cross-scale VQA benchmark and training objective for pathology, where multi-magnification integration is clinically essential. The RL optimization step and the reported generalization to single-scale tasks are concrete strengths. Public code release supports reproducibility.
major comments (1)
- [Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.
minor comments (2)
- The abstract states that 'even the limited cross-scale supervision can significantly improve pathological understanding' without quantifying the improvement or discussing limitations of the RL objective; a dedicated limitations paragraph would strengthen the discussion.
- Table or figure reporting the exact text-only screening rejection rates and final question statistics would make the curation pipeline more transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the leakage-aware curation pipeline, which is central to the validity of Scale-VQA. We address the major comment below.
read point-by-point responses
-
Referee: [Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.
Authors: We agree that the manuscript would be strengthened by explicit quantitative validation of the text-only screening. The adversarial text-only screening step was applied during curation to remove magnification-dependent textual shortcuts, but post-curation accuracy of text-only baselines on the final 4,685-question set was not reported. In the revised manuscript we will add these results (text-only model accuracy on the curated set, expected near random chance) in the benchmark construction section to confirm that residual shortcuts are minimal. This addition directly supports the cross-scale reasoning claim and the reported generalization. revision: yes
Circularity Check
No circularity: empirical benchmark curation and RL training with independent evaluation claims.
full rationale
The paper constructs Scale-VQA via an adversarial text-only screening pipeline and constraint-guided design, then trains ScaleReasoner-R1 with RL and reports empirical SOTA results on the new benchmark plus generalization to single-scale tasks. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce any claim to a self-referential definition. The load-bearing assumption about shortcut removal is an empirical verification issue rather than a definitional or self-citation reduction. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The Journal of pathol- ogy249(3), 286–294 (2019)
Abels, E., Pantanowitz, L., Aeffner, F., Zarella, M.D., Van der Laak, J., Bui, M.M., Vemuri, V.N., Parwani, A.V., Gibbs, J., Agosto-Arroyo, E., et al.: Computational pathology definitions, best practices, and recommendations for regulatory guid- ance: a white paper from the digital pathology association. The Journal of pathol- ogy249(3), 286–294 (2019)
2019
-
[2]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and an- swer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018). https://doi.org/10.1109/CVPR.2018.00522
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Nature Computational Science (2025)
Chen, K., Liu, M., Yan, F., et al.: Cost-effective instruction learning for pathology vision and language analysis. Nature Computational Science (2025). https://doi.org/10.1038/s43588-025-00818-5
-
[5]
In: European Conference on Com- puter Vision
Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2025)
2025
-
[6]
arXiv preprint arXiv:2410.11761 (2024) 10 C
Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., , Ming, H., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761 (2024) 10 C. Phan et al
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., Hu, M., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5134–5143 (Jun 2025)
2025
-
[8]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Nucleic acids research 44(8), e71–e71 (2016)
Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M., Pagnotta, S.M., Castiglioni, I., et al.: Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic acids research 44(8), e71–e71 (2016)
2016
-
[10]
Goyal,Y.,Khot,T.,Summers-Stay,D.,Batra,D.,Parikh,D.:MakingthevinVQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017). https://doi.org/10.1109/CVPR.2017.670
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hashimoto, N., Fukushima, D., Koga, R., Takagi, Y., Ko, K., Kohno, K., Nakaguro, M., Nakamura, S., Hontani, H., Takeuchi, I.: Multi-scale domain- adversarial multiple-instance cnn for cancer subtype classification with unanno- tated histopathological images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3852–38...
2020
-
[13]
PathVQA: 30000+ Questions for Medical Visual Question Answering
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[14]
Advances in Neural Information Processing Systems36, 28541–28564 (2023)
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)
2023
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Liang, Y., Lyu, X., Chen, W., Ding, M., Zhang, J., He, X., Wu, S., Xing, X., Yang, S., Wang, X., et al.: Wsi-llava: A multimodal large language model for whole slide image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22718–22727 (2025)
2025
- [16]
-
[17]
A Multimodal Generative AI Copilot for Human Pathology,
Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Zhao, M., Chow, A.K., Ike- mura, K., Kim, A., Pouli, D., Patel, A., Soliman, A., Chen, C., Ding, T., Wang, J.J., Gerber, G., Liang, I., Le, L.P., Parwani, A.V., Weishaupt, L.L., Mahmood, F.: A multimodal generative ai copilot for human pathology. Nature634(8033), 466–473 (Oct 2024). https://doi.org/10.10...
-
[18]
arXiv e-prints pp
Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt- llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints pp. arXiv–2312 (2023)
2023
-
[19]
In: Proceedings of the 58th annual meet- ing of the association for computational linguistics
Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meet- ing of the association for computational linguistics. pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727
-
[20]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) Enhancing Pathological VLMs with Cross-scale Reasoning 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
In: European Conference on Computer Vision
Sun, Y., Wu, H., Zhu, C., Zheng, S., Chen, Q., Zhang, K., Zhang, Y., Wan, D., Lan, X., Zheng, M., et al.: Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. In: European Conference on Computer Vision. pp. 56–73. Springer (2024)
2024
-
[22]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk,J.,Dai,A.M.,Hauth,A.,etal.:Gemini:afamilyofhighlycapablemultimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
arXiv preprint arXiv:2507.17303 (2025)
Xu, Z., Liu, Z., Hou, J., Ma, J., Jin, C., Wang, Y., Chen, Z., Zhang, Z., Huang, F., Guo, Z., et al.: A versatile pathology co-pilot via reasoning enhanced multimodal large language model. arXiv preprint arXiv:2507.17303 (2025)
-
[25]
arXiv preprint arXiv:2305.15075 (2023)
Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., Li, H.: Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075 (2023)
-
[26]
arXiv preprint arXiv:2505.11404 (2025)
Zhang, W., Zhang, P., Guo, J., Cheng, T., Chen, J., Zhang, S., Zhang, Z., Yi, Y., Bu, H.: Patho-r1: A multimodal reinforcement learning-based pathology expert reasoner. arXiv preprint arXiv:2505.11404 (2025)
-
[27]
Nature Machine Intelligence1(5), 236–245 (2019)
Zhang, Z., Chen, P., McGough, M., Xing, F., Wang, C., Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence1(5), 236–245 (2019)
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.