Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

Bo Zhang; Jiang Liu; Jie Cao; Ling Zhang; Tianwei Lin; Wenjie Yan; Wenqiao Zhang; Yingda Xia; Yu Zhong; Zhongwei Qiu

arxiv: 2605.20277 · v1 · pith:PWNBJSJZnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

Tianwei Lin , Zhongwei Qiu , Jie Cao , Jiang Liu , Wenjie Yan , Bo Zhang , Yu Zhong , Wenqiao Zhang

show 2 more authors

Yingda Xia Ling Zhang

This is my paper

Pith reviewed 2026-05-21 08:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical vision-language models3D CT analysisreinforcement learningtrajectory integral feedbackclinical hallucinationsabnormality detectionanatomy-aware rewardsclinical faithfulness

0 comments

The pith

Trajectory-integral feedback lets medical VLMs reduce hallucinations and omissions in 3D CT analysis by penalizing cumulative clinical errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that standard reinforcement learning for medical vision-language models creates a mechanistic divergence where rewards favor fluent language over factual medical content, producing evaluation hallucinations that lead to diagnostically wrong CT reports. It addresses this by creating the Clinical Abnormality Benchmarking Substrate to break reports into verifiable units and then introducing TIF-GRPO, which treats clinical reasoning as a pseudo-temporal trajectory and applies integral feedback to accumulate penalties for persistent omissions while curbing excessive hallucinations. A sympathetic reader would care because current AI assistants for volumetric imaging still make critical factual mistakes that could affect patient care, and a method that directly regulates rewards for clinical correctness could make these tools more trustworthy. The approach borrows from control theory to enforce anatomy-aware alignment during policy optimization. Experiments on 3D CT benchmarks show gains in detection accuracy and report faithfulness.

Core claim

By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort, leading to significantly enhanced abnormality detection and clinical faithfulness on 3D CT benchmarks.

What carries the argument

TIF-GRPO framework, which integrates control-theoretic integral feedback into GRPO policy optimization using the Clinical Abnormality Benchmarking Substrate to enforce factual clinical correctness over lexical similarity rewards.

If this is right

Abnormality detection performance improves on volumetric CT benchmarks.
Generated radiology reports exhibit greater clinical faithfulness with fewer factual errors.
Policy optimization aligns more closely with medical facts instead of surface-level language similarity.
Persistent omissions accumulate as state errors that the feedback loop corrects over the trajectory.
A new approach to fine-grained reward regulation becomes available for medical vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The trajectory framing could extend to sequential diagnostic reasoning in other imaging modalities such as MRI.
Reduced hallucinations might support safer deployment of AI report generators in high-volume clinical screening.
Similar integral feedback ideas could be tested in non-medical domains where factual consistency matters more than fluency.
Combining the method with real-time human feedback loops might further stabilize long-horizon clinical analyses.

Load-bearing premise

Clinical reasoning can be validly formulated as a pseudo-temporal trajectory for anomaly discovery such that integral feedback directly penalizes persistent omissions and hallucinations without distorting medical semantics.

What would settle it

If experiments on the same 3D CT benchmarks show that TIF-GRPO produces no improvement or a decline in abnormality detection accuracy and clinical faithfulness scores relative to standard GRPO baselines, the central effectiveness claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.20277 by Bo Zhang, Jiang Liu, Jie Cao, Ling Zhang, Tianwei Lin, Wenjie Yan, Wenqiao Zhang, Yingda Xia, Yu Zhong, Zhongwei Qiu.

**Figure 1.** Figure 1: Overview of the “Evaluation Hallucinations” and “Mechanistic Divergence”. (a) Surface-similarity proxy signals induce evaluation hallucinations, where high-scoring predictions mismatch GT clinical facts. (b) Our CABS framework enables accurate abnormality-level measurement, and TIF-GRPO applies trajectory-integral control based on CABS to suppress hallucinations and align optimization with clinical fid… view at source ↗

**Figure 2.** Figure 2: Overview of the CABS workflow. Free-text clinical reports are converted into structured clinical semantics, followed by semantic consistency auditing and clinician usability analysis, achieving an overall acceptance ratio of approximately 99.4% × 99.2% ≈ 98.6%. −E(V,q,ygt)∼D[ PT t=1 log πθ(ygt|V, q, w<t)], where D denotes the training dataset. Reinforcement Learning (RL): To further align the model with c… view at source ↗

**Figure 3.** Figure 3: TIF-GRPO leverages CABS to decompose reports into clinical abnormality units, enabling trajectory-integral control that penalizes false positives and omissions for factuality-aligned RL. the policy optimization process to diverge from true clinical fidelity. To resolve this misalignment and ground policy optimization in clinical factuality, we propose the Clinical Abnormality Benchmarking Substrate, a stru… view at source ↗

**Figure 4.** Figure 4: Clinical Competence Analysis of CABS System. 4.3. Clinical Competence Analysis of CABS System CABS serves as a key anchor for validating both the existence of evaluation hallucinations and the effectiveness of TIF-GRPO. To this end, we conduct a systematic validation of CABS from the perspective of clinical capability analysis, combining assessments from clinical experts and large-model self-evaluations… view at source ↗

**Figure 5.** Figure 5: Evaluation Hallucination Analysis. cabs-e, c, o represent the Entity Core, Clinical Fidelity, Organ Coverage metrics in the CABS system, indicating the real clinical competence verified by radiologists. s1-s6 represent the surface similarity metrics: BLEU, ROUGE, METEOR, RadGraph, RaTEScore, and BioBert Score. we generate a set of clinically plausible variants via controlled perturbations involving 0–5 ab… view at source ↗

**Figure 6.** Figure 6: Mechanistic Divergence Analysis via counterfactual ranking consistency evaluation. We perturb GT reports to generate clinically plausible variants (0–5 abnormal entity modifications), whose Text-Rank reflects clinical priority. Concordance ratio ϕ = P/n 2 measures pairwise rank agreement. CABS-F1 achieves the highest ϕ, indicating superior clinical fidelity [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of Running Cost weights and Control Effort weights on the Training Dynamics of TIF-GRPO. E. More Experiments In Section 4.6, we showed that running cost and control effort constitute essential mechanisms for clinically usable reporting: running cost encourages the model to identify abnormalities, whereas control effort suppresses false-positive reporting. Beyond the quantitative results, we further … view at source ↗

**Figure 8.** Figure 8: Case study on surface similarity metrics and CABS [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Case study on TIF-GRPO, GRPO-ROUGE and GRPO-LLM. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to fix hallucinations in medical VLMs for 3D CT by casting reasoning as a pseudo-temporal trajectory and adding integral feedback to GRPO, but the abstract gives almost no mechanics or numbers to check if it works.

read the letter

The main takeaway is that the authors want to make reinforcement learning for CT report generation more clinically reliable. They introduce CABS to break reports into verifiable abnormality units and then TIF-GRPO, which adds an integral term to penalize persistent omissions or hallucinations as cumulative errors along a trajectory. This is a fresh application of control ideas to the medical VLM setting, and the framing of mechanistic divergence between surface rewards and factual correctness is a useful way to name the problem that standard RL runs into here. The project link to code is also a positive sign if the repo actually contains the implementation details missing from the abstract. The work does a reasonable job identifying why lexical proxies fall short for diagnostic tasks and proposes a concrete mechanism to regulate anatomy-aware rewards. That part feels like a step in the right direction for anyone trying to make these models safer for real use. The soft spots are more substantial. The abstract claims clear gains on 3D CT benchmarks but shows zero quantitative results, baselines, or error bars, so the central claim rests on unshown evidence. More critically, there is still no explicit definition of the trajectory states, the exact form of the anatomy-aware reward, or how the integral feedback is computed without distorting medical semantics. The stress-test concern about the pseudo-temporal formulation therefore lands: without those constructions it is hard to tell whether the loop enforces facts or simply reweights the same signals the model already produces. Circularity remains a live risk. This paper is mainly for researchers working on RL for medical report generation who are already familiar with GRPO and control-theoretic RL. A reader looking for a new angle on factual grounding in VLMs could get some value from the ideas, but only after seeing the full methods and results. I would send it to peer review so that referees can check the derivations and experiments directly, though it would need substantial revision to stand on its own.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Clinical Abnormality Benchmarking Substrate (CABS) to decompose radiology reports into verifiable clinical semantic units and proposes Trajectory-Integral Feedback GRPO (TIF-GRPO), a reinforcement learning method that casts clinical reasoning in 3D CT as a pseudo-temporal trajectory. It integrates control-theoretic integral feedback to regulate anatomy-aware rewards, penalizing cumulative omissions as state errors and suppressing hallucinations as excessive control effort, with claimed improvements in abnormality detection and clinical faithfulness over standard RL paradigms on 3D CT benchmarks.

Significance. If the trajectory formulation and integral feedback can be shown to enforce factual clinical correctness without circular dependence on the same semantic units used for training, the work could introduce a useful control-theoretic mechanism for reducing evaluation hallucinations in medical VLMs. The CABS substrate offers a structured approach to verifiable clinical evaluation that may have broader applicability beyond the proposed RL variant.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): The central mechanism formulates clinical reasoning as a pseudo-temporal trajectory for anomaly discovery and applies integral feedback to penalize persistent omissions as cumulative state errors, but supplies no explicit definition of trajectory states, the anatomy-aware reward function, or the control law for the integral term. This leaves untested whether the feedback preserves medical semantics or introduces tautology with CABS-derived units.
[§4] §4 (Experiments): The abstract asserts that TIF-GRPO significantly enhances abnormality detection and clinical faithfulness on 3D CT benchmarks, yet reports no quantitative metrics, baselines, error bars, ablation results, or verification procedures for CABS units and integral feedback implementation. Without these, the empirical support for the new paradigm cannot be evaluated.
[§2 and §3] §2 (Related Work) and §3: The claimed 'Mechanistic Divergence' in standard RL (surface-similarity rewards bypassing medical facts) is load-bearing for motivating TIF-GRPO, but the manuscript must demonstrate that the integral term avoids reweighting lexical signals in a similar manner rather than merely reparameterizing the same clinical units.

minor comments (2)

[Abstract] The distinction between 'Evaluation Hallucinations' and standard VLM hallucinations would benefit from concrete examples tied to CT report structures.
[§3] Reproducibility would be strengthened by including pseudocode for the trajectory construction and integral feedback update rule in the main text rather than relying solely on the GitHub repository.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions to strengthen the presentation of definitions, empirical results, and mechanistic distinctions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central mechanism formulates clinical reasoning as a pseudo-temporal trajectory for anomaly discovery and applies integral feedback to penalize persistent omissions as cumulative state errors, but supplies no explicit definition of trajectory states, the anatomy-aware reward function, or the control law for the integral term. This leaves untested whether the feedback preserves medical semantics or introduces tautology with CABS-derived units.

Authors: We agree that explicit formal definitions would improve accessibility. In the revised §3 we will add a dedicated subsection providing: (i) trajectory states as the ordered sequence of anatomical regions and slice-level features derived from CABS decomposition; (ii) the anatomy-aware reward r_t = f(CABS_unit_match) - λ·∫e(τ)dτ where e(τ) is the cumulative omission error; and (iii) the integral control law u(t) = K_i ∫e(τ)dτ with anti-windup to bound hallucination penalties. To address potential tautology, we will include a short analysis demonstrating that the integral operates on aggregate error signals rather than directly re-using per-unit weights from training, supported by a held-out CABS validation set. These additions will be made in the next revision. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts that TIF-GRPO significantly enhances abnormality detection and clinical faithfulness on 3D CT benchmarks, yet reports no quantitative metrics, baselines, error bars, ablation results, or verification procedures for CABS units and integral feedback implementation. Without these, the empirical support for the new paradigm cannot be evaluated.

Authors: The current §4 contains quantitative results (abnormality detection accuracy, clinical faithfulness via CABS-unit F1, and hallucination rate) together with comparisons against GRPO, PPO, and DPO baselines, error bars from five random seeds, and an ablation removing the integral term. However, we acknowledge these elements could be presented more prominently. In the revision we will add a consolidated results table, explicit verification protocol for CABS inter-rater reliability, and additional ablation curves isolating the integral feedback contribution. This will make the empirical support fully transparent. revision: partial
Referee: [§2 and §3] §2 (Related Work) and §3: The claimed 'Mechanistic Divergence' in standard RL (surface-similarity rewards bypassing medical facts) is load-bearing for motivating TIF-GRPO, but the manuscript must demonstrate that the integral term avoids reweighting lexical signals in a similar manner rather than merely reparameterizing the same clinical units.

Authors: The mechanistic divergence is motivated in §2 by showing that lexical proxies (BLEU/ROUGE) correlate poorly with CABS fact coverage. TIF-GRPO replaces per-step lexical rewards with a trajectory-level integral that accumulates state errors, thereby penalizing persistent omissions irrespective of surface phrasing. We will strengthen this claim by adding an explicit comparison in the revised §4: a lexical-reward variant of GRPO versus TIF-GRPO, demonstrating that performance gains persist even when lexical signals are controlled for. This shows the integral mechanism introduces a distinct optimization dynamic rather than simple reparameterization. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CABS as an external decomposition of radiology reports into verifiable clinical semantic units, then introduces TIF-GRPO by casting clinical reasoning as a pseudo-temporal trajectory and applying integral feedback to anatomy-aware rewards. No equations or steps in the abstract reduce the claimed output (enhanced abnormality detection) to the inputs by construction, nor do they rely on self-citation for load-bearing uniqueness theorems or rename known results. The central mechanism is presented as an integration of control-theoretic principles with the new substrate, leaving independent empirical content in the 3D CT benchmark experiments. This is the most common honest finding for method papers that introduce new benchmarks and control formulations without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the unproven modeling choice that clinical reasoning admits a pseudo-temporal trajectory representation and that integral feedback on omissions and control effort will improve factual correctness.

axioms (1)

domain assumption Clinical reasoning can be formulated as a pseudo-temporal trajectory for anomaly discovery.
Directly invoked to justify the integral feedback loop in TIF-GRPO.

invented entities (2)

CABS no independent evidence
purpose: Decompose radiology reports into verifiable clinical semantic units.
New substrate introduced to enable the reward regulation.
TIF-GRPO no independent evidence
purpose: Trajectory-integral feedback mechanism for regulating anatomy-aware rewards.
Core novel framework proposed in the paper.

pith-pipeline@v0.9.0 · 5821 in / 1202 out tokens · 39615 ms · 2026-05-21T08:01:37.400183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RTIF = α − (α/K) Σ (1 − (1/k) Σ ri)² + γ (1 − (FP/(M+ε))²) + terminal + exploration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 14 internal anchors

[1]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

work page 2024
[2]

arXiv preprint arXiv:2503.20047 , year=

Med3dvlm: An efficient vision-language model for 3d medical image analysis , author=. arXiv preprint arXiv:2503.20047 , year=

work page arXiv
[3]

arXiv preprint arXiv:2412.13558 , year=

Read like a radiologist: Efficient vision-language model for 3d medical imaging interpretation , author=. arXiv preprint arXiv:2412.13558 , year=

work page arXiv
[4]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[5]

2005 , isbn =

PID Control: New Identification and Design Methods , publisher =. 2005 , isbn =

work page 2005
[6]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2508.08224 , year=

Capabilities of gpt-5 on multimodal medical reasoning , author=. arXiv preprint arXiv:2508.08224 , year=

work page arXiv
[8]

MedGemma Technical Report

Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Jiang, Y

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding , author=. arXiv preprint arXiv:2510.08668 , year=

work page arXiv
[10]

M3d: Advancing 3d medical image analysis with multi-modal large language models,

M3d: Advancing 3d medical image analysis with multi-modal large language models , author=. arXiv preprint arXiv:2404.00578 , year=

work page arXiv
[11]

arXiv preprint arXiv:2508.17524 , year=

OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation , author=. arXiv preprint arXiv:2508.17524 , year=

work page arXiv
[12]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page
[13]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[14]

Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments , author=. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=

work page
[15]

arXiv preprint arXiv:2106.14463 , year=

Radgraph: Extracting clinical entities and relations from radiology reports , author=. arXiv preprint arXiv:2106.14463 , year=

work page arXiv
[16]

arXiv preprint arXiv:2406.16845 , year=

Ratescore: A metric for radiology report generation , author=. arXiv preprint arXiv:2406.16845 , year=

work page arXiv
[17]

Bioinformatics , volume=

BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

work page 2025
[20]

arXiv preprint arXiv:2503.13939 , year=

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models , author=. arXiv preprint arXiv:2503.13939 , year=

work page arXiv
[21]

arXiv preprint arXiv:2504.09258 , year=

PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks , author=. arXiv preprint arXiv:2504.09258 , year=

work page arXiv
[22]

arXiv preprint arXiv:2504.20930 , year=

ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification , author=. arXiv preprint arXiv:2504.20930 , year=

work page arXiv
[23]

arXiv preprint arXiv:2506.00711 , year=

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training , author=. arXiv preprint arXiv:2506.00711 , year=

work page arXiv
[24]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Nature Communications , volume=

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data , author=. Nature Communications , volume=. 2025 , publisher=

work page 2025
[26]

CoRR , year=

A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities , author=. CoRR , year=

work page
[27]

arXiv preprint arXiv:2011.09257 , year=

Inspecting state of the art performance and NLP metrics in image-based medical report generation , author=. arXiv preprint arXiv:2011.09257 , year=

work page arXiv 2011
[28]

arXiv preprint arXiv:2511.00916 , year=

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs , author=. arXiv preprint arXiv:2511.00916 , year=

work page arXiv
[29]

arXiv preprint arXiv:2305.17100 , volume=

Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks , author=. arXiv preprint arXiv:2305.17100 , volume=. 2023 , publisher=

work page arXiv 2023
[30]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2403.17834 , year=

Developing generalist foundation models from a multimodal dataset for 3d computed tomography , author=. arXiv preprint arXiv:2403.17834 , year=

work page arXiv
[32]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Reevalmed: Rethinking medical report evaluation by aligning metrics with real-world clinical judgment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[33]

, author=

A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models. , author=. Computers, Materials & Continua , volume=

work page
[34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2510.19626 , year=

MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom , author=. arXiv preprint arXiv:2510.19626 , year=

work page arXiv
[36]

arXiv preprint arXiv:2511.14900 , year=

Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis , author=. arXiv preprint arXiv:2511.14900 , year=

work page arXiv
[37]

Zhi, Weihai and Guo, Jiayan and Li, Shangyang , journal=. MedGR

work page
[38]

Advances in neural information processing systems , volume=

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation , author=. Advances in neural information processing systems , volume=

work page
[39]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

arXiv preprint arXiv:2406.19280 , year=

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

work page arXiv
[41]

Machine Learning for Health (ML4H) , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine Learning for Health (ML4H) , pages=. 2023 , organization=

work page 2023
[42]

Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation , author=. arXiv preprint arXiv:2502.09838 , year=

work page arXiv
[43]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Ct2rep: Automated radiology report generation for 3d medical imaging , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2024 , organization=

work page 2024
[44]

arXiv preprint arXiv:2403.05141 , year=

Med3DInsight: Enhancing 3D medical image understanding with 2D multi-modal large language models , author=. arXiv preprint arXiv:2403.05141 , year=

work page arXiv
[45]

arXiv preprint arXiv:2409.19330 , year=

3d-ct-gpt: Generating 3d radiology reports through integration of large vision-language models , author=. arXiv preprint arXiv:2409.19330 , year=

work page arXiv
[46]

arXiv preprint arXiv:2411.12783 , year=

Med-2e3: A 2d-enhanced 3d medical multimodal large language model , author=. arXiv preprint arXiv:2411.12783 , year=

work page arXiv
[47]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2310.10505 , year=

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , author=. arXiv preprint arXiv:2310.10505 , year=

work page arXiv
[49]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=

work page
[51]

2025 , eprint=

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization , author=. 2025 , eprint=

work page 2025
[52]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

arXiv preprint arXiv:2405.19567 , year=

Dr-llava: Visual instruction tuning with symbolic clinical grounding , author=. arXiv preprint arXiv:2405.19567 , year=

work page arXiv
[58]

arXiv preprint arXiv:2505.11404 , year=

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner , author=. arXiv preprint arXiv:2505.11404 , year=

work page arXiv
[59]

arXiv preprint arXiv:2508.02669 , year=

Medvlthinker: Simple baselines for multimodal medical reasoning , author=. arXiv preprint arXiv:2508.02669 , year=

work page arXiv
[60]

Scientific data , volume=

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , author=. Scientific data , volume=. 2019 , publisher=

work page 2019
[61]

2025 , month = aug, howpublished =

work page 2025
[62]

2025 , month = dec, howpublished =

work page 2025
[63]

European conference on computer vision , pages=

Learning spatiotemporal frequency-transformer for compressed video super-resolution , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022
[64]

ICLR , year=

TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis , author=. ICLR , year=

work page
[65]

ICLR , year=

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis , author=. ICLR , year=

work page
[66]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Eyecaregpt: Boosting comprehensive ophthalmology understanding with tailored dataset, benchmark and model , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

work page
[67]

2026 , eprint=

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding , author=. 2026 , eprint=

work page 2026
[68]

arXiv preprint arXiv:2511.22055 , year=

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model , author=. arXiv preprint arXiv:2511.22055 , year=

work page arXiv
[69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videorefer suite: Advancing spatial-temporal object understanding with video llm , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[70]

arXiv preprint arXiv:2403.13447 , year=

Hyperllava: Dynamic visual and language expert tuning for multimodal large language models , author=. arXiv preprint arXiv:2403.13447 , year=

work page arXiv
[71]

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation , author=. arXiv preprint arXiv:2604.11789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

arXiv preprint arXiv:2601.06965 , year=

Unified Personalized Understanding, Generating and Editing , author=. arXiv preprint arXiv:2601.06965 , year=

work page arXiv
[73]

arXiv preprint arXiv:2506.05287 , year=

Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world? , author=. arXiv preprint arXiv:2506.05287 , year=

work page arXiv
[74]

arXiv preprint arXiv:2510.23603 , year=

Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity , author=. arXiv preprint arXiv:2510.23603 , year=

work page arXiv
[75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[76]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[77]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Revisiting the domain shift and sample uncertainty in multi-source active domain transfer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[1] [1]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

work page 2024

[2] [2]

arXiv preprint arXiv:2503.20047 , year=

Med3dvlm: An efficient vision-language model for 3d medical image analysis , author=. arXiv preprint arXiv:2503.20047 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2412.13558 , year=

Read like a radiologist: Efficient vision-language model for 3d medical imaging interpretation , author=. arXiv preprint arXiv:2412.13558 , year=

work page arXiv

[4] [4]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[5] [5]

2005 , isbn =

PID Control: New Identification and Design Methods , publisher =. 2005 , isbn =

work page 2005

[6] [6]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2508.08224 , year=

Capabilities of gpt-5 on multimodal medical reasoning , author=. arXiv preprint arXiv:2508.08224 , year=

work page arXiv

[8] [8]

MedGemma Technical Report

Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Jiang, Y

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding , author=. arXiv preprint arXiv:2510.08668 , year=

work page arXiv

[10] [10]

M3d: Advancing 3d medical image analysis with multi-modal large language models,

M3d: Advancing 3d medical image analysis with multi-modal large language models , author=. arXiv preprint arXiv:2404.00578 , year=

work page arXiv

[11] [11]

arXiv preprint arXiv:2508.17524 , year=

OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation , author=. arXiv preprint arXiv:2508.17524 , year=

work page arXiv

[12] [12]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page

[13] [13]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page

[14] [14]

Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments , author=. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=

work page

[15] [15]

arXiv preprint arXiv:2106.14463 , year=

Radgraph: Extracting clinical entities and relations from radiology reports , author=. arXiv preprint arXiv:2106.14463 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2406.16845 , year=

Ratescore: A metric for radiology report generation , author=. arXiv preprint arXiv:2406.16845 , year=

work page arXiv

[17] [17]

Bioinformatics , volume=

BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

work page 2025

[20] [20]

arXiv preprint arXiv:2503.13939 , year=

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models , author=. arXiv preprint arXiv:2503.13939 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2504.09258 , year=

PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks , author=. arXiv preprint arXiv:2504.09258 , year=

work page arXiv

[22] [22]

arXiv preprint arXiv:2504.20930 , year=

ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification , author=. arXiv preprint arXiv:2504.20930 , year=

work page arXiv

[23] [23]

arXiv preprint arXiv:2506.00711 , year=

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training , author=. arXiv preprint arXiv:2506.00711 , year=

work page arXiv

[24] [24]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Nature Communications , volume=

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data , author=. Nature Communications , volume=. 2025 , publisher=

work page 2025

[26] [26]

CoRR , year=

A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities , author=. CoRR , year=

work page

[27] [27]

arXiv preprint arXiv:2011.09257 , year=

Inspecting state of the art performance and NLP metrics in image-based medical report generation , author=. arXiv preprint arXiv:2011.09257 , year=

work page arXiv 2011

[28] [28]

arXiv preprint arXiv:2511.00916 , year=

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs , author=. arXiv preprint arXiv:2511.00916 , year=

work page arXiv

[29] [29]

arXiv preprint arXiv:2305.17100 , volume=

Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks , author=. arXiv preprint arXiv:2305.17100 , volume=. 2023 , publisher=

work page arXiv 2023

[30] [30]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2403.17834 , year=

Developing generalist foundation models from a multimodal dataset for 3d computed tomography , author=. arXiv preprint arXiv:2403.17834 , year=

work page arXiv

[32] [32]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Reevalmed: Rethinking medical report evaluation by aligning metrics with real-world clinical judgment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[33] [33]

, author=

A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models. , author=. Computers, Materials & Continua , volume=

work page

[34] [34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

arXiv preprint arXiv:2510.19626 , year=

MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom , author=. arXiv preprint arXiv:2510.19626 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2511.14900 , year=

Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis , author=. arXiv preprint arXiv:2511.14900 , year=

work page arXiv

[37] [37]

Zhi, Weihai and Guo, Jiayan and Li, Shangyang , journal=. MedGR

work page

[38] [38]

Advances in neural information processing systems , volume=

Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation , author=. Advances in neural information processing systems , volume=

work page

[39] [39]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page

[40] [40]

arXiv preprint arXiv:2406.19280 , year=

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

work page arXiv

[41] [41]

Machine Learning for Health (ML4H) , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine Learning for Health (ML4H) , pages=. 2023 , organization=

work page 2023

[42] [42]

Healthgpt: A medical large vision-language model for unifying comprehension and gen- eration via heterogeneous knowledge adaptation.CoRR, abs/2502.09838, 2025

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation , author=. arXiv preprint arXiv:2502.09838 , year=

work page arXiv

[43] [43]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

Ct2rep: Automated radiology report generation for 3d medical imaging , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2024 , organization=

work page 2024

[44] [44]

arXiv preprint arXiv:2403.05141 , year=

Med3DInsight: Enhancing 3D medical image understanding with 2D multi-modal large language models , author=. arXiv preprint arXiv:2403.05141 , year=

work page arXiv

[45] [45]

arXiv preprint arXiv:2409.19330 , year=

3d-ct-gpt: Generating 3d radiology reports through integration of large vision-language models , author=. arXiv preprint arXiv:2409.19330 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2411.12783 , year=

Med-2e3: A 2d-enhanced 3d medical multimodal large language model , author=. arXiv preprint arXiv:2411.12783 , year=

work page arXiv

[47] [47]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2310.10505 , year=

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , author=. arXiv preprint arXiv:2310.10505 , year=

work page arXiv

[49] [49]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=

work page

[51] [51]

2025 , eprint=

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization , author=. 2025 , eprint=

work page 2025

[52] [52]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

arXiv preprint arXiv:2405.19567 , year=

Dr-llava: Visual instruction tuning with symbolic clinical grounding , author=. arXiv preprint arXiv:2405.19567 , year=

work page arXiv

[58] [58]

arXiv preprint arXiv:2505.11404 , year=

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner , author=. arXiv preprint arXiv:2505.11404 , year=

work page arXiv

[59] [59]

arXiv preprint arXiv:2508.02669 , year=

Medvlthinker: Simple baselines for multimodal medical reasoning , author=. arXiv preprint arXiv:2508.02669 , year=

work page arXiv

[60] [60]

Scientific data , volume=

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , author=. Scientific data , volume=. 2019 , publisher=

work page 2019

[61] [61]

2025 , month = aug, howpublished =

work page 2025

[62] [62]

2025 , month = dec, howpublished =

work page 2025

[63] [63]

European conference on computer vision , pages=

Learning spatiotemporal frequency-transformer for compressed video super-resolution , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022

[64] [64]

ICLR , year=

TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis , author=. ICLR , year=

work page

[65] [65]

ICLR , year=

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis , author=. ICLR , year=

work page

[66] [66]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Eyecaregpt: Boosting comprehensive ophthalmology understanding with tailored dataset, benchmark and model , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

work page

[67] [67]

2026 , eprint=

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding , author=. 2026 , eprint=

work page 2026

[68] [68]

arXiv preprint arXiv:2511.22055 , year=

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model , author=. arXiv preprint arXiv:2511.22055 , year=

work page arXiv

[69] [69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videorefer suite: Advancing spatial-temporal object understanding with video llm , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[70] [70]

arXiv preprint arXiv:2403.13447 , year=

Hyperllava: Dynamic visual and language expert tuning for multimodal large language models , author=. arXiv preprint arXiv:2403.13447 , year=

work page arXiv

[71] [71]

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation , author=. arXiv preprint arXiv:2604.11789 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

arXiv preprint arXiv:2601.06965 , year=

Unified Personalized Understanding, Generating and Editing , author=. arXiv preprint arXiv:2601.06965 , year=

work page arXiv

[73] [73]

arXiv preprint arXiv:2506.05287 , year=

Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world? , author=. arXiv preprint arXiv:2506.05287 , year=

work page arXiv

[74] [74]

arXiv preprint arXiv:2510.23603 , year=

Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity , author=. arXiv preprint arXiv:2510.23603 , year=

work page arXiv

[75] [75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[76] [76]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[77] [77]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Revisiting the domain shift and sample uncertainty in multi-source active domain transfer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page