Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs

arxiv: 2506.11989 · v3 · submitted 2025-06-13 · 💻 cs.CV

Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs

Yue Yao , Zelin Wen , Yan Tong , Xinyu Tian , Xuqing Li , Xiao Ma , Dongliang Xu , Tom Gedeon This is my paper

Pith reviewed 2026-05-19 09:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest X-rayvision-language modelstest-time scalingreport generationthought graphmedical priorsreasoning budget

0 comments p. Extension

The pith

Integrating medical thought graphs into prompts lets frozen VLLMs produce more accurate and consistent chest X-ray reports at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a Thought Graph Traversal framework to improve how vision-language models generate reports for chest X-rays by embedding structured medical knowledge as a graph in the prompt. The model is guided to follow a logical sequence of organ checks and uses a reasoning budget forcing step to extend inference depth when needed. A reader might care because the method boosts performance on existing frozen models without any retraining or new data, making medical AI more practical. It also makes reasoning traceable to help identify biases in training datasets.

Core claim

The paper claims that a lightweight Thought Graph Traversal framework, which incorporates structured medical priors to guide reasoning through organ-specific findings in a coherent order, combined with reasoning budget forcing to adjust inference depth at test time, enables a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports that outperform baseline prompting approaches on standard benchmarks.

What carries the argument

The Thought Graph Traversal (TGT) framework that structures medical priors into a traversable graph to enforce a medically coherent order of reasoning in the prompt.

If this is right

The frozen model can self-correct its initial analysis during generation.
Generated reports achieve higher accuracy and consistency on standard benchmarks.
Reasoning paths are traceable, allowing identification of dataset biases.
Improvements occur without any modifications to the underlying model or additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to other medical vision tasks such as analyzing CT or MRI images with similar structured priors.
Traceable reasoning might support explainability requirements in clinical AI deployments.
Budget forcing could be adapted for other test-time scaling techniques in non-medical domains.

Load-bearing premise

That adding structured medical priors as a thought graph into the prompt will produce deeper, more logical analysis and self-correction without any model changes or additional training data.

What would settle it

Evaluating the TGT method against baseline prompting on a held-out chest X-ray dataset and observing no improvement in report quality metrics such as accuracy or consistency would falsify the main claim.

Figures

Figures reproduced from arXiv: 2506.11989 by Dongliang Xu, Tom Gedeon, Xiao Ma, Xinyu Tian, Xuqing Li, Yan Tong, Yue Yao, Zelin Wen.

**Figure 1.** Figure 1: Test-time scaling with Thought Graph Traversal. We apply test-time scaling to radiology report generation by introducing a Thought Graph Traversal framework, which enables structured, multi-step reasoning under varying test-time compute budgets. Our method improves report quality when reasoning budget increase (measured by the length of reasoning tokens). Notably, model accuracy shows a positive correlatio… view at source ↗

**Figure 2.** Figure 2: Overview of Thought Graph Traversal for structured radiology report generation. (Left) In the preprocessing stage, GPT-4o is used to extract organ entities and their corresponding descriptions from training reports, which are stored in organ_list and database. (Center) During inference, for each patient, a fixed set of questions is asked per organ (orange boxes), generating diverse viewpoints on an organ i… view at source ↗

**Figure 3.** Figure 3: Examples of conventional prompting and our Thought Graph. (Left) Conventional prompting mimics stylistic examples without deep understanding, often leading to hallucinations or logically inconsistent reports. (Right) Our method explicitly guides the model’s attention toward organ-level reasoning and enforces a medically coherent reasoning order, leading to more accurate, logical, and interconnected diagn… view at source ↗

**Figure 4.** Figure 4: Impact of example quantity in prompt design. We empirically study the effect of example count on model performance during chest X-ray report generation. Using our curated prompt with seven examples as a baseline, we observe that reducing the number of examples significantly degrades report quality due to insufficient guidance, leading to reasoning and formatting errors. Interestingly, increasing the number… view at source ↗

**Figure 5.** Figure 5: Analysis of organ description positions in expert-written reports from the IU X-Ray and MIMIC-CXR datasets. The left y-axis represents the different organs, and the right y-axis indicates the sentence number in the report where each organ is mentioned. The figure shows that certain organs, such as the heart and lungs, tend to appear earlier in reports, while pathological findings are more commonly placed l… view at source ↗

**Figure 6.** Figure 6: The scatter plot showing the relationship between organ order distance and the ROUGE-L values. We conduct experiment across all 120 permutations of five organs in the reasoning graph. Correlation analysis reveals that ROUGE-L is sensitive to organ sequence, with statistically significant negative correlations (Spearman ρ = −0.6163, p = 8.53×10−14; Pearson ρ = −0.6072, p = 2.45×10−13), suggesting the impor… view at source ↗

read the original abstract

Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to chest X-ray report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model's inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an organ-ordered thought graph plus budget forcing as a prompting method for frozen chest X-ray VLLMs, but the central performance claims rest on unshown numbers and missing ablations.

read the letter

Hi colleague, The one thing to know is that this work proposes a Thought Graph Traversal method to guide chest X-ray VLLMs through organ-specific reasoning in a coherent sequence at test time, using structured medical priors and a budget forcing strategy to improve report generation without any model updates. What the paper does is take existing test-time scaling ideas and apply them specifically to radiology with this graph structure. It does well in keeping the method lightweight and in providing open code and prompts for others to use. This makes it easy to test whether the approach actually helps in practice. The traceable reasoning paths could also be useful for spotting biases in the data. Where it falls short is the lack of concrete evidence in the description. The claim of outperforming baselines is stated but without any reported metrics, ablation studies, or details on the experiments. The key question is whether the graph traversal and ordering provide benefits beyond simply including the medical knowledge in the prompt. An experiment comparing the structured graph to an unstructured list of the same priors would clarify if the framework's structure is doing real work or if it's the priors themselves. Without that, the advantage of the method over simpler prompting remains unclear. Readers interested in medical AI, vision-language models, or inference-time improvements might find this relevant. It could spark ideas for similar structured prompting in other domains. I would send this to peer review. The idea is accessible and the open-sourcing supports verification, so referees can assess the empirical claims properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Thought Graph Traversal (TGT) framework for test-time scaling of frozen chest X-ray VLLMs. Structured medical priors are embedded as a graph in the prompt to direct organ-specific reasoning in a medically coherent order; this is combined with a reasoning budget forcing strategy that extends generation length at inference time to promote self-correction. The authors claim the approach yields more accurate and consistent reports than standard prompting baselines on common benchmarks while also exposing dataset biases through traceable reasoning paths, all without model changes or additional training.

Significance. If the empirical results hold after proper controls, the work would show that lightweight, training-free prompt engineering with domain-structured priors can meaningfully improve reasoning depth and consistency in medical VLLMs. The open-sourcing of code and prompts would further support reproducibility and allow the community to test the method on additional datasets or models.

major comments (2)

[Abstract] Abstract and experimental sections: the central claim that TGT 'outperforms baseline prompting approaches on standard benchmarks' is presented without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence makes it impossible to judge the magnitude or reliability of the reported gains and is load-bearing for the paper's main contribution.
[Method] Method section (Thought Graph Traversal description): no ablation is reported that isolates the effect of the graph structure and traversal order from simply providing the same organ-specific medical priors in a flat list. Without this control, it remains unclear whether performance improvements stem from the claimed logical traversal mechanism or from the content of the priors alone.

minor comments (2)

[Abstract] The abstract would benefit from naming the specific benchmarks and reporting at least the key performance deltas to give readers immediate context.
[Method] Clarify the exact form of the 'reasoning budget forcing' strategy (e.g., how the extension length is chosen and whether it is applied uniformly or adaptively).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify the presentation of our contributions. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the central claim that TGT 'outperforms baseline prompting approaches on standard benchmarks' is presented without any quantitative metrics, error bars, dataset sizes, or statistical tests. This absence makes it impossible to judge the magnitude or reliability of the reported gains and is load-bearing for the paper's main contribution.

Authors: We agree that explicit quantitative support is necessary for the central claim. In the revised manuscript we will update both the abstract and the experimental section to report concrete metrics (including clinical accuracy, report quality scores, and standard NLP metrics), along with error bars from multiple runs, exact dataset sizes, and statistical significance tests such as paired t-tests or Wilcoxon tests. These additions will allow readers to assess the magnitude and reliability of the observed improvements. revision: yes
Referee: [Method] Method section (Thought Graph Traversal description): no ablation is reported that isolates the effect of the graph structure and traversal order from simply providing the same organ-specific medical priors in a flat list. Without this control, it remains unclear whether performance improvements stem from the claimed logical traversal mechanism or from the content of the priors alone.

Authors: We concur that an ablation isolating the contribution of the graph structure and traversal order is valuable. We will add this control experiment to the revised method and results sections, comparing Thought Graph Traversal against a flat-list baseline that supplies identical organ-specific medical priors without the graph or ordered traversal. The new results will quantify any incremental benefit attributable to the structured traversal mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting method with external benchmark evaluation

full rationale

The paper presents a prompting framework (Thought Graph Traversal) for test-time scaling in VLLMs, integrating medical priors into prompts and using reasoning budget forcing. Performance is measured against baseline prompting on standard benchmarks, with no equations, fitted parameters, derivations, or self-citation chains that reduce claims to inputs by construction. The method is self-contained as an empirical technique whose validity rests on external comparisons rather than internal redefinitions or forced predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that medically coherent ordering of findings improves reasoning depth in VLLMs; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Structured medical priors integrated into prompts enable deeper and more logical analysis in frozen VLLMs
Invoked in the description of the TGT framework as the mechanism for guiding reasoning without model changes.

invented entities (1)

Thought Graph Traversal framework no independent evidence
purpose: To guide the model through organ-specific findings in a medically coherent order
Presented as a new lightweight structure added to the prompt

pith-pipeline@v0.9.0 · 5728 in / 1234 out tokens · 24232 ms · 2026-05-19T09:11:40.089426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We classify our graph traversal into 1) Sequential traversal... 2) Parallel traversal... each organ node now accumulates its subgraph traversal over time, guided by token budget constraints... performance plateaus around 450 tokens
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Thought Graph Traversal... integrates structured medical priors... reasoning budget forcing strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

[1]

Ho ffmann, S

J. Ho ffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., Training compute- optimal large language models, in: Proceedings of the 36th International Confer- ence on Neural Information Processing Systems, 2022, pp. 30016–30030

work page 2022
[2]

s1: Simple test-time scaling

N. Muennigho ff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettle- moyer, P. Liang, E. Candès, T. Hashimoto, s1: Simple test-time scaling, arXiv preprint arXiv:2501.19393 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

F. Wang, Z. Han, X. Liu, Y . Yin, X. Gao, Ctpt: Continual test-time prompt tuning for vision-language models, Pattern Recognition 161 (2025) 111300

work page 2025
[4]

J. Yin, X. Zhang, L. Wu, X. Wang, Context-aware prompt learning for test- time vision recognition with frozen vision-language model, Pattern Recognition (2025) 111359

work page 2025
[5]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, A. Kumar, Scaling llm test-time compute optimally can be more effective than scaling model parameters (2024). arXiv:2408.03314. URL https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

URL https://openai.com/index/learning-to-reason-with-llms/

OpenAI, Learning to reason with llms (September 2024). URL https://openai.com/index/learning-to-reason-with-llms/

work page 2024
[7]

o1-coder: an o1 replication for coding

Y . Zhang, S. Wu, Y . Yang, J. Shu, J. Xiao, C. Kong, J. Sang, o1-coder: an o1 replication for coding (2024). arXiv:2412.00154. URL https://arxiv.org/abs/2412.00154

work page arXiv 2024
[8]

Y . Qin, X. Li, H. Zou, Y . Liu, S. Xia, Z. Huang, Y . Ye, W. Yuan, H. Liu, Y . Li, P. Liu, O1 replication journey: A strategic progress report – part 1 (2024).arXiv: 2410.18982. URL https://arxiv.org/abs/2410.18982

work page arXiv 2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via rein- forcement learning (2025). arXiv:2501.12948. URL https://arxiv.org/abs/2501.12948 18

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Z. Chen, Y . Song, T.-H. Chang, X. Wan, Generating radiology reports via memory-driven transformer, in: Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449

work page 2020
[11]

Ganeshan, P.-A

D. Ganeshan, P.-A. T. Duong, L. Probyn, L. Lenchik, T. A. McArthur, M. Retrou- vey, E. H. Ghobadi, S. L. Desouches, D. Pastel, I. R. Francis, Structured reporting in radiology, Academic radiology 25 (1) (2018) 66–73

work page 2018
[12]

Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial intelligence 267 (2019) 1–38

T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial intelligence 267 (2019) 1–38

work page 2019
[13]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, et al., Gpt-4 technical report (2023). arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763

work page 2021
[15]

C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talking, draw- ing and editing with visual foundation models, arXiv preprint arXiv:2303.04671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Z. Wang, L. Liu, L. Wang, L. Zhou, Metransformer: Radiology report genera- tion by transformer with multiple learnable expert tokens, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11558–11567

work page 2023
[17]

S. Yang, X. Wu, S. Ge, Z. Zheng, S. K. Zhou, L. Xiao, Radiology report gener- ation with a learned knowledge base and multi-modal alignment, Medical Image Analysis 86 (2023) 102798

work page 2023
[18]

Tanida, P

T. Tanida, P. Müller, G. Kaissis, D. Rueckert, Interactive and explainable region- guided radiology report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7433–7442. 19

work page 2023
[19]

H. Qin, Y . Song, Reinforced cross-modal alignment for radiology report genera- tion, in: Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 448–458

work page 2022
[20]

F. Liu, X. Wu, S. Ge, W. Fan, Y . Zou, Exploring and distilling posterior and prior knowledge for radiology report generation, in: Proceedings of the IEEE /CVF conference on computer vision and pattern recognition, 2021, pp. 13753–13762

work page 2021
[21]

M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, X. Chang, Dynamic graph enhanced contrastive learning for chest x-ray report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343

work page 2023
[22]

Huang, X

Z. Huang, X. Zhang, S. Zhang, Kiut: Knowledge-injected u-transformer for radi- ology report generation, in: Proceedings of the IEEE /CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19809–19818

work page 2023
[23]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877– 1901

work page 2020
[24]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837

work page 2022
[25]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, Y . Iwasawa, Large language models are zero-shot reasoners, in: Advances in Neural Information Processing Systems, 2022

work page 2022
[26]

Y . Liu, J. Singh, G. Liu, A. Payani, L. Zheng, Towards hierarchical multi-agent workflows for zero-shot prompt optimization, arXiv preprint arXiv:2405.20252 (2024)

work page arXiv 2024
[27]

Z. Hu, P. Yang, Y . Jiang, Z. Bai, Prompting large language model with context and pre-answer for knowledge-based vqa, Pattern Recognition 151 (2024) 110399. 20

work page 2024
[28]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Gri ffiths, Y . Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, V ol. 36, Curran Associates, Inc., 2023, pp. 11809–11822. URL https://proceedings....

work page 2023
[29]

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, N. Duan, W. Chen, Critic: Large language models can self-correct with tool-interactive critiquing (2024). arXiv: 2305.11738. URL https://arxiv.org/abs/2305.11738

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. H. Chen, X. Wang, R. Zhang, Z. Cai, K. Ji, G. Yu, X. Wan, B. Wang, Huatuogpt-vision, towards injecting medi- cal visual knowledge into multimodal llms at scale (2024). arXiv:2406.19280. URL https://arxiv.org/abs/2406.19280

work page arXiv 2024
[31]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, J. Lin, Qwen2.5-vl technical report (2025). arXiv:2502.13923. URL https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

GPT-4o System Card

OpenAI, Gpt-4o system card (2024). arXiv:2410.21276. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Pavlopoulos, V

J. Pavlopoulos, V . Kougia, I. Androutsopoulos, A survey on biomedical image captioning, in: Proceedings of the second workshop on shortcomings in vision and language, 2019, pp. 26–36

work page 2019
[34]

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, S. Horng, Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs, arXiv preprint arXiv:1901.07042 (2019). 21

work page internal anchor Pith review Pith/arXiv arXiv 1901
[35]

Papineni, S

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic eval- uation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002
[36]

Banerjee, A

S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or sum- marization, 2005, pp. 65–72

work page 2005
[37]

Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp

C.-Y . Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81

work page 2004
[38]

Y . Deng, W. Zhang, Z. Chen, Q. Gu, Rephrase and respond: Let large language models ask better questions for themselves, arXiv preprint arXiv:2311.04205 (2023)

work page arXiv 2023
[39]

de Wynter, X

A. de Wynter, X. Wang, Q. Gu, S.-Q. Chen, On meta-prompting, arXiv preprint arXiv:2312.06562 (2023)

work page arXiv 2023
[40]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegre ffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, P. Clark, Self-refine: Iterative refinement with self-feedback, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Process...

work page 2023

[1] [1]

Ho ffmann, S

J. Ho ffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., Training compute- optimal large language models, in: Proceedings of the 36th International Confer- ence on Neural Information Processing Systems, 2022, pp. 30016–30030

work page 2022

[2] [2]

s1: Simple test-time scaling

N. Muennigho ff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettle- moyer, P. Liang, E. Candès, T. Hashimoto, s1: Simple test-time scaling, arXiv preprint arXiv:2501.19393 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

F. Wang, Z. Han, X. Liu, Y . Yin, X. Gao, Ctpt: Continual test-time prompt tuning for vision-language models, Pattern Recognition 161 (2025) 111300

work page 2025

[4] [4]

J. Yin, X. Zhang, L. Wu, X. Wang, Context-aware prompt learning for test- time vision recognition with frozen vision-language model, Pattern Recognition (2025) 111359

work page 2025

[5] [5]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, A. Kumar, Scaling llm test-time compute optimally can be more effective than scaling model parameters (2024). arXiv:2408.03314. URL https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

URL https://openai.com/index/learning-to-reason-with-llms/

OpenAI, Learning to reason with llms (September 2024). URL https://openai.com/index/learning-to-reason-with-llms/

work page 2024

[7] [7]

o1-coder: an o1 replication for coding

Y . Zhang, S. Wu, Y . Yang, J. Shu, J. Xiao, C. Kong, J. Sang, o1-coder: an o1 replication for coding (2024). arXiv:2412.00154. URL https://arxiv.org/abs/2412.00154

work page arXiv 2024

[8] [8]

Y . Qin, X. Li, H. Zou, Y . Liu, S. Xia, Z. Huang, Y . Ye, W. Yuan, H. Liu, Y . Li, P. Liu, O1 replication journey: A strategic progress report – part 1 (2024).arXiv: 2410.18982. URL https://arxiv.org/abs/2410.18982

work page arXiv 2024

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via rein- forcement learning (2025). arXiv:2501.12948. URL https://arxiv.org/abs/2501.12948 18

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Z. Chen, Y . Song, T.-H. Chang, X. Wan, Generating radiology reports via memory-driven transformer, in: Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449

work page 2020

[11] [11]

Ganeshan, P.-A

D. Ganeshan, P.-A. T. Duong, L. Probyn, L. Lenchik, T. A. McArthur, M. Retrou- vey, E. H. Ghobadi, S. L. Desouches, D. Pastel, I. R. Francis, Structured reporting in radiology, Academic radiology 25 (1) (2018) 66–73

work page 2018

[12] [12]

Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial intelligence 267 (2019) 1–38

T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial intelligence 267 (2019) 1–38

work page 2019

[13] [13]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, et al., Gpt-4 technical report (2023). arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763

work page 2021

[15] [15]

C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talking, draw- ing and editing with visual foundation models, arXiv preprint arXiv:2303.04671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Z. Wang, L. Liu, L. Wang, L. Zhou, Metransformer: Radiology report genera- tion by transformer with multiple learnable expert tokens, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11558–11567

work page 2023

[17] [17]

S. Yang, X. Wu, S. Ge, Z. Zheng, S. K. Zhou, L. Xiao, Radiology report gener- ation with a learned knowledge base and multi-modal alignment, Medical Image Analysis 86 (2023) 102798

work page 2023

[18] [18]

Tanida, P

T. Tanida, P. Müller, G. Kaissis, D. Rueckert, Interactive and explainable region- guided radiology report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7433–7442. 19

work page 2023

[19] [19]

H. Qin, Y . Song, Reinforced cross-modal alignment for radiology report genera- tion, in: Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 448–458

work page 2022

[20] [20]

F. Liu, X. Wu, S. Ge, W. Fan, Y . Zou, Exploring and distilling posterior and prior knowledge for radiology report generation, in: Proceedings of the IEEE /CVF conference on computer vision and pattern recognition, 2021, pp. 13753–13762

work page 2021

[21] [21]

M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, X. Chang, Dynamic graph enhanced contrastive learning for chest x-ray report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343

work page 2023

[22] [22]

Huang, X

Z. Huang, X. Zhang, S. Zhang, Kiut: Knowledge-injected u-transformer for radi- ology report generation, in: Proceedings of the IEEE /CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19809–19818

work page 2023

[23] [23]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877– 1901

work page 2020

[24] [24]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837

work page 2022

[25] [25]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, Y . Iwasawa, Large language models are zero-shot reasoners, in: Advances in Neural Information Processing Systems, 2022

work page 2022

[26] [26]

Y . Liu, J. Singh, G. Liu, A. Payani, L. Zheng, Towards hierarchical multi-agent workflows for zero-shot prompt optimization, arXiv preprint arXiv:2405.20252 (2024)

work page arXiv 2024

[27] [27]

Z. Hu, P. Yang, Y . Jiang, Z. Bai, Prompting large language model with context and pre-answer for knowledge-based vqa, Pattern Recognition 151 (2024) 110399. 20

work page 2024

[28] [28]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Gri ffiths, Y . Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, V ol. 36, Curran Associates, Inc., 2023, pp. 11809–11822. URL https://proceedings....

work page 2023

[29] [29]

Z. Gou, Z. Shao, Y . Gong, Y . Shen, Y . Yang, N. Duan, W. Chen, Critic: Large language models can self-correct with tool-interactive critiquing (2024). arXiv: 2305.11738. URL https://arxiv.org/abs/2305.11738

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. H. Chen, X. Wang, R. Zhang, Z. Cai, K. Ji, G. Yu, X. Wan, B. Wang, Huatuogpt-vision, towards injecting medi- cal visual knowledge into multimodal llms at scale (2024). arXiv:2406.19280. URL https://arxiv.org/abs/2406.19280

work page arXiv 2024

[31] [31]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, J. Lin, Qwen2.5-vl technical report (2025). arXiv:2502.13923. URL https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

GPT-4o System Card

OpenAI, Gpt-4o system card (2024). arXiv:2410.21276. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Pavlopoulos, V

J. Pavlopoulos, V . Kougia, I. Androutsopoulos, A survey on biomedical image captioning, in: Proceedings of the second workshop on shortcomings in vision and language, 2019, pp. 26–36

work page 2019

[34] [34]

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, S. Horng, Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs, arXiv preprint arXiv:1901.07042 (2019). 21

work page internal anchor Pith review Pith/arXiv arXiv 1901

[35] [35]

Papineni, S

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic eval- uation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002

[36] [36]

Banerjee, A

S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or sum- marization, 2005, pp. 65–72

work page 2005

[37] [37]

Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp

C.-Y . Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81

work page 2004

[38] [38]

Y . Deng, W. Zhang, Z. Chen, Q. Gu, Rephrase and respond: Let large language models ask better questions for themselves, arXiv preprint arXiv:2311.04205 (2023)

work page arXiv 2023

[39] [39]

de Wynter, X

A. de Wynter, X. Wang, Q. Gu, S.-Q. Chen, On meta-prompting, arXiv preprint arXiv:2312.06562 (2023)

work page arXiv 2023

[40] [40]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegre ffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, P. Clark, Self-refine: Iterative refinement with self-feedback, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Process...

work page 2023