Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

arxiv: 2605.19663 · v1 · pith:EXTWKTIInew · submitted 2026-05-19 · 💻 cs.AI

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

Weicong Ni , Tianbao Jiang , Linlin Wang This is my paper

Pith reviewed 2026-05-20 05:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision-language modelshallucination reductionstructured reasoningpseudocode librarydifficulty assessmentPOPE benchmarkMMStar benchmarkrobotic automation

0 comments p. Extension

pith:EXTWKTII Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{EXTWKTII}

Prints a linked pith:EXTWKTII badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A pseudocode library with difficulty-based strategy selection lets vision-language models reason more reliably and cut hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PStar, a framework that equips vision-language models with a library of structured pseudocode reasoning paths. It uses a Difficulty Feature Vector to evaluate how complex a question is and then picks the right path for step-by-step inference. This adaptive selection is meant to make outputs more consistent and less prone to visual or language errors. The authors show the method raises scores to 87.1 percent on POPE and 68.0 percent on MMStar, exceeding GPT-4V. The approach targets real-world uses such as robotic automation where hallucinations can cause unsafe decisions.

Core claim

PStar formulates a library of abstract reasoning functions as modular pseudocode and introduces a Difficulty Feature Vector that lets the model assess question complexity, then adaptively selects the matching structured reasoning path to produce more robust and interpretable outputs.

What carries the argument

The Pseudocode-guided Structured Reasoning framework (PStar) that maintains a library of modular reasoning strategies and uses the Difficulty Feature Vector to match question complexity to the appropriate path.

If this is right

Hallucination rates drop in open-ended visual-language tasks, supporting safer use in robotic decision making.
Reasoning becomes more interpretable because each output follows an explicit pseudocode path.
Performance exceeds GPT-4V on the POPE and MMStar benchmarks without requiring larger models.
The same adaptive selection works across questions that differ widely in difficulty and input modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on sequential decision tasks where errors compound over multiple steps.
Replacing the Difficulty Feature Vector with a learned module might further improve selection accuracy.
The pseudocode library offers a template that could be ported to other multimodal models facing similar hallucination issues.

Load-bearing premise

The Difficulty Feature Vector can accurately judge question complexity so that the model reliably picks the best reasoning strategy from the pseudocode library.

What would settle it

An experiment that replaces the Difficulty Feature Vector with random or fixed strategy selection and measures whether POPE and MMStar scores drop below the reported 87.1 percent and 68.0 percent.

Figures

Figures reproduced from arXiv: 2605.19663 by Linlin Wang, Tianbao Jiang, Weicong Ni.

**Figure 1.** Figure 1: Overview of PStar for Structured Reasoning. (a) Difficulty-Aware Sampling uses Difficulty Feature Vectors (DFV) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of DFVs in PCA space for seed sizes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Use case study of two examples. Method MMStar Open-Source Data Training-Free Mulberry [22] 61.3% × 260k × AStar [23] 61.7% ✓ 0.5k ✓ LLaVA-CoT [6] 57.6% × 100k × LlamaV-o1 [24] 59.5% × 118k × Ours 68.0% ✓ 0.5k ✓ TABLE III: Comparison of methods on MMStar benchmark. these Pseudocode-guided paths step by step, executing each abstract function to perform structured reasoning. This example illustrates how PStar… view at source ↗

read the original abstract

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PStar's core idea of pseudocode templates plus a model-generated Difficulty Feature Vector is worth a look for robotics reliability, but the adaptive part risks circularity since DFV comes from the same VLM.

read the letter

The paper's main contribution is a framework called PStar that gives VLMs a library of structured pseudocode reasoning functions and lets the model pick among them using a Difficulty Feature Vector. It reports strong numbers on POPE and MMStar, beating GPT-4V while cutting hallucinations, and frames this as useful for safer robotic automation where open-ended questions vary in difficulty and modality. That combination of modular pseudocode strategies with an adaptive selector is new enough to stand out from plain chain-of-thought work. The write-up does a clear job laying out why hallucination is a safety blocker in physical deployments and why fixed reasoning paths fall short on varied real-world tasks. The experiments are presented as extensive, which suggests they ran the usual benchmarks with some controls. The soft spot is exactly the one the stress test raises. The DFV is described as something the VLM itself computes to judge complexity and modality, with no mention of external features, ground-truth labels, or a separate verifier. If the model hallucinates on difficulty assessment, it will route to the wrong pseudocode path and the reported gains lose their explanation. That makes the adaptive mechanism fragile by construction. The abstract gives no error analysis or ablation on DFV accuracy, so it is hard to know whether the improvements come from better reasoning or just from the model happening to pick decent paths on these particular datasets. This paper is aimed at researchers building reliable VLMs for automation and robotics. Anyone working on structured reasoning or hallucination mitigation would find the pseudocode library and DFV concept useful to examine, even if the results need tighter validation. It deserves a serious referee because the problem is practical and the proposed mechanism is concrete, though the review will likely focus on whether the circularity concern actually undermines the claims once the full experiments and ablations are checked.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Pseudocode-guided Structured Reasoning (PStar) framework for vision-language models. It introduces a library of abstract reasoning functions expressed as structured pseudocode, along with a Difficulty Feature Vector (DFV) that enables the model to assess question complexity and adaptively select reasoning strategies from the library. The central claim is that this approach significantly reduces hallucinations, yielding state-of-the-art results of 87.1% on POPE and 68.0% on MMStar while outperforming GPT-4V, with potential benefits for reliable robotic automation.

Significance. If the performance claims and the reliability of the DFV-based adaptation are substantiated, the work would represent a meaningful step toward interpretable, adaptive reasoning in VLMs for safety-critical applications. The modular pseudocode library offers a concrete mechanism for step-by-step inference that could improve robustness over purely prompt-based methods.

major comments (2)

[Abstract] Abstract: The headline claims of 87.1% on POPE and 68.0% on MMStar (outperforming GPT-4V) are presented with no accompanying experimental details, baselines, statistical tests, error analysis, or dataset splits. This absence makes it impossible to evaluate whether the reported gains are attributable to the structured reasoning or to other factors.
[Method] Method section (DFV description): The Difficulty Feature Vector is computed internally by the same VLM whose hallucinations the framework aims to mitigate. Because DFV directly determines strategy selection from the pseudocode library, any error in difficulty or modality assessment produces the wrong reasoning path; the manuscript provides no external verifier, ground-truth labels, or ablation showing that DFV errors do not collapse the adaptive mechanism to a non-adaptive baseline.

minor comments (1)

[Abstract] The abstract and method descriptions introduce several new terms (DFV, structured pseudocode library) without a clear notation table or diagram showing how DFV components map to library entries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of clarity and robustness in our presentation of PStar. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 87.1% on POPE and 68.0% on MMStar (outperforming GPT-4V) are presented with no accompanying experimental details, baselines, statistical tests, error analysis, or dataset splits. This absence makes it impossible to evaluate whether the reported gains are attributable to the structured reasoning or to other factors.

Authors: We agree that the abstract would be strengthened by including more context for the reported results. In the revised version, we will expand the abstract to briefly note the evaluation on the POPE and MMStar benchmarks, the set of baselines including GPT-4V, and that full experimental details, dataset information, and statistical analysis appear in the Experiments section. This will allow readers to better contextualize the claims while respecting abstract length limits. revision: yes
Referee: [Method] Method section (DFV description): The Difficulty Feature Vector is computed internally by the same VLM whose hallucinations the framework aims to mitigate. Because DFV directly determines strategy selection from the pseudocode library, any error in difficulty or modality assessment produces the wrong reasoning path; the manuscript provides no external verifier, ground-truth labels, or ablation showing that DFV errors do not collapse the adaptive mechanism to a non-adaptive baseline.

Authors: The referee correctly notes the potential for error propagation when DFV is generated by the same VLM. While objective ground-truth labels for question difficulty do not exist in a model-independent sense, we have included ablations in the original experiments comparing adaptive DFV selection against fixed non-adaptive baselines. In the revision, we will add a new analysis subsection that reports DFV accuracy against human annotations on a subset of questions and includes sensitivity tests with injected DFV noise to quantify robustness. These additions will directly address the concern about collapse to non-adaptive behavior. revision: partial

Circularity Check

0 steps flagged

No circularity: framework description contains no derivations, equations, or self-referential reductions.

full rationale

The paper presents PStar as an empirical framework that designs abstract reasoning functions, a pseudocode library, and a Difficulty Feature Vector (DFV) to adaptively select strategies. No equations, fitted parameters, or derivation chains appear in the abstract or described text. Performance numbers (87.1% POPE, 68.0% MMStar) are reported from external benchmark experiments rather than from any internal prediction that reduces to the inputs by construction. The DFV is introduced as a designed component for assessing complexity, but the text provides no indication that it is defined in terms of the target outputs or fitted to the evaluation metrics in a self-referential manner. This is a standard non-circular proposal of a method whose validity rests on experimental validation against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that structured pseudocode paths plus a learned difficulty vector will generalize across open-ended visual-language tasks; the DFV and pseudocode library are newly introduced constructs without shown external validation.

axioms (1)

domain assumption VLMs are susceptible to hallucinations that cause critical failures in decision-making for real-world tasks
Stated directly in the opening of the abstract as the motivating problem.

invented entities (2)

Difficulty Feature Vector (DFV) no independent evidence
purpose: To assess question complexity and enable adaptive selection of reasoning strategies
Newly designed component described as crucial for robustness and interpretability.
structured pseudocode library no independent evidence
purpose: To represent modular reasoning strategies for step-by-step inference
Abstract states the authors first design abstract reasoning functions and formulate this library.

pith-pipeline@v0.9.0 · 5770 in / 1365 out tokens · 36733 ms · 2026-05-20T05:33:07.836816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A*-Based Reasoning Path Generation ... novel cost function g(S) = Σ λ_ai · len(ri) / usefulness(ri)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

work page 2024
[2]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),”arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” inICLR, 2024

work page 2024
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,

T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. He, X. Feng, J. Song, B. Zheng, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” 2024

work page 2024
[6]

Llava-cot: Let vision language models reason step-by-step,

G. Xu, P. Jin, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava-cot: Let vision language models reason step-by-step,” 2025

work page 2025
[7]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 55 204–55 224

work page 2024
[8]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 292–305

work page 2023
[9]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoobet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 375–14 385

work page 2024
[10]

Are We on the Right Way for Evaluating Large Vision-Language Models?

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[12]

Measuring multimodal mathematical reasoning with math- vision dataset,

K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math- vision dataset,” inThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[13]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[14]

Qwen2.5-vl,

Q. Team, “Qwen2.5-vl,” January 2025

work page 2025
[15]

Internal consistency and self-feedback in large language models: A survey,

X. Liang, S. Song, Z. Zheng, H. Wang, Q. Yu, X. Li, R.-H. Li, F. Xiong, and Z. Li, “Internal consistency and self-feedback in large language models: A survey,”CoRR, 2024

work page 2024
[16]

The llama 3 herd of models,

A. Grattafiori and T. Abhimanyu Dubey, “The llama 3 herd of models,” 2024

work page 2024
[17]

Swift: a scalable lightweight infrastructure for fine-tuning,

Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wanget al., “Swift: a scalable lightweight infrastructure for fine-tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 29 733–29 735

work page 2025
[18]

Thoughts are all over the place: On the underthinking of o1-like llms,

Y . Wang, Q. Liu, J. Xu, T. Liang, X. Chen, Z. He, L. Song, D. Yu, J. Li, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu, “Thoughts are all over the place: On the underthinking of o1-like llms,” 2025

work page 2025
[19]

Stop overthinking: A survey on efficient reasoning for large language models,

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,” 2025

work page 2025
[20]

Single-agent policy tree search with guarantees,

L. Orseau, L. Lelis, T. Lattimore, and T. Weber, “Single-agent policy tree search with guarantees,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018
[21]

Progressive multimodal reasoning via active retrieval,

G. Dong, C. Zhang, M. Deng, Y . Zhu, Z. Dou, and J.-R. Wen, “Progressive multimodal reasoning via active retrieval,” 2024

work page 2024
[22]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search,

H. Yao, J. Huang, W. Wu, J. Zhang, Y . Wang, S. Liu, Y . Wang, Y . Song, H. Feng, L. Shen, and D. Tao, “Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search,” 2024

work page 2024
[23]

Boosting multimodal reasoning with mcts-automated structured thinking,

J. Wu, M. Feng, S. Zhang, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with mcts-automated structured thinking,” 2025

work page 2025
[24]

Llamav-o1: Rethinking step-by-step visual reasoning in llms,

O. Thawakar, D. Dissanayake, K. More, R. Thawkar, A. Heakl, N. Ahsan, Y . Li, M. Zumri, J. Lahoud, R. M. Anwer, H. Cholakkal, I. Laptev, M. Shah, F. S. Khan, and S. Khan, “Llamav-o1: Rethinking step-by-step visual reasoning in llms,” 2025

work page 2025
[25]

Atom of thoughts for markov llm test-time scaling,

F. Teng, Z. Yu, Q. Shi, J. Zhang, C. Wu, and Y . Luo, “Atom of thoughts for markov llm test-time scaling,” 2025

work page 2025
[26]

Codei/o: Condensing reasoning patterns via code input-output prediction,

J. Li, D. Guo, D. Yang, R. Xu, Y . Wu, and J. He, “Codei/o: Condensing reasoning patterns via code input-output prediction,” 2025

work page 2025
[27]

Simplification of flesch reading ease formula

J. N. Farr, J. J. Jenkins, and D. G. Paterson, “Simplification of flesch reading ease formula.”Journal of applied psychology, vol. 35, no. 5, p. 333, 1951

work page 1951
[28]

An improved canny edge detection algorithm,

W. Rong, Z. Li, W. Zhang, and L. Sun, “An improved canny edge detection algorithm,” in2014 IEEE international conference on mechatronics and automation. IEEE, 2014, pp. 577–582

work page 2014
[29]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11 198– 11 201

work page 2024
[30]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

work page 2024
[31]

Woodpecker: Hallucination correction for multimodal large language models,

S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y . Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,”Science China Information Sciences, vol. 67, no. 12, p. 220105, 2024

work page 2024
[32]

CogVLM: Visual expert for pretrained language models,

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuan, J. Xu, K. Chen, B. Xu, J. Li, Y . Dong, M. Ding, and J. Tang, “CogVLM: Visual expert for pretrained language models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[33]

Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,

T. Yu, H. Zhang, Y . Yao, Y . Dang, D. Chen, X. Lu, G. Cui, T. He, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,”arXiv preprint arXiv:2405.17220, 2024

work page arXiv 2024
[34]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,

D. Mondal, S. Modi, S. Panda, R. Singh, and G. S. Rao, “Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,” inProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelli...

work page 2024
[36]

Medhalltune: An instruction-tuning benchmark for mitigating medical hallucination in vision-language models,

Q. Yan, Y . Yuan, X. Hu, Y . Wang, J. Xu, J. Li, C.-W. Fu, and P.-A. Heng, “Medhalltune: An instruction-tuning benchmark for mitigating medical hallucination in vision-language models,” 2025

work page 2025
[37]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,

W. Xiao, Z. Huang, L. Gan, W. He, H. Li, Z. Yu, F. Shu, H. Jiang, and L. Zhu, “Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, pp. 25 543–25 551, Apr. 2025

work page 2025
[38]

Sbsc: Step-by-step coding for improving mathematical olympiad performance,

K. Singh, A. Biswas, S. Bhowmick, P. Moturi, and S. K. Gollapalli, “Sbsc: Step-by-step coding for improving mathematical olympiad performance,” 2025

work page 2025
[39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, et.al, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025

work page 2025
[40]

Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,

Y . Du, K. Chen, Y . Zhan, C. H. Low, T. You, M. Islam, Z. Guo, Y . Jin, G. Chen, and P.-A. Heng, “Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,” 2024

work page 2024
[41]

Human-object interaction via automatically designed vlm-guided motion policy,

Z. Deng, Y . Shi, K. Ji, L. Xu, S. Huang, and J. Wang, “Human-object interaction via automatically designed vlm-guided motion policy,” 2025

work page 2025
[42]

Ok-vqa: A visual question answering benchmark requiring external knowledge,

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” 2019

work page 2019
[43]

Hallucination of multimodal large language models: A survey,

Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, and M. Z. Shou, “Hallucination of multimodal large language models: A survey,” 2025

work page 2025
[44]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, and M. Sun, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 807–13 816

work page 2024

[1] [1]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

work page 2024

[2] [2]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),”arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” inICLR, 2024

work page 2024

[4] [4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,

T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. He, X. Feng, J. Song, B. Zheng, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” 2024

work page 2024

[6] [6]

Llava-cot: Let vision language models reason step-by-step,

G. Xu, P. Jin, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava-cot: Let vision language models reason step-by-step,” 2025

work page 2025

[7] [7]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 55 204–55 224

work page 2024

[8] [8]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 292–305

work page 2023

[9] [9]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoobet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 375–14 385

work page 2024

[10] [10]

Are We on the Right Way for Evaluating Large Vision-Language Models?

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[12] [12]

Measuring multimodal mathematical reasoning with math- vision dataset,

K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math- vision dataset,” inThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[13] [13]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[14] [14]

Qwen2.5-vl,

Q. Team, “Qwen2.5-vl,” January 2025

work page 2025

[15] [15]

Internal consistency and self-feedback in large language models: A survey,

X. Liang, S. Song, Z. Zheng, H. Wang, Q. Yu, X. Li, R.-H. Li, F. Xiong, and Z. Li, “Internal consistency and self-feedback in large language models: A survey,”CoRR, 2024

work page 2024

[16] [16]

The llama 3 herd of models,

A. Grattafiori and T. Abhimanyu Dubey, “The llama 3 herd of models,” 2024

work page 2024

[17] [17]

Swift: a scalable lightweight infrastructure for fine-tuning,

Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wanget al., “Swift: a scalable lightweight infrastructure for fine-tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 29 733–29 735

work page 2025

[18] [18]

Thoughts are all over the place: On the underthinking of o1-like llms,

Y . Wang, Q. Liu, J. Xu, T. Liang, X. Chen, Z. He, L. Song, D. Yu, J. Li, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu, “Thoughts are all over the place: On the underthinking of o1-like llms,” 2025

work page 2025

[19] [19]

Stop overthinking: A survey on efficient reasoning for large language models,

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,” 2025

work page 2025

[20] [20]

Single-agent policy tree search with guarantees,

L. Orseau, L. Lelis, T. Lattimore, and T. Weber, “Single-agent policy tree search with guarantees,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018

[21] [21]

Progressive multimodal reasoning via active retrieval,

G. Dong, C. Zhang, M. Deng, Y . Zhu, Z. Dou, and J.-R. Wen, “Progressive multimodal reasoning via active retrieval,” 2024

work page 2024

[22] [22]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search,

H. Yao, J. Huang, W. Wu, J. Zhang, Y . Wang, S. Liu, Y . Wang, Y . Song, H. Feng, L. Shen, and D. Tao, “Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search,” 2024

work page 2024

[23] [23]

Boosting multimodal reasoning with mcts-automated structured thinking,

J. Wu, M. Feng, S. Zhang, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with mcts-automated structured thinking,” 2025

work page 2025

[24] [24]

Llamav-o1: Rethinking step-by-step visual reasoning in llms,

O. Thawakar, D. Dissanayake, K. More, R. Thawkar, A. Heakl, N. Ahsan, Y . Li, M. Zumri, J. Lahoud, R. M. Anwer, H. Cholakkal, I. Laptev, M. Shah, F. S. Khan, and S. Khan, “Llamav-o1: Rethinking step-by-step visual reasoning in llms,” 2025

work page 2025

[25] [25]

Atom of thoughts for markov llm test-time scaling,

F. Teng, Z. Yu, Q. Shi, J. Zhang, C. Wu, and Y . Luo, “Atom of thoughts for markov llm test-time scaling,” 2025

work page 2025

[26] [26]

Codei/o: Condensing reasoning patterns via code input-output prediction,

J. Li, D. Guo, D. Yang, R. Xu, Y . Wu, and J. He, “Codei/o: Condensing reasoning patterns via code input-output prediction,” 2025

work page 2025

[27] [27]

Simplification of flesch reading ease formula

J. N. Farr, J. J. Jenkins, and D. G. Paterson, “Simplification of flesch reading ease formula.”Journal of applied psychology, vol. 35, no. 5, p. 333, 1951

work page 1951

[28] [28]

An improved canny edge detection algorithm,

W. Rong, Z. Li, W. Zhang, and L. Sun, “An improved canny edge detection algorithm,” in2014 IEEE international conference on mechatronics and automation. IEEE, 2014, pp. 577–582

work page 2014

[29] [29]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11 198– 11 201

work page 2024

[30] [30]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

work page 2024

[31] [31]

Woodpecker: Hallucination correction for multimodal large language models,

S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y . Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,”Science China Information Sciences, vol. 67, no. 12, p. 220105, 2024

work page 2024

[32] [32]

CogVLM: Visual expert for pretrained language models,

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuan, J. Xu, K. Chen, B. Xu, J. Li, Y . Dong, M. Ding, and J. Tang, “CogVLM: Visual expert for pretrained language models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[33] [33]

Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,

T. Yu, H. Zhang, Y . Yao, Y . Dang, D. Chen, X. Lu, G. Cui, T. He, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,”arXiv preprint arXiv:2405.17220, 2024

work page arXiv 2024

[34] [34]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,

D. Mondal, S. Modi, S. Panda, R. Singh, and G. S. Rao, “Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,” inProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelli...

work page 2024

[36] [36]

Medhalltune: An instruction-tuning benchmark for mitigating medical hallucination in vision-language models,

Q. Yan, Y . Yuan, X. Hu, Y . Wang, J. Xu, J. Li, C.-W. Fu, and P.-A. Heng, “Medhalltune: An instruction-tuning benchmark for mitigating medical hallucination in vision-language models,” 2025

work page 2025

[37] [37]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,

W. Xiao, Z. Huang, L. Gan, W. He, H. Li, Z. Yu, F. Shu, H. Jiang, and L. Zhu, “Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, pp. 25 543–25 551, Apr. 2025

work page 2025

[38] [38]

Sbsc: Step-by-step coding for improving mathematical olympiad performance,

K. Singh, A. Biswas, S. Bhowmick, P. Moturi, and S. K. Gollapalli, “Sbsc: Step-by-step coding for improving mathematical olympiad performance,” 2025

work page 2025

[39] [39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, et.al, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025

work page 2025

[40] [40]

Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,

Y . Du, K. Chen, Y . Zhan, C. H. Low, T. You, M. Islam, Z. Guo, Y . Jin, G. Chen, and P.-A. Heng, “Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,” 2024

work page 2024

[41] [41]

Human-object interaction via automatically designed vlm-guided motion policy,

Z. Deng, Y . Shi, K. Ji, L. Xu, S. Huang, and J. Wang, “Human-object interaction via automatically designed vlm-guided motion policy,” 2025

work page 2025

[42] [42]

Ok-vqa: A visual question answering benchmark requiring external knowledge,

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” 2019

work page 2019

[43] [43]

Hallucination of multimodal large language models: A survey,

Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, and M. Z. Shou, “Hallucination of multimodal large language models: A survey,” 2025

work page 2025

[44] [44]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, and M. Sun, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 807–13 816

work page 2024