Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
Pith reviewed 2026-05-20 05:33 UTC · model grok-4.3
pith:EXTWKTII Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{EXTWKTII}
Prints a linked pith:EXTWKTII badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A pseudocode library with difficulty-based strategy selection lets vision-language models reason more reliably and cut hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PStar formulates a library of abstract reasoning functions as modular pseudocode and introduces a Difficulty Feature Vector that lets the model assess question complexity, then adaptively selects the matching structured reasoning path to produce more robust and interpretable outputs.
What carries the argument
The Pseudocode-guided Structured Reasoning framework (PStar) that maintains a library of modular reasoning strategies and uses the Difficulty Feature Vector to match question complexity to the appropriate path.
If this is right
- Hallucination rates drop in open-ended visual-language tasks, supporting safer use in robotic decision making.
- Reasoning becomes more interpretable because each output follows an explicit pseudocode path.
- Performance exceeds GPT-4V on the POPE and MMStar benchmarks without requiring larger models.
- The same adaptive selection works across questions that differ widely in difficulty and input modality.
Where Pith is reading between the lines
- The framework could be tested on sequential decision tasks where errors compound over multiple steps.
- Replacing the Difficulty Feature Vector with a learned module might further improve selection accuracy.
- The pseudocode library offers a template that could be ported to other multimodal models facing similar hallucination issues.
Load-bearing premise
The Difficulty Feature Vector can accurately judge question complexity so that the model reliably picks the best reasoning strategy from the pseudocode library.
What would settle it
An experiment that replaces the Difficulty Feature Vector with random or fixed strategy selection and measures whether POPE and MMStar scores drop below the reported 87.1 percent and 68.0 percent.
Figures
read the original abstract
Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Pseudocode-guided Structured Reasoning (PStar) framework for vision-language models. It introduces a library of abstract reasoning functions expressed as structured pseudocode, along with a Difficulty Feature Vector (DFV) that enables the model to assess question complexity and adaptively select reasoning strategies from the library. The central claim is that this approach significantly reduces hallucinations, yielding state-of-the-art results of 87.1% on POPE and 68.0% on MMStar while outperforming GPT-4V, with potential benefits for reliable robotic automation.
Significance. If the performance claims and the reliability of the DFV-based adaptation are substantiated, the work would represent a meaningful step toward interpretable, adaptive reasoning in VLMs for safety-critical applications. The modular pseudocode library offers a concrete mechanism for step-by-step inference that could improve robustness over purely prompt-based methods.
major comments (2)
- [Abstract] Abstract: The headline claims of 87.1% on POPE and 68.0% on MMStar (outperforming GPT-4V) are presented with no accompanying experimental details, baselines, statistical tests, error analysis, or dataset splits. This absence makes it impossible to evaluate whether the reported gains are attributable to the structured reasoning or to other factors.
- [Method] Method section (DFV description): The Difficulty Feature Vector is computed internally by the same VLM whose hallucinations the framework aims to mitigate. Because DFV directly determines strategy selection from the pseudocode library, any error in difficulty or modality assessment produces the wrong reasoning path; the manuscript provides no external verifier, ground-truth labels, or ablation showing that DFV errors do not collapse the adaptive mechanism to a non-adaptive baseline.
minor comments (1)
- [Abstract] The abstract and method descriptions introduce several new terms (DFV, structured pseudocode library) without a clear notation table or diagram showing how DFV components map to library entries.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important aspects of clarity and robustness in our presentation of PStar. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of 87.1% on POPE and 68.0% on MMStar (outperforming GPT-4V) are presented with no accompanying experimental details, baselines, statistical tests, error analysis, or dataset splits. This absence makes it impossible to evaluate whether the reported gains are attributable to the structured reasoning or to other factors.
Authors: We agree that the abstract would be strengthened by including more context for the reported results. In the revised version, we will expand the abstract to briefly note the evaluation on the POPE and MMStar benchmarks, the set of baselines including GPT-4V, and that full experimental details, dataset information, and statistical analysis appear in the Experiments section. This will allow readers to better contextualize the claims while respecting abstract length limits. revision: yes
-
Referee: [Method] Method section (DFV description): The Difficulty Feature Vector is computed internally by the same VLM whose hallucinations the framework aims to mitigate. Because DFV directly determines strategy selection from the pseudocode library, any error in difficulty or modality assessment produces the wrong reasoning path; the manuscript provides no external verifier, ground-truth labels, or ablation showing that DFV errors do not collapse the adaptive mechanism to a non-adaptive baseline.
Authors: The referee correctly notes the potential for error propagation when DFV is generated by the same VLM. While objective ground-truth labels for question difficulty do not exist in a model-independent sense, we have included ablations in the original experiments comparing adaptive DFV selection against fixed non-adaptive baselines. In the revision, we will add a new analysis subsection that reports DFV accuracy against human annotations on a subset of questions and includes sensitivity tests with injected DFV noise to quantify robustness. These additions will directly address the concern about collapse to non-adaptive behavior. revision: partial
Circularity Check
No circularity: framework description contains no derivations, equations, or self-referential reductions.
full rationale
The paper presents PStar as an empirical framework that designs abstract reasoning functions, a pseudocode library, and a Difficulty Feature Vector (DFV) to adaptively select strategies. No equations, fitted parameters, or derivation chains appear in the abstract or described text. Performance numbers (87.1% POPE, 68.0% MMStar) are reported from external benchmark experiments rather than from any internal prediction that reduces to the inputs by construction. The DFV is introduced as a designed component for assessing complexity, but the text provides no indication that it is defined in terms of the target outputs or fitted to the evaluation metrics in a self-referential manner. This is a standard non-circular proposal of a method whose validity rests on experimental validation against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs are susceptible to hallucinations that cause critical failures in decision-making for real-world tasks
invented entities (2)
-
Difficulty Feature Vector (DFV)
no independent evidence
-
structured pseudocode library
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A*-Based Reasoning Path Generation ... novel cost function g(S) = Σ λ_ai · len(ri) / usefulness(ri)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306
work page 2024
-
[2]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),”arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Minigpt-4: Enhancing vision-language understanding with advanced large language models,
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” inICLR, 2024
work page 2024
-
[4]
Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,
T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. He, X. Feng, J. Song, B. Zheng, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” 2024
work page 2024
-
[6]
Llava-cot: Let vision language models reason step-by-step,
G. Xu, P. Jin, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava-cot: Let vision language models reason step-by-step,” 2025
work page 2025
-
[7]
H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 55 204–55 224
work page 2024
-
[8]
Evaluating object hallucination in large vision-language models,
Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 292–305
work page 2023
-
[9]
T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoobet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 375–14 385
work page 2024
-
[10]
Are We on the Right Way for Evaluating Large Vision-Language Models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Linet al., “Are we on the right way for evaluating large vision-language models?”arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[12]
Measuring multimodal mathematical reasoning with math- vision dataset,
K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math- vision dataset,” inThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[13]
Learn to explain: Multimodal reasoning via thought chains for science question answering,
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
- [14]
-
[15]
Internal consistency and self-feedback in large language models: A survey,
X. Liang, S. Song, Z. Zheng, H. Wang, Q. Yu, X. Li, R.-H. Li, F. Xiong, and Z. Li, “Internal consistency and self-feedback in large language models: A survey,”CoRR, 2024
work page 2024
-
[16]
A. Grattafiori and T. Abhimanyu Dubey, “The llama 3 herd of models,” 2024
work page 2024
-
[17]
Swift: a scalable lightweight infrastructure for fine-tuning,
Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wanget al., “Swift: a scalable lightweight infrastructure for fine-tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 29 733–29 735
work page 2025
-
[18]
Thoughts are all over the place: On the underthinking of o1-like llms,
Y . Wang, Q. Liu, J. Xu, T. Liang, X. Chen, Z. He, L. Song, D. Yu, J. Li, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu, “Thoughts are all over the place: On the underthinking of o1-like llms,” 2025
work page 2025
-
[19]
Stop overthinking: A survey on efficient reasoning for large language models,
Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,” 2025
work page 2025
-
[20]
Single-agent policy tree search with guarantees,
L. Orseau, L. Lelis, T. Lattimore, and T. Weber, “Single-agent policy tree search with guarantees,”Advances in Neural Information Processing Systems, vol. 31, 2018
work page 2018
-
[21]
Progressive multimodal reasoning via active retrieval,
G. Dong, C. Zhang, M. Deng, Y . Zhu, Z. Dou, and J.-R. Wen, “Progressive multimodal reasoning via active retrieval,” 2024
work page 2024
-
[22]
H. Yao, J. Huang, W. Wu, J. Zhang, Y . Wang, S. Liu, Y . Wang, Y . Song, H. Feng, L. Shen, and D. Tao, “Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search,” 2024
work page 2024
-
[23]
Boosting multimodal reasoning with mcts-automated structured thinking,
J. Wu, M. Feng, S. Zhang, R. Jin, F. Che, Z. Wen, and J. Tao, “Boosting multimodal reasoning with mcts-automated structured thinking,” 2025
work page 2025
-
[24]
Llamav-o1: Rethinking step-by-step visual reasoning in llms,
O. Thawakar, D. Dissanayake, K. More, R. Thawkar, A. Heakl, N. Ahsan, Y . Li, M. Zumri, J. Lahoud, R. M. Anwer, H. Cholakkal, I. Laptev, M. Shah, F. S. Khan, and S. Khan, “Llamav-o1: Rethinking step-by-step visual reasoning in llms,” 2025
work page 2025
-
[25]
Atom of thoughts for markov llm test-time scaling,
F. Teng, Z. Yu, Q. Shi, J. Zhang, C. Wu, and Y . Luo, “Atom of thoughts for markov llm test-time scaling,” 2025
work page 2025
-
[26]
Codei/o: Condensing reasoning patterns via code input-output prediction,
J. Li, D. Guo, D. Yang, R. Xu, Y . Wu, and J. He, “Codei/o: Condensing reasoning patterns via code input-output prediction,” 2025
work page 2025
-
[27]
Simplification of flesch reading ease formula
J. N. Farr, J. J. Jenkins, and D. G. Paterson, “Simplification of flesch reading ease formula.”Journal of applied psychology, vol. 35, no. 5, p. 333, 1951
work page 1951
-
[28]
An improved canny edge detection algorithm,
W. Rong, Z. Li, W. Zhang, and L. Sun, “An improved canny edge detection algorithm,” in2014 IEEE international conference on mechatronics and automation. IEEE, 2014, pp. 577–582
work page 2014
-
[29]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,
H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11 198– 11 201
work page 2024
-
[30]
Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427
work page 2024
-
[31]
Woodpecker: Hallucination correction for multimodal large language models,
S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y . Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,”Science China Information Sciences, vol. 67, no. 12, p. 220105, 2024
work page 2024
-
[32]
CogVLM: Visual expert for pretrained language models,
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, S. XiXuan, J. Xu, K. Chen, B. Xu, J. Li, Y . Dong, M. Ding, and J. Tang, “CogVLM: Visual expert for pretrained language models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[33]
Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,
T. Yu, H. Zhang, Y . Yao, Y . Dang, D. Chen, X. Lu, G. Cui, T. He, Z. Liu, T.-S. Chua, and M. Sun, “Rlaif-v: Aligning mllms through open- source ai feedback for super gpt-4v trustworthiness,”arXiv preprint arXiv:2405.17220, 2024
-
[34]
Multimodal Chain-of-Thought Reasoning in Language Models
Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,
D. Mondal, S. Modi, S. Panda, R. Singh, and G. S. Rao, “Kam- cot: knowledge augmented multimodal chain-of-thoughts reasoning,” inProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelli...
work page 2024
-
[36]
Q. Yan, Y . Yuan, X. Hu, Y . Wang, J. Xu, J. Li, C.-W. Fu, and P.-A. Heng, “Medhalltune: An instruction-tuning benchmark for mitigating medical hallucination in vision-language models,” 2025
work page 2025
-
[37]
Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,
W. Xiao, Z. Huang, L. Gan, W. He, H. Li, Z. Yu, F. Shu, H. Jiang, and L. Zhu, “Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, pp. 25 543–25 551, Apr. 2025
work page 2025
-
[38]
Sbsc: Step-by-step coding for improving mathematical olympiad performance,
K. Singh, A. Biswas, S. Bhowmick, P. Moturi, and S. K. Gollapalli, “Sbsc: Step-by-step coding for improving mathematical olympiad performance,” 2025
work page 2025
-
[39]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
DeepSeek-AI, D. Guo, D. Yang, et.al, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025
work page 2025
-
[40]
Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,
Y . Du, K. Chen, Y . Zhan, C. H. Low, T. You, M. Islam, Z. Guo, Y . Jin, G. Chen, and P.-A. Heng, “Llm-assisted multi-teacher continual learning for visual question answering in robotic surgery,” 2024
work page 2024
-
[41]
Human-object interaction via automatically designed vlm-guided motion policy,
Z. Deng, Y . Shi, K. Ji, L. Xu, S. Huang, and J. Wang, “Human-object interaction via automatically designed vlm-guided motion policy,” 2025
work page 2025
-
[42]
Ok-vqa: A visual question answering benchmark requiring external knowledge,
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” 2019
work page 2019
-
[43]
Hallucination of multimodal large language models: A survey,
Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, and M. Z. Shou, “Hallucination of multimodal large language models: A survey,” 2025
work page 2025
-
[44]
T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, and M. Sun, “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 807–13 816
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.