arxiv: 2512.14735 · v2 · submitted 2025-12-11 · 💱 q-fin.CP · cs.AI· cs.CV

Recognition: no theorem link

PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents

Yuqun Zhang , Yuxuan Zhao , Sijia Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:34 UTC · model grok-4.3

classification 💱 q-fin.CP cs.AIcs.CV

keywords financial image understandingvision language modelspyramid reasoningadversarial agentsquestion chain generationMonte Carlo Tree Searchsynthetic dataset

0 comments

The pith

PyFi trains VLMs on pyramid question chains to decompose complex financial images into simpler sub-questions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds PyFi as a framework that generates 600,000 financial image question-answer pairs arranged in a pyramid, where base questions test only basic perception and higher levels require deeper financial visual reasoning and expertise. It creates these chains automatically through PyFi-adv, a multi-agent system in which challenger and solver agents compete under Monte Carlo Tree Search to probe increasing capability levels without any human annotations. Fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the resulting pyramid-structured data lets the models break complex financial questions into sub-questions of gradually rising difficulty, producing average accuracy gains of 19.52 percent and 8.06 percent on the dataset. The work supplies both the dataset and hierarchical evaluations to support further development of vision-language models for financial image tasks.

Core claim

PyFi-600K organizes financial question-answer pairs into a reasoning pyramid synthesized by adversarial MCTS agents; fine-tuning VLMs on these progressive chains allows them to answer complex financial visual questions by first solving simpler sub-questions at lower pyramid levels, delivering the stated accuracy improvements on the generated test data.

What carries the argument

PyFi-adv multi-agent adversarial mechanism under Monte Carlo Tree Search that generates pyramid-structured question chains for each financial image

Load-bearing premise

The automatically generated question chains truly increase in reasoning difficulty in a manner that matches genuine financial visual understanding demands.

What would settle it

Test the fine-tuned models on a fresh collection of real financial images and questions written independently by domain experts, checking whether the accuracy gains remain when the test distribution is not produced by the same adversarial process.

Figures

Figures reproduced from arXiv: 2512.14735 by Sijia Chen, Yuqun Zhang, Yuxuan Zhao.

**Figure 1.** Figure 1: Overview of the PyFi-600K dataset within our PyFi framework. The dataset is structured as a pyramid comprising 6 capability levels, 11 financial image types, and 17 financial themes. this dataset is enabled by PyFi-adv, a multi-agent adversarial mechanism designed to automatically synthesize and refine samples using expert-level financial knowledge. PyFi-600K offers three key benefits for the interpretabi… view at source ↗

**Figure 2.** Figure 2: Overview of PyFi-adv: a challenger agent competes with a solver agent under the MCTS paradigm to generate question chains that progressively probe deeper capability levels in financial visual reasoning. form, whether better or worse, at each cognition level during financial reasoning, thereby revealing which level is crucial for which financial questions and to what extent. Progressive Logical Chain in the… view at source ↗

**Figure 3.** Figure 3: Comparison between Qwen2.5-VL models and ours ( PyFi-QwenVL-3B and PyFi-QwenVL-7B). quence of self-generated sub-questions of increasing capability levels, thereby significantly improving interpretability and reliability. 4.3 Main Benefits and Insights Hierarchical Evaluation of VLMs in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of error proportions across capability levels leading to incorrect financial decisions. As [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of level-wise COT by comparing the Qwen2.5-VL-3B-Instruct and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyFi's adversarial MCTS setup for generating pyramid financial QA data is a workable new synthesis trick, but all accuracy gains are reported only on the same synthetic distribution.

read the letter

The paper's real contribution is PyFi-adv, the multi-agent MCTS loop that pits a challenger against a solver to produce question chains for financial images. Base questions test basic perception and the higher levels add financial reasoning depth, yielding the 600K-pair PyFi-600K dataset with no human annotation required. That pipeline is a concrete engineering step for anyone who needs scalable domain data in finance, where labeling is expensive. Releasing code, data, and the fine-tuned models is also useful for replication attempts.

Referee Report

2 major / 2 minor

Summary. The paper proposes PyFi, a framework for pyramid-like financial image understanding in VLMs. It introduces the PyFi-600K synthetic dataset of 600K financial QA pairs generated via PyFi-adv, a multi-agent adversarial MCTS system that produces progressive question chains from basic visual perception to complex financial reasoning. Fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on these chains reportedly enables decomposition of complex questions into sub-questions, yielding average accuracy gains of 19.52% and 8.06% on the dataset. All code, data, and models are released.

Significance. If the gains prove generalizable, the work provides a scalable annotation-free method to synthesize hierarchical training data for financial VLMs, potentially advancing automated analysis of charts, reports, and visual financial documents. The adversarial MCTS challenger-solver loop is a creative mechanism for probing capability levels. However, the self-generated nature of the evaluation dataset substantially reduces the strength of the central claim until independent validation is shown.

major comments (2)

[Abstract] Abstract: All accuracy improvements (19.52% for the 3B model and 8.06% for the 7B model) are measured exclusively on the PyFi-600K dataset produced by the identical PyFi-adv MCTS generation process. This creates a circularity risk where reported gains may reflect adaptation to the synthesis policy rather than improved financial visual reasoning; no external benchmarks, human-annotated test sets, or held-out splits from independent sources are described.
[Evaluation] Evaluation section (inferred from abstract claims): The paper does not report baseline comparisons against standard fine-tuning on non-pyramid data, random question chains, or existing financial VLM benchmarks (e.g., FinVQA or ChartQA variants). Without these controls, it is impossible to isolate the contribution of the pyramid structure versus simple data scaling.

minor comments (2)

[Abstract] Abstract: The description of MCTS parameters (exploration constant, depth limits) is absent; these free parameters should be listed explicitly to support reproducibility claims.
[Abstract] The GitHub repository link is welcome, but the abstract would benefit from a one-sentence statement on the source and diversity of the underlying financial images used for generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of our evaluation methodology. We address each major comment below and commit to revisions that strengthen the claims regarding generalizability.

read point-by-point responses

Referee: [Abstract] Abstract: All accuracy improvements (19.52% for the 3B model and 8.06% for the 7B model) are measured exclusively on the PyFi-600K dataset produced by the identical PyFi-adv MCTS generation process. This creates a circularity risk where reported gains may reflect adaptation to the synthesis policy rather than improved financial visual reasoning; no external benchmarks, human-annotated test sets, or held-out splits from independent sources are described.

Authors: We acknowledge the circularity concern as a substantive limitation of the current evaluation. The PyFi-600K dataset is intentionally self-generated to enable scalable, annotation-free creation of progressive reasoning chains, and the reported gains specifically demonstrate improved decomposition of complex questions into sub-questions. However, this does not fully isolate gains from adaptation to the generator. In the revised manuscript, we will add results on held-out splits of PyFi-600K (disjoint from training chains) and include zero-shot/few-shot evaluations on external benchmarks such as ChartQA and FinVQA to provide independent validation. We will also clarify the fixed nature of the MCTS policy versus learned decomposition skills. revision: yes
Referee: [Evaluation] Evaluation section (inferred from abstract claims): The paper does not report baseline comparisons against standard fine-tuning on non-pyramid data, random question chains, or existing financial VLM benchmarks (e.g., FinVQA or ChartQA variants). Without these controls, it is impossible to isolate the contribution of the pyramid structure versus simple data scaling.

Authors: This is a valid criticism. The current results focus on the benefits of pyramid-structured chains but lack controls to separate the hierarchical organization from mere data volume or random ordering. In the revision, we will expand the evaluation section to include: (i) fine-tuning on the same 600K pairs but with randomly shuffled question orders, (ii) standard fine-tuning on non-pyramid financial QA data of comparable scale, and (iii) performance on public benchmarks (ChartQA, FinVQA) for direct comparison. These additions will better isolate the pyramid structure's contribution. revision: yes

Circularity Check

1 steps flagged

Accuracy gains reported only on the same MCTS-synthetic PyFi-600K dataset

specific steps

fitted input called prediction [Abstract]
"fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset."

The dataset (PyFi-600K) is synthesized without human annotations using the PyFi-adv multi-agent MCTS mechanism; both training chains and the evaluation instances therefore share the identical generation policy and difficulty progression. The accuracy numbers are therefore computed on data whose structure is defined by the same adversarial loop that supplied the fine-tuning examples, so the reported lifts are not independent of the synthesis method.

full rationale

The paper's headline result (accuracy lifts after fine-tuning) is measured exclusively on the PyFi-600K dataset whose question chains were produced by the identical PyFi-adv MCTS challenger/solver loop used to create the training data. No external human-annotated test set or cross-benchmark is referenced, so the reported improvements reduce to performance on data whose distribution and difficulty structure are defined by the same generation process. This matches the fitted-input-called-prediction pattern: the model is fitted to the synthetic pyramid chains and then evaluated on closely related instances from the same synthetic distribution, making the numerical gains statistically expected rather than an independent demonstration of new capability.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that adversarial question generation produces valid progressive reasoning chains; the paper introduces new entities and relies on domain assumptions about VLM training dynamics with minimal explicit free parameters.

free parameters (1)

MCTS exploration and depth parameters
Parameters controlling the tree search for question chain generation are chosen but not detailed as data-fitted values.

axioms (1)

domain assumption Training on progressively harder question chains improves complex reasoning performance in VLMs
Invoked to justify the pyramid structure and fine-tuning approach.

invented entities (1)

PyFi-adv multi-agent adversarial system no independent evidence
purpose: Generate scalable pyramid-structured financial QA pairs without human annotation
Newly introduced mechanism whose validity rests on the paper's internal results.

pith-pipeline@v0.9.0 · 5562 in / 1266 out tokens · 40473 ms · 2026-05-16T23:34:45.735143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rstar-math: Small llms can master math reason- ing with self-evolved deep thinking.arXiv preprint arXiv:2501.04519. Allen H Huang, Hui Wang, and Yi Yang. 2023. Finbert: A large language model for extracting information from financial text.Contemporary Accounting Re- search, 40(2):806–841. Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Z...

work page internal anchor Pith review arXiv 2023
[2]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, and Yue Wang

work page
[3]

arXiv preprint arXiv:2501.17015

Revisit mixture models for multi-agent simula- tion: Experimental study within a unified framework. arXiv preprint arXiv:2501.17015. Haochen Liu, Li Chen, Yu Qiao, Chen Lv, and Hongyang Li. 2024. Reasoning multi-agent behav- ioral topology for interactive autonomous driving. Advances in Neural Information Processing Systems, 37:92605–92637. Wenhao Liu, Zh...

work page arXiv 2024
[4]

InCompanion Proc ACM on Web Conference 2025, pages 785–788

Fin-fact: A benchmark dataset for multimodal financial fact-checking and explanation generation. InCompanion Proc ACM on Web Conference 2025, pages 785–788. Raj Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, and Diyi Yang. 2022. When flue meets flang: Benchmarks and large pretrained langu...

work page 2025
[5]

Finchart-bench: Benchmarking financial chart comprehension in vision-language models.arXiv preprint arXiv:2507.14823. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, and 1 others. 2016. Mas- tering the game of go with deep neur...

work page arXiv 2016
[6]

Fengbin Zhu, Junfeng Li, Liangming Pan, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, and Tat-Seng Chua

Finragbench-v: A benchmark for multimodal rag with visual citation in the financial domain.arXiv preprint arXiv:2505.17471. Fengbin Zhu, Junfeng Li, Liangming Pan, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, and Tat-Seng Chua. 2025. Fintmmbench: Benchmarking temporal- aware multi-modal rag in finance. InACM Interna- tional Conference on Multimedia

work page arXiv 2025