From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Like Zhang; Pan Wang; Qianli Xing; Qingzu He; Qi Wang; Runliang Niu; Shiqi Wang; Xiyu Hu

arxiv: 2606.18846 · v1 · pith:2KNSO4YSnew · submitted 2026-06-17 · 💻 cs.CV

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Like Zhang , Runliang Niu , Shiqi Wang , Xiyu Hu , Qianli Xing , Pan Wang , Qingzu He , Qi Wang This is my paper

Pith reviewed 2026-06-26 21:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsdata annotationon-policy annotationvisual reasoningflowchartsGUI screenshotsbounding boxesstructured data

0 comments

The pith

An on-policy annotation loop with a unified schema produces high-acceptance data that lifts VLM accuracy on flowcharts by 35 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScreenAnnotator to create training data that unites spatial coordinates, open-vocabulary labels, attributes, and relationships in one representation for advanced visual reasoning in vision-language models. Existing tools fall short on expressiveness, couple annotation poorly to training, and force repeated labeling, so the authors replace them with a single atom schema, an on-policy loop containing a Bayesian verifier, and template-driven synthesis that turns each atom into multiple tasks. The loop raises acceptance to nearly 100 percent on flowcharts and 77 percent on GUI images while lowering annotation time, and the resulting data produces a 76.1 percent average accuracy after fine-tuning, a 35.1-point absolute improvement.

Core claim

The paper establishes that an on-policy annotation loop embedded with a Bayesian Annotation Verifier, built on a unified annotation atom schema, drives annotation acceptance rates to nearly 100 percent on flowcharts and 77 percent on GUI screenshots while steadily decreasing per-image time as data accumulate; the same data, when used to fine-tune a VLM, yields 76.1 percent average accuracy, a 35.1-point absolute gain over the baseline.

What carries the argument

The unified annotation atom schema that binds spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into one unit, together with the on-policy annotation loop containing the Bayesian Annotation Verifier.

If this is right

Annotation acceptance reaches nearly 100 percent on flowcharts and 77 percent on GUI screenshots without manual correction.
Per-image annotation time declines as the volume of labeled data grows.
Fine-tuning on the synthesized data raises VLM accuracy on flowchart reasoning to 76.1 percent.
Template-driven synthesis generates multiple reasoning tasks from each static atom, removing the need for repeated annotation.
The same schema and loop apply across flowchart and GUI domains without redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The schema could be tested on additional visual domains such as diagrams or scene graphs to check whether acceptance rates and accuracy gains hold.
If the verifier's Bayesian update is replaced by a simpler rule-based check, one could measure whether acceptance and time savings remain comparable.
The multi-task synthesis templates might be extended to generate chain-of-thought reasoning examples directly from the atoms.

Load-bearing premise

The unified annotation atom schema binds spatial, semantic, and structural information into a representation that is sufficient and lossless for all downstream visual reasoning tasks.

What would settle it

Measure whether a VLM fine-tuned on data produced by this tool shows the same 35-point accuracy gain when evaluated on a new set of GUI screenshots or flowchart variants never seen during annotation.

Figures

Figures reproduced from arXiv: 2606.18846 by Like Zhang, Pan Wang, Qianli Xing, Qingzu He, Qi Wang, Runliang Niu, Shiqi Wang, Xiyu Hu.

**Figure 1.** Figure 1: Motivation of SCREENANNOTATOR: bridging the infrastructure gap between conventional image annotation tools and the data demands of modern VLMs. markers into their textual chains of thought. Training these models therefore requires data that unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a single representation. Recent years have witnes… view at source ↗

**Figure 2.** Figure 2: (a) The On-Policy Annotation Loop: At each round, (1) the current policy model (a detector + VLM) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The unified annotation atom schema. Each [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: On-policy annotation efficiency (RQ1). (a)–(b): accept rate and completion rate per round for flowchart [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Lift score of BAV at budget levels 1%, 5%, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples. (a) Two-hop retrieval on a flowchart: the base VLM skips intermediate reasoning and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScreenAnnotator ties a unified atom schema to an on-policy Bayesian verifier and template synthesis for VLM annotation data, with reported gains in accept rate and fine-tuning accuracy on flowcharts.

read the letter

This paper gives us ScreenAnnotator, which defines a single annotation atom that holds spatial coordinates, open-vocabulary text, attributes, and topology together. It runs an on-policy loop with a Bayesian Annotation Verifier to accept or reject labels during collection, then uses templates to spin the same atoms into multiple reasoning tasks.

The integration of those three pieces is the concrete new element. The tool is open source and directly targets the expressiveness, decoupling, and reusability problems listed in the abstract. The reported numbers show the loop reaching near-100% accept on flowcharts, 77% on GUI screenshots, falling annotation time, and a 35-point accuracy lift after fine-tuning.

The main soft spot is that the abstract states those outcomes without the evaluation protocol, baseline models, dataset sizes, or statistical checks. If the full paper supplies those and shows the gains survive ablations, the results hold. The claim that the schema is lossless for downstream reasoning is plausible for the flowchart and GUI cases shown but remains an assumption until tested more broadly.

The work is aimed at groups building structured visual-reasoning data pipelines. Anyone already dealing with annotation bottlenecks for VLMs will find usable ideas. It deserves a serious referee because the infrastructure problem is real and the pipeline is specified enough to review.

Referee Report

3 major / 2 minor

Summary. The paper introduces ScreenAnnotator, an open-source annotation tool for vision-language models that addresses bottlenecks in expressiveness, annotation-training decoupling, and data reusability. It defines a unified annotation atom schema integrating spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships; implements an on-policy annotation loop with a Bayesian Annotation Verifier (BAV); and uses template-driven multi-task data synthesis to generate diverse reasoning tasks from static atoms. The authors report that the on-policy loop achieves nearly 100% accept rates on flowcharts and 77% on GUI screenshots while reducing per-image annotation time, and that fine-tuning a VLM on flowchart data yields 76.1% average accuracy (a 35.1 percentage point absolute gain).

Significance. If the reported performance gains are supported by rigorous, reproducible evaluation, the work could meaningfully advance VLM training infrastructure by enabling more expressive and reusable annotations for grounded visual reasoning. The open-source release and the on-policy loop design are concrete strengths that could support follow-on research.

major comments (3)

[Abstract] Abstract: the manuscript states specific numeric outcomes (accept rates, time reduction, 35.1-point accuracy gain) but supplies no information on evaluation protocol, baseline comparisons, dataset sizes, statistical tests, or potential confounds, so the data-to-claim link cannot be assessed.
[Introduction / Schema definition] The central claim that the unified annotation atom schema is sufficient and lossless for downstream visual reasoning tasks is load-bearing yet unsupported by any ablation or comparison showing that alternative representations would produce inferior VLM performance.
[On-Policy Annotation Loop] The on-policy loop with BAV is presented as driving the reported accept rates, but the manuscript provides no implementation details, priors, or integration mechanics for BAV, preventing verification of how the loop produces the claimed efficiency gains.

minor comments (2)

The term 'on-policy' is used without clarifying its relation to reinforcement-learning concepts or how the verifier influences the policy.
Figure or diagram clarity: a visual example of an annotation atom and its transformation into multi-task templates would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states specific numeric outcomes (accept rates, time reduction, 35.1-point accuracy gain) but supplies no information on evaluation protocol, baseline comparisons, dataset sizes, statistical tests, or potential confounds, so the data-to-claim link cannot be assessed.

Authors: We agree that the abstract would be improved by briefly indicating the scale of the evaluation and the nature of the comparisons. In the revised version we will expand the abstract to note the number of images annotated, the VLM used for fine-tuning, the baseline model, and the evaluation split, while keeping the abstract concise. Full protocol details, including any statistical reporting, already appear in Section 4 and will be cross-referenced. revision: yes
Referee: [Introduction / Schema definition] The central claim that the unified annotation atom schema is sufficient and lossless for downstream visual reasoning tasks is load-bearing yet unsupported by any ablation or comparison showing that alternative representations would produce inferior VLM performance.

Authors: The empirical evidence for sufficiency is the 35.1-point accuracy improvement obtained when the synthesized tasks are derived from the atom schema. We did not, however, run an explicit ablation that replaces the atom schema with an alternative representation while holding all other factors fixed. In the revision we will add a short discussion subsection that contrasts the atom schema with common alternatives (e.g., separate bounding-box + caption pipelines) and explains the design rationale for unification; we will also note the absence of a controlled ablation as a limitation and outline how such an experiment could be conducted in future work. revision: partial
Referee: [On-Policy Annotation Loop] The on-policy loop with BAV is presented as driving the reported accept rates, but the manuscript provides no implementation details, priors, or integration mechanics for BAV, preventing verification of how the loop produces the claimed efficiency gains.

Authors: We acknowledge that the current manuscript text does not supply the concrete priors, update rules, or pseudocode for the Bayesian Annotation Verifier. The revision will include a new subsection (or expanded appendix) that specifies the prior distributions, the likelihood model, the decision threshold, and the exact integration points within the annotation loop. We will also add a short algorithmic description so that the efficiency claims can be reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an annotation tool (ScreenAnnotator) with a unified atom schema, on-policy loop, and Bayesian Annotation Verifier, then reports measured empirical outcomes including accept rates (nearly 100% on flowcharts, 77% on GUI), per-image time reduction, and a 35.1-point accuracy gain after fine-tuning. These quantities are presented as experimental results from running the tool and training, not as quantities defined in terms of fitted parameters, self-referential definitions, or derivations that reduce to inputs by construction. No equations, uniqueness theorems, or self-citations appear in the text; the central claims rest on externally verifiable performance metrics rather than internal redefinitions or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design rests on domain assumptions about the completeness of the atom schema and the effectiveness of the on-policy Bayesian loop; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Existing annotation tools suffer from limited expressiveness, severe annotation-training decoupling, and poor data reusability.
This premise justifies the need for the new tool and is stated directly in the abstract.
domain assumption A single unified annotation atom schema can capture all required spatial, semantic, and structural primitives without loss for visual reasoning.
Central design choice invoked when defining the schema.

invented entities (1)

Bayesian Annotation Verifier (BAV) no independent evidence
purpose: To perform verification inside the on-policy annotation loop.
New component introduced by the paper; no independent evidence outside the reported accept rates is provided.

pith-pipeline@v0.9.1-grok · 5789 in / 1320 out tokens · 31114 ms · 2026-06-26T21:42:34.765597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 1 internal anchor

[1]

In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France

Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France. IEEE. Elmar Haussmann, Michele Fenzi, Kashyap Chitta, Jan Ivanecky, Hanson Xu, Donna Roy, Akshita Mittel, Nicolas Koumchatzky, Clement Farabet, and Jose M. Alvarez. 2020. Sc...

2020
[2]

InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia

Dividemix: Learning with noisy labels as semi-supervised learning. InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You
[3]

InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172

Robust training under label noise by over- parameterization. InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172. PMLR. Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoh...

work page arXiv 2026
[4]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824. Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2022. A survey of deep active learn- ing.ACM Computing Surveys, 54(9):1–40. Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Text” denotes the textual answer component and “Box

Label Studio: Data labeling soft- ware. Open source software available from https://github.com/HumanSignal/label-studio. Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and Antoni B. Chan. 2022. A comparative survey of deep active learning.arXiv preprint arXiv:2203.13450. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reyn...

work page arXiv 2022
[6]

Successor. Q: In this flowchart, what is the next step after {node.text }? A: The next step after {node.text} is <|ref|>{successors(node )[0].text}<|/ref|><|box|>{successors(node)[0].bbox}<|/ box|>
[7]

Q: In this flowchart, which step directly connects to {node

Predecessor. Q: In this flowchart, which step directly connects to {node. text}? A: The step that directly connects to {node.text} is <|ref|>{ predecessors(node)[0].text}<|/ref|><|box|>{predecessors (node)[0].bbox}<|/box|>. 13
[8]

Count.This task type has four variants: node count, edge count, out-degree, and in-degree. Q: How many nodes are there in this flowchart? A: There are {len(nodes)} nodes in this flowchart <|ref|>{len (nodes)} nodes<|/ref|><|box|>[{nodes[0].bbox}, {nodes [1].bbox}, ..., {nodes[n].bbox}]<|/box|>. Q: How many connecting edges are there in this flowchart? A: ...
[9]

Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]

Two-hop. Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]. bbox}<|/box|>, the first step reaches <|ref|>{path[1]. text}<|/ref|><|box|>{path[1].bbox}<|/box|>, and the second step reaches <|ref|>{path[2].text}<|/ref|><|box |>{path[2].bbox}<|/box|>
[10]

Q: List the complete path from {path[0].text} to {path[-1]

Path. Q: List the complete path from {path[0].text} to {path[-1]. text}. A: The complete path is: <|ref|>{path[0].text}<|/ref|><|box |>{path[0].bbox}<|/box|> -> <|ref|>{path[1].text}<|/ref |><|box|>{path[1].bbox}<|/box|> -> ... -> <|ref|>{path [-1].text}<|/ref|><|box|>{path[-1].bbox}<|/box|>
[11]

Q: When the decision result of {edge.from_node.text} is {edge

Condition. Q: When the decision result of {edge.from_node.text} is {edge. text}, where does the flow go? A: When the result of <|ref|>{edge.from_node.text}<|/ref|><| box|>{edge.from_node.bbox}<|/box|> is <|ref|>{edge.text }<|/ref|><|box|>{edge.bbox}<|/box|>, the flow goes to <|ref|>{edge.to_node.text}<|/ref|><|box|>{edge.to_node. bbox}<|/box|>
[12]

Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image

Spatial. Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image
[13]

Relative. Q: In which direction is {first_node.text} relative to { second_node.text}? A: <|ref|>{first_node.text}<|/ref|><|box|>{first_node.bbox }<|/box|> is located to the {relative_pos(first_node. bbox, second_node.bbox)} of <|ref|>{second_node.text }<|/ref|><|box|>{second_node.bbox}<|/box|>
[14]

Region. Q: Which nodes are located in the {target_region} region of the image? A: The nodes located in the {target_region} region are: { node_1.text}, {node_2.text}, ..., {node_k.text} <|ref |>{target_region} region nodes<|/ref|><|box|>[{node_1. bbox}, {node_2.bbox}, ..., {node_k.bbox}]<|/box|>
[15]

Nearest. Q: Which node is spatially closest to {node.text}? A: The node spatially closest to <|ref|>{node.text}<|/ref|><| box|>{node.bbox}<|/box|> is <|ref|>{nearest_node.text }<|/ref|><|box|>{nearest_node.bbox}<|/box|>
[16]

Q: List all nodes in spatial order from top to bottom

Sorting.This task has two variants: top-to- bottom and left-to-right. Q: List all nodes in spatial order from top to bottom. A: The top-to-bottom order of nodes is: <|ref|>{node_1.text }<|/ref|><|box|>{node_1.bbox}<|/box|> -> <|ref|>{node_2. text}<|/ref|><|box|>{node_2.bbox}<|/box|> -> ... -> <| ref|>{node_n.text}<|/ref|><|box|>{node_n.bbox}<|/box|>. Q: L...
[17]

Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>

Coordinate.This task has two variants: grounding (text → box) and degrounding (box → text). Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>. Q: Which node is located in the region with coordinates {node. bbox}? A: The node located in this region is <|ref|>{node.text}<|...

[1] [1]

In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France

Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France. IEEE. Elmar Haussmann, Michele Fenzi, Kashyap Chitta, Jan Ivanecky, Hanson Xu, Donna Roy, Akshita Mittel, Nicolas Koumchatzky, Clement Farabet, and Jose M. Alvarez. 2020. Sc...

2020

[2] [2]

InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia

Dividemix: Learning with noisy labels as semi-supervised learning. InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You

[3] [3]

InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172

Robust training under label noise by over- parameterization. InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172. PMLR. Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoh...

work page arXiv 2026

[4] [4]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824. Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2022. A survey of deep active learn- ing.ACM Computing Surveys, 54(9):1–40. Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman....

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Text” denotes the textual answer component and “Box

Label Studio: Data labeling soft- ware. Open source software available from https://github.com/HumanSignal/label-studio. Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and Antoni B. Chan. 2022. A comparative survey of deep active learning.arXiv preprint arXiv:2203.13450. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reyn...

work page arXiv 2022

[6] [6]

Successor. Q: In this flowchart, what is the next step after {node.text }? A: The next step after {node.text} is <|ref|>{successors(node )[0].text}<|/ref|><|box|>{successors(node)[0].bbox}<|/ box|>

[7] [7]

Q: In this flowchart, which step directly connects to {node

Predecessor. Q: In this flowchart, which step directly connects to {node. text}? A: The step that directly connects to {node.text} is <|ref|>{ predecessors(node)[0].text}<|/ref|><|box|>{predecessors (node)[0].bbox}<|/box|>. 13

[8] [8]

Count.This task type has four variants: node count, edge count, out-degree, and in-degree. Q: How many nodes are there in this flowchart? A: There are {len(nodes)} nodes in this flowchart <|ref|>{len (nodes)} nodes<|/ref|><|box|>[{nodes[0].bbox}, {nodes [1].bbox}, ..., {nodes[n].bbox}]<|/box|>. Q: How many connecting edges are there in this flowchart? A: ...

[9] [9]

Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]

Two-hop. Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]. bbox}<|/box|>, the first step reaches <|ref|>{path[1]. text}<|/ref|><|box|>{path[1].bbox}<|/box|>, and the second step reaches <|ref|>{path[2].text}<|/ref|><|box |>{path[2].bbox}<|/box|>

[10] [10]

Q: List the complete path from {path[0].text} to {path[-1]

Path. Q: List the complete path from {path[0].text} to {path[-1]. text}. A: The complete path is: <|ref|>{path[0].text}<|/ref|><|box |>{path[0].bbox}<|/box|> -> <|ref|>{path[1].text}<|/ref |><|box|>{path[1].bbox}<|/box|> -> ... -> <|ref|>{path [-1].text}<|/ref|><|box|>{path[-1].bbox}<|/box|>

[11] [11]

Q: When the decision result of {edge.from_node.text} is {edge

Condition. Q: When the decision result of {edge.from_node.text} is {edge. text}, where does the flow go? A: When the result of <|ref|>{edge.from_node.text}<|/ref|><| box|>{edge.from_node.bbox}<|/box|> is <|ref|>{edge.text }<|/ref|><|box|>{edge.bbox}<|/box|>, the flow goes to <|ref|>{edge.to_node.text}<|/ref|><|box|>{edge.to_node. bbox}<|/box|>

[12] [12]

Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image

Spatial. Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image

[13] [13]

Relative. Q: In which direction is {first_node.text} relative to { second_node.text}? A: <|ref|>{first_node.text}<|/ref|><|box|>{first_node.bbox }<|/box|> is located to the {relative_pos(first_node. bbox, second_node.bbox)} of <|ref|>{second_node.text }<|/ref|><|box|>{second_node.bbox}<|/box|>

[14] [14]

Region. Q: Which nodes are located in the {target_region} region of the image? A: The nodes located in the {target_region} region are: { node_1.text}, {node_2.text}, ..., {node_k.text} <|ref |>{target_region} region nodes<|/ref|><|box|>[{node_1. bbox}, {node_2.bbox}, ..., {node_k.bbox}]<|/box|>

[15] [15]

Nearest. Q: Which node is spatially closest to {node.text}? A: The node spatially closest to <|ref|>{node.text}<|/ref|><| box|>{node.bbox}<|/box|> is <|ref|>{nearest_node.text }<|/ref|><|box|>{nearest_node.bbox}<|/box|>

[16] [16]

Q: List all nodes in spatial order from top to bottom

Sorting.This task has two variants: top-to- bottom and left-to-right. Q: List all nodes in spatial order from top to bottom. A: The top-to-bottom order of nodes is: <|ref|>{node_1.text }<|/ref|><|box|>{node_1.bbox}<|/box|> -> <|ref|>{node_2. text}<|/ref|><|box|>{node_2.bbox}<|/box|> -> ... -> <| ref|>{node_n.text}<|/ref|><|box|>{node_n.bbox}<|/box|>. Q: L...

[17] [17]

Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>

Coordinate.This task has two variants: grounding (text → box) and degrounding (box → text). Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>. Q: Which node is located in the region with coordinates {node. bbox}? A: The node located in this region is <|ref|>{node.text}<|...