pith. sign in

arxiv: 2606.18846 · v1 · pith:2KNSO4YSnew · submitted 2026-06-17 · 💻 cs.CV

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Pith reviewed 2026-06-26 21:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsdata annotationon-policy annotationvisual reasoningflowchartsGUI screenshotsbounding boxesstructured data
0
0 comments X

The pith

An on-policy annotation loop with a unified schema produces high-acceptance data that lifts VLM accuracy on flowcharts by 35 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScreenAnnotator to create training data that unites spatial coordinates, open-vocabulary labels, attributes, and relationships in one representation for advanced visual reasoning in vision-language models. Existing tools fall short on expressiveness, couple annotation poorly to training, and force repeated labeling, so the authors replace them with a single atom schema, an on-policy loop containing a Bayesian verifier, and template-driven synthesis that turns each atom into multiple tasks. The loop raises acceptance to nearly 100 percent on flowcharts and 77 percent on GUI images while lowering annotation time, and the resulting data produces a 76.1 percent average accuracy after fine-tuning, a 35.1-point absolute improvement.

Core claim

The paper establishes that an on-policy annotation loop embedded with a Bayesian Annotation Verifier, built on a unified annotation atom schema, drives annotation acceptance rates to nearly 100 percent on flowcharts and 77 percent on GUI screenshots while steadily decreasing per-image time as data accumulate; the same data, when used to fine-tune a VLM, yields 76.1 percent average accuracy, a 35.1-point absolute gain over the baseline.

What carries the argument

The unified annotation atom schema that binds spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into one unit, together with the on-policy annotation loop containing the Bayesian Annotation Verifier.

If this is right

  • Annotation acceptance reaches nearly 100 percent on flowcharts and 77 percent on GUI screenshots without manual correction.
  • Per-image annotation time declines as the volume of labeled data grows.
  • Fine-tuning on the synthesized data raises VLM accuracy on flowchart reasoning to 76.1 percent.
  • Template-driven synthesis generates multiple reasoning tasks from each static atom, removing the need for repeated annotation.
  • The same schema and loop apply across flowchart and GUI domains without redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The schema could be tested on additional visual domains such as diagrams or scene graphs to check whether acceptance rates and accuracy gains hold.
  • If the verifier's Bayesian update is replaced by a simpler rule-based check, one could measure whether acceptance and time savings remain comparable.
  • The multi-task synthesis templates might be extended to generate chain-of-thought reasoning examples directly from the atoms.

Load-bearing premise

The unified annotation atom schema binds spatial, semantic, and structural information into a representation that is sufficient and lossless for all downstream visual reasoning tasks.

What would settle it

Measure whether a VLM fine-tuned on data produced by this tool shows the same 35-point accuracy gain when evaluated on a new set of GUI screenshots or flowchart variants never seen during annotation.

Figures

Figures reproduced from arXiv: 2606.18846 by Like Zhang, Pan Wang, Qianli Xing, Qingzu He, Qi Wang, Runliang Niu, Shiqi Wang, Xiyu Hu.

Figure 1
Figure 1. Figure 1: Motivation of SCREENANNOTATOR: bridging the infrastructure gap between conventional image an￾notation tools and the data demands of modern VLMs. markers into their textual chains of thought. Train￾ing these models therefore requires data that unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a single representation. Recent years have witnes… view at source ↗
Figure 2
Figure 2. Figure 2: (a) The On-Policy Annotation Loop: At each round, (1) the current policy model (a detector + VLM) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The unified annotation atom schema. Each [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: On-policy annotation efficiency (RQ1). (a)–(b): accept rate and completion rate per round for flowchart [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lift score of BAV at budget levels 1%, 5%, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples. (a) Two-hop retrieval on a flowchart: the base VLM skips intermediate reasoning and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ScreenAnnotator, an open-source annotation tool for vision-language models that addresses bottlenecks in expressiveness, annotation-training decoupling, and data reusability. It defines a unified annotation atom schema integrating spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships; implements an on-policy annotation loop with a Bayesian Annotation Verifier (BAV); and uses template-driven multi-task data synthesis to generate diverse reasoning tasks from static atoms. The authors report that the on-policy loop achieves nearly 100% accept rates on flowcharts and 77% on GUI screenshots while reducing per-image annotation time, and that fine-tuning a VLM on flowchart data yields 76.1% average accuracy (a 35.1 percentage point absolute gain).

Significance. If the reported performance gains are supported by rigorous, reproducible evaluation, the work could meaningfully advance VLM training infrastructure by enabling more expressive and reusable annotations for grounded visual reasoning. The open-source release and the on-policy loop design are concrete strengths that could support follow-on research.

major comments (3)
  1. [Abstract] Abstract: the manuscript states specific numeric outcomes (accept rates, time reduction, 35.1-point accuracy gain) but supplies no information on evaluation protocol, baseline comparisons, dataset sizes, statistical tests, or potential confounds, so the data-to-claim link cannot be assessed.
  2. [Introduction / Schema definition] The central claim that the unified annotation atom schema is sufficient and lossless for downstream visual reasoning tasks is load-bearing yet unsupported by any ablation or comparison showing that alternative representations would produce inferior VLM performance.
  3. [On-Policy Annotation Loop] The on-policy loop with BAV is presented as driving the reported accept rates, but the manuscript provides no implementation details, priors, or integration mechanics for BAV, preventing verification of how the loop produces the claimed efficiency gains.
minor comments (2)
  1. The term 'on-policy' is used without clarifying its relation to reinforcement-learning concepts or how the verifier influences the policy.
  2. Figure or diagram clarity: a visual example of an annotation atom and its transformation into multi-task templates would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript states specific numeric outcomes (accept rates, time reduction, 35.1-point accuracy gain) but supplies no information on evaluation protocol, baseline comparisons, dataset sizes, statistical tests, or potential confounds, so the data-to-claim link cannot be assessed.

    Authors: We agree that the abstract would be improved by briefly indicating the scale of the evaluation and the nature of the comparisons. In the revised version we will expand the abstract to note the number of images annotated, the VLM used for fine-tuning, the baseline model, and the evaluation split, while keeping the abstract concise. Full protocol details, including any statistical reporting, already appear in Section 4 and will be cross-referenced. revision: yes

  2. Referee: [Introduction / Schema definition] The central claim that the unified annotation atom schema is sufficient and lossless for downstream visual reasoning tasks is load-bearing yet unsupported by any ablation or comparison showing that alternative representations would produce inferior VLM performance.

    Authors: The empirical evidence for sufficiency is the 35.1-point accuracy improvement obtained when the synthesized tasks are derived from the atom schema. We did not, however, run an explicit ablation that replaces the atom schema with an alternative representation while holding all other factors fixed. In the revision we will add a short discussion subsection that contrasts the atom schema with common alternatives (e.g., separate bounding-box + caption pipelines) and explains the design rationale for unification; we will also note the absence of a controlled ablation as a limitation and outline how such an experiment could be conducted in future work. revision: partial

  3. Referee: [On-Policy Annotation Loop] The on-policy loop with BAV is presented as driving the reported accept rates, but the manuscript provides no implementation details, priors, or integration mechanics for BAV, preventing verification of how the loop produces the claimed efficiency gains.

    Authors: We acknowledge that the current manuscript text does not supply the concrete priors, update rules, or pseudocode for the Bayesian Annotation Verifier. The revision will include a new subsection (or expanded appendix) that specifies the prior distributions, the likelihood model, the decision threshold, and the exact integration points within the annotation loop. We will also add a short algorithmic description so that the efficiency claims can be reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an annotation tool (ScreenAnnotator) with a unified atom schema, on-policy loop, and Bayesian Annotation Verifier, then reports measured empirical outcomes including accept rates (nearly 100% on flowcharts, 77% on GUI), per-image time reduction, and a 35.1-point accuracy gain after fine-tuning. These quantities are presented as experimental results from running the tool and training, not as quantities defined in terms of fitted parameters, self-referential definitions, or derivations that reduce to inputs by construction. No equations, uniqueness theorems, or self-citations appear in the text; the central claims rest on externally verifiable performance metrics rather than internal redefinitions or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design rests on domain assumptions about the completeness of the atom schema and the effectiveness of the on-policy Bayesian loop; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Existing annotation tools suffer from limited expressiveness, severe annotation-training decoupling, and poor data reusability.
    This premise justifies the need for the new tool and is stated directly in the abstract.
  • domain assumption A single unified annotation atom schema can capture all required spatial, semantic, and structural primitives without loss for visual reasoning.
    Central design choice invoked when defining the schema.
invented entities (1)
  • Bayesian Annotation Verifier (BAV) no independent evidence
    purpose: To perform verification inside the on-policy annotation loop.
    New component introduced by the paper; no independent evidence outside the reported accept rates is provided.

pith-pipeline@v0.9.1-grok · 5789 in / 1320 out tokens · 31114 ms · 2026-06-26T21:42:34.765597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France

    Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. In2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 87–93, Paris, France. IEEE. Elmar Haussmann, Michele Fenzi, Kashyap Chitta, Jan Ivanecky, Hanson Xu, Donna Roy, Akshita Mittel, Nicolas Koumchatzky, Clement Farabet, and Jose M. Alvarez. 2020. Sc...

  2. [2]

    InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia

    Dividemix: Learning with noisy labels as semi-supervised learning. InProceedings of the 8th International Conference on Learning Representa- tions, Addis Ababa, Ethiopia. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You

  3. [3]

    InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172

    Robust training under label noise by over- parameterization. InProceedings of the 39th Inter- national Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 14153–14172. PMLR. Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoh...

  4. [4]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824. Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2022. A survey of deep active learn- ing.ACM Computing Surveys, 54(9):1–40. Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman....

  5. [5]

    Text” denotes the textual answer component and “Box

    Label Studio: Data labeling soft- ware. Open source software available from https://github.com/HumanSignal/label-studio. Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and Antoni B. Chan. 2022. A comparative survey of deep active learning.arXiv preprint arXiv:2203.13450. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reyn...

  6. [6]

    Successor. Q: In this flowchart, what is the next step after {node.text }? A: The next step after {node.text} is <|ref|>{successors(node )[0].text}<|/ref|><|box|>{successors(node)[0].bbox}<|/ box|>

  7. [7]

    Q: In this flowchart, which step directly connects to {node

    Predecessor. Q: In this flowchart, which step directly connects to {node. text}? A: The step that directly connects to {node.text} is <|ref|>{ predecessors(node)[0].text}<|/ref|><|box|>{predecessors (node)[0].bbox}<|/box|>. 13

  8. [8]

    Count.This task type has four variants: node count, edge count, out-degree, and in-degree. Q: How many nodes are there in this flowchart? A: There are {len(nodes)} nodes in this flowchart <|ref|>{len (nodes)} nodes<|/ref|><|box|>[{nodes[0].bbox}, {nodes [1].bbox}, ..., {nodes[n].bbox}]<|/box|>. Q: How many connecting edges are there in this flowchart? A: ...

  9. [9]

    Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]

    Two-hop. Q: Starting from {path[0].text}, which node is reached after two steps? A: Starting from <|ref|>{path[0].text}<|/ref|><|box|>{path[0]. bbox}<|/box|>, the first step reaches <|ref|>{path[1]. text}<|/ref|><|box|>{path[1].bbox}<|/box|>, and the second step reaches <|ref|>{path[2].text}<|/ref|><|box |>{path[2].bbox}<|/box|>

  10. [10]

    Q: List the complete path from {path[0].text} to {path[-1]

    Path. Q: List the complete path from {path[0].text} to {path[-1]. text}. A: The complete path is: <|ref|>{path[0].text}<|/ref|><|box |>{path[0].bbox}<|/box|> -> <|ref|>{path[1].text}<|/ref |><|box|>{path[1].bbox}<|/box|> -> ... -> <|ref|>{path [-1].text}<|/ref|><|box|>{path[-1].bbox}<|/box|>

  11. [11]

    Q: When the decision result of {edge.from_node.text} is {edge

    Condition. Q: When the decision result of {edge.from_node.text} is {edge. text}, where does the flow go? A: When the result of <|ref|>{edge.from_node.text}<|/ref|><| box|>{edge.from_node.bbox}<|/box|> is <|ref|>{edge.text }<|/ref|><|box|>{edge.bbox}<|/box|>, the flow goes to <|ref|>{edge.to_node.text}<|/ref|><|box|>{edge.to_node. bbox}<|/box|>

  12. [12]

    Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image

    Spatial. Q: Where is {node.text} located in the image? A: <|ref|>{node.text}<|/ref|><|box|>{node.bbox}<|/box|> is located in the {region(node.bbox)} region of the image

  13. [13]

    Relative. Q: In which direction is {first_node.text} relative to { second_node.text}? A: <|ref|>{first_node.text}<|/ref|><|box|>{first_node.bbox }<|/box|> is located to the {relative_pos(first_node. bbox, second_node.bbox)} of <|ref|>{second_node.text }<|/ref|><|box|>{second_node.bbox}<|/box|>

  14. [14]

    Region. Q: Which nodes are located in the {target_region} region of the image? A: The nodes located in the {target_region} region are: { node_1.text}, {node_2.text}, ..., {node_k.text} <|ref |>{target_region} region nodes<|/ref|><|box|>[{node_1. bbox}, {node_2.bbox}, ..., {node_k.bbox}]<|/box|>

  15. [15]

    Nearest. Q: Which node is spatially closest to {node.text}? A: The node spatially closest to <|ref|>{node.text}<|/ref|><| box|>{node.bbox}<|/box|> is <|ref|>{nearest_node.text }<|/ref|><|box|>{nearest_node.bbox}<|/box|>

  16. [16]

    Q: List all nodes in spatial order from top to bottom

    Sorting.This task has two variants: top-to- bottom and left-to-right. Q: List all nodes in spatial order from top to bottom. A: The top-to-bottom order of nodes is: <|ref|>{node_1.text }<|/ref|><|box|>{node_1.bbox}<|/box|> -> <|ref|>{node_2. text}<|/ref|><|box|>{node_2.bbox}<|/box|> -> ... -> <| ref|>{node_n.text}<|/ref|><|box|>{node_n.bbox}<|/box|>. Q: L...

  17. [17]

    Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>

    Coordinate.This task has two variants: grounding (text → box) and degrounding (box → text). Q: What are the coordinates of {node.text}? A: The coordinates of {node.text} are <|ref|>{node.text}<|/ ref|><|box|>{node.bbox}<|/box|>. Q: Which node is located in the region with coordinates {node. bbox}? A: The node located in this region is <|ref|>{node.text}<|...