hub

arXiv preprint arXiv:2507.07998 , year=

Pyvision: Agentic vision with dynamic tooling , author= · 2025 · arXiv 2507.07998

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 2 use dataset 1 use method 1

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

DailyClue is a new benchmark that requires MLLMs to actively seek visual clues in authentic daily scenarios across four domains and 16 subtasks before performing reasoning.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

cs.AI · 2025-12-18 · unverdicted · novelty 7.0

ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.

DeepEyesV2: Toward Agentic Multimodal Model

cs.CV · 2025-11-07 · unverdicted · novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

cs.AI · 2025-10-28 · unverdicted · novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

cs.IR · 2025-08-07 · unverdicted · novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

Generation-to-Understanding synergy lets multimodal models create self-generated visual edits as intermediate steps, improving performance on twelve benchmarks while revealing limits in task-aligned self-reflection.

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

cs.CV · 2026-04-27 · unverdicted · novelty 5.0

PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

cs.CV · 2026-05-12 · 3 refs

citing papers explorer

Showing 15 of 15 citing papers.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 54
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 3
DailyClue is a new benchmark that requires MLLMs to actively seek visual clues in authentic daily scenarios across four domains and 16 subtasks before performing reasoning.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 58
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection cs.AI · 2025-12-18 · unverdicted · none · ref 49
ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 49
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model cs.CV · 2026-05-12 · unverdicted · none · ref 62 · 2 links
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoning benchmarks.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 62
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction cs.AI · 2025-10-28 · unverdicted · none · ref 45
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 251
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent cs.IR · 2025-08-07 · unverdicted · none · ref 29
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools cs.CV · 2026-05-20 · unverdicted · none · ref 111
IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.
Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models cs.CV · 2026-05-15 · unverdicted · none · ref 53
Generation-to-Understanding synergy lets multimodal models create self-generated visual edits as intermediate steps, improving performance on twelve benchmarks while revealing limits in task-aligned self-reflection.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation cs.CV · 2026-04-27 · unverdicted · none · ref 6
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 55
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models cs.CV · 2026-05-12 · unreviewed · ref 29 · 3 links

arXiv preprint arXiv:2507.07998 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer