MTA-Agent generates 21K verified multi-hop vision-language QA trajectories via tool-augmented evidence synthesis, allowing a 32B open model to outperform GPT-5 and Gemini variants on complex multimodal benchmarks while increasing reasoning steps.
It is particularly useful for identifying specific objects, landmarks, or entities that are visually distinctive but difficult to describe in text alone
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MTA-Agent: An Open Recipe for Multimodal Deep Search Agents
MTA-Agent generates 21K verified multi-hop vision-language QA trajectories via tool-augmented evidence synthesis, allowing a 32B open model to outperform GPT-5 and Gemini variants on complex multimodal benchmarks while increasing reasoning steps.