VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.
Visual reasoning tracer: Object-level grounded reasoning benchmark
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
citing papers explorer
-
VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.