Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024

· 2023 · arXiv 2312.03052

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities

cs.CL · 2024-06-11 · accept · novelty 6.0

A PRISMA-based survey of 158 computational works on toxic meme detection introduces a new toxicity taxonomy and a framework linking target, intent, and conveyance tactics while noting trends in LLMs and cross-modal methods.

BLINK: Multimodal Large Language Models Can See but Not Perceive

cs.CV · 2024-04-18 · accept · novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.

citing papers explorer

Showing 4 of 4 citing papers.

Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 27
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
Toxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities cs.CL · 2024-06-11 · accept · none · ref 84
A PRISMA-based survey of 158 computational works on toxic meme detection introduces a new toxicity taxonomy and a framework linking target, intent, and conveyance tactics while noting trends in LLMs and cross-modal methods.
BLINK: Multimodal Large Language Models Can See but Not Perceive cs.CV · 2024-04-18 · accept · none · ref 40
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools cs.CV · 2026-05-20 · unverdicted · none · ref 38
IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localization, and reasoning.

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer