SADL: What to Ignore? A Benchmark for Subject-Aware Distractor Localization

Cao-Tri Nguyen; Minh-Triet Tran; Nguyen-Khoa Luong; Vinh-Tiep Nguyen

arxiv: 2606.30393 · v1 · pith:GKT2NBI3new · submitted 2026-06-29 · 💻 cs.CV

SADL: What to Ignore? A Benchmark for Subject-Aware Distractor Localization

Cao-Tri Nguyen , Nguyen-Khoa Luong , Vinh-Tiep Nguyen , Minh-Triet Tran This is my paper

Pith reviewed 2026-06-30 06:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords subject-aware distractor localizationvision-language modelsbenchmarkvisual distractorsobject removalphoto compositionmultimodal reasoning

0 comments

The pith

VLMs identify distractors well but over-apply exclusion and suppress true ones at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SADL, the first benchmark for subject-aware distractor localization in real photographs. It evaluates seven vision-language models on a pipeline that first classifies distractors and then applies exclusion filters based on five inclusion factors and three contextual rules. The models perform strongly at initial identification yet systematically remove too many essential objects when trying to respect subject context. A reader would care because current editing tools still require manual decisions about what to keep or remove, and context-blind removal can break scene meaning such as retaining a person while deleting their chair.

Core claim

VLMs are highly capable of identifying distractors, but then over-apply exclusion, which systematically suppresses true distractors at scale. The SADL benchmark supplies 1,800 subject-aware cases across 1,000 photographs together with 14,617 annotated candidates and 1,938 hard negatives to expose and measure this bottleneck through structured evaluation of classification followed by filtering.

What carries the argument

SADL benchmark that structures evaluation around five inclusion factors and three contextual exclusion rules applied after initial distractor classification.

If this is right

Context-agnostic removal can disrupt semantic coherence, for example by keeping a person but removing the chair they sit on.
A two-stage pipeline of distractor classification followed by exclusion filtering is required for subject-aware decisions.
The 14,617 annotated candidates enable systematic testing of future multimodal systems on subject-conditioned reasoning.
Existing saliency models and open-vocabulary detectors fail because they lack subject awareness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better calibration of the exclusion stage could allow more automated yet context-respecting photo editing tools.
The over-exclusion pattern may recur in other tasks where models must selectively attend to a designated subject.
Fine-tuning VLMs on the SADL cases offers a direct test of whether the suppression rate can be reduced.

Load-bearing premise

The 1,938 hard negatives and the five inclusion factors plus three exclusion rules used in the benchmark accurately capture real-world subject-aware decisions without introducing annotation bias or incomplete coverage of edge cases.

What would settle it

Apply the same VLM pipeline to a fresh collection of 1,000 independently annotated photographs and check whether the measured rate of true-distractor suppression matches the rate observed on SADL.

Figures

Figures reproduced from arXiv: 2606.30393 by Cao-Tri Nguyen, Minh-Triet Tran, Nguyen-Khoa Luong, Vinh-Tiep Nguyen.

**Figure 2.** Figure 2: Dataset statistics of SADL. Panels show: (a) label distribution; (b) distractors per object; (c) images per main object; (d) objects per inclusion factor; (e) subject description word count; (f) UpSet plot of factor-combination outcomes; (g) exclusion-rule distribution; (h) distractor yield vs. triggered inclusion level; (i) scene category distribution; (j) top-10 factor combinations; (k) status-flip frequ… view at source ↗

read the original abstract

Photographs frequently contain \emph{visual distractors} besides foregrounds and backgrounds of the intended subject, competing for attention and weakening composition. While modern editing tools streamline object removal, identifying which objects to remove remains a mostly manual process. Existing saliency models and open-vocabulary detectors operate without subject awareness, failing to adapt to shifting user intent. Furthermore, context-agnostic removal may disrupt the scene's semantic coherence (e.g., keep the person but remove the chair they are sitting on). To address these limitations, we formalize the task of subject-aware distractor localization, which identifies distractors while retaining compositionally essential objects. This paper introduces \textsc{SADL}, the first real-world benchmark for this task, comprising 1,800 subject-aware cases across 1,000 photographs to enable systematic evaluation and facilitate future research. In total, there are 14,617 annotated candidates, including a robust set of 1,938 hard negatives to stress-test exclusion calibration. We evaluate seven proprietary and open-weight Vision-Language Models (VLMs) on a sequential pipeline of distractor classification followed by exclusion filtering, structured around five inclusion factors and three contextual exclusion rules. Our analysis reveals that VLMs are highly capable of identifying distractors, but then over-apply exclusion, which systematically suppresses true distractors at scale. By exposing this critical bottleneck, \textsc{SADL} provides a foundational diagnostic tool to advance subject-conditioned reasoning in multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SADL is the first benchmark for subject-aware distractor localization and shows VLMs over-exclude after initial detection.

read the letter

The main takeaway is that this paper creates SADL, a benchmark with 1,800 cases from 1,000 photos and 1,938 hard negatives, then runs VLMs through a classification-plus-exclusion pipeline using five inclusion factors and three rules. The key observation is that the models spot distractors reasonably but apply exclusion too broadly, suppressing valid ones at scale.

What is new is the explicit subject-aware framing. Earlier saliency and open-vocab detection work treats everything the same regardless of user intent or scene coherence. The dataset construction and the two-stage evaluation setup give a concrete way to measure that difference.

The work is useful for exposing a practical bottleneck in current VLMs for editing or composition tasks. The hard-negative set is a sensible choice for testing calibration.

The soft spot is the annotation side. The abstract states the rules and numbers but supplies little on how the 14,617 candidates were labeled, what inter-annotator agreement looked like, or how the inclusion/exclusion criteria were validated against real user decisions. If those details are thin in the full paper, the ground truth reliability stays hard to judge.

This is for people working on image editing tools, context-aware detection, or VLM evaluation in CV. Anyone building or testing models that need to respect subject and composition will find the diagnostic useful.

It deserves a serious referee. The benchmark is new, the evaluation points to a repeatable issue, and the resource itself can support follow-up work even if the rules need tightening later.

Referee Report

2 major / 0 minor

Summary. The paper introduces SADL, the first real-world benchmark for subject-aware distractor localization, with 1,800 cases across 1,000 photographs, 14,617 annotated candidates, and 1,938 hard negatives. It evaluates seven VLMs on a sequential pipeline of distractor classification followed by exclusion filtering using five inclusion factors and three contextual exclusion rules, finding that VLMs identify distractors capably but systematically over-apply exclusion and suppress true distractors.

Significance. If the benchmark rules and annotations prove reliable, the work supplies a diagnostic benchmark that isolates over-exclusion as a concrete failure mode in subject-conditioned VLM reasoning; the provision of hard negatives for stress-testing is a constructive design choice that strengthens the evaluation.

major comments (2)

[Abstract] Abstract: the central evaluation results rest on the five inclusion factors and three exclusion rules, yet no inter-annotator agreement, annotation protocol, or external validation of these rules is reported; without this information the claim that VLMs 'over-apply exclusion' cannot be assessed for robustness against annotation bias.
[Benchmark description] Benchmark description: the assertion that the 1,938 hard negatives 'stress-test exclusion calibration' is load-bearing for the main finding, but the manuscript supplies no analysis of edge-case coverage or agreement between the rule set and real-world subject-aware decisions, leaving the benchmark's fidelity unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency around the annotation process and benchmark validation. We address each point below and will revise the manuscript accordingly to strengthen the presentation of the SADL benchmark.

read point-by-point responses

Referee: [Abstract] Abstract: the central evaluation results rest on the five inclusion factors and three exclusion rules, yet no inter-annotator agreement, annotation protocol, or external validation of these rules is reported; without this information the claim that VLMs 'over-apply exclusion' cannot be assessed for robustness against annotation bias.

Authors: We agree that explicit reporting of the annotation protocol, inter-annotator agreement, and validation steps is necessary to support claims about over-exclusion. The rules were developed through iterative expert review by the authors and domain specialists in photography and visual composition, but these details were omitted from the initial submission. In the revised manuscript we will add a dedicated subsection under Benchmark Construction that describes the full annotation protocol, reports inter-annotator agreement statistics (e.g., Cohen’s kappa on rule application), and provides external validation via a small-scale user study comparing rule-based decisions to independent photographer judgments. revision: yes
Referee: [Benchmark description] Benchmark description: the assertion that the 1,938 hard negatives 'stress-test exclusion calibration' is load-bearing for the main finding, but the manuscript supplies no analysis of edge-case coverage or agreement between the rule set and real-world subject-aware decisions, leaving the benchmark's fidelity unverified.

Authors: We acknowledge that the manuscript currently lacks quantitative or qualitative analysis of edge-case coverage and real-world alignment for the hard-negative set. While the hard negatives were constructed by applying the same five inclusion factors and three exclusion rules to candidate objects that satisfy distractor criteria yet must be retained, we did not include coverage statistics or agreement metrics. In revision we will expand the Benchmark Description section with (i) a breakdown of edge-case categories covered by the 1,938 hard negatives, (ii) qualitative examples illustrating boundary cases, and (iii) a small-scale agreement study measuring consistency between the rule set and independent subject-aware decisions by photographers. This will directly address concerns about benchmark fidelity. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an empirical benchmark (SADL) for subject-aware distractor localization, consisting of 1,800 cases, 14,617 annotations, and 1,938 hard negatives, along with a VLM evaluation pipeline using five inclusion factors and three exclusion rules. No mathematical derivations, equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The central claims rest on benchmark construction and empirical model evaluations, which are self-contained design choices without reduction to inputs by definition or construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters, axioms, or invented entities are introduced; the contribution rests on dataset construction and rule-based evaluation.

pith-pipeline@v0.9.1-grok · 5806 in / 1021 out tokens · 21804 ms · 2026-06-30T06:06:08.558530+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, and Furong Huang. 2024. PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts. InInternational Conference on Learning Representations (ICLR)

2024
[2]

Juyeon Bae et al. 2025. iDIS: Irrelevant Distractors in Image-Based Understanding. arXiv preprint arXiv:2501.XXXXX(2025)

2025
[3]

Ali Borji and Laurent Itti. 2013. State-of-the-Art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence35, 1 (2013), 185– 207

2013
[4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023
[5]

Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement20, 1 (1960), 37–46

1960
[6]

Matt Deitke et al. 2025. Molmo2: Open, State-of-the-Art Multimodal Models. arXiv preprint arXiv:2501.XXXXX(2025)

2025
[7]

Hang Deng et al . 2025. Words vs. Visuals: Impact of Noise on Multimodal Reasoning.arXiv preprint arXiv:2503.XXXXX(2025)

2025
[8]

Goldman, and Aaron Hertzmann

Ohad Fried, Eli Shechtman, Dan B. Goldman, and Aaron Hertzmann. 2015. Find- ing Distractors in Images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1703–1712

2015
[9]

Google DeepMind. 2025. Gemini 3: A Family of Highly Capable Multimodal Models.Technical Report(2025)

2025
[10]

Tri Huynh et al. 2023. SimpSON: Simplifying Photo Cleanup with Single-Click Distractors Removal Using Vision Transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023
[11]

Laurent Itti and Christof Koch. 2001. Computational Modelling of Visual Atten- tion.Nature Reviews Neuroscience2, 3 (2001), 194–203

2001
[12]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence20, 11 (1998), 1254–1259

1998
[13]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. InProceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2014
[14]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026

2023
[15]

2018.Content Analysis: An Introduction to Its Methodology (4th ed.)

Klaus Krippendorff. 2018.Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications

2018
[16]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.International Journal of Computer Vision123, 1 (2017), 32–73

2017
[17]

Rajesh Kumain et al . 2025. Deep Visual Saliency: A Survey.arXiv preprint arXiv:2501.XXXXX(2025)

2025
[18]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale.International Journal of Computer Visio...

2020
[19]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision (ECCV). Springer, 740–755

2014
[20]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2024. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In European Conference on Computer Vision (ECCV)

2024
[21]

Zhihao Ma et al. 2025. Caution: Visual Noise May Harm Your Reasoning Agent. arXiv preprint arXiv:2502.XXXXX(2025)

2025
[22]

Meta AI. 2025. Llama-4: The Llama 4 Herd.Technical Report(2025)

2025
[23]

Mistral AI. 2024. Pixtral 12B.arXiv preprint arXiv:2410.07073(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

OpenAI. 2024. GPT-4o System Card. https://openai.com/index/gpt-4o-system- card.Technical Report(2024)

2024
[25]

OpenAI. 2025. GPT-4.1 Technical Report.Technical Report(2025)

2025
[26]

Za- iane, and Martin Jagersand

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Za- iane, and Martin Jagersand. 2020. U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection. InPattern Recognition, Vol. 106. Elsevier

2020
[27]

Qwen Team and Alibaba Cloud. 2025. Qwen3-VL: Think Before You See: Enabling Comprehensive Thinking for Multimodal Large Language Models.arXiv preprint arXiv:2505.XXXXX(2025)

2025
[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the International Confer- ence on Machine Learning (ICML)

2021
[29]

Nikhila Ravi et al . 2024. SAM 3: Segment Anything Model 3.arXiv preprint arXiv:2412.XXXXX(2024)

2024
[30]

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos.arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. 2025. Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs.arXiv preprint arXiv:2503.20745 (2025). Also known as MATHGLANCE

work page arXiv 2025
[32]

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. 2022. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2022
[33]

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. 2024. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2311.06242

work page arXiv 2024
[34]

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, and Taylor Berg-Kirkpatrick. 2026. UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models.arXiv preprint arXiv:2602.08336(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg
[36]

InEuropean Conference on Computer Vision (ECCV)

Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV)
[37]

Bowen Zhang, Liqing Zhang, et al. 2018. Detecting and Removing Visual Dis- tractors for Video Aesthetic Enhancement. InIEEE Transactions on Multimedia. 7

2018

[1] [1]

Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, and Furong Huang. 2024. PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts. InInternational Conference on Learning Representations (ICLR)

2024

[2] [2]

Juyeon Bae et al. 2025. iDIS: Irrelevant Distractors in Image-Based Understanding. arXiv preprint arXiv:2501.XXXXX(2025)

2025

[3] [3]

Ali Borji and Laurent Itti. 2013. State-of-the-Art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence35, 1 (2013), 185– 207

2013

[4] [4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2023

[5] [5]

Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement20, 1 (1960), 37–46

1960

[6] [6]

Matt Deitke et al. 2025. Molmo2: Open, State-of-the-Art Multimodal Models. arXiv preprint arXiv:2501.XXXXX(2025)

2025

[7] [7]

Hang Deng et al . 2025. Words vs. Visuals: Impact of Noise on Multimodal Reasoning.arXiv preprint arXiv:2503.XXXXX(2025)

2025

[8] [8]

Goldman, and Aaron Hertzmann

Ohad Fried, Eli Shechtman, Dan B. Goldman, and Aaron Hertzmann. 2015. Find- ing Distractors in Images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1703–1712

2015

[9] [9]

Google DeepMind. 2025. Gemini 3: A Family of Highly Capable Multimodal Models.Technical Report(2025)

2025

[10] [10]

Tri Huynh et al. 2023. SimpSON: Simplifying Photo Cleanup with Single-Click Distractors Removal Using Vision Transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023

[11] [11]

Laurent Itti and Christof Koch. 2001. Computational Modelling of Visual Atten- tion.Nature Reviews Neuroscience2, 3 (2001), 194–203

2001

[12] [12]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence20, 11 (1998), 1254–1259

1998

[13] [13]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. InProceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2014

[14] [14]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026

2023

[15] [15]

2018.Content Analysis: An Introduction to Its Methodology (4th ed.)

Klaus Krippendorff. 2018.Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications

2018

[16] [16]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.International Journal of Computer Vision123, 1 (2017), 32–73

2017

[17] [17]

Rajesh Kumain et al . 2025. Deep Visual Saliency: A Survey.arXiv preprint arXiv:2501.XXXXX(2025)

2025

[18] [18]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale.International Journal of Computer Visio...

2020

[19] [19]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision (ECCV). Springer, 740–755

2014

[20] [20]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2024. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. In European Conference on Computer Vision (ECCV)

2024

[21] [21]

Zhihao Ma et al. 2025. Caution: Visual Noise May Harm Your Reasoning Agent. arXiv preprint arXiv:2502.XXXXX(2025)

2025

[22] [22]

Meta AI. 2025. Llama-4: The Llama 4 Herd.Technical Report(2025)

2025

[23] [23]

Mistral AI. 2024. Pixtral 12B.arXiv preprint arXiv:2410.07073(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

OpenAI. 2024. GPT-4o System Card. https://openai.com/index/gpt-4o-system- card.Technical Report(2024)

2024

[25] [25]

OpenAI. 2025. GPT-4.1 Technical Report.Technical Report(2025)

2025

[26] [26]

Za- iane, and Martin Jagersand

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Za- iane, and Martin Jagersand. 2020. U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection. InPattern Recognition, Vol. 106. Elsevier

2020

[27] [27]

Qwen Team and Alibaba Cloud. 2025. Qwen3-VL: Think Before You See: Enabling Comprehensive Thinking for Multimodal Large Language Models.arXiv preprint arXiv:2505.XXXXX(2025)

2025

[28] [28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the International Confer- ence on Machine Learning (ICML)

2021

[29] [29]

Nikhila Ravi et al . 2024. SAM 3: Segment Anything Model 3.arXiv preprint arXiv:2412.XXXXX(2024)

2024

[30] [30]

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos.arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. 2025. Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs.arXiv preprint arXiv:2503.20745 (2025). Also known as MATHGLANCE

work page arXiv 2025

[32] [32]

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. 2022. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2022

[33] [33]

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. 2024. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2311.06242

work page arXiv 2024

[34] [34]

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, and Taylor Berg-Kirkpatrick. 2026. UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models.arXiv preprint arXiv:2602.08336(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg

[36] [36]

InEuropean Conference on Computer Vision (ECCV)

Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV)

[37] [37]

Bowen Zhang, Liqing Zhang, et al. 2018. Detecting and Removing Visual Dis- tractors for Video Aesthetic Enhancement. InIEEE Transactions on Multimedia. 7

2018