International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding, generation , author= · 2022

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

cs.CV · 2026-05-16 · conditional · novelty 7.0

SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

cs.CL · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

cs.CV · 2024-03-05 · conditional · novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

cs.CL · 2026-05-19

citing papers explorer

Showing 9 of 9 citing papers.

SHED: Style-Homogenized Embedding Alignment for Domain Generalization cs.CV · 2026-05-16 · conditional · none · ref 32
SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities cs.CL · 2026-05-06 · unverdicted · none · ref 49
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search cs.CV · 2026-05-09 · unverdicted · none · ref 27
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL · 2026-05-02 · unverdicted · none · ref 46 · 2 links
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis cs.CV · 2024-03-05 · conditional · none · ref 24
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 86
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media cs.CV · 2026-04-23 · unverdicted · none · ref 26
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 135
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning cs.CL · 2026-05-19 · unreviewed · ref 64

International Conference on Machine Learning , pages=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer