SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
International Conference on Machine Learning , pages=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
citing papers explorer
-
SHED: Style-Homogenized Embedding Alignment for Domain Generalization
SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.
-
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
- Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning