super hub Mixed citations

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Alexey Gritsenko, Ibrahim Alabdulmohsin, Michael Tschannen, Muhammad Ferjad Naeem, Nikhil Parthasarathy, Xiao Wang · 2025 · cs.CV · arXiv 2502.14786

Mixed citation behavior. Most common role is background (57%).

205 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 205 citing papers more from Alexey Gritsenko arXiv PDF

abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 40 method 23 baseline 3 dataset 1

citation-polarity summary

background 38 use method 23 baseline 3 unclear 2 use dataset 1

claims ledger

abstract We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and trans

authors

Alexey Gritsenko Ibrahim Alabdulmohsin Michael Tschannen Muhammad Ferjad Naeem Nikhil Parthasarathy Xiao Wang

co-cited works

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Representation Fr\'echet Loss for Visual Generation

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

cs.CV · 2026-05-19 · conditional · novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

cs.RO · 2026-05-12 · conditional · novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

Attention Transfer Is Not Universally Effective for Vision Transformers

cs.CV · 2026-05-08 · accept · novelty 7.0

Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.

Attributions All the Way Down? The Metagame of Interpretability

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoders, and diffusion transformers.

Posterior Augmented Flow Matching

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.

Differentially Private Contrastive Learning via Bounding Group-level Contribution

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1% higher retrieval accuracy than existing approaches.

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

cs.GR · 2026-04-23 · unverdicted · novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.

Evaluating Remote Sensing Image Captions Beyond Metric Biases

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.

citing papers explorer

Showing 50 of 205 citing papers.

Is Dimensionality a Barrier for Retrieval Models? cs.LG · 2026-05-22 · unverdicted · none · ref 148 · internal anchor
Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 43 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Representation Fr\'echet Loss for Visual Generation cs.CV · 2026-04-30 · unverdicted · none · ref 48 · internal anchor
Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 139 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding cs.CV · 2026-01-01 · unverdicted · none · ref 15 · internal anchor
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors cs.CV · 2025-12-09 · unverdicted · none · ref 47 · internal anchor
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval cs.CV · 2026-05-22 · unverdicted · none · ref 17 · internal anchor
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unverdicted · none · ref 85 · internal anchor
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders cs.CV · 2026-05-21 · unverdicted · none · ref 6 · internal anchor
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding cs.CV · 2026-05-19 · conditional · none · ref 30 · internal anchor
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation cs.CL · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference cs.CV · 2026-05-12 · unverdicted · none · ref 7 · 2 links · internal anchor
VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 31 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Attention Transfer Is Not Universally Effective for Vision Transformers cs.CV · 2026-05-08 · accept · none · ref 15 · internal anchor
Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
Attributions All the Way Down? The Metagame of Interpretability cs.LG · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoders, and diffusion transformers.
Posterior Augmented Flow Matching cs.CV · 2026-05-01 · unverdicted · none · ref 33 · internal anchor
PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
Differentially Private Contrastive Learning via Bounding Group-level Contribution cs.CR · 2026-04-29 · unverdicted · none · ref 49 · internal anchor
DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1% higher retrieval accuracy than existing approaches.
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution cs.CV · 2026-04-28 · unverdicted · none · ref 31 · internal anchor
GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition cs.GR · 2026-04-23 · unverdicted · none · ref 19 · internal anchor
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking cs.CV · 2026-04-22 · unverdicted · none · ref 33 · internal anchor
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 50 · internal anchor
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CV · 2026-04-22 · unverdicted · none · ref 30 · internal anchor
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 43 · internal anchor
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes cs.CV · 2026-04-16 · unverdicted · none · ref 34 · internal anchor
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models cs.CV · 2026-04-15 · unverdicted · none · ref 48 · internal anchor
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
UNIGEOCLIP: Unified Geospatial Contrastive Learning cs.CV · 2026-04-13 · unverdicted · none · ref 35 · internal anchor
UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordinate baselines on geospatial tasks.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 62 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Bottleneck Tokens for Unified Multimodal Retrieval cs.LG · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
RewardFlow: Generate Images by Optimizing What You Reward cs.CV · 2026-04-09 · unverdicted · none · ref 40 · internal anchor
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 52 · internal anchor
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support cs.IR · 2026-04-09 · unverdicted · none · ref 35 · internal anchor
Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.
Personalizing Text-to-Image Generation to Individual Taste cs.CV · 2026-04-08 · unverdicted · none · ref 52 · internal anchor
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens cs.CV · 2026-04-06 · conditional · none · ref 68 · internal anchor
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models cs.CV · 2026-03-26 · unverdicted · none · ref 30 · internal anchor
Concept-centric short captions and cross-modal attention pooling yield SOTA compositionality in contrastive V&L models without degrading zero-shot or retrieval performance.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models cs.CV · 2026-01-07 · unverdicted · none · ref 58 · 2 links · internal anchor
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding cs.CV · 2025-12-19 · conditional · none · ref 81 · internal anchor
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors cs.CV · 2025-12-17 · unverdicted · none · ref 46 · internal anchor
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
SoccerMaster: A Vision Foundation Model for Soccer Understanding cs.CV · 2025-12-11 · unverdicted · none · ref 63 · internal anchor
SoccerMaster is the first soccer-specific vision foundation model that unifies tasks from player detection to event classification via multi-task pretraining and outperforms task-specific models on downstream evaluations.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training cs.CV · 2025-11-28 · conditional · none · ref 56 · internal anchor
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds cs.CV · 2025-11-23 · unverdicted · none · ref 85 · internal anchor
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab? cs.CV · 2025-10-01 · unverdicted · none · ref 8 · internal anchor
CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 39 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV · 2025-06-10 · unverdicted · none · ref 89 · internal anchor
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping cs.CV · 2025-05-19 · unverdicted · none · ref 36 · internal anchor
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 62 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 93 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
Proxy-Based Approximation of Shapley and Banzhaf Interactions cs.LG · 2026-05-21 · unverdicted · none · ref 62 · 2 links · internal anchor
ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer