hub

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalya

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

cs.CV · 2026-01-06 · conditional · novelty 7.0

IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

cs.CV · 2025-12-08 · unverdicted · novelty 7.0

MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

cs.RO · 2025-12-04 · unverdicted · novelty 7.0

Hoi! is a new multimodal dataset of force-grounded articulated object manipulations with cross-view video and tactile sensing from human hands and robotic grippers.

RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

RiGS decomposes scenes into static, rigid, and transient 4D Gaussians with an object-wise dynamic mask and scene flow guidance to model multi-scale motions and achieve SOTA novel view synthesis.

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

V2-SAM adapts SAM2 to cross-view object correspondence with geometry-aware and appearance-based prompt generators plus a post-hoc cyclic consistency selector, reporting new state-of-the-art results on Ego-Exo4D, DAVIS-2017, and HANDAL-X.

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

cs.CV · 2025-11-24 · unverdicted · novelty 6.0

A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

cs.CV · 2026-05-18 · conditional · novelty 5.0

TinySAM 2 reaches 90% of SAM 2.1 performance on DAVIS and SA-V using 7% of the memory tokens and 3% of the training data via frame selection, spatial average pooling, temporal similarity-based token pruning, and a RepViT image encoder.

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

cs.CV · 2025-09-12 · unverdicted · novelty 5.0

Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

VVitCutLER introduces VitCut as a temporally stable pseudo-label generator with cross-frame consistency and feature aggregation to improve unsupervised video object detection and segmentation.

citing papers explorer

Showing 10 of 10 citing papers.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 43
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation cs.CV · 2026-01-06 · conditional · none · ref 32
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition cs.CV · 2025-12-08 · unverdicted · none · ref 59
MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation cs.RO · 2025-12-04 · unverdicted · none · ref 39
Hoi! is a new multimodal dataset of force-grounded articulated object manipulations with cross-view video and tactile sensing from human hands and robotic grippers.
RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video cs.CV · 2026-05-22 · unverdicted · none · ref 44
RiGS decomposes scenes into static, rigid, and transient 4D Gaussians with an object-wise dynamic mask and scene flow guidance to model multi-scale motions and achieve SOTA novel view synthesis.
V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence cs.CV · 2025-11-25 · unverdicted · none · ref 46
V2-SAM adapts SAM2 to cross-view object correspondence with geometry-aware and appearance-based prompt generators plus a post-hoc cyclic consistency selector, reporting new state-of-the-art results on Ego-Exo4D, DAVIS-2017, and HANDAL-X.
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on cs.CV · 2025-11-24 · unverdicted · none · ref 46
A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model cs.CV · 2026-05-18 · conditional · none · ref 20
TinySAM 2 reaches 90% of SAM 2.1 performance on DAVIS and SA-V using 7% of the memory tokens and 3% of the training data via frame selection, spatial average pooling, temporal similarity-based token pruning, and a RepViT image encoder.
Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation cs.CV · 2025-09-12 · unverdicted · none · ref 39
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos cs.CV · 2026-05-11 · unverdicted · none · ref 29
VVitCutLER introduces VitCut as a temporally stable pseudo-label generator with cross-frame consistency and feature aggregation to improve unsupervised video object detection and segmentation.

Sam 2: Segment anything in images and videos

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer