SAM 2: Segment Anything in Images and Videos
Pith reviewed 2026-05-10 13:52 UTC · model grok-4.3
The pith
SAM 2 segments images and videos from prompts with higher accuracy and far fewer user inputs than prior systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).
What carries the argument
a simple transformer architecture with streaming memory for real-time video processing, trained via an interactive data engine that iteratively collects and refines large-scale video segmentation annotations
If this is right
- Video segmentation tasks can be completed with higher accuracy while requiring only one-third the user interactions of previous approaches.
- Image segmentation becomes both more accurate and six times faster than with the original Segment Anything Model.
- A large-scale video segmentation dataset is released for use in training and evaluating future models.
- Real-time processing of video streams is supported by the streaming memory design in the transformer architecture.
- The model, dataset, and training code are made available to advance video segmentation and related perception tasks.
Where Pith is reading between the lines
- The iterative data engine loop could be reused to gather high-quality annotations for other dynamic vision tasks such as tracking or action recognition.
- Streaming memory opens the possibility of applying the model to live video feeds without needing to store entire sequences in memory.
- Releasing both the model and the dataset allows independent groups to test whether the performance gains hold on entirely new video domains.
Load-bearing premise
The annotations produced by the interactive data engine are high-quality and diverse enough that a model trained on them generalizes to independent benchmarks without overfitting to the collection process itself.
What would settle it
Testing the released model on established video segmentation benchmarks such as DAVIS or YouTube-VOS and finding that it does not deliver comparable or better accuracy with substantially fewer user interactions than prior methods.
read the original abstract
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SAM 2, a promptable foundation model for visual segmentation in both images and videos. It describes an interactive data engine used to collect the SA-V video segmentation dataset, a transformer-based architecture augmented with streaming memory for real-time video processing, and reports that the resulting model achieves higher accuracy than prior methods while requiring 3x fewer user interactions on video benchmarks and is both more accurate and 6x faster than the original SAM on image tasks. The work releases the model, dataset, and training code.
Significance. If the performance claims and data quality hold after addressing potential biases, this constitutes a substantial contribution as the first large-scale promptable video segmentation foundation model. The combination of an iterative data engine, streaming memory design, and open release of data/code provides a new baseline and resource for video perception tasks, with potential downstream impact on applications requiring efficient interactive segmentation.
major comments (2)
- [§3] §3 (Data Engine): The interactive annotation process is presented as the source of the SA-V dataset that underpins the video segmentation gains, yet the section provides no explicit decontamination steps, diversity statistics, or audits to rule out correlation between collected annotations and the prompt styles or video distributions in held-out benchmarks such as DAVIS and YouTube-VOS. Because the central claim of 3x fewer interactions with better accuracy rests on the assumption that SA-V yields unbiased, generalizable supervision, this omission is load-bearing and requires concrete evidence (e.g., overlap analysis or held-out collection protocols) to secure the result.
- [§5] §5 (Experiments): The reported interaction-efficiency and accuracy deltas are given without error bars, per-video-length breakdowns, or ablations on prompt type and streaming-memory size. These omissions prevent verification that the 3x interaction reduction and accuracy edge are robust rather than artifacts of particular benchmark characteristics or the data-collection loop itself.
minor comments (2)
- [§4] Notation for the streaming memory update rules is introduced without a compact equation or pseudocode block, making it harder to reproduce the real-time inference path.
- [Figure 3] Figure 3 (qualitative results) would benefit from explicit annotation of prompt locations and failure cases to illustrate the claimed interaction savings.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of data collection transparency and experimental reporting that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (Data Engine): The interactive annotation process is presented as the source of the SA-V dataset that underpins the video segmentation gains, yet the section provides no explicit decontamination steps, diversity statistics, or audits to rule out correlation between collected annotations and the prompt styles or video distributions in held-out benchmarks such as DAVIS and YouTube-VOS. Because the central claim of 3x fewer interactions with better accuracy rests on the assumption that SA-V yields unbiased, generalizable supervision, this omission is load-bearing and requires concrete evidence (e.g., overlap analysis or held-out collection protocols) to secure the result.
Authors: We agree that §3 would benefit from greater transparency on potential biases. The SA-V dataset was collected via an iterative data engine that begins with a small seed set and expands through model-assisted annotation on a broad range of public video sources chosen to differ from common benchmarks. In the revised manuscript we will expand §3 with a new subsection that reports: (1) explicit steps taken to avoid overlap with DAVIS and YouTube-VOS (video ID and content hashing), (2) diversity statistics (video length distribution, scene categories, motion types), and (3) a quantitative overlap analysis between collected prompts and those appearing in the held-out benchmarks. These additions will be supported by a table and short appendix figure. revision: yes
-
Referee: [§5] §5 (Experiments): The reported interaction-efficiency and accuracy deltas are given without error bars, per-video-length breakdowns, or ablations on prompt type and streaming-memory size. These omissions prevent verification that the 3x interaction reduction and accuracy edge are robust rather than artifacts of particular benchmark characteristics or the data-collection loop itself.
Authors: We acknowledge that the current experimental section omits several robustness checks. The main numbers are averages over the full benchmark suites, but we did not report variance or stratified breakdowns. In the revision we will augment §5 (and the supplementary material) with: error bars computed across multiple random seeds and video subsets, per-video-length performance curves (short/medium/long clips), and additional ablations varying prompt type (point, box, mask) and memory bank size. These results were already computed during development and confirm that the reported gains hold across conditions; we will include the key tables and figures. revision: yes
Circularity Check
No significant circularity in claimed derivation or performance chain
full rationale
The paper's central claims rest on empirical benchmark results (accuracy, interaction efficiency, speed) obtained by training a transformer model on a newly collected SA-V dataset produced via the interactive data engine. No equations or derivations reduce reported gains to quantities defined by the authors' own fitted parameters or prior outputs. Self-citations to the original SAM work supply architectural context but are not load-bearing; the new streaming memory, data collection loop, and held-out evaluations (DAVIS, YouTube-VOS, etc.) are independent of those citations. The data engine description is procedural rather than definitional, and performance numbers are externally falsifiable rather than tautological. This is the common honest case of an empirical systems paper whose results do not collapse by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- streaming memory size and update rules
axioms (1)
- domain assumption A transformer backbone pretrained on image data can be adapted to video via a streaming memory module without catastrophic forgetting of image segmentation capability.
Forward citations
Cited by 60 Pith papers
-
DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
-
Learning a Particle Dynamics Model with Real-world Videos
A neural particle dynamics model is trained on real videos using dense Gaussians and rendering supervision, without particle-level labels, plus a new dataset of about 500 videos.
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
REACH: Hand Pose Estimation from Room Corners
REACH-Net estimates accurate 3D hand poses from afar using a Transformer that correlates hand and body features across multiview tokens and exploits temporal coordination autoregressively on the new REACH dataset with...
-
CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction
CHOIR reconstructs 4D hand-object interactions from monocular videos by initializing coarse sequences from visual priors, rectifying placements with contact predictions, and jointly optimizing for geometric and contac...
-
VSCD: Video-based Scene Change Detection in Unaligned Scenes
VSCD presents a query-centric multi-reference model for pixel-wise change detection in unaligned, unsynchronized indoor videos, backed by a 1.1-million-frame benchmark and real-robot validation for surveillance and in...
-
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors
LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet...
-
ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society
ShadeBench is a multimodal benchmark dataset for urban shade understanding that includes temporally varying shade maps, satellite imagery, building representations, and text to support shade generation, segmentation, ...
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation
A task-specific iterative framework for weakly supervised 4D radar scene flow estimation uses instance-aware self-supervised losses from 2D tracking/segmentation and a rigid static loss from odometry to outperform LiD...
-
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation
Weakly supervised iterative framework for radar scene flow estimation using back-projected 2D instance masks and odometry-based rigid static loss to outperform LiDAR-dependent and fully supervised baselines on the VoD...
-
CineMatte: Background Matting for Virtual Production and Beyond
CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.
-
Best Segmentation Buddies for Image-Shape Correspondence
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
-
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.
-
SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
-
ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest
Introduces the ELDOR UAV dataset and four benchmark tasks for semantic segmentation and classification of mining disturbances and ecological recovery in rainforest imagery.
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
-
PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.
-
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
3DReflecNet is a 22 TB+ dataset of over 120,000 synthetic and 1,000 real objects with millions of multi-view frames for benchmarking 3D reconstruction on reflective, transparent, and low-texture surfaces.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly
BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...
-
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
Direct primitive comparison in Gaussian splatting with anisotropic drift models and observability terms enables multi-view consistent change detection that separates geometric and appearance changes, outperforming pri...
-
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
GS-DIFF detects changes in 3D Gaussian Splatting scenes by direct primitive attribute comparison with anisotropic drift models and observability terms, outperforming render-then-compare baselines by ~17% mIoU.
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark
Presents the first large-scale infrared off-road dataset and a flow-free temporal model achieving state-of-the-art freespace detection performance with real-time inference.
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
-
HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness
HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.
-
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain gener...
-
Instance-level Visual Active Tracking with Occlusion-Aware Planning
OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.
-
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
-
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.
-
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
-
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
VGGT-Segmentor achieves new state-of-the-art cross-view segmentation on Ego-Exo4D with 67.7% and 68.0% average IoU using a geometry-enhanced model and correspondence-free pretraining that beats most supervised baselines.
-
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
-
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models
SAM-family models split into occluder-aware types that avoid predicting into occluded regions and occluder-agnostic types that confidently segment hidden areas, shown via a new benchmark on polyp datasets.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
Scene Change Detection with Vision-Language Representation Learning
LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with mult...
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
-
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2001.11190 (2020)
Max Allan, Satoshi Kondo, Sebastian Bodenstedt, Stefan Leger, Rahim Kadkhodamohammadi, Imanol Luengo, Felix Fuentes, Evangello Flouty, Ahmed Mohammed, Marius Pedersen, et al. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190,
-
[2]
Learning what to learn for video object segmentation.ECCV, abs/2003.11540,
Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. Learning what to learn for video object segmentation.ECCV, abs/2003.11540,
-
[3]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773,
work page internal anchor Pith review arXiv 2001
-
[4]
arXiv preprint arXiv:1803.00557 , year=
Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi Pont-Tuset. The 2018 davis challenge on video object segmentation.arXiv preprint arXiv:1803.00557,
-
[5]
Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation.arXiv preprint arXiv:1905.00737,
-
[6]
Caicedo, Allen Goodman, Kyle W
Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shantanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl.Nature Methods,
work page 2018
-
[7]
arXiv preprint arXiv:2305.06558 (2023)
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. InNeurIPS, 2021a. Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. InCVPR, 2021b. Ho Kei Cheng, Seoun...
-
[8]
cite arxiv:1406.1078Comment: EMNLP
work page internal anchor Pith review arXiv
-
[9]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review arXiv
-
[10]
Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging.arXiv preprint arXiv:2304.04155,
-
[11]
Estibaliz Gómez-de Mariscal, Hasini Jayatilaka, Özgün Çiçek, Thomas Brox, Denis Wirtz, and Arrate Muñoz- Barrutia. Search for temporal cell segmentation robustness in phase-contrast microscopy videos.arXiv preprint arXiv:2112.08817,
-
[12]
Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, and Leonid Sigal. Tam-vt: Transformation-aware multi-scale video transformer for segmentation and tracking.arXiv preprint arXiv:2312.08514,
-
[13]
Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives
38 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.arXiv preprint arXiv:2311.18259,
-
[14]
Exploiting depth information for wildlife monitoring
Timm Haucke and Volker Steinhage. Exploiting depth information for wildlife monitoring. arXiv preprint arXiv:2102.05607,
-
[15]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298,
-
[16]
Videoclick: Video Object Segmentation with a Single Click.arXiv preprint arXiv:2101.06545,
Namdar Homayounfar, Justin Liang, Wei-Chiu Ma, and Raquel Urtasun. Videoclick: Video Object Segmentation with a Single Click.arXiv preprint arXiv:2101.06545,
-
[17]
Trash- can: A semantically-segmented dataset towards visual de- tection of marine debris
Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris.arXiv:2007.08097,
-
[18]
Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation.arXiv preprint arXiv:2404.19326,
-
[19]
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. MaskRNN: Instance level video object segmentation. In NeurIPS, 2018a. Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. VideoMatch: Matching based video object segmentation. ECCV, abs/1809.01123, 2018b. Xiaoqian Huang, Kachole Sanket, Abdulla Ayyad, Fariborz Baghaei Naeini, Dimitrios Makris, and ...
-
[20]
Quantifying the Carbon Emissions of Machine Learning
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700,
work page internal anchor Pith review arXiv 1910
-
[21]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,
work page internal anchor Pith review arXiv
-
[22]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675,
work page internal anchor Pith review arXiv 2017
-
[23]
Segment anything meets point tracking
Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv:2307.01197,
-
[24]
RoFormer: Enhanced Transformer with Rotary Position Embedding
40 Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864,
work page internal anchor Pith review arXiv
-
[25]
arXiv preprint arXiv:2304.04709 (2023)
Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection.arXiv preprint arXiv:2304.04709,
-
[26]
Stephen McGough, Nick Wright, Ben Burville, and Per Berggren
Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359,
- [27]
-
[28]
Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, open-world segmentation. InICCV, pp. 10776–10785, 2021b. Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical image segmentation.arXiv preprint arXiv:2304.12620, 2023...
-
[29]
N. Xu, L. Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian L. Price, Scott D. Cohen, and Thomas S. Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InECCV, 2018a. N. Xu, L. Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark....
-
[30]
Libo Zhang, Lutao Jiang, Ruyi Ji, and Heng Fan
Lei Yang, Yan Zi Wei, Yisheng HE, Wei Sun, Zhenhang Huang, Haibin Huang, and Haoqiang Fan. iShape: A first step towards irregular shape instance segmentation.arXiv:2109.15068, 2021a. Zongxin Yang and Yi Yang. Decoupling features in hierarchical propagation for video object segmentation. InNeurIPS,
-
[31]
Faster segment anything: Towards lightweight sam for mobile applications, 2023a
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications, 2023a. Jiaming Zhang, Yutao Cui, Gangshan Wu, and Limin Wang. Joint modeling of feature, correspondence, and a compressed memory for video object segmentation.ArXiv, abs/2308.13505, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.