Recognition: 2 theorem links
· Lean TheoremSAM 2: Segment Anything in Images and Videos
Pith reviewed 2026-05-10 13:52 UTC · model grok-4.3
The pith
SAM 2 segments images and videos from prompts with higher accuracy and far fewer user inputs than prior systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).
What carries the argument
a simple transformer architecture with streaming memory for real-time video processing, trained via an interactive data engine that iteratively collects and refines large-scale video segmentation annotations
If this is right
- Video segmentation tasks can be completed with higher accuracy while requiring only one-third the user interactions of previous approaches.
- Image segmentation becomes both more accurate and six times faster than with the original Segment Anything Model.
- A large-scale video segmentation dataset is released for use in training and evaluating future models.
- Real-time processing of video streams is supported by the streaming memory design in the transformer architecture.
- The model, dataset, and training code are made available to advance video segmentation and related perception tasks.
Where Pith is reading between the lines
- The iterative data engine loop could be reused to gather high-quality annotations for other dynamic vision tasks such as tracking or action recognition.
- Streaming memory opens the possibility of applying the model to live video feeds without needing to store entire sequences in memory.
- Releasing both the model and the dataset allows independent groups to test whether the performance gains hold on entirely new video domains.
Load-bearing premise
The annotations produced by the interactive data engine are high-quality and diverse enough that a model trained on them generalizes to independent benchmarks without overfitting to the collection process itself.
What would settle it
Testing the released model on established video segmentation benchmarks such as DAVIS or YouTube-VOS and finding that it does not deliver comparable or better accuracy with substantially fewer user interactions than prior methods.
read the original abstract
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SAM 2, a promptable foundation model for visual segmentation in both images and videos. It describes an interactive data engine used to collect the SA-V video segmentation dataset, a transformer-based architecture augmented with streaming memory for real-time video processing, and reports that the resulting model achieves higher accuracy than prior methods while requiring 3x fewer user interactions on video benchmarks and is both more accurate and 6x faster than the original SAM on image tasks. The work releases the model, dataset, and training code.
Significance. If the performance claims and data quality hold after addressing potential biases, this constitutes a substantial contribution as the first large-scale promptable video segmentation foundation model. The combination of an iterative data engine, streaming memory design, and open release of data/code provides a new baseline and resource for video perception tasks, with potential downstream impact on applications requiring efficient interactive segmentation.
major comments (2)
- [§3] §3 (Data Engine): The interactive annotation process is presented as the source of the SA-V dataset that underpins the video segmentation gains, yet the section provides no explicit decontamination steps, diversity statistics, or audits to rule out correlation between collected annotations and the prompt styles or video distributions in held-out benchmarks such as DAVIS and YouTube-VOS. Because the central claim of 3x fewer interactions with better accuracy rests on the assumption that SA-V yields unbiased, generalizable supervision, this omission is load-bearing and requires concrete evidence (e.g., overlap analysis or held-out collection protocols) to secure the result.
- [§5] §5 (Experiments): The reported interaction-efficiency and accuracy deltas are given without error bars, per-video-length breakdowns, or ablations on prompt type and streaming-memory size. These omissions prevent verification that the 3x interaction reduction and accuracy edge are robust rather than artifacts of particular benchmark characteristics or the data-collection loop itself.
minor comments (2)
- [§4] Notation for the streaming memory update rules is introduced without a compact equation or pseudocode block, making it harder to reproduce the real-time inference path.
- [Figure 3] Figure 3 (qualitative results) would benefit from explicit annotation of prompt locations and failure cases to illustrate the claimed interaction savings.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of data collection transparency and experimental reporting that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (Data Engine): The interactive annotation process is presented as the source of the SA-V dataset that underpins the video segmentation gains, yet the section provides no explicit decontamination steps, diversity statistics, or audits to rule out correlation between collected annotations and the prompt styles or video distributions in held-out benchmarks such as DAVIS and YouTube-VOS. Because the central claim of 3x fewer interactions with better accuracy rests on the assumption that SA-V yields unbiased, generalizable supervision, this omission is load-bearing and requires concrete evidence (e.g., overlap analysis or held-out collection protocols) to secure the result.
Authors: We agree that §3 would benefit from greater transparency on potential biases. The SA-V dataset was collected via an iterative data engine that begins with a small seed set and expands through model-assisted annotation on a broad range of public video sources chosen to differ from common benchmarks. In the revised manuscript we will expand §3 with a new subsection that reports: (1) explicit steps taken to avoid overlap with DAVIS and YouTube-VOS (video ID and content hashing), (2) diversity statistics (video length distribution, scene categories, motion types), and (3) a quantitative overlap analysis between collected prompts and those appearing in the held-out benchmarks. These additions will be supported by a table and short appendix figure. revision: yes
-
Referee: [§5] §5 (Experiments): The reported interaction-efficiency and accuracy deltas are given without error bars, per-video-length breakdowns, or ablations on prompt type and streaming-memory size. These omissions prevent verification that the 3x interaction reduction and accuracy edge are robust rather than artifacts of particular benchmark characteristics or the data-collection loop itself.
Authors: We acknowledge that the current experimental section omits several robustness checks. The main numbers are averages over the full benchmark suites, but we did not report variance or stratified breakdowns. In the revision we will augment §5 (and the supplementary material) with: error bars computed across multiple random seeds and video subsets, per-video-length performance curves (short/medium/long clips), and additional ablations varying prompt type (point, box, mask) and memory bank size. These results were already computed during development and confirm that the reported gains hold across conditions; we will include the key tables and figures. revision: yes
Circularity Check
No significant circularity in claimed derivation or performance chain
full rationale
The paper's central claims rest on empirical benchmark results (accuracy, interaction efficiency, speed) obtained by training a transformer model on a newly collected SA-V dataset produced via the interactive data engine. No equations or derivations reduce reported gains to quantities defined by the authors' own fitted parameters or prior outputs. Self-citations to the original SAM work supply architectural context but are not load-bearing; the new streaming memory, data collection loop, and held-out evaluations (DAVIS, YouTube-VOS, etc.) are independent of those citations. The data engine description is procedural rather than definitional, and performance numbers are externally falsifiable rather than tautological. This is the common honest case of an empirical systems paper whose results do not collapse by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- streaming memory size and update rules
axioms (1)
- domain assumption A transformer backbone pretrained on image data can be adapted to video via a streaming memory module without catastrophic forgetting of image segmentation capability.
Forward citations
Cited by 60 Pith papers
-
DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
-
PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.
-
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
3DReflecNet is a 22 TB+ dataset of over 120,000 synthetic and 1,000 real objects with millions of multi-view frames for benchmarking 3D reconstruction on reflective, transparent, and low-texture surfaces.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly
BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...
-
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
GS-DIFF detects changes in 3D Gaussian Splatting scenes by direct primitive attribute comparison with anisotropic drift models and observability terms, outperforming render-then-compare baselines by ~17% mIoU.
-
From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
Direct primitive comparison in Gaussian splatting with anisotropic drift models and observability terms enables multi-view consistent change detection that separates geometric and appearance changes, outperforming pri...
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark
Presents the first large-scale infrared off-road dataset and a flow-free temporal model achieving state-of-the-art freespace detection performance with real-time inference.
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
-
HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness
HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.
-
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
Instance-level Visual Active Tracking with Occlusion-Aware Planning
OA-VAT improves visual active tracking by combining instance-level prototype discrimination with occlusion-aware diffusion planning, reporting gains over prior SOTA on simulated and real drone benchmarks.
-
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
-
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.
-
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
-
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
VGGT-Segmentor achieves new state-of-the-art cross-view segmentation on Ego-Exo4D with 67.7% and 68.0% average IoU using a geometry-enhanced model and correspondence-free pretraining that beats most supervised baselines.
-
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
-
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models
SAM-family models split into occluder-aware types that avoid predicting into occluded regions and occluder-agnostic types that confidently segment hidden areas, shown via a new benchmark on polyp datasets.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
Scene Change Detection with Vision-Language Representation Learning
LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with mult...
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
-
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...
-
Large-Scale Universal Defect Generation: Foundation Models and Datasets
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock
TRACE achieves 0.998 mIoU for CO2 plume segmentation and best-in-class flux classification from MWIR thermal videos via a gas-aware attention encoder and temporal fusion module.
-
Quantitative Video World Model Evaluation for Geometric-Consistency
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
-
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
VoxCor creates reusable volumetric features from frozen 2D ViT models by combining triplanar inference with a closed-form weighted partial least squares projection, enabling direct voxel correspondence across modaliti...
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
SECOND-Grasp: Semantic Contact-guided Dexterous Grasping
SECOND-Grasp integrates semantic contact proposals from vision-language reasoning with geometric refinement to achieve 98%+ lifting success and improved intent-aware grasping on seen and unseen objects.
-
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
-
From Priors to Perception: Grounding Video-LLMs in Physical Reality
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
-
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
-
HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
HumanSplatHMR closes the loop between human mesh recovery and Gaussian Splatting by using photometric, segmentation, and depth losses to refine poses during avatar optimization.
-
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
TAIL-Safe learns a Lipschitz Q-function from digital-twin failure data to identify an empirical control-invariant safe set for imitation learning policies and applies gradient-based recovery to keep actions inside it.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
Reference graph
Works this paper leans on
-
[1]
2018 robotic scene segmentation challenge
Max Allan, Satoshi Kondo, Sebastian Bodenstedt, Stefan Leger, Rahim Kadkhodamohammadi, Imanol Luengo, Felix Fuentes, Evangello Flouty, Ahmed Mohammed, Marius Pedersen, et al. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190,
-
[2]
Learning what to learn for video object segmentation.ECCV, abs/2003.11540,
Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. Learning what to learn for video object segmentation.ECCV, abs/2003.11540,
-
[3]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773,
work page internal anchor Pith review arXiv 2001
-
[4]
arXiv preprint arXiv:1803.00557 , year=
Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi Pont-Tuset. The 2018 davis challenge on video object segmentation.arXiv preprint arXiv:1803.00557,
-
[5]
Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation.arXiv preprint arXiv:1905.00737,
-
[6]
Caicedo, Allen Goodman, Kyle W
Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shantanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl.Nature Methods,
work page 2018
-
[7]
Segment and Track Anything.arXiv preprint arXiv:2305.06558, 2023
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. InNeurIPS, 2021a. Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. InCVPR, 2021b. Ho Kei Cheng, Seoun...
-
[8]
cite arxiv:1406.1078Comment: EMNLP
work page internal anchor Pith review arXiv
-
[9]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review arXiv
-
[10]
Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging.arXiv preprint arXiv:2304.04155,
-
[11]
Estibaliz Gómez-de Mariscal, Hasini Jayatilaka, Özgün Çiçek, Thomas Brox, Denis Wirtz, and Arrate Muñoz- Barrutia. Search for temporal cell segmentation robustness in phase-contrast microscopy videos.arXiv preprint arXiv:2112.08817,
-
[12]
Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, and Leonid Sigal. Tam-vt: Transformation-aware multi-scale video transformer for segmentation and tracking.arXiv preprint arXiv:2312.08514,
-
[13]
38 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.arXiv preprint arXiv:2311.18259,
-
[14]
Exploiting depth information for wildlife monitoring
Timm Haucke and Volker Steinhage. Exploiting depth information for wildlife monitoring. arXiv preprint arXiv:2102.05607,
-
[15]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298,
-
[16]
Videoclick: Video Object Segmentation with a Single Click.arXiv preprint arXiv:2101.06545,
Namdar Homayounfar, Justin Liang, Wei-Chiu Ma, and Raquel Urtasun. Videoclick: Video Object Segmentation with a Single Click.arXiv preprint arXiv:2101.06545,
-
[17]
TrashCan: A semantically-segmented dataset towards visual detection of marine debris
Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris.arXiv:2007.08097,
-
[18]
Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation.arXiv preprint arXiv:2404.19326,
-
[19]
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. MaskRNN: Instance level video object segmentation. In NeurIPS, 2018a. Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. VideoMatch: Matching based video object segmentation. ECCV, abs/1809.01123, 2018b. Xiaoqian Huang, Kachole Sanket, Abdulla Ayyad, Fariborz Baghaei Naeini, Dimitrios Makris, and ...
-
[20]
Quantifying the Carbon Emissions of Machine Learning
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700,
work page internal anchor Pith review arXiv 1910
-
[21]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,
work page internal anchor Pith review arXiv
-
[22]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675,
work page internal anchor Pith review arXiv 2017
-
[23]
Segment anything meets point tracking
Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv:2307.01197,
-
[24]
RoFormer: Enhanced Transformer with Rotary Position Embedding
40 Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864,
work page internal anchor Pith review arXiv
-
[25]
Can sam segment anything? when sam meets camouflaged object detection
Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection.arXiv preprint arXiv:2304.04709,
-
[26]
Stephen McGough, Nick Wright, Ben Burville, and Per Berggren
Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359,
- [27]
-
[28]
Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, open-world segmentation. InICCV, pp. 10776–10785, 2021b. Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical image segmentation.arXiv preprint arXiv:2304.12620, 2023...
-
[29]
N. Xu, L. Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian L. Price, Scott D. Cohen, and Thomas S. Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InECCV, 2018a. N. Xu, L. Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark....
-
[30]
iShape: A first step towards irregular shape instance segmentation
Lei Yang, Yan Zi Wei, Yisheng HE, Wei Sun, Zhenhang Huang, Haibin Huang, and Haoqiang Fan. iShape: A first step towards irregular shape instance segmentation.arXiv:2109.15068, 2021a. Zongxin Yang and Yi Yang. Decoupling features in hierarchical propagation for video object segmentation. InNeurIPS,
-
[31]
Faster segment anything: Towards lightweight sam for mobile applications, 2023a
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications, 2023a. Jiaming Zhang, Yutao Cui, Gangshan Wu, and Limin Wang. Joint modeling of feature, correspondence, and a compressed memory for video object segmentation.ArXiv, abs/2308.13505, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.