arxiv: 2604.09955 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

Tzu Ling Liu , Ian Stavness , Mrigank Rochan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords video unsupervised domain adaptationaction recognitiontokenizationmotion focusdomain shiftcomputational efficiencyvideo adaptation

0 comments

The pith

Learnable Motion-Focused Tokenization improves video unsupervised domain adaptation by discarding low-motion background tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LMFT to address challenges in video unsupervised domain adaptation for action recognition, where models must transfer from a labeled source to an unlabeled target domain. It works by breaking video frames into patch tokens and training a selection process to retain only motion-rich tokens while dropping low-motion ones that mostly represent static backgrounds. This reduces the impact of background-induced domain shifts and lowers the number of tokens fed to the adaptation model. Experiments across three benchmarks and 21 adaptation settings demonstrate state-of-the-art accuracy alongside major reductions in computational cost. Readers may value this because video models often fail when backgrounds differ and because efficiency matters for deployment on constrained hardware.

Core claim

LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. When used within a VUDA framework, this approach achieves state-of-the-art performance on three standard VUDA benchmarks across 21 domain adaptation settings while significantly reducing computational overhead compared with prior methods.

What carries the argument

Learnable Motion-Focused Tokenization (LMFT), which identifies and keeps motion-rich patch tokens from video frames to focus adaptation on action-relevant content.

If this is right

State-of-the-art results on standard VUDA benchmarks across 21 settings.
Significant reduction in computational overhead during adaptation.
Better handling of domain shifts caused by differing static backgrounds in source and target videos.
Retention of action-relevant information while removing redundant background tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection principle could extend to supervised video tasks or other video problems such as detection and captioning where background noise is costly.
Training the motion selector without target labels might transfer to other unsupervised video adaptation settings beyond action recognition.
The method's success depends on motion reliably signaling action importance, so it may need adjustments for actions that rely on subtle or static cues.

Load-bearing premise

That low-motion tokens primarily correspond to uninformative background regions whose removal will not discard action-relevant information and that the learnable selection process can be trained effectively in the unsupervised target domain.

What would settle it

A set of target-domain videos containing important actions performed with minimal motion, where discarding low-motion tokens causes clear drops in recognition accuracy.

Figures

Figures reproduced from arXiv: 2604.09955 by Ian Stavness, Mrigank Rochan, Tzu Ling Liu.

**Figure 1.** Figure 1: Overview of LMFT. For both source and target videos, LMFT tokenizes frames into patch tokens, computes the L1 distance [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of LMFT on four videos (two left, two right). Each video has three rows: original frames, motion differences, and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMFT adds a learnable motion-based token dropper to VUDA that targets efficiency and background shift, but the unsupervised selector's safety on target data is the part that needs checking.

read the letter

The paper's main move is to tokenize video frames and then learn which low-motion tokens to discard before feeding the rest into a domain-adaptation pipeline. The claim is that this keeps action-relevant patches, removes uninformative background, and cuts compute at the same time. That combination is the thing worth noticing: most VUDA papers chase accuracy alone, while this one tries to make the whole thing lighter for deployment on three standard benchmarks across 21 shifts.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Learnable Motion-Focused Tokenization (LMFT) for Video Unsupervised Domain Adaptation (VUDA). It tokenizes video frames into patch tokens and introduces a learnable mechanism to discard low-motion, redundant tokens (assumed to be background) while retaining motion-rich, action-relevant tokens. The framework is evaluated on three standard VUDA benchmarks across 21 domain adaptation settings, claiming state-of-the-art performance alongside substantial reductions in computational overhead.

Significance. If the central claims hold after addressing the noted concerns, the work would be significant for video domain adaptation: it directly targets the background-induced domain shift problem while simultaneously improving efficiency, an aspect often neglected in prior VUDA literature. The motion-focused token pruning offers a practical route to scalable action recognition adaptation.

major comments (3)

[Abstract, §3] Abstract and §3 (LMFT description): The core assumption that low-motion tokens 'primarily correspond to background regions' and can be safely discarded without losing action-relevant information is load-bearing for both the effectiveness and efficiency claims, yet the unsupervised training of the selector in the target domain provides no validation against ground-truth action regions or robustness checks for action classes where discriminative cues are static or low-motion (e.g., pose-based actions).
[§4] §4 (Experiments): The SOTA results across 21 settings are asserted without reported ablations that isolate the contribution of the learnable motion selector versus other framework components, or tests that forcibly retain low-motion tokens to measure information loss; this leaves the efficiency-performance tradeoff unsubstantiated.
[§3.2] §3.2 (Token selection mechanism): The end-to-end training of the motion-based selector in the unlabeled target domain risks domain-shift sensitivity in motion statistics, but no analysis or failure-case discussion is provided for scenarios where motion estimation itself shifts across domains.

minor comments (2)

[Abstract] The abstract would be clearer with explicit naming of the three benchmarks and a one-sentence summary of the tokenization architecture.
[§3] Notation for the motion estimation and selection thresholds should be defined consistently between text and any equations or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important aspects of our assumptions, experimental validation, and potential limitations, which we address point by point below. We will incorporate revisions to strengthen the paper as outlined.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (LMFT description): The core assumption that low-motion tokens 'primarily correspond to background regions' and can be safely discarded without losing action-relevant information is load-bearing for both the effectiveness and efficiency claims, yet the unsupervised training of the selector in the target domain provides no validation against ground-truth action regions or robustness checks for action classes where discriminative cues are static or low-motion (e.g., pose-based actions).

Authors: We acknowledge that the assumption linking low-motion tokens to background regions is central and that direct ground-truth validation is unavailable in the unsupervised target domain. Our empirical results across 21 settings demonstrate consistent gains, providing indirect support. To address the concern directly, we will revise §3 and the abstract to clarify the assumption's scope, add a limitations discussion for low-motion or pose-based actions, and include qualitative token visualizations in the supplementary material. revision: yes
Referee: [§4] §4 (Experiments): The SOTA results across 21 settings are asserted without reported ablations that isolate the contribution of the learnable motion selector versus other framework components, or tests that forcibly retain low-motion tokens to measure information loss; this leaves the efficiency-performance tradeoff unsubstantiated.

Authors: The referee is correct that dedicated ablations isolating the learnable selector are not reported. While overall SOTA comparisons and efficiency metrics are provided, we will add targeted ablations in the revised §4, including variants with/without the selector, random token retention baselines, and forced retention of low-motion tokens to quantify any information loss and better substantiate the tradeoff. revision: yes
Referee: [§3.2] §3.2 (Token selection mechanism): The end-to-end training of the motion-based selector in the unlabeled target domain risks domain-shift sensitivity in motion statistics, but no analysis or failure-case discussion is provided for scenarios where motion estimation itself shifts across domains.

Authors: We agree that domain shifts in motion statistics represent a potential risk not explicitly analyzed. The end-to-end adaptation and strong cross-domain results provide some evidence of robustness, but we will revise §3.2 to include an analysis of motion statistic differences across domains and a discussion of possible failure cases, drawing on examples from the evaluated benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: LMFT is a proposed method whose performance claims rest on experimental benchmarks rather than self-referential definitions or fitted inputs.

full rationale

The paper introduces LMFT as a learnable token selection process that discards low-motion tokens while retaining action-relevant ones for VUDA. No equations or steps in the abstract or description reduce a claimed prediction or result to its own inputs by construction; the selection is trained end-to-end on the target domain without invoking self-citations for uniqueness or smuggling ansatzes. The SOTA performance is asserted via experiments across 21 settings on three benchmarks, which constitutes independent empirical content rather than a renaming or self-definition. This is a standard method-proposal paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be extracted or verified from the text.

pith-pipeline@v0.9.0 · 5467 in / 1090 out tokens · 60118 ms · 2026-05-10T16:34:34.437242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 7

2025
[2]

Token merging: Your vit but faster.ICLR, 2023

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.ICLR, 2023. 2, 7

2023
[3]

Temporal attentive align- ment for large-scale video domain adaptation

Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal attentive align- ment for large-scale video domain adaptation. InICCV,
[4]

Don’t look twice: Faster video transformers with run-length tokenization

Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Ni- inuma, Kris Kitani, and L´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tokenization. InNeurIPS,
[5]

Dual-head contrastive domain adaptation for video action recognition

Victor G Turrisi da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Dual-head contrastive domain adaptation for video action recognition. InWACV, 2022. 1, 6

2022
[6]

Turrisi Da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci

Victor G. Turrisi Da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Unsupervised Domain Adaptation for Video Trans- formers in Action Recognition. InICPR, pages 1258–1265,
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 5

work page internal anchor Pith review arXiv 2023
[8]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 5

2022
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 2, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InICML, 2015. 1, 2, 6

2015
[11]

Bypass back-propagation: Optimization-based structural pruning for large language models via policy gra- dient

Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, and Gui- Song Xia. Bypass back-propagation: Optimization-based structural pruning for large language models via policy gra- dient. InACL, 2025. 4, 8

2025
[12]

Sefar: Semi-supervised fine- grained action recognition with temporal perturbation and learning stabilization

Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, and Dian Shao. Sefar: Semi-supervised fine- grained action recognition with temporal perturbation and learning stabilization. InAAAI, 2025. 2

2025
[13]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review arXiv
[14]

Drone- hat: Hybrid attention transformer for complex action recog- nition in drone surveillance videos

Mustaqeem Khan, Jamil Ahmad, Abdulmotaleb El Saddik, Wail Gueaieb, Giulia De Masi, and Fakhri Karray. Drone- hat: Hybrid attention transformer for complex action recog- nition in drone surveillance videos. InCVPR, 2024. 2

2024
[15]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion. InICCV, 2011. 5

2011
[16]

Source-free video domain adaptation with spatial-temporal- historical consistency learning

Kai Li, Deep Patel, Erik Kruus, and Martin Renqiang Min. Source-free video domain adaptation with spatial-temporal- historical consistency learning. InCVPR, 2023. 6

2023
[17]

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. InICML, 2015. 1, 2, 6

2015
[18]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712, 2016. 8

work page Pith review arXiv 2016
[19]

Moments in time dataset: one million videos for event under- standing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2019

Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra- makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva. Moments in time dataset: one million videos for event under- standing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2019. 5

2019
[20]

Multi-modal domain adaptation for fine-grained action recognition

Jonathan Munro and Dima Damen. Multi-modal domain adaptation for fine-grained action recognition. InCVPR,
[21]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4, 6

2021
[22]

Unsupervised video domain adaptation with masked pre-training and collabora- tive self-training

Arun Reddy, William Paul, Corban Rivera, Ketul Shah, Celso M de Melo, and Rama Chellappa. Unsupervised video domain adaptation with masked pre-training and collabora- tive self-training. InCVPR, 2024. 1, 2, 3, 6, 7, 8

2024
[23]

Contrast and mix: Temporal contrastive video domain adaptation with background mixing

Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. InNeurIPS,
[24]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2, 7

2025
[25]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5

work page internal anchor Pith review arXiv 2012
[26]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InNeurIPS, 2022. 5

2022
[27]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023. 5

2023
[28]

Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, pages 229–256, 1992

Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, pages 229–256, 1992. 4

1992
[29]

Svformer: Semi-supervised video trans- former for action recognition

Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Svformer: Semi-supervised video trans- former for action recognition. InCVPR, 2023. 2

2023
[30]

Arid: A new dataset for recogniz- ing action in the dark

Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianx- iong Yin, and Simon See. Arid: A new dataset for recogniz- ing action in the dark. InDeep Learning for Human Activity Recognition, 2021. 5 9

2021
[31]

Source-free video domain adaptation by learning temporal consistency for action recognition

Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, and Zhenghua Chen. Source-free video domain adaptation by learning temporal consistency for action recognition. In ECCV, 2022. 6

2022
[32]

Multi-source video do- main adaptation with temporal attentive moment alignment network.Circuits and Systems for Video Technology, 2023

Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, Zhengguo Li, and Zhenghua Chen. Multi-source video do- main adaptation with temporal attentive moment alignment network.Circuits and Systems for Video Technology, 2023. 5

2023
[33]

Leveraging endo- and exo- temporal regularization for black-box video domain adapta- tion.Transactions on Machine Learning Research, 2024

Yuecong Xu, Jianfei Yang, Haozhi Cao, Min Wu, Xiaoli Li, Lihua Xie, and Zhenghua Chen. Leveraging endo- and exo- temporal regularization for black-box video domain adapta- tion.Transactions on Machine Learning Research, 2024. 6

2024
[34]

The unreasonable effectiveness of large language-vision models for source-free video domain adaptation

Giacomo Zara, Alessandro Conti, Subhankar Roy, St ´ephane Lathuili`ere, Paolo Rota, and Elisa Ricci. The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. InICCV, 2023. 1, 3, 6, 7

2023
[35]

Audio-adaptive activity recognition across video do- mains

Yunhua Zhang, Hazel Doughty, Ling Shao, and Cees GM Snoek. Audio-adaptive activity recognition across video do- mains. InCVPR, 2022. 5, 6 10

2022