Temporal Aware Pruning for Efficient Diffusion-based Video Generation

Bo Yuan; Junhao Ran; Sheng Li; Xulong Tang; Yang Sui; Yue Dai

arxiv: 2605.17837 · v1 · pith:A3E76MFMnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

Sheng Li , Yang Sui , Junhao Ran , Bo Yuan , Yue Dai , Xulong Tang This is my paper

Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusiontoken pruningtemporal coherenceefficient inferencetraining-freeViT pruningspatiotemporal sequences

0 comments

The pith

Temporal smoothing of token importance across frames lets pruning speed up video diffusion while keeping coherence

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that token pruning for video diffusion models works much better when it respects time relationships between frames instead of treating each frame separately. Standard attention-based pruning per frame creates flickering, background inconsistency, and quality loss because it ignores how important tokens should stay aligned over time. TAPE counters this by smoothing importance scores temporally, reselecting tokens at selected layers to fit their different semantic roles, and varying how much to prune based on the current diffusion timestep. A reader would care because these models are too slow for practical use due to long spatiotemporal sequences, and a training-free fix could make high-quality video generation faster on ordinary hardware.

Core claim

TAPE is a training-free method that applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter, performs token reselection in selected layers to align pruning with layers' diverse semantic focus, and adopts a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes during later refinement.

What carries the argument

Temporal smoothing of token importance scores across frames together with layer-wise reselection and timestep-adaptive pruning budgets.

Load-bearing premise

That the temporal smoothing and layer reselection will not create new artifacts or quality drops that standard visual metrics fail to catch, especially in complex motion or long sequences.

What would settle it

Running TAPE-generated videos with rapid complex motions or extended lengths and measuring visible flickering, background drift, or drops in perceptual scores against the unpruned baseline and other pruning methods.

Figures

Figures reproduced from arXiv: 2605.17837 by Bo Yuan, Junhao Ran, Sheng Li, Xulong Tang, Yang Sui, Yue Dai.

**Figure 2.** Figure 2: Overview of TAPE. At timestep T, ① timestep-aware scheduling first decides the pruning ratio, which will be reduced at late steps; ② Token reselection is conducted intermittently, align pruning decisions with diverse semantic focuses in different layers; upon selection, ③ temporal smoothing blends current and aligned previous scores to enforce temporally coherent pruning. ToMe (Bolya & Hoffman, 2023) and t… view at source ↗

**Figure 3.** Figure 3: An example of attention distribution across layers. Each block (i.e., token) in the attention [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the generated video. baseline. Although a 40% token reduction rate introduces slight softness in some regions, the overall structure, motion, and prompt semantics are still well captured, demonstrating that TAPE maintains strong visual fidelity even under aggressive pruning. We provide additional visualizations of videos generated with our pruning method TAPE in the supplementary material … view at source ↗

**Figure 5.** Figure 5: An example visualization of pruned areas across frames for EViT and our proposed TAPE [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Additional visualizations of the generated videos. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAPE adds temporal smoothing, layer reselection, and timestep scheduling to token pruning for video diffusion, which targets flickering but needs full results to confirm it works without new artifacts.

read the letter

Colleague, the core of this paper is a training-free pruning method called TAPE that tries to make token reduction work for video diffusion without breaking temporal coherence. It uses three pieces: smoothing importance scores across adjacent frames to cut selection jitter, reselection in chosen layers to match their different semantic roles, and a schedule that prunes more aggressively in early noisy timesteps while easing off later. This combination is presented as new relative to the per-frame attention pruning baselines mentioned in the abstract. The approach makes sense as a direct response to the background inconsistency and flickering that come from applying image-style pruning frame by frame in a multi-step diffusion process. The paper does a reasonable job naming those failure modes and offering simple, heuristic fixes that do not require retraining or extra parameters beyond the smoothing strength and budget schedule. That keeps the method lightweight and easy to implement on top of existing models. The soft spots are mostly around verification. The abstract claims significant speedups with preserved visual fidelity and better results than prior token reduction work, but without the actual metrics, baseline comparisons, ablation tables, or error analysis it is difficult to judge the size of the gains or whether subtle temporal issues remain in complex motion or longer sequences. Standard image metrics can miss low-level flickering or consistency drift, so the stress-test concern about residual artifacts is worth checking against the full experiments. If the results only cover short clips or lack detailed temporal evaluation, the central claim weakens. This paper is for people working on practical speedups for diffusion-based video generation. A reader already running these models and looking for inference tricks could get value from trying the three mechanisms. It deserves a serious referee because the problem is real, the method is straightforward to reproduce, and the full paper should contain the data needed to test whether the fixes actually deliver on fidelity.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TAPE, a training-free token pruning method for ViT-based diffusion video generation models. It introduces three components: temporal smoothing to align token importance scores across adjacent frames and reduce selection jitter, layer-wise token reselection to match pruning to each layer's semantic focus and prevent localized error accumulation, and a timestep-dependent pruning budget that applies aggressive pruning in early noisy denoising steps while relaxing it during later fidelity-critical steps. The central claim is that these changes mitigate the background inconsistency, flickering, and quality loss seen in naive per-frame attention-based pruning, yielding substantial inference speedups while preserving visual fidelity and outperforming prior token-reduction baselines.

Significance. If the fidelity claims hold under rigorous testing, the work would meaningfully lower the computational barrier for spatiotemporal attention in video diffusion, enabling longer or higher-resolution generation on modest hardware. The training-free design and explicit targeting of temporal coherence issues are practical strengths. The heuristic nature of the three components is acknowledged but does not undermine potential utility provided the experimental comparisons are robust.

major comments (2)

[§4.1 and Table 2] §4.1 and Table 2: The reported speedups and visual-quality metrics (FID, CLIP-T, etc.) are shown against prior token-reduction methods, but no quantitative temporal-consistency metrics (e.g., optical-flow warping error, inter-frame LPIPS, or flicker index) are provided for long sequences or complex motion. This directly bears on the central claim that temporal smoothing plus reselection fully eliminates the flickering and background inconsistency the abstract attributes to naive pruning.
[§3.2] §3.2: The temporal-smoothing operation is described as aligning importance scores across frames, yet the manuscript lists 'temporal smoothing strength' as a free hyper-parameter with no sensitivity analysis or default-value justification. If the reported gains depend on per-video tuning of this parameter, the comparison to prior methods that also require hyper-parameter choices is weakened.

minor comments (2)

[Figure 4] Figure 4 caption: the legend does not clarify whether the visualized token masks are from the same denoising timestep or aggregated across steps.
[Related Work] Related-work section: citation to the original token-pruning ViT papers is present, but recent video-specific pruning works (e.g., those using motion-aware masks) are referenced only briefly; a short comparison paragraph would help situate the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our work. We have carefully considered each comment and provide point-by-point responses below, along with our plans for revisions.

read point-by-point responses

Referee: [§4.1 and Table 2] §4.1 and Table 2: The reported speedups and visual-quality metrics (FID, CLIP-T, etc.) are shown against prior token-reduction methods, but no quantitative temporal-consistency metrics (e.g., optical-flow warping error, inter-frame LPIPS, or flicker index) are provided for long sequences or complex motion. This directly bears on the central claim that temporal smoothing plus reselection fully eliminates the flickering and background inconsistency the abstract attributes to naive pruning.

Authors: We agree that quantitative temporal consistency metrics would provide additional support for our central claims regarding the mitigation of flickering and background inconsistency. While the existing metrics (FID, CLIP-T) and qualitative results demonstrate the effectiveness of TAPE, we will incorporate optical-flow warping error and inter-frame LPIPS evaluations in the revised manuscript to directly quantify temporal coherence improvements over baselines. revision: yes
Referee: [§3.2] §3.2: The temporal-smoothing operation is described as aligning importance scores across frames, yet the manuscript lists 'temporal smoothing strength' as a free hyper-parameter with no sensitivity analysis or default-value justification. If the reported gains depend on per-video tuning of this parameter, the comparison to prior methods that also require hyper-parameter choices is weakened.

Authors: The temporal smoothing strength is set to a fixed default value in all our experiments, which we will explicitly state and justify in the revised manuscript. To address the concern, we will also include a sensitivity analysis demonstrating that the performance remains robust across a range of values for this hyper-parameter, indicating that the gains do not rely on per-video tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in heuristic design and empirical claims

full rationale

The paper proposes TAPE as a training-free method using three heuristic components—temporal smoothing to align token importance across frames, layer-wise reselection to match semantic focus, and timestep budget scheduling for aggressive early pruning—explicitly to mitigate issues like flickering and error accumulation from naive per-frame pruning. These are presented as design choices supported by experimental comparisons to baselines and prior token-reduction methods, with no mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to inputs by construction. The speedups and fidelity preservation are asserted via empirical results on standard metrics rather than any self-referential loop, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from attention-based pruning and diffusion models plus a small number of scheduling hyperparameters whose values are chosen to balance speed and quality.

free parameters (2)

temporal smoothing strength
Controls how strongly importance scores are aligned across frames; value chosen experimentally.
timestep pruning budget schedule
Determines how aggressively to prune at each denoising step; tuned for early vs late steps.

axioms (2)

domain assumption Attention scores provide a reliable proxy for token importance in ViT-based diffusion models
Invoked when deciding which tokens to prune.
domain assumption Maintaining temporal coherence across frames is critical for perceived video quality
Stated as the reason naive per-frame pruning fails.

pith-pipeline@v0.9.0 · 5728 in / 1350 out tokens · 35463 ms · 2026-05-20T12:06:37.691369+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TAPE (i) applies temporal smoothing to align token-importance across adjacent frames... (ii) performs token reselection in selected layers... (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

˜s_n_i = α·s_n_i + (1−α)·s_{n−1}_i (α=0.5)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages · 17 internal anchors

[1]

Proceedings of the IEEE international conference on computer vision , pages=

Segflow: Joint learning for video object segmentation and optical flow , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[2]

Artificial Intelligence Review , volume=

Optical flow for video super-resolution: A survey , author=. Artificial Intelligence Review , volume=. 2022 , publisher=

work page 2022
[3]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning accurate dense correspondences and when to trust them , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[4]

IEEE Transactions on Image Processing , volume=

MSA-Net: Establishing reliable correspondences by multiscale attention network , author=. IEEE Transactions on Image Processing , volume=. 2022 , publisher=

work page 2022
[5]

IEEE transactions on pattern analysis and machine intelligence , volume=

Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2019 , publisher=

work page 2019
[6]

IEEE Transactions on Multimedia , volume=

Dynamic motion estimation and evolution video prediction network , author=. IEEE Transactions on Multimedia , volume=. 2020 , publisher=

work page 2020
[7]

arXiv preprint arXiv:2202.07800 , year=

Not all patches are what you need: Expediting vision transformers via token reorganizations , author=. arXiv preprint arXiv:2202.07800 , year=

work page arXiv
[8]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Hardness-aware deep metric learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[10]

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle=

work page
[11]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

work page 2018
[12]

Unsupervised Representation Learning by Predicting Image Rotations

Unsupervised representation learning by predicting image rotations , author=. arXiv preprint arXiv:1803.07728 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

European conference on computer vision , pages=

Unsupervised learning of visual representations by solving jigsaw puzzles , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[15]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Science China Information Sciences , volume=

A unified pruning framework for vision transformers , author=. Science China Information Sciences , volume=. 2023 , publisher=

work page 2023
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Width & depth pruning for vision transformers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Patch slimming for efficient vision transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[20]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

work page
[21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A-vit: Adaptive tokens for efficient vision transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

arXiv preprint arXiv:2305.17530 , year=

Pumer: Pruning and merging tokens for efficient vision language models , author=. arXiv preprint arXiv:2305.17530 , year=

work page arXiv
[23]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=

work page
[24]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[25]

International Conference on Machine Learning , pages=

Toward understanding the feature learning process of self-supervised contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[26]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Accelerating Self-Supervised Learning via Efficient Training Strategies , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Contrastive dual gating: Learning sparse features with contrastive learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

2021 58th ACM/IEEE Design Automation Conference (DAC) , pages=

Enabling on-device self-supervised contrastive learning with selective data contrast , author=. 2021 58th ACM/IEEE Design Automation Conference (DAC) , pages=. 2021 , organization=

work page 2021
[29]

International Conference on Machine Learning , pages=

Rigging the lottery: Making all tickets winners , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[30]

Advances in Neural Information Processing Systems , volume=

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

IEEE Micro , volume=

Sustainable ai processing at the edge , author=. IEEE Micro , volume=. 2022 , publisher=

work page 2022
[32]

Companion Proceedings of the Web Conference 2022 , pages=

Optimizing Data Layout for Training Deep Neural Networks , author=. Companion Proceedings of the Web Conference 2022 , pages=

work page 2022
[33]

International Conference on Machine Learning , pages=

Self-damaging contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[34]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[35]

Advances in neural information processing systems , volume=

What makes for good views for contrastive learning? , author=. Advances in neural information processing systems , volume=

work page
[36]

Improved Baselines with Momentum Contrastive Learning

Improved baselines with momentum contrastive learning , author=. arXiv preprint arXiv:2003.04297 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[37]

Science China Technological Sciences , volume=

Modeling of nano piezoelectric actuator based on block matching algorithm with optimal block size , author=. Science China Technological Sciences , volume=. 2013 , publisher=

work page 2013
[38]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

work page
[39]

Optics Express , volume=

Occlusion removal method of partially occluded 3D object using sub-image block matching in computational integral imaging , author=. Optics Express , volume=. 2008 , publisher=

work page 2008
[40]

SSIM , author=

Image quality metrics: PSNR vs. SSIM , author=. 2010 20th international conference on pattern recognition , pages=. 2010 , organization=

work page 2010
[41]

IEEE transactions on Image Processing , volume=

A new diamond search algorithm for fast block-matching motion estimation , author=. IEEE transactions on Image Processing , volume=. 2000 , publisher=

work page 2000
[42]

European conference on computer vision , pages=

Visualizing and understanding convolutional networks , author=. European conference on computer vision , pages=. 2014 , organization=

work page 2014
[43]

Advances in neural information processing systems , volume=

How transferable are features in deep neural networks? , author=. Advances in neural information processing systems , volume=

work page
[44]

The Eleventh International Conference on Learning Representations , year=

Which Layer is Learning Faster? A Systematic Exploration of Layer-wise Convergence Rate for Deep Neural Networks , author=. The Eleventh International Conference on Learning Representations , year=

work page
[45]

European Conference on Computer Vision , pages=

Towards Efficient and Effective Self-Supervised Learning of Visual Representations , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[46]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[47]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[48]

2011 , publisher=

The caltech-ucsd birds-200-2011 dataset , author=. 2011 , publisher=

work page 2011
[49]

Proceedings of the ieee/cvf International Conference on computer vision , pages=

Scaling and benchmarking self-supervised visual representation learning , author=. Proceedings of the ieee/cvf International Conference on computer vision , pages=

work page
[50]

MSB based new hybrid image compression technique for wireless transmission , author=. Advances in Computing and Information Technology: Proceedings of the Second International Conference on Advances in Computing and Information Technology (ACITY) July 13-15, 2012, Chennai, India-Volume 2 , pages=. 2013 , organization=

work page 2012
[51]

Entropy , volume=

On the performance of video resolution, motion and dynamism in transmission using near-capacity transceiver for wireless communication , author=. Entropy , volume=. 2021 , publisher=

work page 2021
[52]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Contextual transformer networks for visual recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[53]

Proceedings of NAACL-HLT , pages=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of NAACL-HLT , pages=

work page
[54]

Robotics and Autonomous Systems , volume=

RiSH: A robot-integrated smart home for elderly care , author=. Robotics and Autonomous Systems , volume=. 2018 , publisher=

work page 2018
[55]

Artificial Intelligence Review , volume=

Applications, databases and open computer vision research from drone videos and images: a survey , author=. Artificial Intelligence Review , volume=. 2021 , publisher=

work page 2021
[56]

International Conference on Machine Learning , pages=

Barlow twins: Self-supervised learning via redundancy reduction , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[57]

Advances in Neural Information Processing Systems , volume=

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation , author=. Advances in Neural Information Processing Systems , volume=

work page
[58]

IEEE Transactions on Evolutionary Computation , volume=

Differential Evolution-Based Feature Selection: A Niching-Based Multiobjective Approach , author=. IEEE Transactions on Evolutionary Computation , volume=. 2022 , publisher=

work page 2022
[59]

2016 3rd international conference on computing for sustainable global development (INDIACom) , pages=

A review of supervised machine learning algorithms , author=. 2016 3rd international conference on computing for sustainable global development (INDIACom) , pages=. 2016 , organization=

work page 2016
[60]

Advances in Neural Information Processing Systems , volume=

Channel gating neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[62]

Advances in Neural Information Processing Systems , volume=

Ressl: Relational self-supervised learning with weak augmentation , author=. Advances in Neural Information Processing Systems , volume=

work page
[63]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Seed the views: Hierarchical semantic alignment for contrastive representation learning , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[64]

Advances in neural information processing systems , volume=

Learning representations by maximizing mutual information across views , author=. Advances in neural information processing systems , volume=

work page
[65]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[66]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hnssl: Hard negative-based self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A simple data mixing prior for improving self-supervised learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[69]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Selfaugment: Automatic augmentation policies for self-supervised learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[70]

Fine-Grained Visual Classification of Aircraft

Fine-grained visual classification of aircraft , author=. arXiv preprint arXiv:1306.5151 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Proceedings IEEE Conference on Computer Vision and Pattern Recognition

Statistics of range images , author=. Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662) , volume=. 2000 , organization=

work page 2000
[72]

European Conference on Computer Vision , pages=

Fast-MoCo: Boost momentum-based contrastive learning with combinatorial patches , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[73]

arXiv preprint arXiv:2103.13559 , year=

Rethinking self-supervised learning: Small is beautiful , author=. arXiv preprint arXiv:2103.13559 , year=

work page arXiv
[75]

The Twelfth International Conference on Learning Representations , year=

Waxing-and-waning: a generic similarity-based framework for efficient self-supervised learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[76]

arXiv preprint arXiv:2401.16694 , year=

etuner: A Redundancy-Aware Framework for Efficient Continual Learning Application on Edge Devices , author=. arXiv preprint arXiv:2401.16694 , year=

work page arXiv
[77]

2019 international conference on communications, information system and computer engineering (CISCE) , pages=

EEG signal classification method based on feature priority analysis and CNN , author=. 2019 international conference on communications, information system and computer engineering (CISCE) , pages=. 2019 , organization=

work page 2019
[78]

2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC) , volume=

A neural network-based teaching style analysis model , author=. 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC) , volume=. 2019 , organization=

work page 2019
[79]

Multimedia Tools and Applications , volume=

An adaptive regression based single-image super-resolution , author=. Multimedia Tools and Applications , volume=. 2022 , publisher=

work page 2022
[80]

The Eleventh International Conference on Learning Representations , year=

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing , author=. The Eleventh International Conference on Learning Representations , year=

work page
[81]

Advances in Neural Information Processing Systems , volume=

Mest: Accurate and fast memory-economic sparse training framework on the edge , author=. Advances in Neural Information Processing Systems , volume=

work page

Showing first 80 references.

[1] [1]

Proceedings of the IEEE international conference on computer vision , pages=

Segflow: Joint learning for video object segmentation and optical flow , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[2] [2]

Artificial Intelligence Review , volume=

Optical flow for video super-resolution: A survey , author=. Artificial Intelligence Review , volume=. 2022 , publisher=

work page 2022

[3] [3]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning accurate dense correspondences and when to trust them , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[4] [4]

IEEE Transactions on Image Processing , volume=

MSA-Net: Establishing reliable correspondences by multiscale attention network , author=. IEEE Transactions on Image Processing , volume=. 2022 , publisher=

work page 2022

[5] [5]

IEEE transactions on pattern analysis and machine intelligence , volume=

Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2019 , publisher=

work page 2019

[6] [6]

IEEE Transactions on Multimedia , volume=

Dynamic motion estimation and evolution video prediction network , author=. IEEE Transactions on Multimedia , volume=. 2020 , publisher=

work page 2020

[7] [7]

arXiv preprint arXiv:2202.07800 , year=

Not all patches are what you need: Expediting vision transformers via token reorganizations , author=. arXiv preprint arXiv:2202.07800 , year=

work page arXiv

[8] [8]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[9] [9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Hardness-aware deep metric learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[10] [10]

Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , booktitle=

work page

[11] [11]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

work page 2018

[12] [12]

Unsupervised Representation Learning by Predicting Image Rotations

Unsupervised representation learning by predicting image rotations , author=. arXiv preprint arXiv:1803.07728 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

European conference on computer vision , pages=

Unsupervised learning of visual representations by solving jigsaw puzzles , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016

[15] [15]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[16] [16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[17] [17]

Science China Information Sciences , volume=

A unified pruning framework for vision transformers , author=. Science China Information Sciences , volume=. 2023 , publisher=

work page 2023

[18] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Width & depth pruning for vision transformers , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[19] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Patch slimming for efficient vision transformers , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[20] [20]

Advances in neural information processing systems , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. Advances in neural information processing systems , volume=

work page

[21] [21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A-vit: Adaptive tokens for efficient vision transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[22] [22]

arXiv preprint arXiv:2305.17530 , year=

Pumer: Pruning and merging tokens for efficient vision language models , author=. arXiv preprint arXiv:2305.17530 , year=

work page arXiv

[23] [23]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=

work page

[24] [24]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[25] [25]

International Conference on Machine Learning , pages=

Toward understanding the feature learning process of self-supervised contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[26] [26]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Accelerating Self-Supervised Learning via Efficient Training Strategies , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Contrastive dual gating: Learning sparse features with contrastive learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[28] [28]

2021 58th ACM/IEEE Design Automation Conference (DAC) , pages=

Enabling on-device self-supervised contrastive learning with selective data contrast , author=. 2021 58th ACM/IEEE Design Automation Conference (DAC) , pages=. 2021 , organization=

work page 2021

[29] [29]

International Conference on Machine Learning , pages=

Rigging the lottery: Making all tickets winners , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020

[30] [30]

Advances in Neural Information Processing Systems , volume=

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training , author=. Advances in Neural Information Processing Systems , volume=

work page

[31] [31]

IEEE Micro , volume=

Sustainable ai processing at the edge , author=. IEEE Micro , volume=. 2022 , publisher=

work page 2022

[32] [32]

Companion Proceedings of the Web Conference 2022 , pages=

Optimizing Data Layout for Training Deep Neural Networks , author=. Companion Proceedings of the Web Conference 2022 , pages=

work page 2022

[33] [33]

International Conference on Machine Learning , pages=

Self-damaging contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[34] [34]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[35] [35]

Advances in neural information processing systems , volume=

What makes for good views for contrastive learning? , author=. Advances in neural information processing systems , volume=

work page

[36] [36]

Improved Baselines with Momentum Contrastive Learning

Improved baselines with momentum contrastive learning , author=. arXiv preprint arXiv:2003.04297 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[37] [37]

Science China Technological Sciences , volume=

Modeling of nano piezoelectric actuator based on block matching algorithm with optimal block size , author=. Science China Technological Sciences , volume=. 2013 , publisher=

work page 2013

[38] [38]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

work page

[39] [39]

Optics Express , volume=

Occlusion removal method of partially occluded 3D object using sub-image block matching in computational integral imaging , author=. Optics Express , volume=. 2008 , publisher=

work page 2008

[40] [40]

SSIM , author=

Image quality metrics: PSNR vs. SSIM , author=. 2010 20th international conference on pattern recognition , pages=. 2010 , organization=

work page 2010

[41] [41]

IEEE transactions on Image Processing , volume=

A new diamond search algorithm for fast block-matching motion estimation , author=. IEEE transactions on Image Processing , volume=. 2000 , publisher=

work page 2000

[42] [42]

European conference on computer vision , pages=

Visualizing and understanding convolutional networks , author=. European conference on computer vision , pages=. 2014 , organization=

work page 2014

[43] [43]

Advances in neural information processing systems , volume=

How transferable are features in deep neural networks? , author=. Advances in neural information processing systems , volume=

work page

[44] [44]

The Eleventh International Conference on Learning Representations , year=

Which Layer is Learning Faster? A Systematic Exploration of Layer-wise Convergence Rate for Deep Neural Networks , author=. The Eleventh International Conference on Learning Representations , year=

work page

[45] [45]

European Conference on Computer Vision , pages=

Towards Efficient and Effective Self-Supervised Learning of Visual Representations , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[46] [46]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[47] [47]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009

[48] [48]

2011 , publisher=

The caltech-ucsd birds-200-2011 dataset , author=. 2011 , publisher=

work page 2011

[49] [49]

Proceedings of the ieee/cvf International Conference on computer vision , pages=

Scaling and benchmarking self-supervised visual representation learning , author=. Proceedings of the ieee/cvf International Conference on computer vision , pages=

work page

[50] [50]

MSB based new hybrid image compression technique for wireless transmission , author=. Advances in Computing and Information Technology: Proceedings of the Second International Conference on Advances in Computing and Information Technology (ACITY) July 13-15, 2012, Chennai, India-Volume 2 , pages=. 2013 , organization=

work page 2012

[51] [51]

Entropy , volume=

On the performance of video resolution, motion and dynamism in transmission using near-capacity transceiver for wireless communication , author=. Entropy , volume=. 2021 , publisher=

work page 2021

[52] [52]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Contextual transformer networks for visual recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[53] [53]

Proceedings of NAACL-HLT , pages=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of NAACL-HLT , pages=

work page

[54] [54]

Robotics and Autonomous Systems , volume=

RiSH: A robot-integrated smart home for elderly care , author=. Robotics and Autonomous Systems , volume=. 2018 , publisher=

work page 2018

[55] [55]

Artificial Intelligence Review , volume=

Applications, databases and open computer vision research from drone videos and images: a survey , author=. Artificial Intelligence Review , volume=. 2021 , publisher=

work page 2021

[56] [56]

International Conference on Machine Learning , pages=

Barlow twins: Self-supervised learning via redundancy reduction , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[57] [57]

Advances in Neural Information Processing Systems , volume=

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation , author=. Advances in Neural Information Processing Systems , volume=

work page

[58] [58]

IEEE Transactions on Evolutionary Computation , volume=

Differential Evolution-Based Feature Selection: A Niching-Based Multiobjective Approach , author=. IEEE Transactions on Evolutionary Computation , volume=. 2022 , publisher=

work page 2022

[59] [59]

2016 3rd international conference on computing for sustainable global development (INDIACom) , pages=

A review of supervised machine learning algorithms , author=. 2016 3rd international conference on computing for sustainable global development (INDIACom) , pages=. 2016 , organization=

work page 2016

[60] [60]

Advances in Neural Information Processing Systems , volume=

Channel gating neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[61] [61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[62] [62]

Advances in Neural Information Processing Systems , volume=

Ressl: Relational self-supervised learning with weak augmentation , author=. Advances in Neural Information Processing Systems , volume=

work page

[63] [63]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Seed the views: Hierarchical semantic alignment for contrastive representation learning , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

work page 2022

[64] [64]

Advances in neural information processing systems , volume=

Learning representations by maximizing mutual information across views , author=. Advances in neural information processing systems , volume=

work page

[65] [65]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[66] [66]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[67] [67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hnssl: Hard negative-based self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[68] [68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A simple data mixing prior for improving self-supervised learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[69] [69]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Selfaugment: Automatic augmentation policies for self-supervised learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[70] [70]

Fine-Grained Visual Classification of Aircraft

Fine-grained visual classification of aircraft , author=. arXiv preprint arXiv:1306.5151 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Proceedings IEEE Conference on Computer Vision and Pattern Recognition

Statistics of range images , author=. Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662) , volume=. 2000 , organization=

work page 2000

[72] [72]

European Conference on Computer Vision , pages=

Fast-MoCo: Boost momentum-based contrastive learning with combinatorial patches , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[73] [73]

arXiv preprint arXiv:2103.13559 , year=

Rethinking self-supervised learning: Small is beautiful , author=. arXiv preprint arXiv:2103.13559 , year=

work page arXiv

[74] [75]

The Twelfth International Conference on Learning Representations , year=

Waxing-and-waning: a generic similarity-based framework for efficient self-supervised learning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[75] [76]

arXiv preprint arXiv:2401.16694 , year=

etuner: A Redundancy-Aware Framework for Efficient Continual Learning Application on Edge Devices , author=. arXiv preprint arXiv:2401.16694 , year=

work page arXiv

[76] [77]

2019 international conference on communications, information system and computer engineering (CISCE) , pages=

EEG signal classification method based on feature priority analysis and CNN , author=. 2019 international conference on communications, information system and computer engineering (CISCE) , pages=. 2019 , organization=

work page 2019

[77] [78]

2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC) , volume=

A neural network-based teaching style analysis model , author=. 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC) , volume=. 2019 , organization=

work page 2019

[78] [79]

Multimedia Tools and Applications , volume=

An adaptive regression based single-image super-resolution , author=. Multimedia Tools and Applications , volume=. 2022 , publisher=

work page 2022

[79] [80]

The Eleventh International Conference on Learning Representations , year=

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing , author=. The Eleventh International Conference on Learning Representations , year=

work page

[80] [81]

Advances in Neural Information Processing Systems , volume=

Mest: Accurate and fast memory-economic sparse training framework on the edge , author=. Advances in Neural Information Processing Systems , volume=

work page