TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

Coloma Ballester; Dimosthenis Karatzas; Pritam Mishra

arxiv: 2506.20588 · v2 · submitted 2025-06-25 · 💻 cs.CV

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

Pritam Mishra , Coloma Ballester , Dimosthenis Karatzas This is my paper

Pith reviewed 2026-05-19 07:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords video summarizationself-supervised learningMarkov processestemporal informationrepresentativenessSUMMETVSUM

0 comments

The pith

A self-supervised video summarization model uses Markov process losses to capture temporal dependencies without attention mechanisms or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a self-supervised framework for selecting key frames from videos that avoids the need for human annotations and complex neural architectures like attention layers or transformers. It relies on a two-stage training process and new loss functions derived from Markov processes to measure how well frames represent the video's content and preserve relative timing information. This setup is intended to deliver both spatial and temporal understanding in an efficient way. If the approach holds, it would allow video summarization systems to train and run on new datasets without retraining costs or labeled data. The reported results on standard benchmarks show it surpassing prior unsupervised techniques and approaching the accuracy of supervised models.

Core claim

TRIM integrates Markov process-driven loss metrics that quantify temporal relative information and representativeness within a two-stage self-supervised learning paradigm, enabling the model to capture spatial and temporal dependencies in video data without attention, RNNs, or transformers, and achieving state-of-the-art performance among unsupervised methods on the SUMME and TVSUM datasets while rivaling the best supervised approaches.

What carries the argument

Markov process-driven loss metrics combined with a two-stage self-supervised learning paradigm that maximizes temporal relative information and representativeness to select summary frames.

If this is right

Training video summarizers becomes possible without any labeled data or human annotations.
Computational overhead drops by removing attention, RNN, and transformer components.
Models can transfer more readily across different video domains and datasets.
Real-time summarization on resource-limited devices becomes more feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Markov-based losses might extend to other sequence selection tasks such as key clip extraction in long-form video.
Removing reliance on complex architectures could simplify deployment in privacy-sensitive or on-device settings.
Further gains may appear if the two-stage paradigm is combined with lightweight convolutional backbones.

Load-bearing premise

The Markov process-driven loss metrics and two-stage self-supervised paradigm are sufficient to capture both spatial and temporal dependencies without attention, RNNs, or transformers.

What would settle it

Evaluating the model on SUMME or TVSUM under the standard unsupervised protocol and finding that its F-score or mAP falls below the best prior unsupervised baselines would disprove the performance claim.

Figures

Figures reproduced from arXiv: 2506.20588 by Coloma Ballester, Dimosthenis Karatzas, Pritam Mishra.

**Figure 1.** Figure 1: Overview of Stage 1 − the proposed Self-supervised Pre-training method. Stage 2: Unsupervised fine-tuning: Stage 2 leverages the weights from pre-training to initialize the network, which is subsequently fine-tuned using the unsupervised loss functions described in Section 3.4. All extracted features for a given video are processed in a single batch through our proposed neural network described in [PITH_F… view at source ↗

**Figure 2.** Figure 2: Overview of Stage 2 − the proposed unsupervised fine-tuning method. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation of mask ratio m and hyperparameter ν. Fig. 3a (left): Sensitivity of hyperparameter ν within LP RE loss function during two-stage training. Fig. 3b (right): Performance of proposed model when pre-trained with SSL at different mask ratio. The mask ratio could be between 0 to 0.5, where 0.5 represents that half of the frames in the video have been randomly masked, and this has been done randomly for… view at source ↗

**Figure 4.** Figure 4: Hyper-parameter sensitivity analysis for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study[from Table 2] evaluating the contribution of each loss component, based [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study [from Table 2] evaluating the contribution of each loss component, based [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison between mean and standard deviation of correlation coefficients across 10 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-Validation Splits used by our proposed method on TVSUM dataset [ [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-Validation Splits used by our proposed method on SUMME dataset [ [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRIM combines Markov-process losses with two-stage self-supervised training to do video summarization without attention or RNNs, but the memoryless property makes long-range temporal capture look like the weakest link.

read the letter

TRIM is trying to replace attention and RNNs in video summarization with Markov process losses in a two-stage self-supervised setup, and it reports state-of-the-art unsupervised results on SUMME and TVSUM that come close to supervised models. The new part is the specific combination of those Markov-driven metrics with the training schedule. It aims for efficiency and better generalization across datasets by staying annotation-free and architecture-light. If the performance numbers check out in the full experiments, that could be a practical step for handling large video collections without heavy compute. The main concern is whether the Markov approach really gets the temporal structure. Standard Markov chains only look at the immediate previous state, so long-range frame dependencies might not be captured unless they added something extra that the abstract doesn't mention. The stress test points this out, and without seeing the exact equations or how they model the states, it's hard to tell if the gains are from the claimed method or from other choices like feature extractors. This work is aimed at researchers in computer vision looking for simpler self-supervised alternatives to current dominant architectures. Someone focused on scalable video processing would find it relevant to check out. I would send it for peer review. The idea has enough novelty and the benchmarks are there to make it worth a referee's time, even with likely questions on the temporal modeling.

Referee Report

2 major / 2 minor

Summary. The paper introduces TRIM, a self-supervised video summarization framework that employs novel Markov process-driven loss metrics together with a two-stage self-supervised training paradigm. The central claim is that this architecture-free approach captures both spatial and temporal dependencies, achieves state-of-the-art results on the SUMME and TVSUM benchmarks, outperforms all prior unsupervised methods, and rivals the best supervised models.

Significance. If the performance claims are substantiated by rigorous ablations and statistical tests, the work would be significant for demonstrating that memory-efficient, non-attention-based models can reach competitive accuracy in video summarization. This would reduce reliance on computationally heavy architectures and annotated data, improving cross-domain applicability.

major comments (2)

[§3.2] §3.2 (Markov-driven loss metrics): The first-order Markov transition probabilities are memoryless by definition (P(X_{t+1}|X_t) depends only on the immediate predecessor). It is therefore unclear how the formulation encodes dependencies spanning many frames without explicit higher-order states, recurrence, or aggregation over longer windows. The SOTA claim on SUMME/TVSUM rests on this temporal modeling; a concrete derivation or ablation showing long-range capture is required.
[§4.3] §4.3 (experimental results): The abstract asserts outperformance over all unsupervised methods and parity with supervised ones, yet no error bars, statistical significance tests, or cross-validation details are referenced in the visible results summary. Without these, it is impossible to determine whether the reported gains are robust or arise from hyper-parameter tuning or feature preprocessing.

minor comments (2)

[§3] The two-stage self-supervised paradigm is introduced in the abstract and §3 but lacks an explicit algorithmic outline or pseudocode; adding a clear diagram or numbered steps would improve reproducibility.
Notation for the representativeness and relative-information terms is introduced without a consolidated table of symbols; a short notation table would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate the revisions we will incorporate to improve the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Markov-driven loss metrics): The first-order Markov transition probabilities are memoryless by definition (P(X_{t+1}|X_t) depends only on the immediate predecessor). It is therefore unclear how the formulation encodes dependencies spanning many frames without explicit higher-order states, recurrence, or aggregation over longer windows. The SOTA claim on SUMME/TVSUM rests on this temporal modeling; a concrete derivation or ablation showing long-range capture is required.

Authors: We appreciate this observation on the temporal modeling. While individual transition probabilities are first-order, the Markov-driven losses are formulated over the complete sequence and combined with the representativeness term such that information propagates across multiple timesteps via repeated application of the transition dynamics. This yields effective long-range dependency capture without explicit higher-order states or recurrence. In the revised manuscript we will add an explicit multi-step derivation in §3.2 illustrating this propagation and include an ablation that varies the effective temporal horizon to empirically confirm long-range effects. revision: yes
Referee: [§4.3] §4.3 (experimental results): The abstract asserts outperformance over all unsupervised methods and parity with supervised ones, yet no error bars, statistical significance tests, or cross-validation details are referenced in the visible results summary. Without these, it is impossible to determine whether the reported gains are robust or arise from hyper-parameter tuning or feature preprocessing.

Authors: We agree that statistical rigor is necessary to support the performance claims. The current results are reported as point estimates; we will revise §4.3 to include error bars from multiple runs with varied random seeds, paired statistical significance tests against the leading baselines, and expanded details on the cross-validation protocol and preprocessing pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The provided abstract and claims introduce a self-supervised framework with Markov process-driven loss metrics and a two-stage paradigm to capture spatial-temporal dependencies without attention/RNNs/transformers, reporting SOTA on SUMME/TVSUM. No equations, fitting procedures, or self-citations are shown that would reduce any prediction or uniqueness claim to the inputs by construction. The central performance claims rest on external dataset benchmarks rather than tautological redefinitions or fitted-input renamings. The derivation is therefore self-contained against external benchmarks, with no load-bearing self-citation chains or ansatz smuggling detectable from the text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the paper introduces new loss metrics whose internal definitions and any fitted scalars are not visible, so the ledger remains largely empty.

pith-pipeline@v0.9.0 · 5706 in / 1090 out tokens · 22661 ms · 2026-05-19T07:26:21.221519+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel metric based on the entropy... Δt = |H_t − H_{t−1}| / H_t ... Γ_t = |H_t − (1/(t−1)) Σ H_i| / H_t (equations 1-2); LUNSUP = α LPTRIM + β LPCTRIM + γ LREP + LSD
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

simple 1D CNN... without attention, RNNs or transformers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 5.0

TRIMMER proposes a self-supervised RL method for video summarization that uses entropy-based rewards to capture temporal dynamics and semantic diversity, claiming SOTA results among unsupervised approaches.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

In: MultiMedia Modeling: 26th Inter- national Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26

Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V ., Patras, I.: Unsupervised video summarization via attention-driven adversarial learning. In: MultiMedia Modeling: 26th Inter- national Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. pp. 492–504. Springer (2020)

work page 2020
[2]

In: 2021 IEEE international symposium on multimedia (ISM)

Apostolidis, E., Balaouras, G., Mezaris, V ., Patras, I.: Combining global and local attention with positional encoding for video summarization. In: 2021 IEEE international symposium on multimedia (ISM). pp. 226–234. IEEE (2021)

work page 2021
[3]

In: Proceedings of the 2022 international conference on multimedia retrieval

Apostolidis, E., Balaouras, G., Mezaris, V ., Patras, I.: Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In: Proceedings of the 2022 international conference on multimedia retrieval. pp. 407–415 (2022)

work page 2022
[4]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)

work page 2020
[5]

In: Proceedings of the 1st ACM International Conference on Multimedia in Asia

Chen, Y ., Tao, L., Wang, X., Yamasaki, T.: Weakly supervised video summarization by hierarchical reinforcement learning. In: Proceedings of the 1st ACM International Conference on Multimedia in Asia. pp. 1–6 (2019)

work page 2019
[6]

In: Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14

Fajtl, J., Sokeh, H.S., Argyriou, V ., Monekosso, D., Remagnino, P.: Summarizing videos with attention. In: Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14. pp. 39–54. Springer (2019)

work page 2018
[7]

In: The 22nd International Conference on Artificial Intelligence and Statistics

Feydy, J., Séjourné, T., Vialard, F.X., Amari, S.i., Trouve, A., Peyré, G.: Interpolating between optimal transport and mmd using sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 2681–2690 (2019)

work page 2019
[8]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Fu, H., Wang, H., Yang, J.: Video summarization with a dual attention capsule network. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 446–451. IEEE (2021)

work page 2020
[9]

In: 2021 IEEE International Conference on Multimedia and Expo (ICME)

Ghauri, J.A., Hakimov, S., Ewerth, R.: Supervised video summarization via multiple feature sets with parallel attention. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6s. IEEE (2021)

work page 2021
[10]

Advances in neural information processing systems 33, 21271– 21284 (2020)

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271– 21284 (2020)

work page 2020
[11]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13

Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13. pp. 505–520. Springer (2014)

work page 2014
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, B., Wang, J., Qiu, J., Bui, T., Shrivastava, A., Wang, Z.: Align and attend: Multimodal summarization with dual contrastive losses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14867–14878 (2023)

work page 2023
[13]

In: Proceedings of the 27th ACM International Conference on multimedia

He, X., Hua, Y ., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., Guan, H.: Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on multimedia. pp. 2296–2304 (2019)

work page 2019
[14]

IEEE Transactions on Image Processing 32, 3013–3026 (2023)

Hsu, T.C., Liao, Y .S., Huang, C.R.: Video summarization with spatiotemporal vision transformer. IEEE Transactions on Image Processing 32, 3013–3026 (2023)

work page 2023
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, H., Mu, Y .: Joint video summarization and moment localization by cross-task sam- ple transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16388–16398 (2022) 10

work page 2022
[16]

In: Proceedings of the AAAI Conference on artificial intelligence

Jung, Y ., Cho, D., Kim, D., Woo, S., Kweon, I.S.: Discriminative feature learning for unsuper- vised video summarization. In: Proceedings of the AAAI Conference on artificial intelligence. vol. 33, pp. 8537–8544 (2019)

work page 2019
[17]

In: European conference on computer vision

Jung, Y ., Cho, D., Woo, S., Kweon, I.S.: Global-and-local relative position embedding for unsupervised video summarization. In: European conference on computer vision. pp. 167–183. Springer (2020)

work page 2020
[18]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Li, H., Ke, Q., Gong, M., Drummond, T.: Progressive video summarization via multimodal self-supervised learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 5584–5593 (2023)

work page 2023
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3904–3917 (2022)

Li, H., Ke, Q., Gong, M., Zhang, R.: Video joint modelling based on hierarchical transformer for co-summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3904–3917 (2022)

work page 2022
[21]

Pattern Recognition 111, 107677 (2021)

Li, P., Ye, Q., Zhang, L., Yuan, L., Xu, X., Shao, L.: Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition 111, 107677 (2021)

work page 2021
[22]

Neurocomputing 467, 1–9 (2022)

Liang, G., Lv, Y ., Li, S., Wang, X., Zhang, Y .: Video summarization with a dual-path attentive network. Neurocomputing 467, 1–9 (2022)

work page 2022
[23]

In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 202–211 (2017)

work page 2017
[24]

In: 2014 IEEE international conference on multimedia and expo (ICME)

Mei, S., Guan, G., Wang, Z., He, M., Hua, X.S., Feng, D.D.: l2,0 constrained sparse dictionary selection for video summarization. In: 2014 IEEE international conference on multimedia and expo (ICME). pp. 1–6. IEEE (2014)

work page 2014
[25]

Advances in neural information processing systems 34, 13988–14000 (2021)

Narasimhan, M., Rohrbach, A., Darrell, T.: Clip-it! language-guided video summarization. Advances in neural information processing systems 34, 13988–14000 (2021)

work page 2021
[26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Otani, M., Nakashima, Y ., Rahtu, E., Heikkila, J.: Rethinking the evaluation of video summaries. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7596–7604 (2019)

work page 2019
[27]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13

Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. pp. 540–555. Springer (2014)

work page 2014
[28]

In: Proceedings of the European conference on computer vision (ECCV)

Rochan, M., Ye, L., Wang, Y .: Video summarization using fully convolutional sequence networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 347– 363 (2018)

work page 2018
[29]

Shim, M., Kim, T., Kim, J., Wee, D.: Masked autoencoder for unsupervised video summarization (2023), https://arxiv.org/abs/2306.01395

work page arXiv 2023
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Son, J., Park, J., Kim, K.: Csta: Cnn-based spatiotemporal attention for video summarization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18847–18856 (2024)

work page 2024
[31]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

Song, Y ., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

work page 2015
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Terbouche, H., Morel, M., Rodriguez, M., Othmani, A.: Multi-annotation attention model for video summarization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3143–3152 (2023)

work page 2023
[33]

Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2008)

work page 2008
[34]

In: Proceedings of the 28th ACM international conference on multimedia

Wang, J., Bai, Y ., Long, Y ., Hu, B., Chai, Z., Guan, Y ., Wei, X.: Query twice: Dual mixture attention meta learning for video summarization. In: Proceedings of the 28th ACM international conference on multimedia. pp. 4023–4031 (2020)

work page 2020
[35]

In: International Conference on Machine Learning

Zbontar, J., Jing, L., Misra, I., LeCun, Y ., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning. pp. 12310–12320. PMLR (2021) 11

work page 2021
[36]

In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14

Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. pp. 766–782. Springer (2016)

work page 2016
[37]

IEEE Transactions on Circuits and Systems for Video Technology34(4), 2775–2788 (2023)

Zhang, Y ., Liu, Y ., Kang, W., Tao, R.: Vss-net: Visual semantic self-mining network for video summarization. IEEE Transactions on Circuits and Systems for Video Technology34(4), 2775–2788 (2023)

work page 2023
[38]

Neuro- computing 468, 360–369 (2022)

Zhao, B., Gong, M., Li, X.: Hierarchical multimodal transformer to summarize videos. Neuro- computing 468, 360–369 (2022)

work page 2022
[39]

In: Proceedings of the 25th ACM international conference on Multimedia

Zhao, B., Li, X., Lu, X.: Hierarchical recurrent neural network for video summarization. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 863–871 (2017)

work page 2017
[40]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhao, B., Li, X., Lu, X.: Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7405– 7414 (2018)

work page 2018
[41]

In: Proceedings of the AAAI conference on artificial intelligence

Zhou, K., Qiao, Y ., Xiang, T.: Deep reinforcement learning for unsupervised video summa- rization with diversity-representativeness reward. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018
[42]

IEEE Transactions on Image Processing 31, 3017–3031 (2022)

Zhu, W., Han, Y ., Lu, J., Zhou, J.: Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing 31, 3017–3031 (2022)

work page 2022
[43]

eQu1rNs0an0

Zhu, W., Lu, J., Li, J., Zhou, J.: Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing 30, 948–962 (2020) 12 A Appendix / supplemental material A.1 Supplementary Evaluation Results In this section, we further provide additional performance in terms of F1 score similar to early stage video summarizat...

work page 2020

[1] [1]

In: MultiMedia Modeling: 26th Inter- national Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26

Apostolidis, E., Adamantidou, E., Metsai, A.I., Mezaris, V ., Patras, I.: Unsupervised video summarization via attention-driven adversarial learning. In: MultiMedia Modeling: 26th Inter- national Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. pp. 492–504. Springer (2020)

work page 2020

[2] [2]

In: 2021 IEEE international symposium on multimedia (ISM)

Apostolidis, E., Balaouras, G., Mezaris, V ., Patras, I.: Combining global and local attention with positional encoding for video summarization. In: 2021 IEEE international symposium on multimedia (ISM). pp. 226–234. IEEE (2021)

work page 2021

[3] [3]

In: Proceedings of the 2022 international conference on multimedia retrieval

Apostolidis, E., Balaouras, G., Mezaris, V ., Patras, I.: Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In: Proceedings of the 2022 international conference on multimedia retrieval. pp. 407–415 (2022)

work page 2022

[4] [4]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)

work page 2020

[5] [5]

In: Proceedings of the 1st ACM International Conference on Multimedia in Asia

Chen, Y ., Tao, L., Wang, X., Yamasaki, T.: Weakly supervised video summarization by hierarchical reinforcement learning. In: Proceedings of the 1st ACM International Conference on Multimedia in Asia. pp. 1–6 (2019)

work page 2019

[6] [6]

In: Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14

Fajtl, J., Sokeh, H.S., Argyriou, V ., Monekosso, D., Remagnino, P.: Summarizing videos with attention. In: Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14. pp. 39–54. Springer (2019)

work page 2018

[7] [7]

In: The 22nd International Conference on Artificial Intelligence and Statistics

Feydy, J., Séjourné, T., Vialard, F.X., Amari, S.i., Trouve, A., Peyré, G.: Interpolating between optimal transport and mmd using sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics. pp. 2681–2690 (2019)

work page 2019

[8] [8]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Fu, H., Wang, H., Yang, J.: Video summarization with a dual attention capsule network. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 446–451. IEEE (2021)

work page 2020

[9] [9]

In: 2021 IEEE International Conference on Multimedia and Expo (ICME)

Ghauri, J.A., Hakimov, S., Ewerth, R.: Supervised video summarization via multiple feature sets with parallel attention. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6s. IEEE (2021)

work page 2021

[10] [10]

Advances in neural information processing systems 33, 21271– 21284 (2020)

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271– 21284 (2020)

work page 2020

[11] [11]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13

Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13. pp. 505–520. Springer (2014)

work page 2014

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, B., Wang, J., Qiu, J., Bui, T., Shrivastava, A., Wang, Z.: Align and attend: Multimodal summarization with dual contrastive losses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14867–14878 (2023)

work page 2023

[13] [13]

In: Proceedings of the 27th ACM International Conference on multimedia

He, X., Hua, Y ., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., Guan, H.: Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM International Conference on multimedia. pp. 2296–2304 (2019)

work page 2019

[14] [14]

IEEE Transactions on Image Processing 32, 3013–3026 (2023)

Hsu, T.C., Liao, Y .S., Huang, C.R.: Video summarization with spatiotemporal vision transformer. IEEE Transactions on Image Processing 32, 3013–3026 (2023)

work page 2023

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, H., Mu, Y .: Joint video summarization and moment localization by cross-task sam- ple transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16388–16398 (2022) 10

work page 2022

[16] [16]

In: Proceedings of the AAAI Conference on artificial intelligence

Jung, Y ., Cho, D., Kim, D., Woo, S., Kweon, I.S.: Discriminative feature learning for unsuper- vised video summarization. In: Proceedings of the AAAI Conference on artificial intelligence. vol. 33, pp. 8537–8544 (2019)

work page 2019

[17] [17]

In: European conference on computer vision

Jung, Y ., Cho, D., Woo, S., Kweon, I.S.: Global-and-local relative position embedding for unsupervised video summarization. In: European conference on computer vision. pp. 167–183. Springer (2020)

work page 2020

[18] [18]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Li, H., Ke, Q., Gong, M., Drummond, T.: Progressive video summarization via multimodal self-supervised learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 5584–5593 (2023)

work page 2023

[20] [20]

IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3904–3917 (2022)

Li, H., Ke, Q., Gong, M., Zhang, R.: Video joint modelling based on hierarchical transformer for co-summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3904–3917 (2022)

work page 2022

[21] [21]

Pattern Recognition 111, 107677 (2021)

Li, P., Ye, Q., Zhang, L., Yuan, L., Xu, X., Shao, L.: Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition 111, 107677 (2021)

work page 2021

[22] [22]

Neurocomputing 467, 1–9 (2022)

Liang, G., Lv, Y ., Li, S., Wang, X., Zhang, Y .: Video summarization with a dual-path attentive network. Neurocomputing 467, 1–9 (2022)

work page 2022

[23] [23]

In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 202–211 (2017)

work page 2017

[24] [24]

In: 2014 IEEE international conference on multimedia and expo (ICME)

Mei, S., Guan, G., Wang, Z., He, M., Hua, X.S., Feng, D.D.: l2,0 constrained sparse dictionary selection for video summarization. In: 2014 IEEE international conference on multimedia and expo (ICME). pp. 1–6. IEEE (2014)

work page 2014

[25] [25]

Advances in neural information processing systems 34, 13988–14000 (2021)

Narasimhan, M., Rohrbach, A., Darrell, T.: Clip-it! language-guided video summarization. Advances in neural information processing systems 34, 13988–14000 (2021)

work page 2021

[26] [26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Otani, M., Nakashima, Y ., Rahtu, E., Heikkila, J.: Rethinking the evaluation of video summaries. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7596–7604 (2019)

work page 2019

[27] [27]

In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13

Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. pp. 540–555. Springer (2014)

work page 2014

[28] [28]

In: Proceedings of the European conference on computer vision (ECCV)

Rochan, M., Ye, L., Wang, Y .: Video summarization using fully convolutional sequence networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 347– 363 (2018)

work page 2018

[29] [29]

Shim, M., Kim, T., Kim, J., Wee, D.: Masked autoencoder for unsupervised video summarization (2023), https://arxiv.org/abs/2306.01395

work page arXiv 2023

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Son, J., Park, J., Kim, K.: Csta: Cnn-based spatiotemporal attention for video summarization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18847–18856 (2024)

work page 2024

[31] [31]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

Song, Y ., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015)

work page 2015

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Terbouche, H., Morel, M., Rodriguez, M., Othmani, A.: Multi-annotation attention model for video summarization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3143–3152 (2023)

work page 2023

[33] [33]

Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2008)

work page 2008

[34] [34]

In: Proceedings of the 28th ACM international conference on multimedia

Wang, J., Bai, Y ., Long, Y ., Hu, B., Chai, Z., Guan, Y ., Wei, X.: Query twice: Dual mixture attention meta learning for video summarization. In: Proceedings of the 28th ACM international conference on multimedia. pp. 4023–4031 (2020)

work page 2020

[35] [35]

In: International Conference on Machine Learning

Zbontar, J., Jing, L., Misra, I., LeCun, Y ., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning. pp. 12310–12320. PMLR (2021) 11

work page 2021

[36] [36]

In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14

Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. pp. 766–782. Springer (2016)

work page 2016

[37] [37]

IEEE Transactions on Circuits and Systems for Video Technology34(4), 2775–2788 (2023)

Zhang, Y ., Liu, Y ., Kang, W., Tao, R.: Vss-net: Visual semantic self-mining network for video summarization. IEEE Transactions on Circuits and Systems for Video Technology34(4), 2775–2788 (2023)

work page 2023

[38] [38]

Neuro- computing 468, 360–369 (2022)

Zhao, B., Gong, M., Li, X.: Hierarchical multimodal transformer to summarize videos. Neuro- computing 468, 360–369 (2022)

work page 2022

[39] [39]

In: Proceedings of the 25th ACM international conference on Multimedia

Zhao, B., Li, X., Lu, X.: Hierarchical recurrent neural network for video summarization. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 863–871 (2017)

work page 2017

[40] [40]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhao, B., Li, X., Lu, X.: Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7405– 7414 (2018)

work page 2018

[41] [41]

In: Proceedings of the AAAI conference on artificial intelligence

Zhou, K., Qiao, Y ., Xiang, T.: Deep reinforcement learning for unsupervised video summa- rization with diversity-representativeness reward. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018

[42] [42]

IEEE Transactions on Image Processing 31, 3017–3031 (2022)

Zhu, W., Han, Y ., Lu, J., Zhou, J.: Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing 31, 3017–3031 (2022)

work page 2022

[43] [43]

eQu1rNs0an0

Zhu, W., Lu, J., Li, J., Zhou, J.: Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing 30, 948–962 (2020) 12 A Appendix / supplemental material A.1 Supplementary Evaluation Results In this section, we further provide additional performance in terms of F1 score similar to early stage video summarizat...

work page 2020