Deep kernel video approximation for unsupervised action segmentation

Jouke Dijkstra; Silvia L. Pintea

arxiv: 2604.21572 · v1 · submitted 2026-04-23 · 💻 cs.CV

Deep kernel video approximation for unsupervised action segmentation

Silvia L. Pintea , Jouke Dijkstra This is my paper

Pith reviewed 2026-05-09 22:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords kernelapproximationdistributionmetricspacevideowhenaction

0 comments

The pith

Unsupervised per-video action segmentation via deep kernel approximation of frame distributions using NTKs and MMD achieves competitive benchmark results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Action segmentation divides a video into segments where each segment corresponds to a distinct action such as walking or waving. Standard approaches require large labeled datasets for training. This work performs the task on one video at a time without labels by learning an approximation to the video's frame distribution inside a kernel space. Neural tangent kernels are chosen because they capture complex patterns more effectively than fixed kernels and avoid trivial solutions when the approximation and kernel are learned together. Closeness between the real video distribution and the approximation is measured with maximum mean discrepancy, a metric that preserves geometric structure in the data space and is easier to optimize than optimal transport. The resulting segments are evaluated on six standard action segmentation benchmarks where the method matches state-of-the-art per-video unsupervised techniques. It also reports higher F1 scores than earlier agglomerative clustering approaches when the number of segments is not known in advance.

Core claim

We show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Load-bearing premise

That maximum mean discrepancy in neural tangent kernel space provides a reliable, non-trivial measure of closeness for video frame distributions that directly yields accurate action boundaries.

Figures

Figures reproduced from arXiv: 2604.21572 by Jouke Dijkstra, Silvia L. Pintea.

**Figure 1.** Figure 1: We learn a small video approximation vem∼Q (in green) such that its distribution is "as close as possible" to the distribution of the real video frames vi∼P (in orange). To define "as close as possible" we rely on a geometry preserving metric – MMD (maximum mean discrepancy) which minimizes the distance between the kernel means µQ and µP. into account: OT (optimal transport) and MMD (maximum mean discrepa… view at source ↗

**Figure 2.** Figure 2: We learn the video approximation ve by optimizing the MMD2 loss over a product of NTK and Gaussian kernels. Non-parametric yet descriptive deep kernels: infinite-NTKs. MMD relies on kernel functions k(·, ·). Learning the kernel function k(·, ·) together with the video approximation ve leads to a trivial solution. Therefore, we need to rely on fixed kernels, such as the exponential family kernels. Howe… view at source ↗

**Figure 3.** Figure 3: Model choices: (a) Kernel choice on 50 Salads [39]. The combination of the Gaussian kernel and NTK, as given by the Eq. (5), is the most accurate. (b) Input smoothing σ. We consider three datasets with varying video lengths 50 Salads[39] – long, Desktop Assembly [24] – medium, and YTI [40] – short. The choice of the smoothing hyperparameter s has effect, especially on long videos. s N M and we vary the smo… view at source ↗

**Figure 4.** Figure 4: Failure cases. Our method fails to correctly segment very similar actions, as in the case of Desktop Assembly. Another failure is when distinct actions are grouped together, such as the background class on the challenging MPII Cooking 2 dataset. the smoothing hyperparameter together with the video approximation is possible, however this cannot be directly used in combination with batched data. And batched… view at source ↗

read the original abstract

This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a per-video unsupervised action segmentation method that learns an approximation to the underlying frame distribution in neural tangent kernel (NTK) space by minimizing maximum mean discrepancy (MMD). It positions MMD as a geometry-preserving and computationally preferable alternative to optimal transport, claims that NTKs avoid trivial solutions when jointly optimizing the approximation and kernel, and reports competitive results against state-of-the-art per-video methods on six standard benchmarks plus higher F1 scores than prior agglomerative work when the number of segments is unknown.

Significance. If the empirical claims are substantiated, the work could offer a practical per-video approach suitable for storage-constrained settings while leveraging the descriptive power of NTKs and the optimization advantages of MMD. The explicit contrast with optimal transport and the handling of unknown segment counts are potentially useful contributions, but the soundness of the central empirical results remains difficult to assess from the supplied text.

major comments (2)

[Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.
[Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.

minor comments (1)

[Abstract] Typo in abstract: 'nor permitted' should read 'not permitted'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below, clarifying our approach and committing to revisions that strengthen the justification and experimental details.

read point-by-point responses

Referee: [Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.

Authors: We appreciate the referee's observation on the permutation-invariance of MMD. In our per-video setting, the approximation is optimized jointly with the NTK to capture the video's frame statistics in a high-capacity kernel space, after which segmentation proceeds by applying temporal change-point detection directly to the sequence of original frames using distances induced by the learned NTK (or to the approximated set when it retains ordering). This is not a pure bag-of-frames match because the NTK is data-dependent and the optimization is performed on the full video sequence. To make this explicit, we will add a controlled ablation in the revised manuscript that compares boundary recovery when the approximation is allowed to reorder frames versus when temporal order is preserved, demonstrating that the learned NTK-MMD approximation retains sufficient sequential information for reliable change-point detection. revision: yes
Referee: [Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.

Authors: We acknowledge that the initial submission text did not sufficiently detail the experimental protocol. The full manuscript contains results on the six standard benchmarks (Breakfast, 50Salads, YouTube, etc.), with F1 scores under both known and unknown segment counts, comparisons to prior per-video and agglomerative methods, and a description of the boundary recovery step (temporal clustering on the NTK-induced distances from the approximated distribution). We will expand this section in revision to include the full protocol, quantitative tables, ablation studies on kernel choice and MMD regularization, and error analysis, ensuring all elements needed to verify the claims are present and clearly explained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies established MMD/NTK concepts independently

full rationale

The paper defines a new per-video unsupervised segmentation pipeline by optimizing MMD between original and approximated frame sets inside NTK space, then applying change-point detection or clustering to the resulting representations. No equation or step reduces a claimed prediction to a quantity defined by the inputs (e.g., no fitted parameter is renamed as an independent result). No load-bearing self-citation chain is invoked to justify uniqueness or the kernel choice; NTKs are adopted for their known descriptive properties rather than by prior author work. The central claim therefore rests on the empirical performance on six benchmarks rather than on any definitional equivalence or fitted-input renaming. This is the normal non-circular outcome for an application paper that combines existing metrics with a new task.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields minimal explicit parameters or entities; the approach rests on standard assumptions about kernel metrics.

axioms (2)

domain assumption MMD is a geometry-preserving metric in distribution space
Invoked to justify the closeness measure between original video distribution and its approximation.
domain assumption NTKs have improved descriptive power as opposed to fixed kernels and sidestep trivial solutions when jointly learning inputs and kernel
Stated as the reason for selecting neural tangent kernels over alternatives.

pith-pipeline@v0.9.0 · 5487 in / 1269 out tokens · 48495 ms · 2026-05-09T22:41:53.688831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Fast and unsupervised action boundary detection for action segmentation,

Z. Du, X. Wang, G. Zhou, and Q. Wang, “Fast and unsupervised action boundary detection for action segmentation,” inCVPR, 2022

work page 2022
[2]

Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,

S. Sarfraz, N. Murray, V. Sharma, A. Diba, L. Van Gool, and R. Stiefelhagen, “Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,” inCVPR, 2021

work page 2021
[3]

Hierarchical vector quantization for unsupervised action segmentation,

F. Spurio, E. Bahrami, G. Francesca, and J. Gall, “Hierarchical vector quantization for unsupervised action segmentation,” inAAAI, 2025

work page 2025
[4]

Unsupervised action segmentation via fast learning of se- mantically consistent actoms,

Z. Xing and W. Zhao, “Unsupervised action segmentation via fast learning of se- mantically consistent actoms,” inAAAI, 2024

work page 2024
[5]

In- terpolating between optimal transport and mmd using sinkhorn divergences,

J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and G. Peyré, “In- terpolating between optimal transport and mmd using sinkhorn divergences,” in ICAIS, 2019

work page 2019
[6]

Joint self-supervised video alignment and action segmentation,

A. S. Ali, S. A. Mahmood, M. Saeed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Joint self-supervised video alignment and action segmentation,” inICCV, 2025

work page 2025
[7]

Temporally consistent unbalanced optimal transport for unsupervised action segmentation,

M. Xu and S. Gould, “Temporally consistent unbalanced optimal transport for unsupervised action segmentation,” inCVPR, 2024

work page 2024
[8]

Akernelmethod for the two-sample-problem,

A.Gretton,K.Borgwardt,M.Rasch,B.Schölkopf,andA.Smola,“Akernelmethod for the two-sample-problem,”NeurIPS, 2006

work page 2006
[9]

Learning deep kernels for exponential family densities,

L. Wenliang, D. J. Sutherland, H. Strathmann, and A. Gretton, “Learning deep kernels for exponential family densities,” inICML, 2019

work page 2019
[10]

Finite versus infinite neural networks: an empirical study,

J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl- Dickstein, “Finite versus infinite neural networks: an empirical study,”NeurIPS, 2020. 14 S.L Pintea and J. Dijkstra

work page 2020
[11]

Temporal action segmentation: An analysis of modern techniques,

G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern techniques,”TPAMI, 2023

work page 2023
[12]

Fast unsupervised ego-action learning for first-person sports videos,

K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” inCVPR, 2011

work page 2011
[13]

Temporal relations in videos for unsu- pervised activity analysis.,

F. Nater, H. Grabner, and L. Van Gool, “Temporal relations in videos for unsu- pervised activity analysis.,” inBMVC, 2011

work page 2011
[14]

Unsupervised learning and segmentation of complex activi- ties from video,

F. Sener and A. Yao, “Unsupervised learning and segmentation of complex activi- ties from video,” inCVPR, 2018

work page 2018
[15]

Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,

R.G.VidalMata,W.J.Scheirer,A.Kukleva,D.Cox,andH.Kuehne,“Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,” inW ACV, 2021

work page 2021
[16]

Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,

M. Dimiccoli and H. Wendt, “Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,”TIP, 2020

work page 2020
[17]

Unsupervised learning of action classes with continuous temporal embedding,

A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learning of action classes with continuous temporal embedding,” inCVPR, 2019

work page 2019
[18]

Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,

Z. Wang, H. Chen, X. Li, C. Liu, Y. Xiong, J. Tighe, and C. Fowlkes, “Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,” inW ACV, 2022

work page 2022
[19]

Leveraging triplet loss for un- supervised action segmentation,

E. Bueno-Benito, B. T. Vecino, and M. Dimiccoli, “Leveraging triplet loss for un- supervised action segmentation,” inCVPR, 2023

work page 2023
[20]

A perceptual prediction framework for self supervised event segmentation,

S. N. Aakur and S. Sarkar, “A perceptual prediction framework for self supervised event segmentation,” inCVPR, 2019

work page 2019
[21]

Predicting visual context for unsupervised event segmentation in continuous photo-streams,

A. Garcia del Molino, J.-H. Lim, and A.-H. Tan, “Predicting visual context for unsupervised event segmentation in continuous photo-streams,” inICM, 2018

work page 2018
[22]

Unsupervised discriminative embedding for sub-action learning in complex activities,

S. Swetha, H. Kuehne, Y. S. Rawat, and M. Shah, “Unsupervised discriminative embedding for sub-action learning in complex activities,” inICIP, 2021

work page 2021
[23]

Action shuffle alternating learning for unsupervised action segmentation,

J. Li and S. Todorovic, “Action shuffle alternating learning for unsupervised action segmentation,” inCVPR, 2021

work page 2021
[24]

Unsuper- vised action segmentation by joint representation learning and online clustering,

S. Kumar, S. Haresh, A. Ahmed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Unsuper- vised action segmentation by joint representation learning and online clustering,” inCVPR, 2022

work page 2022
[25]

Sparse on-line gaussian processes,

L. Csató and M. Opper, “Sparse on-line gaussian processes,”Neural computation, 2002

work page 2002
[26]

Sparse greedy gaussian process regression,

A. Smola and P. Bartlett, “Sparse greedy gaussian process regression,”NeurIPS, 2000

work page 2000
[27]

Sparse gaussian processes using pseudo-inputs,

E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” NeurIPS, 2005

work page 2005
[28]

Dataset meta-learning from kernel ridge- regression,

T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge- regression,” inICLR, 2020

work page 2020
[29]

Neural tangent kernel: Convergence and generalization in neural networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,”NeurIPS, 2018

work page 2018
[30]

On exact computation with an infinitely wide neural net,

S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,”NeurIPS, 2019

work page 2019
[31]

Maximum mean discrepancy,

A. J. Smola, A. Gretton, and K. Borgwardt, “Maximum mean discrepancy,” in ICONIP, 2006

work page 2006
[32]

Neural tangent kernel maximum mean discrepancy,

X. Cheng and Y. Xie, “Neural tangent kernel maximum mean discrepancy,” NeurIPS, 2021

work page 2021
[33]

On the similarity between the laplace and neural tangent kernels,

A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ronen, “On the similarity between the laplace and neural tangent kernels,”NeurIPS, 2020

work page 2020
[34]

Efficient statistical tests: A neural tangent kernel approach,

S. Jia, E. Nezhadarya, Y. Wu, and J. Ba, “Efficient statistical tests: A neural tangent kernel approach,” inICML, 2021. Deep kernel video approximation for unsupervised action segmentation 15

work page 2021
[35]

A kernel two-sample test,

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”JMLR, 2012

work page 2012
[36]

Kernel mean embedding of distributions: A review and beyond,

K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf,et al., “Kernel mean embedding of distributions: A review and beyond,”Foundations and Trends in Machine Learning, 2017

work page 2017
[37]

Sample complexity of sinkhorn divergences,

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample complexity of sinkhorn divergences,” inICAIS, PMLR, 2019

work page 2019
[38]

The language of actions: Recovering the syntax and semantics of goal-directed human activities,

H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” inCVPR, 2014

work page 2014
[39]

Combining embedded accelerometers with computer vision for recognizing food preparation activities,

S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inInternational Joint Confer- ence on Pervasive and Ubiquitous Computing, 2013

work page 2013
[40]

Unsupervised learning from narrated instruction videos,

J.-B.Alayrac,P.Bojanowski,N.Agrawal,J.Sivic,I.Laptev,andS.Lacoste-Julien, “Unsupervised learning from narrated instruction videos,” inCVPR, 2016

work page 2016
[41]

Weakly supervised action labeling in videos under ordering constraints,

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic, “Weakly supervised action labeling in videos under ordering constraints,” inECCV, 2014

work page 2014
[42]

Recognizing fine-grained and composite activities using hand-centric features and script data,

M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,”IJCV, 2016

work page 2016
[43]

Action recognition with improved trajectories,

H. Wang and C. Schmid, “Action recognition with improved trajectories,” inICCV, 2013

work page 2013
[44]

Clot: Closed loop optimal transport for unsu- pervised action segmentation,

E. Bueno-Benito and M. Dimiccoli, “Clot: Closed loop optimal transport for unsu- pervised action segmentation,” inICCV, pp. 10719–10729, 2025

work page 2025
[45]

Feature learning in infinite-width neural networks,

G. Yang and E. J. Hu, “Feature learning in infinite-width neural networks,”Corr, 2020

work page 2020
[46]

On the infinite width limit of neural networks with a standard parameterization,

J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee, “On the infinite width limit of neural networks with a standard parameterization,”CoRR, 2020

work page 2020
[47]

Neural tangents: Fast and easy infinite neural networks in python,

R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy infinite neural networks in python,” inICLR, 2020

work page 2020
[48]

Deep neural networks as gaussian processes,

J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,”ICLR, 2018

work page 2018

[1] [1]

Fast and unsupervised action boundary detection for action segmentation,

Z. Du, X. Wang, G. Zhou, and Q. Wang, “Fast and unsupervised action boundary detection for action segmentation,” inCVPR, 2022

work page 2022

[2] [2]

Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,

S. Sarfraz, N. Murray, V. Sharma, A. Diba, L. Van Gool, and R. Stiefelhagen, “Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,” inCVPR, 2021

work page 2021

[3] [3]

Hierarchical vector quantization for unsupervised action segmentation,

F. Spurio, E. Bahrami, G. Francesca, and J. Gall, “Hierarchical vector quantization for unsupervised action segmentation,” inAAAI, 2025

work page 2025

[4] [4]

Unsupervised action segmentation via fast learning of se- mantically consistent actoms,

Z. Xing and W. Zhao, “Unsupervised action segmentation via fast learning of se- mantically consistent actoms,” inAAAI, 2024

work page 2024

[5] [5]

In- terpolating between optimal transport and mmd using sinkhorn divergences,

J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and G. Peyré, “In- terpolating between optimal transport and mmd using sinkhorn divergences,” in ICAIS, 2019

work page 2019

[6] [6]

Joint self-supervised video alignment and action segmentation,

A. S. Ali, S. A. Mahmood, M. Saeed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Joint self-supervised video alignment and action segmentation,” inICCV, 2025

work page 2025

[7] [7]

Temporally consistent unbalanced optimal transport for unsupervised action segmentation,

M. Xu and S. Gould, “Temporally consistent unbalanced optimal transport for unsupervised action segmentation,” inCVPR, 2024

work page 2024

[8] [8]

Akernelmethod for the two-sample-problem,

A.Gretton,K.Borgwardt,M.Rasch,B.Schölkopf,andA.Smola,“Akernelmethod for the two-sample-problem,”NeurIPS, 2006

work page 2006

[9] [9]

Learning deep kernels for exponential family densities,

L. Wenliang, D. J. Sutherland, H. Strathmann, and A. Gretton, “Learning deep kernels for exponential family densities,” inICML, 2019

work page 2019

[10] [10]

Finite versus infinite neural networks: an empirical study,

J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl- Dickstein, “Finite versus infinite neural networks: an empirical study,”NeurIPS, 2020. 14 S.L Pintea and J. Dijkstra

work page 2020

[11] [11]

Temporal action segmentation: An analysis of modern techniques,

G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern techniques,”TPAMI, 2023

work page 2023

[12] [12]

Fast unsupervised ego-action learning for first-person sports videos,

K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” inCVPR, 2011

work page 2011

[13] [13]

Temporal relations in videos for unsu- pervised activity analysis.,

F. Nater, H. Grabner, and L. Van Gool, “Temporal relations in videos for unsu- pervised activity analysis.,” inBMVC, 2011

work page 2011

[14] [14]

Unsupervised learning and segmentation of complex activi- ties from video,

F. Sener and A. Yao, “Unsupervised learning and segmentation of complex activi- ties from video,” inCVPR, 2018

work page 2018

[15] [15]

Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,

R.G.VidalMata,W.J.Scheirer,A.Kukleva,D.Cox,andH.Kuehne,“Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,” inW ACV, 2021

work page 2021

[16] [16]

Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,

M. Dimiccoli and H. Wendt, “Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,”TIP, 2020

work page 2020

[17] [17]

Unsupervised learning of action classes with continuous temporal embedding,

A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learning of action classes with continuous temporal embedding,” inCVPR, 2019

work page 2019

[18] [18]

Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,

Z. Wang, H. Chen, X. Li, C. Liu, Y. Xiong, J. Tighe, and C. Fowlkes, “Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,” inW ACV, 2022

work page 2022

[19] [19]

Leveraging triplet loss for un- supervised action segmentation,

E. Bueno-Benito, B. T. Vecino, and M. Dimiccoli, “Leveraging triplet loss for un- supervised action segmentation,” inCVPR, 2023

work page 2023

[20] [20]

A perceptual prediction framework for self supervised event segmentation,

S. N. Aakur and S. Sarkar, “A perceptual prediction framework for self supervised event segmentation,” inCVPR, 2019

work page 2019

[21] [21]

Predicting visual context for unsupervised event segmentation in continuous photo-streams,

A. Garcia del Molino, J.-H. Lim, and A.-H. Tan, “Predicting visual context for unsupervised event segmentation in continuous photo-streams,” inICM, 2018

work page 2018

[22] [22]

Unsupervised discriminative embedding for sub-action learning in complex activities,

S. Swetha, H. Kuehne, Y. S. Rawat, and M. Shah, “Unsupervised discriminative embedding for sub-action learning in complex activities,” inICIP, 2021

work page 2021

[23] [23]

Action shuffle alternating learning for unsupervised action segmentation,

J. Li and S. Todorovic, “Action shuffle alternating learning for unsupervised action segmentation,” inCVPR, 2021

work page 2021

[24] [24]

Unsuper- vised action segmentation by joint representation learning and online clustering,

S. Kumar, S. Haresh, A. Ahmed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Unsuper- vised action segmentation by joint representation learning and online clustering,” inCVPR, 2022

work page 2022

[25] [25]

Sparse on-line gaussian processes,

L. Csató and M. Opper, “Sparse on-line gaussian processes,”Neural computation, 2002

work page 2002

[26] [26]

Sparse greedy gaussian process regression,

A. Smola and P. Bartlett, “Sparse greedy gaussian process regression,”NeurIPS, 2000

work page 2000

[27] [27]

Sparse gaussian processes using pseudo-inputs,

E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” NeurIPS, 2005

work page 2005

[28] [28]

Dataset meta-learning from kernel ridge- regression,

T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge- regression,” inICLR, 2020

work page 2020

[29] [29]

Neural tangent kernel: Convergence and generalization in neural networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,”NeurIPS, 2018

work page 2018

[30] [30]

On exact computation with an infinitely wide neural net,

S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,”NeurIPS, 2019

work page 2019

[31] [31]

Maximum mean discrepancy,

A. J. Smola, A. Gretton, and K. Borgwardt, “Maximum mean discrepancy,” in ICONIP, 2006

work page 2006

[32] [32]

Neural tangent kernel maximum mean discrepancy,

X. Cheng and Y. Xie, “Neural tangent kernel maximum mean discrepancy,” NeurIPS, 2021

work page 2021

[33] [33]

On the similarity between the laplace and neural tangent kernels,

A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ronen, “On the similarity between the laplace and neural tangent kernels,”NeurIPS, 2020

work page 2020

[34] [34]

Efficient statistical tests: A neural tangent kernel approach,

S. Jia, E. Nezhadarya, Y. Wu, and J. Ba, “Efficient statistical tests: A neural tangent kernel approach,” inICML, 2021. Deep kernel video approximation for unsupervised action segmentation 15

work page 2021

[35] [35]

A kernel two-sample test,

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”JMLR, 2012

work page 2012

[36] [36]

Kernel mean embedding of distributions: A review and beyond,

K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf,et al., “Kernel mean embedding of distributions: A review and beyond,”Foundations and Trends in Machine Learning, 2017

work page 2017

[37] [37]

Sample complexity of sinkhorn divergences,

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample complexity of sinkhorn divergences,” inICAIS, PMLR, 2019

work page 2019

[38] [38]

The language of actions: Recovering the syntax and semantics of goal-directed human activities,

H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” inCVPR, 2014

work page 2014

[39] [39]

Combining embedded accelerometers with computer vision for recognizing food preparation activities,

S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inInternational Joint Confer- ence on Pervasive and Ubiquitous Computing, 2013

work page 2013

[40] [40]

Unsupervised learning from narrated instruction videos,

J.-B.Alayrac,P.Bojanowski,N.Agrawal,J.Sivic,I.Laptev,andS.Lacoste-Julien, “Unsupervised learning from narrated instruction videos,” inCVPR, 2016

work page 2016

[41] [41]

Weakly supervised action labeling in videos under ordering constraints,

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic, “Weakly supervised action labeling in videos under ordering constraints,” inECCV, 2014

work page 2014

[42] [42]

Recognizing fine-grained and composite activities using hand-centric features and script data,

M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,”IJCV, 2016

work page 2016

[43] [43]

Action recognition with improved trajectories,

H. Wang and C. Schmid, “Action recognition with improved trajectories,” inICCV, 2013

work page 2013

[44] [44]

Clot: Closed loop optimal transport for unsu- pervised action segmentation,

E. Bueno-Benito and M. Dimiccoli, “Clot: Closed loop optimal transport for unsu- pervised action segmentation,” inICCV, pp. 10719–10729, 2025

work page 2025

[45] [45]

Feature learning in infinite-width neural networks,

G. Yang and E. J. Hu, “Feature learning in infinite-width neural networks,”Corr, 2020

work page 2020

[46] [46]

On the infinite width limit of neural networks with a standard parameterization,

J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee, “On the infinite width limit of neural networks with a standard parameterization,”CoRR, 2020

work page 2020

[47] [47]

Neural tangents: Fast and easy infinite neural networks in python,

R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy infinite neural networks in python,” inICLR, 2020

work page 2020

[48] [48]

Deep neural networks as gaussian processes,

J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,”ICLR, 2018

work page 2018