Deep kernel video approximation for unsupervised action segmentation
Pith reviewed 2026-05-09 22:41 UTC · model grok-4.3
The pith
Unsupervised per-video action segmentation via deep kernel approximation of frame distributions using NTKs and MMD achieves competitive benchmark results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.
Load-bearing premise
That maximum mean discrepancy in neural tangent kernel space provides a reliable, non-trivial measure of closeness for video frame distributions that directly yields accurate action boundaries.
Figures
read the original abstract
This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a per-video unsupervised action segmentation method that learns an approximation to the underlying frame distribution in neural tangent kernel (NTK) space by minimizing maximum mean discrepancy (MMD). It positions MMD as a geometry-preserving and computationally preferable alternative to optimal transport, claims that NTKs avoid trivial solutions when jointly optimizing the approximation and kernel, and reports competitive results against state-of-the-art per-video methods on six standard benchmarks plus higher F1 scores than prior agglomerative work when the number of segments is unknown.
Significance. If the empirical claims are substantiated, the work could offer a practical per-video approach suitable for storage-constrained settings while leveraging the descriptive power of NTKs and the optimization advantages of MMD. The explicit contrast with optimal transport and the handling of unknown segment counts are potentially useful contributions, but the soundness of the central empirical results remains difficult to assess from the supplied text.
major comments (2)
- [Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.
- [Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.
minor comments (1)
- [Abstract] Typo in abstract: 'nor permitted' should read 'not permitted'.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below, clarifying our approach and committing to revisions that strengthen the justification and experimental details.
read point-by-point responses
-
Referee: [Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.
Authors: We appreciate the referee's observation on the permutation-invariance of MMD. In our per-video setting, the approximation is optimized jointly with the NTK to capture the video's frame statistics in a high-capacity kernel space, after which segmentation proceeds by applying temporal change-point detection directly to the sequence of original frames using distances induced by the learned NTK (or to the approximated set when it retains ordering). This is not a pure bag-of-frames match because the NTK is data-dependent and the optimization is performed on the full video sequence. To make this explicit, we will add a controlled ablation in the revised manuscript that compares boundary recovery when the approximation is allowed to reorder frames versus when temporal order is preserved, demonstrating that the learned NTK-MMD approximation retains sufficient sequential information for reliable change-point detection. revision: yes
-
Referee: [Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.
Authors: We acknowledge that the initial submission text did not sufficiently detail the experimental protocol. The full manuscript contains results on the six standard benchmarks (Breakfast, 50Salads, YouTube, etc.), with F1 scores under both known and unknown segment counts, comparisons to prior per-video and agglomerative methods, and a description of the boundary recovery step (temporal clustering on the NTK-induced distances from the approximated distribution). We will expand this section in revision to include the full protocol, quantitative tables, ablation studies on kernel choice and MMD regularization, and error analysis, ensuring all elements needed to verify the claims are present and clearly explained. revision: yes
Circularity Check
No significant circularity; derivation applies established MMD/NTK concepts independently
full rationale
The paper defines a new per-video unsupervised segmentation pipeline by optimizing MMD between original and approximated frame sets inside NTK space, then applying change-point detection or clustering to the resulting representations. No equation or step reduces a claimed prediction to a quantity defined by the inputs (e.g., no fitted parameter is renamed as an independent result). No load-bearing self-citation chain is invoked to justify uniqueness or the kernel choice; NTKs are adopted for their known descriptive properties rather than by prior author work. The central claim therefore rests on the empirical performance on six benchmarks rather than on any definitional equivalence or fitted-input renaming. This is the normal non-circular outcome for an application paper that combines existing metrics with a new task.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MMD is a geometry-preserving metric in distribution space
- domain assumption NTKs have improved descriptive power as opposed to fixed kernels and sidestep trivial solutions when jointly learning inputs and kernel
Reference graph
Works this paper leans on
-
[1]
Fast and unsupervised action boundary detection for action segmentation,
Z. Du, X. Wang, G. Zhou, and Q. Wang, “Fast and unsupervised action boundary detection for action segmentation,” inCVPR, 2022
work page 2022
-
[2]
Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,
S. Sarfraz, N. Murray, V. Sharma, A. Diba, L. Van Gool, and R. Stiefelhagen, “Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,” inCVPR, 2021
work page 2021
-
[3]
Hierarchical vector quantization for unsupervised action segmentation,
F. Spurio, E. Bahrami, G. Francesca, and J. Gall, “Hierarchical vector quantization for unsupervised action segmentation,” inAAAI, 2025
work page 2025
-
[4]
Unsupervised action segmentation via fast learning of se- mantically consistent actoms,
Z. Xing and W. Zhao, “Unsupervised action segmentation via fast learning of se- mantically consistent actoms,” inAAAI, 2024
work page 2024
-
[5]
In- terpolating between optimal transport and mmd using sinkhorn divergences,
J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and G. Peyré, “In- terpolating between optimal transport and mmd using sinkhorn divergences,” in ICAIS, 2019
work page 2019
-
[6]
Joint self-supervised video alignment and action segmentation,
A. S. Ali, S. A. Mahmood, M. Saeed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Joint self-supervised video alignment and action segmentation,” inICCV, 2025
work page 2025
-
[7]
Temporally consistent unbalanced optimal transport for unsupervised action segmentation,
M. Xu and S. Gould, “Temporally consistent unbalanced optimal transport for unsupervised action segmentation,” inCVPR, 2024
work page 2024
-
[8]
Akernelmethod for the two-sample-problem,
A.Gretton,K.Borgwardt,M.Rasch,B.Schölkopf,andA.Smola,“Akernelmethod for the two-sample-problem,”NeurIPS, 2006
work page 2006
-
[9]
Learning deep kernels for exponential family densities,
L. Wenliang, D. J. Sutherland, H. Strathmann, and A. Gretton, “Learning deep kernels for exponential family densities,” inICML, 2019
work page 2019
-
[10]
Finite versus infinite neural networks: an empirical study,
J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl- Dickstein, “Finite versus infinite neural networks: an empirical study,”NeurIPS, 2020. 14 S.L Pintea and J. Dijkstra
work page 2020
-
[11]
Temporal action segmentation: An analysis of modern techniques,
G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern techniques,”TPAMI, 2023
work page 2023
-
[12]
Fast unsupervised ego-action learning for first-person sports videos,
K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” inCVPR, 2011
work page 2011
-
[13]
Temporal relations in videos for unsu- pervised activity analysis.,
F. Nater, H. Grabner, and L. Van Gool, “Temporal relations in videos for unsu- pervised activity analysis.,” inBMVC, 2011
work page 2011
-
[14]
Unsupervised learning and segmentation of complex activi- ties from video,
F. Sener and A. Yao, “Unsupervised learning and segmentation of complex activi- ties from video,” inCVPR, 2018
work page 2018
-
[15]
Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,
R.G.VidalMata,W.J.Scheirer,A.Kukleva,D.Cox,andH.Kuehne,“Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,” inW ACV, 2021
work page 2021
-
[16]
M. Dimiccoli and H. Wendt, “Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,”TIP, 2020
work page 2020
-
[17]
Unsupervised learning of action classes with continuous temporal embedding,
A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learning of action classes with continuous temporal embedding,” inCVPR, 2019
work page 2019
-
[18]
Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,
Z. Wang, H. Chen, X. Li, C. Liu, Y. Xiong, J. Tighe, and C. Fowlkes, “Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,” inW ACV, 2022
work page 2022
-
[19]
Leveraging triplet loss for un- supervised action segmentation,
E. Bueno-Benito, B. T. Vecino, and M. Dimiccoli, “Leveraging triplet loss for un- supervised action segmentation,” inCVPR, 2023
work page 2023
-
[20]
A perceptual prediction framework for self supervised event segmentation,
S. N. Aakur and S. Sarkar, “A perceptual prediction framework for self supervised event segmentation,” inCVPR, 2019
work page 2019
-
[21]
Predicting visual context for unsupervised event segmentation in continuous photo-streams,
A. Garcia del Molino, J.-H. Lim, and A.-H. Tan, “Predicting visual context for unsupervised event segmentation in continuous photo-streams,” inICM, 2018
work page 2018
-
[22]
Unsupervised discriminative embedding for sub-action learning in complex activities,
S. Swetha, H. Kuehne, Y. S. Rawat, and M. Shah, “Unsupervised discriminative embedding for sub-action learning in complex activities,” inICIP, 2021
work page 2021
-
[23]
Action shuffle alternating learning for unsupervised action segmentation,
J. Li and S. Todorovic, “Action shuffle alternating learning for unsupervised action segmentation,” inCVPR, 2021
work page 2021
-
[24]
Unsuper- vised action segmentation by joint representation learning and online clustering,
S. Kumar, S. Haresh, A. Ahmed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Unsuper- vised action segmentation by joint representation learning and online clustering,” inCVPR, 2022
work page 2022
-
[25]
Sparse on-line gaussian processes,
L. Csató and M. Opper, “Sparse on-line gaussian processes,”Neural computation, 2002
work page 2002
-
[26]
Sparse greedy gaussian process regression,
A. Smola and P. Bartlett, “Sparse greedy gaussian process regression,”NeurIPS, 2000
work page 2000
-
[27]
Sparse gaussian processes using pseudo-inputs,
E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” NeurIPS, 2005
work page 2005
-
[28]
Dataset meta-learning from kernel ridge- regression,
T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge- regression,” inICLR, 2020
work page 2020
-
[29]
Neural tangent kernel: Convergence and generalization in neural networks,
A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,”NeurIPS, 2018
work page 2018
-
[30]
On exact computation with an infinitely wide neural net,
S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,”NeurIPS, 2019
work page 2019
-
[31]
A. J. Smola, A. Gretton, and K. Borgwardt, “Maximum mean discrepancy,” in ICONIP, 2006
work page 2006
-
[32]
Neural tangent kernel maximum mean discrepancy,
X. Cheng and Y. Xie, “Neural tangent kernel maximum mean discrepancy,” NeurIPS, 2021
work page 2021
-
[33]
On the similarity between the laplace and neural tangent kernels,
A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ronen, “On the similarity between the laplace and neural tangent kernels,”NeurIPS, 2020
work page 2020
-
[34]
Efficient statistical tests: A neural tangent kernel approach,
S. Jia, E. Nezhadarya, Y. Wu, and J. Ba, “Efficient statistical tests: A neural tangent kernel approach,” inICML, 2021. Deep kernel video approximation for unsupervised action segmentation 15
work page 2021
-
[35]
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”JMLR, 2012
work page 2012
-
[36]
Kernel mean embedding of distributions: A review and beyond,
K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf,et al., “Kernel mean embedding of distributions: A review and beyond,”Foundations and Trends in Machine Learning, 2017
work page 2017
-
[37]
Sample complexity of sinkhorn divergences,
A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample complexity of sinkhorn divergences,” inICAIS, PMLR, 2019
work page 2019
-
[38]
The language of actions: Recovering the syntax and semantics of goal-directed human activities,
H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” inCVPR, 2014
work page 2014
-
[39]
Combining embedded accelerometers with computer vision for recognizing food preparation activities,
S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inInternational Joint Confer- ence on Pervasive and Ubiquitous Computing, 2013
work page 2013
-
[40]
Unsupervised learning from narrated instruction videos,
J.-B.Alayrac,P.Bojanowski,N.Agrawal,J.Sivic,I.Laptev,andS.Lacoste-Julien, “Unsupervised learning from narrated instruction videos,” inCVPR, 2016
work page 2016
-
[41]
Weakly supervised action labeling in videos under ordering constraints,
P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic, “Weakly supervised action labeling in videos under ordering constraints,” inECCV, 2014
work page 2014
-
[42]
Recognizing fine-grained and composite activities using hand-centric features and script data,
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,”IJCV, 2016
work page 2016
-
[43]
Action recognition with improved trajectories,
H. Wang and C. Schmid, “Action recognition with improved trajectories,” inICCV, 2013
work page 2013
-
[44]
Clot: Closed loop optimal transport for unsu- pervised action segmentation,
E. Bueno-Benito and M. Dimiccoli, “Clot: Closed loop optimal transport for unsu- pervised action segmentation,” inICCV, pp. 10719–10729, 2025
work page 2025
-
[45]
Feature learning in infinite-width neural networks,
G. Yang and E. J. Hu, “Feature learning in infinite-width neural networks,”Corr, 2020
work page 2020
-
[46]
On the infinite width limit of neural networks with a standard parameterization,
J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee, “On the infinite width limit of neural networks with a standard parameterization,”CoRR, 2020
work page 2020
-
[47]
Neural tangents: Fast and easy infinite neural networks in python,
R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy infinite neural networks in python,” inICLR, 2020
work page 2020
-
[48]
Deep neural networks as gaussian processes,
J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,”ICLR, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.