pith. sign in

arxiv: 2604.21572 · v1 · submitted 2026-04-23 · 💻 cs.CV

Deep kernel video approximation for unsupervised action segmentation

Pith reviewed 2026-05-09 22:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords kernelapproximationdistributionmetricspacevideowhenaction
0
0 comments X

The pith

Unsupervised per-video action segmentation via deep kernel approximation of frame distributions using NTKs and MMD achieves competitive benchmark results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Action segmentation divides a video into segments where each segment corresponds to a distinct action such as walking or waving. Standard approaches require large labeled datasets for training. This work performs the task on one video at a time without labels by learning an approximation to the video's frame distribution inside a kernel space. Neural tangent kernels are chosen because they capture complex patterns more effectively than fixed kernels and avoid trivial solutions when the approximation and kernel are learned together. Closeness between the real video distribution and the approximation is measured with maximum mean discrepancy, a metric that preserves geometric structure in the data space and is easier to optimize than optimal transport. The resulting segments are evaluated on six standard action segmentation benchmarks where the method matches state-of-the-art per-video unsupervised techniques. It also reports higher F1 scores than earlier agglomerative clustering approaches when the number of segments is not known in advance.

Core claim

We show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Load-bearing premise

That maximum mean discrepancy in neural tangent kernel space provides a reliable, non-trivial measure of closeness for video frame distributions that directly yields accurate action boundaries.

Figures

Figures reproduced from arXiv: 2604.21572 by Jouke Dijkstra, Silvia L. Pintea.

Figure 1
Figure 1. Figure 1: We learn a small video approximation vem∼Q (in green) such that its distri￾bution is "as close as possible" to the distribution of the real video frames vi∼P (in orange). To define "as close as possible" we rely on a geometry preserving metric – MMD (maximum mean discrepancy) which minimizes the distance between the kernel means µQ and µP. into account: OT (optimal transport) and MMD (maximum mean discrepa… view at source ↗
Figure 2
Figure 2. Figure 2: We learn the video approximation ve by optimizing the MMD2 loss over a prod￾uct of NTK and Gaussian kernels. Non-parametric yet descriptive deep kernels: infinite-NTKs. MMD relies on kernel functions k(·, ·). Learning the kernel function k(·, ·) to￾gether with the video approximation ve leads to a trivial solution. There￾fore, we need to rely on fixed ker￾nels, such as the exponential fam￾ily kernels. Howe… view at source ↗
Figure 3
Figure 3. Figure 3: Model choices: (a) Kernel choice on 50 Salads [39]. The combination of the Gaussian kernel and NTK, as given by the Eq. (5), is the most accurate. (b) Input smoothing σ. We consider three datasets with varying video lengths 50 Salads[39] – long, Desktop Assembly [24] – medium, and YTI [40] – short. The choice of the smoothing hyperparameter s has effect, especially on long videos. s N M and we vary the smo… view at source ↗
Figure 4
Figure 4. Figure 4: Failure cases. Our method fails to correctly segment very similar actions, as in the case of Desktop Assembly. Another failure is when distinct actions are grouped together, such as the background class on the challenging MPII Cooking 2 dataset. the smoothing hyperparameter together with the video approximation is possi￾ble, however this cannot be directly used in combination with batched data. And batched… view at source ↗
read the original abstract

This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a per-video unsupervised action segmentation method that learns an approximation to the underlying frame distribution in neural tangent kernel (NTK) space by minimizing maximum mean discrepancy (MMD). It positions MMD as a geometry-preserving and computationally preferable alternative to optimal transport, claims that NTKs avoid trivial solutions when jointly optimizing the approximation and kernel, and reports competitive results against state-of-the-art per-video methods on six standard benchmarks plus higher F1 scores than prior agglomerative work when the number of segments is unknown.

Significance. If the empirical claims are substantiated, the work could offer a practical per-video approach suitable for storage-constrained settings while leveraging the descriptive power of NTKs and the optimization advantages of MMD. The explicit contrast with optimal transport and the handling of unknown segment counts are potentially useful contributions, but the soundness of the central empirical results remains difficult to assess from the supplied text.

major comments (2)
  1. [Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.
  2. [Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.
minor comments (1)
  1. [Abstract] Typo in abstract: 'nor permitted' should read 'not permitted'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below, clarifying our approach and committing to revisions that strengthen the justification and experimental details.

read point-by-point responses
  1. Referee: [Method (and Abstract)] The core assumption that minimizing MMD between original and approximated frame sets in NTK space yields representations from which accurate action boundaries can be recovered is load-bearing yet insufficiently justified. MMD is a permutation-invariant metric on distributions; matching marginal frame statistics can succeed without preserving temporal order or transition points. The manuscript must demonstrate (via explicit construction or controlled experiment) that the learned approximation encodes sequential structure beyond a bag-of-frames match, otherwise the segmentation step cannot be guaranteed to produce reliable boundaries.

    Authors: We appreciate the referee's observation on the permutation-invariance of MMD. In our per-video setting, the approximation is optimized jointly with the NTK to capture the video's frame statistics in a high-capacity kernel space, after which segmentation proceeds by applying temporal change-point detection directly to the sequence of original frames using distances induced by the learned NTK (or to the approximated set when it retains ordering). This is not a pure bag-of-frames match because the NTK is data-dependent and the optimization is performed on the full video sequence. To make this explicit, we will add a controlled ablation in the revised manuscript that compares boundary recovery when the approximation is allowed to reorder frames versus when temporal order is preserved, demonstrating that the learned NTK-MMD approximation retains sufficient sequential information for reliable change-point detection. revision: yes

  2. Referee: [Experiments] The abstract states that competitive results are shown on six benchmarks, yet the provided text supplies no experimental protocol, quantitative tables, ablation studies, error analysis, or details on how the approximated distribution is converted into segment boundaries (change-point detection, clustering, etc.). Without these elements the central empirical claim cannot be verified.

    Authors: We acknowledge that the initial submission text did not sufficiently detail the experimental protocol. The full manuscript contains results on the six standard benchmarks (Breakfast, 50Salads, YouTube, etc.), with F1 scores under both known and unknown segment counts, comparisons to prior per-video and agglomerative methods, and a description of the boundary recovery step (temporal clustering on the NTK-induced distances from the approximated distribution). We will expand this section in revision to include the full protocol, quantitative tables, ablation studies on kernel choice and MMD regularization, and error analysis, ensuring all elements needed to verify the claims are present and clearly explained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies established MMD/NTK concepts independently

full rationale

The paper defines a new per-video unsupervised segmentation pipeline by optimizing MMD between original and approximated frame sets inside NTK space, then applying change-point detection or clustering to the resulting representations. No equation or step reduces a claimed prediction to a quantity defined by the inputs (e.g., no fitted parameter is renamed as an independent result). No load-bearing self-citation chain is invoked to justify uniqueness or the kernel choice; NTKs are adopted for their known descriptive properties rather than by prior author work. The central claim therefore rests on the empirical performance on six benchmarks rather than on any definitional equivalence or fitted-input renaming. This is the normal non-circular outcome for an application paper that combines existing metrics with a new task.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields minimal explicit parameters or entities; the approach rests on standard assumptions about kernel metrics.

axioms (2)
  • domain assumption MMD is a geometry-preserving metric in distribution space
    Invoked to justify the closeness measure between original video distribution and its approximation.
  • domain assumption NTKs have improved descriptive power as opposed to fixed kernels and sidestep trivial solutions when jointly learning inputs and kernel
    Stated as the reason for selecting neural tangent kernels over alternatives.

pith-pipeline@v0.9.0 · 5487 in / 1269 out tokens · 48495 ms · 2026-05-09T22:41:53.688831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Fast and unsupervised action boundary detection for action segmentation,

    Z. Du, X. Wang, G. Zhou, and Q. Wang, “Fast and unsupervised action boundary detection for action segmentation,” inCVPR, 2022

  2. [2]

    Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,

    S. Sarfraz, N. Murray, V. Sharma, A. Diba, L. Van Gool, and R. Stiefelhagen, “Temporally-weighted hierarchical clustering for unsupervised action segmenta- tion,” inCVPR, 2021

  3. [3]

    Hierarchical vector quantization for unsupervised action segmentation,

    F. Spurio, E. Bahrami, G. Francesca, and J. Gall, “Hierarchical vector quantization for unsupervised action segmentation,” inAAAI, 2025

  4. [4]

    Unsupervised action segmentation via fast learning of se- mantically consistent actoms,

    Z. Xing and W. Zhao, “Unsupervised action segmentation via fast learning of se- mantically consistent actoms,” inAAAI, 2024

  5. [5]

    In- terpolating between optimal transport and mmd using sinkhorn divergences,

    J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouvé, and G. Peyré, “In- terpolating between optimal transport and mmd using sinkhorn divergences,” in ICAIS, 2019

  6. [6]

    Joint self-supervised video alignment and action segmentation,

    A. S. Ali, S. A. Mahmood, M. Saeed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Joint self-supervised video alignment and action segmentation,” inICCV, 2025

  7. [7]

    Temporally consistent unbalanced optimal transport for unsupervised action segmentation,

    M. Xu and S. Gould, “Temporally consistent unbalanced optimal transport for unsupervised action segmentation,” inCVPR, 2024

  8. [8]

    Akernelmethod for the two-sample-problem,

    A.Gretton,K.Borgwardt,M.Rasch,B.Schölkopf,andA.Smola,“Akernelmethod for the two-sample-problem,”NeurIPS, 2006

  9. [9]

    Learning deep kernels for exponential family densities,

    L. Wenliang, D. J. Sutherland, H. Strathmann, and A. Gretton, “Learning deep kernels for exponential family densities,” inICML, 2019

  10. [10]

    Finite versus infinite neural networks: an empirical study,

    J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl- Dickstein, “Finite versus infinite neural networks: an empirical study,”NeurIPS, 2020. 14 S.L Pintea and J. Dijkstra

  11. [11]

    Temporal action segmentation: An analysis of modern techniques,

    G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern techniques,”TPAMI, 2023

  12. [12]

    Fast unsupervised ego-action learning for first-person sports videos,

    K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” inCVPR, 2011

  13. [13]

    Temporal relations in videos for unsu- pervised activity analysis.,

    F. Nater, H. Grabner, and L. Van Gool, “Temporal relations in videos for unsu- pervised activity analysis.,” inBMVC, 2011

  14. [14]

    Unsupervised learning and segmentation of complex activi- ties from video,

    F. Sener and A. Yao, “Unsupervised learning and segmentation of complex activi- ties from video,” inCVPR, 2018

  15. [15]

    Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,

    R.G.VidalMata,W.J.Scheirer,A.Kukleva,D.Cox,andH.Kuehne,“Jointvisual- temporal embedding for unsupervised learning of actions in untrimmed sequences,” inW ACV, 2021

  16. [16]

    Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,

    M. Dimiccoli and H. Wendt, “Learning event representations for temporal segmen- tation of image sequences by dynamic graph embedding,”TIP, 2020

  17. [17]

    Unsupervised learning of action classes with continuous temporal embedding,

    A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learning of action classes with continuous temporal embedding,” inCVPR, 2019

  18. [18]

    Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,

    Z. Wang, H. Chen, X. Li, C. Liu, Y. Xiong, J. Tighe, and C. Fowlkes, “Sscap: Self-supervised co-occurrence action parsing for unsupervised temporal action seg- mentation,” inW ACV, 2022

  19. [19]

    Leveraging triplet loss for un- supervised action segmentation,

    E. Bueno-Benito, B. T. Vecino, and M. Dimiccoli, “Leveraging triplet loss for un- supervised action segmentation,” inCVPR, 2023

  20. [20]

    A perceptual prediction framework for self supervised event segmentation,

    S. N. Aakur and S. Sarkar, “A perceptual prediction framework for self supervised event segmentation,” inCVPR, 2019

  21. [21]

    Predicting visual context for unsupervised event segmentation in continuous photo-streams,

    A. Garcia del Molino, J.-H. Lim, and A.-H. Tan, “Predicting visual context for unsupervised event segmentation in continuous photo-streams,” inICM, 2018

  22. [22]

    Unsupervised discriminative embedding for sub-action learning in complex activities,

    S. Swetha, H. Kuehne, Y. S. Rawat, and M. Shah, “Unsupervised discriminative embedding for sub-action learning in complex activities,” inICIP, 2021

  23. [23]

    Action shuffle alternating learning for unsupervised action segmentation,

    J. Li and S. Todorovic, “Action shuffle alternating learning for unsupervised action segmentation,” inCVPR, 2021

  24. [24]

    Unsuper- vised action segmentation by joint representation learning and online clustering,

    S. Kumar, S. Haresh, A. Ahmed, A. Konin, M. Z. Zia, and Q.-H. Tran, “Unsuper- vised action segmentation by joint representation learning and online clustering,” inCVPR, 2022

  25. [25]

    Sparse on-line gaussian processes,

    L. Csató and M. Opper, “Sparse on-line gaussian processes,”Neural computation, 2002

  26. [26]

    Sparse greedy gaussian process regression,

    A. Smola and P. Bartlett, “Sparse greedy gaussian process regression,”NeurIPS, 2000

  27. [27]

    Sparse gaussian processes using pseudo-inputs,

    E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” NeurIPS, 2005

  28. [28]

    Dataset meta-learning from kernel ridge- regression,

    T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge- regression,” inICLR, 2020

  29. [29]

    Neural tangent kernel: Convergence and generalization in neural networks,

    A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,”NeurIPS, 2018

  30. [30]

    On exact computation with an infinitely wide neural net,

    S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,”NeurIPS, 2019

  31. [31]

    Maximum mean discrepancy,

    A. J. Smola, A. Gretton, and K. Borgwardt, “Maximum mean discrepancy,” in ICONIP, 2006

  32. [32]

    Neural tangent kernel maximum mean discrepancy,

    X. Cheng and Y. Xie, “Neural tangent kernel maximum mean discrepancy,” NeurIPS, 2021

  33. [33]

    On the similarity between the laplace and neural tangent kernels,

    A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ronen, “On the similarity between the laplace and neural tangent kernels,”NeurIPS, 2020

  34. [34]

    Efficient statistical tests: A neural tangent kernel approach,

    S. Jia, E. Nezhadarya, Y. Wu, and J. Ba, “Efficient statistical tests: A neural tangent kernel approach,” inICML, 2021. Deep kernel video approximation for unsupervised action segmentation 15

  35. [35]

    A kernel two-sample test,

    A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”JMLR, 2012

  36. [36]

    Kernel mean embedding of distributions: A review and beyond,

    K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf,et al., “Kernel mean embedding of distributions: A review and beyond,”Foundations and Trends in Machine Learning, 2017

  37. [37]

    Sample complexity of sinkhorn divergences,

    A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, “Sample complexity of sinkhorn divergences,” inICAIS, PMLR, 2019

  38. [38]

    The language of actions: Recovering the syntax and semantics of goal-directed human activities,

    H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” inCVPR, 2014

  39. [39]

    Combining embedded accelerometers with computer vision for recognizing food preparation activities,

    S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inInternational Joint Confer- ence on Pervasive and Ubiquitous Computing, 2013

  40. [40]

    Unsupervised learning from narrated instruction videos,

    J.-B.Alayrac,P.Bojanowski,N.Agrawal,J.Sivic,I.Laptev,andS.Lacoste-Julien, “Unsupervised learning from narrated instruction videos,” inCVPR, 2016

  41. [41]

    Weakly supervised action labeling in videos under ordering constraints,

    P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic, “Weakly supervised action labeling in videos under ordering constraints,” inECCV, 2014

  42. [42]

    Recognizing fine-grained and composite activities using hand-centric features and script data,

    M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite activities using hand-centric features and script data,”IJCV, 2016

  43. [43]

    Action recognition with improved trajectories,

    H. Wang and C. Schmid, “Action recognition with improved trajectories,” inICCV, 2013

  44. [44]

    Clot: Closed loop optimal transport for unsu- pervised action segmentation,

    E. Bueno-Benito and M. Dimiccoli, “Clot: Closed loop optimal transport for unsu- pervised action segmentation,” inICCV, pp. 10719–10729, 2025

  45. [45]

    Feature learning in infinite-width neural networks,

    G. Yang and E. J. Hu, “Feature learning in infinite-width neural networks,”Corr, 2020

  46. [46]

    On the infinite width limit of neural networks with a standard parameterization,

    J. Sohl-Dickstein, R. Novak, S. S. Schoenholz, and J. Lee, “On the infinite width limit of neural networks with a standard parameterization,”CoRR, 2020

  47. [47]

    Neural tangents: Fast and easy infinite neural networks in python,

    R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy infinite neural networks in python,” inICLR, 2020

  48. [48]

    Deep neural networks as gaussian processes,

    J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,”ICLR, 2018