arxiv: 2509.24382 · v2 · submitted 2025-09-29 · 💻 cs.CV · cs.AI

REMAP: Regularized Matching and Partial Alignment of Video Embeddings

Soumyadeep Chandra , Kaushik Roy This is my paper

Pith reviewed 2026-05-18 13:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords procedure learningvideo alignmentpartial optimal transportinstructional videosGromov-Wassersteinunsupervised learningtemporal regularization

0 comments p. Extension

The pith

Partial transport with smoothness regularization aligns only the meaningful parts of instructional videos while leaving background and repeats unmatched.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an unsupervised approach to extract procedural steps from long, noisy instructional videos by aligning their frame embeddings without requiring every frame to find a match. It relaxes the balanced-transport requirement of standard optimal transport so that non-informative segments can stay unmatched, then adds Laplacian-based terms to keep the alignment temporally smooth and structurally consistent. If the method works as claimed, procedure-learning systems could ignore clutter and execution variability that currently derail full-alignment techniques. A reader would care because real-world videos rarely consist of clean, consecutive actions, so an alignment that tolerates irrelevance would make automated step extraction more practical for tasks such as recipe following or assembly guidance.

Core claim

The central claim is that formulating video-embedding alignment as regularized fused partial Gromov-Wasserstein optimal transport, which jointly captures semantic similarity and temporal structure while allowing partial matches, produces more accurate step correspondences than balanced-transport baselines on egocentric and third-person instructional datasets.

What carries the argument

Regularized fused partial Gromov-Wasserstein optimal transport, which relaxes the all-to-all matching constraint and adds Laplacian smoothness plus structural penalties to suppress degenerate alignments.

If this is right

Non-informative frames in long videos can be left unmatched without forcing the alignment to absorb noise.
Temporal smoothness enforced by the Laplacian term reduces erratic jumps between unrelated actions.
Structural regularization preserves the sequential order of procedural steps even when some actions are repeated or omitted.
The same partial-alignment principle applies to both egocentric and third-person video collections without task-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partial-transport idea could be tested on aligning text transcripts or sensor traces that contain irrelevant intervals.
If the regularization weights can be learned from data rather than set by hand, the method might adapt to videos of widely varying lengths.
Extending the framework to cross-modal alignment between video and narrated text would be a direct next step.

Load-bearing premise

That relaxing the balanced-transport requirement together with Laplacian smoothness and structural regularization will reliably stop alignments from latching onto background or repeated frames in real instructional videos.

What would settle it

On a controlled set of instructional videos in which background segments are explicitly labeled and removed, measure whether the partial-transport version still yields higher step-matching accuracy than a standard full-transport version.

Figures

Figures reproduced from arXiv: 2509.24382 by Kaushik Roy, Soumyadeep Chandra.

**Figure 2.** Figure 2: REALIGN framework. (a) An encoder generates frame-level embeddings from two video [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Examples of pairwise alignment scenarios captured by the assignment matrix. (b) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Qualitative outcomes on MECCANO and PC Assembly, where color highlights dis [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Real-world instructional videos are long, noisy, and often contain extended background segments, repeated actions, and execution variability that do not correspond to meaningful procedural steps. We propose **REMAP**, an unsupervised framework for procedure learning based on *Regularized Fused Partial Gromov-Wasserstein Optimal Transport*. REMAP relaxes balanced transport constraints, allowing non-informative or redundant frames to remain unmatched through partial transport. The formulation jointly models semantic similarity and temporal structure, while incorporating Laplacian-based smoothness and structural regularization to prevent degenerate alignments and reduce background interference. We evaluate REMAP on large-scale egocentric and third-person benchmarks. The method consistently outperforms state-of-the-art approaches, achieving up to **11.6\% (+4.45pp)** F1 and **19.6\% (+4.73pp)** IoU improvements on EgoProceL, and an average **41\% (+17.15pp)** F1 gain on ProceL and CrossTask. These results highlight the importance of partial alignment in handling real-world procedural variability and demonstrate that REMAP provides a robust and scalable approach for instructional video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REMAP adds a partial fused Gromov-Wasserstein setup with smoothness regularizers to video alignment and shows benchmark gains, but the results do not isolate what the partial transport actually contributes.

read the letter

REMAP uses regularized partial Gromov-Wasserstein optimal transport to align embeddings from instructional videos while allowing background or redundant frames to stay unmatched. The approach fuses semantic and temporal costs and adds Laplacian smoothness plus structural penalties to avoid bad alignments. That combination is the main technical step beyond standard OT matching for this setting. The paper reports clear improvements over prior methods, including 4.45pp F1 and 4.73pp IoU lifts on EgoProceL plus larger average F1 gains on ProceL and CrossTask. Those numbers are specific and the benchmarks are relevant for procedure learning. The formulation itself looks like a reasonable extension of existing fused partial GW work, with the regularizers chosen to fit the video noise pattern. The citation list covers the right OT and video alignment papers without obvious gaps. The soft spot is the lack of direct evidence that partial transport is doing the claimed work on background frames. The reported scores are aggregate, with no ablation or metric that measures unmatched background fraction, alignment entropy on known non-procedural segments, or a controlled comparison that turns the partial relaxation on and off. Without that isolation it is possible the gains come mostly from the pre-trained embeddings or the fused cost rather than the background-handling mechanism. The evaluation protocols are also described at a high level only, so it is hard to judge splits, hyperparameter search, or potential confounds. This paper is mainly for people working on unsupervised temporal alignment or procedure learning from noisy videos. A reader who already follows OT applications in CV could pick up the regularizer choices and try them elsewhere. I would send it to peer review. The core idea is grounded enough to be worth referee time, but the experiments need tighter controls before the central claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper proposes REMAP, an unsupervised framework for procedure learning in instructional videos based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport. It relaxes balanced transport constraints via partial transport to leave non-informative background frames unmatched, while jointly modeling semantic similarity and temporal structure through Laplacian-based smoothness and structural regularization to avoid degenerate alignments. The method is evaluated on EgoProceL, ProceL, and CrossTask, reporting consistent outperformance of prior state-of-the-art with gains up to 11.6% F1 (+4.45pp) and 19.6% IoU (+4.73pp) on EgoProceL and an average 41% F1 (+17.15pp) on the other two benchmarks.

Significance. If the reported gains are shown to stem specifically from the partial-transport relaxation and regularization rather than from pre-trained embeddings or the fused cost alone, the work would offer a practical advance in handling noisy, variable-length instructional videos for unsupervised procedure segmentation. The emphasis on partial alignment directly targets a common real-world confound (extended background segments) that balanced OT formulations struggle with.

major comments (2)

[Evaluation] Evaluation section: The reported F1 and IoU improvements on EgoProceL, ProceL, and CrossTask are aggregate figures only; no ablation or diagnostic (e.g., fraction of background frames left unmatched, alignment entropy on known background segments, or comparison against a non-partial fused GW baseline) isolates the contribution of the partial-transport term from the Laplacian smoothness and structural regularization terms. This leaves open the possibility that gains arise primarily from the pre-trained embeddings or cost fusion rather than the claimed background-handling mechanism.
[Abstract and §4] Abstract and §4 (method): The central claim that partial transport 'reliably leaves non-informative or redundant frames unmatched' while preserving procedural step alignments is asserted but not supported by any quantitative isolation experiment; without such evidence the load-bearing assumption that the fused partial GW formulation plus regularization prevents degenerate alignments remains unverified.

minor comments (2)

[Evaluation] The manuscript should include explicit details on dataset splits, hyperparameter selection procedure, and statistical significance testing for the reported percentage-point gains.
[§3] Notation for the fused cost and the partial-transport relaxation parameter should be introduced with a clear equation reference in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate additional experiments that isolate the contributions of the partial-transport term.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported F1 and IoU improvements on EgoProceL, ProceL, and CrossTask are aggregate figures only; no ablation or diagnostic (e.g., fraction of background frames left unmatched, alignment entropy on known background segments, or comparison against a non-partial fused GW baseline) isolates the contribution of the partial-transport term from the Laplacian smoothness and structural regularization terms. This leaves open the possibility that gains arise primarily from the pre-trained embeddings or cost fusion rather than the claimed background-handling mechanism.

Authors: We agree that the current aggregate results do not fully isolate the partial-transport relaxation from the regularization terms. In the revised manuscript we will add a dedicated ablation subsection that includes (i) a direct comparison of REMAP against a balanced (non-partial) fused Gromov-Wasserstein baseline using the same embeddings and cost fusion, and (ii) quantitative diagnostics such as the fraction of background frames left unmatched and alignment entropy computed on known background segments. These additions will clarify the specific contribution of the partial-transport mechanism. revision: yes
Referee: [Abstract and §4] Abstract and §4 (method): The central claim that partial transport 'reliably leaves non-informative or redundant frames unmatched' while preserving procedural step alignments is asserted but not supported by any quantitative isolation experiment; without such evidence the load-bearing assumption that the fused partial GW formulation plus regularization prevents degenerate alignments remains unverified.

Authors: We acknowledge that the manuscript currently lacks a quantitative isolation experiment directly supporting the behavior of the partial-transport term. We will add a new analysis (in §4 or the supplementary material) that reports the proportion of unmatched frames on non-informative segments, together with alignment entropy and degeneracy checks under the regularized fused partial GW objective. This will provide empirical verification that the formulation leaves background frames unmatched while preserving procedural alignments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formulation introduces independent regularization terms

full rationale

The paper presents REMAP as a novel unsupervised framework combining partial Gromov-Wasserstein optimal transport with Laplacian smoothness and structural regularization. No derivation step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the central claim rests on the explicit relaxation of balanced transport constraints and the addition of new regularization terms rather than on any self-referential equivalence. The evaluation reports aggregate performance gains without claiming that any result is forced by prior fitted quantities or uniqueness theorems from the same authors. This is a standard case of a self-contained methodological proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that partial transport plus Laplacian and structural regularization can isolate procedural steps; specific regularization weights are likely free parameters.

free parameters (1)

regularization weights for smoothness and structure
Balance semantic similarity, temporal structure, and prevent degenerate alignments.

axioms (1)

domain assumption Gromov-Wasserstein distance can jointly capture semantic similarity and temporal structure in video embeddings
Core modeling choice stated in the abstract.

pith-pipeline@v0.9.0 · 5725 in / 1075 out tokens · 40256 ms · 2026-05-18T13:01:29.495682+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min T≥0 (1−ρ)⟨C,T⟩ + ρ Σ L(Cx_ik, Cy_jl) Tij Tkl + τ (KL(T1∥α)+KL(T⊤1∥β)) −ϵ h(T) ... virtual frame ... IDM-style structural regularization M(T̂)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Laplace-shaped priors Q(i,j)=ϕ exp(−|dt(i,j)|/b) + (1−ϕ)exp(−|do(i,j)|/b) ... IDM M(T̂) ... contrastive stabilization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Kingma DP Ba J Adam et al. A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 1412(6),

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks

Unaiza Ahsan, Chen Sun, and Irfan Essa. Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks.arXiv preprint arXiv:1801.07230,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Joint self-supervised video alignment and action segmentation.arXiv preprint arXiv:2503.16832,

Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M Zeeshan Zia, and Quoc- Huy Tran. Joint self-supervised video alignment and action segmentation.arXiv preprint arXiv:2503.16832,

work page arXiv
[4]

Fused partial gromov- wasserstein for structured objects.arXiv preprint arXiv:2502.09934,

Yikun Bai, Huy Tran, Hengrong Du, Xinran Liu, and Soheil Kolouri. Fused partial gromov- wasserstein for structured objects.arXiv preprint arXiv:2502.09934,

work page arXiv
[5]

Self-supervised multi-task procedure learning from instructional videos

Ehsan Elhamifar and Dat Huynh. Self-supervised multi-task procedure learning from instructional videos. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part XVII 16, pp. 557–573. Springer,

work page 2020
[6]

Learning to segment actions from observation and narration.arXiv preprint arXiv:2005.03684,

Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, and Aida Ne- matzadeh. Learning to segment actions from observation and narration.arXiv preprint arXiv:2005.03684,

work page arXiv 2005
[7]

Unsupervised Representation Learning by Predicting Image Rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations.arXiv preprint arXiv:1803.07728,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Procedure learning via regularized gromov-wasserstein optimal transport.arXiv preprint arXiv:2507.15540,

Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M Zeeshan Zia, and Quoc-Huy Tran. Procedure learning via regularized gromov-wasserstein optimal transport.arXiv preprint arXiv:2507.15540,

work page arXiv
[9]

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

13 Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. What’s cookin’? interpreting cooking videos using text, speech and vision.arXiv preprint arXiv:1503.01558,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Seeing the arrow of time

Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2035–2042,

work page 2035
[11]

Unsupervised learning of visual invariance with tem- poral coherence

Will Y Zou, Andrew Y Ng, and Kai Yu. Unsupervised learning of visual invariance with tem- poral coherence. InNIPS 2011 workshop on deep learning and unsupervised feature learning, volume 3,

work page 2011
[12]

Inner (Convex) Subproblem and Gibbs Kernel Formulation We start from the unconstrained KL-regularized formulation (ignoring additive constants)

III. Inner (Convex) Subproblem and Gibbs Kernel Formulation We start from the unconstrained KL-regularized formulation (ignoring additive constants). The objective combines (i) linearized cost, (ii) IDM reward, (iii) prior-KL, and (iv) marginal KL penalties (for the unbalanced case). General inner problem.Fixing eD(s) and treating the IDM reward−λ1M(T)as ...

work page 2013
[13]

Theorem 2(Monotone decrease of the outer MM

In Option A,L= 2∥C x∥2∥C y∥2. Theorem 2(Monotone decrease of the outer MM. ).LetJdenote the full objective (Eq. A2). At outer steps, replaceFby the quadratic majorizer of Lemma 2 with constantL, and solve the inner problem exactly to obtain ˆT (s+1). • Option A (PSD):J( ˆT (s+1))≤ J( ˆT (s))(global upper bound; tight at ˆT (s)). • Option B (non-PSD):the M...

work page 2014
[14]

It is defined as: F= PN n=1 tn k tnv N (A14) wheret n k andt n v denote the durations of key-steps and the full video for thenth instance, respectively

corresponds to fewer background actions. It is defined as: F= PN n=1 tn k tnv N (A14) wheret n k andt n v denote the durations of key-steps and the full video for thenth instance, respectively. Table A2: Statistics of the EgoProceL dataset across different tasks. Task Videos Key-steps Foreground Missing Repeated Count Count Ratio Key-steps Key-steps PC As...

work page 2022