Improving Temporal Action Segmentation via Constraint-Aware Decoding

Basura Fernando; Chen Li; Debaditya Roy; Hao Zhang; Yeo Keat Ee

arxiv: 2605.10149 · v1 · submitted 2026-05-11 · 💻 cs.CV

Improving Temporal Action Segmentation via Constraint-Aware Decoding

Yeo Keat Ee , Debaditya Roy , Chen Li , Hao Zhang , Basura Fernando This is my paper

Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal action segmentationconstraint-aware decodingViterbi algorithmstructural priorsvideo understandingsemi-supervised learningaction boundary detection

0 comments

The pith

Constraint-aware decoding using statistical priors from training data refines temporal action segmentation predictions at inference time without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a lightweight framework to improve temporal action segmentation by integrating statistical structural priors into the decoding process. These priors, including transition confidences, boundary sets, and class durations, are extracted directly from annotated data and applied via a modified Viterbi algorithm. This corrects common structural errors in predictions from existing models. It benefits both fully supervised and semi-supervised approaches while remaining computationally efficient and avoiding the complexity of grammar-based methods.

Core claim

By incorporating statistical structural priors such as transition confidence, action boundary sets, and per-class duration into a modified Viterbi decoding algorithm, the framework enables inference-time refinement of TAS predictions. This approach corrects structural prediction errors in both fully and semi-supervised models without the need for retraining or added model complexity.

What carries the argument

Modified Viterbi decoding algorithm that integrates statistical structural priors extracted from annotated training data.

If this is right

The method corrects structural prediction errors such as invalid action transitions and boundary misplacements in existing TAS outputs.
It applies to both fully supervised and semi-supervised TAS models with no changes to the underlying network.
Refinement occurs at inference time, preserving the original model's efficiency and avoiding retraining costs.
Priors are extracted directly from available annotations, enabling use in new or low-resource domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This decoding refinement could extend to other structured sequence tasks like speech recognition or pose estimation where statistical priors on transitions are available.
The framework highlights a practical way to combine data-driven predictions with explicit constraints without increasing training complexity.
If priors prove domain-specific, cross-dataset validation would be needed to ensure they do not degrade performance on target videos.

Load-bearing premise

The statistical structural priors extracted from annotated training data remain representative and beneficial on unseen test videos without introducing new errors or requiring per-dataset tuning.

What would settle it

Applying the modified Viterbi decoder to videos where the extracted priors produce invalid transitions or mismatched durations, resulting in lower accuracy than the baseline model, would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.10149 by Basura Fernando, Chen Li, Debaditya Roy, Hao Zhang, Yeo Keat Ee.

**Figure 2.** Figure 2: 3.1 Extracting Structural Constraints from Activity Videos Structural constraints define the structure of activities in terms of its constituent actions. One such structural constraint that we consider are frequently occurring action transitions. Specifically, for each consecutive pair of actions (A → B) in the video, we compute transition confidence as: Conf(A → B) = Count(A → B) Count(A) . (1) All obser… view at source ↗

**Figure 3.** Figure 3: Inference time scaling with video length. 4.4 Complexity and Runtime Comparison Several grammar-based approaches have been proposed for modeling activity structure in videos. Differentiable grammars [25] and adversarial generative grammars [26] integrate grammar learning into deep networks, while stochastic grammars [27] capture hierarchical and temporal relations. The KARI method [5] uses a Breadth-firs… view at source ↗

**Figure 4.** Figure 4: Qualitative examples from the semi-supervised experiments demonstrate that our [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of edit score improvements across ICC iterations in a semi-supervised set [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a lightweight constraint-aware decoding framework for temporal action segmentation (TAS). Statistical structural priors (transition confidence, action boundary sets, and per-class duration) are extracted directly from annotated training data and incorporated into a modified Viterbi algorithm. This enables inference-time refinement of predictions from existing fully supervised or semi-supervised TAS models without retraining or added model complexity, with the goal of correcting structural errors while preserving efficiency. Code is released.

Significance. If the claimed improvements are robust, the work would provide a practical, model-agnostic post-processing step that leverages readily available training statistics. This is valuable for low-resource domains and could be adopted as a standard refinement module for TAS pipelines. The emphasis on efficiency and the public code repository are positive factors for reproducibility and impact.

major comments (3)

[§4] §4 (Experiments): No cross-dataset or domain-shift experiments are reported to test whether training-derived priors (transition confidences, boundary sets, durations) remain beneficial on unseen videos whose action ordering or timing statistics differ from the training set. This directly affects the central claim of applicability to new or low-resource domains.
[§3.2] §3.2 (Modified Viterbi): The integration of per-class duration constraints appears to use hard or strongly weighted penalties; if test-video durations deviate from training statistics, this risks introducing new segmentation errors rather than correcting them. No sensitivity analysis or mismatch quantification is provided.
[Table 2] Table 2 / §4.3 (Ablations): Reported gains from adding individual constraints are shown, but without standard deviations across runs or statistical significance tests, it is difficult to determine whether the improvements are reliable or could be explained by variance in the base TAS models.

minor comments (2)

[§3] Notation for the modified score function in the Viterbi recursion is introduced without a single consolidated definition; a small table or boxed equation summarizing all constraint terms would improve clarity.
[§2] The related-work section omits several recent semi-supervised TAS methods published after 2022; a brief comparison of how constraint-aware decoding differs from those approaches would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the potential impact of our constraint-aware decoding framework. We address each major comment point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (Experiments): No cross-dataset or domain-shift experiments are reported to test whether training-derived priors (transition confidences, boundary sets, durations) remain beneficial on unseen videos whose action ordering or timing statistics differ from the training set. This directly affects the central claim of applicability to new or low-resource domains.

Authors: We acknowledge that explicit cross-dataset or domain-shift experiments would provide stronger evidence for generalization. Our current evaluation uses standard TAS benchmarks (50Salads, GTEA, Breakfast) with priors extracted from each dataset's training split and tested on held-out videos that already contain natural variability in action ordering and durations. For low-resource scenarios, the framework is intended to derive priors directly from whatever annotated data is available in the target domain. We will add a dedicated discussion paragraph in the revised manuscript clarifying the scope of our claims, noting the absence of cross-dataset results as a limitation, and outlining how the statistical priors could be adapted under moderate shifts. revision: partial
Referee: [§3.2] §3.2 (Modified Viterbi): The integration of per-class duration constraints appears to use hard or strongly weighted penalties; if test-video durations deviate from training statistics, this risks introducing new segmentation errors rather than correcting them. No sensitivity analysis or mismatch quantification is provided.

Authors: The duration term is added as a soft penalty (scaled by a tunable hyperparameter) inside the Viterbi recursion rather than a hard constraint; the weight is chosen on a validation split to avoid over-penalization. We agree that quantifying sensitivity to duration mismatch is valuable. In the revision we will insert a new subsection with (i) a sensitivity plot showing segmentation accuracy as a function of increasing duration mismatch (artificially induced on the test set) and (ii) statistics of the observed per-class duration differences between training and test splits in each benchmark. revision: yes
Referee: [Table 2] Table 2 / §4.3 (Ablations): Reported gains from adding individual constraints are shown, but without standard deviations across runs or statistical significance tests, it is difficult to determine whether the improvements are reliable or could be explained by variance in the base TAS models.

Authors: We thank the referee for highlighting this reporting gap. Although the base TAS models follow the original training protocols, we did not previously report run-to-run variability. For the revised manuscript we will re-execute the ablation experiments across five random seeds, report mean and standard deviation for each configuration in Table 2, and add paired statistical significance tests (e.g., Wilcoxon signed-rank) between the baseline and constraint-augmented results. revision: yes

Circularity Check

0 steps flagged

No circularity; priors extracted independently from training annotations and applied post-hoc

full rationale

The paper's core method extracts transition confidence, action boundary sets, and per-class durations directly from annotated training data, then integrates them into a modified Viterbi decoder for inference-time refinement of TAS model outputs. This is a standard, non-self-referential pipeline: training data provides external statistics, the base TAS model produces independent predictions, and the decoder applies constraints without deriving the priors from the predictions themselves or reducing any claimed result to a fit on the target output. No equations, self-citations, or ansatzes create definitional loops or force predictions by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that data-derived statistical priors are sufficiently representative and that the modified Viterbi search can correct errors without new failure modes. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Statistical priors extracted from training annotations remain valid and helpful on test videos.
This must hold for the constraints to improve rather than degrade segmentation quality.

pith-pipeline@v0.9.0 · 5447 in / 1212 out tokens · 57371 ms · 2026-05-12T03:13:34.580816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean (and Cost/FunctionalEquation.lean) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Watch-n-patch: Unsupervised understanding of actions and relations.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4362–4370, 2015

Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. Watch-n-patch: Unsupervised understanding of actions and relations.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4362–4370, 2015

work page 2015
[2]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7082–7092, 2018

work page 2019
[3]

Egocentric action recognition by capturing hand-object contact and object state.2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6527–6537, 2024

Tsukasa Shiota, Motohiro Takagi, Kaori Kumagai, Hitoshi Seshimo, and Yushi Aono. Egocentric action recognition by capturing hand-object contact and object state.2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6527–6537, 2024

work page 2024
[4]

Alleviating over- segmentation errors by detecting action boundaries.2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2321–2330, 2020

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka. Alleviating over- segmentation errors by detecting action boundaries.2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2321–2330, 2020

work page 2021
[5]

Activity grammars for temporal action segmentation

Dayoung Gong, Joonseok Lee, Deunsol Jung, Suha Kwak, and Minsu Cho. Activity grammars for temporal action segmentation. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[6]

Ms-tcn: Multi-stage temporal convolutional network for action segmentation.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3570–3579, 2019

Yazan Abu Farha and Juergen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3570–3579, 2019

work page 2019
[7]

Set-constrained viterbi for set-supervised action segmentation

Jun Li and Sinisa Todorovic. Set-constrained viterbi for set-supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10820–10829, 2020

work page 2020
[8]

Anchor-constrained viterbi for set-supervised action segmentation

Jun Li and Sinisa Todorovic. Anchor-constrained viterbi for set-supervised action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9806–9815, 2021

work page 2021
[9]

Neuralnetwork-viterbi: A frame- work for weakly supervised video learning

Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. Neuralnetwork-viterbi: A frame- work for weakly supervised video learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[10]

An end-to-end generative framework for video segmentation and recognition

Hilde Kuehne, Juergen Gall, and Thomas Serre. An end-to-end generative framework for video segmentation and recognition. InProc. IEEE Winter Applications of Computer Vision Conference (WACV 16), Lake Placid, Mar 2016. 11

work page 2016
[11]

Unsupervised semantic parsing of video collections.2015 IEEE International Conference on Computer Vision (ICCV), pages 4480– 4488, 2015

Ozan Sener, Amir Zamir, Silvio Savarese, and Ashutosh Saxena. Unsupervised semantic parsing of video collections.2015 IEEE International Conference on Computer Vision (ICCV), pages 4480– 4488, 2015

work page 2015
[12]

End-to-end fine- grained action segmentation and recognition using conditional random field models and discrimina- tive sparse coding

Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, and Rene Vidal. End-to-end fine- grained action segmentation and recognition using conditional random field models and discrimina- tive sparse coding. In2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1558–1567, 2018

work page 2018
[13]

Weakly supervised action learning with rnn based fine-to-coarse modeling.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1273–1282, 2017

Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with rnn based fine-to-coarse modeling.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1273–1282, 2017

work page 2017
[14]

Diffusion action segmentation

Daochang Liu, Qiyue Li, Anh-Dung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. Diffusion action segmentation. InInternational Conference on Computer Vision (ICCV), 2023

work page 2023
[15]

Lea, Michael D

Colin S. Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory Hager. Temporal convolu- tional networks for action segmentation and detection.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1003–1012, 2016

work page 2017
[16]

Improving action segmentation via graph-based temporal reasoning.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14021–14031, 2020

Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving action segmentation via graph-based temporal reasoning.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14021–14031, 2020

work page 2020
[17]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. In The British Machine Vision Conference (BMVC), 2021

work page 2021
[18]

Weakly supervised action segmentation with effective use of attention and self-attention.Computer vision and image understanding, 213:103298, 2021

Yan Bin Ng and Basura Fernando. Weakly supervised action segmentation with effective use of attention and self-attention.Computer vision and image understanding, 213:103298, 2021

work page 2021
[19]

Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.IEEE Transactions on Image Processing, 29:8880– 8891, 2020

Yan Bin Ng and Basura Fernando. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.IEEE Transactions on Image Processing, 29:8880– 8891, 2020

work page 2020
[20]

Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:6647–6658, 2020

Shijie Li, Yazan Abu Farha, Yun Liu, Mingg-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:6647–6658, 2020

work page 2020
[21]

Iterative contrast-classify for semi-supervised temporal action segmentation

Dipika Singhania, Rahul Rahaman, and Angela Yao. Iterative contrast-classify for semi-supervised temporal action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2262–2270, 2022

work page 2022
[22]

C2f-tcn: A framework for semi- and fully- supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2023

Dipika Singhania, Rahul Rahaman, and Angela Yao. C2f-tcn: A framework for semi- and fully- supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2023

work page 2023
[23]

Leveraging action affinity and continuity for semi-supervised tem- poral action segmentation

Guodong Ding and Angela Yao. Leveraging action affinity and continuity for semi-supervised tem- poral action segmentation. InEuropean Conference on Computer Vision, 2022

work page 2022
[24]

The language of actions: Recovering the syntax and semantics of goal-directed human activities.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014

Hilde Kuehne, Ali Bilgin Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014

work page 2014
[25]

Differentiable grammars for videos

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. Differentiable grammars for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11874–11881, 2020

work page 2020
[26]

Adversarial generative grammars for human activity prediction

AJ Piergiovanni, Anelia Angelova, Alexander Toshev, and Michael S Ryoo. Adversarial generative grammars for human activity prediction. InEuropean Conference on Computer Vision, pages 507–

work page
[27]

Predicting human activities using stochas- tic grammar

Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu. Predicting human activities using stochas- tic grammar. InProceedings of the IEEE International Conference on Computer Vision, pages 1164–1172, 2017

work page 2017
[28]

Parsing videos of actions with segmental grammars.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 612–619, 2014

Hamed Pirsiavash and Deva Ramanan. Parsing videos of actions with segmental grammars.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 612–619, 2014. 12

work page 2014
[29]

Vo and Aaron F

Nam N. Vo and Aaron F. Bobick. From stochastic grammar to bayes network: Probabilistic parsing of complex activity.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2641–2648, 2014

work page 2014
[30]

Don't pour cereal into coffee: Differentiable temporal logic for temporal action segmentation

Ziwei Xu, Yogesh Rawat, Yongkang Wong, Mohan S Kankanhalli, and Mubarak Shah. Don't pour cereal into coffee: Differentiable temporal logic for temporal action segmentation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 14890–14903. Curran Associates, Inc., 2022

work page 2022
[31]

Neuro symbolic knowledge reasoning for procedural video question answering.arXiv preprint arXiv:2503.14957, 2025

Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fer- nando. Neuro symbolic knowledge reasoning for procedural video question answering.arXiv preprint arXiv:2503.14957, 2025

work page arXiv 2025
[32]

Nesyc: A neuro- symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870, 2025

Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, and Honguk Woo. Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870, 2025

work page arXiv 2025
[33]

The language of actions: Recovering the syntax and semantics of goal-directed human activities

Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InCVPR, pages 780–787, 2014

work page 2014
[34]

Combining embedded accelerometers with computer vision for recognizing food preparation activities

Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738, 2013. 13

work page 2013

[1] [1]

Watch-n-patch: Unsupervised understanding of actions and relations.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4362–4370, 2015

Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. Watch-n-patch: Unsupervised understanding of actions and relations.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4362–4370, 2015

work page 2015

[2] [2]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7082–7092, 2018

work page 2019

[3] [3]

Egocentric action recognition by capturing hand-object contact and object state.2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6527–6537, 2024

Tsukasa Shiota, Motohiro Takagi, Kaori Kumagai, Hitoshi Seshimo, and Yushi Aono. Egocentric action recognition by capturing hand-object contact and object state.2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6527–6537, 2024

work page 2024

[4] [4]

Alleviating over- segmentation errors by detecting action boundaries.2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2321–2330, 2020

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka. Alleviating over- segmentation errors by detecting action boundaries.2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2321–2330, 2020

work page 2021

[5] [5]

Activity grammars for temporal action segmentation

Dayoung Gong, Joonseok Lee, Deunsol Jung, Suha Kwak, and Minsu Cho. Activity grammars for temporal action segmentation. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[6] [6]

Ms-tcn: Multi-stage temporal convolutional network for action segmentation.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3570–3579, 2019

Yazan Abu Farha and Juergen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3570–3579, 2019

work page 2019

[7] [7]

Set-constrained viterbi for set-supervised action segmentation

Jun Li and Sinisa Todorovic. Set-constrained viterbi for set-supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10820–10829, 2020

work page 2020

[8] [8]

Anchor-constrained viterbi for set-supervised action segmentation

Jun Li and Sinisa Todorovic. Anchor-constrained viterbi for set-supervised action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9806–9815, 2021

work page 2021

[9] [9]

Neuralnetwork-viterbi: A frame- work for weakly supervised video learning

Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. Neuralnetwork-viterbi: A frame- work for weakly supervised video learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[10] [10]

An end-to-end generative framework for video segmentation and recognition

Hilde Kuehne, Juergen Gall, and Thomas Serre. An end-to-end generative framework for video segmentation and recognition. InProc. IEEE Winter Applications of Computer Vision Conference (WACV 16), Lake Placid, Mar 2016. 11

work page 2016

[11] [11]

Unsupervised semantic parsing of video collections.2015 IEEE International Conference on Computer Vision (ICCV), pages 4480– 4488, 2015

Ozan Sener, Amir Zamir, Silvio Savarese, and Ashutosh Saxena. Unsupervised semantic parsing of video collections.2015 IEEE International Conference on Computer Vision (ICCV), pages 4480– 4488, 2015

work page 2015

[12] [12]

End-to-end fine- grained action segmentation and recognition using conditional random field models and discrimina- tive sparse coding

Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, and Rene Vidal. End-to-end fine- grained action segmentation and recognition using conditional random field models and discrimina- tive sparse coding. In2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1558–1567, 2018

work page 2018

[13] [13]

Weakly supervised action learning with rnn based fine-to-coarse modeling.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1273–1282, 2017

Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with rnn based fine-to-coarse modeling.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1273–1282, 2017

work page 2017

[14] [14]

Diffusion action segmentation

Daochang Liu, Qiyue Li, Anh-Dung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. Diffusion action segmentation. InInternational Conference on Computer Vision (ICCV), 2023

work page 2023

[15] [15]

Lea, Michael D

Colin S. Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory Hager. Temporal convolu- tional networks for action segmentation and detection.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1003–1012, 2016

work page 2017

[16] [16]

Improving action segmentation via graph-based temporal reasoning.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14021–14031, 2020

Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving action segmentation via graph-based temporal reasoning.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14021–14031, 2020

work page 2020

[17] [17]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. In The British Machine Vision Conference (BMVC), 2021

work page 2021

[18] [18]

Weakly supervised action segmentation with effective use of attention and self-attention.Computer vision and image understanding, 213:103298, 2021

Yan Bin Ng and Basura Fernando. Weakly supervised action segmentation with effective use of attention and self-attention.Computer vision and image understanding, 213:103298, 2021

work page 2021

[19] [19]

Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.IEEE Transactions on Image Processing, 29:8880– 8891, 2020

Yan Bin Ng and Basura Fernando. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.IEEE Transactions on Image Processing, 29:8880– 8891, 2020

work page 2020

[20] [20]

Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:6647–6658, 2020

Shijie Li, Yazan Abu Farha, Yun Liu, Mingg-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:6647–6658, 2020

work page 2020

[21] [21]

Iterative contrast-classify for semi-supervised temporal action segmentation

Dipika Singhania, Rahul Rahaman, and Angela Yao. Iterative contrast-classify for semi-supervised temporal action segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2262–2270, 2022

work page 2022

[22] [22]

C2f-tcn: A framework for semi- and fully- supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2023

Dipika Singhania, Rahul Rahaman, and Angela Yao. C2f-tcn: A framework for semi- and fully- supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2023

work page 2023

[23] [23]

Leveraging action affinity and continuity for semi-supervised tem- poral action segmentation

Guodong Ding and Angela Yao. Leveraging action affinity and continuity for semi-supervised tem- poral action segmentation. InEuropean Conference on Computer Vision, 2022

work page 2022

[24] [24]

The language of actions: Recovering the syntax and semantics of goal-directed human activities.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014

Hilde Kuehne, Ali Bilgin Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014

work page 2014

[25] [25]

Differentiable grammars for videos

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. Differentiable grammars for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11874–11881, 2020

work page 2020

[26] [26]

Adversarial generative grammars for human activity prediction

AJ Piergiovanni, Anelia Angelova, Alexander Toshev, and Michael S Ryoo. Adversarial generative grammars for human activity prediction. InEuropean Conference on Computer Vision, pages 507–

work page

[27] [27]

Predicting human activities using stochas- tic grammar

Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu. Predicting human activities using stochas- tic grammar. InProceedings of the IEEE International Conference on Computer Vision, pages 1164–1172, 2017

work page 2017

[28] [28]

Parsing videos of actions with segmental grammars.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 612–619, 2014

Hamed Pirsiavash and Deva Ramanan. Parsing videos of actions with segmental grammars.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 612–619, 2014. 12

work page 2014

[29] [29]

Vo and Aaron F

Nam N. Vo and Aaron F. Bobick. From stochastic grammar to bayes network: Probabilistic parsing of complex activity.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2641–2648, 2014

work page 2014

[30] [30]

Don't pour cereal into coffee: Differentiable temporal logic for temporal action segmentation

Ziwei Xu, Yogesh Rawat, Yongkang Wong, Mohan S Kankanhalli, and Mubarak Shah. Don't pour cereal into coffee: Differentiable temporal logic for temporal action segmentation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 14890–14903. Curran Associates, Inc., 2022

work page 2022

[31] [31]

Neuro symbolic knowledge reasoning for procedural video question answering.arXiv preprint arXiv:2503.14957, 2025

Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fer- nando. Neuro symbolic knowledge reasoning for procedural video question answering.arXiv preprint arXiv:2503.14957, 2025

work page arXiv 2025

[32] [32]

Nesyc: A neuro- symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870, 2025

Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, and Honguk Woo. Nesyc: A neuro-symbolic continual learner for complex embodied tasks in open domains.arXiv preprint arXiv:2503.00870, 2025

work page arXiv 2025

[33] [33]

The language of actions: Recovering the syntax and semantics of goal-directed human activities

Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InCVPR, pages 780–787, 2014

work page 2014

[34] [34]

Combining embedded accelerometers with computer vision for recognizing food preparation activities

Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738, 2013. 13

work page 2013