arxiv: 2604.15173 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Boundary-Centric Active Learning for Temporal Action Segmentation

Halil Ismail Helvaci , Sen-Ching Samson Cheung

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal action segmentationactive learningboundary detectionlabel efficiencyvideo annotationuncertainty samplingaction transitions

0 comments

The pith

Focusing supervision only on action boundaries yields stronger temporal segmentation with far fewer labels than standard active learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work shows that annotation cost in untrimmed videos concentrates at action transitions, where small timing errors hurt segmental metrics most. It builds a two-stage loop that first picks uncertain videos and then ranks the top-K boundaries inside each video with a score fusing neighborhood uncertainty, class ambiguity, and temporal dynamics. Only those boundary frames receive labels, yet the model still trains on short clips centered on them to retain context through its receptive field. On GTEA, 50Salads, and Breakfast, this boundary-centric protocol beats prior active-learning baselines and state-of-the-art methods at low label budgets, with the biggest margins on datasets where boundary accuracy drives edit and overlap F1 scores.

Core claim

B-ACT ranks unlabeled videos by predictive uncertainty, then inside each chosen video detects candidate transitions from current predictions and selects the top-K via a boundary score that combines neighborhood uncertainty, class ambiguity, and temporal predictive dynamics; labels are requested only for those boundary frames while training proceeds on boundary-centered clips, producing higher label efficiency than representative TAS active-learning baselines under sparse budgets.

What carries the argument

The boundary score, which fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics to rank candidate transitions for labeling.

If this is right

Label budgets can be cut while maintaining or improving edit and overlap F1 because supervision targets the locations where errors concentrate.
Gains are largest on datasets where boundary placement dominates the evaluation metrics rather than interior frame accuracy.
The two-stage hierarchy first reduces video-level redundancy then focuses frame-level effort on transitions.
Training on boundary-centered clips preserves receptive-field context without requiring dense labels across entire videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-first idea could transfer to other dense temporal tasks such as speech diarization or surgical video phase recognition.
Combining the boundary score with existing semi-supervised consistency losses might shrink the required labeled set even further.
Evaluating the method on longer, uncurated videos would test whether the two-stage selection still avoids redundant boundary queries at scale.

Load-bearing premise

The proposed boundary score reliably identifies the frames whose labels will improve the model the most, and labeling only those frames while training on centered clips supplies enough temporal context.

What would settle it

Running the same video-selection stage but replacing boundary selection with random or uniform frame sampling inside each video, then measuring whether segmentation F1 scores on GTEA, 50Salads, or Breakfast fall below the boundary-centric results at identical label budgets.

Figures

Figures reproduced from arXiv: 2604.15173 by Halil Ismail Helvaci, Sen-Ching Samson Cheung.

**Figure 2.** Figure 2: Overview of the proposed boundary-centric active learning pipeline for TAS. Stage 1: Unlabeled videos are ranked by predictive uncertainty estimated via MC dropout, and the most uncertain videos are selected. Stage 2: For each selected video, candidate boundaries are extracted from label changes in the predicted sequence and ranked by a boundary score that fuses (i) local predictive uncertainty in a tempor… view at source ↗

**Figure 4.** Figure 4: Ablation On Video Selection. Performance As A Function Of Annotation Budget When Selecting Videos Uniformly At Random Vs. Using Uncertainty-Based Acquisition (Higher Is Better). (Top) Frame-Wise Accuracy And (Bottom) Edit Score On GTEA. Uncertainty-Based Selection Yields Consistent Gains Beyond The Extremely Low-Budget Regime, Indicating That Reliable Video-Level Uncertainty Estimates Improve Label Effici… view at source ↗

**Figure 3.** Figure 3: Qualitative TAS on sample videos with boundary scores. Breakfast (top), 50 Salads (middle), and GTEA (bottom) visualize ground truth vs. predicted segments, per-frame errors, and the boundary score with GT/predicted boundaries. wgb = 0.3, w∇b = 0.5 provides the best overall trade-off, achieving the top accuracy (62.9) and the best F1 scores (68.3/62.3/45.7), while maintaining a high edit score (60.5). Thi… view at source ↗

**Figure 5.** Figure 5: Ablation On Clip Selection. With Uncertainty-Based Video Selection Fixed, We Compare Random Clip Sampling To The Proposed SBAU-Based Clip Selector Under Varying Annotation Budgets. (Top) Frame-Wise Accuracy And (Bottom) Edit Score On GTEA. SBAU-Driven Clip Selection Yields Increasing Gains As Budget Grows, While At Extremely Low Budgets Boundary-Aware Scores Can Be Less Reliable, Making Random Sampling Com… view at source ↗

**Figure 6.** Figure 6: Per-class performance breakdown. Precision, recall, and F1 scores for each action class on (a) GTEA, (b) 50Salads, and (c) Breakfast, evaluated under the same labeling budget and training protocol as the ablation experiments. Across datasets, performance is strong on frequent, visually distinctive actions, while lower F1 is concentrated in fine-grained or visually similar classes and short-duration action… view at source ↗

read the original abstract

Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

B-ACT targets boundaries in TAS active learning with a fused score and boundary-only labels, delivering efficiency gains on three datasets but leaving the sufficiency of interior context untested.

read the letter

B-ACT picks videos by uncertainty then ranks boundaries inside them with a score that mixes neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Only those boundary frames get labeled, while training still runs on short clips centered at the selected points to pull in receptive-field context. The abstract reports consistent outperformance over prior TAS active learning baselines under low budgets, with bigger lifts on datasets where boundary accuracy drives edit and overlap F1 scores.

Referee Report

3 major / 2 minor

Summary. The paper presents B-ACT, a clip-budgeted active learning framework for temporal action segmentation that prioritizes labeling boundary frames. It uses a hierarchical approach: ranking videos by predictive uncertainty and then selecting top-K boundaries in selected videos via a boundary score fusing neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Training is performed on boundary-centered clips using only the boundary labels, claiming better label efficiency and outperformance over TAS active learning baselines on GTEA, 50Salads, and Breakfast under sparse budgets.

Significance. Should the empirical results prove robust, this boundary-centric strategy could meaningfully advance label-efficient learning for dense video tasks like TAS, where annotation is costly and errors cluster at transitions. The multi-dataset evaluation and focus on segmental metrics are positive aspects. However, the absence of theoretical derivation for the boundary score and limited ablations temper the broader significance.

major comments (3)

The definition of the boundary score as a fusion of three components lacks an ablation analysis to determine the individual contributions of neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. This is critical as the central claim depends on this score effectively identifying frames that most improve the model.
The results do not include statistical significance tests for the reported improvements, nor detailed comparisons with exact metric values against all baselines. Additionally, there is no analysis addressing whether boundary-only labeling suffices for learning action interiors on datasets like Breakfast with variable action lengths, which directly impacts the validity of the label efficiency claim.
The choice of top-K boundaries per video and the fusion weights in the boundary score are free parameters without sensitivity analysis, potentially affecting the reproducibility and generality of the reported gains.

minor comments (2)

The abstract mentions 'prior state of the art' but does not specify which methods are included in the comparisons.
Ensure all figures clearly label the axes and include error bars if applicable for the performance curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We appreciate the recognition of the potential impact of our boundary-centric active learning strategy for label-efficient temporal action segmentation. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: The definition of the boundary score as a fusion of three components lacks an ablation analysis to determine the individual contributions of neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. This is critical as the central claim depends on this score effectively identifying frames that most improve the model.

Authors: We agree that an ablation analysis is essential to substantiate the boundary score design. In the revised manuscript, we will add a dedicated ablation study evaluating each component (neighborhood uncertainty, class ambiguity, and temporal predictive dynamics) in isolation as well as in all combinations. This will report segmental metrics on GTEA, 50Salads, and Breakfast under identical budgets, highlighting the synergistic gains from the full fusion. revision: yes
Referee: The results do not include statistical significance tests for the reported improvements, nor detailed comparisons with exact metric values against all baselines. Additionally, there is no analysis addressing whether boundary-only labeling suffices for learning action interiors on datasets like Breakfast with variable action lengths, which directly impacts the validity of the label efficiency claim.

Authors: We acknowledge the validity of these concerns. We will add statistical significance tests (e.g., paired t-tests with p-values) for all reported improvements and expand the result tables to list exact metric values for every baseline and our method across all datasets. To directly examine boundary-only labeling, we will include a new analysis on Breakfast: performance on interior (non-boundary) frames when training exclusively with boundary labels versus full supervision, demonstrating how boundary-centered clips enable the model to learn interiors via temporal context despite variable action lengths. revision: yes
Referee: The choice of top-K boundaries per video and the fusion weights in the boundary score are free parameters without sensitivity analysis, potentially affecting the reproducibility and generality of the reported gains.

Authors: We agree that sensitivity analysis is important for reproducibility. In the revision, we will add experiments varying K (e.g., 1–10) and different fusion weight combinations (equal, grid-searched, and component-specific), showing that gains remain consistent across reasonable ranges. We will explicitly state the hyperparameters used in the main results and release code to support exact reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the B-ACT framework derivation

full rationale

The paper defines a procedural two-stage active learning loop that ranks videos by predictive uncertainty and selects boundary frames inside each video via a composite boundary score fusing neighborhood uncertainty, class ambiguity, and temporal dynamics. All performance claims are supported by direct empirical comparisons on GTEA, 50Salads, and Breakfast against external baselines under fixed clip budgets. No equation or prediction is shown to be mathematically identical to a fitted parameter or self-citation input; the boundary score is an explicit design choice whose utility is measured rather than derived by construction from the evaluation data.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach rests on the domain assumption that segmentation errors concentrate at boundaries and that uncertainty-based selection plus the new score will identify high-value frames; several implementation details such as exact weighting in the boundary score and choice of K are not specified in the abstract and function as free parameters.

free parameters (2)

top-K boundaries per video
Number of boundaries selected for annotation within each chosen video; controls the per-video budget and must be set to match overall label constraints.
boundary score fusion weights
Relative importance given to neighborhood uncertainty, class ambiguity, and temporal predictive dynamics when ranking candidate transitions.

axioms (2)

domain assumption Segmentation errors in TAS concentrate at action boundaries and small temporal shifts disproportionately affect segmental metrics.
Explicitly stated as the core motivation for boundary-centric supervision.
domain assumption Predictive uncertainty is a reliable proxy for annotation value in both video selection and boundary ranking.
Used to rank videos in stage one and to contribute to the boundary score in stage two.

invented entities (1)

boundary score no independent evidence
purpose: Ranks candidate transition points by combining neighborhood uncertainty, class ambiguity, and temporal predictive dynamics.
Newly defined scoring function whose exact formulation and weighting are not provided in the abstract.

pith-pipeline@v0.9.0 · 5495 in / 1552 out tokens · 109393 ms · 2026-05-10T11:30:50.819898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 8 canonical work pages

[1]

Local- izing moments of actions in untrimmed videos of infants with autism spectrum disorder,

H. I. Helvaci, C.-N. Chuah, S. Ozonoff, and S.-C. S. Cheung, “Local- izing moments of actions in untrimmed videos of infants with autism spectrum disorder,” in2024 IEEE International Conference on Image Processing (ICIP). IEEE, 2024, pp. 3841–3847

2024
[2]

A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision,

N. Manakitsa, G. S. Maraslidis, L. Moysis, and G. F. Fragulis, “A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision,”Technologies, vol. 12, no. 2, p. 15, 2024

2024
[3]

Mmta: Multi membership temporal attention for fine-grained stroke rehabilitation assessment,

H. I. Helvaci, J. P. Huber, J. Bae, and S.-c. S. Cheung, “Mmta: Multi membership temporal attention for fine-grained stroke rehabilitation assessment,”arXiv preprint arXiv:2603.00878, 2026. 13

work page arXiv 2026
[4]

Temporal action segmentation: An analysis of modern techniques,

G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern techniques,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, pp. 1011–1030, 2023

2023
[5]

Iterative contrast-classify for semi-supervised temporal action segmentation,

D. Singhania, R. Rahaman, and A. Yao, “Iterative contrast-classify for semi-supervised temporal action segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2262–2270

2022
[6]

Leveraging action affinity and continuity for semi- supervised temporal action segmentation,

G. Ding and A. Yao, “Leveraging action affinity and continuity for semi- supervised temporal action segmentation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 17–32

2022
[7]

Self- supervised learning for semi-supervised temporal action proposal,

X. Wang, S. Zhang, Z. Qing, Y . Shao, C. Gao, and N. Sang, “Self- supervised learning for semi-supervised temporal action proposal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1905–1914

2021
[8]

D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation,

C.-Y . Chang, D.-A. Huang, Y . Sui, L. Fei-Fei, and J. C. Niebles, “D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3546–3555

2019
[9]

Learning discriminative prototypes with dynamic time warping,

X. Chang, F. Tung, and G. Mori, “Learning discriminative prototypes with dynamic time warping,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021, pp. 8395–8404

2021
[10]

Weakly-supervised action segmentation with itera- tive soft boundary assignment,

L. Ding and C. Xu, “Weakly-supervised action segmentation with itera- tive soft boundary assignment,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6508–6516

2018
[11]

Connectionist temporal modeling for weakly supervised action labeling,

D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” inEuropean confer- ence on computer Vision. Springer, 2016, pp. 137–153

2016
[12]

Weakly supervised learning of actions from transcripts,

H. Kuehne, A. Richard, and J. Gall, “Weakly supervised learning of actions from transcripts,”Computer Vision and Image Understanding, vol. 163, pp. 78–89, 2017

2017
[13]

Timestamp-supervised action segmentation with graph convolutional networks,

H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, “Timestamp-supervised action segmentation with graph convolutional networks,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 10 619–10 626

2022
[14]

Temporal action segmentation from timestamp supervision,

Z. Li, Y . Abu Farha, and J. Gall, “Temporal action segmentation from timestamp supervision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8365–8374

2021
[15]

Reducing the label bias for timestamp supervised temporal action segmentation,

K. Liu, Y . Li, S. Liu, C. Tan, and Z. Shao, “Reducing the label bias for timestamp supervised temporal action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6503–6513

2023
[16]

A generalized and robust framework for timestamp supervision in temporal action segmentation,

R. Rahaman, D. Singhania, A. Thiery, and A. Yao, “A generalized and robust framework for timestamp supervision in temporal action segmentation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 279–296

2022
[17]

A survey of deep active learning,

P. Ren, Y . Xiao, X. Chang, P.-Y . Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang, “A survey of deep active learning,”ACM computing surveys (CSUR), vol. 54, no. 9, pp. 1–40, 2021

2021
[18]

Are all frames equal? active sparse labeling for video action detection,

A. Rana and Y . Rawat, “Are all frames equal? active sparse labeling for video action detection,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 358–14 373, 2022

2022
[19]

Hybrid active learning via deep clustering for video action detection,

A. J. Rana and Y . S. Rawat, “Hybrid active learning via deep clustering for video action detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 867–18 877

2023
[20]

A new efficient hybrid technique for human action recognition using 2d conv-rbm and lstm with optimized frame selection,

M. Joudaki, M. Imani, and H. R. Arabnia, “A new efficient hybrid technique for human action recognition using 2d conv-rbm and lstm with optimized frame selection,”Technologies, vol. 13, no. 2, p. 53, 2025

2025
[21]

Boundary-aware cascade networks for temporal action segmentation,

Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-aware cascade networks for temporal action segmentation,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 34–51

2020
[22]

Dir-as: Decoupling individual identifica- tion and temporal reasoning for action segmentation,

P. Wang and H. Ling, “Dir-as: Decoupling individual identifica- tion and temporal reasoning for action segmentation,”arXiv preprint arXiv:2304.02110, 2023

work page arXiv 2023
[23]

Uncertainty-aware representation learning for action segmentation

L. Chen, M. Li, Y . Duan, J. Zhou, and J. Lu, “Uncertainty-aware representation learning for action segmentation.” inIJCAI, vol. 2, 2022, p. 6

2022
[24]

Faster diffusion action segmentation,

S. Wang, S. Wang, M. Li, D. Yang, H. Kuang, Z. Qian, and L. Zhang, “Faster diffusion action segmentation,”arXiv preprint arXiv:2408.02024, 2024

work page arXiv 2024
[25]

Ms-tcn: Multi-stage temporal convolutional network for action segmentation,

Y . A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575– 3584

2019
[26]

Learning to recognize objects in egocentric activities,

A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” inCVPR 2011. IEEE, 2011, pp. 3281–3288

2011
[27]

The language of actions: Recov- ering the syntax and semantics of goal-directed human activities,

H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recov- ering the syntax and semantics of goal-directed human activities,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 780–787

2014
[28]

Combining embedded accelerometers with computer vision for recognizing food preparation activities,

S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inProceed- ings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013, pp. 729–738

2013
[29]

Weakly supervised action learning with rnn based fine-to-coarse modeling,

A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning with rnn based fine-to-coarse modeling,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 754– 763

2017
[30]

A multi- stream bi-directional recurrent neural network for fine-grained action detection,

B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi- stream bi-directional recurrent neural network for fine-grained action detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1961–1970

2016
[31]

Tempo- ral convolutional networks for action segmentation and detection,

C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tempo- ral convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165

2017
[32]

Asformer: Transformer for action segmentation

F. Yi, H. Wen, and T. Jiang, “Asformer: Transformer for action segmen- tation,”arXiv preprint arXiv:2110.08568, 2021

work page arXiv 2021
[33]

Actionformer: Localizing moments of actions with transformers,

C.-L. Zhang, J. Wu, and Y . Li, “Actionformer: Localizing moments of actions with transformers,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 492–510

2022
[34]

Improving action segmentation via graph-based temporal reasoning,

Y . Huang, Y . Sugano, and Y . Sato, “Improving action segmentation via graph-based temporal reasoning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 024–14 034

2020
[35]

Semantic2graph: graph-based multi-modal feature fusion for action segmentation in videos,

J. Zhang, P.-H. Tsai, and M.-H. Tsai, “Semantic2graph: graph-based multi-modal feature fusion for action segmentation in videos,”arXiv preprint arXiv:2209.05653, 2022

work page arXiv 2022
[36]

Diffusion action segmentation,

D. Liu, Q. Li, A.-D. Dinh, T. Jiang, M. Shah, and C. Xu, “Diffusion action segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 10 139–10 149

2023
[37]

Actfusion: a unified diffusion model for action segmentation and anticipation,

D. Gong, S. Kwak, and M. Cho, “Actfusion: a unified diffusion model for action segmentation and anticipation,”Advances in Neural Information Processing Systems, vol. 37, pp. 89 913–89 942, 2024

2024
[38]

Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,

Z. Lu and E. Elhamifar, “Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 175–18 185

2024
[39]

Refining action segmentation with hierarchical video representations,

H. Ahn and D. Lee, “Refining action segmentation with hierarchical video representations,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 302–16 310

2021
[40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[41]

Weakly supervised energy-based learning for action segmentation,

J. Li, P. Lei, and S. Todorovic, “Weakly supervised energy-based learning for action segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6243–6251

2019
[42]

Leveraging triplet loss for unsupervised action segmentation,

E. Bueno-Benito, B. T. Vecino, and M. Dimiccoli, “Leveraging triplet loss for unsupervised action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4922–4930

2023
[43]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2630–2640

2019
[44]

A perceptual prediction framework for self supervised event segmentation,

S. N. Aakur and S. Sarkar, “A perceptual prediction framework for self supervised event segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1197–1206

2019
[45]

My view is the best view: Procedure learning from egocentric videos,

S. Bansal, C. Arora, and C. Jawahar, “My view is the best view: Procedure learning from egocentric videos,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 657–675

2022
[46]

Stepformer: Self-supervised step discovery and localization in instructional videos,

N. Dvornik, I. Hadji, R. Zhang, K. G. Derpanis, R. P. Wildes, and A. D. Jepson, “Stepformer: Self-supervised step discovery and localization in instructional videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 952–18 961

2023
[47]

Self-supervised multi-task procedure learn- ing from instructional videos,

E. Elhamifar and D. Huynh, “Self-supervised multi-task procedure learn- ing from instructional videos,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 557–573

2020
[48]

Unsupervised learning of action classes with continuous temporal embedding,

A. Kukleva, H. Kuehne, F. Sener, and J. Gall, “Unsupervised learning of action classes with continuous temporal embedding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 066–12 074. 14

2019
[49]

Temporally-weighted hierarchical clustering for unsupervised action segmentation,

S. Sarfraz, N. Murray, V . Sharma, A. Diba, L. Van Gool, and R. Stiefel- hagen, “Temporally-weighted hierarchical clustering for unsupervised action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 225–11 234

2021
[50]

Steps: Self- supervised key step extraction and localization from unlabeled procedu- ral videos,

A. Shah, B. Lundell, H. Sawhney, and R. Chellappa, “Steps: Self- supervised key step extraction and localization from unlabeled procedu- ral videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 375–10 387

2023
[51]

Efficient temporal action segmentation via boundary-aware query voting,

P. Wang, Y . Lin, E. Blasch, H. Linget al., “Efficient temporal action segmentation via boundary-aware query voting,”Advances in Neural Information Processing Systems, vol. 37, pp. 37 765–37 790, 2024

2024
[52]

The power of ensembles for active learning in image classification,

W. H. Beluch, T. Genewein, A. N ¨urnberger, and J. M. K ¨ohler, “The power of ensembles for active learning in image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9368–9377

2018
[53]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” ininternational conference on machine learning. PMLR, 2016, pp. 1050–1059

2016
[54]

Simple and scalable predictive uncertainty estimation using deep ensembles,

B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,”Advances in neural information processing systems, vol. 30, 2017

2017
[55]

What uncertainties do we need in bayesian deep learning for computer vision?

A. Kendall and Y . Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

2017
[56]

Bayesian active learning for classiﬁcation and preferenc e learning,

N. Houlsby, F. Husz ´ar, Z. Ghahramani, and M. Lengyel, “Bayesian active learning for classification and preference learning,”arXiv preprint arXiv:1112.5745, 2011

work page arXiv 2011
[57]

Deep bayesian active learning with image data,

Y . Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learning with image data,” inInternational conference on machine learning. PMLR, 2017, pp. 1183–1192

2017
[58]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

O. Sener and S. Savarese, “Active learning for convolutional neural networks: A core-set approach,”arXiv preprint arXiv:1708.00489, 2017

work page Pith review arXiv 2017
[59]

Region-based active learning for efficient labeling in se- mantic segmentation,

T. Kasarla, G. Nagendar, G. M. Hegde, V . Balasubramanian, and C. Jawahar, “Region-based active learning for efficient labeling in se- mantic segmentation,” in2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, pp. 1109–1117

2019
[60]

Viewal: Active learning with viewpoint entropy for semantic segmentation,

Y . Siddiqui, J. Valentin, and M. Nießner, “Viewal: Active learning with viewpoint entropy for semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9433–9443

2020
[61]

Revisiting superpixels for active learning in semantic segmentation with realistic annotation costs,

L. Cai, X. Xu, J. H. Liew, and C. S. Foo, “Revisiting superpixels for active learning in semantic segmentation with realistic annotation costs,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 988–10 997

2021
[62]

How to measure un- certainty in uncertainty sampling for active learning,

V .-L. Nguyen, M. H. Shaker, and E. H ¨ullermeier, “How to measure un- certainty in uncertainty sampling for active learning,”Machine Learning, vol. 111, no. 1, pp. 89–122, 2022

2022
[63]

Two-stage active learning for efficient temporal action segmentation,

Y . Su and E. Elhamifar, “Two-stage active learning for efficient temporal action segmentation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 161–183

2024
[64]

Drop-dtw: Aligning common signal between sequences while dropping outliers,

M. Dvornik, I. Hadji, K. G. Derpanis, A. Garg, and A. Jepson, “Drop-dtw: Aligning common signal between sequences while dropping outliers,”Advances in Neural Information Processing Systems, vol. 34, pp. 13 782–13 793, 2021

2021
[65]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

2017
[66]

Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm,

Y . Xie, H. Lu, J. Yan, X. Yang, M. Tomizuka, and W. Zhan, “Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 715–23 724

2023
[67]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

1948
[68]

Powerevaluationbald: Efficient evaluation- oriented deep (bayesian) active learning with stochastic acquisition functions,

A. Kirsch and Y . Gal, “Powerevaluationbald: Efficient evaluation- oriented deep (bayesian) active learning with stochastic acquisition functions,”arXiv preprint arXiv:2101.03552, 2021

work page arXiv 2021
[69]

Divergence measures based on the shannon entropy,

J. Lin, “Divergence measures based on the shannon entropy,”IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991

1991
[70]

L. C. Freeman,Elementary Applied Statistics: For Students in Behav- ioral Science. New York: Wiley, 1965

1965