pith. machine review for the scientific record. sign in

arxiv: 2605.08945 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal action quality assessmentprogressive fusionimplicit decouplingvideo action scoringmamba temporal modelingwavelet frequency analysissports video analysiscross-modal attention
0
0 comments X

The pith

PIDNet progressively decouples and fuses RGB, flow and audio features to assess action quality more accurately than prior multimodal methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that quality signals in multimodal videos are heterogeneous across modalities and evolve in distinct stages, so coarse fusion blurs important cues. PIDNet therefore maps the three streams into a shared space, disentangles long-range temporal patterns from local frequency details, then fuses them in three explicit stages that retrieve complements while suppressing redundancy. Experiments on two gymnastics datasets show this yields competitive score correlations and tighter error control. If the approach holds, automated judging systems can integrate visual motion and sound evidence without losing stage-specific information.

Core claim

The central claim is that an iMambaWave module combined with a three-stage Group3M fusion network performs implicit decoupling of modality-specific information, cross-modal complementary cues and global semantics, producing higher correlation with human scores and better error control on the Rhythmic Gymnastics and Fis-V datasets than existing unimodal and multimodal baselines.

What carries the argument

The iMambaWave module, which maps RGB, optical flow and audio features into a shared latent space then disentangles them via a Bi-Mamba branch for long-range dependencies and a wavelet branch for local perturbations, followed by gated aggregation and three-stage progressive fusion with modality complementary attention.

If this is right

  • Higher agreement between automatic scores and human judges on rhythmic gymnastics routines.
  • Reduced cross-modal redundancy while preserving stage-specific quality signals.
  • Consistent gains in temporal modeling when the decoupling module is added to other visual backbones.
  • Ablation-confirmed contribution from each fusion stage to overall accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged fusion pattern could apply to other video tasks where cues strengthen or weaken over time, such as skill assessment in surgery or dance.
  • Real-time coaching tools might become feasible if the progressive structure allows early stopping at intermediate stages.
  • Adding further modalities like depth or pose keypoints would test whether the same decoupling logic continues to suppress redundancy.

Load-bearing premise

Quality evidence from different modalities is heterogeneous and evolves progressively, so it requires dedicated implicit decoupling and staged fusion steps rather than unified or coarse modeling.

What would settle it

A head-to-head experiment on the same Rhythmic Gymnastics dataset in which a simple early-fusion or late-fusion multimodal baseline achieves equal or higher score correlation and lower error than PIDNet.

Figures

Figures reproduced from arXiv: 2605.08945 by Nenggan Zheng, Pengfei Wang, Qiqi Li.

Figure 1
Figure 1. Figure 1: Illustration of the iMambaWave module. The input feature [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of PIDNet. RGB, optical flow, and audio features are extracted by pretrained backbones and projected into a shared latent space, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter analysis of the proposed PIDNet on the RG and Fis [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of intermediate feature response maps across the three [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of test samples along the multimodal temporal dimension. The first sample is clubs #001 from the RG dataset, and the second [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualization of the iMambaWave gate responses for the test sample ball #033 from RG dataset. The top row shows representative frames [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes PIDNet, a progressive implicit decoupling and fusion network for multimodal action quality assessment. It introduces the iMambaWave module to project RGB, optical flow, and audio features into a shared space, disentangling them via a Bi-Mamba branch (long-range temporal dependencies) and a wavelet-transform branch (local perturbations), followed by gated aggregation. These features are then processed by a three-stage progressive fusion network built from Group3M blocks that employ modality-complementary attention to retrieve cross-modal evidence while suppressing redundancy, plus multi-scale convolutions. Experiments on the Rhythmic Gymnastics and Fis-V datasets report competitive Spearman/Pearson correlations and favorable error metrics relative to unimodal and multimodal baselines, with ablations confirming component contributions and iMambaWave shown to be plug-and-play across backbones.

Significance. If the reported gains hold under rigorous verification, the work offers a concrete architecture for addressing modality heterogeneity and stage-wise evolution of quality cues in AQA, moving beyond coarse fusion or unified modeling. The explicit separation of long-range and local-frequency modeling inside iMambaWave, combined with staged complementary attention, is a targeted response to the stated limitations of prior methods. The ablation studies and cross-backbone generalization tests provide useful evidence of component utility; these strengths would be further enhanced by reproducible code or parameter counts.

major comments (2)
  1. [Experiments section] Experiments section, main results table: the reported correlations and error metrics lack error bars, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests against the strongest baselines), which is required to substantiate the central claim of 'highly competitive' performance and 'favorable error control'.
  2. [Method section] Method section (iMambaWave and Group3M): the precise equations or pseudocode for the gated aggregation (how temporal and frequency-domain features are weighted) and the modality-complementary attention inside Group3M blocks are not supplied, preventing independent verification that the mechanism indeed 'adaptively fuses' and 'suppresses redundancy' as asserted.
minor comments (3)
  1. [Abstract] Abstract and introduction: the phrase 'highly competitive' is used without quantifying the absolute or relative improvements (e.g., Δρ or ΔMAE) over the top two baselines on each dataset.
  2. [Related Work] Related-work section: several recent multimodal AQA papers that also employ attention-based fusion are cited only briefly; a short paragraph contrasting PIDNet's progressive implicit decoupling against those approaches would clarify novelty.
  3. [Figures] Figure captions: the architecture diagram (presumably Figure 2 or 3) would benefit from explicit labels on the Bi-Mamba vs. wavelet paths and the three fusion stages to match the textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments identify key areas where additional rigor and detail will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section, main results table: the reported correlations and error metrics lack error bars, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests against the strongest baselines), which is required to substantiate the central claim of 'highly competitive' performance and 'favorable error control'.

    Authors: We agree that the absence of error bars, standard deviations, and statistical significance tests limits the strength of the performance claims. In the revised manuscript we will rerun the main experiments across multiple random seeds (at least five runs) to report means and standard deviations for Spearman and Pearson correlations as well as error metrics. We will also add paired t-tests (or Wilcoxon signed-rank tests where appropriate) against the strongest baselines to establish statistical significance of the observed improvements. revision: yes

  2. Referee: [Method section] Method section (iMambaWave and Group3M): the precise equations or pseudocode for the gated aggregation (how temporal and frequency-domain features are weighted) and the modality-complementary attention inside Group3M blocks are not supplied, preventing independent verification that the mechanism indeed 'adaptively fuses' and 'suppresses redundancy' as asserted.

    Authors: We acknowledge that the current manuscript provides only descriptive text for these components. In the revision we will insert the explicit mathematical formulations for the gated aggregation operation within iMambaWave (including the weighting of temporal and frequency-domain branches) and for the modality-complementary attention mechanism inside each Group3M block. Where helpful, we will also supply concise pseudocode to clarify the adaptive fusion and redundancy-suppression steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical architecture proposal for multimodal AQA, introducing iMambaWave (Bi-Mamba + wavelet branches with gated aggregation) and three-stage Group3M fusion (modality complementary attention + multi-scale convolutions). No equations, first-principles derivations, or predictions appear in the provided text. Central claims rest on experimental results and ablations on public datasets (Rhythmic Gymnastics, Fis-V), with no self-citation load-bearing the architecture choice, no fitted parameters renamed as predictions, and no renaming of known results. The design is motivated by stated heterogeneity of modalities but does not reduce to its own inputs by construction; results are externally validated via comparisons and component isolations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into any fitted hyperparameters or additional assumptions; the design rests on domain assumptions about modality heterogeneity and progressive quality cues.

axioms (1)
  • domain assumption Quality evidence from different modalities is heterogeneous and evolves progressively over time, necessitating specific decoupling rather than coarse fusion.
    Explicitly stated as motivation for the progressive implicit decoupling and fusion network.
invented entities (2)
  • iMambaWave module no independent evidence
    purpose: Maps RGB, optical flow, and audio features into shared latent space and disentangles them using Bi-Mamba for temporal dependencies and wavelet-transform for local details.
    Newly proposed component without independent evidence outside this work's experiments.
  • Group3M blocks no independent evidence
    purpose: Enable three-stage progressive fusion with modality complementary attention and multi-scale convolutions.
    Architectural invention introduced for the fusion network.

pith-pipeline@v0.9.0 · 5567 in / 1384 out tokens · 48515 ms · 2026-05-12T02:27:05.956018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    Tsa-net: Tube self-attention network for action quality assessment,

    S. Wang, D. Yang, P. Zhai, C. Chen, and L. Zhang, “Tsa-net: Tube self-attention network for action quality assessment,” inProc. ACM Int. Conf. Multimedia, 2021, pp. 4902–4910

  2. [2]

    Egcn: An ensemble-based learning framework for exploring effective skeleton- based rehabilitation exercise assessment,

    Y . B. X.B., Y . Liu, X. Zhang, G. Chen, and K. C. Chan, “Egcn: An ensemble-based learning framework for exploring effective skeleton- based rehabilitation exercise assessment,” inProc. Int. Joint Conf. Artif. Intelligence., 2022, pp. 3681–3687

  3. [3]

    A decade of action quality assessment: Largest systematic survey of trends, challenges, and future directions: Hao yin et al

    H. Yin, P. Parmar, D. Xu, Y . Zhang, T. Zheng, and W. Fu, “A decade of action quality assessment: Largest systematic survey of trends, challenges, and future directions: Hao yin et al.”Int. J. Comput. Vis., vol. 134, no. 2, p. 73, 2026

  4. [4]

    Uil-aqa: Uncertainty-aware clip-level interpretable action quality assessment,

    X. Dong, X. Liu, W. Li, A. Adeyemi-Ejeye, and A. Gilbert, “Uil-aqa: Uncertainty-aware clip-level interpretable action quality assessment,” Int. J. Comput. Vis., vol. 134, no. 1, p. 24, 2026

  5. [5]

    Causality-enhanced multiple instance learning with graph convolutional networks for parkinsonian freezing-of-gait assessment,

    R. Guo, Z. Xie, C. Zhang, and X. Qian, “Causality-enhanced multiple instance learning with graph convolutional networks for parkinsonian freezing-of-gait assessment,”IEEE Trans. Image Process., vol. 33, pp. 3991–4001, 2024

  6. [6]

    Sedskill: Surgical events driven method for skill assessment from thoracoscopic surgical videos,

    X. Ding, X. Xu, and X. Li, “Sedskill: Surgical events driven method for skill assessment from thoracoscopic surgical videos,” inProc. Int. Conf. Med. Image Comput. Comput.-Assisted Interv. (MICCAI). Springer, 2023, pp. 35–45

  7. [7]

    Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,

    K. Zhou, H. P. Shum, F. W. Li, X. Zhang, and X. Liang, “Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,”IEEE Trans. Image Process., 2025

  8. [8]

    Learning semantics-guided representations for scoring figure skating,

    Z. Du, D. He, X. Wang, and Q. Wang, “Learning semantics-guided representations for scoring figure skating,”IEEE Trans. Multimedia, vol. 26, pp. 4987–4997, 2024

  9. [9]

    Video swin transformer,

    Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2022, pp. 3202–3211

  10. [10]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308

  11. [11]

    Learning spatiotemporal features with 3d convolutional networks,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inProc. Int. Conf. Comput. Vis., 2015, pp. 4489–4497

  12. [12]

    Enhancing long-term action quality assessment: A dual-modality dataset and causal cross- modal framework for trampoline gymnastics,

    F. Lin, J. Huang, Z. Chen, K. Zhu, and C. Feng, “Enhancing long-term action quality assessment: A dual-modality dataset and causal cross- modal framework for trampoline gymnastics,”Sensors, vol. 25, no. 18, p. 5824, 2025

  13. [13]

    Hierarchical graph convolutional networks for action quality assessment,

    K. Zhou, Y . Ma, H. P. Shum, and X. Liang, “Hierarchical graph convolutional networks for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7749–7763, 2023

  14. [14]

    Action assessment by joint relation graphs,

    J.-H. Pan, J. Gao, and W.-S. Zheng, “Action assessment by joint relation graphs,” inProc. Int. Conf. Comput. Vis., 2019, pp. 6331–6340

  15. [15]

    Adaptive spatiotemporal graph transformer network for action quality assessment,

    J. Liu, H. Wang, W. Zhou, K. Stawarz, P. Corcoran, Y . Chen, and H. Liu, “Adaptive spatiotemporal graph transformer network for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., 2025

  16. [16]

    Language-guided audio-visual learning for long-term sports assessment,

    H. Xu, X. Ke, H. Wu, R. Xu, Y . Li, and W. Guo, “Language-guided audio-visual learning for long-term sports assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 23 967–23 977

  17. [17]

    Nar- rative action evaluation with prompt-guided multimodal interaction,

    S. Zhang, S. Bai, G. Chen, L. Chen, J. Lu, J. Wang, and Y . Tang, “Nar- rative action evaluation with prompt-guided multimodal interaction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 18 430– 18 439

  18. [18]

    Vision- language action knowledge learning for semantic-aware action quality assessment,

    H. Xu, X. Ke, Y . Li, R. Xu, H. Wu, X. Lin, and W. Guo, “Vision- language action knowledge learning for semantic-aware action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 423–440

  19. [19]

    Skating-mixer: Long-term sport audio-visual modeling with mlps,

    J. Xia, M. Zhuge, T. Geng, S. Fan, Y . Wei, Z. He, and F. Zheng, “Skating-mixer: Long-term sport audio-visual modeling with mlps,” in AAAI Conf. Artif. Intell., vol. 37, no. 3, 2023, pp. 2901–2909

  20. [20]

    Multimodal action quality assessment,

    L.-A. Zeng and W.-S. Zheng, “Multimodal action quality assessment,” IEEE Trans. Image Process., vol. 33, pp. 1600–1613, 2024

  21. [21]

    Techcoach: Towards technical-point-aware descriptive action coach- ing,

    Y .-M. Li, A.-L. Wang, L.-A. Zeng, K.-Y . Lin, Y .-M. Tang, and W. Zheng, “Techcoach: Towards technical-point-aware descriptive action coach- ing,” inAAAI Conf. Artif. Intell., vol. 40, no. 8, 2026, pp. 6699–6707

  22. [22]

    From beats to scores: A multi-modal framework for comprehensive figure skating assessment,

    F. Wang, Q. Wang, and D. Chen, “From beats to scores: A multi-modal framework for comprehensive figure skating assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 5905–5914

  23. [23]

    The pros and cons: Rank-aware temporal attention for skill determination in long videos,

    H. Doughty, W. Mayol-Cuevas, and D. Damen, “The pros and cons: Rank-aware temporal attention for skill determination in long videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7862– 7871

  24. [24]

    Magr: Manifold-aligned graph regularization for continual action quality assessment,

    K. Zhou, L. Wang, X. Zhang, H. P. Shum, F. W. Li, J. Li, and X. Liang, “Magr: Manifold-aligned graph regularization for continual action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 375–392

  25. [25]

    What and how well you performed? a multitask learning approach to action quality assessment,

    P. Parmar and B. T. Morris, “What and how well you performed? a multitask learning approach to action quality assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 304–313

  26. [26]

    Human-centric fine-grained action quality assessment,

    J. Xu, S. Yin, and Y . Peng, “Human-centric fine-grained action quality assessment,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

  27. [27]

    Group-aware contrastive regression for action quality assessment,

    X. Yu, Y . Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware contrastive regression for action quality assessment,” inProc. Int. Conf. Comput. Vis., 2021, pp. 7919–7928

  28. [28]

    Pairwise contrastive learning network for action quality assessment,

    M. Li, H.-B. Zhang, Q. Lei, Z. Fan, J. Liu, and J.-X. Du, “Pairwise contrastive learning network for action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2022, pp. 457–473

  29. [29]

    Finediving: A fine- grained dataset for procedure-aware action quality assessment,

    J. Xu, Y . Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A fine- grained dataset for procedure-aware action quality assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2949–2958

  30. [30]

    Action quality assessment with temporal parsing transformer,

    Y . Bai, D. Zhou, S. Zhang, J. Wang, E. Ding, Y . Guan, Y . Long, and J. Wang, “Action quality assessment with temporal parsing transformer,” inProc. Eur. Conf. Comput. Vis.Springer, 2022, pp. 422–438

  31. [31]

    Likert scoring with grade decoupling for long-term action assessment,

    A. Xu, L.-A. Zeng, and W.-S. Zheng, “Likert scoring with grade decoupling for long-term action assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3232–3241. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  32. [32]

    Localization- assisted uncertainty score disentanglement network for action quality assessment,

    Y . Ji, L. Ye, H. Huang, L. Mao, Y . Zhou, and L. Gao, “Localization- assisted uncertainty score disentanglement network for action quality assessment,” inProc. ACM Int. Conf. Multimedia, 2023, pp. 8590–8597

  33. [33]

    Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment,

    K. Zhou, J. Li, R. Cai, L. Wang, X. Zhang, and X. Liang, “Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment,”Proc. Int. Joint Conf. Artif. Intelligence., 2024

  34. [34]

    Quality-guided vision- language learning for long-term action quality assessment,

    H. Xu, H. Wu, X. Ke, Y . Li, R. Xu, and W. Guo, “Quality-guided vision- language learning for long-term action quality assessment,”IEEE Trans. Multimedia, 2025

  35. [35]

    Learning long-range action repre- sentation by two-stream mamba pyramid network for figure skating assessment,

    F. Wang, Q. Wang, and P. Zhao, “Learning long-range action repre- sentation by two-stream mamba pyramid network for figure skating assessment,” inProc. ACM Int. Conf. Multimedia, 2025, pp. 867–875

  36. [36]

    Attention-driven multimodal align- ment for long-term action quality assessment,

    X. Wang, P.-J. Li, and Y .-Y . Shen, “Attention-driven multimodal align- ment for long-term action quality assessment,”Appl. Soft. Comput., p. 113649, 2025

  37. [37]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProc. Conf. North Am. Assoc. Comput. Linguist., 2019, pp. 4171–4186

  38. [38]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877– 1901, 2020

  39. [39]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  40. [40]

    Visual-semantic alignment temporal parsing for action quality assessment,

    K. Gedamu, Y . Ji, Y . Yang, J. Shao, and H. T. Shen, “Visual-semantic alignment temporal parsing for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 3, pp. 2436–2449, 2025

  41. [41]

    Rica 2: Rubric- informed, calibrated assessment of actions,

    A. Majeedi, V . R. Gajjala, S. S. S. N. GNVV , and Y . Li, “Rica 2: Rubric- informed, calibrated assessment of actions,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 143–161

  42. [42]

    Hybrid dynamic-static context-aware attention network for action assessment in long videos,

    L.-A. Zeng, F.-T. Hong, W.-S. Zheng, Q.-Z. Yu, W. Zeng, Y .-W. Wang, and J.-H. Lai, “Hybrid dynamic-static context-aware attention network for action assessment in long videos,” inProc. ACM Int. Conf. Multimedia, 2020, pp. 2526–2534

  43. [43]

    Learning to score figure skating sport videos,

    C. Xu, Y . Fu, B. Zhang, Z. Chen, Y .-G. Jiang, and X. Xue, “Learning to score figure skating sport videos,”IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4578–4590, 2020

  44. [44]

    Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,

    Y . Liu, S. Li, Y . Wu, C.-W. Chen, Y . Shan, and X. Qie, “Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3042–3051

  45. [45]

    Msaf: Multimodal split attention fusion,

    L. Su, C. Hu, G. Li, and D. Cao, “Msaf: Multimodal split attention fusion,”arXiv preprint arXiv:2012.07175, 2020

  46. [46]

    Action quality assessment across multiple actions,

    P. Parmar and B. Morris, “Action quality assessment across multiple actions,” inProc. IEEE Winter Conf. Appl. Comput. Vis.IEEE, 2019, pp. 1468–1476

  47. [47]

    Video and accelerometer-based motion analysis for automated surgical skills assessment,

    A. Zia, Y . Sharma, V . Bettadapura, E. L. Sarin, and I. Essa, “Video and accelerometer-based motion analysis for automated surgical skills assessment,”Int. J. Comput. Assist. Radiol. Surg., vol. 13, pp. 443–455, 2018

  48. [48]

    Ast: Audio spectrogram transformer,

    Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,”arXiv preprint arXiv:2104.01778, 2021

  49. [49]

    Audio set: An ontology and human- labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” inInt. Conf. Acoust. Speech Signal Process. (ICASSP). IEEE, 2017, pp. 776–780

  50. [50]

    Learning to score olympic events,

    P. Parmar and B. Tran Morris, “Learning to score olympic events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 20–28

  51. [51]

    Joint visual and audio learning for video highlight detection,

    T. Badamdorj, M. Rochan, Y . Wang, and L. Cheng, “Joint visual and audio learning for video highlight detection,” inProc. Int. Conf. Comput. Vis., 2021, pp. 8107–8117