Recognition: 2 theorem links
· Lean TheoremPIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment
Pith reviewed 2026-05-12 02:27 UTC · model grok-4.3
The pith
PIDNet progressively decouples and fuses RGB, flow and audio features to assess action quality more accurately than prior multimodal methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an iMambaWave module combined with a three-stage Group3M fusion network performs implicit decoupling of modality-specific information, cross-modal complementary cues and global semantics, producing higher correlation with human scores and better error control on the Rhythmic Gymnastics and Fis-V datasets than existing unimodal and multimodal baselines.
What carries the argument
The iMambaWave module, which maps RGB, optical flow and audio features into a shared latent space then disentangles them via a Bi-Mamba branch for long-range dependencies and a wavelet branch for local perturbations, followed by gated aggregation and three-stage progressive fusion with modality complementary attention.
If this is right
- Higher agreement between automatic scores and human judges on rhythmic gymnastics routines.
- Reduced cross-modal redundancy while preserving stage-specific quality signals.
- Consistent gains in temporal modeling when the decoupling module is added to other visual backbones.
- Ablation-confirmed contribution from each fusion stage to overall accuracy.
Where Pith is reading between the lines
- The staged fusion pattern could apply to other video tasks where cues strengthen or weaken over time, such as skill assessment in surgery or dance.
- Real-time coaching tools might become feasible if the progressive structure allows early stopping at intermediate stages.
- Adding further modalities like depth or pose keypoints would test whether the same decoupling logic continues to suppress redundancy.
Load-bearing premise
Quality evidence from different modalities is heterogeneous and evolves progressively, so it requires dedicated implicit decoupling and staged fusion steps rather than unified or coarse modeling.
What would settle it
A head-to-head experiment on the same Rhythmic Gymnastics dataset in which a simple early-fusion or late-fusion multimodal baseline achieves equal or higher score correlation and lower error than PIDNet.
Figures
read the original abstract
Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PIDNet, a progressive implicit decoupling and fusion network for multimodal action quality assessment. It introduces the iMambaWave module to project RGB, optical flow, and audio features into a shared space, disentangling them via a Bi-Mamba branch (long-range temporal dependencies) and a wavelet-transform branch (local perturbations), followed by gated aggregation. These features are then processed by a three-stage progressive fusion network built from Group3M blocks that employ modality-complementary attention to retrieve cross-modal evidence while suppressing redundancy, plus multi-scale convolutions. Experiments on the Rhythmic Gymnastics and Fis-V datasets report competitive Spearman/Pearson correlations and favorable error metrics relative to unimodal and multimodal baselines, with ablations confirming component contributions and iMambaWave shown to be plug-and-play across backbones.
Significance. If the reported gains hold under rigorous verification, the work offers a concrete architecture for addressing modality heterogeneity and stage-wise evolution of quality cues in AQA, moving beyond coarse fusion or unified modeling. The explicit separation of long-range and local-frequency modeling inside iMambaWave, combined with staged complementary attention, is a targeted response to the stated limitations of prior methods. The ablation studies and cross-backbone generalization tests provide useful evidence of component utility; these strengths would be further enhanced by reproducible code or parameter counts.
major comments (2)
- [Experiments section] Experiments section, main results table: the reported correlations and error metrics lack error bars, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests against the strongest baselines), which is required to substantiate the central claim of 'highly competitive' performance and 'favorable error control'.
- [Method section] Method section (iMambaWave and Group3M): the precise equations or pseudocode for the gated aggregation (how temporal and frequency-domain features are weighted) and the modality-complementary attention inside Group3M blocks are not supplied, preventing independent verification that the mechanism indeed 'adaptively fuses' and 'suppresses redundancy' as asserted.
minor comments (3)
- [Abstract] Abstract and introduction: the phrase 'highly competitive' is used without quantifying the absolute or relative improvements (e.g., Δρ or ΔMAE) over the top two baselines on each dataset.
- [Related Work] Related-work section: several recent multimodal AQA papers that also employ attention-based fusion are cited only briefly; a short paragraph contrasting PIDNet's progressive implicit decoupling against those approaches would clarify novelty.
- [Figures] Figure captions: the architecture diagram (presumably Figure 2 or 3) would benefit from explicit labels on the Bi-Mamba vs. wavelet paths and the three fusion stages to match the textual description.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments identify key areas where additional rigor and detail will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Experiments section] Experiments section, main results table: the reported correlations and error metrics lack error bars, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or Wilcoxon tests against the strongest baselines), which is required to substantiate the central claim of 'highly competitive' performance and 'favorable error control'.
Authors: We agree that the absence of error bars, standard deviations, and statistical significance tests limits the strength of the performance claims. In the revised manuscript we will rerun the main experiments across multiple random seeds (at least five runs) to report means and standard deviations for Spearman and Pearson correlations as well as error metrics. We will also add paired t-tests (or Wilcoxon signed-rank tests where appropriate) against the strongest baselines to establish statistical significance of the observed improvements. revision: yes
-
Referee: [Method section] Method section (iMambaWave and Group3M): the precise equations or pseudocode for the gated aggregation (how temporal and frequency-domain features are weighted) and the modality-complementary attention inside Group3M blocks are not supplied, preventing independent verification that the mechanism indeed 'adaptively fuses' and 'suppresses redundancy' as asserted.
Authors: We acknowledge that the current manuscript provides only descriptive text for these components. In the revision we will insert the explicit mathematical formulations for the gated aggregation operation within iMambaWave (including the weighting of temporal and frequency-domain branches) and for the modality-complementary attention mechanism inside each Group3M block. Where helpful, we will also supply concise pseudocode to clarify the adaptive fusion and redundancy-suppression steps. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is an empirical architecture proposal for multimodal AQA, introducing iMambaWave (Bi-Mamba + wavelet branches with gated aggregation) and three-stage Group3M fusion (modality complementary attention + multi-scale convolutions). No equations, first-principles derivations, or predictions appear in the provided text. Central claims rest on experimental results and ablations on public datasets (Rhythmic Gymnastics, Fis-V), with no self-citation load-bearing the architecture choice, no fitted parameters renamed as predictions, and no renaming of known results. The design is motivated by stated heterogeneity of modalities but does not reduce to its own inputs by construction; results are externally validated via comparisons and component isolations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Quality evidence from different modalities is heterogeneous and evolves progressively over time, necessitating specific decoupling rather than coarse fusion.
invented entities (2)
-
iMambaWave module
no independent evidence
-
Group3M blocks
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
iMambaWave module... Bi-Mamba branch and a wavelet-transform branch... gated aggregation... Group3M blocks... modality complementary attention (MoCA)
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-stage progressive fusion network... 8-tick period not referenced
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tsa-net: Tube self-attention network for action quality assessment,
S. Wang, D. Yang, P. Zhai, C. Chen, and L. Zhang, “Tsa-net: Tube self-attention network for action quality assessment,” inProc. ACM Int. Conf. Multimedia, 2021, pp. 4902–4910
work page 2021
-
[2]
Y . B. X.B., Y . Liu, X. Zhang, G. Chen, and K. C. Chan, “Egcn: An ensemble-based learning framework for exploring effective skeleton- based rehabilitation exercise assessment,” inProc. Int. Joint Conf. Artif. Intelligence., 2022, pp. 3681–3687
work page 2022
-
[3]
H. Yin, P. Parmar, D. Xu, Y . Zhang, T. Zheng, and W. Fu, “A decade of action quality assessment: Largest systematic survey of trends, challenges, and future directions: Hao yin et al.”Int. J. Comput. Vis., vol. 134, no. 2, p. 73, 2026
work page 2026
-
[4]
Uil-aqa: Uncertainty-aware clip-level interpretable action quality assessment,
X. Dong, X. Liu, W. Li, A. Adeyemi-Ejeye, and A. Gilbert, “Uil-aqa: Uncertainty-aware clip-level interpretable action quality assessment,” Int. J. Comput. Vis., vol. 134, no. 1, p. 24, 2026
work page 2026
-
[5]
R. Guo, Z. Xie, C. Zhang, and X. Qian, “Causality-enhanced multiple instance learning with graph convolutional networks for parkinsonian freezing-of-gait assessment,”IEEE Trans. Image Process., vol. 33, pp. 3991–4001, 2024
work page 2024
-
[6]
Sedskill: Surgical events driven method for skill assessment from thoracoscopic surgical videos,
X. Ding, X. Xu, and X. Li, “Sedskill: Surgical events driven method for skill assessment from thoracoscopic surgical videos,” inProc. Int. Conf. Med. Image Comput. Comput.-Assisted Interv. (MICCAI). Springer, 2023, pp. 35–45
work page 2023
-
[7]
K. Zhou, H. P. Shum, F. W. Li, X. Zhang, and X. Liang, “Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,”IEEE Trans. Image Process., 2025
work page 2025
-
[8]
Learning semantics-guided representations for scoring figure skating,
Z. Du, D. He, X. Wang, and Q. Wang, “Learning semantics-guided representations for scoring figure skating,”IEEE Trans. Multimedia, vol. 26, pp. 4987–4997, 2024
work page 2024
-
[9]
Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2022, pp. 3202–3211
work page 2022
-
[10]
Quo vadis, action recognition? a new model and the kinetics dataset,
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308
work page 2017
-
[11]
Learning spatiotemporal features with 3d convolutional networks,
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inProc. Int. Conf. Comput. Vis., 2015, pp. 4489–4497
work page 2015
-
[12]
F. Lin, J. Huang, Z. Chen, K. Zhu, and C. Feng, “Enhancing long-term action quality assessment: A dual-modality dataset and causal cross- modal framework for trampoline gymnastics,”Sensors, vol. 25, no. 18, p. 5824, 2025
work page 2025
-
[13]
Hierarchical graph convolutional networks for action quality assessment,
K. Zhou, Y . Ma, H. P. Shum, and X. Liang, “Hierarchical graph convolutional networks for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7749–7763, 2023
work page 2023
-
[14]
Action assessment by joint relation graphs,
J.-H. Pan, J. Gao, and W.-S. Zheng, “Action assessment by joint relation graphs,” inProc. Int. Conf. Comput. Vis., 2019, pp. 6331–6340
work page 2019
-
[15]
Adaptive spatiotemporal graph transformer network for action quality assessment,
J. Liu, H. Wang, W. Zhou, K. Stawarz, P. Corcoran, Y . Chen, and H. Liu, “Adaptive spatiotemporal graph transformer network for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., 2025
work page 2025
-
[16]
Language-guided audio-visual learning for long-term sports assessment,
H. Xu, X. Ke, H. Wu, R. Xu, Y . Li, and W. Guo, “Language-guided audio-visual learning for long-term sports assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 23 967–23 977
work page 2025
-
[17]
Nar- rative action evaluation with prompt-guided multimodal interaction,
S. Zhang, S. Bai, G. Chen, L. Chen, J. Lu, J. Wang, and Y . Tang, “Nar- rative action evaluation with prompt-guided multimodal interaction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 18 430– 18 439
work page 2024
-
[18]
Vision- language action knowledge learning for semantic-aware action quality assessment,
H. Xu, X. Ke, Y . Li, R. Xu, H. Wu, X. Lin, and W. Guo, “Vision- language action knowledge learning for semantic-aware action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 423–440
work page 2024
-
[19]
Skating-mixer: Long-term sport audio-visual modeling with mlps,
J. Xia, M. Zhuge, T. Geng, S. Fan, Y . Wei, Z. He, and F. Zheng, “Skating-mixer: Long-term sport audio-visual modeling with mlps,” in AAAI Conf. Artif. Intell., vol. 37, no. 3, 2023, pp. 2901–2909
work page 2023
-
[20]
Multimodal action quality assessment,
L.-A. Zeng and W.-S. Zheng, “Multimodal action quality assessment,” IEEE Trans. Image Process., vol. 33, pp. 1600–1613, 2024
work page 2024
-
[21]
Techcoach: Towards technical-point-aware descriptive action coach- ing,
Y .-M. Li, A.-L. Wang, L.-A. Zeng, K.-Y . Lin, Y .-M. Tang, and W. Zheng, “Techcoach: Towards technical-point-aware descriptive action coach- ing,” inAAAI Conf. Artif. Intell., vol. 40, no. 8, 2026, pp. 6699–6707
work page 2026
-
[22]
From beats to scores: A multi-modal framework for comprehensive figure skating assessment,
F. Wang, Q. Wang, and D. Chen, “From beats to scores: A multi-modal framework for comprehensive figure skating assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 5905–5914
work page 2025
-
[23]
The pros and cons: Rank-aware temporal attention for skill determination in long videos,
H. Doughty, W. Mayol-Cuevas, and D. Damen, “The pros and cons: Rank-aware temporal attention for skill determination in long videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7862– 7871
work page 2019
-
[24]
Magr: Manifold-aligned graph regularization for continual action quality assessment,
K. Zhou, L. Wang, X. Zhang, H. P. Shum, F. W. Li, J. Li, and X. Liang, “Magr: Manifold-aligned graph regularization for continual action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 375–392
work page 2024
-
[25]
What and how well you performed? a multitask learning approach to action quality assessment,
P. Parmar and B. T. Morris, “What and how well you performed? a multitask learning approach to action quality assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 304–313
work page 2019
-
[26]
Human-centric fine-grained action quality assessment,
J. Xu, S. Yin, and Y . Peng, “Human-centric fine-grained action quality assessment,”IEEE Trans. Pattern Anal. Mach. Intell., 2025
work page 2025
-
[27]
Group-aware contrastive regression for action quality assessment,
X. Yu, Y . Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware contrastive regression for action quality assessment,” inProc. Int. Conf. Comput. Vis., 2021, pp. 7919–7928
work page 2021
-
[28]
Pairwise contrastive learning network for action quality assessment,
M. Li, H.-B. Zhang, Q. Lei, Z. Fan, J. Liu, and J.-X. Du, “Pairwise contrastive learning network for action quality assessment,” inProc. Eur. Conf. Comput. Vis.Springer, 2022, pp. 457–473
work page 2022
-
[29]
Finediving: A fine- grained dataset for procedure-aware action quality assessment,
J. Xu, Y . Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A fine- grained dataset for procedure-aware action quality assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2949–2958
work page 2022
-
[30]
Action quality assessment with temporal parsing transformer,
Y . Bai, D. Zhou, S. Zhang, J. Wang, E. Ding, Y . Guan, Y . Long, and J. Wang, “Action quality assessment with temporal parsing transformer,” inProc. Eur. Conf. Comput. Vis.Springer, 2022, pp. 422–438
work page 2022
-
[31]
Likert scoring with grade decoupling for long-term action assessment,
A. Xu, L.-A. Zeng, and W.-S. Zheng, “Likert scoring with grade decoupling for long-term action assessment,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3232–3241. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
work page 2022
-
[32]
Localization- assisted uncertainty score disentanglement network for action quality assessment,
Y . Ji, L. Ye, H. Huang, L. Mao, Y . Zhou, and L. Gao, “Localization- assisted uncertainty score disentanglement network for action quality assessment,” inProc. ACM Int. Conf. Multimedia, 2023, pp. 8590–8597
work page 2023
-
[33]
Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment,
K. Zhou, J. Li, R. Cai, L. Wang, X. Zhang, and X. Liang, “Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment,”Proc. Int. Joint Conf. Artif. Intelligence., 2024
work page 2024
-
[34]
Quality-guided vision- language learning for long-term action quality assessment,
H. Xu, H. Wu, X. Ke, Y . Li, R. Xu, and W. Guo, “Quality-guided vision- language learning for long-term action quality assessment,”IEEE Trans. Multimedia, 2025
work page 2025
-
[35]
F. Wang, Q. Wang, and P. Zhao, “Learning long-range action repre- sentation by two-stream mamba pyramid network for figure skating assessment,” inProc. ACM Int. Conf. Multimedia, 2025, pp. 867–875
work page 2025
-
[36]
Attention-driven multimodal align- ment for long-term action quality assessment,
X. Wang, P.-J. Li, and Y .-Y . Shen, “Attention-driven multimodal align- ment for long-term action quality assessment,”Appl. Soft. Comput., p. 113649, 2025
work page 2025
-
[37]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProc. Conf. North Am. Assoc. Comput. Linguist., 2019, pp. 4171–4186
work page 2019
-
[38]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877– 1901, 2020
work page 1901
-
[39]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Visual-semantic alignment temporal parsing for action quality assessment,
K. Gedamu, Y . Ji, Y . Yang, J. Shao, and H. T. Shen, “Visual-semantic alignment temporal parsing for action quality assessment,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 3, pp. 2436–2449, 2025
work page 2025
-
[41]
Rica 2: Rubric- informed, calibrated assessment of actions,
A. Majeedi, V . R. Gajjala, S. S. S. N. GNVV , and Y . Li, “Rica 2: Rubric- informed, calibrated assessment of actions,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 143–161
work page 2024
-
[42]
Hybrid dynamic-static context-aware attention network for action assessment in long videos,
L.-A. Zeng, F.-T. Hong, W.-S. Zheng, Q.-Z. Yu, W. Zeng, Y .-W. Wang, and J.-H. Lai, “Hybrid dynamic-static context-aware attention network for action assessment in long videos,” inProc. ACM Int. Conf. Multimedia, 2020, pp. 2526–2534
work page 2020
-
[43]
Learning to score figure skating sport videos,
C. Xu, Y . Fu, B. Zhang, Z. Chen, Y .-G. Jiang, and X. Xue, “Learning to score figure skating sport videos,”IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4578–4590, 2020
work page 2020
-
[44]
Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,
Y . Liu, S. Li, Y . Wu, C.-W. Chen, Y . Shan, and X. Qie, “Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3042–3051
work page 2022
-
[45]
Msaf: Multimodal split attention fusion,
L. Su, C. Hu, G. Li, and D. Cao, “Msaf: Multimodal split attention fusion,”arXiv preprint arXiv:2012.07175, 2020
-
[46]
Action quality assessment across multiple actions,
P. Parmar and B. Morris, “Action quality assessment across multiple actions,” inProc. IEEE Winter Conf. Appl. Comput. Vis.IEEE, 2019, pp. 1468–1476
work page 2019
-
[47]
Video and accelerometer-based motion analysis for automated surgical skills assessment,
A. Zia, Y . Sharma, V . Bettadapura, E. L. Sarin, and I. Essa, “Video and accelerometer-based motion analysis for automated surgical skills assessment,”Int. J. Comput. Assist. Radiol. Surg., vol. 13, pp. 443–455, 2018
work page 2018
-
[48]
Ast: Audio spectrogram transformer,
Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,”arXiv preprint arXiv:2104.01778, 2021
-
[49]
Audio set: An ontology and human- labeled dataset for audio events,
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” inInt. Conf. Acoust. Speech Signal Process. (ICASSP). IEEE, 2017, pp. 776–780
work page 2017
-
[50]
Learning to score olympic events,
P. Parmar and B. Tran Morris, “Learning to score olympic events,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 20–28
work page 2017
-
[51]
Joint visual and audio learning for video highlight detection,
T. Badamdorj, M. Rochan, Y . Wang, and L. Cheng, “Joint visual and audio learning for video highlight detection,” inProc. Int. Conf. Comput. Vis., 2021, pp. 8107–8117
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.