pith. machine review for the scientific record.
sign in

arxiv: 2604.08613 · v1 · submitted 2026-04-09 · 💻 cs.CV

ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords video saliency predictionmulti-expert ensembleadaptive gatingspatio-temporal featuresNTIRE challengeinference fusionensemble learningcomputer vision
0
0 comments X

The pith

ViSAGE aggregates diverse inductive biases via specialized decoders with adaptive gating and inference-time fusion to achieve top ranks in video saliency prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViSAGE, a multi-expert ensemble framework designed for video saliency prediction. Each specialized decoder applies adaptive gating and modulation to refine spatio-temporal features from video input. Complementary predictions produced by the different experts are then fused at inference time. This structure is shown to deliver first-place results on two of four evaluation metrics in the NTIRE 2026 challenge private test set, while outperforming most other entries on the remaining metrics. A reader would care because the approach offers a concrete way to combine multiple modeling biases for a task where single architectures often struggle with the full range of motion and attention patterns in video.

Core claim

ViSAGE is a multi-expert ensemble framework for video saliency prediction in which each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from the different experts are fused at inference time. This design aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the private test set of the NTIRE 2026 Challenge, the method ranked first on two out of four evaluation metrics and outperformed most other solutions on the remaining two.

What carries the argument

ViSAGE multi-expert ensemble: specialized decoders each apply adaptive gating and modulation to spatio-temporal features, with predictions fused at inference time.

If this is right

  • The framework aggregates complementary inductive biases to handle complex spatio-temporal saliency cues more effectively than most competing single-model solutions.
  • Adaptive gating within each decoder refines features in a manner tailored to the expert's specialization.
  • Inference-time fusion combines the experts without requiring changes to the training procedure.
  • The resulting model demonstrates strong performance and generalization on the held-out private test set of the challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design suggests that video saliency benefits from modular specialization rather than monolithic scaling of a single network.
  • Inference-time fusion could be extended to other video tasks such as action recognition or video summarization where multiple attention cues matter.
  • A practical test would be to measure whether the same set of experts maintains its advantage on uncurated, real-world video streams outside the challenge dataset.
  • The approach highlights a trade-off between training separate decoders and the added inference cost of running and fusing them.

Load-bearing premise

That multiple specialized decoders with adaptive gating will supply sufficiently complementary information whose fusion at inference time reliably improves saliency prediction over single-model or alternative ensemble designs.

What would settle it

An ablation on the same private test set in which the fused output scores no higher than the single best expert, or lower than a simple non-adaptive average of the experts, on all four metrics would indicate that the gating and fusion steps do not deliver the claimed advantage.

Figures

Figures reproduced from arXiv: 2604.08613 by Hao Liu, Kun Wang, Liqiang Nie, Qianlong Xiang, Yupeng Hu, Zhiran Li.

Figure 1
Figure 1. Figure 1: Overview of the proposed ViSAGE framework for video saliency prediction. Our method utilizes a shared InternVideo2 backbone [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of predicted saliency under different types [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents ViSAGE, a multi-expert ensemble framework for video saliency prediction submitted to the NTIRE 2026 Challenge. It consists of specialized decoders that apply adaptive gating and modulation to spatio-temporal features, with complementary predictions fused at inference time to aggregate diverse inductive biases. The central empirical claim is that this design ranked first on two of four metrics on the private test set while outperforming most competing solutions on the remaining metrics, demonstrating effectiveness and generalization.

Significance. If the observed rankings can be causally linked to the adaptive gating and multi-expert fusion rather than base architecture or training choices, the work would provide a useful demonstration of how complementary inductive biases can improve spatio-temporal saliency modeling. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.
  2. [Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our NTIRE 2026 challenge report. We address the major points below, noting the constraints of a challenge submission while committing to improvements where feasible.

read point-by-point responses
  1. Referee: [ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.

    Authors: We acknowledge that the manuscript presents a high-level overview, consistent with the format of many challenge reports under page limits. The full technical realization—including the adaptive gating mechanism, modulation operations, expert specialization, and inference-time fusion—is implemented in the publicly released code repository (https://github.com/iLearn-Lab/CVPRW26-ViSAGE). In the revised manuscript we will add a framework diagram and the key equations governing the gating and fusion steps to make the aggregation of inductive biases explicit. revision: yes

  2. Referee: [Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.

    Authors: We agree that controlled ablations would help isolate the contribution of adaptive gating and multi-expert fusion. However, this paper reports the final submitted solution for the NTIRE challenge; the private test set remains inaccessible after the challenge deadline, precluding new experiments on the reported metrics. The released code permits the community to perform ablations on the public validation split. We will expand the manuscript with a discussion of the design rationale for each component and why the observed rankings on the hidden test set support the overall approach. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical challenge ranking with no derivation chain

full rationale

The paper is a challenge report describing the ViSAGE multi-expert framework at a high level and stating its empirical rankings on the NTIRE 2026 private test set (first on two of four metrics). No equations, parameter fittings, derivations, or self-citations appear in the provided text. The central claim is a direct outcome of external challenge evaluation rather than any internal reduction to inputs by construction. No load-bearing steps of the enumerated circularity patterns exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities used in the model.

pith-pipeline@v0.9.0 · 5455 in / 1007 out tokens · 41832 ms · 2026-05-10T18:34:43.579700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018

    Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fr´edo Durand. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018. 1, 3

  2. [2]

    Temporal-spatial feature pyramid for video saliency detection, 2021

    Qinyao Chang and Shiping Zhu. Temporal-spatial feature pyramid for video saliency detection, 2021. 4

  3. [3]

    Towards generalizable deepfake detec- tion by primary region regularization.ACM Transactions on Multimedia Computing, Communications and Applications, 22(2):1–25, 2026

    Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, and Mohan Kankanhalli. Towards generalizable deepfake detec- tion by primary region regularization.ACM Transactions on Multimedia Computing, Communications and Applications, 22(2):1–25, 2026. 1

  4. [4]

    En- semble deep learning: A review.Engineering applications of artificial intelligence, 115:105151, 2022

    Mudasir A Ganaie, Minghui Hu, Ashwani Kumar Malik, Muhammad Tanveer, and Ponnuthurai N Suganthan. En- semble deep learning: A review.Engineering applications of artificial intelligence, 115:105151, 2022. 3

  5. [5]

    Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1

  6. [6]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2

  7. [7]

    Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

    Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021. 1

  8. [8]

    Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

    Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

  9. [9]

    Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

    Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

  10. [10]

    A model of saliency-based visual attention for rapid scene analysis

    Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelli- gence, 20(11):1254–1259, 1998. 1

  11. [11]

    ViNet: Pushing the limits of visual modality for audio-visual saliency prediction

    Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyam- gopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. ViNet: Pushing the limits of visual modality for audio-visual saliency prediction. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 3520–3527, 2021. 4

  12. [12]

    Salicon: Saliency in context

    Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1072–1080, 2015. 1

  13. [13]

    Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

    Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,

  14. [14]

    Curmim: Cur- riculum masked image modeling

    Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yu- peng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Cur- riculum masked image modeling. In2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), page 2041, 2025. 1

  15. [15]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1

  16. [16]

    Kyle Min and Jason J. Corso. TASED-Net: Temporally- aggregating spatial encoder-decoder network for video saliency detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2394– 2403, 2019. 1, 4

  17. [17]

    Aim 2024 challenge on video saliency prediction: Methods and results

    Andrey Moskalenko, Alexey Bryncev, Dmitry Vatolin, Radu Timofte, Gen Zhan, Li Yang, Yunlong Tang, Yiting Liao, Jiongzhi Lin, Baitao Huang, et al. Aim 2024 challenge on video saliency prediction: Methods and results. InEuropean Conference on Computer Vision, pages 178–194. Springer,

  18. [18]

    Ntire 2026 challenge on video saliency prediction: Methods and results

    Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev, Dmitry Vatolin, Radu Timo- fte, et al. Ntire 2026 challenge on video saliency prediction: Methods and results. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2, 3

  19. [19]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 2

  20. [20]

    Benjamin W Tatler. The central fixation bias in scene view- ing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vi- sion, 7(14):4–4, 2007. 2

  21. [21]

    STA ViS: Spatio-temporal audiovisual saliency network

    Antigoni Tsiami, Petros Koutras, and Petros Maragos. STA ViS: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 4766–4776,

  22. [22]

    Time series classification via enhanced temporal representation learning

    Kun Wang, Chun Wang, Yunxiao Wang, Wei Luo, Peng Zhan, Yupeng Hu, and Xueqing Li. Time series classification via enhanced temporal representation learning. In2021 IEEE 6th international conference on big data analytics (ICBDA), pages 188–192. IEEE, 2021. 1

  23. [23]

    Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

    Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

  24. [24]

    Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

    Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

  25. [25]

    Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

    Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

  26. [26]

    Revisiting video saliency: A large- scale benchmark and a new model

    Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. Revisiting video saliency: A large- scale benchmark and a new model. InProceedings of the IEEE Conference on computer vision and pattern recogni- tion, pages 4894–4903, 2018. 1

  27. [27]

    Revisiting video saliency prediction in the deep learning era.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):220–237,

    Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. Revisiting video saliency prediction in the deep learning era.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):220–237,

  28. [28]

    Internvideo2: Scaling foundation models for mul- timodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 1, 2

  29. [29]

    Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

    Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

  30. [30]

    TINA: Text-free inversion attack for unlearned text-to-image diffusion models.arXiv preprint arXiv:2603.17828, 2026

    Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. TINA: Text-free inversion attack for unlearned text-to-image diffusion models.arXiv preprint arXiv:2603.17828, 2026. 1

  31. [31]

    Holistically-nested edge de- tection

    Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015. 3

  32. [32]

    CASP-Net: Rethinking video saliency prediction from an audio-visual consistency percep- tual perspective

    Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, and Guangtao Zhai. CASP-Net: Rethinking video saliency prediction from an audio-visual consistency percep- tual perspective. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 6441–6450, 2023. 4

  33. [33]

    Diffsal: Joint audio and video learn- ing for diffusion saliency prediction

    Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, and Yufei Zha. Diffsal: Joint audio and video learn- ing for diffusion saliency prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27273–27283, 2024. 1, 4

  34. [34]

    Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023

    Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 1

  35. [35]

    Multi-factor adaptive vision se- lection for egocentric video question answering

    Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision se- lection for egocentric video question answering. InForty- first International Conference on Machine Learning, pages 59310–59328, 2024. 1

  36. [36]

    Spatial understanding from videos: Structured prompts meet simulation data

    Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 1–16, 2025. 1