ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3
The pith
ViSAGE aggregates diverse inductive biases via specialized decoders with adaptive gating and inference-time fusion to achieve top ranks in video saliency prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViSAGE is a multi-expert ensemble framework for video saliency prediction in which each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from the different experts are fused at inference time. This design aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the private test set of the NTIRE 2026 Challenge, the method ranked first on two out of four evaluation metrics and outperformed most other solutions on the remaining two.
What carries the argument
ViSAGE multi-expert ensemble: specialized decoders each apply adaptive gating and modulation to spatio-temporal features, with predictions fused at inference time.
If this is right
- The framework aggregates complementary inductive biases to handle complex spatio-temporal saliency cues more effectively than most competing single-model solutions.
- Adaptive gating within each decoder refines features in a manner tailored to the expert's specialization.
- Inference-time fusion combines the experts without requiring changes to the training procedure.
- The resulting model demonstrates strong performance and generalization on the held-out private test set of the challenge.
Where Pith is reading between the lines
- The design suggests that video saliency benefits from modular specialization rather than monolithic scaling of a single network.
- Inference-time fusion could be extended to other video tasks such as action recognition or video summarization where multiple attention cues matter.
- A practical test would be to measure whether the same set of experts maintains its advantage on uncurated, real-world video streams outside the challenge dataset.
- The approach highlights a trade-off between training separate decoders and the added inference cost of running and fusing them.
Load-bearing premise
That multiple specialized decoders with adaptive gating will supply sufficiently complementary information whose fusion at inference time reliably improves saliency prediction over single-model or alternative ensemble designs.
What would settle it
An ablation on the same private test set in which the fused output scores no higher than the single best expert, or lower than a simple non-adaptive average of the experts, on all four metrics would indicate that the gating and fusion steps do not deliver the claimed advantage.
Figures
read the original abstract
In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ViSAGE, a multi-expert ensemble framework for video saliency prediction submitted to the NTIRE 2026 Challenge. It consists of specialized decoders that apply adaptive gating and modulation to spatio-temporal features, with complementary predictions fused at inference time to aggregate diverse inductive biases. The central empirical claim is that this design ranked first on two of four metrics on the private test set while outperforming most competing solutions on the remaining metrics, demonstrating effectiveness and generalization.
Significance. If the observed rankings can be causally linked to the adaptive gating and multi-expert fusion rather than base architecture or training choices, the work would provide a useful demonstration of how complementary inductive biases can improve spatio-temporal saliency modeling. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.
- [Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our NTIRE 2026 challenge report. We address the major points below, noting the constraints of a challenge submission while committing to improvements where feasible.
read point-by-point responses
-
Referee: [ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.
Authors: We acknowledge that the manuscript presents a high-level overview, consistent with the format of many challenge reports under page limits. The full technical realization—including the adaptive gating mechanism, modulation operations, expert specialization, and inference-time fusion—is implemented in the publicly released code repository (https://github.com/iLearn-Lab/CVPRW26-ViSAGE). In the revised manuscript we will add a framework diagram and the key equations governing the gating and fusion steps to make the aggregation of inductive biases explicit. revision: yes
-
Referee: [Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.
Authors: We agree that controlled ablations would help isolate the contribution of adaptive gating and multi-expert fusion. However, this paper reports the final submitted solution for the NTIRE challenge; the private test set remains inaccessible after the challenge deadline, precluding new experiments on the reported metrics. The released code permits the community to perform ablations on the public validation split. We will expand the manuscript with a discussion of the design rationale for each component and why the observed rankings on the hidden test set support the overall approach. revision: partial
Circularity Check
No significant circularity; empirical challenge ranking with no derivation chain
full rationale
The paper is a challenge report describing the ViSAGE multi-expert framework at a high level and stating its empirical rankings on the NTIRE 2026 private test set (first on two of four metrics). No equations, parameter fittings, derivations, or self-citations appear in the provided text. The central claim is a direct outcome of external challenge evaluation rather than any internal reduction to inputs by construction. No load-bearing steps of the enumerated circularity patterns exist.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fr´edo Durand. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018. 1, 3
work page 2018
-
[2]
Temporal-spatial feature pyramid for video saliency detection, 2021
Qinyao Chang and Shiping Zhu. Temporal-spatial feature pyramid for video saliency detection, 2021. 4
work page 2021
-
[3]
Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, and Mohan Kankanhalli. Towards generalizable deepfake detec- tion by primary region regularization.ACM Transactions on Multimedia Computing, Communications and Applications, 22(2):1–25, 2026. 1
work page 2026
-
[4]
Mudasir A Ganaie, Minghui Hu, Ashwani Kumar Malik, Muhammad Tanveer, and Ponnuthurai N Suganthan. En- semble deep learning: A review.Engineering applications of artificial intelligence, 115:105151, 2022. 3
work page 2022
-
[5]
Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1
work page 2022
-
[6]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2
work page 2022
-
[7]
Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021. 1
work page 2021
-
[8]
Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023
work page 2023
-
[9]
Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1
work page 2026
-
[10]
A model of saliency-based visual attention for rapid scene analysis
Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelli- gence, 20(11):1254–1259, 1998. 1
work page 1998
-
[11]
ViNet: Pushing the limits of visual modality for audio-visual saliency prediction
Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyam- gopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. ViNet: Pushing the limits of visual modality for audio-visual saliency prediction. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 3520–3527, 2021. 4
work page 2021
-
[12]
Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1072–1080, 2015. 1
work page 2015
-
[13]
Gaming for boundary: Elastic localization for frame- supervised video moment retrieval
Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,
-
[14]
Curmim: Cur- riculum masked image modeling
Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yu- peng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Cur- riculum masked image modeling. In2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), page 2041, 2025. 1
work page 2041
-
[15]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1
work page 2022
-
[16]
Kyle Min and Jason J. Corso. TASED-Net: Temporally- aggregating spatial encoder-decoder network for video saliency detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2394– 2403, 2019. 1, 4
work page 2019
-
[17]
Aim 2024 challenge on video saliency prediction: Methods and results
Andrey Moskalenko, Alexey Bryncev, Dmitry Vatolin, Radu Timofte, Gen Zhan, Li Yang, Yunlong Tang, Yiting Liao, Jiongzhi Lin, Baitao Huang, et al. Aim 2024 challenge on video saliency prediction: Methods and results. InEuropean Conference on Computer Vision, pages 178–194. Springer,
work page 2024
-
[18]
Ntire 2026 challenge on video saliency prediction: Methods and results
Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev, Dmitry Vatolin, Radu Timo- fte, et al. Ntire 2026 challenge on video saliency prediction: Methods and results. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2, 3
work page 2026
-
[19]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 2
work page 2018
-
[20]
Benjamin W Tatler. The central fixation bias in scene view- ing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vi- sion, 7(14):4–4, 2007. 2
work page 2007
-
[21]
STA ViS: Spatio-temporal audiovisual saliency network
Antigoni Tsiami, Petros Koutras, and Petros Maragos. STA ViS: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 4766–4776,
-
[22]
Time series classification via enhanced temporal representation learning
Kun Wang, Chun Wang, Yunxiao Wang, Wei Luo, Peng Zhan, Yupeng Hu, and Xueqing Li. Time series classification via enhanced temporal representation learning. In2021 IEEE 6th international conference on big data analytics (ICBDA), pages 188–192. IEEE, 2021. 1
work page 2021
-
[23]
Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1
work page 2024
-
[24]
Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1
work page 2025
-
[25]
Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1
work page 2026
-
[26]
Revisiting video saliency: A large- scale benchmark and a new model
Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. Revisiting video saliency: A large- scale benchmark and a new model. InProceedings of the IEEE Conference on computer vision and pattern recogni- tion, pages 4894–4903, 2018. 1
work page 2018
-
[27]
Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. Revisiting video saliency prediction in the deep learning era.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):220–237,
-
[28]
Internvideo2: Scaling foundation models for mul- timodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 1, 2
work page 2024
-
[29]
Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture
Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1
work page 2025
-
[30]
Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. TINA: Text-free inversion attack for unlearned text-to-image diffusion models.arXiv preprint arXiv:2603.17828, 2026. 1
-
[31]
Holistically-nested edge de- tection
Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015. 3
work page 2015
-
[32]
Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, and Guangtao Zhai. CASP-Net: Rethinking video saliency prediction from an audio-visual consistency percep- tual perspective. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 6441–6450, 2023. 4
work page 2023
-
[33]
Diffsal: Joint audio and video learn- ing for diffusion saliency prediction
Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, and Yufei Zha. Diffsal: Joint audio and video learn- ing for diffusion saliency prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27273–27283, 2024. 1, 4
work page 2024
-
[34]
Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 1
work page 2023
-
[35]
Multi-factor adaptive vision se- lection for egocentric video question answering
Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision se- lection for egocentric video question answering. InForty- first International Conference on Machine Learning, pages 59310–59328, 2024. 1
work page 2024
-
[36]
Spatial understanding from videos: Structured prompts meet simulation data
Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 1–16, 2025. 1
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.