arxiv: 2604.08613 · v1 · submitted 2026-04-09 · 💻 cs.CV

ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

Kun Wang , Yupeng Hu , Zhiran Li , Hao Liu , Qianlong Xiang , Liqiang Nie This is my paper

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords video saliency predictionmulti-expert ensembleadaptive gatingspatio-temporal featuresNTIRE challengeinference fusionensemble learningcomputer vision

0 comments

The pith

ViSAGE aggregates diverse inductive biases via specialized decoders with adaptive gating and inference-time fusion to achieve top ranks in video saliency prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViSAGE, a multi-expert ensemble framework designed for video saliency prediction. Each specialized decoder applies adaptive gating and modulation to refine spatio-temporal features from video input. Complementary predictions produced by the different experts are then fused at inference time. This structure is shown to deliver first-place results on two of four evaluation metrics in the NTIRE 2026 challenge private test set, while outperforming most other entries on the remaining metrics. A reader would care because the approach offers a concrete way to combine multiple modeling biases for a task where single architectures often struggle with the full range of motion and attention patterns in video.

Core claim

ViSAGE is a multi-expert ensemble framework for video saliency prediction in which each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from the different experts are fused at inference time. This design aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the private test set of the NTIRE 2026 Challenge, the method ranked first on two out of four evaluation metrics and outperformed most other solutions on the remaining two.

What carries the argument

ViSAGE multi-expert ensemble: specialized decoders each apply adaptive gating and modulation to spatio-temporal features, with predictions fused at inference time.

If this is right

The framework aggregates complementary inductive biases to handle complex spatio-temporal saliency cues more effectively than most competing single-model solutions.
Adaptive gating within each decoder refines features in a manner tailored to the expert's specialization.
Inference-time fusion combines the experts without requiring changes to the training procedure.
The resulting model demonstrates strong performance and generalization on the held-out private test set of the challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design suggests that video saliency benefits from modular specialization rather than monolithic scaling of a single network.
Inference-time fusion could be extended to other video tasks such as action recognition or video summarization where multiple attention cues matter.
A practical test would be to measure whether the same set of experts maintains its advantage on uncurated, real-world video streams outside the challenge dataset.
The approach highlights a trade-off between training separate decoders and the added inference cost of running and fusing them.

Load-bearing premise

That multiple specialized decoders with adaptive gating will supply sufficiently complementary information whose fusion at inference time reliably improves saliency prediction over single-model or alternative ensemble designs.

What would settle it

An ablation on the same private test set in which the fused output scores no higher than the single best expert, or lower than a simple non-adaptive average of the experts, on all four metrics would indicate that the gating and fusion steps do not deliver the claimed advantage.

Figures

Figures reproduced from arXiv: 2604.08613 by Hao Liu, Kun Wang, Liqiang Nie, Qianlong Xiang, Yupeng Hu, Zhiran Li.

**Figure 1.** Figure 1: Overview of the proposed ViSAGE framework for video saliency prediction. Our method utilizes a shared InternVideo2 backbone [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of predicted saliency under different types [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A challenge report on a multi-expert video saliency model that tops two metrics but skips the ablations needed to explain why.

read the letter

The key takeaway is that this team took first place on two of the four private-test metrics in the NTIRE 2026 video saliency challenge using their ViSAGE multi-expert system with adaptive gating. They built a framework where each expert decoder applies its own adaptive gating and modulation to spatio-temporal features from the video, then the outputs get fused during inference. The idea is that the different experts bring complementary biases that together handle complex saliency patterns better. They released the code, which is a plus for anyone wanting to reproduce or build on it. The paper does a good job laying out the high-level architecture and clearly stating the ranking results. It shows generalization by performing well on the held-out private set. Where it falls short is the missing experimental support for the main design choices. No single-expert baseline, no ablation on the gating mechanism versus fixed fusion, and no error analysis or comparison to simple ensembles. Without those, it's difficult to know whether the adaptive experts are what made the difference or if other implementation details carried the load. The description stays at a summary level with no equations or detailed diagrams. This kind of paper is useful for specialists in video saliency prediction who track challenge results and look for practical implementations. Someone outside that niche probably won't find much to take away. I would send it to peer review. The top ranking on a public benchmark plus code release makes it worth a referee's time, even if revisions would need to add the missing controls and deeper analysis.

Referee Report

2 major / 0 minor

Summary. The paper presents ViSAGE, a multi-expert ensemble framework for video saliency prediction submitted to the NTIRE 2026 Challenge. It consists of specialized decoders that apply adaptive gating and modulation to spatio-temporal features, with complementary predictions fused at inference time to aggregate diverse inductive biases. The central empirical claim is that this design ranked first on two of four metrics on the private test set while outperforming most competing solutions on the remaining metrics, demonstrating effectiveness and generalization.

Significance. If the observed rankings can be causally linked to the adaptive gating and multi-expert fusion rather than base architecture or training choices, the work would provide a useful demonstration of how complementary inductive biases can improve spatio-temporal saliency modeling. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.
[Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our NTIRE 2026 challenge report. We address the major points below, noting the constraints of a challenge submission while committing to improvements where feasible.

read point-by-point responses

Referee: [ViSAGE framework description] The method description supplies only a high-level narrative of the specialized decoders, adaptive gating, and inference-time fusion with no equations, diagrams, or implementation details. This prevents verification of how the claimed aggregation of inductive biases is realized in practice.

Authors: We acknowledge that the manuscript presents a high-level overview, consistent with the format of many challenge reports under page limits. The full technical realization—including the adaptive gating mechanism, modulation operations, expert specialization, and inference-time fusion—is implemented in the publicly released code repository (https://github.com/iLearn-Lab/CVPRW26-ViSAGE). In the revised manuscript we will add a framework diagram and the key equations governing the gating and fusion steps to make the aggregation of inductive biases explicit. revision: yes
Referee: [Experimental evaluation] No ablation experiments are reported (e.g., single-expert baseline, non-adaptive gating, or simple averaging fusion). Without these controls, the first-place rankings on two private-test metrics cannot be used to confirm that the performance gains arise from the proposed components rather than model scale, data, or ensemble size.

Authors: We agree that controlled ablations would help isolate the contribution of adaptive gating and multi-expert fusion. However, this paper reports the final submitted solution for the NTIRE challenge; the private test set remains inaccessible after the challenge deadline, precluding new experiments on the reported metrics. The released code permits the community to perform ablations on the public validation split. We will expand the manuscript with a discussion of the design rationale for each component and why the observed rankings on the hidden test set support the overall approach. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical challenge ranking with no derivation chain

full rationale

The paper is a challenge report describing the ViSAGE multi-expert framework at a high level and stating its empirical rankings on the NTIRE 2026 private test set (first on two of four metrics). No equations, parameter fittings, derivations, or self-citations appear in the provided text. The central claim is a direct outcome of external challenge evaluation rather than any internal reduction to inputs by construction. No load-bearing steps of the enumerated circularity patterns exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities used in the model.

pith-pipeline@v0.9.0 · 5455 in / 1007 out tokens · 41832 ms · 2026-05-10T18:34:43.579700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018

Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fr´edo Durand. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018. 1, 3

work page 2018
[2]

Temporal-spatial feature pyramid for video saliency detection, 2021

Qinyao Chang and Shiping Zhu. Temporal-spatial feature pyramid for video saliency detection, 2021. 4

work page 2021
[3]

Towards generalizable deepfake detec- tion by primary region regularization.ACM Transactions on Multimedia Computing, Communications and Applications, 22(2):1–25, 2026

Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, and Mohan Kankanhalli. Towards generalizable deepfake detec- tion by primary region regularization.ACM Transactions on Multimedia Computing, Communications and Applications, 22(2):1–25, 2026. 1

work page 2026
[4]

En- semble deep learning: A review.Engineering applications of artificial intelligence, 115:105151, 2022

Mudasir A Ganaie, Minghui Hu, Ashwani Kumar Malik, Muhammad Tanveer, and Ponnuthurai N Suganthan. En- semble deep learning: A review.Engineering applications of artificial intelligence, 115:105151, 2022. 3

work page 2022
[5]

Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1

work page 2022
[6]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022. 2

work page 2022
[7]

Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021. 1

work page 2021
[8]

Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

work page 2023
[9]

Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

work page 2026
[10]

A model of saliency-based visual attention for rapid scene analysis

Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelli- gence, 20(11):1254–1259, 1998. 1

work page 1998
[11]

ViNet: Pushing the limits of visual modality for audio-visual saliency prediction

Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyam- gopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. ViNet: Pushing the limits of visual modality for audio-visual saliency prediction. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 3520–3527, 2021. 4

work page 2021
[12]

Salicon: Saliency in context

Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1072–1080, 2015. 1

work page 2015
[13]

Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,

work page
[14]

Curmim: Cur- riculum masked image modeling

Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yu- peng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Cur- riculum masked image modeling. In2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), page 2041, 2025. 1

work page 2041
[15]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 1

work page 2022
[16]

Kyle Min and Jason J. Corso. TASED-Net: Temporally- aggregating spatial encoder-decoder network for video saliency detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 2394– 2403, 2019. 1, 4

work page 2019
[17]

Aim 2024 challenge on video saliency prediction: Methods and results

Andrey Moskalenko, Alexey Bryncev, Dmitry Vatolin, Radu Timofte, Gen Zhan, Li Yang, Yunlong Tang, Yiting Liao, Jiongzhi Lin, Baitao Huang, et al. Aim 2024 challenge on video saliency prediction: Methods and results. InEuropean Conference on Computer Vision, pages 178–194. Springer,

work page 2024
[18]

Ntire 2026 challenge on video saliency prediction: Methods and results

Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev, Dmitry Vatolin, Radu Timo- fte, et al. Ntire 2026 challenge on video saliency prediction: Methods and results. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. 2, 3

work page 2026
[19]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 2

work page 2018
[20]

Benjamin W Tatler. The central fixation bias in scene view- ing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vi- sion, 7(14):4–4, 2007. 2

work page 2007
[21]

STA ViS: Spatio-temporal audiovisual saliency network

Antigoni Tsiami, Petros Koutras, and Petros Maragos. STA ViS: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 4766–4776,

work page
[22]

Time series classification via enhanced temporal representation learning

Kun Wang, Chun Wang, Yunxiao Wang, Wei Luo, Peng Zhan, Yupeng Hu, and Xueqing Li. Time series classification via enhanced temporal representation learning. In2021 IEEE 6th international conference on big data analytics (ICBDA), pages 188–192. IEEE, 2021. 1

work page 2021
[23]

Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

work page 2024
[24]

Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

work page 2025
[25]

Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

work page 2026
[26]

Revisiting video saliency: A large- scale benchmark and a new model

Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. Revisiting video saliency: A large- scale benchmark and a new model. InProceedings of the IEEE Conference on computer vision and pattern recogni- tion, pages 4894–4903, 2018. 1

work page 2018
[27]

Revisiting video saliency prediction in the deep learning era.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):220–237,

Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. Revisiting video saliency prediction in the deep learning era.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):220–237,

work page
[28]

Internvideo2: Scaling foundation models for mul- timodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 1, 2

work page 2024
[29]

Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

work page 2025
[30]

TINA: Text-free inversion attack for unlearned text-to-image diffusion models.arXiv preprint arXiv:2603.17828, 2026

Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. TINA: Text-free inversion attack for unlearned text-to-image diffusion models.arXiv preprint arXiv:2603.17828, 2026. 1

work page arXiv 2026
[31]

Holistically-nested edge de- tection

Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015. 3

work page 2015
[32]

CASP-Net: Rethinking video saliency prediction from an audio-visual consistency percep- tual perspective

Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, and Guangtao Zhai. CASP-Net: Rethinking video saliency prediction from an audio-visual consistency percep- tual perspective. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 6441–6450, 2023. 4

work page 2023
[33]

Diffsal: Joint audio and video learn- ing for diffusion saliency prediction

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, and Yufei Zha. Diffsal: Joint audio and video learn- ing for diffusion saliency prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27273–27283, 2024. 1, 4

work page 2024
[34]

Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023

Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 1

work page 2023
[35]

Multi-factor adaptive vision se- lection for egocentric video question answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision se- lection for egocentric video question answering. InForty- first International Conference on Machine Learning, pages 59310–59328, 2024. 1

work page 2024
[36]

Spatial understanding from videos: Structured prompts meet simulation data

Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 1–16, 2025. 1

work page 2025