Deep Saliency Models : The Quest For The Loss Function
Pith reviewed 2026-05-25 09:17 UTC · model grok-4.3
The pith
A linear combination of several loss functions improves deep saliency model performance across datasets and architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On a fixed network architecture, modifying the loss function can significantly improve or depreciate the results. A linear combination of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, demonstrating the robustness of a combined metric.
What carries the argument
Linear combination of multiple loss functions used to train a saliency prediction network.
If this is right
- Changing only the loss function on one network can raise or lower saliency prediction scores on standard datasets.
- Loss functions not previously used for saliency can contribute usefully when included.
- A single combined loss outperforms any of its component losses alone.
- The same loss combination improves results when transferred to a different network architecture.
- The performance lift holds across multiple datasets.
Where Pith is reading between the lines
- Loss selection may be treated as an additional hyper-parameter to tune rather than a fixed choice.
- The same blending approach could be tested on other pixel-wise prediction tasks such as semantic segmentation.
- An automated search over loss weights might discover even stronger combinations than the hand-chosen ones reported.
- If the gains survive stricter controls on all other training variables, loss design would become a primary lever for model improvement.
Load-bearing premise
That measured performance differences are produced by the loss functions themselves rather than by interactions with hyper-parameters or training settings that stayed fixed only within each experiment.
What would settle it
Retraining the identical architectures with the reported loss combinations while deliberately varying optimizer settings or data preprocessing to match the single-loss baselines and checking whether the gains disappear.
read the original abstract
Recent advances in deep learning have pushed the performances of visual saliency models way further than it has ever been. Numerous models in the literature present new ways to design neural networks, to arrange gaze pattern data, or to extract as much high and low-level image features as possible in order to create the best saliency representation. However, one key part of a typical deep learning model is often neglected: the choice of the loss function. In this work, we explore some of the most popular loss functions that are used in deep saliency models. We demonstrate that on a fixed network architecture, modifying the loss function can significantly improve (or depreciate) the results, hence emphasizing the importance of the choice of the loss function when designing a model. We also introduce new loss functions that have never been used for saliency prediction to our knowledge. And finally, we show that a linear combination of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, hence demonstrating the robustness of a combined metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the impact of loss function choice on deep visual saliency prediction. It reports that, on a fixed network architecture, swapping or combining loss functions produces substantial performance changes across standard saliency benchmarks. New loss functions are introduced, and a linear combination of selected losses is shown to improve results on multiple datasets and on a second architecture.
Significance. If the reported gains can be attributed unambiguously to the loss functions, the work would usefully draw attention to an under-examined design choice in saliency modeling and supply a practical recipe for combining losses that appears robust across datasets and architectures. The cross-dataset and cross-architecture evaluation is a constructive element of the study.
major comments (3)
- [Experimental protocol / Methods] The central claim—that performance differences are caused by the loss functions themselves—requires that optimizer, learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline remained bitwise identical for every loss variant. The manuscript states that a fixed network is used but supplies no explicit confirmation that these other training factors were held constant; without that assurance the attribution of gains to the loss functions is not yet secured.
- [Results and tables] No error bars, standard deviations across multiple runs, or statistical significance tests are reported for any of the performance deltas. Consequently the magnitude and reliability of the claimed improvements (both single-loss and combined-loss) cannot be assessed from the presented data.
- [Cross-architecture experiments] When the linear-combination result is extended to a second architecture, the manuscript does not state whether the hyper-parameter settings (including any re-tuning) were identical to those used for the first architecture. This leaves open the possibility that part of the reported gain arises from architecture-specific optimization rather than from the loss combination alone.
minor comments (2)
- [Abstract] The abstract contains minor grammatical awkwardness ('way further than it has ever been').
- [Title] Title capitalization is inconsistent ('The Quest For The Loss Function').
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for unambiguous attribution of performance gains to loss functions. We address each major comment below and will revise the manuscript to incorporate clarifications and additional analyses where appropriate.
read point-by-point responses
-
Referee: [Experimental protocol / Methods] The central claim—that performance differences are caused by the loss functions themselves—requires that optimizer, learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline remained bitwise identical for every loss variant. The manuscript states that a fixed network is used but supplies no explicit confirmation that these other training factors were held constant; without that assurance the attribution of gains to the loss functions is not yet secured.
Authors: We confirm that the optimizer (Adam), learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline were held identical across all loss variants on the fixed architecture. The manuscript's emphasis on a 'fixed network' was intended to convey this, but we acknowledge the lack of explicit wording. In the revised manuscript we will add a dedicated sentence in Section 3 (Experimental Setup) stating that all non-loss training factors remained bitwise identical. revision: yes
-
Referee: [Results and tables] No error bars, standard deviations across multiple runs, or statistical significance tests are reported for any of the performance deltas. Consequently the magnitude and reliability of the claimed improvements (both single-loss and combined-loss) cannot be assessed from the presented data.
Authors: The observation is correct; the current tables report single-run results. We will add standard deviations computed over three independent runs with different random seeds and include paired t-test p-values for the key deltas (single-loss vs. baseline and combined-loss vs. best single loss) in the revised tables and text. revision: yes
-
Referee: [Cross-architecture experiments] When the linear-combination result is extended to a second architecture, the manuscript does not state whether the hyper-parameter settings (including any re-tuning) were identical to those used for the first architecture. This leaves open the possibility that part of the reported gain arises from architecture-specific optimization rather than from the loss combination alone.
Authors: The same hyper-parameter values (including the loss weights) were transferred without re-tuning to the second architecture. We will insert an explicit statement in the cross-architecture subsection clarifying that no architecture-specific hyper-parameter search was performed for the loss combination. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper performs an empirical comparison of loss functions on fixed network architectures, reporting performance on standard external saliency datasets and benchmarks. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted inputs or self-citations. Central claims about linear combinations of losses are validated through independent evaluation metrics across datasets and a second architecture, with no load-bearing self-citation chains or self-definitional reductions identified.
Axiom & Free-Parameter Ledger
free parameters (1)
- loss combination weights
axioms (1)
- domain assumption Gradient descent on a neural network can optimize any differentiable loss that compares predicted and ground-truth saliency maps
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a linear combination of several well-chosen loss functions leads to significant improvements
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also introduce new loss functions that have never been used for saliency prediction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep Saliency Models : The Quest For The Loss Function
INTRODUCTION Despite decades of research, visual attention mechanisms of humans remain complex to understand and even more com- plex to model. With the availability of large databases of eye- tracking and mouse movements recorded on images [1, 2], there is now a far better understanding of the perceptual mech- anisms. Significant progress has been made in ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
We, thus, provide a brief account of rele- vant works and summarize them in this section
RELA TED WORKS Computational models of saliency prediction, a long stand- ing problem in computer vision, have been studied from so many perspectives that going through all is beyond the scope of this manuscript. We, thus, provide a brief account of rele- vant works and summarize them in this section. We refer the readers to [5, 6] for an overview. To dat...
-
[3]
Our focus is, however, the second group
employs support vector machines and [20] uses extreme learning machines. Our focus is, however, the second group. Within end-to-end deep learning techniques, the main re- search has been on architecture design. Many of the models borrow the pre-trained weights of an image recognition net- work and experiment combining different layers in various ways. In ...
-
[4]
After this presenta- tion, we elaborate on the tested loss functions
LOSS FUNCTIONS FOR DEEP SALIENCY NETWORK Before delving into the description of loss functions, we present the architecture of the convolutional neural network that will be used throughout this paper. After this presenta- tion, we elaborate on the tested loss functions. 3.1. Proposed baseline architecture Figure 1 presents the overall architecture of the ...
work page 1920
-
[5]
combining KLD, CC and NSS loss functions, and the sec- ond one (LC 2) adding Deep Features loss, Gram Matrices loss and sigmoid-weighted MSE. This specific combination was chosen because it relies on an existing successful combi- nation and also aggregates the four types of metrics together. We followed the work of [26] to set the coefficients for the first ...
-
[6]
EXPERIMENTS 4.1. Testing protocols To carry out the evaluation, we use seven quality metrics ap- plied on the MIT benchmark [1, 38]: CC (correlation co- efficient, CC ∈ [−1, 1]), SIM (similarity, intersection be- tween histograms of saliency, SIM ∈ [0, 1]), AUC (Area Under Curve, AU C ∈ [0, 1]), NSS (Normalized Scanpath Saliency, N SS∈ ]−∞, +∞[), EMD (Eart...
work page 2000
-
[7]
CONCLUSION In this paper, we introduced a deep neural network which pur- pose was to evaluate the impact of loss functions on the pre- Fig. 4 : Example of good predictions by the combination loss while a single loss makes bad predictions (for SAM- VGG model). (a) original image; (b) Ground truth saliency map; (c) KLD + CC + NSS + DF + GM + SIG-MSE + R com...
work page 1993
-
[8]
Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Fr´edo Durand, Aude Oliva, and Antonio Torralba, “Mit saliency benchmark,” 2015
work page 2015
-
[9]
M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in2015 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[10]
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
Ali Borji, “Saliency prediction in the deep learn- ing era: An empirical investigation,” arXiv preprint arXiv:1810.03716, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
End-to-end saliency mapping via probability distribution prediction,
S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability distribution prediction,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[12]
State-of-the-art in visual attention modeling,
A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2013
work page 2013
-
[13]
Quantitative analy- sis of human-model agreement in visual saliency model- ing: A comparative study,
A. Borji, D. N. Sihite, and L. Itti, “Quantitative analy- sis of human-model agreement in visual saliency model- ing: A comparative study,”IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 55–69, Jan 2013
work page 2013
-
[14]
A model of saliency- based visual attention for rapid scene analysis,
L. Itti, C. Koch, and E. Niebur, “A model of saliency- based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 20, no. 11, pp. 1254–1259, 1998
work page 1998
-
[15]
Saliency based on information maximization,
Neil D. B. Bruce and John K. Tsotsos, “Saliency based on information maximization,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, 2005
work page 2005
-
[16]
Jonathan Harel, Christof Koch, and Pietro Perona, “Graph-based visual saliency,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, 2006
work page 2006
-
[17]
Image signature: High- lighting sparse salient regions,
X. Hou, J. Harel, and C. Koch, “Image signature: High- lighting sparse salient regions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 34, no. 1, pp. 194–201, 2012
work page 2012
-
[18]
Saliency detection: A boolean map approach,
J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in 2013 IEEE International Conference on Computer Vision, 2013
work page 2013
-
[19]
Learning saliency-based visual attention: A review,
Qi Zhao and Christof Koch, “Learning saliency-based visual attention: A review,” Signal Processing, vol. 93, no. 6, pp. 1401–1407, 2013
work page 2013
-
[20]
Saliency and human fixations: State-of-the-art and study of compari- son metrics,
Nicolas Riche, Matthieu Duvinage, Matei Mancas, Bernard Gosselin, and Thierry Dutoit, “Saliency and human fixations: State-of-the-art and study of compari- son metrics,” in The IEEE International Conference on Computer Vision (ICCV), 2013
work page 2013
-
[21]
Anal- ysis of scores, datasets, and models in visual saliency prediction,
A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti, “Anal- ysis of scores, datasets, and models in visual saliency prediction,” in 2013 IEEE International Conference on Computer Vision, 2013
work page 2013
-
[22]
A deeper look at saliency: Feature contrast, semantics, and beyond,
N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature contrast, semantics, and beyond,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 516–524
work page 2016
-
[23]
Saliency revisited: Analysis of mouse movements versus fixations,
H. R. Tavakoli, F. Ahmed, A. Borji, and J. Laakso- nen, “Saliency revisited: Analysis of mouse movements versus fixations,” in 2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017, pp. 6354–6362
work page 2017
-
[24]
Where should saliency models look next?,
Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Tor- ralba, and F. Durand, “Where should saliency models look next?,” in European Conference on Computer Vi- sion (ECCV), 2016
work page 2016
-
[25]
Understanding and Visualizing Deep Visual Saliency Models
Sen He, Hamed R Tavakoli, Ali Borji, Yang Mi, and Nicolas Pugeault, “Understanding and visual- izing deep visual saliency models,” arXiv preprint arXiv:1903.02501, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[26]
Large-scale optimization of hierarchical features for saliency prediction in natural images,
E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images,” in IEEE Computer Vision and Pattern Recog- nition (CVPR), 2014
work page 2014
-
[27]
Hamed R. Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu, “Exploiting inter-image similarity and en- semble of extreme learners for fixation prediction using deep features,” Neurocomput., vol. 244, no. C, pp. 10– 18, June 2017
work page 2017
-
[28]
Salicon: Reducing the semantic gap in saliency predic- tion by adapting deep neural networks,
Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao, “Salicon: Reducing the semantic gap in saliency predic- tion by adapting deep neural networks,” in The IEEE International Conference on Computer Vision (ICCV) , 2015
work page 2015
-
[29]
Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,
M. Kummerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” in ICLR Workshop, 2015
work page 2015
-
[30]
Understanding low- and high-level contributions to fix- ation prediction,
M. Kummerer, T. S. Wallis, L. A. Gatys, and M. Bethge, “Understanding low- and high-level contributions to fix- ation prediction,” in The IEEE International Conference on Computer Vision (ICCV), 2017
work page 2017
-
[31]
A Deep Multi-Level Network for Saliency Prediction,
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara, “A Deep Multi-Level Network for Saliency Prediction,” in International Conference on Pattern Recognition (ICPR), 2016
work page 2016
-
[32]
A deep spatial contextual long-term recurrent convolutional network for saliency detection,
Nian Liu and Junwei Han, “A deep spatial contextual long-term recurrent convolutional network for saliency detection,” IEEE Transactions on Image Processing , vol. 27, no. 7, pp. 3264–3274, 2018
work page 2018
-
[33]
Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model,
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara, “Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model,” IEEE Trans- actions on Image Processing, vol. 27, no. 10, pp. 5142– 5154, 2018
work page 2018
-
[34]
EML-NET:An Expandable Multi-Layer NETwork for Saliency Prediction
Sen Jia, “EML-NET: an expandable multi-layer network for saliency prediction,” CoRR, vol. abs/1805.01047, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Information- theoretic model comparison unifies saliency metrics,
Wallis T. Kuemmerer M. and Bethge M., “Information- theoretic model comparison unifies saliency metrics,” Proceedings of the National Academy of Science , vol. 112, no. 52, pp. 16054–16059, Oct 2015
work page 2015
-
[36]
DeepGaze II: Reading fixations from deep features trained on object recognition
Matthias K ¨ummerer, Thomas SA Wallis, and Matthias Bethge, “Deepgaze ii: Reading fixations from deep features trained on object recognition,” arXiv preprint arXiv:1610.01563, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[37]
Geometric loss functions for camera pose regression with deep learn- ing,
Alex Kendall and Roberto Cipolla, “Geometric loss functions for camera pose regression with deep learn- ing,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[38]
Per- ceptual losses for real-time style transfer and super- resolution,
Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Per- ceptual losses for real-time style transfer and super- resolution,” in European Conference on Computer Vi- sion, 2016
work page 2016
-
[39]
Fo- cal loss for dense object detection,
T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr, “Fo- cal loss for dense object detection,” in 2017 IEEE Inter- national Conference on Computer Vision (ICCV) , Oct 2017, pp. 2999–3007
work page 2017
-
[40]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recogni- tion,” arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[41]
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Learning to predict where humans look,
Tilke Judd, Krista Ehinger, Fr ´edo Durand, and Antonio Torralba, “Learning to predict where humans look,” in 12th international conference on Computer Vision . IEEE, 2009, pp. 2106–2113
work page 2009
-
[43]
SalGAN: Visual Saliency Prediction with Generative Adversarial Networks
Junting Pan, Cristian Canton, Kevin McGuinness, Noel E O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Video salient object detection via fully convolutional net- works,
Wenguan Wang, Jianbing Shen, and Ling Shao, “Video salient object detection via fully convolutional net- works,” IEEE Transactions on Image Processing , vol. 27, no. 1, pp. 38–49, 2018
work page 2018
-
[45]
Methods for comparing scanpaths and saliency maps: strengths and weaknesses,
Olivier Le Meur and Thierry Baccino, “Methods for comparing scanpaths and saliency maps: strengths and weaknesses,” Behavior Research Method, vol. 45, no. 1, pp. 251–266, 2013
work page 2013
-
[46]
Components of bottom-up gaze allocation in natural images,
Robert J Peters, Asha Iyer, Laurent Itti, and Christof Koch, “Components of bottom-up gaze allocation in natural images,” Vision research, vol. 45, no. 18, pp. 2397–2416, 2005
work page 2005
-
[47]
A neural algorithm of artistic style,
Ecker A.S. Bethge M. Gatys, L.A., “A neural algorithm of artistic style,” in arXivpreprint, 2015
work page 2015
-
[48]
Saliency from hierarchical adaptation through decorrelation and variance normalization,
A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, and R. Dosil, “Saliency from hierarchical adaptation through decorrelation and variance normalization,” Im- age and Vision Computing , vol. 30, no. 1, pp. 51 – 64, 2012
work page 2012
-
[49]
Shallow and deep convolutional networks for saliency prediction,
Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor, “Shallow and deep convolutional networks for saliency prediction,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 598–606
work page 2016
-
[50]
CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research
Ali Borji and Laurent Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” arXiv preprint arXiv:1505.03581, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [51]
-
[52]
An element sensi- tive saliency model with position prior learning for web pages,
Wang Y . Chang G.J., Zhang Y ., “An element sensi- tive saliency model with position prior learning for web pages,” in ICIAI, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.