pith. machine review for the scientific record. sign in

arxiv: 2604.08858 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

BIAS: A Biologically Inspired Algorithm for Video Saliency Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords video saliency detectionbiologically inspiredmotion detectorItti-Koch frameworksaliency mapstraffic accident anticipationdynamic attentionfoci of attention
0
0 comments X

The pith

BIAS adds a retina-inspired motion detector to the Itti-Koch framework and uses greedy multi-Gaussian fitting to produce fast saliency maps for video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dynamic saliency detector called BIAS that adds temporal motion features drawn from retinal processing to a classic attention model. It locates attention foci by fitting multiple Gaussian peaks in a greedy way that trades off competition and coverage. This combination runs at millisecond latency and produces saliency maps that beat several deep-learning systems on the DHF1K benchmark when attention is driven mainly by motion. The same maps are then shown to support early recognition of traffic accidents, reaching state-of-the-art cause-effect labeling and flagging incidents up to 0.72 seconds before human annotators mark them.

Core claim

BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy.

What carries the argument

Retina-inspired motion detector that extracts temporal features, paired with a greedy multi-Gaussian peak-fitting algorithm that identifies foci of attention while balancing winner-take-all competition and information maximization.

Load-bearing premise

The retina-inspired motion detector and greedy peak-fitting step together produce saliency maps that capture human attention patterns and generalize beyond the DHF1K and traffic-video test sets.

What would settle it

Evaluating BIAS on a fresh collection of videos dominated by top-down attention cues and finding that its saliency maps or accident-anticipation accuracy fall below the deep-learning baselines it beat on DHF1K.

Figures

Figures reproduced from arXiv: 2604.08858 by Ya-tang Li, Zhao-ji Zhang.

Figure 1
Figure 1. Figure 1: (a) General architecture of BIAS. (b) Comparison of center–surround [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Predicted saliency maps on an example video clip from the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of performance and runtime between BIAS and other [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Correlation coefficients for different center–delta pairs. (b) Per [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) From left to right: original frames, human fixation ground truth, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The input is a sequence of predicted saliency maps. SPARK-ResNet [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of predicted times across different models. (a) Predicted [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti--Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper presents BIAS, a biologically inspired algorithm for dynamic visual saliency detection in continuous video streams. It extends the Itti-Koch framework with a retina-inspired motion detector for temporal features and employs a greedy multi-Gaussian peak-fitting algorithm to identify foci of attention (FOAs), balancing winner-take-all competition with information maximization. The model is claimed to achieve millisecond-scale latency, outperform heuristic-based and several deep-learning approaches on the DHF1K dataset (especially bottom-up attention videos), and deliver state-of-the-art results in traffic accident cause-effect recognition while anticipating accidents up to 0.72 seconds before manual annotation.

Significance. If the performance claims hold with proper validation, BIAS could provide an efficient, interpretable, and low-latency alternative to deep models for video saliency, potentially useful for real-time applications such as traffic monitoring. The hybrid biological-computational approach is a strength if the retina-inspired components are shown to contribute measurably beyond standard motion detectors.

major comments (4)
  1. Methods section on retina-inspired motion detector: no quantitative validation against retinal recordings or physiological data is provided to establish the fidelity of the temporal feature extraction; without this, the biological inspiration claim cannot be assessed as load-bearing for the reported performance gains.
  2. Results section on DHF1K dataset: outperformance is asserted over heuristics and deep models but no ablation studies removing the motion detector or biological components are described, leaving it unclear whether the gains derive from the retina-inspired elements rather than the peak-fitting or other factors.
  3. Results section on traffic accident analysis: the 0.72-second anticipation claim and SOTA cause-effect recognition require explicit dataset details, exact evaluation metrics, baseline comparisons, and statistical significance tests; absent these, the real-world utility assertion lacks direct support.
  4. Evaluation methodology: no held-out video domains or cross-dataset tests beyond DHF1K and traffic videos are mentioned, raising the risk that the greedy multi-Gaussian fitting overfits to dataset-specific motion statistics rather than generalizing.
minor comments (2)
  1. Abstract: specific performance metrics (e.g., AUC, NSS, sAUC) and the exact deep-learning models compared are not listed, reducing clarity of the outperformance claim.
  2. Figure captions and pseudocode: the peak-fitting algorithm would benefit from an explicit equation or algorithm box to clarify the information-maximization term.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: Methods section on retina-inspired motion detector: no quantitative validation against retinal recordings or physiological data is provided to establish the fidelity of the temporal feature extraction; without this, the biological inspiration claim cannot be assessed as load-bearing for the reported performance gains.

    Authors: The retina-inspired motion detector is constructed from established models of retinal ganglion cell responses and direction selectivity, as detailed in the methods with supporting citations to physiological literature. We do not claim an exact replica of biological recordings but rather a computationally efficient abstraction that captures key temporal dynamics. To address the concern, we will revise the methods section to include an expanded discussion mapping each component to specific retinal mechanisms and known physiological properties. New direct quantitative validation against raw retinal data would require dedicated physiological experiments outside the scope of this computational modeling paper. revision: partial

  2. Referee: Results section on DHF1K dataset: outperformance is asserted over heuristics and deep models but no ablation studies removing the motion detector or biological components are described, leaving it unclear whether the gains derive from the retina-inspired elements rather than the peak-fitting or other factors.

    Authors: We agree that ablation studies are necessary to isolate contributions. The revised manuscript will include new ablation experiments on DHF1K: one removing the retina-inspired motion detector (replaced by standard frame differencing), one replacing it with conventional optical flow, and one ablating the multi-Gaussian fitting in favor of standard WTA. These results will quantify the incremental benefit of the biological components, particularly on bottom-up attention videos. revision: yes

  3. Referee: Results section on traffic accident analysis: the 0.72-second anticipation claim and SOTA cause-effect recognition require explicit dataset details, exact evaluation metrics, baseline comparisons, and statistical significance tests; absent these, the real-world utility assertion lacks direct support.

    Authors: The traffic accident experiments use a standard annotated traffic video dataset with cause-effect labels. The 0.72 s figure is the mean anticipation interval at which saliency-based prediction accuracy remains above a fixed threshold prior to annotated accident onset. In revision we will expand the section with: complete dataset statistics and source, precise metric definitions (anticipation time, cause-effect accuracy), full list of baselines with their scores, and statistical significance results (e.g., paired t-tests and confidence intervals). These additions will directly support the reported utility. revision: yes

  4. Referee: Evaluation methodology: no held-out video domains or cross-dataset tests beyond DHF1K and traffic videos are mentioned, raising the risk that the greedy multi-Gaussian fitting overfits to dataset-specific motion statistics rather than generalizing.

    Authors: DHF1K already spans diverse motion and scene statistics, and the traffic videos constitute an independent real-world domain. To further demonstrate generalization, the revision will add an internal cross-validation protocol on DHF1K (holding out video subsets stratified by motion intensity and scene type) and report the resulting variance. We will also discuss the low-parameter nature of the greedy fitting procedure, which reduces overfitting risk compared with learned models. revision: partial

Circularity Check

0 steps flagged

No circularity: BIAS is an independently defined algorithm

full rationale

The paper constructs BIAS by extending the standard Itti-Koch saliency framework with an explicitly described retina-inspired motion detector and a greedy multi-Gaussian peak-fitting procedure for FOAs. These components are introduced as design choices motivated by biology and information-maximization principles rather than derived from or fitted to the DHF1K or traffic-accident evaluation data. No equation reduces to a parameter estimated on the same test sets, no self-citation supplies a uniqueness theorem or ansatz, and performance results are presented as empirical outcomes of the algorithm rather than predictions forced by construction. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed; the motion detector and peak-fitting algorithm are presented as constructed components without specified fitting values or unproven assumptions.

pith-pipeline@v0.9.0 · 5460 in / 1275 out tokens · 67483 ms · 2026-05-10T18:19:53.835398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages

  1. [1]

    Information capacity of a single retinal channel,

    D. Kelly, “Information capacity of a single retinal channel,” IRE Trans. Inf. Theory, vol. 8, no. 3, pp. 221–226, Apr. 1962

  2. [2]

    A new framework for understanding vision from the perspective of the primary visual cortex,

    L. Zhaoping, “A new framework for understanding vision from the perspective of the primary visual cortex,” Curr. Opin. in Neurobio., vol. 58, pp. 1–10, 2019. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S0959438819300042

  3. [3]

    How much the eye tells the brain,

    K. Koch, J. McLean, R. Segev, M. A. Freed, M. J. Berry, V . Balasubra- manian, and P. Sterling, “How much the eye tells the brain,” Curr. Biol., vol. 16, no. 14, pp. 1428–1434, Jul. 2006. [Online]. Available: https://www.cell.com/current-biology/abstract/S0960-9822(06)01639-3

  4. [4]

    The unbearable slowness of being: Why do we live at 10 bits/s?

    J. Zheng and M. Meister, “The unbearable slowness of being: Why do we live at 10 bits/s?” Neuron, vol. 113, no. 2, pp. 192– 204, 2025. [Online]. Available: https://www.cell.com/neuron/abstract/ S0896-6273(24)00808-0

  5. [5]

    The attention system of the human brain: 20 years after,

    S. E. Petersen and M. I. Posner, “The attention system of the human brain: 20 years after,” Annu. Rev. of Neurosci., vol. 35, no. 1, pp. 73–89, Jun. 2012. [Online]. Available: https: //www.annualreviews.org/doi/10.1146/annurev-neuro-062111-150525

  6. [6]

    A feature-integration theory of attention,

    A. M. Treisman and G. Gelade, “A feature-integration theory of attention,” Cogn. Psychol., vol. 12, no. 1, pp. 97–136, 1980. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ 0010028580900055

  7. [7]

    Shifts in selective visual attention: Towards the underlying neural circuitry,

    C. Koch and S. Ullman, “Shifts in selective visual attention: Towards the underlying neural circuitry,” Human Neurobiology, vol. 4, no. 4, pp. 219–227, 1985

  8. [8]

    A model of saliency-based visual attention for rapid scene analysis,

    L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, vol. 20, no. 11, pp. 1254–1259, 1998

  9. [9]

    State-of-the-art in visual attention modeling,

    A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE TPAMI, vol. 35, no. 1, pp. 185–207, Jan. 2013

  10. [10]

    Computational modelling of visual attention,

    L. Itti and C. Koch, “Computational modelling of visual attention,” Nat. Rev. Neurosci., vol. 2, no. 3, pp. 194–203, Mar. 2001. [Online]. Available: https://www.nature.com/articles/35058500

  11. [11]

    Revisiting video saliency prediction in the deep learning era,

    W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji, “Revisiting video saliency prediction in the deep learning era,” IEEE TPAMI, vol. 43, no. 1, pp. 220–237, Jan. 2021. [Online]. Available: https://ieeexplore.ieee.org/document/8744328

  12. [12]

    Geneva: World Health Organization, 2023

    Global Status Report on Road Safety 2023, 1st ed. Geneva: World Health Organization, 2023

  13. [13]

    Anticipating accidents in dashcam videos,

    F.-H. Chan, Y .-T. Chen, Y . Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” in ACCV, S.-H. Lai, V . Lepetit, K. Nishino, and Y . Sato, Eds. Cham: Springer Int. Publishing, 2017, pp. 136–153

  14. [14]

    DoTA: Unsupervised detection of traffic anomaly in driving videos,

    Y . Yao, X. Wang, M. Xu, Z. Pu, Y . Wang, E. Atkins, and D. J. Crandall, “DoTA: Unsupervised detection of traffic anomaly in driving videos,” IEEE TPAMI, vol. 45, no. 1, pp. 444–459, Jan. 2023

  15. [15]

    Revisiting video saliency: A large-scale benchmark and a new model,

    W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, “Revisiting video saliency: A large-scale benchmark and a new model,” in CVPR, 2018, pp. 4894–4903

  16. [16]

    Driver anomaly detection: A dataset and contrastive learning approach,

    O. Kopuklu, J. Zheng, H. Xu, and G. Rigoll, “Driver anomaly detection: A dataset and contrastive learning approach,” in 2021 IEEE Winter Conf. on Appl. of Comput. Vis.. (W ACV). Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 91–100. [Online]. Available: https://ieeexplore.ieee.org/document/9423242/

  17. [17]

    Traffic accident benchmark for causality recogni- tion,

    T. You and B. Han, “Traffic accident benchmark for causality recogni- tion,” in ECCV, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer Int. Publishing, 2020, pp. 540–556

  18. [18]

    Dada-2000: Can driving accident be predicted by driver attention ƒ analyzed by a benchmark

    J. Fang, D. Yan, J. Qiao, J. Xue, H. Wang, and S. Li, “DADA-2000: Can driving accident be predicted by driver attentionƒ analyzed by a benchmark,” in 2019 IEEE Intell. Transp. Syst. Conf. (ITSC). Auckland, New Zealand: IEEE Press, 2019, pp. 4303–4309. [Online]. Available: https://doi.org/10.1109/ITSC.2019.8917218

  19. [19]

    A measure of motion salience for surveillance applications,

    R. Wildes, “A measure of motion salience for surveillance applications,” in ICIP, Oct. 1998, pp. 183–187 vol.3. [Online]. Available: https: //ieeexplore.ieee.org/document/727163

  20. [20]

    Detecting salient motion by accumulating directionally- consistent flow,

    L. Wixson, “Detecting salient motion by accumulating directionally- consistent flow,”IEEE TPAMI, vol. 22, no. 8, pp. 774–780, Aug. 2000. [Online]. Available: https://ieeexplore.ieee.org/document/868680

  21. [21]

    The dis- criminant center-surround hypothesis for bottom-up saliency,

    D. Gao, V . Mahadevan, and N. Vasconcelos, “The dis- criminant center-surround hypothesis for bottom-up saliency,” in NeurIPS, vol. 20. Curran Associates, Inc., 2007. [Online]. Available: https://papers.nips.cc/paper files/paper/2007/hash/ 51ef186e18dc00c2d31982567235c559-Abstract.html

  22. [22]

    A model of motion attention for video skimming,

    Y .-F. Ma and H.-J. Zhang, “A model of motion attention for video skimming,” in ICIP, vol. 1, Sep. 2002, pp. I–I. [Online]. Available: https://ieeexplore.ieee.org/document/1037976

  23. [23]

    Static and space-time visual saliency detection by self-resemblance,

    H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” J. Vis., vol. 9, no. 12, p. 15, Nov. 2009. [Online]. Available: https://doi.org/10.1167/9.12.15

  24. [24]

    Spatiotemporal saliency detection and its applications in static and dynamic scenes,

    W. Kim, C. Jung, and C. Kim, “Spatiotemporal saliency detection and its applications in static and dynamic scenes,” IEEE TCSVT, vol. 21, no. 4, pp. 446–456, Apr. 2011. [Online]. Available: https://ieeexplore.ieee.org/document/5728853

  25. [25]

    Spatiotemporal saliency in dynamic scenes,

    V . Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic scenes,” IEEE TPAMI, vol. 32, no. 1, pp. 171–177, Jan. 2010. [Online]. Available: https://ieeexplore.ieee.org/document/4967608

  26. [26]

    Video saliency incorporating spatiotemporal cues and uncertainty weighting,

    Y . Fang, Z. Wang, W. Lin, and Z. Fang, “Video saliency incorporating spatiotemporal cues and uncertainty weighting,” IEEE TIP, vol. 23, no. 9, pp. 3910–3921, Sep. 2014. [Online]. Available: https://ieeexplore.ieee.org/document/6857361

  27. [27]

    A generic framework of user attention model and its application in video summarization,

    Y .-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, “A generic framework of user attention model and its application in video summarization,” IEEE TMM, vol. 7, no. 5, pp. 907–919, Oct. 2005. [Online]. Available: https://ieeexplore.ieee.org/document/1510638

  28. [28]

    Visual attention detection in video sequences using spatiotemporal cues,

    Y . Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM MM, ser. MM ’06. New York, NY , USA: Association for Computing Machinery, Oct. 2006, pp. 815–824. [Online]. Available: https://doi.org/10.1145/1180639.1180824

  29. [29]

    Predicting visual fixations on video based on low-level visual features,

    O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixations on video based on low-level visual features,” Vis. Res., vol. 47, no. 19, pp. 2483–2498, Sep. 2007. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0042698907002593

  30. [30]

    How many bits does it take for a stimulus to be salient?

    S. H. Khatoonabadi, N. Vasconcelos, I. V . Baji ´c, and Y . Shan, “How many bits does it take for a stimulus to be salient?” in CVPR, Jun. 2015, pp. 5501–5510. [Online]. Available: https: //ieeexplore.ieee.org/document/7299189

  31. [31]

    Salient motion detection in compressed domain,

    K. Muthuswamy and D. Rajan, “Salient motion detection in compressed domain,” IEEE Sign. Process. Letters, vol. 20, no. 10, pp. 996–999, 2013

  32. [32]

    Region-of-interest based compressed domain video transcoding scheme,

    A. Sinha, G. Agarwal, and A. Anbu, “Region-of-interest based compressed domain video transcoding scheme,” in ICASSP, vol. 3. Montreal, Que., Canada: IEEE, 2004, pp. iii–161–4. [Online]. Available: http://ieeexplore.ieee.org/document/1326506/

  33. [33]

    Bayesian surprise attracts human attention,

    L. Itti and P. Baldi, “Bayesian surprise attracts human attention,” Vis. Res., vol. 49, no. 10, pp. 1295–1306, Jun. 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0042698908004380

  34. [34]

    SUN: A bayesian framework for saliency using natural statistics,

    L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “SUN: A bayesian framework for saliency using natural statistics,” J. Vis., vol. 8, no. 7, p. 32, Dec. 2008. [Online]. Available: http://jov.arvojournals.org/article.aspx?doi=10.1167/8.7.32

  35. [35]

    A new perceived motion based shot content representation,

    Y .-F. Ma and H.-J. Zhang, “A new perceived motion based shot content representation,” in ICIP, vol. 3, Oct. 2001, pp. 426–429. [Online]. Available: https://ieeexplore.ieee.org/document/958142

  36. [36]

    Dynamic visual attention: Searching for coding length increments,

    X. Hou and L. Zhang, “Dynamic visual attention: Searching for coding length increments,” in NeurIPS, vol. 21. Curran Associates, Inc.,

  37. [37]

    Available: https://papers.nips.cc/paper files/paper/2008/ hash/a8baa56554f96369ab93e4f3bb068c22-Abstract.html

    [Online]. Available: https://papers.nips.cc/paper files/paper/2008/ hash/a8baa56554f96369ab93e4f3bb068c22-Abstract.html

  38. [38]

    Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform,

    C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform,” in CVPR, Jun. 2008, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/ 4587715

  39. [39]

    A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression,

    C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression,” IEEE TIP, vol. 19, no. 1, pp. 185–198, 2010. [Online]. Available: https://ieeexplore.ieee.org/document/5223506

  40. [40]

    Dynamic whitening saliency,

    V . Lebor ´an, A. Garc ´ıa-D´ıaz, X. R. Fdez-Vidal, and X. M. Pardo, “Dynamic whitening saliency,” IEEE TPAMI, vol. 39, no. 5, pp. 893–907, May 2017. [Online]. Available: https://ieeexplore.ieee.org/ document/7469361

  41. [41]

    Clustering of gaze during dynamic scene viewing is predicted by motion,

    P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering of gaze during dynamic scene viewing is predicted by motion,” Cogn. Comput., vol. 3, no. 1, pp. 5–24, Mar. 2011. [Online]. Available: https://doi.org/10.1007/s12559-010-9074-z

  42. [42]

    Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,

    S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” IEEE TPAMI, vol. 37, no. 7, pp. 1408–1424, Jul. 2015. [Online]. Available: https://ieeexplore.ieee.org/document/6942210

  43. [43]

    Eye-tracking database for a set of standard video sequences,

    H. Hadizadeh, M. J. Enriquez, and I. V . Bajic, “Eye-tracking database for a set of standard video sequences,” IEEE TIP, vol. 21, no. 2, pp. 898–903, Feb. 2012. [Online]. Available: https://ieeexplore.ieee.org/document/5986709 10

  44. [44]

    Automatic foveation for video compression using a neurobiologi- cal model of visual attention,

    L. Itti, “Automatic foveation for video compression using a neurobiologi- cal model of visual attention,”IEEE TIP, vol. 13, no. 10, pp. 1304–1318, Oct. 2004

  45. [45]

    Two-stream convolutional networks for action recognition in videos,

    K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NeurIPS, ser. NIPS’14, vol. 1. Cambridge, MA, USA: MIT Press, Dec. 2014, pp. 568–576

  46. [46]

    Deepvs: A deep learning based video saliency prediction approach,

    L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep learning based video saliency prediction approach,” in ECCV, 2018, pp. 602–617

  47. [47]

    Spatio-temporal saliency networks for dynamic saliency prediction,

    C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliency networks for dynamic saliency prediction,” IEEE TMM, vol. 20, no. 7, pp. 1688–1698, Jul. 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8119879

  48. [48]

    Video saliency prediction based on spatial- temporal two-stream network,

    K. Zhang and Z. Chen, “Video saliency prediction based on spatial- temporal two-stream network,” IEEE TCSVT, vol. 29, no. 12, pp. 3544–3557, Dec. 2019. [Online]. Available: https://ieeexplore.ieee.org/ document/8543830

  49. [49]

    Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm,

    X. Wu, Z. Wu, J. Zhang, L. Ju, and S. Wang, “Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm,” in AAAI, vol. 34, 2020, pp. 12 410–12 417

  50. [50]

    Simple vs complex temporal recurrences for video saliency prediction

    P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Gir ´o-i Nieto, and K. McGuinness, “Simple vs complex temporal recurrences for video saliency prediction,” arXiv preprint arXiv:1907.01869, 2019

  51. [51]

    Going from image to video saliency: Augmenting image salience with dynamic attentional push,

    S. Gorji and J. J. Clark, “Going from image to video saliency: Augmenting image salience with dynamic attentional push,” in CVPR, Jun. 2018, pp. 7501–7511. [Online]. Available: https: //ieeexplore.ieee.org/document/8578881

  52. [52]

    Unified image and video saliency modeling,

    R. Droste, J. Jiao, and J. A. Noble, “Unified image and video saliency modeling,” in ECCV, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., vol. 12350. Cham: Springer Int. Publishing, 2020, pp. 419–435. [Online]. Available: https://link.springer.com/10. 1007/978-3-030-58558-7 25

  53. [53]

    Predicting human eye fixations via an LSTM-based saliency attentive model,

    M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye fixations via an LSTM-based saliency attentive model,” IEEE TIP, vol. 27, no. 10, pp. 5142–5154, Oct. 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8400593

  54. [54]

    A spatial-temporal recurrent neural network for video saliency prediction,

    K. Zhang, Z. Chen, and S. Liu, “A spatial-temporal recurrent neural network for video saliency prediction,” IEEE TIP, vol. 30, pp. 572–587,

  55. [55]

    Available: https://ieeexplore.ieee.org/document/9263359

    [Online]. Available: https://ieeexplore.ieee.org/document/9263359

  56. [56]

    Deep3DSaliency: Deep stereo- scopic video saliency detection model by 3d convolutional networks,

    Y . Fang, G. Ding, J. Li, and Z. Fang, “Deep3DSaliency: Deep stereo- scopic video saliency detection model by 3d convolutional networks,” IEEE TIP, Dec. 2018

  57. [57]

    Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection,

    K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection,” in CVPR, 2019, pp. 2394–2403

  58. [58]

    Spatio-temporal self-attention network for video saliency prediction,

    Z. Wang, Z. Liu, G. Li, Y . Wang, T. Zhang, L. Xu, and J. Wang, “Spatio-temporal self-attention network for video saliency prediction,” IEEE TMM, vol. 25, pp. 1161–1174, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/9667292/

  59. [59]

    Hierarchical domain-adapted feature learning for video saliency prediction,

    G. Bellitto, F. Proietto Salanitri, S. Palazzo, F. Rundo, D. Giordano, and C. Spampinato, “Hierarchical domain-adapted feature learning for video saliency prediction,” IJCV, vol. 129, no. 12, pp. 3216– 3232, Dec. 2021. [Online]. Available: https://link.springer.com/10.1007/ s11263-021-01519-y

  60. [60]

    Video saliency prediction using spatiotemporal residual attentive networks,

    Q. Lai, W. Wang, H. Sun, and J. Shen, “Video saliency prediction using spatiotemporal residual attentive networks,” IEEE TIP, vol. 29, pp. 1113–1126, 2019

  61. [61]

    Temporal-spatial feature pyramid for video saliency detection,

    Q. Chang and S. Zhu, “Temporal-spatial feature pyramid for video saliency detection,” Sep. 2021, arXiv:2105.04213. [Online]. Available: http://arxiv.org/abs/2105.04213

  62. [62]

    ViNet: Pushing the limits of visual modality for audio-visual saliency prediction,

    S. Jain, P. Yarlagadda, S. Jyoti, S. Karthik, R. Sub- ramanian, and V . Gandhi, “ViNet: Pushing the limits of visual modality for audio-visual saliency prediction,” in 2021 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). Prague, Czech Republic: IEEE, Sep. 2021, pp. 3520–3527. [Online]. Available: https://ieeexplore.ieee.org/document/9635989/

  63. [63]

    Transformer-based multi-scale feature integration network for video saliency prediction,

    X. Zhou, S. Wu, R. Shi, B. Zheng, S. Wang, H. Yin, J. Zhang, and C. Yan, “Transformer-based multi-scale feature integration network for video saliency prediction,” IEEE TCSVT, vol. 33, no. 12, pp. 7696–7707, Dec. 2023. [Online]. Available: https: //ieeexplore.ieee.org/document/10130326/authors#authors

  64. [64]

    SalFoM: Dynamic saliency prediction with video founda- tion models,

    M. Moradi, M. Moradi, F. Rundo, C. Spampinato, A. Borji, and S. Palazzo, “SalFoM: Dynamic saliency prediction with video founda- tion models,” in Pattern Recognition, A. Antonacopoulos, S. Chaudhuri, R. Chellappa, C.-L. Liu, S. Bhattacharya, and U. Pal, Eds. Cham: Springer Nature Switzerland, 2025, pp. 33–48

  65. [65]

    TM2SP: A transformer-based multi-level spatiotem- poral feature pyramid network for video saliency prediction,

    C. Li and S. Liu, “TM2SP: A transformer-based multi-level spatiotem- poral feature pyramid network for video saliency prediction,” IEEE TCSVT, vol. 35, no. 6, pp. 5236–5250, Jun. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10841372/authors

  66. [66]

    Video saliency forecasting transformer,

    C. Ma, H. Sun, Y . Rao, J. Zhou, and J. Lu, “Video saliency forecasting transformer,” IEEE TCSVT, vol. 32, no. 10, pp. 6850–6862, Oct. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9770033/

  67. [67]

    Transformer-based video saliency prediction with high temporal dimension decoding,

    M. Moradi, S. Palazzo, and C. Spampinato, “Transformer-based video saliency prediction with high temporal dimension decoding,” Jan. 2024, arXiv:2401.07942. [Online]. Available: http://arxiv.org/abs/2401.07942

  68. [68]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” in CVPR, Jun. 2020, pp. 2443...

  69. [69]

    Large scale interactive motion forecasting for autonomous driving : The waymo open motion dataset,

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov, “Large scale interactive motion forecasting for autonomous driving : The waymo open motion dataset,” in ICCV, Oct. 2021, pp. 9690–9699. [Online]. Available: https:/...

  70. [70]

    Exploring the limitations of behavior cloning for autonomous driving,

    F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limitations of behavior cloning for autonomous driving,” in ICCV, Oct. 2019, pp. 9328–9337. [Online]. Available: https://ieeexplore.ieee.org/ document/9009463

  71. [71]

    Safety-critical learning for long-tail events: The TUM traffic accident dataset,

    W. Zimmer, R. Greer, X. Zhou, R. Song, M. Pavel, D. Lehmberg, A. Ghita, A. Gopalkrishnan, M. Trivedi, and A. Knoll, “Safety-critical learning for long-tail events: The TUM traffic accident dataset,” Aug

  72. [72]

    Available: http://arxiv.org/abs/2508.14567

    [Online]. Available: http://arxiv.org/abs/2508.14567

  73. [73]

    Uncertainty-based traffic accident anticipation with spatio-temporal relational learning,

    W. Bao, Q. Yu, and Y . Kong, “Uncertainty-based traffic accident anticipation with spatio-temporal relational learning,” in ACM MM, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, Oct. 2020, pp. 2682–2690. [Online]. Available: https: //dl.acm.org/doi/10.1145/3394171.3413827

  74. [74]

    Advances, challenges, and future research needs in machine learning-based crash prediction models: A systematic review,

    Y . Ali, F. Hussain, and M. M. Haque, “Advances, challenges, and future research needs in machine learning-based crash prediction models: A systematic review,” Accident Analysis & Prevention, vol. 194, p. 107378, Jan. 2024. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0001457523004256

  75. [75]

    PLOS Computational Biology21, e1012101 (2025)

    H. Li and L. Chen, “Traffic accident risk prediction based on deep learning and spatiotemporal features of vehicle trajectories,” PLOS ONE, vol. 20, no. 5, p. e0320656, May 2025. [Online]. Available: https://journals.plos.org/plosone/article?id=10.1371/journal. pone.0320656

  76. [76]

    Prediction of traffic accident risk based on vehicle trajectory data,

    H. Li and L. Yu, “Prediction of traffic accident risk based on vehicle trajectory data,” Traffic Injury Prevention, vol. 26, no. 2, pp. 164–171, 2025

  77. [77]

    A dynamic spatial-temporal attention network for early anticipation of traffic accidents,

    M. M. Karim, Y . Li, R. Qin, and Z. Yin, “A dynamic spatial-temporal attention network for early anticipation of traffic accidents,” IEEE Trans. on Intell. Transp Syst., vol. 23, no. 7, pp. 9590–9600, Jul. 2022. [Online]. Available: https://doi.org/10.1109/TITS.2022.3155613

  78. [78]

    Applying computational tools to predict gaze direction in interactive visual environments,

    R. J. Peters and L. Itti, “Applying computational tools to predict gaze direction in interactive visual environments,” ACM Trans. Appl. Percept., vol. 5, no. 2, pp. 9:1–9:19, May 2008. [Online]. Available: https://doi.org/10.1145/1279920.1279923

  79. [79]

    Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios,

    R. BT et al., “Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios,” Int. radio consultative committee Int. telecommunication union, Switzerland, CCIR Rep, 2011

  80. [80]

    Fast 2d complex gabor filter with kernel decomposition,

    J. Kim, S. Um, and D. Min, “Fast 2d complex gabor filter with kernel decomposition,” IEEE TIP, vol. 27, no. 4, pp. 1713–1722, Apr. 2018. [Online]. Available: http://ieeexplore.ieee.org/document/8207611/

Showing first 80 references.