pith. machine review for the scientific record. sign in

arxiv: 2604.20311 · v2 · submitted 2026-04-22 · 💻 cs.MM · cs.AI

Recognition: unknown

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:09 UTC · model grok-4.3

classification 💻 cs.MM cs.AI
keywords micro-video popularity predictionspatio-temporal enlargementframe scoring moduletopology-aware memory banklong-sequence perceptionscalable knowledge utilizationvideo recommendationtemporal dynamics
0
0 comments X

The pith

A unified framework for joint spatio-temporal enlargement lets micro-video popularity models perceive longer sequences and draw from more historical videos without growing storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to fix two limits in micro-video popularity prediction: models that only look at short parts of a video and memory stores that cannot hold enough past examples to show relevance. The authors build a single system that enlarges both the time view and the space view so that highlight patterns across many frames can be captured and new videos can be added to history indefinitely. If the approach works, forecasts become more accurate because the model sees the full content arc and its ties to earlier videos. Readers who care about online platforms would note that such forecasts directly affect what content gets recommended and how servers allocate bandwidth.

Core claim

The central claim is that temporal enlargement via a frame scoring module (which extracts highlight cues through sparse sampling and dense perception pathways and fuses them adaptively) combined with spatial enlargement via a topology-aware memory bank (which hierarchically clusters historical content and updates encoder features rather than expanding storage) produces precise long-sequence understanding and scalable knowledge utilization, directly improving popularity prediction.

What carries the argument

The joint spatio-temporal enlargement formed by the frame scoring module for adaptive long-sequence fusion and the topology-aware memory bank for hierarchical clustering and feature updates.

If this is right

  • The method produces consistent gains over eleven baselines on three standard MVPP benchmarks.
  • It improves both prediction accuracy and ranking consistency.
  • It supports more reliable content recommendation and traffic allocation by using fuller video context.
  • It incorporates all relevant historical videos while keeping storage growth bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same enlargement pattern could be tested on other sequential prediction tasks such as view-count forecasting for longer videos.
  • Applying the memory bank to datasets with millions of clips would check whether clustering remains efficient at web scale.
  • The dual sampling paths suggest a general way to combine coarse and fine video signals in any task that needs both speed and detail.

Load-bearing premise

The adaptive fusion of sparse and dense frame views together with the hierarchical clustering and feature updates will deliver the claimed long-sequence perception and scalable storage without introducing biases or new efficiency costs.

What would settle it

A side-by-side test on videos whose length exceeds the original sparse sampling window or on a growing set of historical references where the new method shows no gain in accuracy or ranking over the flat-memory baselines.

Figures

Figures reproduced from arXiv: 2604.20311 by Chen Xu, Dali Wang, Junqing Yu, Yi-Ping Phoebe Chen, Yunyao Zhang, Zikai Song.

Figure 1
Figure 1. Figure 1: Motivation of our work. (a) Innovation I addresses [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of STAP. (1) The Temporal Enlargement process starts from multimodal encoding and frame [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal Enlargement workflow. Frame scoring [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spatial Enlargement workflow. The fused query en [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of STAP on key hyperparame [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Temporal-side robustness under different training [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spatial-memory retrieval quality on representative [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a unified framework for micro-video popularity prediction (MVPP) that performs joint spatio-temporal enlargement. Temporally, a frame scoring module extracts highlight cues via complementary sparse sampling and dense perception pathways whose outputs are adaptively fused. Spatially, a Topology-Aware Memory Bank hierarchically clusters historical videos by topological relationships and updates the encoder features of the corresponding clusters (rather than adding new entries) to support unbounded historical association without unbounded storage growth. The authors report that the resulting model consistently outperforms 11 strong baselines across mainstream metrics on three widely used MVPP benchmarks, with gains in both prediction accuracy and ranking consistency.

Significance. If the central claims hold, the work would advance MVPP by addressing two practical bottlenecks—restricted temporal range from short-range sampling and limited scalability of flat retrieval memory—thereby improving content recommendation and traffic allocation systems. The constructive design of the memory bank (hierarchical clustering plus feature updates) is a notable technical contribution that could generalize beyond MVPP if information preservation is demonstrated.

major comments (1)
  1. [§3.3] §3.3 (Topology-Aware Memory Bank): The claim that updating cluster encoder features enables 'unbounded historical association' without loss of video-specific information is load-bearing for the spatial-enlargement contribution and the reported benchmark gains. The manuscript does not specify the exact update rule (e.g., averaging, weighted fusion) nor provide an analysis or ablation showing that distinctive per-video cues are retained after repeated updates. If clustering is imperfect or updates overwrite details, the memory bank would not truly support scalable knowledge utilization, making the outperformance potentially attributable to the temporal module alone.
minor comments (2)
  1. [Abstract] Abstract: 'sufer' is a typographical error and should read 'suffer'.
  2. [Abstract] Abstract: The abstract asserts consistent outperformance and 'robust improvements' but supplies no numerical deltas, ablation results, or error bars. Adding at least one representative table excerpt or key metric values would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address the single major comment point-by-point below, agreeing that clarification and additional validation are warranted for the Topology-Aware Memory Bank.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Topology-Aware Memory Bank): The claim that updating cluster encoder features enables 'unbounded historical association' without loss of video-specific information is load-bearing for the spatial-enlargement contribution and the reported benchmark gains. The manuscript does not specify the exact update rule (e.g., averaging, weighted fusion) nor provide an analysis or ablation showing that distinctive per-video cues are retained after repeated updates. If clustering is imperfect or updates overwrite details, the memory bank would not truly support scalable knowledge utilization, making the outperformance potentially attributable to the temporal module alone.

    Authors: We agree that the update rule and its effect on information retention must be made explicit to substantiate the spatial-enlargement claims. In the revised manuscript we will add the precise update formula in §3.3: when a new video is assigned to cluster c with centroid similarity s, the cluster encoder feature is updated via a weighted moving average f_c ← (1 − β·s)·f_c + (β·s)·f_new, where β is a fixed decay hyper-parameter (set to 0.3 in all experiments). This rule blends new information proportionally to relevance while damping overwriting of prior cluster content. We will also insert a new ablation subsection (§4.4) that (i) tracks average cosine similarity between original per-video features and their post-update cluster representations over 10 successive updates, (ii) compares prediction performance when the memory bank is replaced by a flat retrieval baseline of equal capacity, and (iii) reports that the joint spatio-temporal model still yields statistically significant gains over the temporal-only variant. These additions directly address the concern that gains might stem solely from the temporal module. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive framework with independent empirical validation

full rationale

The paper presents a constructive proposal for joint spatio-temporal enlargement in MVPP, describing a frame scoring module with adaptive fusion for temporal enlargement and a topology-aware memory bank using hierarchical clustering and cluster feature updates for spatial enlargement. No equations, derivations, or self-referential definitions appear that reduce the claimed outperformance to fitted inputs or prior self-citations by construction. The central claims rest on the architectural design and benchmark experiments against 11 baselines, which are externally falsifiable and do not collapse into the method's own parameters or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The approach rests on standard assumptions of deep learning models for video and retrieval tasks plus two invented architectural components whose effectiveness is asserted via experiments.

invented entities (2)
  • Frame scoring module with sparse and dense pathways no independent evidence
    purpose: Extract highlight cues from video frames for long-sequence understanding
    New component introduced to address sparse short-range sampling limitation
  • Topology-Aware Memory Bank no independent evidence
    purpose: Hierarchically cluster historical videos and update cluster features for scalable unbounded memory
    New memory structure to overcome flat retrieval memory limits

pith-pipeline@v0.9.0 · 5577 in / 1193 out tokens · 27993 ms · 2026-05-09T23:09:07.522630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 50 canonical work pages · 14 internal anchors

  1. [1]

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid, Ali Zolna, Jonathan Romero, Alexey Dosovitskiy, Jakob Uszkor- eit, et al. 2021. ViViT: A Video Vision Transformer.arXiv preprint arXiv:2103.15691 (2021). doi:10.48550/arXiv.2103.15691

  2. [2]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

  3. [3]

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time At- tention All You Need for Video Understanding?arXiv preprint arXiv:2102.05095 (2021). doi:10.48550/arXiv.2102.05095

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  5. [5]

    João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733. doi:10.1109/CVPR.2017.502

  6. [6]

    Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro- Videos via a Transductive Model. InProceedings of the 24th ACM International Conference on Multimedia(Amsterdam, The Netherlands)(MM ’16). Association for Computing Machinery, New York, NY, USA, 898–907. doi:...

  7. [7]

    Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 20463–20471

  8. [8]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the 33rd ACM International Conference on Multimedia. 6113–6122

  9. [9]

    Zhangtao Cheng, Jian Lang, Ting Zhong, and Fan Zhou. 2025. Seeing the Unseen in Micro-Video Popularity Prediction: Self-Correlation Retrieval for Missing Modality Generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 142–1...

  10. [10]

    Zhangtao Cheng, Jiao Li, Jian Lang, Ting Zhong, and Fan Zhou. 2025. In- context prompt-augmented micro-video popularity prediction. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympo- sium on Educational Advances in Artificial ...

  11. [11]

    Zhangtao Cheng, Jienan Zhang, Xovee Xu, Goce Trajcevski, Ting Zhong, and Fan Zhou. 2024. Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 445–455. doi:10...

  12. [12]

    Zhangtao Cheng, Fan Zhou, Xovee Xu, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Philip S. Yu. 2024. Information Cascade Popularity Prediction via Probabilistic Diffusion.IEEE Trans. on Knowl. and Data Eng.36, 12 (Dec. 2024), 8541–8555. doi:10.1109/TKDE.2024.3465241

  13. [13]

    Tsun-hin Cheung and Kin-man Lam. 2022. Crossmodal bipolar attention for multimodal classification on social media.Neurocomput.514, C (Dec. 2022), 1–12. doi:10.1016/j.neucom.2022.09.140

  14. [14]

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2021. Rethinking Attention with Performers.arXiv preprint arXiv:2009.14794(2021). doi:10.48550/arXiv.2009.14794

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  16. [16]

    Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi- queue Momentum Contrast for Microvideo-Product Retrieval. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (Singapore, Singapore)(WSDM ’23). Association for Computing Machinery, New York, NY, USA, 1003–1011. doi:10.1145/3539597.3570405

  17. [17]

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition.arXiv preprint arXiv:1812.03982(2019). doi:10.48550/arXiv.1812.03982

  18. [18]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 5484–5495

  19. [19]

    Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck

  20. [20]

    Contextual lstm (clstm) models for large scale nlp tasks.arXiv preprint arXiv:1602.06291(2016)

  21. [21]

    Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling

  22. [22]

    Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

  23. [23]

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 364, 10 pages

  24. [24]

    Chih-Chung Hsu, Chia-Ming Lee, Xiu-Yu Hou, and Chi-Han Tsai. 2023. Gra- dient Boost Tree Network based on Extensive Feature Analysis for Popularity Prediction of Social Posts. InProceedings of the 31st ACM International Confer- ence on Multimedia(Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 9451–9455. doi:10.1145/358...

  25. [25]

    Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. Refine: Composed video retrieval via shared and differ- ential semantics enhancement.ACM Transactions on Multimedia Computing, Communications and Applications(2026)

  26. [26]

    Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2025. SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding.arXiv preprint arXiv:2504.07745(2025)

  27. [27]

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. InEuro- pean conference on computer vision. Springer, 709–727

  28. [28]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. doi:10.18653/v1/2020.emnlp-main.550

  29. [29]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48. doi:10.1145/3397271.3401075

  30. [30]

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19113–19122

  31. [31]

    Aditya Khosla, Atish Das Sarma, and Raffay Hamid. 2014. What makes an image popular?. InProceedings of the 23rd International Conference on World Wide Web (Seoul, Korea)(WWW ’14). Association for Computing Machinery, New York, NY, USA, 867–876. doi:10.1145/2566486.2567996

  32. [32]

    Xin Lai, Yihong Zhang, and Wei Zhang. 2020. HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 4565–4569. doi:10.1145/3394171.3416273

  33. [33]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, et al

  34. [34]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401(2020). doi:10.48550/arXiv.2005.11401

  35. [35]

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao

  36. [36]

    In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVI(Milan, Italy)

    VideoMamba: State Space Model for Efficient Video Understanding. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVI(Milan, Italy). Springer-Verlag, Berlin, Heidelberg, 237–255. doi:10.1007/978-3-031-73347-5_14

  37. [37]

    Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, and Wei Yang. 2026. Large Language Model as Token Compressor and Decompressor. arXiv:2603.25340 [cs.CL]

  38. [38]

    Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, and Wei Yang

  39. [39]

    arXiv preprint arXiv:2507.00029 (2025)

    LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing.arXiv preprint arXiv:2507.00029(2025). 9 Wang et al

  40. [40]

    Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. 2024. Coupled mamba: Enhanced multi-modal fusion with coupled state space model.arXiv preprint arXiv:2405.18014(2024)

  41. [41]

    Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. 2025. M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering. InSecond Conference on Language Modeling. https://openreview.net/forum?id=9ffYcEiNw9

  42. [42]

    Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, and Ruixiang Tang. 2025. Taco: Enhancing multimodal in-context learning via task mapping- guided sequence configuration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 736–763

  43. [43]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. Retrack: Evidence-driven dual-stream directional anchor calibra- tion network for composed video retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 23373–23381

  44. [44]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6762–6770

  45. [45]

    Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding.arXiv preprint arXiv:1811.08383(2019). doi:10. 48550/arXiv.1811.08383

  46. [46]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9, Article 195 (Jan. 2023), 35 pages. doi:10.1145/3560815

  47. [47]

    Ilya Loshchilov, Frank Hutter, et al. 2017. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.051015, 5 (2017), 5

  48. [48]

    Shijian Mao, Wudong Xi, Lei Yu, Gaotian Lü, Xingxing Xing, Xingchen Zhou, and Wei Wan. 2023. Enhanced CatBoost with Stacking Features for Social Media Prediction. InProceedings of the 31st ACM International Conference on Multimedia (Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 9430–9435. doi:10.1145/3581783.3612839

  49. [49]

    Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2025. A Content-Driven Micro-Video Rec- ommendation Dataset at Scale. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Machinery, New York, NY...

  50. [50]

    Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, and Yupeng Hu. 2026. MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291(2026)

  51. [51]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  52. [52]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  53. [53]

    Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme

  54. [54]

    BPR: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618(2012)

  55. [55]

    Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. Temporal Coherent Object Flow for Multi-Object Tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6978–6986

  56. [56]

    Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2023. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 2321–2329

  57. [57]

    Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2024. Autogenic language embedding for coherent point tracking. In Proceedings of the 32nd ACM International Conference on Multimedia. 2021–2030

  58. [58]

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800

  59. [59]

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Xinchao Wang

  60. [60]

    Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

    Hypergraph-State Collaborative Reasoning for Multi-Object Tracking. arXiv:2604.12665 [cs.CV]

  61. [61]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  62. [62]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity.arXiv preprint arXiv:2006.04768 (2020). doi:10.48550/arXiv.2006.04768

  63. [63]

    Bo Wu, Peiye Liu, Qiushi Huang, Zhaoyang Zeng, Jia Wang, Bei Liu, Jiebo Luo, and Wen-Huang Cheng. 2025. SMPV: Social Media Prediction for Videos. In Proceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 14055–14057. doi:10.1145/3746027.3763757

  64. [64]

    Bo Wu, Tao Mei, Wen-Huang Cheng, and Yongdong Zhang. 2016. Unfolding temporal dynamics: Predicting social media popularity using multi-scale temporal decomposition. InProceedings of the AAAI conference on artificial intelligence, Vol. 30

  65. [65]

    Jiayi Xie, Yaochen Zhu, and Zhenzhong Chen. 2021. Micro-video popularity prediction via multimodal variational information bottleneck.IEEE Transactions on Multimedia25 (2021), 24–37

  66. [66]

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. 2021. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.arXiv preprint arXiv:2102.03902(2021). doi:10. 48550/arXiv.2102.03902

  67. [67]

    Kele Xu, Zhimin Lin, Jianqiao Zhao, Peicang Shi, Wei Deng, and Huaimin Wang

  68. [68]

    InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20)

    Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 4580–4584. doi:10.1145/3394171.3416274

  69. [69]

    Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie

  70. [70]

    Stable: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness.arXiv preprint arXiv:2604.01617(2026)

  71. [71]

    Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. 2025. MVP: Winning Solution to SMP Challenge 2025 Video Track. InProceedings of the 33rd ACM International Conference on Multimedia (Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 14079–14085. doi:10.1145/3746027.3763761

  72. [72]

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences.arXiv preprint arXiv:2007.14062(2020). doi:10.48550/arXiv.2007.14062

  73. [73]

    Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. HINT: Composed Image Retrieval with Dual-Path Compositional Contextualized Network.arXiv preprint arXiv:2603.26341(2026)

  74. [74]

    Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, and Zikai Song. 2026. Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning. arXiv:2601.02902 [cs.AI] https://arxiv.org/abs/2601.02902

  75. [75]

    Yunyao Zhang, Yihao Ai, Zuocheng Ying, Qirui Mi, Junqing Yu, Wei Yang, and Zikai Song. 2026. Coupling Macro Dynamics and Micro States for Long-Horizon Social Simulation. arXiv:2604.05516 [cs.SI] https://arxiv.org/abs/2604.05516

  76. [76]

    Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. 𝐺𝐴−𝑆 3: Comprehensive Social Network Simulation with Group Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Li...

  77. [77]

    Yunyao Zhang, Zuocheng Ying, Xinglang Zhang, Junqing Yu, Peng Fang, Xu Chen, Wei Yang, and Zikai Song. 2026. IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics. arXiv:2604.06600 [cs.SI] https: //arxiv.org/abs/2604.06600

  78. [78]

    Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Yi- Ping Phoebe Chen, Wei Yang, and Zikai Song. 2026. Semantic-Aware Log- ical Reasoning via a Semiotic Framework. arXiv:2509.24765 [cs.AI] https: //arxiv.org/abs/2509.24765

  79. [79]

    Zhuoran Zhang, Shibiao Xu, Li Guo, and Wenke Lian. 2023. Multi-modal Varia- tional Auto-Encoder Model for Micro-video Popularity Prediction. InProceedings of the 8th International Conference on Communication and Information Processing (Beijing, China)(ICCIP ’22). Association for Computing Machinery, New York, NY, USA, 9–16. doi:10.1145/3571662.3571664

  80. [80]

    Ting Zhong, Jian Lang, Yifan Zhang, Zhangtao Cheng, Kunpeng Zhang, and Fan Zhou. 2024. Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY,...

Showing first 80 references.