arxiv: 2604.20311 · v2 · submitted 2026-04-22 · 💻 cs.MM · cs.AI

Recognition: unknown

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Dali Wang , Yunyao Zhang , Junqing Yu , Yi-Ping Phoebe Chen , Chen Xu , Zikai Song

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:09 UTC · model grok-4.3

classification 💻 cs.MM cs.AI

keywords micro-video popularity predictionspatio-temporal enlargementframe scoring moduletopology-aware memory banklong-sequence perceptionscalable knowledge utilizationvideo recommendationtemporal dynamics

0 comments

The pith

A unified framework for joint spatio-temporal enlargement lets micro-video popularity models perceive longer sequences and draw from more historical videos without growing storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to fix two limits in micro-video popularity prediction: models that only look at short parts of a video and memory stores that cannot hold enough past examples to show relevance. The authors build a single system that enlarges both the time view and the space view so that highlight patterns across many frames can be captured and new videos can be added to history indefinitely. If the approach works, forecasts become more accurate because the model sees the full content arc and its ties to earlier videos. Readers who care about online platforms would note that such forecasts directly affect what content gets recommended and how servers allocate bandwidth.

Core claim

The central claim is that temporal enlargement via a frame scoring module (which extracts highlight cues through sparse sampling and dense perception pathways and fuses them adaptively) combined with spatial enlargement via a topology-aware memory bank (which hierarchically clusters historical content and updates encoder features rather than expanding storage) produces precise long-sequence understanding and scalable knowledge utilization, directly improving popularity prediction.

What carries the argument

The joint spatio-temporal enlargement formed by the frame scoring module for adaptive long-sequence fusion and the topology-aware memory bank for hierarchical clustering and feature updates.

If this is right

The method produces consistent gains over eleven baselines on three standard MVPP benchmarks.
It improves both prediction accuracy and ranking consistency.
It supports more reliable content recommendation and traffic allocation by using fuller video context.
It incorporates all relevant historical videos while keeping storage growth bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same enlargement pattern could be tested on other sequential prediction tasks such as view-count forecasting for longer videos.
Applying the memory bank to datasets with millions of clips would check whether clustering remains efficient at web scale.
The dual sampling paths suggest a general way to combine coarse and fine video signals in any task that needs both speed and detail.

Load-bearing premise

The adaptive fusion of sparse and dense frame views together with the hierarchical clustering and feature updates will deliver the claimed long-sequence perception and scalable storage without introducing biases or new efficiency costs.

What would settle it

A side-by-side test on videos whose length exceeds the original sparse sampling window or on a growing set of historical references where the new method shows no gain in accuracy or ranking over the flat-memory baselines.

Figures

Figures reproduced from arXiv: 2604.20311 by Chen Xu, Dali Wang, Junqing Yu, Yi-Ping Phoebe Chen, Yunyao Zhang, Zikai Song.

**Figure 2.** Figure 2: Overall framework of STAP. (1) The Temporal Enlargement process starts from multimodal encoding and frame [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal Enlargement workflow. Frame scoring [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Spatial Enlargement workflow. The fused query en [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of STAP on key hyperparame [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal-side robustness under different training [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Spatial-memory retrieval quality on representative [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-pathway temporal scorer plus hierarchical memory bank is a clean new combination for MVPP, but the feature-update step risks losing video-specific signals and needs tighter evidence.

read the letter

The paper's main contribution is a unified framework that enlarges both temporal perception and spatial memory for micro-video popularity prediction. It uses a frame scoring module with sparse and dense pathways that adaptively fuse to handle long sequences, paired with a topology-aware memory bank that clusters historical videos by topological links and updates cluster encoder features instead of adding new entries. This package is new as a single proposal, and the paper does a good job spelling out the practical limits of short-range sampling and flat retrieval memory in real recommendation settings. The experiments claim steady gains over 11 baselines on three standard benchmarks for both accuracy and ranking, which would be useful if the tables and ablations hold up under review. The temporal side looks straightforward and implementable. The spatial side is the more inventive part, since bounded storage with unbounded association is a real engineering constraint. The soft spot is exactly where the stress-test points: hierarchical clustering and feature updates can overwrite or dilute per-video details if clusters mix dissimilar content or if the update rule is too aggressive. Without explicit checks on information preservation, cluster purity, or reconstruction from the updated features, the claim that the bank truly supports association with all relevant history stays partly assumptive. Efficiency numbers and failure cases would also help, since real deployments care about both. This is aimed at multimedia and recommender-systems researchers who already work on video popularity or long-sequence modeling. It has enough concrete method and empirical claims to deserve a serious referee rather than a desk reject; a reviewer could focus on tightening the memory analysis and adding targeted ablations without needing a full rewrite.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a unified framework for micro-video popularity prediction (MVPP) that performs joint spatio-temporal enlargement. Temporally, a frame scoring module extracts highlight cues via complementary sparse sampling and dense perception pathways whose outputs are adaptively fused. Spatially, a Topology-Aware Memory Bank hierarchically clusters historical videos by topological relationships and updates the encoder features of the corresponding clusters (rather than adding new entries) to support unbounded historical association without unbounded storage growth. The authors report that the resulting model consistently outperforms 11 strong baselines across mainstream metrics on three widely used MVPP benchmarks, with gains in both prediction accuracy and ranking consistency.

Significance. If the central claims hold, the work would advance MVPP by addressing two practical bottlenecks—restricted temporal range from short-range sampling and limited scalability of flat retrieval memory—thereby improving content recommendation and traffic allocation systems. The constructive design of the memory bank (hierarchical clustering plus feature updates) is a notable technical contribution that could generalize beyond MVPP if information preservation is demonstrated.

major comments (1)

[§3.3] §3.3 (Topology-Aware Memory Bank): The claim that updating cluster encoder features enables 'unbounded historical association' without loss of video-specific information is load-bearing for the spatial-enlargement contribution and the reported benchmark gains. The manuscript does not specify the exact update rule (e.g., averaging, weighted fusion) nor provide an analysis or ablation showing that distinctive per-video cues are retained after repeated updates. If clustering is imperfect or updates overwrite details, the memory bank would not truly support scalable knowledge utilization, making the outperformance potentially attributable to the temporal module alone.

minor comments (2)

[Abstract] Abstract: 'sufer' is a typographical error and should read 'suffer'.
[Abstract] Abstract: The abstract asserts consistent outperformance and 'robust improvements' but supplies no numerical deltas, ablation results, or error bars. Adding at least one representative table excerpt or key metric values would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address the single major comment point-by-point below, agreeing that clarification and additional validation are warranted for the Topology-Aware Memory Bank.

read point-by-point responses

Referee: [§3.3] §3.3 (Topology-Aware Memory Bank): The claim that updating cluster encoder features enables 'unbounded historical association' without loss of video-specific information is load-bearing for the spatial-enlargement contribution and the reported benchmark gains. The manuscript does not specify the exact update rule (e.g., averaging, weighted fusion) nor provide an analysis or ablation showing that distinctive per-video cues are retained after repeated updates. If clustering is imperfect or updates overwrite details, the memory bank would not truly support scalable knowledge utilization, making the outperformance potentially attributable to the temporal module alone.

Authors: We agree that the update rule and its effect on information retention must be made explicit to substantiate the spatial-enlargement claims. In the revised manuscript we will add the precise update formula in §3.3: when a new video is assigned to cluster c with centroid similarity s, the cluster encoder feature is updated via a weighted moving average f_c ← (1 − β·s)·f_c + (β·s)·f_new, where β is a fixed decay hyper-parameter (set to 0.3 in all experiments). This rule blends new information proportionally to relevance while damping overwriting of prior cluster content. We will also insert a new ablation subsection (§4.4) that (i) tracks average cosine similarity between original per-video features and their post-update cluster representations over 10 successive updates, (ii) compares prediction performance when the memory bank is replaced by a flat retrieval baseline of equal capacity, and (iii) reports that the joint spatio-temporal model still yields statistically significant gains over the temporal-only variant. These additions directly address the concern that gains might stem solely from the temporal module. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive framework with independent empirical validation

full rationale

The paper presents a constructive proposal for joint spatio-temporal enlargement in MVPP, describing a frame scoring module with adaptive fusion for temporal enlargement and a topology-aware memory bank using hierarchical clustering and cluster feature updates for spatial enlargement. No equations, derivations, or self-referential definitions appear that reduce the claimed outperformance to fitted inputs or prior self-citations by construction. The central claims rest on the architectural design and benchmark experiments against 11 baselines, which are externally falsifiable and do not collapse into the method's own parameters or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The approach rests on standard assumptions of deep learning models for video and retrieval tasks plus two invented architectural components whose effectiveness is asserted via experiments.

invented entities (2)

Frame scoring module with sparse and dense pathways no independent evidence
purpose: Extract highlight cues from video frames for long-sequence understanding
New component introduced to address sparse short-range sampling limitation
Topology-Aware Memory Bank no independent evidence
purpose: Hierarchically cluster historical videos and update cluster features for scalable unbounded memory
New memory structure to overcome flat retrieval memory limits

pith-pipeline@v0.9.0 · 5577 in / 1193 out tokens · 27993 ms · 2026-05-09T23:09:07.522630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 50 canonical work pages · 14 internal anchors

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid, Ali Zolna, Jonathan Romero, Alexey Dosovitskiy, Jakob Uszkor- eit, et al. 2021. ViViT: A Video Vision Transformer.arXiv preprint arXiv:2103.15691 (2021). doi:10.48550/arXiv.2103.15691

work page doi:10.48550/arxiv.2103.15691 2021
[2]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review arXiv 2020
[3]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time At- tention All You Need for Video Understanding?arXiv preprint arXiv:2102.05095 (2021). doi:10.48550/arXiv.2102.05095

work page doi:10.48550/arxiv.2102.05095 2021
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[5]

João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733. doi:10.1109/CVPR.2017.502

work page doi:10.1109/cvpr.2017.502 2017
[6]

Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro- Videos via a Transductive Model. InProceedings of the 24th ACM International Conference on Multimedia(Amsterdam, The Netherlands)(MM ’16). Association for Computing Machinery, New York, NY, USA, 898–907. doi:...

work page doi:10.1145/2964284 2016
[7]

Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 20463–20471

2026
[8]

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the 33rd ACM International Conference on Multimedia. 6113–6122

2025
[9]

Zhangtao Cheng, Jian Lang, Ting Zhong, and Fan Zhou. 2025. Seeing the Unseen in Micro-Video Popularity Prediction: Self-Correlation Retrieval for Missing Modality Generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1(Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 142–1...

work page arXiv 2025
[10]

Zhangtao Cheng, Jiao Li, Jian Lang, Ting Zhong, and Fan Zhou. 2025. In- context prompt-augmented micro-video popularity prediction. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympo- sium on Educational Advances in Artificial ...

work page doi:10.1609/aaai.v39i11.33254 2025
[11]

Zhangtao Cheng, Jienan Zhang, Xovee Xu, Goce Trajcevski, Ting Zhong, and Fan Zhou. 2024. Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 445–455. doi:10...

work page doi:10.1145/3637528 2024
[12]

Zhangtao Cheng, Fan Zhou, Xovee Xu, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Philip S. Yu. 2024. Information Cascade Popularity Prediction via Probabilistic Diffusion.IEEE Trans. on Knowl. and Data Eng.36, 12 (Dec. 2024), 8541–8555. doi:10.1109/TKDE.2024.3465241

work page doi:10.1109/tkde.2024.3465241 2024
[13]

Tsun-hin Cheung and Kin-man Lam. 2022. Crossmodal bipolar attention for multimodal classification on social media.Neurocomput.514, C (Dec. 2022), 1–12. doi:10.1016/j.neucom.2022.09.140

work page doi:10.1016/j.neucom.2022.09.140 2022
[14]

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2021. Rethinking Attention with Performers.arXiv preprint arXiv:2009.14794(2021). doi:10.48550/arXiv.2009.14794

work page internal anchor Pith review doi:10.48550/arxiv.2009.14794 2021
[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

2019
[16]

Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi- queue Momentum Contrast for Microvideo-Product Retrieval. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (Singapore, Singapore)(WSDM ’23). Association for Computing Machinery, New York, NY, USA, 1003–1011. doi:10.1145/3539597.3570405

work page doi:10.1145/3539597.3570405 2023
[17]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition.arXiv preprint arXiv:1812.03982(2019). doi:10.48550/arXiv.1812.03982

work page doi:10.48550/arxiv.1812.03982 2019
[18]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 5484–5495

2021
[19]

Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck
[20]

Contextual lstm (clstm) models for large scale nlp tasks.arXiv preprint arXiv:1602.06291(2016)

work page arXiv 2016
[21]

Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling

2024
[22]

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

work page internal anchor Pith review arXiv 2021
[23]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 364, 10 pages

2020
[24]

Chih-Chung Hsu, Chia-Ming Lee, Xiu-Yu Hou, and Chi-Han Tsai. 2023. Gra- dient Boost Tree Network based on Extensive Feature Analysis for Popularity Prediction of Social Posts. InProceedings of the 31st ACM International Confer- ence on Multimedia(Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 9451–9455. doi:10.1145/358...

work page doi:10.1145/3581783.3612843 2023
[25]

Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. Refine: Composed video retrieval via shared and differ- ential semantics enhancement.ACM Transactions on Multimedia Computing, Communications and Applications(2026)

2026
[26]

Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2025. SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding.arXiv preprint arXiv:2504.07745(2025)

work page arXiv 2025
[27]

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. InEuro- pean conference on computer vision. Springer, 709–727

2022
[28]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[29]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48. doi:10.1145/3397271.3401075

work page doi:10.1145/3397271.3401075 2020
[30]

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19113–19122

2023
[31]

Aditya Khosla, Atish Das Sarma, and Raffay Hamid. 2014. What makes an image popular?. InProceedings of the 23rd International Conference on World Wide Web (Seoul, Korea)(WWW ’14). Association for Computing Machinery, New York, NY, USA, 867–876. doi:10.1145/2566486.2567996

work page doi:10.1145/2566486.2567996 2014
[32]

Xin Lai, Yihong Zhang, and Wei Zhang. 2020. HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 4565–4569. doi:10.1145/3394171.3416273

work page doi:10.1145/3394171.3416273 2020
[33]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, et al
[34]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401(2020). doi:10.48550/arXiv.2005.11401

work page internal anchor Pith review doi:10.48550/arxiv.2005.11401 2005
[35]

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao
[36]

In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVI(Milan, Italy)

VideoMamba: State Space Model for Efficient Video Understanding. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVI(Milan, Italy). Springer-Verlag, Berlin, Heidelberg, 237–255. doi:10.1007/978-3-031-73347-5_14

work page doi:10.1007/978-3-031-73347-5_14 2024
[37]

Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, and Wei Yang. 2026. Large Language Model as Token Compressor and Decompressor. arXiv:2603.25340 [cs.CL]

work page internal anchor Pith review arXiv 2026
[38]

Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, and Wei Yang
[39]

arXiv preprint arXiv:2507.00029 (2025)

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing.arXiv preprint arXiv:2507.00029(2025). 9 Wang et al

work page internal anchor Pith review arXiv 2025
[40]

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. 2024. Coupled mamba: Enhanced multi-modal fusion with coupled state space model.arXiv preprint arXiv:2405.18014(2024)

work page arXiv 2024
[41]

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. 2025. M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering. InSecond Conference on Language Modeling. https://openreview.net/forum?id=9ffYcEiNw9

2025
[42]

Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, and Ruixiang Tang. 2025. Taco: Enhancing multimodal in-context learning via task mapping- guided sequence configuration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 736–763

2025
[43]

Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. Retrack: Evidence-driven dual-stream directional anchor calibra- tion network for composed video retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 23373–23381

2026
[44]

Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6762–6770

2026
[45]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding.arXiv preprint arXiv:1811.08383(2019). doi:10. 48550/arXiv.1811.08383

work page arXiv 2019
[46]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9, Article 195 (Jan. 2023), 35 pages. doi:10.1145/3560815

work page doi:10.1145/3560815 2023
[47]

Ilya Loshchilov, Frank Hutter, et al. 2017. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.051015, 5 (2017), 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Shijian Mao, Wudong Xi, Lei Yu, Gaotian Lü, Xingxing Xing, Xingchen Zhou, and Wei Wan. 2023. Enhanced CatBoost with Stacking Features for Social Media Prediction. InProceedings of the 31st ACM International Conference on Multimedia (Ottawa ON, Canada)(MM ’23). Association for Computing Machinery, New York, NY, USA, 9430–9435. doi:10.1145/3581783.3612839

work page doi:10.1145/3581783.3612839 2023
[49]

Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2025. A Content-Driven Micro-Video Rec- ommendation Dataset at Scale. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Machinery, New York, NY...

work page doi:10.1145/3746252.3761655 2025
[50]

Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, and Yupeng Hu. 2026. MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291(2026)

work page arXiv 2026
[51]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[52]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[53]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme
[54]

BPR: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618(2012)

work page internal anchor Pith review arXiv 2012
[55]

Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. Temporal Coherent Object Flow for Multi-Object Tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6978–6986

2025
[56]

Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2023. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 2321–2329

2023
[57]

Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2024. Autogenic language embedding for coherent point tracking. In Proceedings of the 32nd ACM International Conference on Multimedia. 2021–2030

2024
[58]

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800

2022
[59]

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Xinchao Wang
[60]

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking. arXiv:2604.12665 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[62]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity.arXiv preprint arXiv:2006.04768 (2020). doi:10.48550/arXiv.2006.04768

work page internal anchor Pith review doi:10.48550/arxiv.2006.04768 2020
[63]

Bo Wu, Peiye Liu, Qiushi Huang, Zhaoyang Zeng, Jia Wang, Bei Liu, Jiebo Luo, and Wen-Huang Cheng. 2025. SMPV: Social Media Prediction for Videos. In Proceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 14055–14057. doi:10.1145/3746027.3763757

work page doi:10.1145/3746027.3763757 2025
[64]

Bo Wu, Tao Mei, Wen-Huang Cheng, and Yongdong Zhang. 2016. Unfolding temporal dynamics: Predicting social media popularity using multi-scale temporal decomposition. InProceedings of the AAAI conference on artificial intelligence, Vol. 30

2016
[65]

Jiayi Xie, Yaochen Zhu, and Zhenzhong Chen. 2021. Micro-video popularity prediction via multimodal variational information bottleneck.IEEE Transactions on Multimedia25 (2021), 24–37

2021
[66]

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. 2021. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.arXiv preprint arXiv:2102.03902(2021). doi:10. 48550/arXiv.2102.03902

work page arXiv 2021
[67]

Kele Xu, Zhimin Lin, Jianqiao Zhao, Peicang Shi, Wei Deng, and Huaimin Wang
[68]

InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20)

Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 4580–4584. doi:10.1145/3394171.3416274

work page doi:10.1145/3394171.3416274
[69]

Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie
[70]

Stable: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness.arXiv preprint arXiv:2604.01617(2026)

work page arXiv 2026
[71]

Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. 2025. MVP: Winning Solution to SMP Challenge 2025 Video Track. InProceedings of the 33rd ACM International Conference on Multimedia (Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 14079–14085. doi:10.1145/3746027.3763761

work page doi:10.1145/3746027.3763761 2025
[72]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences.arXiv preprint arXiv:2007.14062(2020). doi:10.48550/arXiv.2007.14062

work page doi:10.48550/arxiv.2007.14062 2020
[73]

Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, and Yupeng Hu. 2026. HINT: Composed Image Retrieval with Dual-Path Compositional Contextualized Network.arXiv preprint arXiv:2603.26341(2026)

work page arXiv 2026
[74]

Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, and Zikai Song. 2026. Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning. arXiv:2601.02902 [cs.AI] https://arxiv.org/abs/2601.02902

work page internal anchor Pith review Pith/arXiv arXiv 2026
[75]

Yunyao Zhang, Yihao Ai, Zuocheng Ying, Qirui Mi, Junqing Yu, Wei Yang, and Zikai Song. 2026. Coupling Macro Dynamics and Micro States for Long-Horizon Social Simulation. arXiv:2604.05516 [cs.SI] https://arxiv.org/abs/2604.05516

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. 𝐺𝐴−𝑆 3: Comprehensive Social Network Simulation with Group Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Li...

work page doi:10.18653/v1/2025.findings-acl.468 2025
[77]

Yunyao Zhang, Zuocheng Ying, Xinglang Zhang, Junqing Yu, Peng Fang, Xu Chen, Wei Yang, and Zikai Song. 2026. IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics. arXiv:2604.06600 [cs.SI] https: //arxiv.org/abs/2604.06600

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Yi- Ping Phoebe Chen, Wei Yang, and Zikai Song. 2026. Semantic-Aware Log- ical Reasoning via a Semiotic Framework. arXiv:2509.24765 [cs.AI] https: //arxiv.org/abs/2509.24765

work page internal anchor Pith review Pith/arXiv arXiv 2026
[79]

Zhuoran Zhang, Shibiao Xu, Li Guo, and Wenke Lian. 2023. Multi-modal Varia- tional Auto-Encoder Model for Micro-video Popularity Prediction. InProceedings of the 8th International Conference on Communication and Information Processing (Beijing, China)(ICCIP ’22). Association for Computing Machinery, New York, NY, USA, 9–16. doi:10.1145/3571662.3571664

work page doi:10.1145/3571662.3571664 2023
[80]

Ting Zhong, Jian Lang, Yifan Zhang, Zhangtao Cheng, Kunpeng Zhang, and Fan Zhou. 2024. Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3626772.3657929 2024

Showing first 80 references.