Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web
Pith reviewed 2026-05-20 00:50 UTC · model grok-4.3
The pith
Structured open-web context and trend-aware adaptation are required for accurate micro-video popularity prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Micro-video popularity prediction is reformulated as open-web grounded prediction. The WEBSHORTS dataset couples 14K videos with real-time open-web context organized as three-dimensional evidence-cards and daily view counts over 7 days. SHORTS-CAST generates dimension-wise rationales from the evidence-card to guide popularity regression and adapts selectively when delayed labels reveal genuine trend shifts. It outperforms content-only, retrieval-augmented, and other online adaptation baselines under offline and delayed-label online protocols.
What carries the argument
The three-dimensional evidence-card capturing external attention along complementary web-context dimensions, which serves as the basis for rationale generation and popularity prediction in the SHORTS-CAST framework.
If this is right
- Improved accuracy in popularity forecasting supports better recommendation and advertising decisions.
- Trend-aware adaptation enables handling of fast-evolving short-form video ecosystems.
- Use of delayed labels allows detection of genuine trend shifts for model updates.
- Structured web context reduces reliance on historical internal video corpora.
Where Pith is reading between the lines
- The approach may generalize to predicting engagement for other time-sensitive content like live streams or social posts.
- Real-time web data collection could be combined with privacy-preserving techniques for broader adoption.
- Comparing performance across different web search providers might show robustness or sensitivity to data sources.
Load-bearing premise
Open-web context collected at upload time supplies predictive signal for popularity that is not already present in the video content or in retrieval from platform-internal video corpora.
What would settle it
A controlled experiment showing that a model without open-web context achieves comparable performance to SHORTS-CAST on the delayed-label online protocol would falsify the claim that web context is jointly necessary.
Figures
read the original abstract
Micro-video popularity prediction (MVPP) forecasts the popularity a newly uploaded short-form video will attract within a fixed number of days after upload. This task supports downstream applications in recommendation, advertising, and creator analytics, yet the problem is hard since virality depends on external trends rather than video content alone. Prior MVPP methods incorporate context by retrieving similar videos from platform-internal corpora, however historical neighbors cannot reveal whether a topic is currently trending, controversial, or already saturated across the open web. To this end, we reformulate MVPP as open-web grounded prediction and introduce WEBSHORTS, the first micro-video dataset that couples 14K videos with real-time open-web context collected at upload time, alongside daily view counts tracked over 7 days. The context for each video is organized as a structured evidence-card that captures the external attention landscape along three complementary web-context dimensions. We further propose SHORTS-CAST, a framework that generates dimension-wise rationales from the evidence-card to guide popularity regression, then adapts at deployment by selectively updating the context-to-popularity mapping when delayed labels reveal genuine trend shifts. In our experiments, SHORTS-CAST consistently outperforms content-only, video corpus retrieval-augmented, and online adaptation baselines under both offline and delayed-label online protocols, confirming that structured web context and trend-aware adaptation are jointly necessary for popularity forecasting under realistic deployment constraints in fast-evolving short-form video ecosystems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the WEBSHORTS dataset of 14K micro-videos paired with real-time open-web context collected at upload time and daily view counts over 7 days. Context is organized into structured three-dimensional evidence-cards. It proposes SHORTS-CAST, which generates dimension-wise rationales from the evidence-card for popularity regression and selectively adapts the context-to-popularity mapping at deployment when delayed labels indicate genuine trend shifts. Experiments report that SHORTS-CAST consistently outperforms content-only, video corpus retrieval-augmented, and online adaptation baselines under both offline and delayed-label online protocols, concluding that structured web context and trend-aware adaptation are jointly necessary for micro-video popularity prediction.
Significance. If the results hold under rigorous controls, the work advances micro-video popularity prediction by demonstrating the value of grounding forecasts in contemporaneous open-web signals rather than historical internal corpora alone. The new WEBSHORTS dataset and the evidence-card representation provide a concrete resource for future research on external context in dynamic media ecosystems. The selective adaptation mechanism directly targets the challenge of evolving trends.
major comments (2)
- [Experiments / online protocol description] The claim that trend-aware adaptation is jointly necessary for the online protocol rests on delayed labels reliably indicating genuine external trend shifts rather than platform noise or random fluctuations. The manuscript provides no quantitative check (e.g., correlation of label deltas with independent signals such as search-volume spikes or external mention counts) at the moments adaptation is triggered. This validation is load-bearing for the 'genuine trend shifts' premise and the resulting conclusion.
- [Method / SHORTS-CAST framework] Details on evidence-card construction (exact sources, aggregation rules, and temporal alignment for the three dimensions) and on the rationale-generation process (models, prompts, or training) are insufficient for replication. These choices directly affect whether the reported gains can be attributed to the structured web context rather than implementation specifics.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement of the magnitude of improvements (e.g., relative gains or absolute metrics) to allow readers to gauge practical significance without reading the full experimental tables.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight important aspects of validation and reproducibility that we address point by point below. We will revise the manuscript to incorporate clarifications and additional details where feasible.
read point-by-point responses
-
Referee: [Experiments / online protocol description] The claim that trend-aware adaptation is jointly necessary for the online protocol rests on delayed labels reliably indicating genuine external trend shifts rather than platform noise or random fluctuations. The manuscript provides no quantitative check (e.g., correlation of label deltas with independent signals such as search-volume spikes or external mention counts) at the moments adaptation is triggered. This validation is load-bearing for the 'genuine trend shifts' premise and the resulting conclusion.
Authors: We acknowledge that an explicit quantitative validation linking label deltas to independent external signals would further support the interpretation of genuine trend shifts. Our online protocol is intentionally designed around the realistic constraint of delayed labels only, with the selective adaptation mechanism intended to respond to significant deviations that may reflect external changes. To address this, we will add a supplementary analysis in the revised manuscript that examines correlations between adaptation triggers and spikes in web mentions or related signals already present in the evidence-cards. We will also clarify the assumptions underlying the protocol and discuss potential noise sources as a limitation if the correlations prove modest. revision: partial
-
Referee: [Method / SHORTS-CAST framework] Details on evidence-card construction (exact sources, aggregation rules, and temporal alignment for the three dimensions) and on the rationale-generation process (models, prompts, or training) are insufficient for replication. These choices directly affect whether the reported gains can be attributed to the structured web context rather than implementation specifics.
Authors: We agree that the current level of detail is insufficient for replication and that this affects attribution of gains to the structured context. In the revised manuscript we will expand the Methods section (and add an appendix if needed) to specify: the exact sources and collection methods for each of the three evidence-card dimensions; the aggregation rules, counting procedures, and normalization steps; the temporal alignment logic across sources; and the precise models, prompting templates, and any training or fine-tuning procedures used for rationale generation. These additions will enable readers to reproduce the evidence-card construction and SHORTS-CAST pipeline. revision: yes
Circularity Check
No significant circularity; derivation relies on external data and empirical baselines
full rationale
The paper introduces a new dataset (WEBSHORTS) coupling videos with real-time open-web context collected at upload time and proposes SHORTS-CAST for generating rationales and selective adaptation using delayed labels. Performance is evaluated against content-only, retrieval-augmented, and online adaptation baselines under offline and delayed-label protocols. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claim of joint necessity for web context and trend-aware adaptation follows from comparative outperformance rather than by construction from the inputs themselves. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
structured evidence-card
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reformulate MVPP as open-web grounded prediction and introduce WEBSHORTS... evidence-card that captures the external attention landscape along three complementary web-context dimensions.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Growth-Conditioned Drift filtering... triggers lightweight updates... when delayed labels reveal genuine trend shifts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Micro tells macro: Predicting the popularity of micro-videos via a transductive model
Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. InProceedings of the 24th ACM International Conference on Multimedia, pages 898–907, 2016. doi: 10.1145/2964284.2964314
-
[2]
Smp challenge: An overview of social media prediction challenge 2019
Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, and Jiebo Luo. Smp challenge: An overview of social media prediction challenge 2019. InProceedings of the 27th ACM International Conference on Multimedia, pages 2667–2671, 2019
work page 2019
-
[3]
Mvp: Winning solution to smp challenge 2025 video track
Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. Mvp: Winning solution to smp challenge 2025 video track. InProceedings of the ACM International Conference on Multimedia, pages 14079–14085, 2025. doi: 10.1145/3746027.3763761
-
[4]
A multimodal variational encoder-decoder framework for micro-video popularity prediction
Jiayi Xie, Yaochen Zhu, Zhibin Zhang, Jian Peng, Jing Yi, Yaosi Hu, Hongyi Liu, and Zhenzhong Chen. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of The Web Conference 2020, WWW ’20, page 2542–2548, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450370233. doi: 10.1145/336...
-
[5]
Predicting micro-video popularity via multi-modal retrieval augmentation
Ting Zhong, Jian Lang, Yifan Zhang, Zhangtao Cheng, Kunpeng Zhang, and Fan Zhou. Predicting micro-video popularity via multi-modal retrieval augmentation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2579–2583, 2024. doi: 10.1145/3626772.3657929
-
[6]
Zhangtao Cheng, Jian Lang, Ting Zhong, and Fan Zhou. Seeing the unseen in micro-video popularity prediction: Self-correlation retrieval for missing modality generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 142–152, 2025. doi: 10.1145/ 3690624.3709308
-
[7]
Jack Hessel, Lillian Lee, and David Mimno. Cats and captions vs. creators and the clock: Comparing multimodal content to context in predicting relative popularity. InProceedings of the 26th international conference on world wide web, pages 927–936, 2017
work page 2017
-
[8]
Expecting to be hip: Hawkes intensity processes for social media popularity
Marian-Andrei Rizoiu, Lexing Xie, Scott Sanner, Manuel Cebrian, Honglin Yu, and Pascal Van Hentenryck. Expecting to be hip: Hawkes intensity processes for social media popularity. InProceedings of the 26th International Conference on World Wide Web, WWW ’17, page 735–744, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences S...
-
[9]
Retrieval- augmented hypergraph for multimodal social media popularity prediction
Zhangtao Cheng, Jienan Zhang, Xovee Xu, Goce Trajcevski, Ting Zhong, and Fan Zhou. Retrieval- augmented hypergraph for multimodal social media popularity prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 445–455, 2024. doi: 10.1145/3637528.3672041
-
[10]
Echoes in the feed: Evolution- aware prompt-augmented micro-video popularity prediction
Wei Chen, Jiao Li, Jian Lang, Zhangtao Cheng, Yong Wang, and Fan Zhou. Echoes in the feed: Evolution- aware prompt-augmented micro-video popularity prediction. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2744–2748, 2025. doi: 10.1145/3726302.3730184
-
[11]
In-context prompt-augmented micro- video popularity prediction
Zhangtao Cheng, Jiao Li, Jian Lang, Ting Zhong, and Fan Zhou. In-context prompt-augmented micro- video popularity prediction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11527–11535, 2025. doi: 10.1609/aaai.v39i11.33254
-
[12]
Xovee Xu, Yifan Zhang, Fan Zhou, and Jingkuan Song. Improving multimodal social media popularity prediction via selective retrieval knowledge augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 932–940, 2025. doi: 10.1609/aaai.v39i1.32078
-
[13]
A content-driven micro-video recommendation dataset at scale.arXiv preprint arXiv:2309.15379, 2023
Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. A content-driven micro-video recommendation dataset at scale.arXiv preprint arXiv:2309.15379, 2023
-
[14]
Freeman, Frédo Durand, Eli Shechtman, and Xun Huang
Yijie Xu, Bolun Zheng, Wei Zhu, Hangjia Pan, Yuchen Yao, Ning Xu, Anan Liu, Quan Zhang, and Chenggang Yan. Smtpd: A new benchmark for temporal prediction of social media popularity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18847–18857, 2025. doi: 10.1109/CVPR52734.2025.01756. 10
-
[15]
Real-time short video recommendation on mobile devices
Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, Peng Jiang, and Kun Gai. Real-time short video recommendation on mobile devices. InProceedings of the 31st ACM international conference on information & knowledge management, pages 3103–3112, 2022
work page 2022
-
[16]
Smp challenge: An overview and analysis of social media prediction challenge
Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, and Jiebo Luo. Smp challenge: An overview and analysis of social media prediction challenge. InProceedings of the 31st ACM International Conference on Multimedia, pages 9651–9655, 2023
work page 2023
-
[17]
Aditya Khosla, Atish Das Sarma, and Raffay Hamid. What makes an image popular? InProceedings of the 23rd International Conference on World Wide Web (WWW), pages 867–876, 2014. doi: 10.1145/ 2566486.2567996
-
[18]
Peiguang Jing, Yuting Su, Liqiang Nie, Xu Bai, Jing Liu, and Meng Wang. Low-rank multi-view embedding learning for micro-video popularity prediction.IEEE Transactions on Knowledge and Data Engineering (TKDE), 30(8):1519–1532, 2018. doi: 10.1109/TKDE.2017.2785784
-
[19]
Social media popularity prediction based on visual-textual features with xgboost
Junhong Chen, Dayong Liang, Zhanmo Zhu, Xiaojing Zhou, Zihan Ye, and Xiuyun Mo. Social media popularity prediction based on visual-textual features with xgboost. InProceedings of the 27th ACM International Conference on Multimedia, pages 2692–2696, 2019
work page 2019
-
[20]
HyFea: Winning solution to social media popularity prediction for multimedia grand challenge 2020
Xin Lai, Yihong Zhang, and Wei Zhang. HyFea: Winning solution to social media popularity prediction for multimedia grand challenge 2020. InProceedings of the 28th ACM International Conference on Multimedia (MM), pages 4565–4569, 2020. doi: 10.1145/3394171.3416275
-
[21]
Jiayi Xie, Yaochen Zhu, and Zhenzhong Chen. Micro-video popularity prediction via multimodal varia- tional information bottleneck.IEEE Transactions on Multimedia, 25:24–37, 2021
work page 2021
-
[22]
Tsun-hin Cheung and Kin-man Lam. Crossmodal bipolar attention for multimodal classification on social media.Neurocomputing, 514:1–12, 2022
work page 2022
-
[23]
Multi-modal variational auto-encoder model for micro-video popularity prediction
Zhuoran Zhang, Shibiao Xu, Li Guo, and Wenke Lian. Multi-modal variational auto-encoder model for micro-video popularity prediction. InProceedings of the 8th International Conference on Communication and Information Processing (ICCIP), pages 9–16, 2022. doi: 10.1145/3571662.3571664
-
[24]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[25]
Multi-queue momentum contrast for microvideo-product retrieval
Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. Multi-queue momentum contrast for microvideo-product retrieval. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 1003–1011, 2023
work page 2023
-
[26]
Dual-stream pre-training transformer to enhance multimodal learning for social media prediction
Wenhao Hu, Weilong Chen, Weimin Yuan, Yan Wang, Shimin Cai, and Yanru Zhang. Dual-stream pre-training transformer to enhance multimodal learning for social media prediction. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11450–11456, 2024
work page 2024
-
[27]
Higher-order vision-language alignment for social media prediction
Mingsheng Tu, Tianjiao Wan*, Qisheng Xu, Xinhao Jiang, Kele Xu, and Cheng Yang. Higher-order vision-language alignment for social media prediction. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11457–11463, 2024
work page 2024
-
[28]
Efficient test-time adaptation of vision-language models
Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14162–14171, 2024
work page 2024
-
[29]
Realistic test-time adaptation of vision-language models
Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, and Ismail Ben Ayed. Realistic test-time adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25103–25112, 2025
work page 2025
-
[30]
Noisy test-time adaptation in vision-language models
Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, and Bo Han. Noisy test-time adaptation in vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=iylpeTI0Ql
work page 2025
-
[31]
Dota: Distributional test-time adaptation of vision-language models
Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, and Changqing Zhang. Dota: Distributional test-time adaptation of vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 11
-
[32]
Lightweight online adaption for time series foundation model forecasts
Thomas L Lee, William Toner, Rajkarn Singh, Artjom Joosen, and Martin Asenov. Lightweight online adaption for time series foundation model forecasts. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=gAxYbvoOQz
work page 2025
-
[33]
Lifan Zhao and Yanyan Shen. Proactive model adaptation against concept drift for online time series forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2020–2031, 2025. doi: 10.1145/3690624.3709210
-
[34]
Fast and slow streams for online time series forecast- ing without information leakage
Ying yee Ava Lau, Zhiwen Shao, and Dit-Yan Yeung. Fast and slow streams for online time series forecast- ing without information leakage. InThe Thirteenth International Conference on Learning Representations,
-
[35]
URLhttps://openreview.net/forum?id=I0n3EyogMi
-
[36]
Continual collaborative distillation for recommender system
Gyuseok Lee, SeongKu Kang, Wonbin Kweon, and Hwanjo Yu. Continual collaborative distillation for recommender system. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 1495–1505, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400704901. doi: 10.1145/3637528.3671924. URL https://do...
-
[37]
Mitigating distribution shifts in sequential recommendation: An invariance perspective
Yuxin Liao, Yonghui Yang, Min Hou, Le Wu, Hefei Xu, and Hao Liu. Mitigating distribution shifts in sequential recommendation: An invariance perspective. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1603–1613, 2025
work page 2025
-
[38]
Online drift detection with maximum concept discrepancy
Ke Wan, Yi Liang, and Susik Yoon. Online drift detection with maximum concept discrepancy. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2924–2935, 2024. doi: 10.1145/3637528.3672016
-
[39]
Inflora: Interference-free low-rank adaptation for continual learning
Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024
work page 2024
-
[40]
Online-lora: Task-free online continual learning via low rank adaptation
Xiwen Wei, Guihong Li, and Radu Marculescu. Online-lora: Task-free online continual learning via low rank adaptation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[41]
Gated integration of low-rank adaptation for continual learning of large language models
Yan-Shuo Liang, Jiarui Chen, and Wu-Jun Li. Gated integration of low-rank adaptation for continual learning of large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[42]
Hierarchical knowledge prompt tuning for multi-task test-time adaptation
Qiang Zhang, Mengsheng Zhao, Jiawei Liu, Fanrui Zhang, Yongchao Xu, and Zheng-Jun Zha. Hierarchical knowledge prompt tuning for multi-task test-time adaptation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30524–30533, 2025
work page 2025
-
[43]
Dpcore: Dynamic prompt coreset for continual test-time adaptation
Yunbei Zhang, Akshay Mehra, Shuaicheng Niu, and Jihun Hamm. Dpcore: Dynamic prompt coreset for continual test-time adaptation. InForty-second International Conference on Machine Learning
-
[44]
Forecasting the buzz: Enriching hashtag popularity prediction with llm reasoning
Yifei Xu, Jiaying Wu, Herun Wan, Yang Li, Zhen Hou, and Min-Yen Kan. Forecasting the buzz: Enriching hashtag popularity prediction with llm reasoning. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 5396–5400, 2025. doi: 10.1145/3746252.3760970
-
[45]
Mmsum: A dataset for multimodal summarization and thumbnail generation of videos
Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, et al. Mmsum: A dataset for multimodal summarization and thumbnail generation of videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21909–21921, 2024
work page 2024
-
[46]
Jeongeun Lee, Youngjae Yu, and Dongha Lee. Hippo-video: Simulating watch histories with large language models for personalized video highlighting. InConference on Language Modeling, 2025. URL https://arxiv.org/abs/2507.16873. Published as a conference paper at COLM 2025
-
[47]
Towards automatic learning of procedures from web instructional videos
Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[48]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019
work page 2019
-
[49]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 12
work page 2023
-
[50]
Perplexity ai.https://www.perplexity.ai/, 2024
Perplexity AI. Perplexity ai.https://www.perplexity.ai/, 2024. Accessed: 2025-05-08
work page 2024
-
[51]
Introducing chatgpt search, 2024
OpenAI. Introducing chatgpt search, 2024. URL https://openai.com/index/ introducing-chatgpt-search/
work page 2024
-
[52]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Search-o1: Agentic search-enhanced large reasoning models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025
work page 2025
-
[54]
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anas- tasios Nikolas Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. Search arena: Analyzing search-augmented LLMs. InThe Fourteenth International Conference on Learning Representa- tions, 2026. URLhttps://openreview.net/forum?id=MMGRlDnhtI
work page 2026
-
[55]
Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Paolo Rota, Nicu Sebe, Zheng Liu, and Lizi Liao. Video-browsecomp: Benchmarking agentic video research on open web.arXiv preprint arXiv:2512.23044, 2025
-
[56]
Agenticshop: Benchmarking agentic product curation for personalized web shopping
Sunghwan Kim, Ryang Heo, Yongsik Seo, Jinyoung Yeo, and Dongha Lee. Agenticshop: Benchmarking agentic product curation for personalized web shopping. InProceedings of the ACM Web Conference 2026, pages 2489–2500, 2026
work page 2026
-
[57]
xAI. grok-4.1-fast-reasoning, 2025. URL https://docs.x.ai/developers/models/ grok-4-1-fast-reasoning
work page 2025
-
[58]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. URL https://api. semanticscholar.org/CorpusID:257532815
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Critique-out-loud reward models,
Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique- out-loud reward models.arXiv preprint arXiv:2408.11791, 2024
-
[60]
MM-RLHF: The next step forward in multimodal LLM alignment
YiFan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di ZHANG, Liang Wang, and Rong Jin. MM-RLHF: The next step forward in multimodal LLM alignment. In Forty-second International Conference on Machine Learning, 2025. URL https...
work page 2025
-
[61]
Personalized reward modeling for text-to-image generation
Jeongeun Lee, Ryang Heo, and Dongha Lee. Personalized reward modeling for text-to-image generation. arXiv preprint arXiv:2511.19458, 2025
-
[62]
Yankai Yang, Yancheng Long, Hongyang Wei, Wei Chen, Tianke Zhang, Kaiyu Jiang, Haonan Fan, Changyi Liu, Jiankang Chen, Kaiyu Tang, et al. Joint reward modeling: Internalizing chain-of-thought for efficient visual reward models.arXiv preprint arXiv:2602.07533, 2026
-
[63]
Multimodal llms as customized reward models for text-to-image generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, and Changyou Chen. Multimodal llms as customized reward models for text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19638–19648, 2025
work page 2025
-
[64]
Ren-Meng Cao, Xiao Fan Liu, and Xiao-Ke Xu. Why cannot long-term cascade be predicted? exploring temporal dynamics in information diffusion processes.Royal Society Open Science, 8(9), 2021
work page 2021
-
[65]
Stephen L France, Mahyar Sharif Vaghefi, and Huimin Zhao. Characterizing viral videos: Methodology and applications.Electronic Commerce Research and Applications, 19:19–32, 2016
work page 2016
-
[66]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[67]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Generalization through memorization: Nearest neighbor language models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InInternational Conference on Learning Representations. 13
-
[69]
Adaptation approaches for nearest neighbor language models
Rishabh Bhardwaj, George Polovets, and Monica Sunkara. Adaptation approaches for nearest neighbor language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 1135–1146, Toronto, Canada, July
work page 2023
-
[70]
doi: 10.18653/v1/2023.findings-acl.73
Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.73. URL https: //aclanthology.org/2023.findings-acl.73/
-
[71]
Adanpc: Exploring non-parametric classifier for test-time adaptation
Yifan Zhang, Xue Wang, Kexin Jin, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Adanpc: Exploring non-parametric classifier for test-time adaptation. InInternational conference on machine learning, pages 41647–41676. PMLR, 2023
work page 2023
-
[72]
Sisuo Lyu, Siru Zhong, Tiegang Chen, Weilin Ruan, Qingxiang Liu, Taiqiang Lv, Qingsong Wen, Raymond Chi-Wing Wong, and Yuxuan Liang. Ts-memory: Plug-and-play memory for time series foundation models.arXiv preprint arXiv:2602.11550, 2026
-
[73]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023
work page 2023
-
[74]
A simple but strong baseline for online continual learning: Repeated augmented rehearsal
Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet, Nick Jin Sean Lim, and Alvin Jia. A simple but strong baseline for online continual learning: Repeated augmented rehearsal. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=bhvUOhnsgZ
work page 2022
-
[75]
Self-consistent reasoning-based aspect-sentiment quad prediction with extract-then-assign strategy
Jieyong Kim, Ryang Heo, Yongsik Seo, SeongKu Kang, Jinyoung Yeo, and Dongha Lee. Self-consistent reasoning-based aspect-sentiment quad prediction with extract-then-assign strategy. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7295–7303, 2024
work page 2024
-
[76]
Yongsik Seo, Sungwon Song, Ryang Heo, Jieyong Kim, and Dongha Lee. Make compound sentences simple to analyze: Learning to split sentences for aspect-based sentiment analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11171–11184, 2024
work page 2024
-
[77]
Imagine all the relevance: Scenario-profiled indexing with knowledge expansion for dense retrieval
Sangam Lee, Ryang Heo, SeongKu Kang, and Dongha Lee. Imagine all the relevance: Scenario-profiled indexing with knowledge expansion for dense retrieval. InSecond Conference on Language Modeling
-
[78]
Angle-optimized text embeddings,
Xianming Li and Jing Li. Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023
-
[79]
{Trend Feature} {Topic} {Sub-Topic} #shorts
Ryang Heo, Yongsik Seo, Junseong Lee, and Dongha Lee. Can large language models be effective online opinion miners? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23108–23147, 2025. 14 A Limitations and Broader Impacts LimitationsWhile our results confirm the value of open-web grounding and trend-aware adap...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.