arxiv: 2604.20858 · v1 · submitted 2026-03-01 · 💻 cs.IR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Mixture of Sequence: Theme-Aware Mixture-of-Experts for Long-Sequence Recommendation

Xiao Lin , Zhicheng Tang , Weilin Cong , Mengyue Hang , Kai Wang , Yajuan Wang , Zhichen Zeng , Ting-Wei Li

show 9 more authors

Hyunsik Yoo Zhining Liu Xuying Ning Ruizhong Qiu Wen-yen Chen Shuo Chang Rong Jin Huayu Li Hanghang Tong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:26 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords sequential recommendationmixture of expertslong sequencessession hoppingtheme-aware routingmulti-scale fusionclick-through rate

0 comments

The pith

A theme-aware mixture-of-experts model splits long user sequences into coherent theme-specific subsequences to filter interest shifts and improve recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long user histories in sequential recommendation often mix stable short-term interests with abrupt shifts that add noise and hurt predictions. The paper shows this pattern, called session hopping, and introduces the Mixture of Sequence framework to handle it. MoS learns latent themes in the data, routes sessions into separate subsequences that stay within one theme, and then fuses outputs from experts that look at the full sequence, recent actions, and theme-specific details. If the approach works, models can maintain or raise accuracy on long sequences while using less computation than other mixture-of-experts methods.

Core claim

The Mixture of Sequence (MoS) framework is a model-agnostic MoE approach that extracts theme-specific and multi-scale subsequences from noisy raw user sequences. It employs a theme-aware routing mechanism to adaptively learn the latent themes of user sequences and organizes these sequences into multiple coherent subsequences. Each subsequence contains only sessions aligned with a specific theme. A multi-scale fusion mechanism then leverages three types of experts to capture global sequence characteristics, short-term user behaviors, and theme-specific semantic patterns.

What carries the argument

Theme-aware routing that groups sessions into theme-coherent subsequences, paired with multi-scale expert fusion for global, short-term, and semantic views.

If this is right

Predictions rely only on sessions that match the active theme, reducing the impact of misleading shifts.
The model remains compatible with many existing sequential backbones because the routing and fusion layers sit on top.
Computational cost drops relative to standard MoE baselines because each expert processes shorter, cleaner subsequences.
SOTA results hold across multiple datasets while reporting fewer total floating-point operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing logic could extend to other sequential tasks that show periodic reappearance of patterns, such as next-basket prediction in retail.
Increasing the number of themes or adding a hierarchical scale might handle extremely long histories with more frequent shifts.
Combining the router with explicit user profile features could make theme discovery more stable on cold-start users.

Load-bearing premise

User interests remain stable inside short sessions and shift in patterns that a learned router can reliably separate into distinct latent themes without losing key signals.

What would settle it

A controlled test on standard long-sequence benchmarks where replacing the theme-aware router with random session grouping produces equal or higher accuracy and lower FLOPs than the full MoS model.

Figures

Figures reproduced from arXiv: 2604.20858 by Hanghang Tong, Huayu Li, Hyunsik Yoo, Kai Wang, Mengyue Hang, Rong Jin, Ruizhong Qiu, Shuo Chang, Ting-Wei Li, Weilin Cong, Wen-yen Chen, Xiao Lin, Xuying Ning, Yajuan Wang, Zhicheng Tang, Zhichen Zeng, Zhining Liu.

**Figure 2.** Figure 2: Heatmap of the self-similarity matrix for a representative user transaction history. Red indicates high similarity, blue indicates low similarity, and green lines denote session boundaries. interests remain highly consistent and stable within a short temporal span (a session), as indicated by the red area of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of MoS. The left panel illustrates theme-aware routing, which assigns inputs to experts according to theme vectors in the codebook. The right panel shows multi-scale fusion, which models user behaviors by extracting subsequences at different granularities from the full sequence. Here, the 𝑖-th row of the codebook describes the theme feature associated with the 𝑖-th expert. Given an item embedd… view at source ↗

**Figure 4.** Figure 4: Trade-off between utility and efficiency. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Dispatch behavior of routers for the most popular [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of 𝛼𝐼 and 𝛼𝑊 on model utility. and the interior region denotes full fusion of all experts. It is evident that full fusion consistently achieves the best results, while pairwise fusion outperforms using the single expert. This observation suggests that although the global expert alone already provides strong representations for recommendations, the three types of experts capture complementary behavi… view at source ↗

**Figure 7.** Figure 7: Scaling study of MoS. MoS consistently enhances AUC/GAUC with increased sequence length and number of experts. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of Session Hopping (a) Router dispatch behavior for the most popular item. (b) Router dispatch behavior for the least popular item [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison between dispatch behavior of different MoE routers. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

Sequential recommendation has rapidly advanced in click-through rate prediction due to its ability to model dynamic user interests. A key challenge, however, lies in modeling long sequences: users often exhibit significant interest shifts, introducing substantial irrelevant or misleading information. Our empirical analysis corroborates this challenge and uncovers a recurring behavioral pattern in long sequences (\textit{session hopping}): user interests remain stable within short temporal spans (\textit{sessions}) but shift drastically across sessions and may reappear after multiple sessions. To address this challenge, we propose the Mixture of Sequence (MoS) framework, a model-agnostic MoE approach that achieves accurate predictions by extracting theme-specific and multi-scale subsequences from noisy raw user sequences. First, MoS employs a theme-aware routing mechanism to adaptively learn the latent themes of user sequences and organizes these sequences into multiple coherent subsequences. Each subsequence contains only sessions aligned with a specific theme, thereby effectively filtering out irrelevant or even misleading information introduced by user interest shifts in session hopping. In addition, to alleviate potential information loss, we introduce a multi-scale fusion mechanism, which leverages three types of experts to capture global sequence characteristics, short-term user behaviors, and theme-specific semantic patterns. Together, these two mechanisms endow MoS with the ability to deliver accurate recommendations from multi-faceted and multi-scale perspectives. Experimental results demonstrate that MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts, providing strong evidence of its excellent balance between utility and efficiency. The code is available at https://github.com/xiaolin-cs/MoS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoS adds theme-aware subsequence routing to MoE for long rec sequences, but the session-hopping assumption needs stronger checks in the results.

read the letter

The core of this paper is a theme-aware Mixture-of-Experts setup for long-sequence recommendation. It learns to split user sequences into subsequences grouped by latent themes to cut out interest-shift noise, then combines outputs from three expert types at different scales. What is new is the routing step that organizes sessions into theme-coherent groups plus the multi-scale fusion of global, short-term, and theme-specific experts. The paper does well at spelling out the session-hopping pattern and why standard models pick up misleading signals from it. Releasing code at the GitHub link is a plus for anyone wanting to test it. The main concern is whether the theme router actually works as claimed. The abstract references an empirical analysis of the hopping behavior, but without numbers on intra-session stability or router accuracy, or an ablation comparing learned routing to simpler alternatives, the filtering benefit could be overstated. If the discovered themes don't align with real user patterns, the subsequence extraction might discard useful information and the efficiency gains might not hold up. This is relevant for researchers and practitioners in sequential recommendation and CTR prediction who deal with long histories. Readers already experimenting with MoE in recsys would get the most from the specific routing and fusion design. It should go to peer review because it offers a concrete architecture with available code, though the experiments will need scrutiny on the routing validation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Mixture of Sequence (MoS) framework, a model-agnostic Mixture-of-Experts approach for long-sequence recommendation. It identifies a recurring 'session hopping' pattern in which user interests remain stable within short temporal sessions but shift drastically across sessions. MoS employs a theme-aware routing mechanism to adaptively learn latent themes and extract coherent theme-specific subsequences that filter irrelevant information, together with a multi-scale fusion mechanism that combines experts capturing global sequence characteristics, short-term behaviors, and theme-specific patterns. The paper claims that this yields state-of-the-art performance while using fewer FLOPs than other MoE baselines, with code released at https://github.com/xiaolin-cs/MoS.

Significance. If the empirical results hold, the work offers a practical, model-agnostic way to mitigate interest-shift noise in long user sequences by exploiting stable intra-session themes and multi-scale experts, potentially improving the accuracy-efficiency trade-off in sequential recommendation. The open-source code is a clear positive for reproducibility. The significance is limited, however, by the absence of concrete experimental support for the core assumptions and performance claims.

major comments (2)

[Abstract] Abstract: the central claim that 'MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts' is presented without any reference to datasets, baselines, metrics (AUC, NDCG, etc.), ablation tables, or statistical significance tests. This absence directly undermines verification of both the utility and efficiency assertions that constitute the paper's main contribution.
[Abstract] Abstract: the theme-aware routing is asserted to 'adaptively learn the latent themes' and thereby filter misleading information from session hopping, yet no intra-/inter-session similarity statistics, theme coherence scores, or ablation isolating learned routing versus random routing is supplied. If the router fails to recover stable themes accurately, subsequence extraction becomes a source of information loss rather than a gain, rendering both the SOTA accuracy and the reported FLOPs reduction claims unsupported.

minor comments (1)

[Abstract] The phrase 'session hopping' is introduced in italics without a formal definition or citation to prior session-based modeling literature; a brief operational definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for stronger empirical grounding of our claims. We address each comment below and have updated the manuscript to incorporate additional details and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts' is presented without any reference to datasets, baselines, metrics (AUC, NDCG, etc.), ablation tables, or statistical significance tests. This absence directly undermines verification of both the utility and efficiency assertions that constitute the paper's main contribution.

Authors: The abstract is intentionally concise as a high-level summary. The full manuscript provides the requested details in Section 5, with results across multiple public datasets, standard metrics including AUC and NDCG, comparisons to MoE and other baselines, ablation tables, FLOPs measurements, and statistical significance tests. To improve accessibility, we have revised the abstract to briefly reference the evaluation setting on public benchmarks with AUC/NDCG metrics and direct readers to the experimental section for full tables and analyses. revision: yes
Referee: [Abstract] Abstract: the theme-aware routing is asserted to 'adaptively learn the latent themes' and thereby filter misleading information from session hopping, yet no intra-/inter-session similarity statistics, theme coherence scores, or ablation isolating learned routing versus random routing is supplied. If the router fails to recover stable themes accurately, subsequence extraction becomes a source of information loss rather than a gain, rendering both the SOTA accuracy and the reported FLOPs reduction claims unsupported.

Authors: The manuscript already includes an empirical analysis of the session-hopping pattern and demonstrates the routing's value through end-to-end gains and component ablations. We agree that more targeted diagnostics would strengthen the argument. In the revision we have added intra-/inter-session similarity statistics, theme coherence scores, and a new ablation comparing the learned router against random routing; these results confirm that the router recovers coherent themes and that subsequence extraction improves rather than harms performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal is self-contained

full rationale

The paper proposes the Mixture of Sequence (MoS) framework as a model-agnostic MoE architecture featuring a theme-aware routing mechanism to organize user sequences into theme-specific subsequences and a multi-scale fusion mechanism with three expert types. These components are motivated by an empirical observation of session-hopping patterns rather than being defined in terms of target performance metrics or reduced to fitted parameters called predictions. No equations or derivations in the provided text reduce the central claims to self-citations, ansatzes smuggled via prior work, or renaming of known results; the SOTA and efficiency claims rest on external experimental comparisons. The derivation chain is therefore independent and self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The framework rests on the empirical validity of the session-hopping pattern and on the assumption that latent themes can be learned and used to partition sequences without external supervision.

free parameters (2)

number of latent themes
The routing mechanism adaptively learns latent themes; the number of themes functions as a model hyperparameter or learned capacity.
expert scale configuration
Choice of three expert types (global, short-term, theme-specific) and their fusion weights are design decisions that affect the multi-scale component.

axioms (1)

domain assumption User interests remain stable within short temporal sessions but shift drastically across sessions (session hopping).
Invoked as the basis for the theme-aware routing; stated as corroborated by empirical analysis in the abstract.

invented entities (2)

theme-aware routing mechanism no independent evidence
purpose: To learn latent themes and reorganize raw sequences into coherent theme-specific subsequences that filter irrelevant information.
Newly introduced component central to noise removal.
multi-scale fusion mechanism no independent evidence
purpose: To combine outputs from global, short-term, and theme-specific experts and thereby mitigate information loss from subsequence extraction.
Newly introduced component to preserve predictive signals.

pith-pipeline@v0.9.0 · 5648 in / 1588 out tokens · 81173 ms · 2026-05-15T17:26:12.434969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

theme-aware routing mechanism ... maintains a theme codebook W ... computes its cosine similarity with each entry of the codebook ... EMA update ... k-means clustering to obtain k cluster centroids
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

session hopping phenomenon: user interests remain highly consistent within a session but differ substantially across adjacent sessions ... may reappear after several sessions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

164 extracted references · 164 canonical work pages · 9 internal anchors

[1]

Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, and Jingrui He. 2025. ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 1–12

work page 2025
[2]

Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. 2021. Ee-net: Exploitation-exploration neural networks in contextual bandits.arXiv preprint arXiv:2110.03177(2021)

work page arXiv 2021
[3]

Yuanchen Bei, Weizhi Zhang, Siwen Wang, et al. 2025. Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities.arXiv preprint arXiv:2506.18019 (2025)

work page arXiv 2025
[4]

Shuqing Bian, Xingyu Pan, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, and Ji-Rong Wen. 2023. Multi-modal mixture of experts represetation learning for sequential recommendation. InProceedings of the 32nd ACM international conference on information and knowledge management. 110–119

work page 2023
[5]

Xuheng Cai, Chao Huang, Lianghao Xia, and Xubin Ren. 2023. LightGCL: Simple yet effective graph contrastive learning for recommendation.arXiv preprint arXiv:2302.08191(2023)

work page arXiv 2023
[6]

Yue Cao, Xiaojiang Zhou, Jiaqi Feng, Peihao Huang, Yao Xiao, Dayao Chen, and Sheng Chen. 2022. Sampling is all you need on modeling long-term user behaviors for CTR prediction. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 2974–2983

work page 2022
[7]

Valérie Castin, Pierre Ablin, and Gabriel Peyré. 2023. How smooth is attention? arXiv preprint arXiv:2312.14820(2023)

work page arXiv 2023
[8]

Zheng Chai, Qin Ren, Xijun Xiao, et al. 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256

work page 2025
[9]

Jianxin Chang, Chen Gao, Xiangnan He, Depeng Jin, and Yong Li. 2020. Bundle recommendation with graph convolutional networks. InProceedings of the 43rd international ACM SIGIR conference on Research and development in Information Retrieval. 1673–1676

work page 2020
[11]

Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng Jin, and Yong Li. 2021. Sequential recommendation with graph neural networks. InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 378–387

work page 2021
[12]

Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage interest network for lifelong user behavior modeling in CTR prediction at kuaishou. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3785–3794

work page 2023
[13]

Hao Chen, Yuanchen Bei, Qijie Shen, Yue Xu, Sheng Zhou, Wenbing Huang, Feiran Huang, Senzhang Wang, and Xiao Huang. 2024. Macro graph neural networks for online billion-scale recommender systems. InProceedings of the ACM web conference 2024. 3598–3608

work page 2024
[14]

Hong Chen, Yudong Chen, Xin Wang, et al . 2021. Curriculum disentangled recommendation with noisy multi-feedback.NeurIPS34 (2021), 26924–26936

work page 2021
[15]

Huiyuan Chen, Zhe Xu, Chin-Chia Michael Yeh, Vivian Lai, Yan Zheng, Minghua Xu, and Hanghang Tong. 2024. Masked graph transformer for large-scale recommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2502–2506

work page 2024
[16]

Xusong Chen, Dong Liu, Zheng-Jun Zha, Wengang Zhou, Zhiwei Xiong, and Yan Li. 2018. Temporal hierarchical attention at category-and item-level for micro- video click-through prediction. InProceedings of the 26th ACM international conference on Multimedia. 1146–1153

work page 2018
[17]

Xu Chen, Hongteng Xu, Yongfeng Zhang, et al. 2018. Sequential recommenda- tion with user memory networks. InProceedings of the eleventh ACM interna- tional conference on web search and data mining. 108–116

work page 2018
[18]

Yongjun Chen, Zhiwei Liu, Jia Li, Julian McAuley, and Caiming Xiong. 2022. Intent contrastive learning for sequential recommendation. InProceedings of the ACM Web Conference 2022. 2172–2182

work page 2022
[19]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, et al . 2016. Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems. 7–10

work page 2016
[20]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al

work page
[21]

InProceedings of the 1st workshop on deep learning for recommender systems

Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems. 7–10

work page
[22]

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the repre- sentation collapse of sparse mixture of experts.Advances in Neural Information Processing Systems35 (2022), 34600–34613

work page 2022
[23]

Minjin Choi, Hye-young Kim, Hyunsouk Cho, and Jongwuk Lee. 2024. Multi- intent-aware Session-based Recommendation.arXiv preprint arXiv:2405.00986 (2024)

work page arXiv 2024
[24]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. InProceedings of the 10th ACM conference on recommender systems. 191–198

work page 2016
[25]

Damai Dai, Chengqi Deng, Chenggang Zhao, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Kaize Ding, Zhe Xu, Hanghang Tong, and Huan Liu. 2022. Data augmentation for deep graph learning: A survey.ACM SIGKDD Explorations Newsletter24, 2 (2022), 61–77

work page 2022
[27]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page
[28]

In International Conference on Machine Learning

Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569

work page
[29]

Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin

work page
[30]

InThe world wide web conference

Graph neural networks for social recommendation. InThe world wide web conference. 417–426

work page
[31]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022
[32]

Dongqi Fu, Zhe Xu, Hanghang Tong, and Jingrui He. 2023. Natural and artificial dynamics in gnns: A tutorial. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1252–1255

work page 2023
[33]

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM Conference on Recommender Systems. 299–315

work page 2022
[34]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Jiayan Guo, Yaming Yang, Xiangchen Song, Yuan Zhang, Yujing Wang, Jing Bai, and Yan Zhang. 2022. Learning multi-granularity consecutive user intent unit for session-based recommendation. InProceedings of the fifteenth ACM International conference on web search and data mining. 343–352

work page 2022
[36]

Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, et al . 2021. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning.NeurIPS34 (2021), 29335–29347

work page 2021
[37]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648

work page 2020
[40]

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk

work page
[41]

Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[42]

Yupeng Hou, Binbin Hu, Zhiqiang Zhang, and Wayne Xin Zhao. 2022. Core: simple and effective session-based recommendation within consistent represen- tation space. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1796–1801

work page 2022
[43]

Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji- Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 585–593

work page 2022
[44]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton

work page
[45]

Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

work page 1991
[46]

Baoyu Jing, Yuchen Yan, Kaize Ding, Chanyoung Park, Yada Zhu, Huan Liu, and Hanghang Tong. 2024. Sterling: Synergistic representation learning on bipartite graphs. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 12976–12984

work page 2024
[47]

Baoyu Jing, Yuchen Yan, Yada Zhu, and Hanghang Tong. 2022. Coin: Co-cluster infomax for bipartite graphs.arXiv preprint arXiv:2206.00006(2022)

work page arXiv 2022
[50]

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206

work page 2018
[51]

Diederik P Kingma. 2014. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

TN Kipf. 2016. Semi-supervised classification with graph convolutional net- works.arXiv preprint arXiv:1609.02907(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

Ivan Koychev and Ingo Schwab. 2000. Adaptation to drifting user’s interests. In Proceedings of ECML2000 workshop: machine learning in new information age. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates Xiao Lin et al. 39–46

work page 2000
[54]

Johannes Kruse, Kasper Lindskow, et al. 2024. EB-NeRD a large-scale dataset for news recommendation. InProceedings of the Recommender Systems Challenge

work page 2024
[55]

Wai Lam and Javed Mostafa. 2001. Modeling user interest shift using a bayesian approach.Journal of the American society for Information Science and Technology 52, 5 (2001), 416–429

work page 2001
[56]

Wai Lam, Snehasis Mukhopadhyay, Javed Mostafa, and Mathew Palakal. 1996. Detection of shifts in user interests for personalized information filtering. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. 317–325

work page 1996
[57]

Gyuseok Lee, Yaokun Liu, Yifan Liu, Susik Yoon, Dong Wang, and SeongKu Kang. 2025. Session-Based Recommendation with Validated and Enriched LLM Intents.arXiv preprint arXiv:2508.00570(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[59]

Dingcheng Li, Xu Li, Jun Wang, and Ping Li. 2020. Video recommendation with multi-gate mixture of experts soft actor critic. InProceedings of the 43rd International ACM SIGIR conference on research and development in information retrieval. 1553–1556

work page 2020
[60]

Honghao Li, Yiwen Zhang, Yi Zhang, Hanwei Li, Lei Sang, and Jieming Zhu

work page
[61]

FCN: Fusing Exponential and Linear Cross Network for Click-Through Rate Prediction.arXiv preprint arXiv:2407.13349(2024)

work page arXiv 2024
[62]

Haoxuan Li, Chunyuan Zheng, Wenjie Wang, Hao Wang, Fuli Feng, and Xiao- Hua Zhou. 2024. Debiased recommendation with noisy feedback. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1576–1586

work page 2024
[63]

Jinning Li, Ruipeng Han, Chenkai Sun, Dachun Sun, Ruijie Wang, Jingying Zeng, Yuchen Yan, Hanghang Tong, and Tarek Abdelzaher. 2024. Large language model-guided disentangled belief representation learning on polarized social graphs. In2024 33rd International Conference on Computer Communications and Networks (ICCCN). IEEE, 1–9

work page 2024
[64]

Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428

work page 2017
[65]

Jinning Li, Huajie Shao, Dachun Sun, Ruijie Wang, Yuchen Yan, Jinyang Li, Shengzhong Liu, Hanghang Tong, and Tarek Abdelzaher. 2022. Unsupervised belief representation learning with information-theoretic variational graph auto-encoders. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1728–1738

work page 2022
[66]

Ting Wei Li, Qiaozhu Mei, and Jiaqi Ma. 2023. A metadata-driven approach to understand graph neural networks.Advances in Neural Information Processing Systems36 (2023), 15320–15340

work page 2023
[67]

Yongqi Li, Meng Liu, Jianhua Yin, Chaoran Cui, Xin-Shun Xu, and Liqiang Nie. 2019. Routing micro-videos via a temporal graph-guided recommendation system. InProceedings of the 27th ACM international conference on multimedia. 1464–1472

work page 2019
[68]

Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, et al. 2025. Flow matching meets biology and life science: a survey.arXiv preprint arXiv:2507.17731(2025)

work page arXiv 2025
[69]

Mingfu Liang, Xi Liu, Rong Jin, Boyang Liu, Qiuling Suo, Qinghai Zhou, Song Zhou, Laming Chen, Hua Zheng, Zhiyuan Li, et al. 2025. External large foun- dation model: How to efficiently serve trillions of parameters for online ads recommendation. InCompanion Proceedings of the ACM on Web Conference 2025. 344–353

work page 2025
[70]

Xiao Lin, Jian Kang, Weilin Cong, and Hanghang Tong. 2024. Bemap: Balanced message passing for fair graph neural network. InLearning on Graphs Conference. PMLR, 37–1

work page 2024
[71]

Xiao Lin, Zhining Liu, Dongqi Fu, Ruizhong Qiu, and Hanghang Tong. 2024. Backtime: Backdoor attacks on multivariate time series forecasting.Advances in Neural Information Processing Systems37 (2024), 131344–131368

work page 2024
[72]

Xiao Lin*, Zhining Liu*, Ze Yang*, Gaotang Li, Ruizhong Qiu, Shuke Wang, Hui Liu, Haotian Li, Sumit Keswani, Vishwa Pardeshi, et al. 2025. MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models.arXiv preprint arXiv:2505.14728(2025)

work page arXiv 2025
[73]

Xiao Lin, Zhichen Zeng, Tianxin Wei, et al. 2025. Cats: Mitigating correlation shift for multivariate time series classification.arXiv preprint arXiv:2504.04283 (2025)

work page arXiv 2025
[74]

Chengkai Liu, Jianghao Lin, Jianling Wang, et al. 2024. Mamba4rec: Towards efficient sequential recommendation with selective state space models.arXiv preprint arXiv:2403.03900(2024)

work page arXiv 2024
[75]

Cheng Liu, Chenhuan Yu, Ning Gui, Zhiwu Yu, and Songgaojun Deng. 2024. SimGCL: graph contrastive learning by finding homophily in heterophily.Knowl- edge and Information Systems66, 3 (2024), 2089–2114

work page 2024
[76]

Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: short- term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1831–1839

work page 2018
[77]

Xin Liu, Zheng Li, Yifan Gao, et al . 2023. Enhancing user intent capture in session-based recommendation with attribute patterns.Advances in Neural Information Processing Systems36 (2023), 30821–30839

work page 2023
[78]

Xin Liu, Zheng Li, Yifan Gao, Jingfeng Yang, Tianyu Cao, Zhengyang Wang, Bing Yin, and Yangqiu Song. 2024. Enhancing User Intent Capture in Session- Based Recommendation with Attribute Patterns.Advances in Neural Information Processing Systems36 (2024)

work page 2024
[79]

Xiaolong Liu, Zhichen Zeng, Xiaoyi Liu, Siyang Yuan, Weinan Song, Mengyue Hang, Yiqun Liu, Chaofei Yang, Donghyun Kim, Wen-Yen Chen, et al . 2024. A collaborative ensemble framework for ctr prediction.arXiv preprint arXiv:2411.13700(2024)

work page arXiv 2024
[80]

Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, and Hanghang Tong. 2025. SelfElicit: Your language model secretly knows where is the relevant evidence. InProceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 9153–9173

work page 2025
[81]

Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, et al. 2025. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771(2025)

work page arXiv 2025
[82]

Zhining Liu, Zihao Li, Ze Yang, Tianxin Wei, Jian Kang, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. 2025. CLIMB: Class-imbalanced Learning Benchmark on Tabular Data.arXiv preprint arXiv:2505.17451(2025)

work page arXiv 2025
[83]

Zhining Liu, Ruizhong Qiu, Zhichen Zeng, Hyunsik Yoo, David Zhou, Zhe Xu, Yada Zhu, Kommy Weldemariam, Jingrui He, and Hanghang Tong. 2024. Class-Imbalanced Graph Learning without Class Rebalancing. InForty-first International Conference on Machine Learning

work page 2024
[84]

Zhining Liu, Ze Yang, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. 2025. Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting. InProceedings of the 42nd International Conference on Machine Learning, Vol. 267. PMLR, 40022–40042

work page 2025
[85]

Jiaqi Ma, Zhe Zhao, Jilin Chen, Ang Li, Lichan Hong, and Ed H Chi. 2019. Snr: Sub-network routing for flexible parameter sharing in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 216–223

work page 2019

Showing first 80 references.