pith. sign in

arxiv: 2606.07454 · v1 · pith:Z464W5RCnew · submitted 2026-06-05 · 💻 cs.IR · cs.AI

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

Pith reviewed 2026-06-27 20:26 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords scientific paper recommendationlongitudinal evaluationuser profilinginterest driftrecommendation systemsinformation retrievaladaptive rankingdaily streams
0
0 comments X

The pith

PaperFlow organizes daily scientific paper recommendation into profiling, recommending, and adapting stages that handle interest shifts over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scientific paper recommendation should be treated as a longitudinal daily process rather than static ranking over fixed sets. It proposes PaperFlow as a framework with three coupled stages: building inspectable user profiles from cold-start evidence, ranking each day's paper stream under a display budget using multiple signals, and updating the profile from distinct feedback types while modeling drift across days. A shared longitudinal benchmark with 24 simulated users, 50 daily streams, and over 1,200 episodes is introduced to enable consistent evaluation. Experiments show PaperFlow outperforming five baselines on oracle ranking, alignment with simulated selections, and blind human judgments.

Core claim

PaperFlow structures scientific paper recommendation as three coupled stages—Profiling from heterogeneous cold-start evidence into a structured scholarly profile, Recommending via multi-signal aggregation on date-specific streams under a fixed budget, and Adapting from semantically distinct feedback while modeling interest drift—and demonstrates stronger oracle-based ranking, behavioral alignment with simulated reading, and blind human-evaluation scores than five baselines on a fixed longitudinal user-day benchmark containing 1,200 episodes and 20,727 papers.

What carries the argument

The PaperFlow framework's three coupled stages (Profiling, Recommending, Adapting) linked through a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden relevance labels under a shared temporal boundary.

If this is right

  • Daily paper streams can be ranked more effectively when user profiles are maintained and updated across days rather than recomputed from scratch.
  • Interest drift becomes modellable when feedback is treated as semantically distinct signals instead of uniform relevance labels.
  • Evaluation of recommendation systems gains consistency when benchmarks fix the temporal information boundary and separate visible inputs from hidden labels.
  • Automatic metrics gain credibility when aligned with blind human expert judgments on the same episodes.
  • Cold-start scenarios in scientific reading become addressable through structured, inspectable profiles built from heterogeneous evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged structure could apply to other evolving recommendation settings such as daily news or research tool suggestions where user state changes over short time scales.
  • The fixed-benchmark design could serve as a template for testing adaptive systems in domains outside papers, provided the simulated users are replaced by real traces.
  • If the simulated drift patterns diverge from actual researcher behavior, the adaptation stage would need recalibration using real longitudinal logs.
  • Integration with live arXiv or PubMed feeds would allow direct measurement of whether the multi-signal ranking reduces information overload on working days.

Load-bearing premise

The 24 simulated research users, their feedback signals, and the hidden simulated relevance labels accurately reflect real scientific reading behavior, interest drift, and relevance judgments.

What would settle it

Deploy PaperFlow to real researchers for multiple consecutive days, record their actual reading selections and explicit feedback, then compare the system's rankings and adaptations against those observed choices and reported satisfaction.

Figures

Figures reproduced from arXiv: 2606.07454 by Bihui Yu, Cheng Tan, Fuqiang Wang, Jiaohao Fu, Jie Dong, Jingxuan Wei, Siyuan Li, Song Tan, Xinglong Xu, Zheng Guo, Zheng Sun.

Figure 1
Figure 1. Figure 1: Motivation and paradigm comparison between traditional scientific paper recommendation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PaperFlow dynamic personalized scientific reading loop. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of the PaperFlow benchmark. Daily paper streams and simulated [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automatic–human metric alignment (ModelAutoScore vs. ModelHumanScore) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Interest-drift analysis. Cell color is normalized within each metric, with darker green [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative PaperFlow case study. reproducibility and controlled temporal comparison, but the simulated labels should not be interpreted as human-annotated truth or as deployment logs from real users. The current benchmark is mainly derived from arXiv daily paper streams, so coverage may differ across fields and publication venues. Future work should connect the protocol with larger-scale human evaluati… view at source ↗
Figure 8
Figure 8. Figure 8: The 24 simulated researcher profiles. mark construction. The system must combine long-term directions, short-term behavior, and explicit rules to decide whether a paper is worth showing to the user. B.2 The 24 Simulated Researchers [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of PaperFlow evaluation metrics. The figure groups metrics by oracle-based [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case-study selection criteria. The figure summarizes five representative case types for in [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Successful recommendation case for an NLP user. The episode shows that the user’s [PITH_FULL_IMAGE:figures/full_fig_p046_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Interest-drift case from GUI/Web agents to multimodal reasoning. Repeated evidence [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Behavior-consistency case with a high-SelectedNDCG list. Selected papers appear near the front of the Top-20, showing that behavior-based agreement can complement static oracle relevance in longitudinal recommendation evaluation. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗
read the original abstract

Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents PaperFlow, a framework for recommending scientific papers in a daily longitudinal setting. It consists of three stages: Profiling to build inspectable scholarly profiles from cold-start evidence, Recommending to rank date-specific paper streams under a display budget using multi-signal aggregation, and Adapting to update user state from feedback and model interest drift. The authors introduce a longitudinal user-day benchmark with 24 simulated users, 50 daily streams, 1,200 episodes, and 497,448 records, and report that PaperFlow outperforms five baselines in oracle ranking, behavioral alignment, and blind human evaluation.

Significance. This work has potential significance in moving paper recommendation evaluation from static to dynamic, longitudinal settings that better reflect real scientific reading. The provision of a fixed, reproducible benchmark with explicit temporal boundaries and the use of blind human evaluation are positive aspects that could facilitate future comparisons. Credit is due for the detailed specification of the benchmark parameters (24 users, 50 streams, shared information boundary) and the multi-stage framework. However, the significance is tempered by the exclusive reliance on simulated data for all quantitative claims.

major comments (1)
  1. [Longitudinal user-day benchmark definition and evaluation] The performance claims (strongest oracle-based ranking, highest behavioral alignment with simulated reading selections, best blind human-evaluation score) are obtained exclusively on the longitudinal user-day benchmark constructed from 24 simulated research users, hidden simulated relevance labels, and an interest-drift model (as described in the benchmark definition and evaluation sections). No validation of the simulation procedure against real scholarly reading patterns, interest drift, or external relevance judgments is provided, which is load-bearing for the central empirical contribution and the reported superiority over the five baselines.
minor comments (1)
  1. [Abstract] The abstract could more explicitly state the limitations of the simulated benchmark to set appropriate expectations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the reproducibility of the benchmark and the value of longitudinal evaluation. We address the single major comment below.

read point-by-point responses
  1. Referee: [Longitudinal user-day benchmark definition and evaluation] The performance claims (strongest oracle-based ranking, highest behavioral alignment with simulated reading selections, best blind human-evaluation score) are obtained exclusively on the longitudinal user-day benchmark constructed from 24 simulated research users, hidden simulated relevance labels, and an interest-drift model (as described in the benchmark definition and evaluation sections). No validation of the simulation procedure against real scholarly reading patterns, interest drift, or external relevance judgments is provided, which is load-bearing for the central empirical contribution and the reported superiority over the five baselines.

    Authors: We agree the simulation is central to the quantitative results. The benchmark was deliberately constructed as a fully specified, reproducible simulation to enable controlled longitudinal experiments (interest drift, accumulating feedback, date-specific streams) that cannot be ethically or practically obtained at this scale with real users. All five baselines are evaluated under identical conditions, so relative improvements remain valid within the benchmark. The blind human evaluation supplies an external, non-simulated check: domain experts preferred PaperFlow outputs without knowledge of the underlying labels or drift model. We will add a dedicated Limitations subsection that explicitly discusses the simulation assumptions, their potential divergence from real reading patterns, and the absence of direct validation against external user logs. No new experiments are required for this clarification. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external baseline comparisons within a self-defined but non-reductive benchmark

full rationale

The paper introduces PaperFlow as a three-stage framework and defines its own longitudinal user-day benchmark with 24 simulated users and hidden relevance labels. It then reports performance against five independent baselines plus a blind human-evaluation protocol. No derivation, equation, or result reduces to its own inputs by construction; the benchmark is an evaluation substrate rather than a self-referential loop, and no self-citations or fitted parameters are invoked as load-bearing premises. This is the standard case of a proposal evaluated on a custom testbed.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework and benchmark are described at a high level without technical internals.

pith-pipeline@v0.9.1-grok · 5768 in / 1141 out tokens · 22356 ms · 2026-06-27T20:26:18.892421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 18 canonical work pages

  1. [1]

    Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Daniel S

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Jenna Sparks, Jena D. Hwang, Varsha Kishore, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Daniel S. Weld, Doug Downey, Wen-Tau Yih, P...

  2. [2]

    Agentic Feedback Loop Modeling Improves Recommendation and User Simulation

    Shihao Cai, Jizhi Zhang, Keqin Bao, Chongming Gao, Qifan Wang, Fuli Feng, and Xiangnan He. Agentic Feedback Loop Modeling Improves Recommendation and User Simulation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2235–2244,

  3. [3]

    URLhttps://doi.org/10.1145/3726302.3729893

    doi: 10.1145/3726302.3729893. URLhttps://doi.org/10.1145/3726302.3729893

  4. [4]

    A Multi-Agent Conversational Recommender System.arXiv preprint arXiv:2402.01135, 2024

    Jiabao Fang, Shen Gao, Pengjie Ren, Xiuying Chen, Suzan Verberne, and Zhaochun Ren. A Multi-Agent Conversational Recommender System.arXiv preprint arXiv:2402.01135, 2024. URL https://arxiv.org/abs/ 2402.01135

  5. [5]

    Scholar Inbox: Personalized Paper Recommendations for Scientists

    Markus Flicke, Glenn Angrabeit, Madhav Iyengar, Vitalii Protsenko, Illia Shakun, Jovan Cicvaric, Bora Kargi, 11 PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams Haoyu He, Lukas Schuler, Lewin Scholz, Kavyanjali Agnihotri, Yong Cao, and Andreas Geiger. Scholar Inbox: Personalized Paper Recommendations for Scientists. InACL 2025 S...

  6. [6]

    2023 , isbn =

    Raymond Fok, Hita Kambhamettu, Luca Soldaini, Jonathan Bragg, Kyle Lo, Andrew Head, Marti A. Hearst, and Daniel S. Weld. Scim: Intelligent Skimming Support for Scientific Papers. InIUI 2023, 2023. doi: 10.1145/3581641.3584034. URLhttps://doi.org/10.1145/3581641.3584034

  7. [7]

    Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System.arXiv preprint arXiv:2303.14524, 2023

    Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System.arXiv preprint arXiv:2303.14524, 2023. URLhttps://arxiv.org/abs/2303.14524

  8. [8]

    PaSa: An LLM Agent for Comprehensive Academic Paper Search

    Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, and Weinan E. PaSa: An LLM Agent for Comprehensive Academic Paper Search. InACL 2025 Long Papers, 2025. URL https: //aclanthology.org/2025.acl-long.572/

  9. [9]

    Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs (RecBench+)

    Jiani Huang, Shijie Wang, Liang bo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, and Qing Li. Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs (RecBench+). InWSDM 2026, 2026. doi: 10.1145/3773966.3777954. URL https://doi.org/10.1145/ 3773966.3777954

  10. [10]

    Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations.ACM Transactions on Recommender Systems, 2025

    Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations.ACM Transactions on Recommender Systems, 2025. doi: 10.1145/3731446. URLhttps://doi.org/10.1145/3731446

  11. [11]

    Kang, Sherry Tongshuang Wu, Joseph Chee Chang, and Aniket Kittur

    Hyeonsu B. Kang, Sherry Tongshuang Wu, Joseph Chee Chang, and Aniket Kittur. Synergi: A Mixed- Initiative System for Scholarly Synthesis and Sensemaking. InUIST 2023, 2023. doi: 10.1145/3586183.3606759. URLhttps://doi.org/10.1145/3586183.3606759

  12. [12]

    Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue

    Yoonjoo Lee, Hyeonsu B. Kang, Matt Latzke, Juho Kim, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue. PaperWeaver: Enriching Topical Paper Alerts by Contextualizing Recommended Papers with User-collected Papers. InCHI 2024, 2024. doi: 10.1145/3613904.3642196. URL https://doi.org/10.1145/ 3613904.3642196

  13. [13]

    Incorporating External Knowledge and Goal Guidance for LLM-based Conversational Recommender Systems

    Chuang Li, Yang Deng, Hengchang Hu, Min-Yen Kan, and Haizhou Li. Incorporating External Knowledge and Goal Guidance for LLM-based Conversational Recommender Systems. InFindings of NAACL 2025, 2025. URLhttps://aclanthology.org/2025.findings-naacl.17/

  14. [14]

    Towards Personalized Deep Research: Benchmarks and Evaluations (PDR-Bench)

    Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, and Wangchunshu Zhou. Towards Personalized Deep Research: Benchmarks and Evaluations (PDR-Bench). InICLR 2026 Poster, 2026. URL https://openreview.net/forum?id=51LIRzF53v

  15. [15]

    acl-long.805/

    Guanyu Lin, Tao Feng, Pengrui Han, Ge Liu, and Jiaxuan You. Arxiv Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 122–130, 2024. doi: 10.18653/v1/2024. emnlp-demo.13. URLhttps://aclanthology.org/202...

  16. [16]

    Towards Interest Drift-driven User Representation Learning in Sequential Recommendation (IDURL)

    Xiaolin Lin, Weike Pan, and Zhong Ming. Towards Interest Drift-driven User Representation Learning in Sequential Recommendation (IDURL). InSIGIR 2025, 2025. doi: 10.1145/3726302.3730099. URL https: //doi.org/10.1145/3726302.3730099

  17. [17]

    Knowledge-Enhanced Recommendation with User-Centric Subgraph Network.arXiv preprint arXiv:2403.14377, 2024

    Guangyi Liu, Quanming Yao, Yongqi Zhang, and Lei Chen. Knowledge-Enhanced Recommendation with User-Centric Subgraph Network.arXiv preprint arXiv:2403.14377, 2024. URLhttps://arxiv.org/abs/2403. 14377

  18. [19]

    URLhttps://arxiv.org/abs/2503.01189

  19. [20]

    Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anastasi- ades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond 12 PaperFlow: Profiling, Recommending, and Adapting Acros...

  20. [21]

    Programming with data: Test-driven data engineering for self-improving llms from raw corpora

    Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, and Cheng Tan. Programming with data: Test-driven data engineering for self-improving llms from raw corpora. arXiv preprint arXiv:2604.24819, 2026

  21. [22]

    Zhang, and Daniel S Weld

    Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, and Daniel S Weld. CiteRead: Integrating Localized Citation Contexts into Scientific Paper Reading. InIUI 2022, 2022. doi: 10.1145/3490099.3511162. URL https://doi.org/10.1145/3490099.3511162

  22. [23]

    Rahmani, Xi Wang, Xiao Fu, and Aldo Lipani

    Jerome Ramos, Hossen A. Rahmani, Xi Wang, Xiao Fu, and Aldo Lipani. Transparent and Scrutable Recommendations Using Natural Language User Profiles. InACL 2024 Long Papers, 2024. URL https: //aclanthology.org/2024.acl-long.753/

  23. [24]

    AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

    Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, and Yong Li. AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems. InNeurIPS 2025 Datasets and Benchmarks Track Spotlight, 2025. URL https:// openreview.net/forum?id=fm77rDf9JS

  24. [25]

    Large Language Models are Learnable Planners for Long-Term Recommendation

    Wentao Shi, Xiangnan He, Yang Zhang, Chongming Gao, Xinyue Li, Jizhi Zhang, Qifan Wang, and Fuli Feng. Large Language Models are Learnable Planners for Long-Term Recommendation. InSIGIR 2024, 2024. doi: 10.1145/3626772.3657683. URLhttps://doi.org/10.1145/3626772.3657683

  25. [26]

    RAH! RecSys– Assistant–Human: A Human-Centered Recommendation Framework With LLM Agents.IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024

    Yubo Shu, Haonan Zhang, Hansu Gu, Peng Zhang, Tun Lu, Dongsheng Li, and Ning Gu. RAH! RecSys– Assistant–Human: A Human-Centered Recommendation Framework With LLM Agents.IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024. doi: 10.1109/TCSS.2024.3404039. URL https://doi. org/10.1109/TCSS.2024.3404039

  26. [27]

    Skarlinski, Sam Cox, Jon M

    Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024. URL https://arxiv.org/abs/2409. 13740

  27. [28]

    Ruotong Wang, Xinyi Zhou, Lin Qiu, Joseph Chee Chang, Jonathan Bragg, and Amy X. Zhang. PaperPing: A Socially-aware AI Agent that Recommends Academic Papers to Research Group Chats with Contextualized Explanations. InCSCW Companion 2025, 2025. doi: 10.1145/3715070.3757230. URL https://doi.org/10. 1145/3715070.3757230

  28. [29]

    Ruotong Wang, Xinyi Zhou, Lin Qiu, Joseph Chee Chang, Jonathan Bragg, and Amy X. Zhang. Social- RAG: Retrieving from Group Interactions to Socially Ground AI Generation. InCHI 2025, 2025. doi: 10.1145/3706598.3713749. URLhttps://doi.org/10.1145/3706598.3713749

  29. [30]

    Discourse-Aware Scientific Paper Recommendation via QA-Style Sum- marization and Multi-Level Contrastive Learning.arXiv preprint arXiv:2511.03330, 2025

    Shenghua Wang and Zhen Yin. Discourse-Aware Scientific Paper Recommendation via QA-Style Sum- marization and Multi-Level Contrastive Learning.arXiv preprint arXiv:2511.03330, 2025. URL https: //arxiv.org/abs/2511.03330

  30. [31]

    SurveyAgent: A Conversational System for Personalized and Efficient Research Survey.arXiv preprint arXiv:2404.06364, 2024

    Xintao Wang, Jiangjie Chen, Nianqi Li, Lida Chen, Xinfeng Yuan, Wei Shi, Xuyang Ge, Rui Xu, and Yanghua Xiao. SurveyAgent: A Conversational System for Personalized and Efficient Research Survey.arXiv preprint arXiv:2404.06364, 2024. URLhttps://arxiv.org/abs/2404.06364

  31. [32]

    RecMind: Large Language Model Powered Agent For Recommendation

    Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. RecMind: Large Language Model Powered Agent For Recommendation. In Findings of NAACL 2024, 2024. URLhttps://aclanthology.org/2024.findings-naacl.271/. 13 PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

  32. [33]

    Pager: Bridging the semantic-execution gap in point-precise geometric gui control.arXiv preprint arXiv:2605.15963, 2026

    Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, et al. Pager: Bridging the semantic-execution gap in point-precise geometric gui control.arXiv preprint arXiv:2605.15963, 2026

  33. [34]

    The trinity of consistency as a defining principle for general world models.arXiv preprint arXiv:2602.23152, 2026

    Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Chang Yu, et al. The trinity of consistency as a defining principle for general world models.arXiv preprint arXiv:2602.23152, 2026

  34. [35]

    Research Paper Recommender System by Considering Users’ Information Seeking Behaviors

    Zhelin Xu, Shuhei Yamamoto, and Hideo Joho. Research Paper Recommender System by Considering Users’ Information Seeking Behaviors. InJSAI 2025, 2025. URL https://www.jstage.jst.go.jp/article/pjsai/ JSAI2025/0/JSAI2025_1D4OS24b02/_article/-char/en

  35. [36]

    Embracing Plasticity: Balancing Stability and Plasticity in Continual Recommender Systems

    Hyunsik Yoo, SeongKu Kang, Ruizhong Qiu, Charlie Xu, Fei Wang, and Hanghang Tong. Embracing Plasticity: Balancing Stability and Plasticity in Continual Recommender Systems. InSIGIR 2025, 2025. doi: 10.1145/3726302.3729964. URLhttps://doi.org/10.1145/3726302.3729964

  36. [37]

    Paperfit: Vision-in-the-loop typesetting optimization for scientific documents.arXiv preprint arXiv:2605.10341, 2026

    Bihui Yu, Xinglong Xu, Junjie Jiang, Jiabei Cheng, Caijun Jia, Siyuan Li, Conghui He, Jingxuan Wei, and Cheng Tan. Paperfit: Vision-in-the-loop typesetting optimization for scientific documents.arXiv preprint arXiv:2605.10341, 2026

  37. [38]

    AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems

    Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems. InThe Web Conference 2024 / WWW 2024, 2024. doi: 10.1145/3589334.3645537. URL https: //doi.org/10.1145/3589334.3645537

  38. [39]

    Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning

    Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning. InSIGIR 2024, 2024. doi: 10.1145/3626772.3657828. URLhttps://doi.org/10.1145/3626772.3657828. 14 PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams Appendix Appendix Table...

  39. [40]

    Retrieve the candidate papers for the corresponding date from the fixed paper pool

  40. [41]

    Compute the base relevance score from the user profile and paper content

  41. [42]

    Check must-read author, institution, and keyword matches

  42. [43]

    Apply interest-drift weighting to new interest topics and downweight suppressed old topics

  43. [44]

    Apply recent reading-signal weights to short-term topics that repeatedly appear and are selected by the user

  44. [45]

    Generate the Top-20 recommendation list

  45. [46]

    Simulate user selections based on oracle labels, system rank, system label, and drift-topic matches

  46. [47]

    always show papers by this author,

    Generate reading reports for selected papers and record token usage, episode metadata, and the drift timeline. This pipeline connects recommendation, selection, profile update, and reading assistance into a closed loop. The recommendation stage determines what the user sees. The selection stage simulates what the user actually clicks or reads. The profile...

  47. [48]

    Do not invent methods, experiments, or conclusions that do not appear in the paper information

  48. [49]

    Do not exaggerate it as directly solving the user's problem

    If the paper is only topically related, explicitly say it is topically related. Do not exaggerate it as directly solving the user's problem

  49. [50]

    If a must-read author, institution, or keyword is matched, state the reason

  50. [51]

    If the paper is related to an interest-drift direction, state the corresponding new interest topic

  51. [52]

    user_profile

    Output only one or two sentences. User: { "user_profile": { "core_directions": {...}, "must_read": {...}, "drift_state": "...", "reading_signal_topics": [...] }, "paper": { "title": "...", "abstract": "...", "authors": [...], "institutions": [...], "keywords": [...] }, "signals": { "system_label": "...", "score": 0.0, "matched_topics": [...], "matched_mus...

  52. [53]

    ProfileMatch: whether the list matches the user's profile and current research interests

  53. [54]

    RankingQuality: whether stronger or more useful papers appear earlier in the Top-20 list

  54. [55]

    DecisionUsefulness: whether the list helps the user decide what to read, skim, or skip

  55. [56]

    ProfileMatch

    DiversityFocusBalance: whether the list balances focused relevance with useful breadth. Return JSON: { "ProfileMatch": 1, "RankingQuality": 1, "DecisionUsefulness": 1, "DiversityFocusBalance": 1, "comments": "brief rationale" } Model-comparison report prompt. Model-Comparison Report Prompt You will see a researcher profile, a recommended paper, the paper ...

  56. [57]

    The candidate paper pool remains fixed

  57. [58]

    User profiles remain fixed

  58. [59]

    The Top-20 display budget remains fixed

  59. [60]

    The embedding model remains fixed

  60. [61]

    Evaluation metrics remain fixed

  61. [62]

    42 PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

    Each model writes to a separate output directory. 42 PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

  62. [63]

    8.TokenCost is reported only as an efficiency metric and is not included in ModelAutoScore or ModelHumanScore

    Token usage is recorded by date. 8.TokenCost is reported only as an efficiency metric and is not included in ModelAutoScore or ModelHumanScore. If a model encounters JSON parsing errors, API connection errors, or token-usage accounting anoma- lies, these should be recorded separately and should not be directly compared with models that completed full runs...

  63. [64]

    Several Top-5 papers are both must_read and relevant

  64. [65]

    The user selects five papers from the list

  65. [66]

    The ranking remains strong beyond static oracle labels. Top-20 Overview: Density of Relevant/Must-read Papers Rank Relevant/Must_read Rank Relevant/Must_read relevant/must_read relevant(not must_read) not relevant User Long-term Interests Mechanism: The success comes from aligned signals. The user's long-termdirections are NLP, LLMs, and information extra...

  66. [67]

    The user initially concentrates on GUI agents and web automation

  67. [68]

    The observing state accumulates repeated new-topic hits

  68. [69]

    The signal eventually crosses the drift threshold. New-Topic Hits over Time New-topic hits (per observation) Drift threshold threshold(30) Time (observations) Update Mechanism: After the threshold is crossed, the system locks multimodal reasoning as the anchor topic. It then increases the ranking weight of papers related to multimodal interaction and reas...

  69. [70]

    Some papers may be only weak_relevant under the oracle

  70. [71]

    The user may still click or read them because similar topicswere selected recently

  71. [72]

    Interpretation: The case illustrates that behavioral consistency is not identical to static relevance

    If the system ranks these papers early, SelectedNDCG@20 increases. Interpretation: The case illustrates that behavioral consistency is not identical to static relevance. Selected papers are not always those with the highest oracle gain, but they may better match the user's recent reading trajectory. Longitudinal scientific recommendation should therefore ...