ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

Chenchi Zhang; Dunxian Huang; Haibin Chen; Hao Chang; Honghao Fu; Huimin Yi; Jiacheng Chen; Jingxuan Feng; Ju Huang; Junjun Zheng

arxiv: 2606.31693 · v1 · pith:LP6GK5MDnew · submitted 2026-06-30 · 💻 cs.IR · cs.AI· cs.CL

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

Jiacheng Chen , Tao Zhang , Manxi Lin , Dunxian Huang , Teng Shi , Honghao Fu , Mengyan Li , Xinming Zhang

show 18 more authors

Chenchi Zhang Xuan Lu Xiaoxiong Du Haibin Chen Shaolin Ye Hao Chang Xiaoqi Li Shuwen Xiao Yujin Yuan Jingxuan Feng Shaopan Xiong Huimin Yi Ju Huang Qiu Shen Ying Chen Junjun Zheng Xiangheng Kong Yuning Jiang

This is my paper

Pith reviewed 2026-07-01 03:53 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords foundation modelagentic shoppingsemantic IDsintent fulfillmentLLM agentsitem-space operationsmulti-turn tasks

0 comments

The pith

ShopX trains one foundation model to translate shopping intents directly into item-space actions using semantic IDs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that wrapping LLMs around separate search and ranking systems creates lossy hand-offs that hurt complex or ambiguous shopping requests. ShopX instead unifies intent parsing, planning, and execution inside a single model that operates natively on semantic IDs for retrieval, ranking, and bundling. A custom training recipe equips the base LLM to perform these multi-turn item-space tasks while keeping its general knowledge and instruction following intact. The resulting model-native framework is evaluated on real Taobao-derived single- and multi-turn tasks and shown to outperform tool-mediated agent baselines, especially on harder cases. This design removes the need for external retrieval interfaces between the agent and the item catalog.

Core claim

ShopX is a foundation model that combines intent understanding, execution planning, and flexible SID-native item-space operations inside one system, deployed through a model-facing action protocol and serving harness that supports context access, catalog grounding, and state management for agentic shopping workflows.

What carries the argument

Semantically recoverable, LLM-operable semantic IDs (SIDs) that let the model compose operations such as SID beam-search retrieval, listwise ranking, and product bundling directly in item space.

If this is right

Model-native fulfillment reduces lossy hand-offs between agent orchestration and item-space execution.
Performance gains appear most clearly on complex or ambiguous multi-turn requests.
The same model can retain general LLM capabilities while gaining specialized item-space skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification pattern could be tested in other agentic domains that currently route language to external tools.
If SIDs prove stable across catalogs, the approach might reduce dependence on separate indexing pipelines in production systems.

Load-bearing premise

Semantically recoverable SIDs can be designed and a training recipe exists that equips a general LLM for flexible multi-turn item-space fulfillment while retaining its original knowledge and instruction-following abilities.

What would settle it

A controlled comparison on the same Taobao-derived tasks where the ShopX model-native system shows no gain or worse performance than tool-mediated baselines on complex or ambiguous requests.

read the original abstract

The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation pipelines, forcing complex intents through low-bandwidth retrieval or ranking interfaces and leaving a gap between language understanding and item-space fulfillment. Generative recommendation gives LLMs a direct item-space interface through semantic IDs (SIDs), but existing models mainly generate candidates for retrieval rather than translate flexible intents into item-space outcomes. We propose ShopX to address this bottleneck by unifying intent understanding, execution planning, and flexible SID-native item-space operations into a single foundation model. We deploy ShopX in agentic shopping workflows through a model-native item-fulfillment framework with a serving harness that defines a model-facing action protocol and exposes support surfaces for context access, catalog grounding, and state management. Within this framework, ShopX plans and composes SID-based item-space operations such as SID beam-search retrieval, listwise ranking, or product bundling. This model-centric design reduces lossy hand-offs between agent orchestration and item-space execution. To build ShopX, we design semantically recoverable, LLM-operable SIDs and a training recipe that equips a general LLM for flexible multi-turn item-space fulfillment while retaining the knowledge and instruction-following abilities needed by a shopping agent. We evaluate the ShopX framework against tool-mediated agentic systems on single- and multi-turn fulfillment tasks derived from anonymized Taobao production logs, showing that model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShopX tries to replace tool calls with direct SID operations inside one LLM for shopping agents, but the abstract supplies no training details or numbers so the retention and improvement claims stay untested.

read the letter

The core idea is to train one model that does intent parsing, planning, and item-space actions like SID beam search or bundling without handing off to separate retrieval tools. That removes one layer of interface loss in agentic flows, which is a concrete engineering target for e-commerce agents. The framework description with its action protocol and support surfaces for catalog grounding is a reasonable way to make the model usable in multi-turn sessions.

The paper does not show any equations or loss terms, so there is nothing to check for circularity. The evaluation is described only at the level of "improves overall framework behavior on complex requests" from Taobao logs, with no metrics, baselines, or error bars supplied.

The weakest part is the training recipe. The text asserts that semantically recoverable SIDs plus some mixture let a general LLM keep its knowledge and instruction following while gaining flexible fulfillment skills, but gives no data mixture, retention ablations, or capability checks. If that recipe does not exist or if it degrades the base model, the claimed advantage over tool-mediated systems disappears. That matches the stress-test concern exactly.

This is aimed at people already building agentic shopping systems who want to experiment with model-native item operations. A reader could pull the framework sketch and the SID concept for their own work, but the current version is too thin on evidence to stand on its own.

I would send it to peer review only if the full manuscript adds the missing training description, ablations, and quantitative results; otherwise it stays at the level of a system sketch.

Referee Report

2 major / 1 minor

Summary. The paper proposes ShopX, a foundation model unifying intent understanding, execution planning, and flexible SID-native item-space operations (e.g., beam-search retrieval, listwise ranking, bundling) for agentic shopping. It introduces a model-native fulfillment framework with a serving harness for action protocols, context access, and state management. The authors design semantically recoverable SIDs and a training recipe claimed to equip a general LLM for multi-turn fulfillment while retaining knowledge and instruction-following. Evaluation on single- and multi-turn tasks from anonymized Taobao production logs is asserted to show that model-native fulfillment outperforms tool-mediated agentic systems, especially on complex or ambiguous requests.

Significance. If the training recipe and empirical results hold, the work could meaningfully advance agentic e-commerce by closing the gap between LLM reasoning and direct item-space manipulation, reducing lossy tool interfaces. The focus on preserving general LLM capabilities during domain adaptation is a positive framing that aligns with practical deployment needs.

major comments (2)

[Abstract] Abstract: The central claim that a training recipe equips a general LLM for SID-based multi-turn operations (beam-search, ranking, bundling) while retaining instruction-following is unsupported by any description of SID construction, loss terms, data mixtures, or retention ablations; this detail is load-bearing for the weakest assumption identified in the stress-test.
[Abstract] Abstract: The assertion that 'model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests' is presented without any metrics, baselines, error bars, task definitions, or statistical details from the Taobao log evaluation, preventing verification of the claimed superiority over tool-mediated systems.

minor comments (1)

[Abstract] Abstract: The acronym SID is used without an initial expansion or reference to prior generative-recommendation literature, which reduces accessibility for readers outside that sub-area.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below, noting that the full manuscript provides the supporting technical details while the abstract serves as a concise summary. We propose targeted revisions to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a training recipe equips a general LLM for SID-based multi-turn operations (beam-search, ranking, bundling) while retaining instruction-following is unsupported by any description of SID construction, loss terms, data mixtures, or retention ablations; this detail is load-bearing for the weakest assumption identified in the stress-test.

Authors: The manuscript body contains these descriptions: SID construction and semantic recoverability are detailed in Section 3.1, the training recipe (including loss terms, data mixtures, and the multi-turn fulfillment protocol) appears in Section 4, and retention ablations for instruction-following and general capabilities are reported in Section 5.3. The abstract summarizes rather than replicates these sections. We will revise the abstract to include one additional sentence providing high-level pointers to these elements. revision: partial
Referee: [Abstract] Abstract: The assertion that 'model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests' is presented without any metrics, baselines, error bars, task definitions, or statistical details from the Taobao log evaluation, preventing verification of the claimed superiority over tool-mediated systems.

Authors: Section 5 defines the single- and multi-turn tasks derived from the Taobao logs, specifies the tool-mediated baselines, reports metrics with error bars, and includes statistical comparisons. The abstract condenses the outcome. We will revise the abstract to include a short quantitative summary (e.g., relative improvement ranges on complex queries) while remaining within length constraints. revision: partial

Circularity Check

0 steps flagged

No derivation chain or load-bearing equations present; claims rest on architectural description and external evaluation

full rationale

The provided abstract and manuscript description contain no equations, parameter fits, uniqueness theorems, or self-citations that reduce any prediction or result to the inputs by construction. The central claims concern the existence of a training recipe and SID design, supported by evaluation on Taobao production logs rather than internal self-reference. This is a standard self-contained systems paper with no circular steps matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5909 in / 1005 out tokens · 43858 ms · 2026-07-01T03:53:07.658758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 34 canonical work pages · 14 internal anchors

[1]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/ , August 2025. Accessed: 2026-06-08

2025
[2]

Introducing claude 4

Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4 , May 2025. Accessed: 2026-06-08

2025
[3]

Gemini 3: Introducing the latest gemini ai model from google

Google. Gemini 3: Introducing the latest gemini ai model from google. https://blog.g oogle/products/gemini/gemini-3/, November 2025. Accessed: 2026-06-08

2025
[4]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Introducing codex

OpenAI. Introducing codex. https://openai.com/index/introducing-codex/ , May 2025. Accessed: 2026-06-08

2025
[6]

Claude code

Anthropic. Claude code. https://docs.anthropic.com/en/docs/claude-code/ge tting-started, 2025. Accessed: 2026-06-08

2025
[7]

Openclaw.https://docs.openclaw.ai/, 2026

OpenClaw. Openclaw.https://docs.openclaw.ai/, 2026. Accessed: 2026-06-08

2026
[8]

Amazon’s rufus ai assistant now available to all u.s

Amazon. Amazon’s rufus ai assistant now available to all u.s. customers. https://ww w.aboutamazon.com/news/retail/how-to-use-amazon-rufus , 2024. Accessed: 2026-05-11

2024
[9]

Powering product discovery in chatgpt

OpenAI. Powering product discovery in chatgpt. https://openai.com/index/power ing-product-discovery-in-chatgpt/, March 2026. Accessed: 2026-05-11

2026
[10]

Buy it in chatgpt: Instant checkout and the agentic commerce protocol

OpenAI. Buy it in chatgpt: Instant checkout and the agentic commerce protocol. https: //openai.com/index/buy-it-in-chatgpt/, September 2025. Accessed: 2026-05-11

2025
[11]

千问与淘宝打通用ai也能“逛淘宝”了

People’s Daily Online. 千问与淘宝打通用ai也能“逛淘宝”了. https://finance.people .com.cn/n1/2026/0511/c1004-40717594.html, May 2026. Accessed: 2026-06-01

2026
[12]

Shop with ai mode, use ai to buy and try clothes on yourself virtually

Google. Shop with ai mode, use ai to buy and try clothes on yourself virtually. https: //blog.google/products-and-platforms/products/shopping/google-shopp ing-ai-mode-virtual-try-on-update/, May 2025. Accessed: 2026-05-11

2025
[13]

淘宝内测 ai搜索，上线两款新品

Sina Finance. 淘宝内测 ai搜索，上线两款新品 . https://finance.sina.com.cn/ tech/it/2025-09-12/doc-infqfwzc7426261.shtml , September 2025. Accessed: 2026-06-01

2025
[14]

小红书站内开测ai搜索功能，并已上线独立 app

3E Life. 小红书站内开测ai搜索功能，并已上线独立 app. https://www.3elife.net/A rt/internet/202501/05/100181.html, January 2025. Accessed: 2026-06-01

2025
[15]

A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860, 2023

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860, 2023

work page arXiv 2023
[16]

A survey of large language model empowered agents for recommenda- tion and search.arXiv preprint arXiv:2503.05659, 2025

Yizhe Zhang et al. A survey of large language model empowered agents for recommenda- tion and search.arXiv preprint arXiv:2503.05659, 2025

work page arXiv 2025
[17]

arXiv preprint arXiv:2303.14524 , year=

Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat- rec: Towards interactive and explainable llms-augmented recommender system.arXiv preprint arXiv:2303.14524, 2023. 35

work page arXiv 2023
[18]

Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

work page arXiv 2023
[19]

Recmind: Large language model powered agent for recommendation

Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024

2024
[20]

Recai: Lever- aging large language models for next-generation recommender systems.arXiv preprint arXiv:2403.06465, 2024

Jianxun Lian, Yuxuan Lei, Xu Huang, Jing Yao, Wei Xu, and Xing Xie. Recai: Lever- aging large language models for next-generation recommender systems.arXiv preprint arXiv:2403.06465, 2024

work page arXiv 2024
[21]

Retrieval-augmented conversational recommendation with prompt-based semi-structured natural language state tracking

Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, and Scott Sanner. Retrieval-augmented conversational recommendation with prompt-based semi-structured natural language state tracking. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024
[22]

Recgpt technical report.arXiv preprint arXiv:2507.22879, 2025

Chao Yi et al. Recgpt technical report.arXiv preprint arXiv:2507.22879, 2025

work page arXiv 2025
[23]

Recgpt-v2 technical report.arXiv preprint arXiv:2512.14503, 2025

Chao Yi et al. Recgpt-v2 technical report.arXiv preprint arXiv:2512.14503, 2025

work page arXiv 2025
[24]

Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. Recommender systems with generative retrieval. In Advances in Neural Information Processing Systems, 2023

2023
[25]

Adapting large language models by integrating collaborative semantics for recommendation

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommendation. In2024 IEEE 40th International Conference on Data Engineering, pages 1435–1448, 2024

2024
[26]

Learnable item tokenization for generative recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

2024
[27]

Generative recommender with end-to-end learnable item tokenization

Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao. Generative recommender with end-to-end learnable item tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 729–739, 2025

2025
[28]

FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng, Yu Li, Zhihong Chen, Jian Wu, Xiangheng Kong, Shengyu Zhang, Kun Kuang, Yuning Jiang, and Bo Zheng. Forge: Forming semantic identifiers for generative retrieval in industrial datasets.arXiv preprint arXiv:2509.20904, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Onerec technical report.arXiv preprint arXiv:2506.13695, 2025

Guorui Zhou et al. Onerec technical report.arXiv preprint arXiv:2506.13695, 2025

work page arXiv 2025
[30]

OneRec-V2 Technical Report

Guorui Zhou et al. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Openonerec technical report.arXiv preprint arXiv:2512.24762, 2025a

Guorui Zhou, Honghui Bao, et al. Openonerec technical report.arXiv preprint arXiv:2512.24762, 2025. 36

work page arXiv 2025
[32]

Neural re-ranking in multi-stage recommender systems: A review

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. Onerec- think: In-text reasonin...

work page arXiv 2025
[33]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

2023
[35]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[36]

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean M. Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. InInternational Conference on Learning Representations, 2026. URL https://openreview.n et/forum?id=c1bTcrDmt4

2026
[37]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 2024

2024
[38]

Cmmlu: Measuring massive multitask language understanding in chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

2024
[39]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.Transactions on Machine Learning Research, 2023

2023
[41]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

2024
[42]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, 2021. 37

2021
[43]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. InAdvances in Neural Information Processing Systems, 2023

2023
[46]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

2021
[48]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

2019
[49]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[50]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Towards noise contrastive estimation with soft targets for conditional models.arXiv preprint arXiv:2404.14076, 2024

Johannes Hugger and Virginie Uhlmann. Towards noise contrastive estimation with soft targets for conditional models.arXiv preprint arXiv:2404.14076, 2024

work page arXiv 2024
[52]

ColBERT: Efficient and effective passage search via contextualized late interaction over BERT

Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48, 2020

2020
[53]

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. MetaEmbed: Scaling multimodal retrieval at test-time with flexible late interaction.arXiv preprint arXiv:2509.18095, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Neural discrete representa- tion learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representa- tion learning. InAdvances in Neural Information Processing Systems, 2017

2017
[55]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[56]

Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search

Shicong Liu, Hongtao Lu, and Junru Shao. Improved residual vector quantization for high-dimensional approximate nearest neighbor search.arXiv preprint arXiv:1509.05195, 2015. 38

work page internal anchor Pith review Pith/arXiv arXiv 2015
[57]

OneSearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236, 2025

Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, Huangyu Dai, Xing Xu, Tong Zhao, Mingcan Peng, XiaoYang Zheng, Cong Zhang, Qihang Zhao, Yuqing Ding, Chenyi Lei, Wenwu Ou, and Han Li. OneSearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arX...

work page arXiv 2025
[58]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

MiMo-V2-Flash Technical Report

Xiaomi MiMo Team. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Deepseek-v4 technical report

DeepSeek-AI. Deepseek-v4 technical report. https://huggingface.co/deepseek-a i/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026. Accessed: 2026-06-03

2026
[61]

Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

NVIDIA Nemotron Team. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

work page arXiv 2026
[62]

Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025
[63]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Is chatgpt good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[65]

Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

work page arXiv 2023
[66]

Llm4rerank: Llm-based auto-reranking framework for recommendations.arXiv preprint arXiv:2406.12433, 2024

Jingtong Gao, Bo Chen, Weiwen Liu, Xiangyang Li, Yichao Wang, Wanyu Wang, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. Llm4rerank: Llm-based auto-reranking framework for recommendations.arXiv preprint arXiv:2406.12433, 2024

work page arXiv 2024
[67]

Care: Contextual adaptation of recommenders for llm-based conversational recommendation

Chuang Li, Yang Deng, Hengchang Hu, See-Kiong Ng, Min-Yen Kan, and Haizhou Li. Care: Contextual adaptation of recommenders for llm-based conversational recommendation. arXiv preprint arXiv:2508.13889, 2025

work page arXiv 2025
[68]

Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation.arXiv preprint arXiv:2511.06254, 2025

Teng Shi, Chenglei Shen, Weijie Yu, Shen Nie, Chongxuan Li, Xiao Zhang, Ming He, Yan Han, and Jun Xu. Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation.arXiv preprint arXiv:2511.06254, 2025

work page arXiv 2025
[69]

Content-based collabo- rative generation for recommender systems

Yidan Wang, Zhaochun Ren, Weiwei Sun, Jiyuan Yang, Zhixiang Liang, Xin Chen, Ruobing Xie, Su Yan, Xu Zhang, Pengjie Ren, Zhumin Chen, and Xin Xin. Content-based collabo- rative generation for recommender systems. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2420–2430, 2024

2024
[70]

Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025. 39

2025
[71]

Order-agnostic identifier for large language model-based generative recommenda- tion

Xinyu Lin, Haihan Shi, Wenjie Wang, Fuli Feng, Qifan Wang, See-Kiong Ng, and Tat-Seng Chua. Order-agnostic identifier for large language model-based generative recommenda- tion. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1923–1933, 2025

1923
[72]

Universal item tokenization for transferable generative recommendation.arXiv preprint arXiv:2504.04405, 2025

Bowen Zheng, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. Universal item tokenization for transferable generative recommendation.arXiv preprint arXiv:2504.04405, 2025

work page arXiv 2025
[73]

Bbqrec: Behavior-bind quantization for multi-modal sequential recommendation.arXiv preprint arXiv:2504.06636, 2025

Kaiyuan Li, Rui Xiang, Yong Bai, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, and Kun Gai. Bbqrec: Behavior-bind quantization for multi-modal sequential recommendation.arXiv preprint arXiv:2504.06636, 2025

work page arXiv 2025
[74]

[omitted]

Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. Generating long semantic ids in parallel for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 956–966, 2025. 40 Appendix A. Author List Core Contributors Jiacheng Chen∗ ...

2025

[1] [1]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/ , August 2025. Accessed: 2026-06-08

2025

[2] [2]

Introducing claude 4

Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4 , May 2025. Accessed: 2026-06-08

2025

[3] [3]

Gemini 3: Introducing the latest gemini ai model from google

Google. Gemini 3: Introducing the latest gemini ai model from google. https://blog.g oogle/products/gemini/gemini-3/, November 2025. Accessed: 2026-06-08

2025

[4] [4]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Introducing codex

OpenAI. Introducing codex. https://openai.com/index/introducing-codex/ , May 2025. Accessed: 2026-06-08

2025

[6] [6]

Claude code

Anthropic. Claude code. https://docs.anthropic.com/en/docs/claude-code/ge tting-started, 2025. Accessed: 2026-06-08

2025

[7] [7]

Openclaw.https://docs.openclaw.ai/, 2026

OpenClaw. Openclaw.https://docs.openclaw.ai/, 2026. Accessed: 2026-06-08

2026

[8] [8]

Amazon’s rufus ai assistant now available to all u.s

Amazon. Amazon’s rufus ai assistant now available to all u.s. customers. https://ww w.aboutamazon.com/news/retail/how-to-use-amazon-rufus , 2024. Accessed: 2026-05-11

2024

[9] [9]

Powering product discovery in chatgpt

OpenAI. Powering product discovery in chatgpt. https://openai.com/index/power ing-product-discovery-in-chatgpt/, March 2026. Accessed: 2026-05-11

2026

[10] [10]

Buy it in chatgpt: Instant checkout and the agentic commerce protocol

OpenAI. Buy it in chatgpt: Instant checkout and the agentic commerce protocol. https: //openai.com/index/buy-it-in-chatgpt/, September 2025. Accessed: 2026-05-11

2025

[11] [11]

千问与淘宝打通用ai也能“逛淘宝”了

People’s Daily Online. 千问与淘宝打通用ai也能“逛淘宝”了. https://finance.people .com.cn/n1/2026/0511/c1004-40717594.html, May 2026. Accessed: 2026-06-01

2026

[12] [12]

Shop with ai mode, use ai to buy and try clothes on yourself virtually

Google. Shop with ai mode, use ai to buy and try clothes on yourself virtually. https: //blog.google/products-and-platforms/products/shopping/google-shopp ing-ai-mode-virtual-try-on-update/, May 2025. Accessed: 2026-05-11

2025

[13] [13]

淘宝内测 ai搜索，上线两款新品

Sina Finance. 淘宝内测 ai搜索，上线两款新品 . https://finance.sina.com.cn/ tech/it/2025-09-12/doc-infqfwzc7426261.shtml , September 2025. Accessed: 2026-06-01

2025

[14] [14]

小红书站内开测ai搜索功能，并已上线独立 app

3E Life. 小红书站内开测ai搜索功能，并已上线独立 app. https://www.3elife.net/A rt/internet/202501/05/100181.html, January 2025. Accessed: 2026-06-01

2025

[15] [15]

A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860, 2023

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860, 2023

work page arXiv 2023

[16] [16]

A survey of large language model empowered agents for recommenda- tion and search.arXiv preprint arXiv:2503.05659, 2025

Yizhe Zhang et al. A survey of large language model empowered agents for recommenda- tion and search.arXiv preprint arXiv:2503.05659, 2025

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2303.14524 , year=

Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat- rec: Towards interactive and explainable llms-augmented recommender system.arXiv preprint arXiv:2303.14524, 2023. 35

work page arXiv 2023

[18] [18]

Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. Recommender ai agent: Integrating large language models for interactive recommendations.arXiv preprint arXiv:2308.16505, 2023

work page arXiv 2023

[19] [19]

Recmind: Large language model powered agent for recommendation

Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024

2024

[20] [20]

Recai: Lever- aging large language models for next-generation recommender systems.arXiv preprint arXiv:2403.06465, 2024

Jianxun Lian, Yuxuan Lei, Xu Huang, Jing Yao, Wei Xu, and Xing Xie. Recai: Lever- aging large language models for next-generation recommender systems.arXiv preprint arXiv:2403.06465, 2024

work page arXiv 2024

[21] [21]

Retrieval-augmented conversational recommendation with prompt-based semi-structured natural language state tracking

Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, and Scott Sanner. Retrieval-augmented conversational recommendation with prompt-based semi-structured natural language state tracking. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024

[22] [22]

Recgpt technical report.arXiv preprint arXiv:2507.22879, 2025

Chao Yi et al. Recgpt technical report.arXiv preprint arXiv:2507.22879, 2025

work page arXiv 2025

[23] [23]

Recgpt-v2 technical report.arXiv preprint arXiv:2512.14503, 2025

Chao Yi et al. Recgpt-v2 technical report.arXiv preprint arXiv:2512.14503, 2025

work page arXiv 2025

[24] [24]

Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. Recommender systems with generative retrieval. In Advances in Neural Information Processing Systems, 2023

2023

[25] [25]

Adapting large language models by integrating collaborative semantics for recommendation

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommendation. In2024 IEEE 40th International Conference on Data Engineering, pages 1435–1448, 2024

2024

[26] [26]

Learnable item tokenization for generative recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

2024

[27] [27]

Generative recommender with end-to-end learnable item tokenization

Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao. Generative recommender with end-to-end learnable item tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 729–739, 2025

2025

[28] [28]

FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng, Yu Li, Zhihong Chen, Jian Wu, Xiangheng Kong, Shengyu Zhang, Kun Kuang, Yuning Jiang, and Bo Zheng. Forge: Forming semantic identifiers for generative retrieval in industrial datasets.arXiv preprint arXiv:2509.20904, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Onerec technical report.arXiv preprint arXiv:2506.13695, 2025

Guorui Zhou et al. Onerec technical report.arXiv preprint arXiv:2506.13695, 2025

work page arXiv 2025

[30] [30]

OneRec-V2 Technical Report

Guorui Zhou et al. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Openonerec technical report.arXiv preprint arXiv:2512.24762, 2025a

Guorui Zhou, Honghui Bao, et al. Openonerec technical report.arXiv preprint arXiv:2512.24762, 2025. 36

work page arXiv 2025

[32] [32]

Neural re-ranking in multi-stage recommender systems: A review

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. Onerec- think: In-text reasonin...

work page arXiv 2025

[33] [33]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

2023

[35] [35]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024

[36] [36]

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean M. Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. InInternational Conference on Learning Representations, 2026. URL https://openreview.n et/forum?id=c1bTcrDmt4

2026

[37] [37]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 2024

2024

[38] [38]

Cmmlu: Measuring massive multitask language understanding in chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

2024

[39] [39]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.Transactions on Machine Learning Research, 2023

2023

[41] [41]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

2024

[42] [42]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, 2021. 37

2021

[43] [43]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [45]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. InAdvances in Neural Information Processing Systems, 2023

2023

[46] [46]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

2021

[48] [48]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

2019

[49] [49]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[50] [50]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [51]

Towards noise contrastive estimation with soft targets for conditional models.arXiv preprint arXiv:2404.14076, 2024

Johannes Hugger and Virginie Uhlmann. Towards noise contrastive estimation with soft targets for conditional models.arXiv preprint arXiv:2404.14076, 2024

work page arXiv 2024

[52] [52]

ColBERT: Efficient and effective passage search via contextualized late interaction over BERT

Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48, 2020

2020

[53] [53]

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. MetaEmbed: Scaling multimodal retrieval at test-time with flexible late interaction.arXiv preprint arXiv:2509.18095, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Neural discrete representa- tion learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representa- tion learning. InAdvances in Neural Information Processing Systems, 2017

2017

[55] [55]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[56] [56]

Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search

Shicong Liu, Hongtao Lu, and Junru Shao. Improved residual vector quantization for high-dimensional approximate nearest neighbor search.arXiv preprint arXiv:1509.05195, 2015. 38

work page internal anchor Pith review Pith/arXiv arXiv 2015

[57] [57]

OneSearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236, 2025

Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, Huangyu Dai, Xing Xu, Tong Zhao, Mingcan Peng, XiaoYang Zheng, Cong Zhang, Qihang Zhao, Yuqing Ding, Chenyi Lei, Wenwu Ou, and Han Li. OneSearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arX...

work page arXiv 2025

[58] [58]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

MiMo-V2-Flash Technical Report

Xiaomi MiMo Team. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[60] [60]

Deepseek-v4 technical report

DeepSeek-AI. Deepseek-v4 technical report. https://huggingface.co/deepseek-a i/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026. Accessed: 2026-06-03

2026

[61] [61]

Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

NVIDIA Nemotron Team. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

work page arXiv 2026

[62] [62]

Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025

[63] [63]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Is chatgpt good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[65] [65]

Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

work page arXiv 2023

[66] [66]

Llm4rerank: Llm-based auto-reranking framework for recommendations.arXiv preprint arXiv:2406.12433, 2024

Jingtong Gao, Bo Chen, Weiwen Liu, Xiangyang Li, Yichao Wang, Wanyu Wang, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. Llm4rerank: Llm-based auto-reranking framework for recommendations.arXiv preprint arXiv:2406.12433, 2024

work page arXiv 2024

[67] [67]

Care: Contextual adaptation of recommenders for llm-based conversational recommendation

Chuang Li, Yang Deng, Hengchang Hu, See-Kiong Ng, Min-Yen Kan, and Haizhou Li. Care: Contextual adaptation of recommenders for llm-based conversational recommendation. arXiv preprint arXiv:2508.13889, 2025

work page arXiv 2025

[68] [68]

Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation.arXiv preprint arXiv:2511.06254, 2025

Teng Shi, Chenglei Shen, Weijie Yu, Shen Nie, Chongxuan Li, Xiao Zhang, Ming He, Yan Han, and Jun Xu. Llada-rec: Discrete diffusion for parallel semantic id generation in generative recommendation.arXiv preprint arXiv:2511.06254, 2025

work page arXiv 2025

[69] [69]

Content-based collabo- rative generation for recommender systems

Yidan Wang, Zhaochun Ren, Weiwei Sun, Jiyuan Yang, Zhixiang Liang, Xin Chen, Ruobing Xie, Su Yan, Xu Zhang, Pengjie Ren, Zhumin Chen, and Xin Xin. Content-based collabo- rative generation for recommender systems. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2420–2430, 2024

2024

[70] [70]

Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025

Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. Tokenrec: Learning to tokenize id for llm- based generative recommendations.IEEE Transactions on Knowledge and Data Engineering, 2025. 39

2025

[71] [71]

Order-agnostic identifier for large language model-based generative recommenda- tion

Xinyu Lin, Haihan Shi, Wenjie Wang, Fuli Feng, Qifan Wang, See-Kiong Ng, and Tat-Seng Chua. Order-agnostic identifier for large language model-based generative recommenda- tion. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1923–1933, 2025

1923

[72] [72]

Universal item tokenization for transferable generative recommendation.arXiv preprint arXiv:2504.04405, 2025

Bowen Zheng, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. Universal item tokenization for transferable generative recommendation.arXiv preprint arXiv:2504.04405, 2025

work page arXiv 2025

[73] [73]

Bbqrec: Behavior-bind quantization for multi-modal sequential recommendation.arXiv preprint arXiv:2504.06636, 2025

Kaiyuan Li, Rui Xiang, Yong Bai, Yongxiang Tang, Yanhua Cheng, Xialong Liu, Peng Jiang, and Kun Gai. Bbqrec: Behavior-bind quantization for multi-modal sequential recommendation.arXiv preprint arXiv:2504.06636, 2025

work page arXiv 2025

[74] [74]

[omitted]

Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. Generating long semantic ids in parallel for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 956–966, 2025. 40 Appendix A. Author List Core Contributors Jiacheng Chen∗ ...

2025