pith. sign in

arxiv: 2602.22913 · v2 · submitted 2026-02-26 · 💻 cs.IR · cs.LG

SIGMA: A Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress

Pith reviewed 2026-05-15 19:17 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords generative recommendationinstruction-drivenmulti-task recommendersemantic groundinghybrid tokenizationadaptive fusionrecommender systems
0
0 comments X

The pith

SIGMA grounds items in a unified semantic-collaborative space and follows instructions to handle many recommendation tasks at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SIGMA as a generative recommender that shifts from narrow next-item prediction to serving diverse real-world tasks through natural language instructions. Items are first placed into one latent space that holds both broad meaning and interaction patterns from user data. Hybrid tokenization supports both detailed modeling and quick generation, while a large multi-task dataset trains the system to respond correctly to different instructions. A three-step generation routine with adaptive fusion then tunes the output distribution for the accuracy or diversity level each task requires. Experiments on offline data and live A/B tests at AliExpress are used to show that these steps produce effective results across tasks.

Core claim

SIGMA grounds item entities in a unified latent space capturing both general semantics and collaborative signals. Building on this foundation, it introduces hybrid item tokenization for precise modeling and efficient generation, constructs a large-scale multi-task supervised fine-tuning dataset to enable instruction-following across recommendation demands, and applies a three-step item generation procedure integrated with adaptive probabilistic fusion to calibrate output distributions according to task-specific needs for accuracy and diversity.

What carries the argument

The unified latent space that merges semantic and collaborative signals, together with hybrid tokenization and the adaptive probabilistic fusion step inside the three-step generation procedure.

If this is right

  • One model can address multiple distinct recommendation tasks without separate training pipelines for each.
  • Output calibration via adaptive fusion allows the same generation process to favor accuracy on some tasks and diversity on others.
  • Instruction following reduces the need to redesign the system when business requirements change.
  • Offline gains translate to measurable lifts in live user metrics during A/B testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms could reuse the same grounding and fusion layers when adding new tasks simply by extending the instruction dataset.
  • The approach points toward recommenders that treat task variation as a prompting problem rather than an architecture problem.
  • Similar grounding techniques might let generative models incorporate new data modalities without full retraining.

Load-bearing premise

That grounding items in one latent space holding both semantics and collaborative signals, plus hybrid tokenization and adaptive fusion, will produce accurate and diverse recommendations when the system is driven by instructions.

What would settle it

Compare SIGMA's outputs against task-specific baselines on a held-out set of multi-task queries, measuring whether accuracy and diversity metrics improve or stay the same under the same instruction prompts.

Figures

Figures reproduced from arXiv: 2602.22913 by Bin Chen, Bing Wang, Chao Zhang, Huaikuan Yi, Lei Kou, Lei Shen, Xiaoyi Zeng, Yang Yu, Yayu Cao.

Figure 1
Figure 1. Figure 1: The overall framework of SIGMA. Given these factors, we propose SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender deployed at AliExpress. Specifically, to mitigate the domain shift and lack of collaborative signals for general LLMs [16, 19], we first propose a multi-view alignment framework that grounds natural language, world knowledge, and item entities within a unified late… view at source ↗
Figure 2
Figure 2. Figure 2: Performance variations of different methods on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The online serving architecture for SIGMA. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

With the rapid evolution of Large Language Models (LLMs), generative recommendation is gradually reshaping the paradigm of recommender systems. However, most existing methods remain confined to the interaction-driven next-item prediction paradigm, struggling to keep pace with the latest evolving trends or address the diverse recommendation tasks along with business-specific requirements in real-world scenarios. To this end, we present SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender deployed at AliExpress. Specifically, we first ground item entities in a unified latent space capturing both general semantics and collaborative signals. Building upon this, we introduce a hybrid item tokenization method for both precise modeling and efficient generation. Moreover, we construct a large-scale multi-task supervised fine-tuning dataset empowering SIGMA to fulfill various recommendation demands via instruction-following. Finally, we design a three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism to calibrate the output distributions based on task-specific requirements for recommendation accuracy and diversity. Extensive offline experiments and online A/B tests demonstrate the effectiveness of SIGMA across various real-world recommendation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SIGMA, a generative multi-task recommender deployed at AliExpress. It grounds item entities in a unified latent space combining semantics and collaborative signals, employs hybrid item tokenization for precise modeling and efficient generation, constructs a large-scale multi-task SFT dataset for instruction-following across recommendation tasks, and uses a three-step generation procedure with adaptive probabilistic fusion to balance accuracy and diversity. Effectiveness is claimed via offline experiments and online A/B tests on real-world tasks.

Significance. If the empirical results hold, the work provides a practical demonstration of scaling LLM-based generative recommendation to production multi-task settings in e-commerce. The combination of semantic grounding, instruction-driven multi-task SFT, and task-calibrated fusion addresses limitations of interaction-only next-item paradigms, offering a deployable framework that supports diverse business requirements while maintaining generation efficiency.

major comments (2)
  1. [§4.3] §4.3 (adaptive probabilistic fusion): the mechanism is described at a high level but lacks the explicit formulation or pseudocode for how task-specific priors are computed and combined with the LLM output distribution; without this, it is unclear whether the calibration step is fully determined by the instruction or requires additional learned parameters.
  2. [§5.1] §5.1 (offline experiments): while standard metrics are referenced, the section does not report per-task breakdowns or statistical significance tests for the claimed gains over baselines; this weakens the multi-task effectiveness argument because aggregate improvements could be driven by a subset of tasks.
minor comments (2)
  1. [§3.2] The hybrid tokenization procedure in §3.2 would be clearer with an accompanying figure showing the semantic and collaborative token streams before fusion.
  2. Notation for the unified latent space (e.g., the embedding dimensions and fusion weights) is introduced without a consolidated table; adding one would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive feedback. We address each major comment below and will incorporate the necessary clarifications and additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (adaptive probabilistic fusion): the mechanism is described at a high level but lacks the explicit formulation or pseudocode for how task-specific priors are computed and combined with the LLM output distribution; without this, it is unclear whether the calibration step is fully determined by the instruction or requires additional learned parameters.

    Authors: We appreciate this observation. The current description in §4.3 is indeed high-level. In the revision we will add the explicit mathematical formulation showing how task-specific priors are derived directly from instruction embeddings (without extra learned parameters beyond the base model) and combined with the LLM output distribution through a calibrated weighted fusion. We will also include pseudocode for the full three-step generation procedure. revision: yes

  2. Referee: [§5.1] §5.1 (offline experiments): while standard metrics are referenced, the section does not report per-task breakdowns or statistical significance tests for the claimed gains over baselines; this weakens the multi-task effectiveness argument because aggregate improvements could be driven by a subset of tasks.

    Authors: We agree that per-task breakdowns and statistical tests would strengthen the multi-task claims. In the revised manuscript we will add a table with per-task metric results and report p-values from paired statistical significance tests to demonstrate that gains are consistent and significant across tasks rather than driven by a subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an LLM-based recommender architecture (semantic grounding, hybrid tokenization, multi-task SFT, three-step generation with adaptive fusion) and supports its claims exclusively through offline experiments and online A/B tests. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. The central effectiveness argument rests on empirical results rather than any self-referential reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard assumptions from generative modeling and recommendation systems plus several newly introduced technical components whose independent validation is not shown in the abstract.

axioms (2)
  • domain assumption Item entities can be grounded in a unified latent space that captures both general semantics and collaborative signals
    This is stated as the first foundational step enabling all subsequent modeling.
  • domain assumption A large-scale multi-task supervised fine-tuning dataset can empower instruction-following for diverse recommendation demands
    Assumed when constructing the dataset to support multiple tasks.
invented entities (2)
  • hybrid item tokenization method no independent evidence
    purpose: precise modeling and efficient generation
    New method introduced to support both accuracy and generation speed.
  • adaptive probabilistic fusion mechanism no independent evidence
    purpose: calibrate output distributions based on task-specific requirements for accuracy and diversity
    New mechanism added in the final generation step.

pith-pipeline@v0.9.0 · 5510 in / 1377 out tokens · 85139 ms · 2026-05-15T19:17:35.629415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

  1. [1]

    Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, et al . 2025. OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search. arXiv:2509.03236

  2. [2]

    Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. arXiv:2502.18965

  3. [3]

    Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). InProceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). Association for Computing Machinery, New York, NY, USA, 299–315

  4. [4]

    Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Em- beddings for Search Ranking at Airbnb. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). Asso- ciation for Computing Machinery, New York, NY, USA, 311–320

  5. [5]

    Xian Guo, Ben Chen, Siyuan Wang, Ying Yang, Chenyi Lei, Yuqing Ding, and Han Li. 2025. OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion. arXiv:2506.06913

  6. [6]

    Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. 2025. Generating Long Semantic IDs in Parallel for Recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Association for Computing Machinery, New York, NY, USA, 956–966

  7. [7]

    Junguang Jiang, Yanwen Huang, Bin Liu, Xiaoyu Kong, Xinhang Li, Ziru Xu, Han Zhu, Jian Xu, and Bo Zheng. 2025. Large Language Model as Universal Retriever in Industrial-Scale Recommender System. arXiv:2502.03041

  8. [8]

    Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. LLaRA: Large Language-Recommendation Assistant. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1785–1795

  9. [9]

    Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2025. How Can Recommender Systems Benefit from Large Language Models: A Survey.ACM Trans. Inf. Syst.43, 2, Article 28 (2025), 47 pages

  10. [10]

    Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao

  11. [11]

    InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25)

    Generative Recommender with End-to-End Learnable Item Tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 729–739

  12. [12]

    Zihan Liu, Yupeng Hou, and Julian McAuley. 2024. Multi-Behavior Generative Recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). Association for Computing Machinery, New York, NY, USA, 1575–1585

  13. [13]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

  14. [14]

    Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, and Jiebo Luo. 2024. LLM-Rec: Personalized Recommendation via Prompting Large Language Models. InFindings of the Asso- ciation for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, Mexico City, Mexico, 583–612

  15. [15]

    McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, and Kamil Ciosek

    Thomas M. McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, and Kamil Ciosek. 2023. Impatient Bandits: Optimizing Recommendations for the Long- Term Without Delay. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 1687–1697

  16. [16]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748

  17. [17]

    Shutong Qiao, Wei Zhou, Junhao Wen, Chen Gao, Qun Luo, Peixuan Chen, and Yong Li. 2025. Multi-view Intent Learning and Alignment with Large Language Models for Session-based Recommendation.ACM Trans. Inf. Syst.43, 4, Article 91 (2025), 25 pages

  18. [18]

    Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al

  19. [19]

    Recommender Systems with Generative Retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315

  20. [20]

    Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng

  21. [21]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25)

    Scaling Transformers for Discriminative Recommendation via Generative Pretraining. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Association for Computing Machinery, New York, NY, USA, 2893–2903

  22. [22]

    Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, and Zhenhua Dong. 2024. EAGER: Two- Stream Generative Recommender with Behavior-Semantic Collaboration. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery, New York...

  23. [23]

    Chen Wei, Yixin Ji, Zeyuan Chen, Jia Xu, and Zhongyi Liu. 2024. LLMGR: Large Language Model-based Generative Retrieval in Alipay Search. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2847–2851

  24. [24]

    Zhipeng Wei, Kuo Cai, Junda She, Jie Chen, Minghao Chen, Yang Zeng, Qiang Luo, Wencong Zeng, Ruiming Tang, Kun Gai, and Guorui Zhou. 2025. OneLoc: Geo- Aware Generative Recommender Systems for Local Life Service. arXiv:2508.14646

  25. [25]

    Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A Survey on Large Language Models for Recommendation.World Wide Web27, 5 (2024), 31 pages

  26. [26]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388

  27. [27]

    Wencai Ye, Mingjie Sun, Shuhang Chen, Wenjin Wu, and Peng Jiang. 2025. Align3GR: Unified Multi-Level Alignment for LLM-based Generative Recom- mendation. arXiv:2511.11255

  28. [28]

    Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, New York, NY, USA, 1435–1448

  29. [29]

    Guorui Zhou, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Qiang Luo, Qian- qian Wang, Qigen Hu, Rui Huang, Shiyao Wang, et al. 2025. OneRec Technical Report. arXiv:2506.13695