SIGMA: A Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress
Pith reviewed 2026-05-15 19:17 UTC · model grok-4.3
The pith
SIGMA grounds items in a unified semantic-collaborative space and follows instructions to handle many recommendation tasks at once.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SIGMA grounds item entities in a unified latent space capturing both general semantics and collaborative signals. Building on this foundation, it introduces hybrid item tokenization for precise modeling and efficient generation, constructs a large-scale multi-task supervised fine-tuning dataset to enable instruction-following across recommendation demands, and applies a three-step item generation procedure integrated with adaptive probabilistic fusion to calibrate output distributions according to task-specific needs for accuracy and diversity.
What carries the argument
The unified latent space that merges semantic and collaborative signals, together with hybrid tokenization and the adaptive probabilistic fusion step inside the three-step generation procedure.
If this is right
- One model can address multiple distinct recommendation tasks without separate training pipelines for each.
- Output calibration via adaptive fusion allows the same generation process to favor accuracy on some tasks and diversity on others.
- Instruction following reduces the need to redesign the system when business requirements change.
- Offline gains translate to measurable lifts in live user metrics during A/B testing.
Where Pith is reading between the lines
- Platforms could reuse the same grounding and fusion layers when adding new tasks simply by extending the instruction dataset.
- The approach points toward recommenders that treat task variation as a prompting problem rather than an architecture problem.
- Similar grounding techniques might let generative models incorporate new data modalities without full retraining.
Load-bearing premise
That grounding items in one latent space holding both semantics and collaborative signals, plus hybrid tokenization and adaptive fusion, will produce accurate and diverse recommendations when the system is driven by instructions.
What would settle it
Compare SIGMA's outputs against task-specific baselines on a held-out set of multi-task queries, measuring whether accuracy and diversity metrics improve or stay the same under the same instruction prompts.
Figures
read the original abstract
With the rapid evolution of Large Language Models (LLMs), generative recommendation is gradually reshaping the paradigm of recommender systems. However, most existing methods remain confined to the interaction-driven next-item prediction paradigm, struggling to keep pace with the latest evolving trends or address the diverse recommendation tasks along with business-specific requirements in real-world scenarios. To this end, we present SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender deployed at AliExpress. Specifically, we first ground item entities in a unified latent space capturing both general semantics and collaborative signals. Building upon this, we introduce a hybrid item tokenization method for both precise modeling and efficient generation. Moreover, we construct a large-scale multi-task supervised fine-tuning dataset empowering SIGMA to fulfill various recommendation demands via instruction-following. Finally, we design a three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism to calibrate the output distributions based on task-specific requirements for recommendation accuracy and diversity. Extensive offline experiments and online A/B tests demonstrate the effectiveness of SIGMA across various real-world recommendation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SIGMA, a generative multi-task recommender deployed at AliExpress. It grounds item entities in a unified latent space combining semantics and collaborative signals, employs hybrid item tokenization for precise modeling and efficient generation, constructs a large-scale multi-task SFT dataset for instruction-following across recommendation tasks, and uses a three-step generation procedure with adaptive probabilistic fusion to balance accuracy and diversity. Effectiveness is claimed via offline experiments and online A/B tests on real-world tasks.
Significance. If the empirical results hold, the work provides a practical demonstration of scaling LLM-based generative recommendation to production multi-task settings in e-commerce. The combination of semantic grounding, instruction-driven multi-task SFT, and task-calibrated fusion addresses limitations of interaction-only next-item paradigms, offering a deployable framework that supports diverse business requirements while maintaining generation efficiency.
major comments (2)
- [§4.3] §4.3 (adaptive probabilistic fusion): the mechanism is described at a high level but lacks the explicit formulation or pseudocode for how task-specific priors are computed and combined with the LLM output distribution; without this, it is unclear whether the calibration step is fully determined by the instruction or requires additional learned parameters.
- [§5.1] §5.1 (offline experiments): while standard metrics are referenced, the section does not report per-task breakdowns or statistical significance tests for the claimed gains over baselines; this weakens the multi-task effectiveness argument because aggregate improvements could be driven by a subset of tasks.
minor comments (2)
- [§3.2] The hybrid tokenization procedure in §3.2 would be clearer with an accompanying figure showing the semantic and collaborative token streams before fusion.
- Notation for the unified latent space (e.g., the embedding dimensions and fusion weights) is introduced without a consolidated table; adding one would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive feedback. We address each major comment below and will incorporate the necessary clarifications and additions in the revised manuscript.
read point-by-point responses
-
Referee: [§4.3] §4.3 (adaptive probabilistic fusion): the mechanism is described at a high level but lacks the explicit formulation or pseudocode for how task-specific priors are computed and combined with the LLM output distribution; without this, it is unclear whether the calibration step is fully determined by the instruction or requires additional learned parameters.
Authors: We appreciate this observation. The current description in §4.3 is indeed high-level. In the revision we will add the explicit mathematical formulation showing how task-specific priors are derived directly from instruction embeddings (without extra learned parameters beyond the base model) and combined with the LLM output distribution through a calibrated weighted fusion. We will also include pseudocode for the full three-step generation procedure. revision: yes
-
Referee: [§5.1] §5.1 (offline experiments): while standard metrics are referenced, the section does not report per-task breakdowns or statistical significance tests for the claimed gains over baselines; this weakens the multi-task effectiveness argument because aggregate improvements could be driven by a subset of tasks.
Authors: We agree that per-task breakdowns and statistical tests would strengthen the multi-task claims. In the revised manuscript we will add a table with per-task metric results and report p-values from paired statistical significance tests to demonstrate that gains are consistent and significant across tasks rather than driven by a subset. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an LLM-based recommender architecture (semantic grounding, hybrid tokenization, multi-task SFT, three-step generation with adaptive fusion) and supports its claims exclusively through offline experiments and online A/B tests. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. The central effectiveness argument rests on empirical results rather than any self-referential reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Item entities can be grounded in a unified latent space that captures both general semantics and collaborative signals
- domain assumption A large-scale multi-task supervised fine-tuning dataset can empower instruction-following for diverse recommendation demands
invented entities (2)
-
hybrid item tokenization method
no independent evidence
-
adaptive probabilistic fusion mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hybrid item tokenization method that combines SID prefixes with unique item IDs... three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. arXiv:2502.18965
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). InProceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). Association for Computing Machinery, New York, NY, USA, 299–315
work page 2022
-
[4]
Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Em- beddings for Search Ranking at Airbnb. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). Asso- ciation for Computing Machinery, New York, NY, USA, 311–320
work page 2018
- [5]
-
[6]
Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. 2025. Generating Long Semantic IDs in Parallel for Recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Association for Computing Machinery, New York, NY, USA, 956–966
work page 2025
- [7]
-
[8]
Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. LLaRA: Large Language-Recommendation Assistant. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1785–1795
work page 2024
-
[9]
Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2025. How Can Recommender Systems Benefit from Large Language Models: A Survey.ACM Trans. Inf. Syst.43, 2, Article 28 (2025), 47 pages
work page 2025
-
[10]
Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao
-
[11]
Generative Recommender with End-to-End Learnable Item Tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 729–739
-
[12]
Zihan Liu, Yupeng Hou, and Julian McAuley. 2024. Multi-Behavior Generative Recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). Association for Computing Machinery, New York, NY, USA, 1575–1585
work page 2024
-
[13]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations
work page 2019
-
[14]
Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, and Jiebo Luo. 2024. LLM-Rec: Personalized Recommendation via Prompting Large Language Models. InFindings of the Asso- ciation for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, Mexico City, Mexico, 583–612
work page 2024
-
[15]
McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, and Kamil Ciosek
Thomas M. McDonald, Lucas Maystre, Mounia Lalmas, Daniel Russo, and Kamil Ciosek. 2023. Impatient Bandits: Optimizing Recommendations for the Long- Term Without Delay. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). Association for Computing Machinery, New York, NY, USA, 1687–1697
work page 2023
-
[16]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Shutong Qiao, Wei Zhou, Junhao Wen, Chen Gao, Qun Luo, Peixuan Chen, and Yong Li. 2025. Multi-view Intent Learning and Alignment with Large Language Models for Session-based Recommendation.ACM Trans. Inf. Syst.43, 4, Article 91 (2025), 25 pages
work page 2025
-
[18]
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
-
[19]
Recommender Systems with Generative Retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315
work page 2023
-
[20]
Chunqi Wang, Bingchao Wu, Zheng Chen, Lei Shen, Bing Wang, and Xiaoyi Zeng
-
[21]
InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25)
Scaling Transformers for Discriminative Recommendation via Generative Pretraining. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Association for Computing Machinery, New York, NY, USA, 2893–2903
-
[22]
Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, and Zhenhua Dong. 2024. EAGER: Two- Stream Generative Recommender with Behavior-Semantic Collaboration. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery, New York...
work page 2024
-
[23]
Chen Wei, Yixin Ji, Zeyuan Chen, Jia Xu, and Zhongyi Liu. 2024. LLMGR: Large Language Model-based Generative Retrieval in Alipay Search. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2847–2851
work page 2024
- [24]
-
[25]
Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A Survey on Large Language Models for Recommendation.World Wide Web27, 5 (2024), 31 pages
work page 2024
-
[26]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [27]
-
[28]
Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, New York, NY, USA, 1435–1448
work page 2024
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.