FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

Chenchi Zhang; Junjun Zheng; Kairui Fu; Kun Kuang; Shengyu Zhang; Shuwen Xiao; Tao Zhang; Xiangheng Kong; Xinming Zhang; Yuliang Yan

arxiv: 2509.20904 · v3 · pith:GEQEYW4Tnew · submitted 2025-09-25 · 💻 cs.IR

FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

Kairui Fu , Tao Zhang , Shuwen Xiao , Ziyang Wang , Xinming Zhang , Chenchi Zhang , Yuliang Yan , Junjun Zheng

show 4 more authors

Xiangheng Kong Shengyu Zhang Kun Kuang Yuning Jiang

This is my paper

classification 💻 cs.IR

keywords semanticconstructionexperimentsforgegenerativeidentifiersretrievaltaobao

0 comments

read the original abstract

Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) for recommendation due to their meaningful semantic discriminability. However, current studies in this field primarily (1) offer limited investigation into the construction strategies for better SIDs, and (2) their SID assessment typically relies on costly GR training. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieRs for Generative rEtrieval. Specifically, FORGE provides a taxonomy of the SID construction process from several perspectives and validates their impact on downstream GR through offline experiments across diverse settings. Notably, these empirical findings have led to a 0.35% increase in transaction count via online A/B experiments in the Guess You Like section of Taobao. The corresponding SID construction strategies have since been deployed at full scale on Taobao, demonstrating their practical effectiveness. To avoid expensive SID assessment that requires full GR training, we propose two novel SID evaluation metrics that are highly correlated with recommendation performance, enabling convenient evaluations without any GR training. Furthermore, to facilitate the community, we release AL-GR, the industrial dataset used in our experiments, comprising 14 billion interactions and 250 million items with the corresponding multimodal features collected from Taobao. All the code and data are available at https://github.com/selous123/al_sid.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
cs.IR 2026-04 unverdicted novelty 7.0

IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance an...
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
cs.IR 2026-04 accept novelty 7.0

Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce
cs.IR 2026-02 unverdicted novelty 7.0

RAD-DPO adds token-level gradient detachment, similarity-based dynamic reward weighting, and a multi-label global contrastive objective to DPO for better handling of hierarchical Semantic IDs and noisy feedback in e-c...
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space
cs.IR 2026-04 unverdicted novelty 6.0

GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baseli...
UniRec: Bridging the Expressive Gap between Generative and Discriminative Recommendation via Chain-of-Attribute
cs.IR 2026-04 unverdicted novelty 6.0

UniRec bridges the expressive gap in generative recommendation by prefixing semantic ID sequences with structured attribute tokens, recovering explicit feature crossing and yielding +22.6% HR@50 gains plus online lift...
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL
cs.IR 2026-05 unverdicted novelty 5.0

CQ-SID semantic IDs and EG-GRPO RL improve generative retrieval hit rates up to 26.76% over RQ-VAE baselines and deliver +1.15% GMV in live e-commerce A/B tests.