pith. sign in

arxiv: 2606.00324 · v1 · pith:STEZ6NWFnew · submitted 2026-05-29 · 💻 cs.IR · cs.AI

LLMs Need Encoders for Semantic IDs Too

Pith reviewed 2026-06-28 20:30 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords semantic idsgenerative recommendationprefix n-gram memoryllm encodershierarchical codesmultimodal modelsprefix-conditioned representations
0
0 comments X

The pith

Semantic IDs require a dedicated encoder in LLMs because each token's meaning depends on its prefix context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal LLMs already attach separate encoders to non-text modalities such as images because raw embeddings fail to capture their internal structure. Semantic IDs used in generative recommendation form another such modality: each level of the hierarchical code only acquires meaning from the preceding levels. Current practice simply adds the SID tokens to the LLM vocabulary and expects joint training to discover these dependencies. PrefixMem instead supplies pre-computed prefix n-gram representations that the model can read directly at SID positions. On large-scale Pinterest data the method raises deepest-level SID accuracy by up to 46 percent relative and full-SID retrieval recall by up to 22 percent relative at matched compute, with the largest gains on examples that defeat greedy decoding.

Core claim

Semantic IDs constitute another modality whose tokens benefit from a dedicated encoder; PrefixMem provides prefix-conditioned representations and improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder can be pre-trained independently and then attached to any LLM for joint training.

What carries the argument

PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that supplies structured, prefix-conditioned representations at SID token positions.

If this is right

  • The encoder can be pre-trained separately before attachment to any LLM family.
  • Accuracy gains concentrate on hard examples, reaching up to 77 percent relative improvement where greedy decoding fails.
  • Full-SID retrieval recall improves by up to 22 percent relative at matched training compute.
  • The same attachment pattern works across multiple LLM families on large recommendation data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any token sequence whose semantics are prefix-dependent may benefit from an analogous lightweight encoder rather than vocabulary expansion alone.
  • The pattern suggests that future multimodal LLMs could maintain a small library of modality-specific encoders, each handling one structured token type.
  • Systems that rely solely on vocabulary growth for hierarchical codes may systematically leave performance on the table once training compute is fixed.

Load-bearing premise

The observed gains arise from the prefix-conditioned structure supplied by the n-gram tables rather than from extra parameters or altered training dynamics alone.

What would settle it

An ablation that replaces the prefix n-gram tables with an equal number of extra parameters that emit non-prefix or random vectors for SID positions and then measures no accuracy change would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.00324 by Jaewon Yang, Jiajing Xu, Xiangyi Chen, Xinyi Li, Yi-Ping Hsu, Zelun Wang.

Figure 1
Figure 1. Figure 1: The same SID code means different things under [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The PrefixMem module at level ℓ=3. Given prefix (𝑐1, 𝑐2, 𝑐3), prefix 𝑛-grams of increasing length are hashed into multi-head embedding tables. Retrieved vectors are aggre￾gated, projected via 𝑊out, and added (⊕) to the input embed￾ding of 𝑐3, enriching it before the LLM predicts 𝑐4. idx𝑛,ℎ = Ê𝑛 𝑖=1 (𝑐𝑖 × 𝑝𝑖,ℎ) ! mod 𝑇 (1) where × is integer multiplication, ⊕ is bitwise XOR, 𝑝𝑖,ℎ are fixed prime constants, … view at source ↗
Figure 3
Figure 3. Figure 3: TF-L5 accuracy over training on Qwen3 1.7B. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens) because raw token embeddings alone cannot capture modality-specific structure. We argue that Semantic IDs (SIDs), the hierarchical codes used in generative recommendation, constitute another such modality: a SID level token's meaning depends on its prefix context, yet current systems simply add SID tokens to the vocabulary and rely on training to learn these context-dependent meanings from scratch. We propose PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that provides the LLM with structured, prefix-conditioned representations at SID token positions. Like vision encoders in multimodal LLMs, PrefixMem can be pre-trained independently and then attached to any LLM for joint training. We evaluate on large-scale data from Pinterest across multiple LLM families and show that PrefixMem improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder's benefit concentrates on hard examples where greedy decoding fails, with up to 77% relative accuracy gains, confirming that SID tokens benefit from a dedicated encoder just as other non-language modalities do.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Semantic IDs (SIDs) in generative recommendation systems form a distinct modality whose tokens require a dedicated encoder, analogous to vision encoders, because a SID token's meaning depends on its prefix. It proposes PrefixMem, a lightweight prefix n-gram memory table encoder that supplies structured, prefix-conditioned representations and can be pre-trained independently before joint LLM training. On large-scale Pinterest data across LLM families, PrefixMem yields up to 46% relative improvement in deepest-level SID accuracy and 22% in full-SID retrieval recall at matched training compute, with larger gains on hard examples where greedy decoding fails.

Significance. If the performance gains are shown to arise specifically from the prefix-conditioned n-gram structure rather than added capacity, the work would strengthen the case for treating SIDs as a modality requiring explicit encoders and could guide encoder design in industrial recommendation systems. The empirical scale on real Pinterest data is a positive feature, but the absence of parameter-matched controls limits the strength of the modality-encoder analogy.

major comments (2)
  1. [Abstract / evaluation paragraph] Abstract and evaluation paragraph: the reported gains (46% deepest-level accuracy, 22% retrieval recall) are stated to occur 'at matched training compute,' yet no evidence is given that total parameter count is held constant or that an ablation with equivalent extra capacity (e.g., unstructured MLP or additional embedding layers on SID tokens) was performed. Without this control, the central claim that PrefixMem's benefit derives from structured prefix-conditioned representations rather than capacity or optimization dynamics alone cannot be substantiated.
  2. [Abstract] Abstract: the manuscript provides no details on baselines, error bars, exact data splits, or ablation controls for the Pinterest experiments. This makes it impossible to assess whether the reported relative improvements are robust or whether they could be explained by differences in training dynamics unrelated to the proposed encoder structure.
minor comments (2)
  1. [Abstract] The abstract mentions improvements 'across multiple LLM families' but does not name the families or report per-family breakdowns; adding this would improve reproducibility.
  2. [evaluation paragraph] Notation for 'deepest-level SID accuracy' and 'full-SID retrieval recall' is used without an explicit definition or reference to the precise metric formulas in the evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for parameter-matched controls and fuller experimental details to strengthen the modality-encoder analogy. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / evaluation paragraph] Abstract and evaluation paragraph: the reported gains (46% deepest-level accuracy, 22% retrieval recall) are stated to occur 'at matched training compute,' yet no evidence is given that total parameter count is held constant or that an ablation with equivalent extra capacity (e.g., unstructured MLP or additional embedding layers on SID tokens) was performed. Without this control, the central claim that PrefixMem's benefit derives from structured prefix-conditioned representations rather than capacity or optimization dynamics alone cannot be substantiated.

    Authors: We agree that matching training compute in FLOPs does not by itself rule out capacity or optimization effects, and the current manuscript does not report total parameter counts or include an unstructured MLP ablation of matched capacity. In the revision we will add (i) explicit parameter counts for PrefixMem-augmented models versus baselines and (ii) a capacity-matched ablation replacing the prefix n-gram tables with an unstructured MLP of equivalent parameter budget applied to SID token positions. These additions will directly test whether the prefix-conditioned structure, rather than added capacity, drives the observed gains. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript provides no details on baselines, error bars, exact data splits, or ablation controls for the Pinterest experiments. This makes it impossible to assess whether the reported relative improvements are robust or whether they could be explained by differences in training dynamics unrelated to the proposed encoder structure.

    Authors: The current manuscript is indeed concise on these points. In the revision we will expand the experimental section to report: baseline configurations and hyper-parameters, error bars computed over multiple random seeds, precise train/validation/test splits on the Pinterest data, and the full set of ablation controls (including the new capacity-matched MLP ablation noted above). This will enable readers to evaluate robustness and rule out unrelated training dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external data

full rationale

The paper proposes PrefixMem as an SID encoder and reports relative accuracy/recall gains on large-scale Pinterest data across LLM families. No mathematical derivations, equations, or 'predictions' appear that reduce by construction to fitted inputs or self-defined quantities. Claims rest on observed metrics at matched training compute rather than any self-referential loop, uniqueness theorem, or ansatz smuggled via self-citation. This matches the default case of a self-contained empirical result against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate specific free parameters or axioms; the approach implicitly assumes that prefix n-gram statistics are sufficient to capture context-dependent SID semantics and that the encoder can be attached without disrupting LLM training dynamics.

pith-pipeline@v0.9.1-grok · 5751 in / 1049 out tokens · 19530 ms · 2026-06-28T20:30:56.833118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    Prabhat Agarwal, Anirudhan Badrinath, Laksh Bhasin, Jaewon Yang, Edoardo Botta, Jiajing Xu, and Charles Rosenberg. 2025. Pinrec: Outcome-conditioned, multi-token generative retrieval for industry-scale recommendation systems. arXiv preprint arXiv:2504.10507(2025)

  2. [2]

    Honghui Bao et al. 2024. Learnable Item Tokenization for Generative Recommen- dation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM)

  3. [3]

    Josh Beal, Eric Kim, Jinfeng Rao, Rex Wu, Dmitry Kislyuk, and Charles Rosenberg

  4. [4]

    PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest. (2026)

  5. [5]

    Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, and Ramya Korlakai Vinayak. 2026. Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation. arXiv preprint arXiv:2604.02324(2026)

  6. [6]

    Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. 2026. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?Advances in Neural Information Processing Systems 38 (2026), 57654–57689

  7. [7]

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. 2026. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372(2026)

  8. [8]

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037(2024)

  9. [9]

    Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965 (2025)

  10. [10]

    Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5). InProceedings of the 16th ACM Conference on Recommender Systems. 299–315

  11. [11]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  12. [12]

    Ruidong Han, Bin Yin, Shangyu Chen, He Jiang, Fei Jiang, Xiang Li, Chi Ma, Mincong Huang, Xiaoguang Li, Chunzhen Jing, et al . 2025. Mtgr: Industrial- scale generative recommendation framework in meituan. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5731–5738

  13. [13]

    Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, et al

  14. [14]

    InProceedings of the ACM Web Conference 2026

    Plum: Adapting pre-trained language models for industrial-scale generative recommendations. InProceedings of the ACM Web Conference 2026. 8093–8104

  15. [15]

    Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xi- ang Wang, An Zhang, and Tat-Seng Chua. 2026. Reasoning over Semantic IDs Enhances Generative Recommendation.arXiv preprint arXiv:2603.23183(2026)

  16. [16]

    Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. 2025. Generating Long Semantic IDs in Parallel for Recommendation.arXiv preprint arXiv:2506.05781 (2025)

  17. [17]

    Clark Mingxuan Ju, Tong Zhao, Leonardo Neves, Liam Collins, Bhuvesh Kumar, Jiwen Ren, Lili Zhang, Wenfeng Zhuo, Vincent Zhang, Xiao Bai, et al. 2026. Seman- tic IDs for Recommender Systems at Snapchat.arXiv preprint arXiv:2604.03949 (2026)

  18. [18]

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 4 (2025)

  19. [19]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

  20. [20]

    Zhirui Kuai, Zuxu Chen, Huimu Wang, Mingming Li, Dadong Miao, Binbin Wang, Xusong Chen, Li Kuang, Yuxing Han, Jiaxing Wang, Guoyu Tang, Lin Liu, Songlin Wang, and Jingwei Zhuo. 2024. Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval. InProceedings of the 2024 Conference on Empirical Methods in Na...

  21. [21]

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11523–11532

  22. [22]

    Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadu- rai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. 2026. Generative Reasoning Re-ranker.arXiv preprint arXiv:2602.07774(2026)

  23. [23]

    Enze Liu, Bowen Zheng, Cheng Ling, Lantao Hu, Han Li, and Wayne Xin Zhao

  24. [24]

    InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Generative recommender with end-to-end learnable item tokenization. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 729–739

  25. [25]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  26. [26]

    Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, et al

  27. [27]

    Onerec-think: In-text reasoning for generative recommendation.arXiv preprint arXiv:2510.11639(2025)

  28. [28]

    Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. 2022. WECHSEL: Ef- fective initialization of subword embeddings for cross-lingual transfer of monolin- gual language models. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics

  29. [29]

    Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al

  30. [30]

    Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315

  31. [31]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  32. [32]

    Anima Singh, Trung Vu, Nikhil Mehta, Raghunandan Keshavan, Maheswaran Sathiamoorthy, Yilin Zheng, Lichan Hong, Lukasz Heldt, Li Wei, Devansh Tandon, et al. 2024. Better generalization with semantic ids: A case study in ranking for recommendations. InProceedings of the 18th ACM Conference on Recommender Systems. 1039–1044

  33. [33]

    Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghu- nandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, and Ningren Han. 2026. Vectorizing the Trie: Efficient Con- strained Decoding for LLM-based Generative Retrieval on Accelerators.arXiv preprint arXiv:2602.22647(2026)

  34. [34]

    Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index. InAdvances in Neural Information Processing Systems, Vol. 35

  35. [35]

    Neha Verma, Nikhil Mehta, Shao-Chuan Wang, Naijing Zhang, Alicia Tsai, Li Wei, Lukasz Heldt, Lichan Hong, Ed Chi, and Xinyang Yi. 2026. ORBIT: Preserv- ing Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging.arXiv preprint arXiv:2605.12419(2026)

  36. [36]

    Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, et al . 2024. Eager: Two-stream generative recommender with behavior-semantic collaboration. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3245–3254

  37. [37]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  38. [38]

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Vber, Boshi Li, Celena Guan, et al. 2024. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152(2024)

  39. [39]

    Bowen Zheng, Yupeng Hou, Hongyu Lu, Zhichao Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Adapting large language models by integrating collaborative semantics for recommendation.arXiv preprint arXiv:2311.09049(2024)

  40. [40]

    Carolina Zheng, Minhui Huang, Dmitrii Pedchenko, Kaushik Rangadurai, Siyu Wang, Fan Xia, Gaby Nahum, Jie Lei, Yang Yang, Tao Liu, et al. 2025. Enhancing embedding representation stability in recommendation systems with semantic id. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 954–957

  41. [41]

    Guorui Zhou, Honghui Bao, Jiaming Huang, Jiaxin Deng, Jinghao Zhang, Junda She, Kuo Cai, Lejian Ren, Lu Ren, Qiang Luo, et al. 2025. OpenOneRec Technical Report.arXiv preprint arXiv:2512.24762(2025)

  42. [42]

    Guorui Zhou, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Qiang Luo, Qian- qian Wang, Qigen Hu, Rui Huang, Shiyao Wang, et al. 2025. OneRec Technical Report.arXiv preprint arXiv:2506.13695(2025)

  43. [43]

    Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, et al. 2025. OneRec-v2 Technical Report.arXiv preprint arXiv:2508.20900(2025)

  44. [44]

    Jieming Zhu, Mengqun Jin, Qijiong Liu, Zexuan Qiu, Zhenhua Dong, and Xiu Li

  45. [45]

    InProceedings of the 18th ACM Conference on Recommender Systems

    Cost: Contrastive quantization based semantic tokenization for generative recommendation. InProceedings of the 18th ACM Conference on Recommender Systems. 969–974