Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models
Pith reviewed 2026-05-25 06:43 UTC · model grok-4.3
The pith
SIF encodes each historical raw sample directly into sequence tokens to preserve full sample information and unify sequential with non-sequential features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SIF encodes each historical raw sample directly into the sequence token via a Sample Tokenizer that applies hierarchical group-adaptive quantization to produce Token Samples, followed by a SIF-Mixer that conducts token-level and sample-level mixing over the resulting homogeneous representations. This simultaneously maximizes preservation of sample-level context and removes the heterogeneity barrier between sequential and non-sequential features.
What carries the argument
The Sample Tokenizer using hierarchical group-adaptive quantization (HGAQ) that turns full raw samples into tokens, paired with the SIF-Mixer that performs mixing at token and sample levels.
If this is right
- Full sample-level and time-varying features can be incorporated into the sequence without truncation.
- The Transformer backbone can apply its full capacity to deep interactions because all inputs are now homogeneous sample tokens.
- Industrial-scale deployment becomes feasible, as shown by the reported rollout on a food delivery platform.
- Sample information scaling and model capacity scaling become compatible rather than separate tracks.
Where Pith is reading between the lines
- The same tokenization step could be applied to other mixed sequential and tabular datasets outside recommendation.
- If quantization overhead stays low, the method may support even longer histories than current sequence models allow.
- Future work could test whether the same sample tokens improve performance when the backbone is scaled to larger parameter counts.
Load-bearing premise
Hierarchical group-adaptive quantization can turn complete raw samples into tokens without losing the information required for accurate downstream prediction.
What would settle it
A controlled comparison on the industrial dataset showing that models using the tokenized samples achieve lower prediction accuracy than models that retain the original raw sample features would disprove the central claim.
Figures
read the original abstract
Scaling industrial recommender models has followed two parallel paradigms: \textbf{sample information scaling} -- enriching the information content of each training sample through deeper and longer behavior sequences -- and \textbf{model capacity scaling} -- unifying sequence modeling and feature interaction within a single Transformer backbone. However, these two paradigms still face two structural limitations. Firstly, sample information scaling methods encode only a subset of each historical interaction into the sequence token, leaving the majority of the original sample context unexploited and precluding the modeling of sample-level, time-varying features. Secondly, model capacity scaling methods are inherently constrained by the structural heterogeneity between sequential and non-sequential features, preventing the model from fully realizing its representational capacity. To address these issues, we propose \textbf{SIF} (\emph{Sample Is Feature}), which encodes each historical Raw Sample directly into the sequence token -- maximally preserving sample information while simultaneously resolving the heterogeneity between sequential and non-sequential features. SIF consists of two key components. The \textbf{Sample Tokenizer} quantizes each historical Raw Sample into a Token Sample via hierarchical group-adaptive quantization (HGAQ), enabling full sample-level context to be incorporated into the sequence efficiently. The \textbf{SIF-Mixer} then performs deep feature interaction over the homogeneous sample representations via token-level and sample-level mixing, fully unleashing the model's representational capacity. Extensive experiments on a large-scale industrial dataset validate SIF's effectiveness, and we have successfully deployed SIF on an industrial food delivery platform.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SIF (Sample Is Feature) to address limitations in scaling recommender models by encoding each historical raw sample directly into sequence tokens via a Sample Tokenizer using hierarchical group-adaptive quantization (HGAQ), and employing a SIF-Mixer for token-level and sample-level mixing. This is claimed to maximally preserve sample information, resolve heterogeneity between sequential and non-sequential features, with validation through extensive experiments on a large-scale industrial dataset and successful deployment on an industrial food delivery platform.
Significance. If the central claims hold, SIF could enable more effective use of full sample context in unified Transformer-based recommenders, potentially improving performance in industrial applications. The reported deployment on a food delivery platform provides practical evidence of scalability and effectiveness, which is a notable strength.
major comments (2)
- [Abstract] Abstract: The claim that the Sample Tokenizer with HGAQ 'enables full sample-level context to be incorporated into the sequence efficiently' while 'maximally preserving sample information' is load-bearing for the contribution, yet the description supplies no information-retention metric, reconstruction error, or ablation against non-quantized baselines to demonstrate that binning does not erase predictive signal from fine-grained features such as exact timestamps or rare combinations.
- [Abstract] Abstract: The statement that 'extensive experiments on a large-scale industrial dataset validate SIF's effectiveness' is presented without any reported baselines, ablation results on the HGAQ step, error analysis, or quantitative metrics, preventing assessment of whether the method actually outperforms prior sample-information or model-capacity scaling approaches.
minor comments (1)
- [Abstract] The acronym expansion 'Sample Is Feature' is introduced without an explicit statement of how the name relates to the token-level treatment of samples.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below and will make targeted revisions to improve clarity while preserving the manuscript's focus on end-to-end recommendation performance as the primary validation metric.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the Sample Tokenizer with HGAQ 'enables full sample-level context to be incorporated into the sequence efficiently' while 'maximally preserving sample information' is load-bearing for the contribution, yet the description supplies no information-retention metric, reconstruction error, or ablation against non-quantized baselines to demonstrate that binning does not erase predictive signal from fine-grained features such as exact timestamps or rare combinations.
Authors: We agree the abstract is concise and does not embed quantitative retention metrics. The full manuscript (Section 4.2) contains ablations comparing HGAQ against non-quantized and alternative encoding baselines, showing consistent gains in recommendation metrics that serve as indirect evidence of preserved predictive signal. We do not report reconstruction error because HGAQ is a task-specific quantization for recommendation tokens rather than a general compression method; exact reconstruction of raw values (e.g., timestamps) is not the objective. We will revise the abstract to reference the relevant ablation section and briefly note that downstream performance validates signal preservation. revision: partial
-
Referee: [Abstract] Abstract: The statement that 'extensive experiments on a large-scale industrial dataset validate SIF's effectiveness' is presented without any reported baselines, ablation results on the HGAQ step, error analysis, or quantitative metrics, preventing assessment of whether the method actually outperforms prior sample-information or model-capacity scaling approaches.
Authors: The abstract summarizes results whose details—including baselines, HGAQ ablations, error analysis, and quantitative metrics—are fully reported in Sections 4 and 5, with comparisons to prior scaling approaches on the industrial dataset and deployment results. This follows standard abstract conventions for brevity. We will revise the abstract to include a short reference to key experimental outcomes or direct readers to the experimental sections for the supporting evidence. revision: partial
Circularity Check
No circularity; proposal is self-contained design without reduction to fitted inputs or self-citations
full rationale
The paper introduces SIF as a new method consisting of Sample Tokenizer (via HGAQ) and SIF-Mixer to encode raw samples into tokens and perform mixing. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or description. The claims rest on the architectural choices and experimental validation on an industrial dataset rather than any definitional or fitted circularity. This is the normal case of a method proposal without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Sample Tokenizer quantizes each historical Raw Sample into a Token Sample via hierarchical group-adaptive quantization (HGAQ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza- tion. arXiv:1607.06450 [stat.ML]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, et al . 2025. Longer: Scaling up long sequence modeling in industrial recommenders. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 247–256
work page 2025
-
[3]
Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage interest network for lifelong user behavior modeling in CTR prediction at kuaishou. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3785–3794
work page 2023
-
[4]
Qiwei Chen, Changhua Pei, Shanshan Lv, Chao Li, Junfeng Ge, and Wenwu Ou
-
[5]
arXiv preprint arXiv:2108.04468(2021)
End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468(2021)
- [6]
-
[7]
Ruidong Han, Bin Yin, Shangyu Chen, He Jiang, Fei Jiang, Xiang Li, Chi Ma, Mincong Huang, Xiaoguang Li, Chunzhen Jing, et al . 2025. Mtgr: Industrial- scale generative recommendation framework in meituan. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5731–5738
work page 2025
-
[8]
Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning vector-quantized item representation for transferable sequential recommenders. InProceedings of the ACM Web Conference 2023. 1162–1171
work page 2023
- [9]
- [10]
-
[11]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- mization. InProc. Int. Conf. on Learning Representations (ICLR)
work page 2015
-
[12]
Xinchun Li, Ning Zhang, Qianqian Yang, Fei Teng, Wenlin Zhao, Huizhi Yang, Heng Shi, Linlan Chen, Yixin Wu, Zhen Wang, et al. 2026. IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems.arXiv preprint arXiv:2604.08933(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Qijiong Liu, Hengchang Hu, Jiahao Wu, Jieming Zhu, Min-Yen Kan, and Xiao- Ming Wu. 2024. Discrete semantic tokenization for deep ctr prediction. InCom- panion Proceedings of the ACM Web Conference 2024. 919–922
work page 2024
-
[14]
Yimin Lv, Shuli Wang, Beihong Jin, Yisong Yu, Yapeng Zhang, Jian Dong, Yongkang Wang, Xingxing Wang, and Dong Wang. 2023. Deep situation-aware interaction network for click-through rate prediction. InProceedings of the 17th ACM conference on recommender systems. 171–182
work page 2023
-
[15]
Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692
work page 2020
-
[16]
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al
-
[17]
Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315
work page 2023
-
[18]
Qiaoyu Tan, Jianwei Zhang, Jiangchao Yao, Ninghao Liu, Jingren Zhou, Hongxia Yang, and Xia Hu. 2021. Sparse-interest network for sequential recommendation. InProceedings of the 14th ACM international conference on web search and data mining. 598–606
work page 2021
-
[19]
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InProc. Annual Conf. on Neural Information Processing Systems (NeurIPS). 6306–6315
work page 2017
-
[20]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. InProc. Annual Conf. on Neural Information Processing Systems (NeurIPS). 5998–6008
work page 2017
-
[21]
Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the web conference 2021. 1785–1797
work page 2021
- [22]
-
[23]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, et al. 2025. InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6225–6233
work page 2025
-
[24]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhao- jie Gong, Fangda Gu, Michael He, et al. 2024. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [25]
- [26]
- [27]
- [28]
-
[29]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948
work page 2019
-
[30]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068
work page 2018
-
[31]
Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, et al. 2025. Rankmixer: Scaling up ranking models in industrial recommenders. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6309–6316
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.