Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Chen Wang; Heng Huang; Michelle Gong; Reza Shirkavand; Xiaokai Wei; Zheng Hui

arxiv: 2510.05125 · v2 · submitted 2025-09-30 · 💻 cs.CL · cs.LG

Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Reza Shirkavand , Xiaokai Wei , Chen Wang , Zheng Hui , Heng Huang , Michelle Gong This is my paper

Pith reviewed 2026-05-18 12:17 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords recommendation systemslarge language modelsmixture of expertsitem ID dialectcollaborative filteringtoken-type gatingmultimodal integrationcatalog native LLM

0 comments

The pith

By splitting each LLM block's feed-forward network into text and item experts with token-type gating, item interaction histories become a native dialect without destructive interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to combine the predictive power of collaborative filtering with the reasoning abilities of large language models in one system. It does this by treating sequences of item IDs and user histories as if they were another language the model already knows how to speak. The technical move is to divide the feed-forward network inside every transformer block into two separate experts, one for ordinary text and one for item tokens, then route each token to the right expert using a simple gate based on its type. This separation stops the two kinds of information from getting in each other's way during training and inference. If the approach works, the resulting model can deliver accurate recommendations while still answering natural-language questions and explaining its suggestions using the knowledge it had before the change.

Core claim

IDIOMoE treats item interaction histories as a native dialect within the language space of a pretrained LLM. It achieves this by splitting the Feed Forward Network of each block into a dedicated text expert and an item expert, then applying token-type gating so that text tokens and item tokens are processed by their respective experts. The construction avoids destructive interference between the text and catalog modalities, produces strong recommendation performance on both public and proprietary datasets, and leaves the original text understanding of the pretrained model intact.

What carries the argument

The split feed-forward network inside each transformer block, with one expert for text tokens and one expert for item tokens, selected by token-type gating.

If this is right

Recommendation accuracy improves on both public and proprietary datasets.
The original text understanding and reasoning abilities of the pretrained LLM remain available.
Collaborative signals from item histories can be processed in the same forward pass as natural-language queries.
The model supports unified handling of implicit preferences and explicit text inputs without separate pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar expert splits could be tried for other ID-based or catalog data types that currently clash with language modeling.
The gating approach might reduce the amount of retraining needed when adding new data modalities to an existing LLM.
If the simple split works at current scale, it raises the question of whether the same pattern holds when the number of modalities grows beyond two.

Load-bearing premise

That treating item histories as a native dialect and using simple token-type gating on the split experts is enough to stop destructive interference between text and item signals.

What would settle it

Measure whether the modified model loses accuracy on standard text comprehension benchmarks or fails to beat strong recommendation baselines on the same datasets after the split and gating are applied.

Figures

Figures reproduced from arXiv: 2510.05125 by Chen Wang, Heng Huang, Michelle Gong, Reza Shirkavand, Xiaokai Wei, Zheng Hui.

**Figure 2.** Figure 2: Overview of our proposed IDIOMoE. We extend the LLM tokenizer with new [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Language understanding retention. 3 METHOD 3.1 PRELIMINARY We study how incorporating item textual attributes affects performance given a user’s interaction history. We start from the pretrained Qwen/Qwen2.5-0.5B (Qwen et al., 2025), extend its vocabulary with item-ID tokens, and compare three variants that differ only in input format and the source of item embeddings. In all variants, instruction text tok… view at source ↗

**Figure 4.** Figure 4: Results on our industrial dataset. Datasets, Evaluation, & Backbone We use public Amazon Dataset: Games, Instruments and Arts (Ni et al., 2019) as well as Sports, Beauty and Toys McAuley et al. (2015). We further report performance on larger 2023 Amazon variants (Beauty, Books, and Toys) with substantially larger item vocabularies Hou et al. (2024a). We also train and evaluate on our in-house industrial-sc… view at source ↗

**Figure 5.** Figure 5: FFN key-value memory analysis comparing MoE vs. non-MoE. Each subfigure shows [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise attention metrics on text-only inputs. Left: previous-token attention. Middle: [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Distance profiles aggregated over early, mid, and late layers. MoE and backbone curves are [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The IDIOMoE split on FFNs is a straightforward attempt to reduce text-item interference in LLM recsys, but shared attention still mixes the modalities upstream.

read the letter

The paper's main contribution is IDIOMoE, which splits the feed-forward network in each transformer block of a pretrained LLM into separate text and item experts, then routes via token-type gating. This lets item interaction histories act as a native dialect inside the language model rather than forcing everything through text-only inputs. The goal is to keep the original model's text reasoning intact while adding collaborative signals for recommendation tasks. That framing is clear and directly targets a practical pain point in production systems that want natural-language interfaces without maintaining separate ranking and explanation models. The reported results on both public and proprietary datasets suggest the approach can deliver usable recommendation performance without obvious collapse in text capabilities. That combination of preservation plus gains is the part worth paying attention to if the numbers hold up under scrutiny. The soft spot is the shared multi-head self-attention that sits before the split FFNs. Item-ID tokens and text tokens still compute joint attention scores and produce mixed hidden states that then reach the specialized experts. The abstract and stress-test note give no attention-pattern analysis or ablation that isolates the attention component, so it remains possible that some entanglement happens upstream of the gating. If the full paper includes those checks or shows the mixing is harmless in practice, the architecture claim strengthens; otherwise the interference-avoidance story stays partial. The work is aimed at people building hybrid LLM-recommendation systems who need something deployable without massive retraining. A reader already working on catalog-aware language models or MoE variants for recsys would find the concrete split-and-gate design useful to compare against. It has enough of a defined architecture and claimed outcomes to merit sending out for peer review, though any referee would likely press on the attention-mixing question and ask for clearer ablations.

Referee Report

1 major / 0 minor

Summary. The paper introduces IDIOMoE, which treats item interaction histories as a native 'dialect' within the language space of a pretrained LLM. It splits the Feed Forward Network of each transformer block into separate text and item experts, using token-type gating to route inputs and thereby avoid destructive interference between text and catalog modalities. The method is claimed to deliver strong recommendation performance on public and proprietary datasets while preserving the pretrained model's text understanding capabilities.

Significance. If the central architectural claim holds and the reported performance is substantiated, the work would offer a parameter-efficient route to unifying collaborative signals with LLM reasoning, reducing the need for separate modality-specific models or extensive retraining in recommendation systems.

major comments (1)

Abstract: the claim that splitting only the FFN with token-type gating 'avoids destructive interference' is load-bearing for the central contribution, yet the preceding multi-head self-attention remains fully shared. Item-ID and text tokens therefore produce jointly attended hidden states that are passed to the specialized experts; without attention-pattern analysis, cross-modal weight statistics, or an ablation that isolates the shared attention component, it is unclear whether entanglement is prevented upstream of the gating mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment highlighting the distinction between shared attention and specialized FFN experts. We address the point directly below and commit to revisions that provide the requested evidence.

read point-by-point responses

Referee: Abstract: the claim that splitting only the FFN with token-type gating 'avoids destructive interference' is load-bearing for the central contribution, yet the preceding multi-head self-attention remains fully shared. Item-ID and text tokens therefore produce jointly attended hidden states that are passed to the specialized experts; without attention-pattern analysis, cross-modal weight statistics, or an ablation that isolates the shared attention component, it is unclear whether entanglement is prevented upstream of the gating mechanism.

Authors: We agree that the shared multi-head self-attention produces jointly attended representations, and that this leaves open the possibility of upstream entanglement. Our central claim is narrower: the token-type gating and modality-specific FFN experts prevent the kind of destructive interference that would otherwise occur when a single set of FFN weights must process both modalities. The shared attention is intentional, as it permits limited cross-modal information flow that can be beneficial for recommendation reasoning. To make this distinction explicit and to substantiate the claim, we will revise the manuscript to include (1) attention-map visualizations contrasting intra-modal and cross-modal attention patterns and (2) an ablation that freezes or replaces the shared attention layers while keeping the expert FFNs. These additions will appear in the experimental section and will be referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes an architectural change to pretrained LLMs by splitting each block's Feed Forward Network into separate text and item experts routed via token-type gating. This design choice is presented directly as a way to reduce modality interference while treating item IDs as a native dialect. No equations, parameter-fitting steps, or predictions are described that would reduce claims to inputs by construction. The abstract and description contain no self-citation load-bearing arguments, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The central claim rests on the proposed split-expert mechanism plus empirical results on public and proprietary datasets, making the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5719 in / 1105 out tokens · 32246 ms · 2026-05-18T12:17:21.451258+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

evaluates an LLM as a ranker given a user’s history and a set of candidate items in the prompt. Such methods can achieve competitive performance without task-specific training, demonstrating strong generalization, but are sensitive to prompt design, prone to sequence-order biases, and often ignore subtle interaction semantics. Fine-Tuning and Alignment.To...

work page 2024
[2]

knowledge entanglement

encodes multimodal item attributes (text, images) into a shared quantitative token space, enhancing cold-start and cross-domain performance. RL-based alignment (Lu et al., 2024) further improves controllability by optimizing instruction-following behavior with preference-based rewards, enabling conversational Friedman et al. (2023); Li et al. (2019); Chen...

work page 2024
[3]

adopts bidirectional masked modeling for sequences. Transformer extensions and self-supervision.FDSA(Zhang et al., 2019) enriches feature de- pendencies within Transformers, andS3-Rec(Zhou et al., 2020) pretrains with sequence-aware self-supervision. 18 Under review Table 8: Statistics of Amazon datasets used. Dataset Total sequences Num items Games 42259...

work page 2019
[4]

Experts per FFN block: 2 (ID expert + Text expert)

work page
[5]

Routing: static token-type routing (ID tokens→ID expert; text tokens→Text expert)

work page
[6]

Shared components: attention, LayerNorms, positional embeddings

work page
[7]

ID expert width = 1 for ablations

Expert widths: Text expert width = 1. ID expert width = 1 for ablations. Tuned for main tables

work page
[8]

last-k with 4,8, 16 is tuned for main results

Placement: all-layers become MoE for ablations. last-k with 4,8, 16 is tuned for main results

work page
[9]

In other small-scale runs we select the best among: freeze-all, freeze-text-expert- only, and freeze-attention-only

Freezing Policy: For Table 1 experiments (Text analysis) and ablations, LLM backbone is frozen. In other small-scale runs we select the best among: freeze-all, freeze-text-expert- only, and freeze-attention-only. In industrial dataset, we freeze everything and only train the item experts and item embeddings

work page
[10]

Factorized Embedding: On amazon datasets, instead of a single embedding table E∈ RNitems×d, we first project to a lower dimensional space and then to the model dimension to reduce embedding parameters E=W l ×W u where Wl ∈R Nitems×dmid and Wu ∈R dmid×d

work page
[11]

Ablations with LLM-based models and Table 1 do not use this warm-up to ensure fairness

For main results (not ablations and not Table 1), we warm up the item expert with item-only sequences for 20% of epochs, then gradually mix in text tokens with a linear schedule. Ablations with LLM-based models and Table 1 do not use this warm-up to ensure fairness. B.6 RESULTS B.6.1 PROPRIETARYRESULTS Table 9 shows the results on our industrial dataset. ...

work page
[12]

previous-token attention,A[i, i−1]averaged over valid positions

work page
[13]

attention to the first token,A[:,0]

work page
[14]

Left: previous-token attention

the distance profile,A[i, i−d]as a function of offsetd 20 Under review Figure 6: Layer-wise attention metrics on text-only inputs. Left: previous-token attention. Middle: attention to the first token. Right: attention entropy. MoE (blue) and backbone (orange) overlap across layers, indicating preserved attention geometry. Figure 7: Distance profiles aggre...

work page
[15]

We also aggregate distance profiles over early/mid/late layer blocks for clarity

the entropy of the attention distribution over keys per query, averaged over queries. We also aggregate distance profiles over early/mid/late layer blocks for clarity. Figures 6 and 7 show that the MoE model and the pretrained backbone exhibitnear-identicalattention patterns on text-only inputs across all layers. Layer-wise previous-token bias, first-toke...

work page
[16]

The MoE architecture modifies the feed-forward pathways, while the backbone self-attention blocks remain architecturally unchanged

work page
[17]

The text-only inputs do not activate item-specific experts, so the effective computation path closely matches the backbone. Consequently, attentionstructure(diagonal strength, range of contextual aggregation) remains stable, even though token-level representations downstream of attention can still differ due to MoE expert routing within the MLPs. Under te...

work page
[18]

At short contexts (256 tokens), MoE adds modest training overhead (+6.5% latency, -6.1% tokens/s) and a larger inference overhead (+18.4% latency)

Overhead shrinks with sequence length. At short contexts (256 tokens), MoE adds modest training overhead (+6.5% latency, -6.1% tokens/s) and a larger inference overhead (+18.4% latency). As context grows, routing/pack–scatter costs amortize: at 512 tokens the inference overhead drops to +12.5%, and at 1024 tokens it is only +3.8% with no memory increase. ...

work page
[19]

Memory is neutral. Peak GPU memory is within ±0.5G of the dense baseline across all settings, and identical at 1024 tokens for both training (29.4G) and inference (4.67G), consistent with activating one expert per token. IDIOMoE achieves near-parity efficiency at long contexts (≤4% overhead at 1024) and acceptable overheads at short contexts ( ≈18% at 256...

work page

[1] [1]

evaluates an LLM as a ranker given a user’s history and a set of candidate items in the prompt. Such methods can achieve competitive performance without task-specific training, demonstrating strong generalization, but are sensitive to prompt design, prone to sequence-order biases, and often ignore subtle interaction semantics. Fine-Tuning and Alignment.To...

work page 2024

[2] [2]

knowledge entanglement

encodes multimodal item attributes (text, images) into a shared quantitative token space, enhancing cold-start and cross-domain performance. RL-based alignment (Lu et al., 2024) further improves controllability by optimizing instruction-following behavior with preference-based rewards, enabling conversational Friedman et al. (2023); Li et al. (2019); Chen...

work page 2024

[3] [3]

adopts bidirectional masked modeling for sequences. Transformer extensions and self-supervision.FDSA(Zhang et al., 2019) enriches feature de- pendencies within Transformers, andS3-Rec(Zhou et al., 2020) pretrains with sequence-aware self-supervision. 18 Under review Table 8: Statistics of Amazon datasets used. Dataset Total sequences Num items Games 42259...

work page 2019

[4] [4]

Experts per FFN block: 2 (ID expert + Text expert)

work page

[5] [5]

Routing: static token-type routing (ID tokens→ID expert; text tokens→Text expert)

work page

[6] [6]

Shared components: attention, LayerNorms, positional embeddings

work page

[7] [7]

ID expert width = 1 for ablations

Expert widths: Text expert width = 1. ID expert width = 1 for ablations. Tuned for main tables

work page

[8] [8]

last-k with 4,8, 16 is tuned for main results

Placement: all-layers become MoE for ablations. last-k with 4,8, 16 is tuned for main results

work page

[9] [9]

In other small-scale runs we select the best among: freeze-all, freeze-text-expert- only, and freeze-attention-only

Freezing Policy: For Table 1 experiments (Text analysis) and ablations, LLM backbone is frozen. In other small-scale runs we select the best among: freeze-all, freeze-text-expert- only, and freeze-attention-only. In industrial dataset, we freeze everything and only train the item experts and item embeddings

work page

[10] [10]

Factorized Embedding: On amazon datasets, instead of a single embedding table E∈ RNitems×d, we first project to a lower dimensional space and then to the model dimension to reduce embedding parameters E=W l ×W u where Wl ∈R Nitems×dmid and Wu ∈R dmid×d

work page

[11] [11]

Ablations with LLM-based models and Table 1 do not use this warm-up to ensure fairness

For main results (not ablations and not Table 1), we warm up the item expert with item-only sequences for 20% of epochs, then gradually mix in text tokens with a linear schedule. Ablations with LLM-based models and Table 1 do not use this warm-up to ensure fairness. B.6 RESULTS B.6.1 PROPRIETARYRESULTS Table 9 shows the results on our industrial dataset. ...

work page

[12] [12]

previous-token attention,A[i, i−1]averaged over valid positions

work page

[13] [13]

attention to the first token,A[:,0]

work page

[14] [14]

Left: previous-token attention

the distance profile,A[i, i−d]as a function of offsetd 20 Under review Figure 6: Layer-wise attention metrics on text-only inputs. Left: previous-token attention. Middle: attention to the first token. Right: attention entropy. MoE (blue) and backbone (orange) overlap across layers, indicating preserved attention geometry. Figure 7: Distance profiles aggre...

work page

[15] [15]

We also aggregate distance profiles over early/mid/late layer blocks for clarity

the entropy of the attention distribution over keys per query, averaged over queries. We also aggregate distance profiles over early/mid/late layer blocks for clarity. Figures 6 and 7 show that the MoE model and the pretrained backbone exhibitnear-identicalattention patterns on text-only inputs across all layers. Layer-wise previous-token bias, first-toke...

work page

[16] [16]

The MoE architecture modifies the feed-forward pathways, while the backbone self-attention blocks remain architecturally unchanged

work page

[17] [17]

The text-only inputs do not activate item-specific experts, so the effective computation path closely matches the backbone. Consequently, attentionstructure(diagonal strength, range of contextual aggregation) remains stable, even though token-level representations downstream of attention can still differ due to MoE expert routing within the MLPs. Under te...

work page

[18] [18]

At short contexts (256 tokens), MoE adds modest training overhead (+6.5% latency, -6.1% tokens/s) and a larger inference overhead (+18.4% latency)

Overhead shrinks with sequence length. At short contexts (256 tokens), MoE adds modest training overhead (+6.5% latency, -6.1% tokens/s) and a larger inference overhead (+18.4% latency). As context grows, routing/pack–scatter costs amortize: at 512 tokens the inference overhead drops to +12.5%, and at 1024 tokens it is only +3.8% with no memory increase. ...

work page

[19] [19]

Memory is neutral. Peak GPU memory is within ±0.5G of the dense baseline across all settings, and identical at 1024 tokens for both training (29.4G) and inference (4.67G), consistent with activating one expert per token. IDIOMoE achieves near-parity efficiency at long contexts (≤4% overhead at 1024) and acceptable overheads at short contexts ( ≈18% at 256...

work page