FedMPO recovers missing modalities via topology-aware generation, filters noisy recoveries with missing-aware routing, and uses reliability-aware aggregation to achieve up to 5.65% gains over baselines in high-missing and non-IID federated graph settings.
hub Mixed citations
Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders
Mixed citation behavior. Most common role is background (44%).
abstract
Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task featuring both semi-synthetic and real-world evaluation datasets. Experiments with 11 leading LLMs show that their rankings on BLaIR show little correlation with MTEB, highlighting the unique challenges of semantic encoding in recommendation.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.
InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling stable rankings in one forward pass.
Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
HORIZON creates a cross-domain, long-horizon user modeling benchmark from Amazon Reviews that tests generalization across time, domains, and unseen users, exposing gaps in sequential and LLM-based recommendation models.
DynLP is a parallel dynamic batch update algorithm for label propagation that achieves significant speedups by updating only relevant parts of the graph on GPUs.
GenRecEdit injects cold-start items into generative recommendation models via context-aware token editing and interference-reducing triggers, boosting cold-start accuracy while using only 9.5% of retraining time.
ItemRAG augments LLM recommendation prompts with item-level retrievals that blend semantic and co-purchase signals, outperforming user-history RAG in both standard and cold-start settings.
VoteGCL augments graph-based recommendation systems with high-confidence synthetic interactions generated via majority-voting LLM reranks and integrates them into graph contrastive learning to improve accuracy and reduce popularity bias.
PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
CAMPA resolves modal conflicts in decoupled multimodal GNNs via cross-modal aligned propagation and trajectory aligned aggregation, outperforming coupled and decoupled baselines on benchmarks while retaining efficiency.
LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer than baselines.
PREFER is an online preference learning system that generates personalized review summaries and improves alignment with user interests in simulations on Amazon review data.
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
A unified benchmark of eleven CE methods shows effectiveness-sparsity trade-offs vary by method and format, performance is consistent from item to list level, and graph-based explainers face scalability limits on large graphs.
CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
citing papers explorer
-
Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity
FedMPO recovers missing modalities via topology-aware generation, filters noisy recoveries with missing-aware routing, and uses reliability-aware aggregation to achieve up to 5.65% gains over baselines in high-missing and non-IID federated graph settings.
-
RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents
RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.
-
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation
Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.
-
One Pass, Any Order: Position-Invariant Listwise Reranking for LLM-Based Recommendation
InvariRank achieves permutation-invariant listwise reranking for LLM-based recommendations via a structured attention mask that blocks cross-candidate interactions and shared positional framing under RoPE, enabling stable rankings in one forward pass.
-
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
-
HORIZON: A Benchmark for In-the-wild User Behaviour Modeling
HORIZON creates a cross-domain, long-horizon user modeling benchmark from Amazon Reviews that tests generalization across time, domains, and unseen users, exposing gaps in sequential and LLM-based recommendation models.
-
DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning
DynLP is a parallel dynamic batch update algorithm for label propagation that achieves significant speedups by updating only relevant parts of the graph on GPUs.
-
GenRecEdit: Adapting Model Editing for Generative Recommendation with Cold-Start Items
GenRecEdit injects cold-start items into generative recommendation models via context-aware token editing and interference-reducing triggers, boosting cold-start accuracy while using only 9.5% of retraining time.
-
ItemRAG: Item-Based Retrieval-Augmented Generation for LLM-Based Recommendation
ItemRAG augments LLM recommendation prompts with item-level retrievals that blend semantic and co-purchase signals, outperforming user-history RAG in both standard and cold-start settings.
-
VoteGCL: Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
VoteGCL augments graph-based recommendation systems with high-confidence synthetic interactions generated via majority-voting LLM reranks and integrates them into graph contrastive learning to improve accuracy and reduce popularity bias.
-
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD
PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
-
Conditional Attribute Estimation with Autoregressive Sequence Models
Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.
-
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
-
CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation
CAMPA resolves modal conflicts in decoupled multimodal GNNs via cross-modal aligned propagation and trajectory aligned aggregation, outperforming coupled and decoupled baselines on benchmarks while retaining efficiency.
-
LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries
LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
-
Bridging Textual Profiles and Latent User Embeddings for Personalization
BLUE aligns LLM-generated textual user profiles with embedding-based recommendation objectives via reinforcement learning and next-item text supervision, yielding better zero-shot performance and cross-domain transfer than baselines.
-
PREFER: Personalized Review Summarization with Online Preference Learning
PREFER is an online preference learning system that generates personalized review summaries and improves alignment with user interests in simulations on Amazon review data.
-
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
-
Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems
Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
-
From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
A unified benchmark of eleven CE methods shows effectiveness-sparsity trade-offs vary by method and format, performance is consistent from item to list level, and graph-based explainers face scalability limits on large graphs.
-
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
-
PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context
PeReGrINE is a graph-based benchmark that restructures Amazon Reviews 2023 with temporal cutoffs and introduces dissonance analysis to measure how well retrieval-conditioned models match user style and product consensus.
-
TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
TRU is a plug-and-play unlearning method for multimodal recommenders that applies ranking fusion, modality scaling, and layer isolation to achieve better retain-forget trade-offs than uniform baselines.
-
Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network
Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.
-
SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding
SessionIntentBench is a large-scale multimodal benchmark for inter-session intention-shift modeling in e-commerce, with 1.95M intention entries and human-annotated gold labels showing current L(V)LMs struggle but improve when intention is injected.
-
Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
ABPO combines group-relative policy optimization with anchored exposure correction and asymmetric feedback handling to enable effective continual updates for LLM recommenders under bandit feedback constraints.
-
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible accuracy loss.
-
Stable Multimodal Graph Unlearning via Feature-Dimension Aware Quantile Selection
FDQ improves stability in multimodal graph unlearning by using feature-dimension aware quantile selection to protect sensitive high-dimensional layers while preserving utility and enabling effective forgetting.
-
Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough
Semantic and collaborative representations show low item-level overlap on sparse data, so global alignment suppresses complementary signals and a shared-plus-private fusion design is needed instead.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
-
Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation
DECOR learns decomposed contextual token representations by combining pretrained semantics with collaborative signals to fix objective misalignment in two-stage generative recommendation systems.
-
To GPU or Not to GPU: Vector Search in Relational Engines
Relational engines achieve faster SQL+vector-search queries on GPU than CPU when using compact vector indexes and fast interconnects, reversing the CPU-only design in current systems.
-
Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem
Data portability scenarios in algorithmic pluralism produce varying effects on user utility across different recommendation algorithms.
- Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)