Deep Learning Recommendation Model for Personalization and Recommendation Systems
read the original abstract
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.
This paper has not been read by Pith yet.
Forward citations
Cited by 28 Pith papers
-
Agentic Recommender System with Hierarchical Belief-State Memory
MARS uses hierarchical memory and LLM planning to achieve 26.4% higher HR@1 on InstructRec benchmarks compared to prior methods.
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
-
Agentic Recommender System with Hierarchical Belief-State Memory
MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries
LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
-
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
-
RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping
RecFlash uses frequency-based data remapping in NAND flash in-storage computing to improve recommendation inference latency by up to 81% and energy consumption by 91.9% over prior ISC architectures.
-
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks
LayerPipe2 derives per-layer delay assignments for multistage pipelined training and uses an improved moving average to recompute past weights without explicit storage.
-
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on indu...
-
Make It Long, Keep It Fast: End-to-End 10K Long User Behavior Sequence Modeling for Billion-Scale Douyin Recommendation
Douyin deploys stacked target-to-history cross attention and request-level batching to scale end-to-end recommendation modeling to 10k-length histories, observing scaling-law gains and live engagement improvements.
-
Learning from Natural Language Feedback for Personalized Question Answering
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
-
TrainMover: An Interruption-Resilient Runtime for ML Training
TrainMover achieves ~20s downtime for interruptions in 1024-GPU LLM training via two-phase delta-based communication setup, communication-free sandboxed warmup, and general standby design, projecting 55% reduction in ...
-
LLM Retrieval for Stable and Predictable Ad Recommendations
LLM-based semantic retrieval with hierarchical attributes and graph expansion improves stability and predictability in industrial ad recommendation systems.
-
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
FLUID retires candidate-side item IDs in production livestream rankers via cross-domain multimodal hierarchical codes and late-fusion ID-free design, reporting online gains of +0.55% Quality Watch Duration and +2.05% ...
-
Recommender Systems as Control Systems
Modeling recommender systems as control systems shows that time-optimized fairness interventions can improve overall long-term performance rather than merely trading off against utility.
-
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy...
-
Intelligent Elastic Feature Fading: Enabling Model Retrain-Free Feature Efficiency Rollouts at Scale
IEFF enables retrain-free feature efficiency rollouts in ranking systems by elastically controlling feature coverage at serving time, achieving 5x faster rollouts, zero retraining GPU cost, and 50-55% less performance...
-
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
-
Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
-
Make It Long, Keep It Fast: End-to-End 10K Long User Behavior Sequence Modeling for Billion-Scale Douyin Recommendation
Introduces STCA for linear-complexity target-to-history attention, RLB for shared user encoding across targets, and length-extrapolative training to enable end-to-end 10K sequence modeling with observed scaling-law ga...
-
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
-
Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
-
A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems
A framework integrates MM-LLMs into recommendation systems via caption generation as categorical features, reporting 0.35% offline AUC lift and 0.02% online metric improvement.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.