TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
hub Canonical reference
Azzolini, et al
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
LayerPipe2 derives per-layer delay assignments for multistage pipelined training and uses an improved moving average to recompute past weights without explicit storage.
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on industry data.
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
TrainMover achieves ~20s downtime for interruptions in 1024-GPU LLM training via two-phase delta-based communication setup, communication-free sandboxed warmup, and general standby design, projecting 55% reduction in wasted GPU hours.
LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
RecFlash uses frequency-based data remapping in NAND flash in-storage computing to improve recommendation inference latency by up to 81% and energy consumption by 91.9% over prior ISC architectures.
LLM-based semantic retrieval with hierarchical attributes and graph expansion improves stability and predictability in industrial ad recommendation systems.
Introduces STCA for linear-complexity target-to-history attention, RLB for shared user encoding across targets, and length-extrapolative training to enable end-to-end 10K sequence modeling with observed scaling-law gains and production deployment improvements.
Modeling recommender systems as control systems shows that time-optimized fairness interventions can improve overall long-term performance rather than merely trading off against utility.
SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy with an analytical cost model accurate to 2%.
IEFF enables retrain-free feature efficiency rollouts in ranking systems by elastically controlling feature coverage at serving time, achieving 5x faster rollouts, zero retraining GPU cost, and 50-55% less performance degradation than abrupt feature removal.
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
A framework integrates MM-LLMs into recommendation systems via caption generation as categorical features, reporting 0.35% offline AUC lift and 0.02% online metric improvement.
citing papers explorer
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
-
Agentic Recommender System with Hierarchical Belief-State Memory
MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks
LayerPipe2 derives per-layer delay assignments for multistage pipelined training and uses an improved moving average to recompute past weights without explicit storage.
-
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on industry data.
-
Learning from Natural Language Feedback for Personalized Question Answering
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
-
TrainMover: An Interruption-Resilient Runtime for ML Training
TrainMover achieves ~20s downtime for interruptions in 1024-GPU LLM training via two-phase delta-based communication setup, communication-free sandboxed warmup, and general standby design, projecting 55% reduction in wasted GPU hours.
-
LLM Agents Enable User-Governed Personalization Beyond Platform Boundaries
LLM agents enable users to integrate cross-platform and offline data for personalization that outperforms single-platform baselines in proof-of-concept tests.
-
One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
-
RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping
RecFlash uses frequency-based data remapping in NAND flash in-storage computing to improve recommendation inference latency by up to 81% and energy consumption by 91.9% over prior ISC architectures.
-
LLM Retrieval for Stable and Predictable Ad Recommendations
LLM-based semantic retrieval with hierarchical attributes and graph expansion improves stability and predictability in industrial ad recommendation systems.
-
Make It Long, Keep It Fast: End-to-End 10K Long User Behavior Sequence Modeling for Billion-Scale Douyin Recommendation
Introduces STCA for linear-complexity target-to-history attention, RLB for shared user encoding across targets, and length-extrapolative training to enable end-to-end 10K sequence modeling with observed scaling-law gains and production deployment improvements.
-
Recommender Systems as Control Systems
Modeling recommender systems as control systems shows that time-optimized fairness interventions can improve overall long-term performance rather than merely trading off against utility.
-
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy with an analytical cost model accurate to 2%.
-
Intelligent Elastic Feature Fading: Enabling Model Retrain-Free Feature Efficiency Rollouts at Scale
IEFF enables retrain-free feature efficiency rollouts in ranking systems by elastically controlling feature coverage at serving time, achieving 5x faster rollouts, zero retraining GPU cost, and 50-55% less performance degradation than abrupt feature removal.
-
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
-
Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
-
Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
-
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
-
A General Framework for Multimodal LLM-Based Multimedia Understanding in Large-Scale Recommendation Systems
A framework integrates MM-LLMs into recommendation systems via caption generation as categorical features, reporting 0.35% offline AUC lift and 0.02% online metric improvement.
- FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation