HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2024 2representative citing papers
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
citing papers explorer
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.