Meta’s official technical blog about the DLRM (Deep Learning Recommendation Model) proposed in 2019: https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model.
- [arXiv ‘19] Deep Learning Recommendation Model for Personalization and Recommendation Systems
- [arXiv ‘20] Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems (Zion server)
Meta originally use CPU servers to train and infer personalized RM. But for the throughput/latency requirement and SLA, they have transferred to GPU servers now. However, there still exists challenges like the large memory capacity demanded and irregular memory access pattern in the embedding table lookup procedure.
Research papers
Introduce performance concerns of DLRM training / inference, guiding the hardware (xPU topology, network interconnection, etc.) and system configurations (software stack, distributed training policy, embedding table placement, etc.) in the datacenter.
-
[HPCA ‘20] The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation
-
[HPCA ‘21] Understanding Training Efficiency of Deep Learning Recommendation Models at Scale
-
[ISCA ‘22] Software-hardware co-design for fast and scalable training of deep learning recommendation models
Facebook’s on-going production practices on hyper-parameter selection & tuning, system configuration for the best performance of DLRM.
-
[MLSys ‘20] Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems (Baidu)
3 hierarchical workflow exploiting device HBM, host memory and SSD for the large-scale embedding tables
-
[ASPLOS ‘21] MERCI: efficient embedding reduction on commodity hardware via sub-query memoization
Preprocess the embedding table and store the partial reduction results of frequently co-occurrence features. Extra memory capacity required but less runtime memory accesses.
-
[RecSys ‘22] Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference (NVIDIA)
NVIDIA’s share on their HugeCTR system design and the HPS (Hierarchical Parameter Server)
-
[ASPLOS ‘22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
Estimate embedding table access pattern based on input feature statistics. Binary-encoded sparse inputs are hashed first prior to embedding table look-up.
-
[EuroSys ‘22] Fleche: an efficient GPU embedding cache for personalized recommendations
Unified query key (embedding table flatten cache / slab hashing) and query kernel fusion to sidestep pronounced host management overhead
-
[ISCA ‘22] Software-hardware co-design for fast and scalable training of deep learning recommendation models
4D parallelism (table, row, column and data) for the embedding operator. A SW-HW co-design system ZionEX for maximize offline training throughput.
-
[KDD ‘22] Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
Use hybrid parameter updating with synchronous for dense and asynchronous for embedding. Then design and optimize a distributed heterogeneous (CPU + GPU) training system for the proposed algorithm, i.e., PS for CPU embedding nodes and Allreduce for GPU dense nodes.
-
[ASPLOS ‘23] GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference
Model the pre-defined reduction results as a graph problem and propose a greedy algorithm to iteratively group nodes (items) into clusters (frequently co-occurrence items in one query). Also a heterogeneous memory-aware system.
-
[ISCA ‘23] Optimizing CPU Performance for Recommendation Systems At-Scale
Combine software prefetching and manual hyper-threading to speedup CPU DLRM inference and alleviate the inefficiency caused by irregular memory accesses from the embedding table lookup phase.
-
[OSDI ‘23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
Prune embedding features in training even with noticeable accuracy gains.
-
[SOSP ‘23] Bagpipe: Accelerating Deep Recommendation Model Training
Propose a cache and prefetch mechanism for DLRM offline training, based on logical-replicated, physical-partitioned distributed cache design.
Architectural research
- [MICRO ‘21] RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
Simulated accelerator design for multi-stage RM inference; RM model quality: NDCG
-
[ISCA ‘20] Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations
-
[ISCA ‘21] SPACE: locality-aware processing in heterogeneous memory for personalized recommendations
-
[ISCA ‘23] MTIA: First Generation Silicon Targeting Meta’s Recommendation Systems
-
[ICS ‘23] Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training
Exploits SmartNIC’s on-chip resource as caches for popular embeddings/cores for low-arithmetic computation. Use FPGA as SNIC.
-
[ATC ‘22] FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs
Useful link / posts / articles
- https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41126/
- https://www.zhihu.com/tardis/zm/art/82839874 (Chinese)
- https://www.cnblogs.com/rossiXYZ/p/15897877.html (Chinese, source code intro of HugeCTR)
- NVIDIA DLRM PyTorch examples
- Software Prefetching
- Intel Prefetch Intrinsics