Graph Embeddings at Scale

Anish Khazane; C. Bayan Bruss; Jonathan Rider; Keegan E. Hines; Richard Serpe; Saurabh Nagrecha

arxiv: 1907.01705 · v1 · pith:RO5BA4DRnew · submitted 2019-07-03 · 💻 cs.LG · stat.ML

Graph Embeddings at Scale

C. Bayan Bruss , Anish Khazane , Jonathan Rider , Richard Serpe , Saurabh Nagrecha , Keegan E. Hines This is my paper

Pith reviewed 2026-05-25 10:22 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords graph embeddingsskip-gramdistributed traininglarge-scale graphslink predictionvertex representationsFriendster network

0 comments

The pith

A distributed infrastructure scales skip-gram embeddings to graphs of 68 million vertices by avoiding any graph partitioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a distributed system that trains vertex embeddings on very large networks without splitting the graph across machines. It dynamically builds limited-size computation graphs on workers and uses fast indexing to update the embeddings. The authors demonstrate this on the public Friendster network and a private 50-million-vertex graph, tracking both how accurately the embeddings predict links and how quickly the training converges. They also report that adding more workers improves both accuracy and speed. The approach matters because many industrial graphs are too large for single-machine training yet too awkward to partition without losing performance.

Core claim

The system scales the skip-gram algorithm to graphs containing tens of millions of vertices by completely avoiding graph partitioning, instead dynamically creating size-constrained computational graphs across worker nodes and relying on highly efficient indexing operations to update embeddings.

What carries the argument

Distributed infrastructure that avoids graph partitioning by dynamically creating size-constrained computational graphs and using efficient indexing for embedding updates.

If this is right

Skip-gram embeddings become feasible on graphs with 50-70 million vertices using commodity distributed hardware.
Link-prediction accuracy and convergence rate both improve as the number of worker nodes increases.
The same infrastructure can be applied to other existing embedding algorithms beyond skip-gram.
Heterogeneous graphs of industrial size can be embedded without manual partitioning decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce the engineering cost of deploying embeddings in recommendation or ranking pipelines that already use large social or interaction graphs.
If the indexing and dynamic-graph approach generalizes, it may allow rapid experimentation with new embedding objectives on the same infrastructure.
Downstream tasks that rely on vertex similarity, such as community detection, may see improved results once embeddings are trained at full graph scale.

Load-bearing premise

That a system which never partitions the graph but instead creates temporary small computation graphs on workers will still produce high-quality embeddings at this scale.

What would settle it

Run the system on the Friendster graph and obtain link-prediction accuracy no better than a random baseline or observe training that fails to converge within a reasonable number of iterations.

Figures

Figures reproduced from arXiv: 1907.01705 by Anish Khazane, C. Bayan Bruss, Jonathan Rider, Keegan E. Hines, Richard Serpe, Saurabh Nagrecha.

**Figure 2.** Figure 2: Performance benchmarks on large scale graphs. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Experiments run on the Friendster graph 5 CONCLUSION In this paper, we present a distributed architecture for training embeddings on large-scale graphs. We first explored the infrastructural advantages and disadvantages of other systems in industry that compute recommendations or embeddings at scale. We then presented our own approach, and delved into experimental results from scaling the skip-gram algor… view at source ↗

read the original abstract

Graph embedding is a popular algorithmic approach for creating vector representations for individual vertices in networks. Training these algorithms at scale is important for creating embeddings that can be used for classification, ranking, recommendation and other common applications in industry. While industrial systems exist for training graph embeddings on large datasets, many of these distributed architectures are forced to partition copious amounts of data and model logic across many worker nodes. In this paper, we propose a distributed infrastructure that completely avoids graph partitioning, dynamically creates size constrained computational graphs across worker nodes, and uses highly efficient indexing operations for updating embeddings that allow the system to function at scale. We show that our system can scale an existing embeddings algorithm - skip-gram - to train on the open-source Friendster network (68 million vertices) and on an internal heterogeneous graph (50 million vertices). We measure the performance of our system on two key quantitative metrics: link-prediction accuracy and rate of convergence. We conclude this work by analyzing how a greater number of worker nodes actually improves our system's performance on the aforementioned metrics and discuss our next steps for rigorously evaluating the embedding vectors produced by our system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes an engineering system for running skip-gram embeddings on 68M-node graphs without explicit partitioning, but the scaling evidence stays high-level and leaves the communication and indexing costs unexamined.

read the letter

The main thing here is a distributed training setup for skip-gram that skips the usual graph partitioning step. Instead it builds dynamic, size-limited computational graphs on the fly and leans on fast indexing to update embeddings across workers. They run this on the public Friendster graph (68M vertices) and a 50M-vertex internal heterogeneous graph, tracking link-prediction accuracy and convergence speed as more workers are added. That infrastructure choice is the concrete contribution; the underlying random-walk plus negative-sampling algorithm is unchanged from prior work. The results show that adding workers improves both metrics, which is at least consistent with the no-partitioning design working at these sizes. Credit is due for shipping a working system on real industrial-scale data rather than toy graphs. The soft spot is that the abstract (and the stress-test note) give almost no numbers on walk-generation time, per-epoch communication volume, or memory footprint once the indexing layer is in play. Without those curves it is difficult to know whether the claimed avoidance of partitioning actually removes the bottleneck or simply relocates it into the indexing and dynamic-graph machinery. The paper also defers rigorous embedding-quality checks to future work, so the link-prediction numbers are the only quantitative anchor provided. This is the kind of systems paper that industrial groups working on recommendation or ranking at tens of millions of nodes would want to read for the design pattern. A referee could usefully press on the missing scaling diagnostics and ask for a clearer comparison against partitioned baselines. I would send it to review rather than desk-reject; the claim is practically relevant even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper proposes a distributed infrastructure for training skip-gram graph embeddings that avoids graph partitioning by dynamically creating size-constrained computational graphs across workers and relying on efficient indexing for embedding updates. It claims this system scales an existing skip-gram implementation to the Friendster network (68M vertices) and an internal heterogeneous graph (50M vertices), with performance measured on link-prediction accuracy and rate of convergence; the abstract further states that increasing the number of workers improves these metrics.

Significance. If the no-partitioning design and scaling results hold with proper verification, the work would offer a practical systems contribution for industrial-scale graph embedding training by reducing partitioning overhead. The focus on an unmodified skip-gram algorithm and quantitative metrics on real large graphs is a strength, but the absence of supporting implementation details, baselines, and scaling data in the provided description limits assessable impact.

major comments (2)

[Abstract] Abstract: the central claim that the system 'completely avoids graph partitioning' while scaling skip-gram to 68M vertices rests on unshown experiments; no description is given of how dynamic graph creation and indexing handle adjacency information or random walks without full replication or worker-count-dependent communication volume, which is load-bearing for the scalability assertion.
[Abstract] Abstract: the reported improvements in link-prediction accuracy and convergence rate with more workers lack error bars, baseline comparisons (e.g., against partitioned systems), or any data on walk generation time and per-epoch communication costs, preventing verification that the design functions at the claimed scale without prohibitive overheads.

minor comments (1)

The abstract concludes by deferring 'rigorous evaluation' of the produced embeddings without specifying the planned metrics, datasets, or comparison methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the presentation of our no-partitioning approach and experimental claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the system 'completely avoids graph partitioning' while scaling skip-gram to 68M vertices rests on unshown experiments; no description is given of how dynamic graph creation and indexing handle adjacency information or random walks without full replication or worker-count-dependent communication volume, which is load-bearing for the scalability assertion.

Authors: The body of the manuscript (Sections 3 and 4) describes the dynamic creation of size-constrained computational graphs across workers and the use of efficient indexing for embedding updates. However, we agree that the abstract does not sufficiently elaborate on the handling of adjacency information and random walks to demonstrate the absence of full replication or worker-dependent communication volume. We will revise the abstract and add a clarifying subsection in the methods to explicitly address these mechanisms and their implications for scalability. revision: yes
Referee: [Abstract] Abstract: the reported improvements in link-prediction accuracy and convergence rate with more workers lack error bars, baseline comparisons (e.g., against partitioned systems), or any data on walk generation time and per-epoch communication costs, preventing verification that the design functions at the claimed scale without prohibitive overheads.

Authors: Section 5 of the manuscript presents the link-prediction accuracy and convergence results as a function of worker count. We acknowledge that the abstract claims would benefit from additional supporting details. In revision we will add error bars to the relevant figures, report walk generation times and per-epoch communication costs, and include a discussion of related partitioned baselines. Comprehensive new baseline experiments may require additional runs and will be noted as such. revision: partial

Circularity Check

0 steps flagged

No circularity: systems paper with no derivations or fitted predictions

full rationale

The paper presents a distributed systems architecture for scaling skip-gram embeddings on large graphs (Friendster 68M vertices, internal 50M) using dynamic computational graphs and indexing, with no equations, parameter fits, or mathematical derivations. Performance is measured empirically via link-prediction accuracy and convergence rate; the central claims are engineering feasibility and scaling behavior, not a closed-form result that reduces to its inputs. No self-citations load-bear any uniqueness theorem or ansatz. This is self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or postulates are present. The contribution is described as an engineering system rather than a mathematical derivation.

pith-pipeline@v0.9.0 · 5736 in / 1124 out tokens · 27158 ms · 2026-05-25T10:22:30.984585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

[1]

Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A com- prehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637

work page 2018
[2]

Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems . 2674–2682

work page 2015
[3]

Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference on World Wide Web . International World Wide Web Conferences Steering Committee, 1775–1784

work page 2018
[4]

Joshua Goodman. 2001. Classes for Fast Maximum Entropy Training. CoRR cs.CL/0108006 (2001). http://arxiv.org/abs/cs.CL/0108006

work page arXiv 2001
[5]

Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 1809– 1818

work page 2015
[6]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. CoRR abs/1607.00653 (2016). arXiv:1607.00653 http://arxiv.org/abs/ 1607.00653

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Michael Gutmann and Aapo HyvÃďrinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics (Proceedings of Machine Learning Research) , Yee Whye Teh and Mike Tit- terington (Eds.), Vol. 9. PMLR, Chia Laguna Re...

work page 2010
[8]

Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

KONECT. 2017. Friendster Network Dataset

work page 2017
[10]

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225. https://transacl.org/ ojs/index.php/tacl/article/view/570

work page 2015
[11]

Thomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. Graph Embeddings at Scale MLG’19, August 2019, Anchorage, Alaska USA In Neural Information Processing Systems

work page 2013
[12]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean

work page
[13]

In Advances in Neural Information Processing Systems 26 , C

Distributed Representations of Words and Phrases and their Com- positionality. In Advances in Neural Information Processing Systems 26 , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger (Eds.). Curran Associates, Inc., 3111–3119. http://papers.nips.cc/paper/ 5021-distributed-representations-of-words-and-phrases-and-their-compo...

work page
[14]

Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings ef- ficiently with noise-contrastive estimation. In Advances in neural information processing systems. 2265–2273

work page 2013
[15]

Erik Ordentlich, Lee Yang, Andy Feng, Peter Cnudde, Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, and Gavin Owens. 2016. Network-efficient dis- tributed word2vec training system for large vocabularies. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 1139–1148

work page 2016
[16]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn- ing of Social Representations. CoRR abs/1403.6652 (2014). arXiv:1403.6652 http://arxiv.org/abs/1403.6652

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems . 693–701

work page 2011
[18]

Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: Streaming Graph Partitioning for Massive Scale Graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14). ACM, New York, NY, USA, 333–342. https://doi.org/10. 1145/2556195.2556213

work page arXiv 2014
[19]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. arXiv preprint arXiv:1803.02349 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. 2017. StarSpace: Embed All The Things! CoRR abs/1709.03856 (2017). arXiv:1709.03856 http://arxiv.org/abs/1709.03856

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A com- prehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637

work page 2018

[2] [2]

Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems . 2674–2682

work page 2015

[3] [3]

Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference on World Wide Web . International World Wide Web Conferences Steering Committee, 1775–1784

work page 2018

[4] [4]

Joshua Goodman. 2001. Classes for Fast Maximum Entropy Training. CoRR cs.CL/0108006 (2001). http://arxiv.org/abs/cs.CL/0108006

work page arXiv 2001

[5] [5]

Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 1809– 1818

work page 2015

[6] [6]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. CoRR abs/1607.00653 (2016). arXiv:1607.00653 http://arxiv.org/abs/ 1607.00653

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Michael Gutmann and Aapo HyvÃďrinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics (Proceedings of Machine Learning Research) , Yee Whye Teh and Mike Tit- terington (Eds.), Vol. 9. PMLR, Chia Laguna Re...

work page 2010

[8] [8]

Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [9]

KONECT. 2017. Friendster Network Dataset

work page 2017

[10] [10]

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225. https://transacl.org/ ojs/index.php/tacl/article/view/570

work page 2015

[11] [11]

Thomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. Graph Embeddings at Scale MLG’19, August 2019, Anchorage, Alaska USA In Neural Information Processing Systems

work page 2013

[12] [12]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean

work page

[13] [13]

In Advances in Neural Information Processing Systems 26 , C

Distributed Representations of Words and Phrases and their Com- positionality. In Advances in Neural Information Processing Systems 26 , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger (Eds.). Curran Associates, Inc., 3111–3119. http://papers.nips.cc/paper/ 5021-distributed-representations-of-words-and-phrases-and-their-compo...

work page

[14] [14]

Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings ef- ficiently with noise-contrastive estimation. In Advances in neural information processing systems. 2265–2273

work page 2013

[15] [15]

Erik Ordentlich, Lee Yang, Andy Feng, Peter Cnudde, Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, and Gavin Owens. 2016. Network-efficient dis- tributed word2vec training system for large vocabularies. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 1139–1148

work page 2016

[16] [16]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn- ing of Social Representations. CoRR abs/1403.6652 (2014). arXiv:1403.6652 http://arxiv.org/abs/1403.6652

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems . 693–701

work page 2011

[18] [18]

Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: Streaming Graph Partitioning for Massive Scale Graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14). ACM, New York, NY, USA, 333–342. https://doi.org/10. 1145/2556195.2556213

work page arXiv 2014

[19] [19]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. arXiv preprint arXiv:1803.02349 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. 2017. StarSpace: Embed All The Things! CoRR abs/1709.03856 (2017). arXiv:1709.03856 http://arxiv.org/abs/1709.03856

work page internal anchor Pith review Pith/arXiv arXiv 2017