Abstract
Characterizing large complex networks such as online social networks through node querying is a challenging task. Network service providers often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network of interest. Various ad hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for large networks where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web APIs available on most major large networks for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than state-of-the-art methods. We also explore methods to uncover shortest paths between a subset of nodes and detect high degree nodes by sampling only a small fraction of the network of interest. Our results demonstrate that utilizing neighborhood information yields methods that are two orders of magnitude faster than state-of-the-art methods.
Similar content being viewed by others
Notes
In directed networks where querying a node retrieves the node¡\(^{-}\)s incoming and outgoing edges.
References
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: SIGKDD, pp 631–636
Hubler C et al (2008) Metropolis algorithms for representative subgraph sampling. In: ICDM, pp 283–292
Maiya AS, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: SIGKDD, pp 105–113
Ahmed NK et al (2012) Network sampling: from static to streaming graphs. TKDD 8(2):7:1–7:56
Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403
Gjoka M et al (2010) Walking in Facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506
Ribeiro B, Towsley D (2012) On the estimation accuracy of degree distributions from graph sampling. In: CDC, pp 1–6
Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109
Graybill FA, Deal RB (1959) Combining unbiased estimators. Biometrics 15(4):543–550
Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46
Ribeiro B et al (2010) Multiple random walks to uncover short paths in power law networks. In: INFOCOM NetSciCom, pp 1–6
Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms. Probab Surv 1:20–71
Jones GL (2004) On the Markov chain central limit theorem. Probab Surv 1:299–320
Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. JASA 47:663–685
Lee CH et al (2012) Beyond random walk and Metropolis–Hastings samplers: Why you should not backtrack for unbiased graph sampling. In: SIGMETRICS/Performance, pp 319–330
Lim Y et al (2011) Online estimating the \(k\) central nodes of a network. In: NSW, pp 1–6
Cooper C et al (2012) A fast algorithm to find all high degree vertices in power law graphs. In: WWW LSNA, pp 1007–1016
Coppersmith D et al (1993) Random walks on weighted graphs, and applications to on-line algorithms (extended). J ACM 40:421–453
Maiya AS, Berger-Wolf TY (2010) Online sampling of high centrality individuals in social networks. In: PAKDD, pp 91–98
Maiya AS, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: SIGKDD, pp 105–113
Hui P et al (2008) BUBBLE Rap: social-based forwarding in delay tolerant networks. In: MobiHoc, pp 241–250
Ribeiro B et al (2012) Multiple random walks to uncover short paths in power law networks. In: Infocom NetSciCom, pp 1–6
Wang P et al (2012) Sampling contents distributed over graphs. Technical Report TR-1201, Xi’an Jiaotong University
Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42
Richardson M et al (2003) Trust management for the semantic web. In: ISWC, pp 351–368
Leskovec J et al (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700
Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 241–252
Kurant M et al (2011) Towards unbiased BFS sampling. JSAC 29(9):1799–1809
Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34
Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 49(1):11–34
Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390
Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM Mini-conference, pp 2701–2705
Acknowledgements
The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education & China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, P., Zhao, J., Ribeiro, B. et al. Practical characterization of large networks using neighborhood information. Knowl Inf Syst 58, 701–728 (2019). https://doi.org/10.1007/s10115-018-1167-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1167-0