Abstract
Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected two-layered network structure. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.
Similar content being viewed by others
Notes
Twitter API rate limiting. https://dev.twitter.com/rest/public/rate-limiting.
Sina Weibo provides a check-in service (http://place.weibo.com) that allows users to share location information with their friends, e.g., the restaurants they took lunch, the hotels they lived during travel. The service is similar to the function in Foursquare and other location-based OSNs.
A Weibo user ID is in the range [1,000,000,000, 6,200,000,000], as of May 2017. About 10% of the IDs in this range represent valid users.
Weibo search API. http://open.weibo.com/wiki/2/location/pois/search/by_area.
For Facebook, the friendship network is an undirected graph; for Twitter, because the followees and followers of a user are known once the user is collected, hence we can build an undirected graph of the Twitter follower network on-the-fly.
A Markov chain is said to be time reversible with respect to stationary distribution \(\pi \) if it satisfies condition \(\pi _ip_{ij}=\pi _jp_{ji}, \forall i,j\).
Let \(\hat{\theta }^{\text {RW}}\) denote the RW estimate. Then \(\hat{\theta }^{\text {RW}}=1/Z^{\text {RW}}\sum _{u\in S}f(u)/d_u\) where samples in multiset S are collected using a simple RW in G, and \(Z^{\text {RW}}=\sum _{u\in S}1/d_u\).
Let \(\hat{\theta }^{\text {MHRW}}\) denote the MHRW estimate. Because MHRW obtains samples uniformly at random, thus the estimator is simply \(\hat{\theta }^{\text {MHRW}}=1/|S|\sum _{u\in S}f(u)\) where samples in multiset S are collected using a MHRW in G.
The user ID space ranges from 100,000 to 10,000,000, and actor ID space ranges from 892,000 to 2,100,000.
References
Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Proceedings of the 7th workshop on algorithms and models for the web graph
Backstrom L, Kleinberg J (2014) Romantic partnerships and the dispersion of social ties: a network analysis of relationship status on Facebook. In: Proceedings of the 17th ACM conference on computer supported cooperative work and social computing
Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Birnbaum ZW, Sirken MG (1965) Design of sample surveys to estimate the prevalence of rare diseases: three unbiased estimates. Vital Health Stat 2(11):1–8
Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining
Gjoka M, Kurant M, Butts CT, Markopoulou A (2010) Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of the 29th annual IEEE international conference on computer communications
Gjoka M, Butts CT, Kurant M, Markopoulou A (2011a) Multigraph sampling of online social networks. IEEE J Sel Areas Commun 29(9):1893–1905
Gjoka M, Kurant M, Butts CT, Markopoulou A (2011b) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892
Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263
Han J, Choi D, Chun BG, Kwon TT, Chul Kim H, Choi Y (2014) Collecting, organizing, and sharing pins in Pinterest: interest-driven or social-driven? In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation community
Hardiman SJ, Katzir L (2013) Estimating clustering coefficients and size of social networks via random walk. In: Proceeding of the 22nd international world wide web conference
Katzir L, Liberty E, Somekh O (2011) Estimating sizes of social networks via biased sampling. In: Proceedings of the 19th international world wide web conference
Lee CH, Xu X, Eun DY (2012) Beyond random walk and Metropolis–Hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation community
Lee CH, Xu X, Eun DY (2017) On the Rao–Blackwellization and its application for graph sampling via neighborhood exploration. In: Proceedings of the 36th annual IEEE international conference on computer communications
Leskovec J, Huttenlocher D, Kleinberg J (2010) Signed networks in social media. In: Proceedings of the SIGCHI conference on human factors in computing systems
Li Y, Steiner M, Wang L, Zhang ZL, Bao J (2012) Dissecting foursquare venue popularity via random region sampling. In: Proceedings of the 8th international conference on emerging networking experiments and technologies
Li Y, Wang L, Steiner M, Bao J, Zhu T (2014) Region sampling and estimation of geosocial data with dynamic range calibration. In: Proceedings of the 30th IEEE international conference on data engineering
Li H, Ai W, Liu X, Tang J, Huang G, Feng F, Mei Q (2016) Voting with their feet: inferring user preferences from app management activities. In: Proceedings of the 25th international world wide web conference
Massoulié L, Merrer EL, Kermarrec AM, Ganesh A (2006) Peer counting and sampling in overlay networks: random walk methods. In: Proceedings of ACM symposium on principles of distributed computing
McAuley J, Pandey R, Leskovec J (2015) Inferring networks of substitutable and complementary products. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining
Meyn S, Tweedie RL (2009) Markov Chains and statistic stability, 2nd edn. Cambridge University Press, Cambridge
Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conference
Mondal M, Viswanath B, Druschel P, Gummadi KP, Clement A, Mislove A, Post A (2012) Defending against large-scale crawls in online social networks. In: Proceedings of the 8th international conference on emerging networking experiments and technologies
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conference
Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: Proceedings of the 31st annual IEEE international conference on computer communications
Robert CP, Casella G (2004) Monte Carlo statistic methods, 2nd edn. Springer, Berlin
Seshadhri C, Pinar A, Kolda TG (2013) Triadic measures on graphs: the power of wedge sampling. In: Proceedings of the 13th SIAM international conference on data mining
Sinclair A, Jerrum M (1989) Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82(1):93–133
Wang P, He W, Liu X (2014a) An efficient sampling method for characterizing points of interests on maps. In: Proceedings of the 30th IEEE international conference on data engineering
Wang P, Lui JC, Ribeiro B, Towsley D, Zhao J, Guan X (2014b) Efficiently estimating motif statistics of large networks. ACM Trans Knowl Discov Data 9(2):1–27
Xu X, Lee CH, Eun DY (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of the 33rd annual IEEE international conference on computer communications
Zhang B, Kreitz G, Isaksson M, Ubillos J, Urdaneta G, Pouwelse JA, Epema D (2013) Understanding user behavior in Spotify. In: Proceedings of the 32nd annual IEEE international conference on computer communications
Zhao J, Lui JC, Towsley D, Wang P, Guan X (2015) A tale of three graphs: sampling design on hybrid social-affiliation networks. In: Proceedings of the 31st IEEE international conference on data engineering
Zhou Z, Zhang N, Gong Z, Das G (2013) Faster random walks by rewiring online social networks on-the-fly. In: Proceedings of the 29th IEEE international conference on data engineering
Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. In: Proceedings of the VLDB endowment
Acknowledgements
The authors wish to thank the anonymous reviewers for their helpful feedback. The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1301254, 61603290, 61602371, 61772412), the Ministry of Education&China Mobile Research Fund (MCM20160311), the Natural Science Foundation of Jiangsu Province (SBK2014021758), 111 International Collaboration Program of China, the Prospective Joint Research of Industry-Academia-Research Joint Innovation Funding of Jiangsu Province (BY2014074), Shenzhen Basic Research Grant (JCYJ20160229195940462, JCYJ20170816100819428), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034). The work by John C. S. Lui was supported in part by GRF 14208816.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hanghang Tong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Table 4.
Rights and permissions
About this article
Cite this article
Zhao, J., Wang, P., Lui, J.C.S. et al. Sampling online social networks by random walk with indirect jumps. Data Min Knowl Disc 33, 24–57 (2019). https://doi.org/10.1007/s10618-018-0587-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0587-5