Linked Ego Networks: Improving estimate reliability and validity with respondent-driven sampling
Introduction
In many forms of research, there is no list of all members for the studied population (i.e., a sampling frame) from which a random sample may be drawn and estimates about the population characteristics may be inferred based on the select probabilities of sample units. Non-probability sampling methods may be used for such situations, such as key informant sampling (Deaux and Callaghan, 1985), targeted/location sampling (Watters and Biernacki, 1989), and snowball sampling (Erickson, 1979). However, these methods all introduce a considerable selection bias, which impairs generalization of the findings from the sample to the studied population (Heckathorn, 1997, Magnani et al., 2005). Respondent-driven sampling (RDS) is an alternative method that is currently being used extensively in public health research for the study of hard-to-access populations, e.g., injecting drug users (IDUs), men who have sex with men (MSM) and sex workers (SWs). RDS uses a link-tracing network sampling design and provides, given fulfillment of a limited number of assumptions, asymptotically unbiased population estimates as well as a feasible implementation, making it the state-of-the-art sampling method for studying hard-to-access populations (Johnston et al., 2008, Wejnert, 2009, Lansky et al., 2007, Kogan et al., 2011, Wejnert and Heckathorn, 2008).
RDS starts with a number of pre-selected respondents who serve as “seeds”. After an interview, the seeds are asked to distribute a certain number of coupons (usually 3) to their friends who are also within the studied population. Individuals with a valid coupon can then participate in the study and are provided the same number of coupons to distribute. The above recruitment process is repeated until the desired sample size is reached (Heckathorn, 1997). In a typical RDS, information about who recruits whom and the respondents’ number of friends within the population (degree) are also recorded for the purpose of generating population estimates from the sample (Heckathorn, 2002, Salganik and Heckathorn, 2004).
Suppose a RDS study is conducted on a connected network with the additional assumptions that (i) network links are undirected, (ii) sampling of peer recruitment is done with replacement, (iii) each participant recruits one peer from his/her neighbors, and (iv) the peer recruitment is a random selection among all the participant's neighbors. Then the RDS process can be modeled as a Markov process, and the composition of the sample will stabilize and be independent of the properties of the seeds (Salganik and Heckathorn, 2004, Heckathorn, 2007, Volz and Heckathorn, 2008). Following this, the probability for each node to be included in the RDS sample is proportional to its degree. Specifically, for a given sample , with nA being the number of respondents in the sample with property A (e.g., HIV-positive) and nB = n − nA being the rest. Let {d1, d2, …, dn} be the respondents’ degree and be the recruitment matrix observed from the sample, where sXY is the proportion of recruitments from group X to group Y (for the purpose of this paper, we consider a binary property such that each individual belongs either to group A or B). Then the proportion of individuals belonging to group A in the population, , can be estimated by Salganik and Heckathorn (2004) and Volz and Heckathorn (2008):
or
where and are the estimated average degrees for individuals of group A and B in the population. Both estimators give asymptotically unbiased estimates when the above assumptions are fulfilled (Salganik and Heckathorn, 2004, Volz and Heckathorn, 2008).
The methodology of RDS is nicely designed; however, the assumptions underlying the RDS estimators are rarely met in practice (Wejnert, 2009, Tomas and Gile, 2011, Goel and Salganik, 2010, Bengtsson et al., 2012). For example, empirical RDS studies use more than one coupon and sampling is conducted without replacement, that is, each respondent is only allowed to participate once. A comprehensive evaluation has been made by Lu et al. (2012), where the effects of violation of assumptions (i)–(iv), as well as the effect of selection and number of seeds and coupons, were evaluated one by one, by simulating RDS process on an empirical MSM network as well as artificial networks and comparing RDS estimates with known population properties. They have shown that when the sample size is relatively small (<10% of the population), RDS estimators have a strong resistance to violations of certain assumptions, such as low response rate and errors in self-reporting of degrees, and the like. On the other hand, large bias and variance may result from differential recruitments, or from networks with non-reciprocal relationships. When the sample size is relatively large (>50% of the population), similar results were also found by Gile and Handcock (2010), where they focused on the sensitivity of RDS estimators to the selection of seeds, respondent behavior and violation of assumption (ii).
It was not until recently that researchers found the variance in RDS may have been severely underestimated (Salganik, 2006). In a study by Goel and Salganik (2010) based on simulated RDS samples on empirical networks, they found that the RDS estimator typically generates five to ten times greater variance than simple random sampling (Salganik, 2006). Moreover, McCreesh et al. (2012) conducted a RDS study on male household heads in rural Uganda where the true population data was known, and they found that only one-third of RDS estimates outperformed the raw proportions in the RDS sample, and only 50–74% of RDS 95% confidence intervals, calculated based on a bootstrap approach for RDS, included the true population proportion.
For the above reasons, there has been an increasing interest in developing new RDS estimators to improve the performance of RDS. For example, Gile (2011) developed a successive-sampling-based estimator for RDS to adjust the assumption of sampling with replacement and demonstrated its superior performance when the size of the population is known. Lu et al. (2013) proposed new estimators for RDS on directed networks, with known in degree difference between estimated groups. Both of the above estimators can be used as a sensitivity test when the required population parameters are not known.
Both the traditional RDSI, RDSII estimators, and the estimators newly developed by Gile (2011), Gile and Handcock (2011) and Lu et al. (2013) utilize the same information collected by standard RDS practice, that is, the recruitment matrix S, and the degree and studied properties of each respondent in the sample. There is however scope to improve estimates dramatically if data on the composition of respondents’ ego networks can be put to use. Such data has already been collected for other purposes in many RDS studies. For example, in a RDS study of MSM in Campinas City, Brazil, by de Mello et al. (2008), respondents were asked to describe the percentage of certain characteristics among their friends/acquaintances, such as disclosure of sexual orientation to family, HIV status, and the like. In a RDS study of opiate users in Yunnan, China, information about supporting, drug using, and sexual behaviors between respondents and their network members was collected (Li et al., 2011). One of the most thorough RDS studies utilizing ego network information was done by Rudolph et al. (2011), in which they asked the respondents to provide extensive characteristics for each alter within their personal networks such as demographic characteristics, history of incarceration, and drug injection and crack and heroin use.
Aiming to improve the RDS estimator, we will focus on how to integrate this additional information in the estimation process to generate improved population estimates. The rest of this paper is organized as follows. In Section 2, we develop a new estimator that integrates traditional RDS data with egocentric data; in Section 3, we describe network data used for simulation and study design; in Section 4, we evaluate the performance of the new estimator by simulated RDS processes under various settings; and in Section 5, we summarize and draw our conclusions.
Section snippets
RDSIego: estimator for RDS with egocentric data
The ego networks from a RDS sample differ from general egocentric data collected in many sociological surveys (Britton and Trapman, 2012, Everett and Borgatti, 2005) in the way that each “ego” is connected with (recruited by) its recruiter. For example, in a partial chain of RDS as illustrated in Fig. 1, participants , , , are asked to provide personal network compositions and and are recruited by , , respectively.
For each respondent in a RDS sample , let ,
Network data
In this paper we use both an anonymized empirical social network and simulated networks to evaluate the performance of the newly proposed estimator. The empirical network, previously analyzed in Lu et al., 2012, Lu et al., 2013 and Rybski et al. (2009), comes from the Nordic region's largest and most active web community for homosexual, bisexual, transgender, and queer persons. Nodes of the network are website members who identify themselves as homosexual males, and links are friendship
Estimates of network link types
The difference between RDSI and RDSIego lies in the estimation of the recruitment matrix S. As a first step, we therefore simulate the RDS process with random recruitment () and differential recruitment () and then estimate the proportion of type eA→B links in the population, , by both the raw sample recruitment proportion, sAB, and the proposed ego-network-based estimator, , for all four variables in the MSM network, age, ct, cs and pf, respectively.
An example of the
Conclusion and discussion
Ego network data has been collected for decades and exists largely in sociological surveys (Britton and Trapman, 2012, Everett and Borgatti, 2005, Handcock and Gile, 2010, Newman, 2003, Mizruchi and Marquis, 2006, Marsden, 2002, Hanneman and Riddle, 2005); the RDS sampling mechanism further makes it possible to collect “linked-ego network” data. By combining RDS recruitment trees with ego networks, this study developed a new estimator, RDSIego, for RDS studies. Given that participants can
Acknowledgement
The author would like to thank Professor Fredrik Liljeros and Dr. Linus Bengtsson for helpful discussions. This work has been partially funded by Riksbankens Jubileumsfond (The Bank of Sweden Tercentenary Foundation).
References (49)
Network items and the general social survey
Social Networks
(1984)- et al.
Ego network betweenness
Social Networks
(2005) Are respondents more likely to list alters with certain characteristics? Implications for name generator data
Social Networks
(2004)Egocentric and sociocentric measures of network centrality
Social Networks
(2002)- et al.
Does the online collection of ego-centered network data reduce data quality? An experimental comparison
Social Networks
(2010) - et al.
Egocentric, sociocentric, or dyadic? Identifying the appropriate level of analysis in the study of organizational networks
Social Networks
(2006) Ego-centered networks and the ripple effect
Social Networks
(2003)- et al.
The game of contacts: estimating the social visibility of groups
Social Networks
(2011) - et al.
A comparative study of social network models: network evolution models and nodal attribute models
Social Networks
(2009) - et al.
Implementation of web-based respondent-driven sampling among men who have sex with men in vietnam
PLoS ONE
(2012)
Global HIV surveillance among MSM: is risk behavior seriously underestimated?
AIDS
Inferring global network properties from egocentric data with applications to epidemics
Key informant versus self-report estimates of health-risk behavior
Evaluation Review
Assessment of risk factors for HIV infection among men who have sex with men in the metropolitan area of Campinas City, Brazil, using respondent-driven sampling
Some problems of inference from chain data
Sociological Methodology
Improved inference for respondent-driven sampling data with application to HIV prevalence estimation
Journal of the American Statistical Association
Respondent-driven sampling: an assessment of current methodology
Sociological Methodology
Network model-assisted inference from respondent-driven sampling data
Walking in facebook: a case study of unbiased sampling of OSNs
Assessing respondent-driven sampling
Proceedings of the National Academy of Sciences of the United States of America
Modeling social networks from sampled data
Annals of Applied Statistics
Introduction to Social Network Methods
On the theory of sampling from finite populations
Annals of Mathematical Statistics
Respondent-driven sampling: a new approach to the study of hidden populations
Social Problems
Cited by (39)
The development of respondent-driven sampling (RDS) inference: A systematic review of the population mean and variance estimates
2020, Drug and Alcohol DependenceCitation Excerpt :Though not all studies’ findings coincide with each other, or address all RDS concerns, the general consensus is that all estimates perform more or less in the same way. Verdery et al. had the most comprehensive list of estimators in their evaluation study and they demonstrated that RDSIEGO did out-perform the other estimators in various conditions and was robust to differential recruitment (respondents preferentially recruit their contacts with particular interest) as well as varying homophily and network distribution, which corresponded to the findings of Lu (2013) (Verdery et al., 2015). RDSIEGO requires detailed ego network information for all variables of interest and still relies, to a certain extent, on assumptions related to the respondent’s behavior and network structure (i.e. degree is reported accurately and that each individual is connected directly or indirectly to every individual in the network).
A fuzzy logic based estimator for respondent driven sampling of complex networks
2018, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :The nodes are required to correctly report their degree [9,24,25]. During the sampling process, the nodes randomly select among their neighbors [9,20,23,26,27]. Each node can pass only one coupon [20,28].
Sustainable business models, venture typologies, and entrepreneurial ecosystems: A social network perspective
2018, Journal of Cleaner ProductionMODELING THE VISIBILITY DISTRIBUTION FOR RESPONDENT-DRIVEN SAMPLING WITH APPLICATION TO POPULATION SIZE ESTIMATION
2024, Annals of Applied StatisticsA Bayesian framework for modelling the preferential selection process in respondent-driven sampling
2022, Statistical Modelling