Linked Ego Networks: Improving estimate reliability and validity with respondent-driven sampling

doi:10.1016/j.socnet.2013.10.001

Social Networks

Volume 35, Issue 4, October 2013, Pages 669-685

https://doi.org/10.1016/j.socnet.2013.10.001 Get rights and content

Highlights

•
An estimator is developed for respondent-driven sampling with ego network data.
•
The estimator has improved precision in estimating population characteristics.
•
The estimator is robust to differential recruitment and to variations in network structure.
•
Effect of reporting error is evaluated by simulations on both empirical and synthetic networks.

Abstract

Respondent-driven sampling (RDS) is currently widely used for the study of HIV/AIDS-related high risk populations. However, recent studies have shown that traditional RDS methods are likely to generate large variances and may be severely biased since the assumptions behind RDS are seldom fully met in real life. To improve estimation in RDS studies, we propose a new method to generate estimates with ego network data, which is collected by asking respondents about the composition of their personal networks, such as “what proportion of your friends are married?”. By simulations on an extracted real-world social network of gay men as well as on artificial networks with varying structural properties, we show that the precision of estimates for population characteristics is greatly improved. The proposed estimator shows superior advantages over traditional RDS estimators, and most importantly, the method exhibits strong robustness to the recruitment preference of respondents and degree reporting error, which commonly happen in RDS practice and may generate large estimate biases and errors for traditional RDS estimators. The positive results henceforth encourage researchers to collect ego network data for variables of interests by RDS, for both hard-to-access populations and general populations when random sampling is not applicable.

Introduction

In many forms of research, there is no list of all members for the studied population (i.e., a sampling frame) from which a random sample may be drawn and estimates about the population characteristics may be inferred based on the select probabilities of sample units. Non-probability sampling methods may be used for such situations, such as key informant sampling (Deaux and Callaghan, 1985), targeted/location sampling (Watters and Biernacki, 1989), and snowball sampling (Erickson, 1979). However, these methods all introduce a considerable selection bias, which impairs generalization of the findings from the sample to the studied population (Heckathorn, 1997, Magnani et al., 2005). Respondent-driven sampling (RDS) is an alternative method that is currently being used extensively in public health research for the study of hard-to-access populations, e.g., injecting drug users (IDUs), men who have sex with men (MSM) and sex workers (SWs). RDS uses a link-tracing network sampling design and provides, given fulfillment of a limited number of assumptions, asymptotically unbiased population estimates as well as a feasible implementation, making it the state-of-the-art sampling method for studying hard-to-access populations (Johnston et al., 2008, Wejnert, 2009, Lansky et al., 2007, Kogan et al., 2011, Wejnert and Heckathorn, 2008).

RDS starts with a number of pre-selected respondents who serve as “seeds”. After an interview, the seeds are asked to distribute a certain number of coupons (usually 3) to their friends who are also within the studied population. Individuals with a valid coupon can then participate in the study and are provided the same number of coupons to distribute. The above recruitment process is repeated until the desired sample size is reached (Heckathorn, 1997). In a typical RDS, information about who recruits whom and the respondents’ number of friends within the population (degree) are also recorded for the purpose of generating population estimates from the sample (Heckathorn, 2002, Salganik and Heckathorn, 2004).

Suppose a RDS study is conducted on a connected network with the additional assumptions that (i) network links are undirected, (ii) sampling of peer recruitment is done with replacement, (iii) each participant recruits one peer from his/her neighbors, and (iv) the peer recruitment is a random selection among all the participant's neighbors. Then the RDS process can be modeled as a Markov process, and the composition of the sample will stabilize and be independent of the properties of the seeds (Salganik and Heckathorn, 2004, Heckathorn, 2007, Volz and Heckathorn, 2008). Following this, the probability for each node to be included in the RDS sample is proportional to its degree. Specifically, for a given sample $U = {v_{1}, v_{2}, \dots, v_{n}}$ , with n_A being the number of respondents in the sample with property A (e.g., HIV-positive) and n_B = n − n_A being the rest. Let {d₁, d₂, …, d_n} be the respondents’ degree and $S = [\begin{matrix} s_{AA} & s_{AB} \\ s_{BA} & s_{BB} \end{matrix}]$ be the recruitment matrix observed from the sample, where s_XY is the proportion of recruitments from group X to group Y (for the purpose of this paper, we consider a binary property such that each individual belongs either to group A or B). Then the proportion of individuals belonging to group A in the population, $P_{A}^{*}$ , can be estimated by Salganik and Heckathorn (2004) and Volz and Heckathorn (2008):

${\hat{P}}_{A}^{RDSI} = \frac{s_{BA} {\hat{\bar{D}}}_{B}}{s_{AB} {\hat{\bar{D}}}_{A} + s_{BA} {\hat{\bar{D}}}_{B}},$ or

${\hat{P}}_{A}^{RDSII} = \frac{\sum_{v_{i} \in A \cap U} d_{i}^{- 1}}{\sum_{v_{i} \in U} d_{i}^{- 1}},$ where ${\hat{\bar{D}}}_{A} = n_{A} / (\sum_{v_{i} \in A ⋂ U} d_{i}^{- 1})$ and ${\hat{\bar{D}}}_{B} = n_{B} / (\sum_{v_{i} \in B \cap U} d_{i}^{- 1})$ are the estimated average degrees for individuals of group A and B in the population. Both estimators give asymptotically unbiased estimates when the above assumptions are fulfilled (Salganik and Heckathorn, 2004, Volz and Heckathorn, 2008).

The methodology of RDS is nicely designed; however, the assumptions underlying the RDS estimators are rarely met in practice (Wejnert, 2009, Tomas and Gile, 2011, Goel and Salganik, 2010, Bengtsson et al., 2012). For example, empirical RDS studies use more than one coupon and sampling is conducted without replacement, that is, each respondent is only allowed to participate once. A comprehensive evaluation has been made by Lu et al. (2012), where the effects of violation of assumptions (i)–(iv), as well as the effect of selection and number of seeds and coupons, were evaluated one by one, by simulating RDS process on an empirical MSM network as well as artificial networks and comparing RDS estimates with known population properties. They have shown that when the sample size is relatively small (<10% of the population), RDS estimators have a strong resistance to violations of certain assumptions, such as low response rate and errors in self-reporting of degrees, and the like. On the other hand, large bias and variance may result from differential recruitments, or from networks with non-reciprocal relationships. When the sample size is relatively large (>50% of the population), similar results were also found by Gile and Handcock (2010), where they focused on the sensitivity of RDS estimators to the selection of seeds, respondent behavior and violation of assumption (ii).

It was not until recently that researchers found the variance in RDS may have been severely underestimated (Salganik, 2006). In a study by Goel and Salganik (2010) based on simulated RDS samples on empirical networks, they found that the RDS estimator typically generates five to ten times greater variance than simple random sampling (Salganik, 2006). Moreover, McCreesh et al. (2012) conducted a RDS study on male household heads in rural Uganda where the true population data was known, and they found that only one-third of RDS estimates outperformed the raw proportions in the RDS sample, and only 50–74% of RDS 95% confidence intervals, calculated based on a bootstrap approach for RDS, included the true population proportion.

For the above reasons, there has been an increasing interest in developing new RDS estimators to improve the performance of RDS. For example, Gile (2011) developed a successive-sampling-based estimator for RDS to adjust the assumption of sampling with replacement and demonstrated its superior performance when the size of the population is known. Lu et al. (2013) proposed new estimators for RDS on directed networks, with known in degree difference between estimated groups. Both of the above estimators can be used as a sensitivity test when the required population parameters are not known.

Both the traditional RDSI, RDSII estimators, and the estimators newly developed by Gile (2011), Gile and Handcock (2011) and Lu et al. (2013) utilize the same information collected by standard RDS practice, that is, the recruitment matrix S, and the degree and studied properties of each respondent in the sample. There is however scope to improve estimates dramatically if data on the composition of respondents’ ego networks can be put to use. Such data has already been collected for other purposes in many RDS studies. For example, in a RDS study of MSM in Campinas City, Brazil, by de Mello et al. (2008), respondents were asked to describe the percentage of certain characteristics among their friends/acquaintances, such as disclosure of sexual orientation to family, HIV status, and the like. In a RDS study of opiate users in Yunnan, China, information about supporting, drug using, and sexual behaviors between respondents and their network members was collected (Li et al., 2011). One of the most thorough RDS studies utilizing ego network information was done by Rudolph et al. (2011), in which they asked the respondents to provide extensive characteristics for each alter within their personal networks such as demographic characteristics, history of incarceration, and drug injection and crack and heroin use.

Aiming to improve the RDS estimator, we will focus on how to integrate this additional information in the estimation process to generate improved population estimates. The rest of this paper is organized as follows. In Section 2, we develop a new estimator that integrates traditional RDS data with egocentric data; in Section 3, we describe network data used for simulation and study design; in Section 4, we evaluate the performance of the new estimator by simulated RDS processes under various settings; and in Section 5, we summarize and draw our conclusions.

Section snippets

RDSI^ego: estimator for RDS with egocentric data

The ego networks from a RDS sample differ from general egocentric data collected in many sociological surveys (Britton and Trapman, 2012, Everett and Borgatti, 2005) in the way that each “ego” is connected with (recruited by) its recruiter. For example, in a partial chain of RDS as illustrated in Fig. 1, participants $v_{i}$ , $v_{j}$ , $v_{k}$ , are asked to provide personal network compositions and $v_{j}$ and $v_{k}$ are recruited by $v_{i}$ , $v_{j}$ , respectively.

For each respondent $v_{i}$ in a RDS sample $U = {v_{1}, v_{2}, \dots, v_{n}}$ , let $n_{i}^{A}$ , $n_{i}$

Network data

In this paper we use both an anonymized empirical social network and simulated networks to evaluate the performance of the newly proposed estimator. The empirical network, previously analyzed in Lu et al., 2012, Lu et al., 2013 and Rybski et al. (2009), comes from the Nordic region's largest and most active web community for homosexual, bisexual, transgender, and queer persons. Nodes of the network are website members who identify themselves as homosexual males, and links are friendship

Estimates of network link types

The difference between RDSI and RDSI^ego lies in the estimation of the recruitment matrix S. As a first step, we therefore simulate the RDS process with random recruitment ( $p_{A}^{diff} = 0$ ) and differential recruitment ( $p_{A}^{diff} = 1$ ) and then estimate the proportion of type e_A→B links in the population, $s_{AB}^{*}$ , by both the raw sample recruitment proportion, s_AB, and the proposed ego-network-based estimator, ${\hat{s}}_{AB}^{ego}$ , for all four variables in the MSM network, age, ct, cs and pf, respectively.

An example of the

Conclusion and discussion

Ego network data has been collected for decades and exists largely in sociological surveys (Britton and Trapman, 2012, Everett and Borgatti, 2005, Handcock and Gile, 2010, Newman, 2003, Mizruchi and Marquis, 2006, Marsden, 2002, Hanneman and Riddle, 2005); the RDS sampling mechanism further makes it possible to collect “linked-ego network” data. By combining RDS recruitment trees with ego networks, this study developed a new estimator, RDSI^ego, for RDS studies. Given that participants can

Acknowledgement

The author would like to thank Professor Fredrik Liljeros and Dr. Linus Bengtsson for helpful discussions. This work has been partially funded by Riksbankens Jubileumsfond (The Bank of Sweden Tercentenary Foundation).

References (49)

R.S. Burt
Network items and the general social survey
Social Networks
(1984)
M. Everett et al.
Ego network betweenness
Social Networks
(2005)
A. Marin
Are respondents more likely to list alters with certain characteristics? Implications for name generator data
Social Networks
(2004)
P.V. Marsden
Egocentric and sociocentric measures of network centrality
Social Networks
(2002)
U. Matzat et al.
Does the online collection of ego-centered network data reduce data quality? An experimental comparison
Social Networks
(2010)
M.S. Mizruchi et al.
Egocentric, sociocentric, or dyadic? Identifying the appropriate level of analysis in the study of organizational networks
Social Networks
(2006)
M.E.J. Newman
Ego-centered networks and the ripple effect
Social Networks
(2003)
M. Salganik et al.
The game of contacts: estimating the social visibility of groups
Social Networks
(2011)
R. Toivonen et al.
A comparative study of social network models: network evolution models and nodal attribute models
Social Networks
(2009)
L. Bengtsson et al.
Implementation of web-based respondent-driven sampling among men who have sex with men in vietnam
PLoS ONE
(2012)

L. Bengtsson et al.

Global HIV surveillance among MSM: is risk behavior seriously underestimated?

AIDS

(2010)

T. Britton et al.

Inferring global network properties from egocentric data with applications to epidemics

(2012)

E. Deaux et al.

Key informant versus self-report estimates of health-risk behavior

Evaluation Review

(1985)

M. de Mello et al.

Assessment of risk factors for HIV infection among men who have sex with men in the metropolitan area of Campinas City, Brazil, using respondent-driven sampling

(2008)

B.H. Erickson

Some problems of inference from chain data

Sociological Methodology

(1979)

K.J. Gile

Improved inference for respondent-driven sampling data with application to HIV prevalence estimation

Journal of the American Statistical Association

(2011)

K.J. Gile et al.

Respondent-driven sampling: an assessment of current methodology

Sociological Methodology

(2010)

K.J. Gile et al.

Network model-assisted inference from respondent-driven sampling data

(2011)

M. Gjoka et al.

Walking in facebook: a case study of unbiased sampling of OSNs

S. Goel et al.

Assessing respondent-driven sampling

Proceedings of the National Academy of Sciences of the United States of America

(2010)

M.S. Handcock et al.

Modeling social networks from sampled data

Annals of Applied Statistics

(2010)

R.A. Hanneman et al.

Introduction to Social Network Methods

(2005)

M.H. Hansen et al.

On the theory of sampling from finite populations

Annals of Mathematical Statistics

(1943)

D.D. Heckathorn

Respondent-driven sampling: a new approach to the study of hidden populations

Social Problems

(1997)

Cited by (39)

The development of respondent-driven sampling (RDS) inference: A systematic review of the population mean and variance estimates
2020, Drug and Alcohol Dependence
Citation Excerpt :
Though not all studies’ findings coincide with each other, or address all RDS concerns, the general consensus is that all estimates perform more or less in the same way. Verdery et al. had the most comprehensive list of estimators in their evaluation study and they demonstrated that RDSIEGO did out-perform the other estimators in various conditions and was robust to differential recruitment (respondents preferentially recruit their contacts with particular interest) as well as varying homophily and network distribution, which corresponded to the findings of Lu (2013) (Verdery et al., 2015). RDSIEGO requires detailed ego network information for all variables of interest and still relies, to a certain extent, on assumptions related to the respondent’s behavior and network structure (i.e. degree is reported accurately and that each individual is connected directly or indirectly to every individual in the network).
Respondent-driven sampling (RDS) is a successful data collection method used in hard-to-reach populations, like those experiencing or at high risk of drug dependence. Since its introduction in 1997, identifying appropriate methods for estimating population means and sampling variances has been challenging and numerous approaches have been developed for making inferences about these quantities. To guide researchers and practitioners in deciding which approach to use, this article reviews the literature on these methodological developments.
A systematic review using four electronic databases was conducted in order to summarize the progress of RDS inference over the last 20 years and to provide insight to researchers on using the appropriate estimators in analyzing RDS data. Two independent reviewers selected the relevant abstracts and articles; thirty-two studies were included. The content of the studies was further categorized into developing and evaluating RDS mean and variance estimators.
The population mean estimator RDSI^EGO and the sampling variance estimators associated with tree boot strapping were identified as promising methods as the most robust population mean and variance estimate, respectively; as these estimators rely on a fewer assumptions.
RDS holds substantial promise as a sampling method for understanding populations at high risk. The varied approaches to inference with RDS data each rely on different assumptions, but some require fewer assumptions than others and provide more robust and accurate inferences, when their corresponding assumptions are met.
A fuzzy logic based estimator for respondent driven sampling of complex networks
2018, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
The nodes are required to correctly report their degree [9,24,25]. During the sampling process, the nodes randomly select among their neighbors [9,20,23,26,27]. Each node can pass only one coupon [20,28].
Respondent Driven Sampling (RDS) is a popular network-based method for sampling from hidden population. This method is a type of chain referral (or snowball) sampling in which an estimator is used to infer the proportion of the population with that property. Existing RDS estimators are asymptotically unbiased based on various underlying assumptions. However, these assumptions are often violated in practice, and little attention has been given to violation of one of these assumptions on accurately reporting the degree by all nodes. In this paper, we address the violation of this assumption and propose a new estimator based on fuzzy computing. In particular, the number of an individual’s contacts can be a fuzzy concept. Using fuzzy functions, we transform the reported degrees to fuzzy numbers and estimate the infection prevalence in the hidden population by the proposed estimator. We simulate RDS method under the condition that all assumptions are satisfied except the one for the degree, and then evaluate the proposed estimator in synthetic and real datasets. Our results show that the fuzzy-based estimator can reduce the sampling bias in average 54% as compared to the existing methods.
Sustainable business models, venture typologies, and entrepreneurial ecosystems: A social network perspective
2018, Journal of Cleaner Production
The successful adaptation and creation of sustainable entrepreneurial ventures significantly influences the ability to create more environmentally and socially integrated economic systems. Sustainable business models are a critical component towards this goal. However, the development of sustainable business models is a complex process that requires a supportive entrepreneurial ecosystem. Integrating literature on sustainable business models, network theory, and entrepreneurial ecosystems, we analyze the influence of organizational-level (venture types and venture tenure) and individual-level factors (types of network actors and their demographic characteristics) that influence the social network connectivity of ventures with sustainable and conventional business models. To this purpose, we modeled two municipal entrepreneurial ecosystems in the Southeast United States through a complex network of stakeholders (e.g. entrepreneurs, investors, institutional leaders) and analyzed the resulting social connectivity measures. Our results indicate that sustainable entrepreneurs were underrepresented when compared to conventional entrepreneurs, but that their networks were more densely connected. We also found that different social clusters emerged, based on type of venture and business model, venture tenure, type of network actor (e.g. entrepreneur or investor), or demographic characteristic. With this study, we contribute to the literature on entrepreneurial ecosystems and sustainable business models.
MODELING THE VISIBILITY DISTRIBUTION FOR RESPONDENT-DRIVEN SAMPLING WITH APPLICATION TO POPULATION SIZE ESTIMATION
2024, Annals of Applied Statistics
Network Sampling Methods for Estimating Social Networks, Population Percentages, and Totals of People Experiencing Unsheltered Homelessness
2023, arXiv
A Bayesian framework for modelling the preferential selection process in respondent-driven sampling
2022, Statistical Modelling

View all citing articles on Scopus

View full text

Linked Ego Networks: Improving estimate reliability and validity with respondent-driven sampling

Highlights

Abstract

Introduction

Section snippets

RDSIego: estimator for RDS with egocentric data

Network data

Estimates of network link types

Conclusion and discussion

Acknowledgement

Social Networks

Social Networks

Social Networks

Social Networks

Social Networks

Social Networks

Social Networks

Social Networks

Social Networks

Implementation of web-based respondent-driven sampling among men who have sex with men in vietnam

PLoS ONE

Global HIV surveillance among MSM: is risk behavior seriously underestimated?

AIDS

Inferring global network properties from egocentric data with applications to epidemics

Key informant versus self-report estimates of health-risk behavior

Evaluation Review

Assessment of risk factors for HIV infection among men who have sex with men in the metropolitan area of Campinas City, Brazil, using respondent-driven sampling

Some problems of inference from chain data

Sociological Methodology

Improved inference for respondent-driven sampling data with application to HIV prevalence estimation

Journal of the American Statistical Association

Respondent-driven sampling: an assessment of current methodology

Sociological Methodology

Network model-assisted inference from respondent-driven sampling data

Walking in facebook: a case study of unbiased sampling of OSNs

Assessing respondent-driven sampling

Proceedings of the National Academy of Sciences of the United States of America

Modeling social networks from sampled data

Annals of Applied Statistics

Introduction to Social Network Methods

On the theory of sampling from finite populations

Annals of Mathematical Statistics

Respondent-driven sampling: a new approach to the study of hidden populations

Social Problems

RDSI^ego: estimator for RDS with egocentric data