Analysis of complex survey data using SAS

https://doi.org/10.1016/S0169-2607(00)00088-2Get rights and content

Abstract

Commonly used statistical methods and software packages typically assume that observations are independent and identically distributed and fail to account for complex sampling designs when present. I suggest an approach to analyzing complex survey data in SAS, using weighted generalized estimating equations. Limited Monte Carlo simulations support the method. An example demonstrates application of the method and compares results to those from software commonly used in the analysis of complex survey data.

Introduction

Data analyses that ignore elements of a complex sampling design, such as clustering and unequal sample weights, may lead to dubious point estimates and measures of dispersion [1], [2]. Such misleading results are a result of the observations in complex survey data being dependent and unequally weighted. Commonly used statistical programs, such as SAS, assume that observations are independent and identically distributed (iid). Using a set of limited and extreme Monte Carlo simulations, I demonstrate that analysis using weighted generalized estimating equations (GEE), [3] available in SAS Version 6.12 or later, provides correct point estimates and variances. In addition, I demonstrate the method by example using the Duke Established Populations for Epidemiologic Study of the Elderly (EPESE) cohort.

Section snippets

Model

Consider the standard logistic regression model, logit Pr[Y=1]=β0+β1x, where the logit of the response variable, log[Y/(1−Y)], is linearly associated with the function, β0+β1x, of the intercept and exposure, respectively. This model is fit using the SAS code:PROC GENMOD; MODEL Y=X/D=B;This invokes the generalized linear model procedure, models Y as the outcome, X as the exposure, and specifies a binary distribution. The default link is the canonical, or logit for a binary distribution, hence a

Monte Carlo simulation

I generated 1000 samples of 500 clustered pairs for each of eight scenarios (true odds ratio 1 or 2, sample weight effect present or absent, intra-cluster correlation 0 or 1), providing a limited and extreme set of hypothetical simulations. The odds ratio was set by randomly generating a binary exposure with probability of 0.5, and then randomly generating the binary outcome, conditional on the exposure when the odds ratio was 2; throughout the probability of outcome was 25%. The sample weight

Example

The Duke (N=4162) EPESE study was one of four NIA-funded studies that randomly sampled community-dwelling men and women 65 years of age or older in 1982 to identify predictors of mortality, hospitalization, and placement in long-term care facilities [4]. The sampling universe was all noninstitutionalized persons 65 years of age and older (i.e. elders) living in five counties of the north-central Piedmont area of North Carolina. The first stage of the four-stage sampling design was area sampling

Comment

In summary, I demonstrated the use of SAS for analysis of complex survey data. There are three important advantages to using SAS for analysis of complex surveys. First, there is no need to purchase and learn new software. such a SUDAAN or WesVar. Second, the procedure is relatively easy to implement. Third, in addition to accurate estimation of the odds ratio from a logistic regression model as demonstrated, Genmod fits many variations of generalized linear models. Admittedly, there are

References (5)

  • W.G. Cochran

    Sampling Techniques

    (1977)
  • P.S. Levy et al.

    Sampling of Populations: Methods and Applications

    (1991)
There are more references available in the full text version of this article.

Cited by (35)

  • Effects of astragalus membranaceus root processed to different particle sizes on growth performance, antioxidant status, and serum metabolites of broiler chickens1

    2013, Poultry Science
    Citation Excerpt :

    These blood metabolites assays were analyzed by an automatic biochemical analyzer (Hitachi 7600–020, Beijing, China). Data were statistically analyzed using 1-way ANOVA using the GLM procedure of SAS version 9.0 (Cole, 2001; SAS Institute Inc., 2002). The effect of AMP supplementation was determined by the contrast option of the GLM procedure.

  • Fetal sex pairing and adverse perinatal outcomes in twin gestations

    2013, Annals of Epidemiology
    Citation Excerpt :

    Infants from a female–female pairing served as the referent group. To account for the fact that infants were matched on mother/pregnancy, log-linear regression models based on the method of generalized estimating equations (GEE), with an exchangeable correlation structure, were used [12,13]. Ignoring the cluster dependency found in twin gestations may lead to erroneous inferences [14].

  • A multienzyme preparation enhances the utilization of nutrients and energy from pure corn and wheat diets in broilers

    2012, Journal of Applied Poultry Research
    Citation Excerpt :

    In equations [3] and [4], E4 is the total amount (kcal) of energy in the diet that was force fed to each bird; E5 is the total amount (kcal) of energy in excreta collected during the 48-h period after force feeding; E6 is the total amount (kcal) of energy in excreta collected during the 48-h period without being fed; and E7 is the feed intake of the diet that was force fed to each bird (DM basis) [28]. Data from all broiler assays were analyzed statistically by one-way ANOVA for a completely randomized design by using SAS software [29, 30]. The nature of the response to increasing levels of supplemental enzyme was determined by polynomial contrasts including linear models.

View all citing articles on Scopus
View full text