Abstract

This paper derives closed-form solutions for the -and- shape parameters associated with the Tukey family of distributions based on the method of percentiles (MOP). A proposed MOP univariate procedure is described and compared with the method of moments (MOM) in the context of distribution fitting and estimating skew and kurtosis functions. The MOP methodology is also extended from univariate to multivariate data generation. A procedure is described for simulating nonnormal distributions with specified Spearman correlations. The MOP procedure has an advantage over the MOM because it does not require numerical integration to compute intermediate correlations. Simulation results demonstrate that the proposed MOP procedure is superior to the MOM in terms of distribution fitting, estimation, relative bias, and relative error.

1. Introduction

The Tukey -and- families of univariate and multivariate nonormal distributions are commonly used for distribution fitting, modeling events, random variable generation, and other applied mathematical contexts such as operational risk, extreme oceanic wind speeds, common stock returns, and solar flare data. See [117].

The family of univariate -and- distributions can be summarized as follows: where is an i.i.d. standard normal random variable with probability density function (pdf), , and cumulative distribution function (cdf), . The transformations in (1)–(3) are strictly monotone increasing functions with real-valued constants and that produce distributions defined as (i) asymmetric -and- (, ), (ii) log-normal (, ), and (iii) symmetric (), respectively. The constant ± controls the skew of a distribution in terms of both direction and magnitude. Taking the negative of will change the direction of the skew but not its magnitude; that is, . The constant controls the tail-weight of a distribution where the function (i) preserves symmetry, (ii) is increasing for and , and (iii) produces increased tail-weight as the value of becomes larger. In summary, (1)–(3) are computationally efficient for the purpose of generating nonormal distributions because they only require the specification of one or two shape parameters () and an algorithm that produces standard normal random deviates.

The values of and associated with (1)–(3) can be determined from either the method of moments (MOM), for example, [8, 10, 13], or the method of percentiles (MOP), for example, [9, 18]. However, these two methods have disadvantages. Specifically, in the context of the MOM, the estimates of conventional skew and kurtosis associated with heavy-tailed or -skewed distributions can be substantially biased, have high variance, or can be influenced by outliers; see, for example, [1924]. In terms of the MOP, the primary disadvantage associated with the -and- procedure is the laborious “multiplicative iterative” approach needed to determine the constant for symmetric and asymmetric nonormal -and- distributions; see [9, p. 484–489].

On the other hand, the MOP approach described by Karian and Dudewicz [2427] in the context of the generalized lambda distribution (GLD) has demonstrated to be an attractive and computationally efficient alternative to the MOM in terms of distribution fitting and computing the GLD shape parameters. Further, it has been demonstrated that the MOP is superior to the MOM over a broad range of combinations of skew and kurtosis for fitting GLDs to theoretical or empirical distributions (see [24]).

A procedure for simulating correlated -and- distributions with a specified correlation structure based on (1)–(3) is described by Kowalchuk and Headrick [10] and Headrick [28]. Specifically, the -and- procedure makes use of the popular NORTA (NORmal To Anything) approach. That is, the procedure begins with generating multivariate standard normal deviates prior to transformation. However, one limitation arises because the Pearson correlation is not invariant under nonlinear strictly increasing transformations. This is a concern because the transformations in (1)–(3) have this characteristic, for example, [8, 28]. Thus, the initial multivariate normal correlation structure used in the NORTA approach will not be maintained subsequent to any of the transformations in (1)–(3). As such, the NORTA procedure must begin with the computation of an intermediate correlation (IC) matrix, which is different from the specified correlation matrix between the nonormal distributions. The purpose of the IC matrix is to adjust for the non-normalization effect of a transformation such that the -and- procedure can generate nonormal distributions with a specified correlation matrix. Further, in the context of the -and- procedure, NORTA also requires numerical integration techniques to solve for the IC matrix, see [28, p. 143], which can be computationally expensive.

In contradistinction, the Spearman correlation has comparative advantages over the Pearson correlation in the context of the -and- procedure for computing ICs. Specifically, the transformations in (1)–(3) must be strictly monotone increasing functions to produce valid pdfs. Thus, subsequent to transformation, the rank order of , , remains the same as the rank order of , ; that is, the Spearman correlation remains unchanged. Moreover, there is a straightforward equation that can be used to directly solve for all pairwise ICs, for example, [28, p. 114], and thus does not require the numerical integration techniques associated with the conventional product MOM -and- procedure.

In view of the above, the present aim is to obviate the problems associated with the MOM in the context of the family of -and- distributions in (1)–(3) by characterizing these distributions through the MOP as described in Karian and Dudewicz [27, p. 172-173]. Specifically, the purpose of this paper is to obviate the laborious “multiplicative iterative” technique given in [9] and develop the methodology and a procedure for simulating -and- distributions with specified medians (), interdecile ranges (), left-right tail-weight ratios (, a skew function), and tail-weight factors (, a kurtosis function). In terms of simulating multivariate -and- distributions, the Spearman correlation will be used in lieu of the Pearson correlation. In summary, the advantages of the proposed MOP procedure are that (i) the MOP parameters (, ) exist for any distribution, whether the mean exists or not, for example, [25]; (ii) there are less relative bias and less relative standard error when juxtaposed with the MOM procedure; (iii) there are closed-form solutions for the and constants; (iv) there is a straightforward equation to obtain pairwise ICs for the purpose of simulating correlated nonormal -and- distributions in (1)–(3).

The remainder of the paper is outlined as follows. In Section 2, a summary of the univariate -and- distributions based on the MOM is provided. In Section 3, the requisite information associated with the MOP is provided. Further, the systems of equations for determining the closed-form solutions of the and constants are subsequently derived for simulating univariate nonormal distributions with specified values of . In Section 4, a comparison of the MOM and the MOP is provided by fitting the SPSS data from [29]. In Section 5, the methodology for simulating correlated nonormal -and- distributions with the specified Spearman correlation matrices is provided. In Section 6, the steps for implementing the proposed MOP procedure are described. A numerical example and results of a simulation are also provided to confirm the derivations and compare the proposed procedure with the MOM procedure. In Section 7, the results of the simulation are discussed.

2. Preliminaries for the Tukey Family of -and- Distributions

2.1. The Tukey Family of -and- Distributions Based on the MOM

The requirement that in (1)–(3) be a strictly monotone increasing function implies that an inverse function exists and thus , where is the general form of the cdf for both the MOM and the MOP. Differentiating both sides with respect to yields , where is the general form of the pdf for both the MOM and the MOP. Hence, . Whence, the pdf integrates to one because is the standard normal pdf and will be nonnegative on its support for given that and where for the transformations in (1)–(3).

The and/or constants associated with (1)–(3) that determine the shape of a distribution are computed using a moment-matching approach that involves the conventional measures of the mean (), variance (), skew (), and kurtosis (). Specifically, the values of and in (1) are determined by simultaneously solving (6) and (7) for specified values of skew and kurtosis (e.g., [28, p. 125, Eqs. 5.16-5.17], and [10, (A1)–(A4)]):

The mean and the variance for a -and- distribution can be determined by evaluating (4) and (5) using the solutions of and obtained from (6) and (7). For any -and- distribution, the th moment will exist if (see [8]). As such, (4)–(7) exist if the first four moments are defined. That is, . Note that the equations of for distributions in (2) and distributions in (3) can be obtained from [10, Eqs. (A5)–(A8)] and [10, Eqs. (A9)–(A12)], respectively.

3. The Percentile Based -and- Family of Distributions

3.1. General Considerations

The percentiles () associated with a conventional based -and- pdf can be obtained by making use of the standard normal cdf, . As such, we will define the following location, scale, and shape parameters as in [27, p. 172-173] where (8) are the (i) median, (ii) interdecile range, (iii) left-right tail-weight ratio (a skew function), and (iv) tail-weight factor (a kurtosis function), respectively. The parameters in (8) are defined to have the restrictions and where a symmetric distribution implies that .

3.2. The Tukey Family of -and- Distributions Based on the MOP

The derivation of a percentile based system of -and- pdfs begins by substituting the standard normal distribution percentiles () into the quantile functions in (1)–(3) and making use of (8) gives where , , from the standard normal distribution. Note from symmetry that and . The explicit forms of (10) are

Simultaneously, solving for the coefficients in (13) and (14) gives the convenient closed-form expressions For a symmetric distribution (), the closed-form solution for can be expressed as

Estimates of based on the percentiles in (10) for a sample of size can be determined by finding the and integer values and their corresponding expected values of the order statistics and , by making use of the following equation (see [30, 31]): such that and subsequently solve the equation for . Thus, an empirical estimate of can then be obtained based on the order statistics of a sample of size as .

4. A Comparison of the MOM and the MOP on Distribution Fitting

Presented in Figure 1 are the MOM and the MOP pdfs superimposed on the histogram of the SPSS data from [29]. This is a data file that concerns the company’s efforts to use the information in its data warehouse to make special offers for customers who are most likely to reply. Specifically, these data are the amount each customer spent on their primary credit card in the last month. The parameters (, ) associated with Figure 1 were based on a sample size of participants. The bootstrap estimates, confidence intervals, and relative standard errors of (, ) are also provided in Figure 1. Note that to fit the -and- distributions to the data, a linear transformation has to be imposed on for both the MOM and the MOP procedures. Specifically, in the context of the MOM, and . In the context of the MOP, and . The values of the sample and the theoretical means (), standard deviations (), medians (med, ), and interdecile range (IDR, ) for the data and -and- pdfs are given in Figure 1.

Visual inspection of the approximation in Figure 1 indicates that the MOP pdf provides a more accurate fit to the actual data over the MOM. Further, have more precision than because the relative standard errors (RSEs) of are much less than those of . To compare the accuracy of the data fitting for the MOP and the MOM, the Euclidean distances are calculated as , where is the observed proportion in each interval and is the expected proportion in each interval for both the MOP and the MOM. From Table 1, the MOP has a more accurate data fit because the Euclidean distance () of the MOP is less than that of the MOM.

5. The Spearman Correlations for the System of the MOP

We assume that the variates and in (1)–(3) produce valid pdfs and are thus increasing monotonic transformations in and . This implies that the rank orders of () and () are identical and thus will have rank correlations of .

Given these assumptions, suppose it is desired to simulate a -variate distribution from the quantile functions in (1)–(3) with a specified Spearman correlation matrix and where each distribution has specified and . Let and have univariate normal pdfs and , the Pearson correlation , and standard normal bivariate density of The correlation can be obtained from the derivation of given in [32]. That is, because (1)–(3) are strictly increasing monotonic transformations, and thus we have from Headrick [28, Eq. 4.34] where is the specified Spearman correlation and is the intermediate correlation (IC). For a specified value of and a finite sample size , the value of the IC, , can be obtained by numerically solving (21).

6. The Procedure for Simulation and Monte Carlo Study

To implement the method for simulating -and-, , distributions with specified , and the Spearman correlations, we suggest the following six steps.(1)Specify the values of and for the transformations of the forms in (1)–(3), and obtain the constants of and/or by solving (15)-(16) using the specified values of and for each distribution. Specify a matrix of the Spearman correlations between and , where .(2)Compute the ICs by substituting the solutions of the constants from Step 1 into (21) and then solve for . Repeat this step separately for all pairwise combinations of correlations.(3)Assemble the ICs into a matrix and decompose this matrix using a Cholesky factorization. Note that this step requires the IC matrix to be positive definite.(4)Use the results of the Cholesky factorization from Step 3 to generate standard normal variables () correlated at the intermediate levels as follows: where are independent standard normal random variables and represents the element in the th row and the th column of the matrix associated with the Cholesky factorization performed in Step 3.(5)Substitute from step 4 into equations of the form in (1)–(3), as noted in Step 1, to generate the and/or distributions with the specified values of and and the Spearman correlations.

To demonstrate the previous steps and evaluate the proposed procedure, a comparison between the MOP and the MOM procedures is subsequently described. Specifically, the distributions in Figure 2 are used as a basis for a comparison using the specified correlation matrices in Table 2. Tables 3, 4, and 5 give the solved IC matrices for the MOM and the MOP procedures with samples of sizes 25 and 750, respectively. Note that the ICs for the MOM were computed by using the Mathematica source code as in [10, Table 1] and the ICs for the MOP are based on (21). Tables 6, 7, and 8 give the results of the Cholesky decompositions on the IC matrices, which are then used to create with the specified ICs by making use of the formulae given in (22) of step 4 with . The values of are subsequently substituted into equations of the form in (1)–(3) to produce for both procedures.

In terms of the simulation, a Fortran algorithm was written for both methods to generate 25,000 independent sample estimates for the specified parameters of (i) conventional skew (), kurtosis (), and the Pearson correlation; (ii) left-right tail-weight ratio (), tail-weight factor (), and the Spearman correlation. All estimates were based on sample sizes of and . The formulae used for computing estimates of were based on Fisher’s -statistics, that is, the formulae currently used by most commercial software packages such as SAS, SPSS, and Minitab, for computing indices of skew and kurtosis (where for the standard normal distribution). The formulae used for computing estimates of were based on (10). Note that the estimates of percentiles were based on (17). The estimate for the Pearson correlations were based on the usual formula for the Pearson product-moment of correlation statistic and the estimate for the Spearman correlations were computed based on usual formula for the Spearman rank of correlation statistic. The estimates for the Pearson and Spearman correlations were both transformed using Fisher’s transformation. Bias-corrected accelerated bootstrapped average (median) estimates, confidence intervals (CIs), and standard errors were subsequently obtained for the estimates associated with the parameters using 10,000 resamples via the commercial software package Spotfire S+ [33]. The bootstrap results for the estimates of the medians and CIs associated with the Pearson and Spearman correlations were transformed back to their original matrices (i.e., estimates for the Pearson and Spearman correlations). Further, if a parameter () was outside its associated bootstrap CI, then an index of relative bias (RB) was computed for the estimate () as . Note that the small amount of bias associated with any bootstrap CI containing a parameter was considered negligible and thus not reported. The results of the simulation are reported in Tables 9, 10, 11, and 12 and are discussed in the next section.

7. Discussion and Conclusion

One of the advantages that the MOP has over the MOM is that the MOP can be much less biased when sampling is from distributions with more severe departures from normality. Moreover, inspection of the simulation results in Tables 9 and 12 clearly indicates this result. That is, the superiority that the MOP estimates () have over their corresponding MOM counterparts (). For example, with samples size of , the estimates of skew and kurtosis for Distribution 3 (heavy-skewed and heavy-tailed) were, on average, 82.37% and 99.65% below their associated population parameters, whereas the estimates of and were 3.02% and 11.06% over their respective parameters. It is also evident from comparing Tables 9 and 12 that and are more efficient estimators as their relative standard errors are considerably smaller than those of the MOM estimators of skew and kurtosis. For example, with samples size of , in terms of Distribution 3, inspection of Tables 9 and 12 indicates RSE measures of RSE( and RSE( compared with RSE and RSE(. This demonstrates that and have more precision because they have less variance around their estimates.

Presented in Tables 13, 14, 15, and 16 are the results associated with the conventional Pearson and the Spearman correlations. Overall inspection of these tables indicates that the Spearman correlation is superior to the Pearson correlation in terms of RB. For example, with samples of size , the RB for the two heavy-tailed distributions (i.e., distributions 3 and 4) was 9.66% for the Pearson correlation compared with only 2.29% for the Spearman correlation. Further, for large sample sizes (), the Spearman correlation bootstrap CIs contained most of the population parameters, whereas the Pearson correlation CIs contained none of the parameters. It is also noted that the variability of the Spearman correlation appears to be more stable than that of the Pearson correlation both within and across the different conditions.

In summary, the proposed MOP procedure is an attractive alternative to the traditional MOM procedure. In particular, the MOP procedure has distinct advantages when distributions with large departures from normality are used.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.