Next Article in Journal
Terroir in View of Bibliometrics
Next Article in Special Issue
On the Vector Representation of Characteristic Functions
Previous Article in Journal
A Detecting System for Abrupt Changes in Temporal Incidence Rate of COVID-19 and Other Pandemics
Previous Article in Special Issue
Some More Results on Characterization of the Exponential and Related Distributions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Family of Finite Mixture Distributions for Modelling Dispersion in Count Data

1
Institute of Actuarial Science and Data Analytics, UCSI University, Kuala Lumpur 56000, Malaysia
2
Institute of Mathematical Sciences, University of Malaya, Kuala Lumpur 50603, Malaysia
3
School of Mathematical Sciences, University of Nottingham Malaysia, Semenyih 43500, Malaysia
4
Faculty of Science and Technology, University of Canberra, Bruce, ACT 2617, Australia
5
Department of Mathematics and Statistics, University of Victoria, Victoria, BC V8W 3R4, Canada
6
Department of Medical Research, China Medical University Hospital, China Medical University, Taichung 40402, Taiwan
7
Center for Converging Humanities, Kyung Hee University, 26 Kyungheedae-ro, Dongdaemun-gu, Seoul 02447, Republic of Korea
8
Department of Mathematics and Informatics, Azerbaijan University, 71 Jeyhun Hajibeyli Street, AZ1007 Baku, Azerbaijan
9
Section of Mathematics, International Telematic University Uninettuno, I-00186 Rome, Italy
*
Authors to whom correspondence should be addressed.
Stats 2023, 6(3), 942-955; https://doi.org/10.3390/stats6030059
Submission received: 3 August 2023 / Revised: 12 September 2023 / Accepted: 13 September 2023 / Published: 18 September 2023
(This article belongs to the Special Issue Advances in Probability Theory and Statistics)

Abstract

:
This paper considers the construction of a family of discrete distributions with the flexibility to cater for under-, equi- and over-dispersion in count data using a finite mixture model based on standard distributions. We are motivated to introduce this family because its simple finite mixture structure adds flexibility and facilitates application and use in analysis. The family of distributions is exemplified using a mixture of negative binomial and shifted negative binomial distributions. Some basic and probabilistic properties are derived. We perform hypothesis testing for equi-dispersion and simulation studies of their power and consider parameter estimation via maximum likelihood and probability-generating-function-based methods. The utility of the distributions is illustrated via their application to real biological data sets exhibiting under-, equi- and over-dispersion. It is shown that the distribution fits better than the well-known generalized Poisson and COM–Poisson distributions for handling under-, equi- and over-dispersion in count data.

1. Introduction

In the statistical analysis of count data, an important feature is the under-, equi- or over- dispersion of the data. There are many models that cater for a particular type of dispersion. In contrast, there are not many distributions with the flexibility to model under-, equi- and over-dispersion. Examples of these distributions are the hyper-Poisson [1], double Poisson [2] and weighted Poisson [3] distributions. Among these distributions, two popular distributions are the generalized Poisson distribution [4] and the COM–Poisson distribution [5,6,7,8].
The generalized Poisson distribution (GPD) has a probability mass function (pmf)
P ( X = x ) = θ ( θ + x λ ) x 1 e θ x λ x ! for   x = 0 ,   1 ,   2 ,   0 for   x > m ,   when   λ   < 0
where θ > 0 , max ( 1 , θ / m ) < λ 1 and m   ( 4 ) are the largest positive integers satisfying θ + m λ > 0 when λ is negative. For λ < 0 , the probabilities do not sum to 1. The GPD is under-, equi- or over-dispersed when λ is negative, zero or positive, respectively. If λ = 0 , the Poisson distribution is obtained. However, for under-dispersion, the parameters must be restricted for the GPD to be a proper distribution [9]; see also [10] for further discussion.
The COM–Poisson distribution arises from a single queue–single server system with state-dependent service rates [5]. Recently, this distribution has attracted a lot of interest and has been applied in many areas [6]. Refs. [11,12] considered the COM–Poisson distribution in the context of modelling time series of counts. The COM–Poisson distribution is a member of the exponential family of distributions and confers several desirable properties for statistical inference. The COM–Poisson distribution has a pmf given by
P ( X = x ) = λ x ( x ! ) v Z ( λ , v )
where Z ( λ , v ) = j = 0 λ j ( j ! ) v   for λ > 0 and v 0 . The COM–Poisson distribution is under-dispersed or over-dispersed when ν > 1 or ν < 1 , respectively. The case ν = 1 gives the Poisson distribution. The COM–Poisson distribution cannot be directly expressed in terms of its mean. To facilitate regression modelling, ref. [13] considered a mean-parametrized COM–Poisson model based on an iterative solution of a mean equation. Compared with over-dispersion, there is relatively less discussion in the literature about under-dispersion in count data. Ref. [14] reviewed under-dispersed count models and various sources of data under-dispersion. Ref. [15] discussed count models that are arbitrarily under-dispersed and showed that the mean-parametrized COM–Poisson distribution can handle arbitrarily small under-dispersion.
It is to be noted that the distributions mentioned above have a restricted parameter range (GPD) or normalizing constants in the form of infinite series (double Poisson, COM–Poisson) that may add to the complexity of a statistical analysis. Recently, motivated by these issues, a particular case of the generalized shifted inverse trinomial (GIT) distribution [16], designated GI T 3,1 [17], has been proposed in order to cater for under-, equi- or over-dispersion in count data. It is noted that when the GIT is equi-dispersed, the distribution is non-Poisson. The GIT has a simple probabilistic structure as the convolution of negative binomial and binomial distributions. This is apparent from its probability-generating function (pgf), given by
G ( t ) = 1 p 3 1 p 3 t n p 1 + p 2 t p 1 + p 2 n , 1 p 3 = p 1 + p 2 , n > 0 .
It is a member of a general family of distributions formed via the convolution of binomial and pseudo-binomial variables [18]. This family has a pgf in the form
1 Q 1 t 1 Q 1 U 1 1 Q 2 t 1 Q 2 U 2 ,
and Kemp gave conditions for this to be a valid pgf. Some members of this family of distributions have been considered in the literature.
The pmf of the GIT distribution is given by
f n ( x ) = i = 0 min ( n , x ) n n + x i n + x i n i , i , x i p 1 n i p 2 i p 3 x i
for x = 0,1 , 2 , , where n is a positive integer and p i 0   for   i = 1,2 , 3 ; i = 1 3 p i = 1 .
The GIT pmf may also be expressed as
f n ( x ) = n + x 1 x p 1 n p 3 x F 1 2 n , x ; n x + 1 ; p 2 p 1 p 3
in terms of the Gauss hypergeometric function F 1 2 . Other recent efforts to develop distributions that can handle under-, equi- or over-dispersion have been performed through COM–Poisson-type extensions on binomial and negative binomial distributions [19,20,21,22,23,24]. Ref [25] examined the flexible weighted Poisson that extends the weighted Poisson of [3] and contains the COM–Poisson and hyper-Poisson distributions as special cases.
Even though the GIT has a simple probabilistic structure as a convolution of binomial and negative binomial random variables and is free from some of the limitations of the other popular distributions, the positive integer index parameter n can be an impediment in parameter estimation. Section 2 proposes a finite mixture model for the construction of a family of distributions that can cater for under-, equi- or over-dispersion. This is exemplified by a finite mixture of negative binomial and shifted negative binomial distributions. The negative binomial distribution is over-dispersed, but the finite mixture exhibits under-, equi- and over-dispersion. Section 2 also presents some basic properties of the finite mixture of negative binomial distributions, while Section 3 discusses parameter estimation and test of equi-dispersion. Empirical modelling of biological count data is considered in Section 4, with concluding remarks presented in Section 5.

2. Finite Mixture of Distributions

Finite mixture models have been widely employed due to their flexibility and computationally convenient representation of complex distributions. Mixture formulations lead to simple techniques in statistical analysis; for example, ref [26] gave a simple method for generating bivariate binomial samples with given marginals and correlation. A good review of this topic is found in [27]. A general family of two-component finite mixture distributions is first established and then a particular case of interest is examined.

2.1. Finite Mixture of Distribution and Its Shifted Distribution

Let P k ; Θ be a probability mass function (pmf), where Θ is the vector of parameters, k = 0 ,   1 ,   2 ,   3 , . and P k 1 is the shifted pmf. Define the pmf P k ; Θ , p of a random variable X as
P k ; Θ , p = 1 p P k ; Θ + p P k 1 ; Θ , k = 0 , 1 , 2 , 3 , ,
where P 1 ; Θ = 0 and 0 < p < 1 . It is easy to check that
k = 0 P k ; Θ , p = 1 .
If G z is the probability-generating function (pgf) of P k ; Θ , the pgf of X is
G X z = 1 p G z + p z G z .
Moments of the finite mixture may be obtained from this pgf. Let μ and σ 2 be the mean and variance of the component distribution P k ; Θ . The mean and variance of X with pmf given by Equation (1) are
μ X = μ + p , σ X 2 = σ 2 + p 1 p .
The index of dispersion I D is given by
I D = σ X 2 μ X = σ 2 + p 1 p μ + p .
The distribution of finite mixture (1) is under-, equi- or over-dispersed according to
I D < , = , > 1 ,
that is,
σ 2 μ < , = , > p 2 .
The quantity σ 2 μ < , = , > 0 depends upon whether the component distribution is under-, equi- or over-dispersed. This shows that the choice of either an equi-dispersed or over-dispersed component distribution will make the finite mixture (1) an under, equi- or over-dispersed distribution. This is illustrated in the next section by using a mixture of negative binomial (NB) distributions.

2.2. Finite Mixture of Negative Binomial Distributions

The negative binomial distribution is a popular model that only caters for over-dispersion. However, when used as components in the finite mixture model (1), the resulting mixture model can handle under-, equi- or over-dispersion. The mixture of NB distributions has pmf
P ( 0 ) = 1 p 1 θ v , k = 0 P ( k ) = 1 p v k k ! θ k 1 θ v + p v k 1 k 1 ! θ k 1 1 θ v , k 1 .
where 0 < p < 1,0 < θ < 1 , ν > 0   a n d v k = Γ v + k Γ v . Let N B θ , ν denote a NB random variable with parameters θ , ν . Thus (3) may be interpreted as follows: an N B θ , ν chance mechanism contributes a proportion 1 p of the counts k 0 . However, for non-zero counts where k 1 , a shifted N B θ , ν also contributes a proportion p .
The pmf (3) can be written as
P ( k ) = 1 p v k k ! θ k 1 θ v + p k θ ν + k 1 v k k ! θ k 1 θ v = 1 p + p k θ ν + k 1 v k k ! θ k 1 θ v , k 0
It is easy to compute the probability P k from (4) by using the one-term recurrence formula for the NB pmf
P k = v k k ! θ k 1 θ v ,
that is, P ( k + 1 ) = v + k θ P ( k ) / k + 1 .
The pgf of pmf (3) is given by
G ( z ) = ( 1 p ) 1 θ 1 θ z v + p z 1 θ 1 θ z v
The form of the pgf given by (5) shows that this finite mixture of NB distributions with pmf (3) can be called an NB-shifted NB distribution. The pgf (5) may also be written as
G z = 1 p + p z 1 θ 1 θ z v , z < 1 1 θ .
This is seen to be the convolution of Bernoulli and NB distributions. The mean and variance of this special case are, respectively,
μ = p + ν θ 1 θ
and
σ 2 = p 1 p + ν θ 1 θ 2 .

2.3. Weighted Negative Binomial Distribution

Let X be a random variable with pmf p(k) and assume that the probability of ascertaining the event X = k has a weighting factor w(k). The weighted distribution [28,29] with weight w(x) has a pmf
P k = w k p k / w x p x .
If w(k) = k, the distribution with pmf (6) is known as the size(length)-biased distribution.
Let Y be the size-biased version of a random variable X with weight w ( k ) = k . Consider another size-biased random variable X w with weight w k = k + γ . The pmf of X w is given by
P ( X w = k ) = γ μ + γ P ( X = k ) + μ μ + γ P ( Y = k )
where μ = x p x . If γ , (7) reduces to P ( X = k ) .
By comparing Equations (4) and (7), it is seen that the finite mixture of NB can be regarded as a weighted NB distribution with weight
w k = 1 p + p k θ ν + k 1 .

2.4. Conditions for Under-, Equi- and Over-Dispersion

The index of dispersion is
I D = Variance Mean = p 1 p + ν θ 1 θ 2 p + ν θ 1 θ .
The distribution is under, equi- and over-dispersed according to whether I D   is less than, equal to or more than 1. Thus, the conditions for under, equi- and over-dispersion are as follows:
If ν θ / 1 θ < p , the distribution is under-dispersed. Conversely, it is equi-dispersed if ν θ / 1 θ = p and over-dispersed if ν θ / 1 θ > p .
Remark 1.
It should be noted that for the special case of equi-dispersion, the distribution does not reduce to the Poisson distribution. With the substitution   p = ν θ / 1 θ   into Equation (4), a weighted negative binomial distribution is obtained.

2.5. Shapes of the Distribution

To examine the shapes of NB-shifted NB distribution, the probabilities in Table 1 are computed for a different combination of parameters.
Figure 1i shows that increasing the parameter ν decreases the probability at zero count dramatically but increases the index of dispersion mildly. The computed pmf drops drastically from 0.81 (case (a)) to almost zero (case (c)) when ν increases from 1 to 50. For case (a), the computed index of dispersion is close to 1 and can be considered to be an equi-dispersed distribution. When θ and p are fixed and ν (>1) is increased as shown in Table 1, the computed index of dispersion is always larger than 1, meaning that it is an over-dispersed distribution. Besides that, the modes are shifted to the right when ν is increased. An increase to the length of the distribution is found and it is always skewed where the right tails are heavier.
On the other hand, Table 1 shows that, when ν is fixed and θ and p are varied, we obtain either over- or under- dispersion. The length of the distribution is longer when θ is larger than p, as presented in Figure 1ii. Meanwhile, case (f) has the highest computed index of dispersion. However, the distribution seems to be flattened when compared with other cases.

2.6. Log-Concavity, Strong Unimodality and Reliability Properties

A distribution is said to be log-concave if its pmf f k , f k > 0 , k = 0 , 1,2 , 3 , satisfies f k 2 f k + 1 f k 1 .
From the results of [30] (p. 388), we can assert that the NB-shifted NB distribution is log-concave since both the binomial (Bernoulli) and negative binomial distributions are log-concave and as log-concavity is preserved under convolution.
From Theorem 3 of [30] (p. 386), which states that a necessary and sufficient condition of pmf f k being strongly unimodal is that f k be log-concave for all k, it follows that the finite mixture of NB distributions is strongly unimodal.
The failure rate of a distribution with pmf f k , f k > 0 , k = 0 ,   1,2 , 3 , is defined by
r k = f k / i k f i .
From the log-concavity property, the NB-shifted NB distribution has an increasing failure rate (IFR) [31]. Furthermore, the following implications hold
I F R I F R A N B U N B U E H N B U E
where IFRA is ‘increasing failure rate average’, NBU is ‘new better than used’, NBUE is ‘new better than used in expectation’ and HNBUE is ‘harmonic new better than used in expectation’. Thus, the finite mixture of NB-shifted NB distribution is IFR, IFRA, NBU, NBUE and HNBUE.

3. Statistical Inferences

3.1. Parameter Estimation

In this section, maximum likelihood (ML) estimation and probability-generating function-based estimation (pgf-estimator) are employed.
ML estimation is an efficient method for the estimation of unknown parameters, but the score equations involved can be complicated and difficult to solve. To overcome numerical complexity, numerical optimization via the simulated annealing algorithm [32] is used.
An alternative estimation method based on the pgf is suggested because the NB-shifted NB pgf has a simple form. This method has been shown to provide quick and consistent estimates for discrete distributions and is robust to outliers [33]. The pgf-based statistic considered here is
T = 0 1 ( F N ( t ) F ( t ) ) 2 d t
where F N ( t ) = 1 N i = 1 N t x i and F ( t ) = E [ t X ] are, respectively, the empirical pgf and the theoretical pgf of the distribution. Ref [33] demonstrated that the pgf estimator outperformed the ML estimator in terms of achieving low values of mean-square error and bias.
Remark 2.
In the statistic (8), the integral over (0, 1) may be interpreted as averaging over the auxiliary variable t by a uniform distribution. A non-uniform distribution can be used.

3.2. Test for Equi-Dispersion

The NB-shifted NB distribution is equi-dispersed when p = ν θ / 1 θ . Hence, the following set of hypotheses is considered. Let h p ,   ν ,   θ = p ν θ / 1 θ
H 0 : h p , ν , θ = 0   vs . H 1 : h p , ν , θ 0 .
The study of the power of this statistical hypothesis test is developed in the next subsection and the results will be provided in Table 2.

3.2.1. Rao’s Score Test [34]

The log-likelihood function can be written as ln L = i N I Y i = y i l n P ( Y i = y i ) , where Y 1 ,   Y 2 ,   ,   Y N is a random sample. The Rao’s score test statistic is given by T = V I 1 V T , where the score vector V and information matrix I are evaluated at the restricted ML estimates.
The score vector V is given as ln L p , ln L ν , ln L θ , where
ln L p = i N I y i 0 l n P Y i = y i p                                                       = i N I y i 0 y i θ 1 + ( v 1 ) θ α i ,
where α i = y i p θ 1 θ + p 1 v 1 θ .
ln L ν = i N I y i 0 l n P ( Y i = y i ) ν = i N I y i 0 y i p y i + v 1 α i ψ 0 v + ψ 0 v + y i + ln 1 θ
ln L θ = i N I y i 0 l n P ( Y i = y i ) θ = i N I y i 0 ν θ 1 + y i θ 1 + p α i .
The elements of the information matrix are
E 2 ln L p 2 = i N E I y i 0 y i θ 1 + ( v 1 ) θ α i 2 , E 2 ln L p ν = i N E I y i 0 y i θ α i 2 , E 2 ln L p θ = i N E I y i 0 y i ( y i + v 1 ) α i 2 , E 2 ln L ν θ = i N E I y i 0 1 θ 1 y i 2 p 2 y i + v 1 3 1 + p y i y i + v 1 1 2 θ 3 + y i p y i + v 1 2 1 + p y i y i + v 1 1 θ 2 , E 2 ln L v 2 = i N E I y i 0 y i p 2 p 1 v 1 θ + y i p ( 2 θ 1 ) 2 θ y i + v 1 2 α i 2 ψ 1 v + ψ 1 v + y i , E 2 ln L θ 2 = i N E I y i 0 v θ 1 2 y i θ 2 1 + y i p 2 α i 2 + 2 p α i ,
where ψ 0 is digamma function and ψ 1 is polygama function of order 1.

3.2.2. Generalized Likelihood Ratio Test

The generalized likelihood ratio (GLR) test statistic is given by
G LR = 2 ln L ( β ^ * ; x ) L ( β ^ ; x )
where β ^ * is the vector of restricted ML estimators under H 0 and β ^ is the unrestricted ML estimators under H 1 . The distribution of the test is χ 2 with 1 degree of freedom.

3.3. Statistical Power Analysis of the Rao’s Score and Generalized Likelihood Ratio Tests

In this section, Table 2 displays the power of score and GLR tests from a Monte Carlo simulation study that is conducted with 1000 repetitions. In the simulation study, the significance level α is set at 5% and 10%. The sample sizes considered are N =100 (small), 500 (moderate) and 1000 (large), while different levels of under-, equi- and over-dispersion are incorporated. Rao’s score test and the GLR test are asymptotically equivalent.
Table 2 demonstrates that, in the scenario of equi-dispersion p = ν θ / 1 θ , the estimated empirical levels of Rao’s score test and GLR test exhibit proximity to each other as the sample size N increases. In the case of under-dispersion, the GLR test consistently exhibits a higher power than Rao’s score test, even with small sample sizes. Meanwhile, a 100% detection rate is achieved for an effect size of 0.61 when the sample size N 500 . For over-dispersion, when the effect size is 0.25, Rao’s score test outperforms the GLR test marginally. However, as the effect size increases to 0.60, a slightly stronger detection is observed with the GLR test, particularly when performed with larger sample sizes ( N = 1000 ). It is evident that, as the deviation from p = ν θ / 1 θ increases, the power of the test also increases.

4. Modeling of Biological Count Data

The NB-shifted NB distribution has been fitted to count data for different indexes of dispersion. The three real-life data sets are as follows:
(1)
Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays (dose 10 in Table 3 and dose 6 in Table 4) to represent under- and equi-dispersed data [35].
(2)
Fetal movement (in Table 5) to represent the over-dispersed data [36].
The performance of the NB-shifted NB distribution is compared with GPD, GIT and COM–Poisson distributions. The parameter n under GIT distribution is chosen based on the lowest chi-square values. For the data with ID of 0.97 (a value very close to 1), Poisson distribution fits well as a comparison study for equi-dispersed data. Under NB-shifted NB distribution conditions, both ML estimation (MLE) and pgf-based estimation are used.
For Table 3, the fits obtained by the NB-shifted NB distribution are comparable with the GPD, GIT and COM–Poisson distributions derived based on the p value. Both estimation methods work well, with similar chi-square values achieved. Meanwhile, Table 4 shows that, based upon the p value and chi-square value, the fits by the NB-shifted NB and GIT distributions are significantly better than the others. For this data set, pgf-based estimation is slightly better than MLE methods, with a lower chi-square value achieved. For over-dispersed data, the NB-shifted NB distribution achieves the highest p value (for MLE) and the lowest chi-square values if compared with others. The MLE method slightly outperforms pgf-based estimation, as presented in Table 5. In addition, the tests of equi-dispersion (Rao’s score and GLR test) indicate that the null hypothesis should be rejected in Table 5, which agrees with the computed index of dispersion. For the two data sets in Table 3 and Table 4, the simulation results presented in Table 6 show that the bias for the estimated parameter v ^ is relatively large.
To evaluate the performance of the pgf-estimator in fitting the NB-shifted NB distribution to real-life data sets, a non-parametric bootstrap simulation [37] was conducted with 1000 repetitions. The simulation involved re-sampling, with replacement, from the original data set to generate a multitude of bootstrap samples. For each of these samples, the maximum likelihood estimate was computed. Non-parametric bootstrap simulation ensures that these insights are obtained without making specific assumptions about the underlying population’s distribution.
The results of this simulation are summarized in Table 6, which provides key statistics the mean, standard error, and confidence interval from the bootstrap analysis for both the MLE and the pgf-estimator. Additionally, the bias of the MLE and pgf-estimator was assessed by comparing the average of the bootstrap estimates to their respective estimated parameters. The findings presented in Table 6 indicate that pgf-estimator has a lower bias than MLE. The mean, standard error, confidence intervals, and bias computed for both estimators are in close agreement. For the two data sets in Table 3 and Table 4, the standard errors for the estimated parameter v ^ are relatively large, indicating low precision in the estimates.

5. Concluding Remarks

A two-component mixture model of a distribution with its shifted counterpart has been proposed in order to generate models that can cater for under-, equi- and over-dispersion. As an example, the paper considers the finite mixture of NB and shifted NB distributions. The distribution does not, like the generalized Poisson distribution, have a problem with the range of the parameters. Even though the COM–Poisson distribution has a simple pmf, the computation of the normalizing constant could encounter problems for extreme values of the parameters. The GIT distribution does not face these issues. However, in comparison, the finite mixture of NB distributions (NB-shifted NB) is much simpler. The proposed distribution has been fitted to a variety of biological data sets. The good fits, shown by the distribution relative to the GPD, COM–Poisson distribution and GIT distribution, proves that it is a viable and simple alternate model for under-, equi- and over-dispersion. The use of components other than NB and shifted NB for the finite mixture will be considered elsewhere.

Author Contributions

Conceptualization, S.H.O.; Methodology, S.H.O., S.Z.S. and S.L.; Validation, S.H.O., S.L. and H.M.S.; Formal analysis, S.H.O., S.Z.S., S.L. and H.M.S.; Investigation, S.H.O., S.Z.S., S.L. and H.M.S.; Resources, S.Z.S.; Data curation, S.H.O. and S.Z.S.; Writing—original draft, S.H.O., S.Z.S., S.L. and H.M.S.; Writing—review & editing, S.H.O., S.Z.S. and H.M.S.; Supervision, S.H.O.; Project administration, S.H.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Higher Education grant FRGS/1/2020/STG06/SYUC/02/1 and UCSI University grant REIG-FBM-2022/050.

Acknowledgments

The authors wish to thank the referees for their valuable comments which have vastly improved the paper. The first author is supported by the Ministry of Higher Education grant FRGS/1/2020/STG06/SYUC/02/1 and UCSI University grant REIG-FBM-2022/050.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bardwell, G.E.; Crow, E.L. A two-parameter family of hyperPoisson distributions. J. Am. Stat. Assoc. 1964, 59, 133–141. [Google Scholar] [CrossRef]
  2. Efron, B. Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 1986, 81, 709–721. [Google Scholar] [CrossRef]
  3. Castillo, J.D.; Pérez-Casany, M. Weighted Poisson distributions for overdispersion and under-dispersion situations. Ann. Inst. Stat. Math. 1998, 50, 567–585. [Google Scholar] [CrossRef]
  4. Consul, P.C. Generalized Poisson Distributions: Properties and Applications; Marcel Dekker Inc.: New York, NY, USA; Basel, Switzerland, 1989. [Google Scholar]
  5. Conway, R.W.; Maxwell, W.L. A queueing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
  6. Sellers, K.F.; Borle, S.; Shmueli, G. The COM-Poisson model for count data: A survey of methods and applications. Appl. Stoch. Model. Bus. Ind. 2012, 28, 104–116. [Google Scholar] [CrossRef]
  7. Sellers, K.F.; Swift, A.W.; Weems, K.S. A flexible distribution class for count data. J. Stat. Distrib. Appl. 2017, 4, 1–21. [Google Scholar] [CrossRef]
  8. Shmueli, G.; Minka, T.P.; Kadane, J.B.; Borle, S.; Boatwright, P. A useful distribution for fitting discrete data: Revival of the Conway–Maxwell–Poisson distribution. Appl. Stat. 2005, 54, 127–142. [Google Scholar] [CrossRef]
  9. Nelson, D.L. Some Remarks on Generalizations of the Negative Binomial and Poisson Distributions. Technometrics 1975, 17, 135–136. [Google Scholar] [CrossRef]
  10. Scollnik, D.P.M. On the analysis of the truncated generalized Poisson distribution using a Bayesian method. ASTIN Bull. 1998, 28, 135–152. [Google Scholar] [CrossRef]
  11. Zhu, F. Modeling time series of counts with COM-Poisson INGARCH models. Math. Comput. Model. 2012, 56, 191–203. [Google Scholar] [CrossRef]
  12. Sellers, K.F.; Peng, S.J.; Arab, A. A flexible univariate autoregressive time-series model for dispersed count data. J. Time Ser. Anal. 2020, 41, 436–453. [Google Scholar] [CrossRef]
  13. Huang, A. Mean-parametrized Conway-Maxwell-Poisson regression models for dispersed counts. Stat. Model. 2017, 17, 359–380. [Google Scholar] [CrossRef]
  14. Sellers, K.F.; Morris, D.S. Underdispersion models: Models that are “under the radar”. Commun. Stat.-Theory Methods 2017, 46, 12075–12086. [Google Scholar] [CrossRef]
  15. Huang, A. On arbitrarily underdispersed discrete distributions. Am. Stat. 2022, 77, 29–34. [Google Scholar] [CrossRef]
  16. Sim, S.Z.; Ong, S.H. A generalized inverse trinomial distribution with application. Stat. Methodol. 2016, 33, 217–233. [Google Scholar] [CrossRef]
  17. Aoyama, K.; Shimizu, K.; Ong, S.H. A first–passage time random walk distribution with five transition probabilities: A generalization of the shifted inverse trinomial. Ann. Inst. Stat. Math. 2008, 60, 1–20. [Google Scholar] [CrossRef]
  18. Kemp, A.W. Convolutions involving binomial pseudo-variables. Sankya 1979, 41, 232–243. [Google Scholar]
  19. Borges, P.; Rodrigues, J.; Balakrishnan, N.; Bazan, J. A COM-Poisson type generalization of the binomial distribution and its properties and applications. Stat. Probab. Lett. 2014, 87, 158–166. [Google Scholar] [CrossRef]
  20. Imoto, T. A generalized Conway-Maxwell-Poisson distribution which includes the negative binomial distribution. Appl. Math. Comput. 2014, 247, 824–834. [Google Scholar] [CrossRef]
  21. Chakraborty, S.; Imoto, T. Extended Conway-Maxwell-Poisson distribution and its properties and applications. J. Stat. Distrib. Appl. 2016, 3, 1–19. [Google Scholar] [CrossRef]
  22. Chakraborty, S.; Ong, S.H. A COM-Poisson-type generalization of the negative binomial distribution. Commun. Stat.-Theory Methods 2016, 45, 4117–4135. [Google Scholar] [CrossRef]
  23. Imoto, T.; Ng, C.M.; Ong, S.H.; Chakraborty, S. A modified Conway-Maxwell-Poisson type binomial distribution and its applications. Commun. Stat.-Theory Methods 2017, 46, 12210–12225. [Google Scholar] [CrossRef]
  24. Zhang, H.; Tan, K.; Li, B. COM-negative binomial distribution: Modeling overdispersion and ultrahigh zero-inflated count data. Front. Math. China 2018, 13, 967–998. [Google Scholar] [CrossRef]
  25. Cahoy, D.; Di Nardo, E.; Polito, F. Flexible models for overdispersed and underdispersed count data. Stat. Pap. 2021, 62, 2969–2990. [Google Scholar] [CrossRef]
  26. Ong, S.H. The computer generation of bivariate binomial variables with given marginals and correlation. Commun. Statist.-Simul. Comput. 1992, 21, 285–299. [Google Scholar] [CrossRef]
  27. McLachlan, G.J.; Lee, S.X.; Rathnayake, S.I. Finite Mixture Models. Annu. Rev. Stat. Its Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
  28. Rao, C.R. On discrete distributions arising out of methods of ascertainment. Sankhyā Indian J. Stat. Ser. A 1965, 27, 311–324. [Google Scholar]
  29. Rao, C.R. Weighted Distributions Arising Out of Methods of Ascertainment: What Population Does a Sample Represent? In A Celebration of Statistics; Atkinson, A.C., Fienberg, S.E., Eds.; Springer: New York, NY, USA, 1985. [Google Scholar]
  30. Keilson, J.; Geber, H. Some Results for Discrete Unimodality. J. Am. Stat. Assoc. 1971, 66, 386–389. [Google Scholar] [CrossRef]
  31. Gupta, P.L.; Gupta, R.C.; Ong, S.H.; Srivastava, H.M. A class of Hurwitz–Lerch Zeta distributions and their applications in reliability. Appl. Math. Comput. 2008, 196, 521–531. [Google Scholar] [CrossRef]
  32. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
  33. Sim, S.Z.; Ong, S.H. Parameter estimation for discrete distributions by generalized Hellinger type divergence based on probability generating function. Commun. Stat. Simul. Comput. 2010, 39, 305–314. [Google Scholar] [CrossRef]
  34. Rao, C.R. Score Test: Historical Review and Recent Developments. In Advances in Ranking and Selection, Multiple Comparisons, and Reliability; Balakrishnan, N., Nagaraja, H.N., Kannan, N., Eds.; Statistics for Industry and Technology; Birkhäuser: Boston, MA, USA, 2005. [Google Scholar] [CrossRef]
  35. Sasaki, M.S. Chromosomal biodosimetry by unfolding a mixed Poisson distribution: A generalized model. Int. J. Radiat. Biol. 2003, 79, 83–97. [Google Scholar] [CrossRef] [PubMed]
  36. Leroux, B.G.; Puterman, M.L. Maximum penalized likelihood estimation for independent and Markov dependent mixture models. Biometrics 1992, 48, 545–558. [Google Scholar] [CrossRef] [PubMed]
  37. Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Figure 1. (i,ii): The pmf plots of the mixture distribution as proposed in Table 1.
Figure 1. (i,ii): The pmf plots of the mixture distribution as proposed in Table 1.
Stats 06 00059 g001
Table 1. Different combination of parameters under NB-shifted NB distribution.
Table 1. Different combination of parameters under NB-shifted NB distribution.
θ   and   p   Are   Fixed   and   ν Is Increased ν   Is   Fixed   and   θ and p Are Varied
Case (a)Case (b)Case (c)Case (d)Case (e)Case (f)
ν 11050101010
θ 0.10.10.10.10.40.8
p0.10.10.10.80.40.1
ID1.011.091.110.731.614.99
Table 2. Simulated power of Score and GLR tests for NB-shifted NB distribution. (Power = number of rejections divided by number of repetitions).
Table 2. Simulated power of Score and GLR tests for NB-shifted NB distribution. (Power = number of rejections divided by number of repetitions).
Equi-DispersionOver-DispersionUnder-Dispersion
ν 101040253
θ 0.10.10.10.10.1
p 0.350.10.10.80.8
Index of dispersion11.091.110.910.47
Effect size00.250.600.240.61
N α Method
1000.05score0.0300.0860.1140.0100.048
GLR0.0360.0770.1050.0720.911
0.1score0.0560.1480.1600.0300.362
GLR0.0820.1330.1750.1440.960
5000.05score0.0350.2800.3910.0591.000
GLR0.0500.2740.3890.2791.000
0.1score0.0690.3940.4930.1241.000
GLR0.0990.3870.5090.4071.000
10000.05score0.0320.4940.6370.1521.000
GLR0.0460.4830.6540.5381.000
0.1score0.0800.6260.7420.2861.000
GLR0.0880.6210.7580.6551.000
Table 3. Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays [35].
Table 3. Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays [35].
No. of Cells Observed
Frequency
Expected Frequency
Dose
10
GPDGITCOM–
Poisson
NB-Shifted NB
MLEMLEMLEMLEpgf-Estimator
001.621.491.260.070.12
198.508.248.017.578.44
22621.5421.5221.7323.5624.39
33335.0335.4935.9237.9737.53
43941.0841.6041.7741.9440.48
53637.0237.1036.9735.6334.30
62626.6626.3526.1824.8124.29
72315.7715.4215.3614.7414.95
837.817.657.657.688.20
923.283.293.303.584.09
1021.181.251.251.521.88
1110.500.620.600.911.33
Total200
χ 2 10.6710.6010.549.859.86
p value 0.300.300.310.280.28
Parameter
Estimates
θ ^ = 4.82 p ^ 2 = 0.26 v ^ = 1.22 θ ^ = 0.08 θ ^ = 0.15
λ ^ = −0.09 p ^ 3 = 0.13 λ ^ = 6.34 p ^ = 0.99 p ^ = 0.99
n = 10 v ^ = 38.58 v ^ = 19.75
ID 0.85
Test for equi-dispersion: Rao’s score test statistic = 0.06 and GLR test statistic = 2.17 (Do not reject the null hypothesis).
Table 4. Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays [35].
Table 4. Frequency and distribution of chromosome aberrations (dicentrics plus ring) in peripheral blood lymphocytes irradiated in vitro with γ-rays [35].
No. of Cells Observed Frequency Expected Frequency
Dose
6
PoissonGPDGITCOM–
Poisson
NB-Shifted NB
MLEMLEMLEMLEMLEpgf-Estimator
01923.7722.8719.1821.9119.0818.95
15650.6250.5055.4350.8856.1756.93
26053.9254.8259.5455.7056.3355.94
33138.2839.0135.0439.2836.7336.22
41820.3820.4617.2820.2718.7418.61
5118.688.447.828.218.148.24
654.353.915.713.744.815.12
Total200
χ 2 4.604.771.884.612.172.01
p value 0.470.310.760.330.540.57
Parameter
Estimates
λ ^ = 2.13 θ ^ = 2.17 p ^ 2 = 0.34 v ^ = 1.09 θ ^ = 0.19 θ ^ = 0.21
λ ^ = −0.02 p ^ 3 = 0.35 λ ^ = 2.32 p ^ = 0.63 p ^ = 0.65
n = 2 v ^ = 6.44 v ^ = 5.48
ID 0.97
Test for equi-dispersion: Rao’s score test statistic = 0.03 and GLR test statistic = 0.05 (Do not reject the null hypothesis).
Table 5. Fetal movement [36].
Table 5. Fetal movement [36].
Number of MovementsObserved FrequencyExpected Frequency
Number of IntervalGPDGITCOM–
Poisson
NB-Shifted NB
MLEMLEMLEMLEpgf-Estimator
0182182.50176.61176.67182.02181.96
14139.4946.7146.6341.2241.60
21211.6212.2912.3010.3010.14
323.953.233.243.783.69
421.460.850.851.521.49
500.570.220.230.650.63
600.230.060.060.280.27
710.170.020.020.230.23
Total240
χ 2 6.0948.7548.324.734.87
p value 0.300.00.00.320.30
Parameter estimates θ ^ = 0.66 p ^ 2 = 0.001 v ^ = 0.001 θ ^ = 0.50 θ ^ = 0.50
λ ^ = 0.22 p ^ 3 = 0.26 λ ^ = 0.26 p ^ = 0.08 p ^ = 0.09
n = 1 v ^ = 0.28 v ^ = 0.27
ID 1.84
Test for equi-dispersion: Rao’s score test statistic = 26.85 and GLR test statistic = 19.67 (Reject the null hypothesis).
Table 6. Summary statistics for MLE and pgf-estimator based on bootstrap simulation.
Table 6. Summary statistics for MLE and pgf-estimator based on bootstrap simulation.
Table 3
MLEpgf-Estimator
θ ^ p ^ v ^ θ ^ p ^ v ^
Estimated parameters0.080.9938.580.150.9919.75
Mean 0.090.9857.320.140.9836.97
Standard error 0.070.0636.640.080.0132.35
Confidence interval (0.03, 0.24)(0.83, 0.99)(10.51, 100)(0.03, 0.32)(0.94, 0.99)(7.41, 100)
Bias0.01−0.0218.74−0.01−0.0117.22
Table 4
MLEpgf-Estimator
θ ^ p ^ v ^ θ ^ p ^ v ^
Estimated parameters0.190.636.440.210.655.48
Mean 0.170.5920.500.200.6118.35
Standard error 0.100.1930.700.110.1628.73
Confidence interval (0.02, 0.36)(0.01, 0.81)(2.41, 100)(0.02, 0.40)(0.08, 0.81)(2.16, 99.99)
Bias−0.02−0.0514.06−0.02−0.0412.88
Table 5
MLEpgf-Estimator
θ ^ p ^ v ^ θ ^ p ^ v ^
Estimated parameters0.50.080.280.50.090.27
Mean 0.480.070.400.490.080.37
Standard error 0.140.050.360.140.060.28
Confidence interval (0.20, 0.72)(0.01,0.17)(0.08, 1.28)(0.22, 0.74)(0.01, 0.18)(0.07, 1.10)
Bias−0.02−0.010.12−0.01−0.010.10
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ong, S.H.; Sim, S.Z.; Liu, S.; Srivastava, H.M. A Family of Finite Mixture Distributions for Modelling Dispersion in Count Data. Stats 2023, 6, 942-955. https://doi.org/10.3390/stats6030059

AMA Style

Ong SH, Sim SZ, Liu S, Srivastava HM. A Family of Finite Mixture Distributions for Modelling Dispersion in Count Data. Stats. 2023; 6(3):942-955. https://doi.org/10.3390/stats6030059

Chicago/Turabian Style

Ong, Seng Huat, Shin Zhu Sim, Shuangzhe Liu, and Hari M. Srivastava. 2023. "A Family of Finite Mixture Distributions for Modelling Dispersion in Count Data" Stats 6, no. 3: 942-955. https://doi.org/10.3390/stats6030059

Article Metrics

Back to TopTop