Few statistical tests for proportions comparison
Introduction
In operations research, comparing two solution methods with each other is frequently needed. This is particularly the case when one wants to tune the parameters of an algorithm. In this case, one wants to know whether a given parameter setting is better than another one. In practice, to identify the best setting, there are several approaches. Without being exhaustive, common techniques are the following:
- 1.
In the context of optimization, a set of problem instances is solved with both methods that have to be compared. Then, the mean, standard deviation (an eventually other measures such as median, minimum, maximum, skewness, kurtosis, etc.) of the solution values obtained are computed.
- 2.
In the context of solving problems exactly, the mean, standard deviation, etc. of the computational effort needed to obtain the optimum solution are computed.
- 3.
The maximal computational effort is fixed, as well as a goal to reach. One counts the number of times each method reaches the goal within the allowed computational effort.
Naturally, there are many variants and other statistics that can be collected. In the first comparison technique, the computational effort is not taken into account. Either the last is very small, or both methods requires approximately the same computational effort.
Very often in practice, the measures that are computed in the first and second comparison techniques quoted above are very primitive. Sometimes they are limited only to the mean. This is evidently very insufficient for stating that a solution method is statistically better than another one.
When the standard deviation is provided in addition to the mean, it is generally (implicitly) assumed that the distribution of the population satisfies the hypothesis of a normal distribution. Under this assumption, a large number of statistical tests are available and can be validly performed. Unfortunately, the normality assumption is far from being always satisfied. For instance, an optimization technique that frequently finds globally optimal solutions has a distribution with a truncated tail, since it is impossible to go beyond the optimum. This situation is illustrated in Fig. 1 that provides the empirical distributions of solutions values obtained for two nondeterministic optimization techniques (Robust taboo search (Taillard, 1991) and POPMUSIC (Taillard and Voss, 2002)) for a turbine runner balancing problem instance. Although this situation is frequent with metaheuristic-based optimization methods, it cannot be generalized.
This figure shows clearly that the distributions are asymmetrical, left truncated (this is a minimization problem; the vertical axis is placed on a lower bound to the optimum) and that both distribution functions are different. Therefore, the estimation of a parameter (the mean) of an a prior unknown distribution function is not evident. Moreover, a confidence interval for the mean should be given, which seems not evident to be undertaken. A bootstrap approach (Davison and Hinkley, 1997, Efron and Tibshirani, 1993) could be convenient.
When the third comparison approach quoted above is used (counting the number of successes), the sign test (Arbuthnott, 1710) (see, e.g. Conover, 1999), or, better, the “Fisher’s exact test” for 2 × 2 contingency table is convenient. A run of a method is successful if it reaches a given goal. In the context of NP-complete problems, the goal is to find a feasible solution. In the context of optimization problems (e.g. NP-hard problems), the goal could be finding the optimum solution (subject that such a solution can be characterized) or finding a solution that is a given percentage above (respectively: below) a lower (respectively: upper) bound to the optimum. When two methods have to be compared on a given set of problem instances, a success for a method could be to provide a solution of better quality than the solution produced by the other method for the same problem instance. In the context of comparing two methods for multiobjective optimization, a success could be to find a solution that is not dominated by the set of solutions produced by the other method. Naturally, the definition of a “success” must be clearly stated before a statistical test is undertaken, but the user has a wide latitude in choosing the definition, possibly leading to different conclusions!
This article develops a statistical test that is more powerful than both the sign test and Fisher’s test for comparing proportions. This test is based on a standard methodology that seems to be uncommon in practice. Indeed, it is not developed in the literature consulted, although it cannot be excluded that it appears somewhere, since there is a huge amount of articles and books dealing with contingency tables (see, e.g. Conover, 1999, Good, 2005). Before the presentation of the new test, other approaches commonly used in practice are reviewed. Finally, the new test is numerically compared to these approaches.
Section snippets
Comparing proportions
The central problem treated by this article is the following: Let us suppose that two populations A and B are governed by Bernoulli distributions, i.e., the probability of success of an occurrence of A (respectively: B) is given by pa (respectively: pb). From the OR user point of view, it is considered that the result of the execution of a method is a random variable. Indeed, either the method is nondeterministic which is typically the case of simulated annealing, or the problem data can be
A new test for comparing proportions
The drawback of McNemar test is that pairwise data are required. In practice it is not always possible to have pairwise data. For instance, let us suppose that Method B was run on nb problem instances randomly generated. The rules for problem generation are perfectly known, but the nb instances themselves have not been published. So, the designer of Method A, who wants to compare his method to Method B, can run his method as many times as he wants (na times). However, if the code of Method B is
Numerical results
The power of a hypothesis statistical test is defined as the probability of rejecting a false null hypothesis. So, the higher the power of a hypothesis statistical test is, the better the test can discriminate between subtle differences in the samples and the better the test is considered.
This section empirically shows that the new test we propose is more powerful than those provided by McNemar and Fisher and, for large samples, slightly more powerful than standard tests. If abusively applied
Conclusions
The new statistical test developed in this article is shown to be much more powerful than the classical McNemar nonparametric test. The power of the new statistical test is comparable to standard parametric test and slightly more powerful than Fisher’s exact test. This result is very positive, since it is commonly believed that parametric tests are significantly more powerful than nonparametric ones and that Fisher’s exact test is the best for 2 × 2 contingency tables. The tables provided in this
Acknowledgements
The authors are grateful to anonymous referees whose comments help in improving this article. They thank C. Evéquoz for corrections brought to the manuscript. This research was partially supported by the University of Applied Sciences of Western Switzerland, Grant QUALOPT-11731.
References (17)
Robust taboo search for the quadratic assignment problem
Parallel Computing
(1991)An argument for divine providence, taken from the constant regularity observed in the births of both sexes
Philosophical Transactions
(1710)Statistical Analysis for Engineers and Scientists
(1994)Practical Nonparametric Statistics
(1999)Mathematical Methods of Statistics
(1946)- et al.
Bootstrap Methods and their Application
(1997) - et al.
An Introduction to the Bootstrap
(1993) The Fisher-Yates test of significance in 2 × 2 contingency tables
Biometrika
(1948)
Cited by (44)
Multi- and many-objective path-relinking: A taxonomy and decomposition approach
2021, Computers and Operations ResearchEvaluation of different Campylobacter jejuni isolates to colonize the intestinal tract of commercial Turkey poults and selective media for enumeration
2018, Poultry ScienceCitation Excerpt :Data for growth curves, enumerating Campylobacter, and qPCR were analyzed using Prism 7.03 statistical software (Graph Pad Software Inc., San Diego, CA) to perform one- or 2-way ANOVA followed by a post-hoc multiple comparisons test (Tukey) to detect differences among groups. The Fisher Exact test (Taillard et al., 2008) was used for pair-wise comparison of the number of poults positive and poults negative for Campylobacter colonization. Results were considered significant at values of P ≤ 0.05.
A case study of TTCN-3 test scripts clone analysis in an industrial telecommunication setting
2017, Information and Software TechnologyEvaluation of disinfectants and antiseptics to eliminate bacteria from the surface of Turkey eggs and hatch gnotobiotic poults
2017, Poultry ScienceCitation Excerpt :All analysis was performed using GraphPad Prism v7.01 (GraphPad Software, LaJolla, CA). The Fisher Exact test was used for pair-wise comparison on viable and toxic embryos produced from a chemical treatment and PBS treated eggs (Taillard, et al., 2008). Cq values from 16S rRNA gene qPCR reactions of treated, PBS-treated eggs, and no template control samples were analyzed using one-way analysis of variance.
Analyzing the impact of MOACO components: An algorithmic study on the multi-objective shortest path problem
2013, Expert Systems with ApplicationsCitation Excerpt :The results returned by the Wilcoxon test are, finally, compared using Taillard, Waelti, and Zuber (2009) test for comparison of proportions with significance level of 95%. For brevity, the only results reported here are the p-values returned by Taillard et al. (2009) test, which are rounded to two decimal places. All tests were executed on an Intel Xeon QuadCore W3520 2.8 GHz, with 8 GB of RAM running Scientific Linux 5.5 64 bits distribution.
Paracrine regulation of neural crest EMT by placodal MMP28
2023, PLoS Biology