Few statistical tests for proportions comparison

doi:10.1016/j.ejor.2006.03.070

European Journal of Operational Research

Volume 185, Issue 3, 16 March 2008, Pages 1336-1350

https://doi.org/10.1016/j.ejor.2006.03.070 Get rights and content

Abstract

This article reviews a number of statistical tests for comparing proportions. These statistical tests are presented in a comprehensive way, so that OR practitioners can easily understand them and correctly use them. A test for 2 × 2 contingency tables is developed and shown to be more powerful than other classical tests of the literature such as Fisher’s exact test. Tables with critical values for small samples are provided, so that the test can be conducted without any computations.

Introduction

In operations research, comparing two solution methods with each other is frequently needed. This is particularly the case when one wants to tune the parameters of an algorithm. In this case, one wants to know whether a given parameter setting is better than another one. In practice, to identify the best setting, there are several approaches. Without being exhaustive, common techniques are the following:

1.
In the context of optimization, a set of problem instances is solved with both methods that have to be compared. Then, the mean, standard deviation (an eventually other measures such as median, minimum, maximum, skewness, kurtosis, etc.) of the solution values obtained are computed.
2.
In the context of solving problems exactly, the mean, standard deviation, etc. of the computational effort needed to obtain the optimum solution are computed.
3.
The maximal computational effort is fixed, as well as a goal to reach. One counts the number of times each method reaches the goal within the allowed computational effort.

Naturally, there are many variants and other statistics that can be collected. In the first comparison technique, the computational effort is not taken into account. Either the last is very small, or both methods requires approximately the same computational effort.

Very often in practice, the measures that are computed in the first and second comparison techniques quoted above are very primitive. Sometimes they are limited only to the mean. This is evidently very insufficient for stating that a solution method is statistically better than another one.

When the standard deviation is provided in addition to the mean, it is generally (implicitly) assumed that the distribution of the population satisfies the hypothesis of a normal distribution. Under this assumption, a large number of statistical tests are available and can be validly performed. Unfortunately, the normality assumption is far from being always satisfied. For instance, an optimization technique that frequently finds globally optimal solutions has a distribution with a truncated tail, since it is impossible to go beyond the optimum. This situation is illustrated in Fig. 1 that provides the empirical distributions of solutions values obtained for two nondeterministic optimization techniques (Robust taboo search (Taillard, 1991) and POPMUSIC (Taillard and Voss, 2002)) for a turbine runner balancing problem instance. Although this situation is frequent with metaheuristic-based optimization methods, it cannot be generalized.

This figure shows clearly that the distributions are asymmetrical, left truncated (this is a minimization problem; the vertical axis is placed on a lower bound to the optimum) and that both distribution functions are different. Therefore, the estimation of a parameter (the mean) of an a prior unknown distribution function is not evident. Moreover, a confidence interval for the mean should be given, which seems not evident to be undertaken. A bootstrap approach (Davison and Hinkley, 1997, Efron and Tibshirani, 1993) could be convenient.

When the third comparison approach quoted above is used (counting the number of successes), the sign test (Arbuthnott, 1710) (see, e.g. Conover, 1999), or, better, the “Fisher’s exact test” for 2 × 2 contingency table is convenient. A run of a method is successful if it reaches a given goal. In the context of NP-complete problems, the goal is to find a feasible solution. In the context of optimization problems (e.g. NP-hard problems), the goal could be finding the optimum solution (subject that such a solution can be characterized) or finding a solution that is a given percentage above (respectively: below) a lower (respectively: upper) bound to the optimum. When two methods have to be compared on a given set of problem instances, a success for a method could be to provide a solution of better quality than the solution produced by the other method for the same problem instance. In the context of comparing two methods for multiobjective optimization, a success could be to find a solution that is not dominated by the set of solutions produced by the other method. Naturally, the definition of a “success” must be clearly stated before a statistical test is undertaken, but the user has a wide latitude in choosing the definition, possibly leading to different conclusions!

This article develops a statistical test that is more powerful than both the sign test and Fisher’s test for comparing proportions. This test is based on a standard methodology that seems to be uncommon in practice. Indeed, it is not developed in the literature consulted, although it cannot be excluded that it appears somewhere, since there is a huge amount of articles and books dealing with contingency tables (see, e.g. Conover, 1999, Good, 2005). Before the presentation of the new test, other approaches commonly used in practice are reviewed. Finally, the new test is numerically compared to these approaches.

Section snippets

Comparing proportions

The central problem treated by this article is the following: Let us suppose that two populations A and B are governed by Bernoulli distributions, i.e., the probability of success of an occurrence of A (respectively: B) is given by p_a (respectively: p_b). From the OR user point of view, it is considered that the result of the execution of a method is a random variable. Indeed, either the method is nondeterministic which is typically the case of simulated annealing, or the problem data can be

A new test for comparing proportions

The drawback of McNemar test is that pairwise data are required. In practice it is not always possible to have pairwise data. For instance, let us suppose that Method B was run on n_b problem instances randomly generated. The rules for problem generation are perfectly known, but the n_b instances themselves have not been published. So, the designer of Method A, who wants to compare his method to Method B, can run his method as many times as he wants (n_a times). However, if the code of Method B is

Numerical results

The power of a hypothesis statistical test is defined as the probability of rejecting a false null hypothesis. So, the higher the power of a hypothesis statistical test is, the better the test can discriminate between subtle differences in the samples and the better the test is considered.

This section empirically shows that the new test we propose is more powerful than those provided by McNemar and Fisher and, for large samples, slightly more powerful than standard tests. If abusively applied

Conclusions

The new statistical test developed in this article is shown to be much more powerful than the classical McNemar nonparametric test. The power of the new statistical test is comparable to standard parametric test and slightly more powerful than Fisher’s exact test. This result is very positive, since it is commonly believed that parametric tests are significantly more powerful than nonparametric ones and that Fisher’s exact test is the best for 2 × 2 contingency tables. The tables provided in this

Acknowledgements

The authors are grateful to anonymous referees whose comments help in improving this article. They thank C. Evéquoz for corrections brought to the manuscript. This research was partially supported by the University of Applied Sciences of Western Switzerland, Grant QUALOPT-11731.

References (17)

É.D. Taillard
Robust taboo search for the quadratic assignment problem
Parallel Computing
(1991)
J. Arbuthnott
An argument for divine providence, taken from the constant regularity observed in the births of both sexes
Philosophical Transactions
(1710)
J.W. Barnes
Statistical Analysis for Engineers and Scientists
(1994)
W.J. Conover
Practical Nonparametric Statistics
(1999)
H. Cramér
Mathematical Methods of Statistics
(1946)
A.C. Davison et al.
Bootstrap Methods and their Application
(1997)
B. Efron et al.
An Introduction to the Bootstrap
(1993)
D. Finney
The Fisher-Yates test of significance in 2 × 2 contingency tables
Biometrika
(1948)

There are more references available in the full text version of this article.

Cited by (44)

Multi- and many-objective path-relinking: A taxonomy and decomposition approach
2021, Computers and Operations Research
This paper proposes a decomposition-based path-relinking method for multi- and many-objective combinatorial optimization problems. The proposed approach, referred to as MOPR/D, considers the localization of solutions in the objective space and can handle this information to make choices regarding intermediate solutions in the path-relinking trajectory. Because it does not require aggregation functions, it does not suffer from a tendency to introduce search bias to specific regions of the Pareto front. The proposed approach was compared with seven path-relinking techniques from the literature on three differently structured combinatorial optimization problems: 0/1 multidimensional knapsack, quadratic assignment, and spanning tree. Experiments were conducted on sets of benchmark instances with up to five objectives. This study also reviewed multi-objective path generation strategies, investigated their behaviors, and a taxonomy is proposed for standardizing and classifying them.
Evaluation of different Campylobacter jejuni isolates to colonize the intestinal tract of commercial Turkey poults and selective media for enumeration
2018, Poultry Science
Citation Excerpt :
Data for growth curves, enumerating Campylobacter, and qPCR were analyzed using Prism 7.03 statistical software (Graph Pad Software Inc., San Diego, CA) to perform one- or 2-way ANOVA followed by a post-hoc multiple comparisons test (Tukey) to detect differences among groups. The Fisher Exact test (Taillard et al., 2008) was used for pair-wise comparison of the number of poults positive and poults negative for Campylobacter colonization. Results were considered significant at values of P ≤ 0.05.
Consumption of contaminated poultry products is the main source of human campylobacteriosis, for which Campylobacter jejuni is responsible for 90% of human cases. Although chickens are believed to be a main source of human exposure to C. jejuni, turkeys also contribute to cases of human infection. Little is known about the kinetics of C. jejuni intestinal colonization in turkeys, or best selective media for their recovery. Enumeration of C. jejuni from intestinal samples can be challenging because most selective Campylobacter media support the growth of non-Campylobacter organisms. In this study, we sought to compare a) C. jejuni isolates that persistently colonize different compartments of the poult intestinal tract, and b) selective media to enumerate C. jejuni from turkey intestinal samples. Three-week-old poults were orally colonized with C. jejuni isolates NCTC 11168 or NADC 20827 (isolated from a turkey flock). Mock-colonized poults were orally gavaged with uninoculated media. Poults were euthanized at d 3, 7, and 21 post colonization and direct plated on different selective Campylobacter media [Campy Line agar with sulfamethoxazole (CLA-S), CHROMagar Campylobacter (CAC) and Campy Cefex] for enumeration. Isolates NCTC 11168 and NADC 20827 poorly colonized the distal ileum. Both isolates colonized the colon, but the number of NADC 20827 significantly decreased at d 21. Isolates NCTC 11168 and NADC 20827 persistently colonized the cecum for up to 21 days. There was no significant difference in the Campylobacter amount recovered on CLA-S and CAC. Campy Cefex failed to prevent growth of background microbes to enumerate C. jejuni from turkey samples. Two independent PCR assays (multiplex PCR and qPCR) confirmed that colonies grown on CLA-S or CAC were C. jejuni. Data from this study demonstrated that isolates NCTC 11168 and NADC 20827 persistently colonized the cecum, and CLA-S or CAC were successful to enumerate Campylobacter from intestinal samples. These findings will be useful to evaluate the host response by C. jejuni in turkeys, and test pre-harvest strategies to reduce its colonization and promote food safety.
A case study of TTCN-3 test scripts clone analysis in an industrial telecommunication setting
2017, Information and Software Technology
Context: This paper presents a novel experiment focused on detecting and analyzing clones in test suites written in TTCN-3, a standard telecommunication test script language, for different industrial projects.
Objective: This paper investigates frequencies, types, and similarity distributions of TTCN-3 clones in test scripts from three industrial projects in telecommunication. We also compare the distribution of clones in TTCN-3 test scripts with the distribution of clones in C/C++ and Java projects from the telecommunication domain. We then perform a statistical analysis to validate the significance of differences between these distributions.
Method: Similarity is computed using CLAN, which compares metrics syntactically derived from script fragments. Metrics are computed from the Abstract Syntax Trees produced by a TTCN-3 parser called Titan developed by Ericsson as an Eclipse plugin. Finally, clone classification of similar script pairs is computed using the Longest Common Subsequence algorithm on token types and token images.
Results: This paper presents figures and diagrams reporting TTCN-3 clone frequencies, types, and similarity distributions. We show that the differences between the distribution of clones in test scripts and the distribution of clones in applications are statistically significant. We also present and discuss some lessons that can be learned about the transferability of technology from this study.
Conclusion: About 24% of fragments in the test suites are cloned, which is a very high proportion of clones compared to what is generally found in source code. The difference in proportion of Type-1 and Type-2 clones is statistically significant and remarkably higher in TTCN-3 than in source code. Type-1 and Type-2 clones represent 82.9% and 15.3% of clone fragments for a total of 98.2%. Within the projects this study investigated, this represents more and easier potential re-factoring opportunities for test scripts than for code.
Evaluation of disinfectants and antiseptics to eliminate bacteria from the surface of Turkey eggs and hatch gnotobiotic poults
2017, Poultry Science
Citation Excerpt :
All analysis was performed using GraphPad Prism v7.01 (GraphPad Software, LaJolla, CA). The Fisher Exact test was used for pair-wise comparison on viable and toxic embryos produced from a chemical treatment and PBS treated eggs (Taillard, et al., 2008). Cq values from 16S rRNA gene qPCR reactions of treated, PBS-treated eggs, and no template control samples were analyzed using one-way analysis of variance.
Bird eggs are in contact with intestinal microbiota at or after oviposition, but are protected from bacterial translocation by a glycoprotein cuticle layer, the shell, and internal membranes. In a preliminary study, turkey eggs were hatched in a germ-free environment. Firmicutes 16S rRNA gene was detected in the cecal microbiota of hatched poults, suggesting that poults may acquire spore-formers by exposure to shell contents during hatching. Generating gnotobiotic poults for research requires elimination of bacteria from the egg’s surface without damaging the developing embryo. The ability of different disinfectants and antiseptics to eliminate eggshell bacteria without harming the developing embryo was tested. Different classes of disinfectants and antiseptics (halogens, biguanidines, and oxidants) were selected to target spores and vegetative bacteria likely present on the egg’s surface. Eggs were treated by fully immersing in heated antiseptic (betadine or chlorhexidine) or disinfectant (alkaline bleach, acidified bleach, chlorine dioxide, Oxysept-333, or Virkon S) solutions for up to 15 minutes. Shells were aseptically harvested for aerobic and anaerobic culturing of bacteria. Toxicity to the developing embryo was assessed by gross evaluation of developmental changes in treated eggs incubated up to 27 d of embryonation. Halogen disinfectants acidified bleach and chlorine dioxide, and oxidants Oxysept-333 and Virkon-S eliminated viable bacteria from eggshells. However, addition of oxidants, alone or in combination with other treatments, produced significant (P < 0.05) embryotoxicity. The combination treatment of acidified bleach, chlorine dioxide, and betadine produced minimal embryotoxicity and eliminated viable bacteria from whole turkey eggs, and produced hatched poults in a gnotobiotic isolator. As a control, eggs were treated with PBS, incubated, and hatched under germ-replete conditions. After hatching, poults were euthanized and treated poults had no detectable bacterial growth or 16S rRNA gene qPCR amplification, demonstrating that acidified sodium hypochlorite, chlorine dioxide, and betadine safely hatched gnotobiotic poults. Generation of germ-free poults is an important tool and will be used to evaluate the host-pathogen interaction by foodborne pathogens such as Campylobacter spp.
Analyzing the impact of MOACO components: An algorithmic study on the multi-objective shortest path problem
2013, Expert Systems with Applications
Citation Excerpt :
The results returned by the Wilcoxon test are, finally, compared using Taillard, Waelti, and Zuber (2009) test for comparison of proportions with significance level of 95%. For brevity, the only results reported here are the p-values returned by Taillard et al. (2009) test, which are rounded to two decimal places. All tests were executed on an Intel Xeon QuadCore W3520 2.8 GHz, with 8 GB of RAM running Scientific Linux 5.5 64 bits distribution.
Multi-objective Ant Colony Optimization (MOACO) algorithms have been successfully applied to several multi-objective combinatorial optimization problems (MCOP) over the past decade. Recently, we proposed a MOACO algorithm named GRACE for the multi-objective shortest path (MSP) problem, confirming the efficiency of such metaheuristic for this MCOP. In this paper, we investigate several extensions of GRACE, proposing several single and multi-colony variants of the original algorithm. All variants are compared on the original set of instances used for proposing GRACE. The best-performing variants are also assessed using a new benchmark containing 300 larger instances with three different underlying graph structures. Experimental evaluation shows one of the variants to produce better results than the others, including the original GRACE, thus improving the state-of-the-art of MSP.
Paracrine regulation of neural crest EMT by placodal MMP28
2023, PLoS Biology
Epithelial–mesenchymal transition (EMT) is an early event in cell dissemination from epithelial tissues. EMT endows cells with migratory, and sometimes invasive, capabilities and is thus a key process in embryo morphogenesis and cancer progression. So far, matrix metalloproteinases (MMPs) have not been considered as key players in EMT but rather studied for their role in matrix remodelling in later events such as cell migration per se. Here, we used Xenopus neural crest cells to assess the role of MMP28 in EMT and migration in vivo. We show that a catalytically active MMP28, expressed by neighbouring placodal cells, is required for neural crest EMT and cell migration. We provide strong evidence indicating that MMP28 is imported in the nucleus of neural crest cells where it is required for normal Twist expression. Our data demonstrate that MMP28 can act as an upstream regulator of EMT in vivo raising the possibility that other MMPs might have similar early roles in various EMT-related contexts such as cancer, fibrosis, and wound healing.
Epithelial-mesenchymal transition (EMT) is an early event in cell dissemination from epithelial tissues. This study in embryos of the amphibian Xenopus laevis reveals that the metalloproteinase MMP28 is produced by placodes and acts inside the nucleus of neighboring neural crest cells to trigger EMT.

View all citing articles on Scopus

View full text

Few statistical tests for proportions comparison

Abstract

Introduction

Section snippets

Comparing proportions

A new test for comparing proportions

Numerical results

Conclusions

Acknowledgements

Parallel Computing

An argument for divine providence, taken from the constant regularity observed in the births of both sexes

Philosophical Transactions

Statistical Analysis for Engineers and Scientists

Practical Nonparametric Statistics

Mathematical Methods of Statistics

Bootstrap Methods and their Application

An Introduction to the Bootstrap

The Fisher-Yates test of significance in 2 × 2 contingency tables

Biometrika