An analysis of the quality of experimental design and reliability of results in tribology research
Introduction
In recent years several high profile projects and publications have questioned the repeatability of scientific research in general. One notable replication study in psychology [1] aimed to replicate the results of one hundred papers published in highly regarded journals of the field. This study failed to reproduce the original results of two thirds of the sample. Ioannidis [2] made estimates for the percentage of findings in the medical literature that were likely to be true, the study concluded that less than 50% are expected to be true for most experimental designs.
There are many reasons for erroneous results making it to publication. Most obviously it is possible that random chance can produce a result that seems important when no real effect is present. This is more likely when sample sizes are smaller, effect sizes are smaller [3], when there is a greater number and less pre-selection of tested relationships [4] and when statistical thresholds are too low [2], [5]. This problem is exacerbated by the bias to publish positive results and leave negative results unpublished [6], [7].
Bias can also be introduced into the design of the experiment and the data analysis methods used. This is not always obvious; for example if a near significant result is found, adding further repeats to the experiment to push for a significant result increases the false positive rate [8]. This and other seemingly ‘border line’ methods, such as poor normalisation, have been used deliberately to show erroneous results including that listening to children's music makes the listener younger (p = 0.03) [8].
These problems have driven some fields of science to stricter rules for publishing. In physics and genomic studies the threshold p values for statistical tests have been lowered, this case has also been made in other areas of science [5]. Many medical journals now require pre-registration of a trial, including data analysis techniques, before the trial starts, on the understanding that the trial will be published regardless of the outcome. Kaplan et al. [7] found that this reduced the proportion of positive results in one heart disease journal from 57% to 8%.
These problems have been particularly felt in fields which study complex systems such as the human mind or body using experimental methods. Tribological tests typically aim to investigate properties of, or directly compare, complex mechanical systems. The global responses of these systems can be altered through a wide variety factors many of which are ignored for practical reasons, cannot be fully controlled, cannot be measured or are neglected to maintain applicability to a real system. Each of these factors has the potential to influence system level results such as forces and wear rates. For this reason results from tribological tests should be expected to show some random scatter and the problems outlined above could be expected in the field of tribology.
There is strong evidence to suggest that typical tribological results show large random variation. For example, current standards for ball-on-flat wear testing [9] and friction testing of plastic sheets [10] give between-laboratory coefficients of variation (COV) of 49% and 18% (static) respectively with in-laboratory values of 23% and 15%. At these levels of variation, if a single test were completed in the same laboratory one would expect the results to be different by more than 50% of the smaller value 23% of the time for wear test (COV=23%). If the tests are performed in different laboratories this rises to 57%. Assuming the COV is accurate, this also means that for ball-on-flat wear testing a total of 21 tests would be required to estimate the mean worn volume, for a lab, with 5% standard error. For more complicated systems results should be expected to contain more variation.
In the light of the problems outlined above, the aim of this study is to assess the state of experimental design in experimental tribology research. In particular the prevalence of simple design fundamentals will be assessed as well as estimates of study power (the chance a study will find an effect if one is present) for the field. This will be pursued through a large scale review of the output from highly regarded journals in the field. The aim is not to critique individual papers or authors, rather just to assess the state of the research and propose improvements.
Section snippets
Criteria
The method used to evaluate the quality of each study is given in Table 1. Although other methods have been developed for grading work in tribology [11] these have not been used in favour of more simple and objective measures. Other grading methods also exist for studies in medicine (eg [12]) however, these are also not used as many of the criteria are not relevant to this field.
Justification for criteria
A brief justification for each of the measures in Table 1 is given below. It should be noted that these criteria have
Training of assessors
In person training was provided for assessors however after this initial training others joined the project. At a minimum the aims of the project were described over e-mail and the project coordinator (M Watson) was available throughout the process for consultation. All assessors received a formatted answer sheet and a clear description of each criteria with their list of works. Examples of each of these are given in the Additional material.
Results
Of the 379 unique papers that were assessed, 290 were experimental studies. Where an assessor recorded that a paper is not an experimental study it is excluded from the following analysis. ‘N/A’ results are also not included in the following analysis. Basic results for the prevalence of the criteria described above are shown in Fig. 1. As shown, prevalence of the majority of the measures is below 10% in this sample, however, control tests (64% n = 235), repeated tests (30% n = 290) and giving
Discussion
The results outlined above show that the measures investigated are not well adopted as a whole. Prevalence is below 10% for the investigated factors apart from: control tests, giving details of analysis methods, and repeating tests at least 3 times. For control tests, the agreement between assessors was very poor. This may be due to the definition of control tests used in this study and the variety of studies reviewed. For example it is not always a fair control to compare to bare metal
Statistical design
The first step in designing a study of a change to a tribological system (such as a change in speed, load, material or coating) is to decide what you are actually interested in. Modern testing machines measure everything possible to give the user more understanding of what is going on. Post test analyses are also often used. This extra information can be useful, but leads to a problem. If the experimenter is biased in any way towards a particular outcome or even just to positive results,
Conclusions
We intend to change the way researchers think about these experiments, while it is true that samples of mild steel contain less variation than a similar number of human subjects, there is randomness and contamination present in every sample and procedure. Within a tribological system there is ample opportunity for these random changes to influence macro-scale results.
This study has shown that current research practices in the field do not take this into account, instead these systems are
Acknowledgements
This work was funded in part by EPSRC Grant: EP/R001766/1.
References (26)
- et al.
Surface properties of eroded human primary and permanent enamel and the possible remineralization influence of CPP-ACP
Wear
(2017) - et al.
Using a patterned microtexture to reduce polyethylene wear in metal-on-polyethylene prosthetic bearing couples
Wear
(2017) - et al.
The effect of surface treatment on the friction and wear behavior of dental Y-TZP ceramic against human enamel
Tribol. Int.
(2017) - et al.
The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials
Lancet
(2001) Estimating the reproducibility of psychological science
Science
(2015)Why most published research findings are false
PLoS Med.
(2005)- et al.
Power failure: why small sample size undermines the reliability of neuroscience
Nat. Rev. Neurosci.
(2013) - et al.
The garden of forking paths: why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time
Psychol. Bull.
(2014) - D.J. Benjamin, J.O. Berger, M. Johannesson, B.A. Nosek, E.-J. Wagenmakers, R. Berk, K.A. Bollen, B. Brembs, L. Brown,...
- et al.
Publication bias: what are the challenges and can they be overcome?
J. Psychiatry Neurosci.
(2012)
Likelihood of null effects of large NHLBI clinical trials has increased over time
PLoS One
False-positive psychology
Psychol. Sci.
ASTM G133 - 05 Standard Test Method for Linearly Reciprocating Ball-on-Flat Sliding Wear
Cited by (20)
Abrasive polishing load effect on surface roughness and material removal rate of Al<inf>2</inf>O<inf>3</inf>, ZTA and SiC
2021, WearCitation Excerpt :Thus, the objective of this work is to optimize the polishing load for each of three engineering ceramics (Al2O3, ZTA and SiC), two abrasive suspensions and respective polishing cloths, focusing on the higher material removal rates and better surface finish. Adequate descriptive and inferential statistics methods were employed to assure the quality of the experimental research [19]. Sintered disks of alumina (Al2O3), zirconia-toughened alumina (ZTA, 10 vol% ZrO2) and silicon carbide (SiC) were acquired from Macéa Cerâmica Técnica Ltda (Brazil) and supplied with as grinded surface condition.
Functionalization of graphene-based materials: Effective approach for enhancement of tribological performance as lubricant additives
2021, Diamond and Related Materials