Elsevier

Wear

Volumes 426–427, Part B, 30 April 2019, Pages 1712-1718
Wear

An analysis of the quality of experimental design and reliability of results in tribology research

https://doi.org/10.1016/j.wear.2018.12.028Get rights and content

Highlights

  • The prevalence of good quality experimental design in tribology research is very low.

  • Statistics are only used in 9.2% of studies.

  • The median sample size was 1 for the papers we reviewed.

Abstract

In recent years several high profile projects have questioned the repeatability and validity of scientific research in the fields of psychology and medicine. In general, these studies have shown or estimated that less than 50% of published research findings are true or replicable even when no breaches of ethics are made. This high percentage stems from widespread poor study design; either through the use of underpowered studies or designs that allow the introduction of bias into the results.

In this work, we have aimed to assess, for the first time, the prevalence of good study design in the field of tribology. A set of simple criteria for factors such as randomisation, blinding, use of control and repeated tests has been made. These criteria have been used in a mass review of the output of five highly regarded tribology journals for the year 2017. In total 379 papers were reviewed by 26 reviewers, 28% of the total output of the journals selected for 2017.

Our results show that the prevalence of these simple aspects of study design is poor. Out of 290 experimental studies, 2.2% used any form of blinding, 3.2% used randomisation of either the tests or the test samples, while none randomised both. 30% repeated experiments 3 or more times and 86% of those who repeated tests used single batches of test materials. 4.4% completed statistical tests on their data.

Due to the low prevalence of repeated tests and statistical analysis it is impossible to give a realistic indication of the percentage of the published works that are likely to be false positives, however these results compare poorly to other more well studied fields. Finally, recommendations for improved study design for researchers and group design for research group leaders are given.

Introduction

In recent years several high profile projects and publications have questioned the repeatability of scientific research in general. One notable replication study in psychology [1] aimed to replicate the results of one hundred papers published in highly regarded journals of the field. This study failed to reproduce the original results of two thirds of the sample. Ioannidis [2] made estimates for the percentage of findings in the medical literature that were likely to be true, the study concluded that less than 50% are expected to be true for most experimental designs.

There are many reasons for erroneous results making it to publication. Most obviously it is possible that random chance can produce a result that seems important when no real effect is present. This is more likely when sample sizes are smaller, effect sizes are smaller [3], when there is a greater number and less pre-selection of tested relationships [4] and when statistical thresholds are too low [2], [5]. This problem is exacerbated by the bias to publish positive results and leave negative results unpublished [6], [7].

Bias can also be introduced into the design of the experiment and the data analysis methods used. This is not always obvious; for example if a near significant result is found, adding further repeats to the experiment to push for a significant result increases the false positive rate [8]. This and other seemingly ‘border line’ methods, such as poor normalisation, have been used deliberately to show erroneous results including that listening to children's music makes the listener younger (p = 0.03) [8].

These problems have driven some fields of science to stricter rules for publishing. In physics and genomic studies the threshold p values for statistical tests have been lowered, this case has also been made in other areas of science [5]. Many medical journals now require pre-registration of a trial, including data analysis techniques, before the trial starts, on the understanding that the trial will be published regardless of the outcome. Kaplan et al. [7] found that this reduced the proportion of positive results in one heart disease journal from 57% to 8%.

These problems have been particularly felt in fields which study complex systems such as the human mind or body using experimental methods. Tribological tests typically aim to investigate properties of, or directly compare, complex mechanical systems. The global responses of these systems can be altered through a wide variety factors many of which are ignored for practical reasons, cannot be fully controlled, cannot be measured or are neglected to maintain applicability to a real system. Each of these factors has the potential to influence system level results such as forces and wear rates. For this reason results from tribological tests should be expected to show some random scatter and the problems outlined above could be expected in the field of tribology.

There is strong evidence to suggest that typical tribological results show large random variation. For example, current standards for ball-on-flat wear testing [9] and friction testing of plastic sheets [10] give between-laboratory coefficients of variation (COV) of 49% and 18% (static) respectively with in-laboratory values of 23% and 15%. At these levels of variation, if a single test were completed in the same laboratory one would expect the results to be different by more than 50% of the smaller value 23% of the time for wear test (COV=23%). If the tests are performed in different laboratories this rises to 57%. Assuming the COV is accurate, this also means that for ball-on-flat wear testing a total of 21 tests would be required to estimate the mean worn volume, for a lab, with 5% standard error. For more complicated systems results should be expected to contain more variation.

In the light of the problems outlined above, the aim of this study is to assess the state of experimental design in experimental tribology research. In particular the prevalence of simple design fundamentals will be assessed as well as estimates of study power (the chance a study will find an effect if one is present) for the field. This will be pursued through a large scale review of the output from highly regarded journals in the field. The aim is not to critique individual papers or authors, rather just to assess the state of the research and propose improvements.

Section snippets

Criteria

The method used to evaluate the quality of each study is given in Table 1. Although other methods have been developed for grading work in tribology [11] these have not been used in favour of more simple and objective measures. Other grading methods also exist for studies in medicine (eg [12]) however, these are also not used as many of the criteria are not relevant to this field.

Justification for criteria

A brief justification for each of the measures in Table 1 is given below. It should be noted that these criteria have

Training of assessors

In person training was provided for assessors however after this initial training others joined the project. At a minimum the aims of the project were described over e-mail and the project coordinator (M Watson) was available throughout the process for consultation. All assessors received a formatted answer sheet and a clear description of each criteria with their list of works. Examples of each of these are given in the Additional material.

Results

Of the 379 unique papers that were assessed, 290 were experimental studies. Where an assessor recorded that a paper is not an experimental study it is excluded from the following analysis. ‘N/A’ results are also not included in the following analysis. Basic results for the prevalence of the criteria described above are shown in Fig. 1. As shown, prevalence of the majority of the measures is below 10% in this sample, however, control tests (64% n = 235), repeated tests (30% n = 290) and giving

Discussion

The results outlined above show that the measures investigated are not well adopted as a whole. Prevalence is below 10% for the investigated factors apart from: control tests, giving details of analysis methods, and repeating tests at least 3 times. For control tests, the agreement between assessors was very poor. This may be due to the definition of control tests used in this study and the variety of studies reviewed. For example it is not always a fair control to compare to bare metal

Statistical design

The first step in designing a study of a change to a tribological system (such as a change in speed, load, material or coating) is to decide what you are actually interested in. Modern testing machines measure everything possible to give the user more understanding of what is going on. Post test analyses are also often used. This extra information can be useful, but leads to a problem. If the experimenter is biased in any way towards a particular outcome or even just to positive results,

Conclusions

We intend to change the way researchers think about these experiments, while it is true that samples of mild steel contain less variation than a similar number of human subjects, there is randomness and contamination present in every sample and procedure. Within a tribological system there is ample opportunity for these random changes to influence macro-scale results.

This study has shown that current research practices in the field do not take this into account, instead these systems are

Acknowledgements

This work was funded in part by EPSRC Grant: EP/R001766/1.

References (26)

  • R.M. Kaplan et al.

    Likelihood of null effects of large NHLBI clinical trials has increased over time

    PLoS One

    (2015)
  • J.P. Simmons et al.

    False-positive psychology

    Psychol. Sci.

    (2011)
  • ASTM

    ASTM G133 - 05 Standard Test Method for Linearly Reciprocating Ball-on-Flat Sliding Wear

    (2015)
  • Cited by (20)

    • Abrasive polishing load effect on surface roughness and material removal rate of Al<inf>2</inf>O<inf>3</inf>, ZTA and SiC

      2021, Wear
      Citation Excerpt :

      Thus, the objective of this work is to optimize the polishing load for each of three engineering ceramics (Al2O3, ZTA and SiC), two abrasive suspensions and respective polishing cloths, focusing on the higher material removal rates and better surface finish. Adequate descriptive and inferential statistics methods were employed to assure the quality of the experimental research [19]. Sintered disks of alumina (Al2O3), zirconia-toughened alumina (ZTA, 10 vol% ZrO2) and silicon carbide (SiC) were acquired from Macéa Cerâmica Técnica Ltda (Brazil) and supplied with as grinded surface condition.

    View all citing articles on Scopus
    View full text