An analysis of the quality of experimental design and reliability of results in tribology research

doi:10.1016/j.wear.2018.12.028

Wear

Volumes 426–427, Part B, 30 April 2019, Pages 1712-1718

https://doi.org/10.1016/j.wear.2018.12.028 Get rights and content

Highlights

•
The prevalence of good quality experimental design in tribology research is very low.
•
Statistics are only used in 9.2% of studies.
•
The median sample size was 1 for the papers we reviewed.

Abstract

In recent years several high profile projects have questioned the repeatability and validity of scientific research in the fields of psychology and medicine. In general, these studies have shown or estimated that less than 50% of published research findings are true or replicable even when no breaches of ethics are made. This high percentage stems from widespread poor study design; either through the use of underpowered studies or designs that allow the introduction of bias into the results.

In this work, we have aimed to assess, for the first time, the prevalence of good study design in the field of tribology. A set of simple criteria for factors such as randomisation, blinding, use of control and repeated tests has been made. These criteria have been used in a mass review of the output of five highly regarded tribology journals for the year 2017. In total 379 papers were reviewed by 26 reviewers, 28% of the total output of the journals selected for 2017.

Our results show that the prevalence of these simple aspects of study design is poor. Out of 290 experimental studies, 2.2% used any form of blinding, 3.2% used randomisation of either the tests or the test samples, while none randomised both. 30% repeated experiments 3 or more times and 86% of those who repeated tests used single batches of test materials. 4.4% completed statistical tests on their data.

Due to the low prevalence of repeated tests and statistical analysis it is impossible to give a realistic indication of the percentage of the published works that are likely to be false positives, however these results compare poorly to other more well studied fields. Finally, recommendations for improved study design for researchers and group design for research group leaders are given.

Introduction

In recent years several high profile projects and publications have questioned the repeatability of scientific research in general. One notable replication study in psychology [1] aimed to replicate the results of one hundred papers published in highly regarded journals of the field. This study failed to reproduce the original results of two thirds of the sample. Ioannidis [2] made estimates for the percentage of findings in the medical literature that were likely to be true, the study concluded that less than 50% are expected to be true for most experimental designs.

There are many reasons for erroneous results making it to publication. Most obviously it is possible that random chance can produce a result that seems important when no real effect is present. This is more likely when sample sizes are smaller, effect sizes are smaller [3], when there is a greater number and less pre-selection of tested relationships [4] and when statistical thresholds are too low [2], [5]. This problem is exacerbated by the bias to publish positive results and leave negative results unpublished [6], [7].

Bias can also be introduced into the design of the experiment and the data analysis methods used. This is not always obvious; for example if a near significant result is found, adding further repeats to the experiment to push for a significant result increases the false positive rate [8]. This and other seemingly ‘border line’ methods, such as poor normalisation, have been used deliberately to show erroneous results including that listening to children's music makes the listener younger (p = 0.03) [8].

These problems have driven some fields of science to stricter rules for publishing. In physics and genomic studies the threshold p values for statistical tests have been lowered, this case has also been made in other areas of science [5]. Many medical journals now require pre-registration of a trial, including data analysis techniques, before the trial starts, on the understanding that the trial will be published regardless of the outcome. Kaplan et al. [7] found that this reduced the proportion of positive results in one heart disease journal from 57% to 8%.

These problems have been particularly felt in fields which study complex systems such as the human mind or body using experimental methods. Tribological tests typically aim to investigate properties of, or directly compare, complex mechanical systems. The global responses of these systems can be altered through a wide variety factors many of which are ignored for practical reasons, cannot be fully controlled, cannot be measured or are neglected to maintain applicability to a real system. Each of these factors has the potential to influence system level results such as forces and wear rates. For this reason results from tribological tests should be expected to show some random scatter and the problems outlined above could be expected in the field of tribology.

There is strong evidence to suggest that typical tribological results show large random variation. For example, current standards for ball-on-flat wear testing [9] and friction testing of plastic sheets [10] give between-laboratory coefficients of variation (COV) of 49% and 18% (static) respectively with in-laboratory values of 23% and 15%. At these levels of variation, if a single test were completed in the same laboratory one would expect the results to be different by more than 50% of the smaller value 23% of the time for wear test (COV=23%). If the tests are performed in different laboratories this rises to 57%. Assuming the COV is accurate, this also means that for ball-on-flat wear testing a total of 21 tests would be required to estimate the mean worn volume, for a lab, with 5% standard error. For more complicated systems results should be expected to contain more variation.

In the light of the problems outlined above, the aim of this study is to assess the state of experimental design in experimental tribology research. In particular the prevalence of simple design fundamentals will be assessed as well as estimates of study power (the chance a study will find an effect if one is present) for the field. This will be pursued through a large scale review of the output from highly regarded journals in the field. The aim is not to critique individual papers or authors, rather just to assess the state of the research and propose improvements.

Section snippets

Criteria

The method used to evaluate the quality of each study is given in Table 1. Although other methods have been developed for grading work in tribology [11] these have not been used in favour of more simple and objective measures. Other grading methods also exist for studies in medicine (eg [12]) however, these are also not used as many of the criteria are not relevant to this field.

Justification for criteria

A brief justification for each of the measures in Table 1 is given below. It should be noted that these criteria have

Training of assessors

In person training was provided for assessors however after this initial training others joined the project. At a minimum the aims of the project were described over e-mail and the project coordinator (M Watson) was available throughout the process for consultation. All assessors received a formatted answer sheet and a clear description of each criteria with their list of works. Examples of each of these are given in the Additional material.

Results

Of the 379 unique papers that were assessed, 290 were experimental studies. Where an assessor recorded that a paper is not an experimental study it is excluded from the following analysis. ‘N/A’ results are also not included in the following analysis. Basic results for the prevalence of the criteria described above are shown in Fig. 1. As shown, prevalence of the majority of the measures is below 10% in this sample, however, control tests (64% n = 235), repeated tests (30% n = 290) and giving

Discussion

The results outlined above show that the measures investigated are not well adopted as a whole. Prevalence is below 10% for the investigated factors apart from: control tests, giving details of analysis methods, and repeating tests at least 3 times. For control tests, the agreement between assessors was very poor. This may be due to the definition of control tests used in this study and the variety of studies reviewed. For example it is not always a fair control to compare to bare metal

Statistical design

The first step in designing a study of a change to a tribological system (such as a change in speed, load, material or coating) is to decide what you are actually interested in. Modern testing machines measure everything possible to give the user more understanding of what is going on. Post test analyses are also often used. This extra information can be useful, but leads to a problem. If the experimenter is biased in any way towards a particular outcome or even just to positive results,

Conclusions

We intend to change the way researchers think about these experiments, while it is true that samples of mild steel contain less variation than a similar number of human subjects, there is randomness and contamination present in every sample and procedure. Within a tribological system there is ample opportunity for these random changes to influence macro-scale results.

This study has shown that current research practices in the field do not take this into account, instead these systems are

Acknowledgements

This work was funded in part by EPSRC Grant: EP/R001766/1.

References (26)

L. Zheng et al.
Surface properties of eroded human primary and permanent enamel and the possible remineralization influence of CPP-ACP
Wear
(2017)
A. Borjali et al.
Using a patterned microtexture to reduce polyethylene wear in metal-on-polyethylene prosthetic bearing couples
Wear
(2017)
M. Buciumeanu et al.
The effect of surface treatment on the friction and wear behavior of dental Y-TZP ceramic against human enamel
Tribol. Int.
(2017)
D. Moher et al.
The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials
Lancet
(2001)
Open Science Collaboration
Estimating the reproducibility of psychological science
Science
(2015)
J.P.A. Ioannidis
Why most published research findings are false
PLoS Med.
(2005)
K.S. Button et al.
Power failure: why small sample size undermines the reliability of neuroscience
Nat. Rev. Neurosci.
(2013)
A. Gelman et al.
The garden of forking paths: why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time
Psychol. Bull.
(2014)
D.J. Benjamin, J.O. Berger, M. Johannesson, B.A. Nosek, E.-J. Wagenmakers, R. Berk, K.A. Bollen, B. Brembs, L. Brown,...
R. Joober et al.
Publication bias: what are the challenges and can they be overcome?
J. Psychiatry Neurosci.
(2012)

R.M. Kaplan et al.

Likelihood of null effects of large NHLBI clinical trials has increased over time

PLoS One

(2015)

J.P. Simmons et al.

False-positive psychology

Psychol. Sci.

(2011)

ASTM

ASTM G133 - 05 Standard Test Method for Linearly Reciprocating Ball-on-Flat Sliding Wear

(2015)

Cited by (20)

Profilometric measurements of wear scars: A review
2023, Wear
Wear leads to loss of functions of machines and to a decrease in their efficiencies. Controlling wear through wear resistance tests is a very important task. Wear tests aim to obtain wear levels of the elements of assemblies in contact. The profilometric method is very helpful. In this paper, measurements of wear scars using the profilometric method are reviewed. Volumetric wear determinations in various configurations, including ball-on-disc tests, block-on-ring tests, reciprocating ball-on-plate tests, and scratch tests were studied. The measurement of the volumetric wear of balls or rounded pins was also analysed. Possible errors in wear levels determinations were discussed. Limitations of profiloemtric measurement methods were considered.
Investigation of 1-octanethiol capped ZnS nanoparticles as lubricant additives and tribological behavior of oil-based nanolubricant
2023, Wear
1-Octanethiol (OT) capped ZnS nanoparticles (OT-ZnS) as lubricating oil additives were investigated by tribological tests including wear tests and rheological measurements. The wear tests were performed on lubrication states using a custom design wear test device consisting of a rotating pin on a cylinder geometry. A rotating 1.2379 cylinder (steel counterpart) and a stationary C40 pin were used as a pair of test material for the wear test. The stationary pin was pressed against a rotating cylinder under lubrication conditions. It was observed that the synthesized OT-ZnS NP effectively improved the lubrication performance of 10W engine oil and the surface properties of the test materials. The addition of OT-ZnS NPs to the 10W base oil increased its viscosity and decreased the contact angle (CA) values. The results showed that the wear loss and the wear scar dimensions were both the lowest, while 0.0125 (wt.%) of OT-ZnS NPs were treated with the base oil. The mass loss of the pins was reduced to more than 6.3%. The wear reduction was observed at about 27% and the length of wear scar in the minor axis was reduced to 11.4% when a load of 100 kg was applied. The images of wear scars showed that the OT-ZnS NPs additives could prevent severe contact with the interacting surfaces. OT-ZnS NPs partially filled on the grooves and adhered on the wear scar surfaces and promote tribofilm formation, thereby it improve the tribological performance of 10W base oil, especially under extreme pressure conditions. Sulfide film (sulfuration layer) adhered to the wear scar surfaces and decrease the wear.
Evolution of morphology, microstructure and hardness of bodies and debris during sliding wear of carbon steels in a closed tribosystem
2023, Wear
Debris generated during sliding wear of steels have an important role on tribosystem's behavior. However, the evolution of morphology, microstructure and hardness of debris and solid bodies, and their role on friction and wear is not quite evident, especially for closed tribosystems. Interrupted pin-on-disk wear tests were carried out on low, medium, and high carbon steel tribopairs, while an adaptation of the test apparatus effectively prevented debris ejection from wear track, simulating a closed tribosystem. The experiment was designed according to a repeated measures ANOVA model for multiple dependent variables that includes running-in duration, wear rates, surface hardness and their ratios, and debris size and shape. Plasticity-dominated severe wear took place in all cases, generating two types of metallic debris: flake-type via adhesion and delamination processes and, after sufficient sliding distance, cutting-type due severe shear deformation caused by the abrasive action of hard asperities. Debris were, at least, 20% softer than worn pins and disks, and provided some lubricating effect when adhered, sheared and rolled between sliding surfaces. Low carbon steel wear generated more soft and adherent debris, resulting in prolonged running-in periods. Finally, steel carbon content did not affect wear mechanisms, debris morphology (size and shape), worn surface hardness and wear rates in closed tribosytems, when a debris-flow controlled process prevails.
A new predictive model for normal and compound impact wear
2021, Wear
This work presents a new model that predicts the wear that results from impacts occurring between two solid bodies under both normal and compound impact, a capability lacking from existing approaches. This frequently occurs in many engineering and industrial situations and depending on the relative sizes of the bodies, bulk material properties, and the number and frequency of impacts, damage can result. Although this eventually causes severe wear problems that limit service life, it is one of the least investigated types of primary wear mechanism. Due to this, robust data, and validated models derived from that data, are rare.
The new model considers the contact with respect to shear force. It can predict wear volume loss and be used to improve understanding of the role of different impact angles during impact, and thus inform the design of machines, however it is currently valid for ductile materials only since hardness is a parameter. Empirical inputs to the model were developed using data produced specifically for this work. Predictions made by the model were then validated, with good correlation, for three common metal alloys by means of comparison with experimental data available in the literature.
Abrasive polishing load effect on surface roughness and material removal rate of Al<inf>2</inf>O<inf>3</inf>, ZTA and SiC
2021, Wear
Citation Excerpt :
Thus, the objective of this work is to optimize the polishing load for each of three engineering ceramics (Al2O3, ZTA and SiC), two abrasive suspensions and respective polishing cloths, focusing on the higher material removal rates and better surface finish. Adequate descriptive and inferential statistics methods were employed to assure the quality of the experimental research [19]. Sintered disks of alumina (Al2O3), zirconia-toughened alumina (ZTA, 10 vol% ZrO2) and silicon carbide (SiC) were acquired from Macéa Cerâmica Técnica Ltda (Brazil) and supplied with as grinded surface condition.
The surface quality determines the performance of ceramic tribological components, e.g., bearings and seals. Surface conditioning processes, like polishing, are generally responsible for the largest portion of the manufacturing costs of these components. Therefore, process parameters, such as polishing load, should be optimized for economic reasons. Sintered alumina (Al₂O₃), zirconia-toughened alumina (ZTA) and silicon carbide (SiC) samples were polished in a sequential two-step process using a water-based diamond abrasive slurry with particle sizes of 3 μm (P1) and 1 μm (P2) on woven and nap cloths, respectively. The effect of material and polishing load (40, 50 and 60 N) on 2D roughness parameters (Ra, RSm, Rz and Rsk) and material removal rate (MRR) were evaluated by a MANCOVA statistically designed experiment. Polishing steps reduced Ra and Rz in succession, with the greatest absolute reduction occurring at P1. Load and material affected only MRR and RSm, when controlled by the effect of initial roughness parameters of surfaces. The low surface porosity fraction combined with very low Ra of surface topography plateaus, especially after P2, resulted in much higher RSm values for SiC. Atomic force microscopy (AFM) images of polished surfaces revealed that, on a finer scale, material removal mechanisms were ductile. MRR for SiC was the highest and proportional to polishing load, while the load did not affect it for the oxide ceramics. The synergy between the mechanical action of abrasives and tribochemistry between SiC and water explains its higher MRR and better surface finish, while Al₂O₃ and ZTA relied mainly on the ductile mechanisms to improve their surface quality.
Functionalization of graphene-based materials: Effective approach for enhancement of tribological performance as lubricant additives
2021, Diamond and Related Materials
The performance of lubricants greatly depends on the additives in their formulation. Considerable attention has been paid to graphene for the formulation of lubricant as anti-wear or anti-friction additives. In the lubricant additive industry, the introduction of graphene-based materials will enable improvement in the lubricant's fundamental characteristics by increasing its strength and load bearing capability. It is then no wonder that the race to commercialize graphene has been on for the past few years since the discovery of the material in 2004. Pristine graphene has a high electrical conductivity; thus, it is desirable for applications such as semiconductors, battery, and sensors. Meanwhile, functionalized graphene is desirable for other applications, including lubricant additives, to achieve a stable dispersion matrix or medium. The graphene structure can be tuned via chemicals or physical functionalization methods to meet the requirement needed by specific applications. In this review, the limitation of graphene as lubricant additives is discussed. Subsequently, methods on the functionalization of graphene are reviewed. Finally, the recent progress on functionalized graphene as lubricant additives are discussed in detail.

View all citing articles on Scopus

View full text

An analysis of the quality of experimental design and reliability of results in tribology research

Highlights

Abstract

Introduction

Section snippets

Criteria

Justification for criteria

Training of assessors

Results

Discussion

Statistical design

Conclusions

Acknowledgements

Wear

Wear

Tribol. Int.

Lancet

Estimating the reproducibility of psychological science

Science

Why most published research findings are false

PLoS Med.

Power failure: why small sample size undermines the reliability of neuroscience

Nat. Rev. Neurosci.

The garden of forking paths: why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time

Psychol. Bull.

Publication bias: what are the challenges and can they be overcome?

J. Psychiatry Neurosci.

Likelihood of null effects of large NHLBI clinical trials has increased over time

PLoS One

False-positive psychology

Psychol. Sci.

ASTM G133 - 05 Standard Test Method for Linearly Reciprocating Ball-on-Flat Sliding Wear