MSclassifR: an R Package for Supervised Classification of Mass Spectra with Machine Learning Methods

Alexandre Godmer; Yahia Benzerara; Emmanuelle Varon; Nicolas Veziris; Karen Druart; Renaud Mozet; Mariette Matondo; Alexandra Aubry; Quentin Giai Gianetto

doi:10.1101/2022.03.14.484252

Abstract

MSclassifR is an R package that has been specifically designed to improve the classification of mass spectra obtained from MALDI-TOF mass spectrometry. It offers a comprehensive range of functions that are focused on processing mass spectra, identifying discriminant m/z values, and making accurate predictions. The package introduces innovative algorithms for selecting discriminating m/z values and making predictions. To assess the effectiveness of these methods, extensive tests were conducted using challenging real datasets, including bacterial subspecies of the Mycobacterium abscessus complex, virulent and avirulent phenotypes of Escherichia coli, different species of Streptococci and nasal swabs from individuals infected and uninfected with SARS-CoV-2. Additionally, multiple datasets of varying sizes were created from these real datasets to evaluate the robustness of the algorithms. The results demonstrated that the Machine Learning-based pipelines in MSclassifR achieved high levels of accuracy and Kappa values. On an in-house dataset, some pipelines even achieved more than 95% mean accuracy, whereas commercial system only achieved 62% mean accuracy. Certain methods showed greater resilience to changes in dataset sizes when constructing Machine Learning-based pipelines. These simulations also helped determine the minimum sizes of training sets required to obtain reliable results. The package is freely available online, and its open-source nature encourages collaborative development, customization, and fosters innovation within the community focused on improving diagnosis based on MALDI-TOF spectra.

Key points

MSclassifR is a comprehensive R package enabling the construction of data analysis pipelines for the precise classification of mass spectra.
Our R package contains an innovative method for variable selection from random forests, which delivered excellent results on real data.
In-depth analysis of various machine learning-based pipelines using our package allowed us to make conclusions about the optimal m/z selection and prediction methods depending on the size of the training dataset.
Using a publicly available dataset of mass spectra obtained from various MALDI-TOF instruments across different countries, MSclassifR is able to build robust pipelines capable of adapting to different instruments in an automatic way.
When tested on an in-house dataset, MSclassifR pipelines consistently outperformed a commercial software in terms of prediction accuracy.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

The manuscript has been considerably improved. It incorporates many evaluations of different possible pipelines on challenging datasets and presents a new method of variable selection using variable importances from random forests.