gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

A new resampling strategy improves proximity estimation with the unsupervised random forest algorithm

Meeting Abstract

  • Cesaire Joris Kuete Fouodo - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
  • Silke Szymczak - Universität zu Lübeck, Lübeck, Germany
  • Inke R. König - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 139

doi: 10.3205/21gmds089, urn:nbn:de:0183-21gmds0898

Published: September 24, 2021

© 2021 Kuete Fouodo et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Random forests (RF) are fast and perform well in high dimensional classification problems. In precision medicine, another use of large scale data is to stratify individuals into homogeneous subgroups. For this unsupervised learning setting, unsupervised random forests (URF) can be used to compute dissimilarities between individuals [1], which can then be used as input for clustering algorithms. The crucial step of URF is the synthetization of an artificial dataset by resampling original values of the individuals. The two data sets are combined and the standard RF algorithm can be used to classify observations as original or artificial. Dissimilarities between each pair of individuals can be obtained by counting how often they end up in the same terminal nodes across the forest.

We review the resampling approaches proposed by Shi and Horvath [1], explain their limitations and propose an new intuitive strategy based on the low density regions (LDR) of the marginal distribution of each variables. We perform a simulation study to compare the different approaches. The results show that resampling original data from LDR improves the quality of dissimilarities between individuals and leads to more homogeneous clusters.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Shi T, Horvath S. Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. 2006;15(1):118–38.