Approximate multiple kernel learning with least-angle regression

doi:10.1016/j.neucom.2019.02.030

Neurocomputing

Volume 340, 7 May 2019, Pages 245-258

https://doi.org/10.1016/j.neucom.2019.02.030 Get rights and content

Under a Creative Commons license

open access

Abstract

Kernel methods provide a principled way for general data representations. Multiple kernel learning and kernel approximation are often treated as separate tasks, with considerable savings in time and memory expected if the two are performed simultaneously.

Our proposed Mklaren algorithm selectively approximates multiple kernel matrices in regression. It uses Incomplete Cholesky Decomposition and Least-angle regression (LAR) to select basis functions, achieving linear complexity both in the number of data points and kernels. Since it approximates kernel matrices rather than functions, it allows to combine an arbitrary set of kernels. Compared to single kernel-based approximations, it selectively approximates different kernels in different regions of the input spaces.

The LAR criterion provides a robust selection of inducing points in noisy settings, and an accurate modelling of regression functions in continuous and discrete input spaces. Among general kernel matrix decompositions, Mklaren achieves minimal approximation rank required for performance comparable to using the exact kernel matrix, at a cost lower than 1% of required operations. Finally, we demonstrate the scalability and interpretability in settings with millions of data points and thousands of kernels.

Keywords

Kernel methods

Kernel approximation

Multiple kernel learning

Least-angle regression

Cited by (0)

Martin Stražar received a masters and a doctoral degree in computer science from University of Ljubljana, Faculty of Computer and Information Science in 2013 and 2018, respectively.

His research interest span scalable machine learning, data integration, kernel methods and Bayesian statistics. The main applications of the developed models are in bioinformatics: modelling with next-generation sequencing (NGS) data sets, protein-RNA iterations and single-cell RNA sequencing.

Dr. Stražar was the recipient of the 2012 iGEM Best Health and Medicine prize, and 2012 iGEM Best Modelling prize. He is one of the core contributors of data mining software platforms Orange (https://orange.biolab.si) and Single cell Orange (https://singlecell.biolab.si).

Tomaž Curk received a doctoral degree in computer science from the University of Ljubljana, Faculty of Computer and Information Science in 2007.

His research interest include bioinformatics, machine learning and data integration, with applications in modelling RNA-seq data on gene expression and iCLIP and RBDmap data on protein-RNA interaction.

Dr. Curk serves as Vice-Dean for research at the Faculty of Computer and Information Science from 2016. He is one of the initial contributors to the data mining software Orange (http://orange.biolab.si), gene expression analysis software dictyExpress (http://dictyexpress.org), gene interaction analysis software SNPsyn (http://snpsyn.biolab.si) and of the iCount software for protein-RNA interaction analytics (https://github.com/tomazc/iCount).