Convergence rate of the semi-supervised greedy algorithm

doi:10.1016/j.neunet.2013.03.001

Neural Networks

Volume 44, August 2013, Pages 44-50

https://doi.org/10.1016/j.neunet.2013.03.001 Get rights and content

Abstract

This paper proposes a new greedy algorithm combining the semi-supervised learning and the sparse representation with the data-dependent hypothesis spaces. The proposed greedy algorithm is able to use a small portion of the labeled and unlabeled data to represent the target function, and to efficiently reduce the computational burden of the semi-supervised learning. We establish the estimation of the generalization error based on the empirical covering numbers. A detailed analysis shows that the error has $O (n^{- 1})$ decay. Our theoretical result illustrates that the unlabeled data is useful to improve the learning performance under mild conditions.

Introduction

The semi-supervised learning, i.e., learning from a set of the labeled and unlabeled data, has attracted many researchers recently due to its main challenge in how to improve its prediction performance using a few labeled data with a large set of unlabeled data. In literature, algorithms of the semi-supervised learning have been proposed in different perspectives. Examples include the graph-based learning (Belkin and Niyogi, 2004, Belkin et al., 2006, Chen et al., 2009, Johnson and Zhang, 2007, Johnson and Zhang, 2008), co-training (Blum and Mitchell, 1998, Sindhwani et al., 2005, Sindhwani and Rosenberg, 2008) and many others. A review study of the semi-supervised learning is discussed in Chapelle, Schölkopf, and Zien (2006) and Zhu (2005).

Among these methods proposed for semi-supervised learning, a family of them can be unified in a Tikhonov regularization scheme in a reproducing kernel Hilbert space (RKHS) $H_{K}$ with a Mercer kernel $K$ , e.g., Belkin and Niyogi (2004), Sindhwani et al. (2005) and Sindhwani and Rosenberg (2008). For the labeled data ${(x_{i}, y_{i})}_{i = 1}^{n}$ and the unlabeled data ${x_{i}}_{i = n + 1}^{n + m}$ , the solution of the regularization framework usually has the following expression $\sum_{i = 1}^{m + n} α_{i} K (x_{i}, \cdot), α_{i} \in R .$

The semi-supervised algorithms intend to search the coefficients ${α_{i}}$ for promising prediction performance. Although they are excellent in the empirical evaluation (Belkin and Niyogi, 2004, Sindhwani et al., 2005, Sindhwani and Rosenberg, 2008), two issues remain to be further addressed in theory:

•
Computation difficulty. Because the regularized framework generally uses the kernel expansions of all the labeled and unlabeled data, computation becomes a serious problem for a huge set of the unlabeled data in real applications.
•
Manifold assumption. In many graph-based methods such as Belkin and Niyogi (2004), Sindhwani et al. (2005) and Sindhwani and Rosenberg (2008), it is assumed that the high-dimensional data is relied on a low-dimensional manifold. However, for different types of data, the convincing evidences of the manifold structure are not available (Fan, Gu, Qiao, & Zhang, 2011).

To address the above issues, previous discussions have done to realize sparse semi-supervised learning in Fan et al. (2011), Sun and Shawe Taylor (2010) and Tsang and Kwok (2007) but the limitation is that they just use unlabeled data to construct an additional sparse regularization term.

In this paper, we investigate the sparse representation of the semi-supervised learning without manifold assumption, and consider the sparsity of the semi-supervised learning in data dependent hypothesis spaces. Inspired by the greedy algorithms in Barron, Cohen, Dahmen, and DeVore (2008), Nair, Choudhury, and Keane (2007) and Zhang, 2002, Zhang, 2009, we propose a new sparse greedy algorithm for the semi-supervised learning. Theoretical analysis shows that the proposed algorithm is efficient to realize the sparse learning. Several contributions of this work have been highlighted below:

•
Our method integrates three different machine learning methods in a coherent way: the sparse semi-supervised learning (Fan et al., 2011, Sun and Shawe Taylor, 2010, Tsang and Kwok, 2007), the greedy algorithm (Nair et al., 2007, Zhang, 2002, Zhang, 2009), and the error analysis in data dependent hypothesis spaces (Shi et al., 2011, Sun and Wu, 2011, Wu and Zhou, 2008, Xiao and Zhou, 2010). We also show how to use them to design and analyze a new semi-supervised algorithm.
•
Generalization error bounds are derived for nonsymmetric and indefinite kernels. Theoretical results show the relative values of the labeled data and unlabeled data to achieve fast learning rates. In particular, we illustrate that the role of the unlabeled data is twofold. The first one is that the semi-supervised method can achieve fast learning rates using the additionally unlabeled data. The second one is that the learning rates essentially depend on the number of the labeled data even if the number of unlabeled data tends to infinity. Furthermore, our error analysis results rely on weaker conditions than the previous methods which are based on density assumption or manifold assumption in Belkin et al. (2006), Belkin and Niyogi (2004), Chen and Li (2009), Chen et al. (2009), Chen, Li, and Peng (2010), Johnson and Zhang, 2007, Johnson and Zhang, 2008 and Rigollet (2007).
•
Even for the supervised learning settings, we can achieve faster learning rates than the previous results in Xiao and Zhou (2010), Shi et al. (2011) and Sun and Wu (2011). In particular, our analysis does not require the interior cone condition presented in Shi et al. (2011) and Xiao and Zhou (2010).

The organization of this paper is as follows. Section 2 provides the necessary background of the semi-supervised learning and then presents the sparse semi-supervised greedy algorithm. Section 3 includes the main result on error analysis and its proof is given in Section 4. An empirical study is given in Section 5. We conclude the paper in Section 6.

Section snippets

The sparse semi-supervised greedy algorithm

Let the input space $X \subset R^{d}$ be a compact subset and $Y = [- M, M]$ . In the semi-supervised model, a learner obtains a labeled data set $z = {(x_{i}, y_{i})}_{i = 1}^{n}$ and an unlabeled data set $x = {x_{n + j}}_{j = 1}^{m}$ . Here, the labeled examples $(x_{i}, y_{i}) \in Z ≔ X \times Y, 1 ⩽ i ⩽ n$ , are independent copies of the random element $(x, y)$ having distribution $ρ$ on $Z$ . The unlabeled data $x_{n + j}, 1 ⩽ j ⩽ m$ , are independent copies of $X$ , whose distribution (the margin distribution of $ρ$ ) is denoted by $ρ_{X}$ . The learning goal is to pick up a function $f : X \to Y$ to minimize

Main result

Now we introduce a data-free function space similar to Shi et al. (2011) and Xiao and Zhou (2010).

Definition 2

Define a data-free assumption function space $H_{1} = {f : f = \sum_{j = 1}^{\infty} α_{j} {\tilde{K}}_{u_{j}}, {α_{j}} \in ℓ_{1}, {u_{j}} \subset X, {\tilde{K}}_{u_{j}} = K_{u_{j}} / {‖ K_{u_{j}} ‖}_{L_{ρ_{X}}^{2}}}$ with the norm ${‖ f ‖}_{H_{1}} = inf {\sum_{j = 1}^{\infty} | α_{j} | : f = \sum_{j = 1}^{\infty} α_{j} {\tilde{K}}_{u_{j}}} .$

In order to investigate the approximation of $π ({\hat{f}}_{k})$ to $f_{ρ}$ , we introduce a regularizing function $f_{λ} = arg min_{f \in H_{1}} {E (f) + λ {‖ f ‖}_{H_{1}}},$ where $λ > 0$ is a regularization parameter.

The regularizing error can be expressed as $D (λ) = inf_{f \in H_{1}} {E (f) - E (f_{ρ}) + λ {‖ f ‖}_{H_{1}}} .$

The decay of

Error analysis

In this section, we provide the proof of Theorem 1 based on the upper bound analysis of the sample and hypothesis errors. The sample error is bounded by the error analysis method and empirical covering numbers. The hypothesis error is established in terms of theoretical analysis of the greedy algorithm presented in Barron et al. (2008).

An empirical study

Our theoretical analysis of the semi-supervised greedy algorithm (SSG) shows that it is efficient to achieve fast learning rates for the regression learning. In this section, we compare our method with the least square regularized regression (LSR) algorithm in RKHS.

The least square regularized regression algorithm has been extensively studied in learning theory (Cucker & Zhou, 2007) and can be formulated as $f_{z} = arg min_{f \in H_{K}} {E_{z} (f) + λ {‖ f ‖}_{K}^{2}} .$

We consider $X = [0, 1]$ , the Gaussian kernel $K (x, t) = exp (- \frac{{(x - t)}^{2}}{2 μ}$

Conclusion and discussion

This paper has introduced a sparse semi-supervised method to learn the regression functions from samples using the orthogonal greedy algorithm. Fast learning rates were derived under mild assumptions. The symmetric or positive semi-definite demand for kernel and the interior cone condition for $X$ (see Shi et al., 2011) is abandoned in this paper. There are some extensions to this method which we discuss as below:

1.
The semi-supervised learning based on other greedy algorithms: The proposed method

Acknowledgments

The authors would like to thank the reviewers for their valued comments and suggestions. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 11001092, No. 11071058, and No. 11226304; by the Fundamental Research Funds for the Central Universities (Program No. 2011PY130), and by the Macau Science and Technology Development Fund (FDCT) under grant 017/2012/A1 and the Research Committee at University of Macau under grants

References (29)

H. Chen et al.
Error bounds of semi-supervised multi-graph regularized classifiers
Information Sciences
(2009)
H. Chen et al.
Semi-supervised learning based on high density regions estimation
Neural Networks
(2010)
M. Fan et al.
Sparse regualrization for semi-supervised classification
Pattern Recognition
(2011)
L. Shi et al.
Concentration estimates for learning with $ℓ^{1}$ -regularizer and data dependent hypothesis spaces
Applied and Computational Harmonic Analysis
(2011)
H.W. Sun et al.
Least square regression with indefinite kernels and coefficient regularization
Applied and Computational Harmonic Analysis
(2011)
Q. Wu et al.
Multi-kernel regularized classifiers
Journal of Complexity
(2007)
Q. Wu et al.
Learning with sample dependent hypothesis spaces
Computers & Mathematics with Applications
(2008)
A.R. Barron et al.
Approximation and learning by greedy algorithm
Annals of Statistics
(2008)
M. Belkin et al.
Semi-supervised learning on Riemannian manifolds
Machine Learning
(2004)
M. Belkin et al.
Manifold regularizaion: a geometric framework for learning from labeled and unlabeled examples
Journal of Machine Learning Research
(2006)

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In 11th annual conference on...

O. Chapelle et al.

Semi-supervised learning

(2006)

H. Chen et al.

Semi-supervised multi-category classification with imperfect model

IEEE Transactions on Neural Networks

(2009)

D.R. Chen et al.

Support vector machine soft margin classifiers: error analysis

Journal of Machine Learning Research

(2004)

Cited by (19)

A distributed semi-supervised learning algorithm based on manifold regularization using wavelet neural network
2019, Neural Networks
Citation Excerpt :
But in the real world, labeled data is always difficult or costly to obtain in many practical applications such as image processing (Gong et al., 2016), social media spammer detection (Yu, Chen, Jiang, Fu, & Qin, 2017), driver distraction detection (Liu, Yang, Huang, Yong, & Lin, 2016), online web video topic detection and tracking (Li, Jiang, Zhang, Pang, & Huang, 2016), linguistic and texture processing (Palomera & Figueroa, 2017; Silva, Coletta, Hruschka, & Jr, 2016) and so on. Consequently, semi-supervised learning (SSL) algorithms are widely investigated (Chen, Zhou, Tang, Li, & Pan, 2013; Frasca, Bertoni, Re, & Valentini, 2013; Hady, Schwenker, & Palm, 2010; Kawakita & Takeuchi, 2014; Levatic, Kocev, Ceci, & D?eroski, 2018). The task of SSL is to use additional unlabeled dataset on the basis of labeled samples.
This paper aims to propose a distributed semi-supervised learning (D-SSL) algorithm to solve D-SSL problems, where training samples are often extremely large-scale and located on distributed nodes over communication networks. Training data of each node consists of labeled and unlabeled samples whose output values or labels are unknown. These nodes communicate in a distributed way, where each node has only access to its own data and can only exchange local information with its neighboring nodes. In some scenarios, these distributed data cannot be processed centrally. As a result, D-SSL problems cannot be centrally solved by using traditional semi-supervised learning (SSL) algorithms. The state-of-the-art D-SSL algorithm, denoted as Distributed Laplacian Regularization Least Square (D-LapRLS), is a kernel based algorithm. It is essential for the D-LapRLS algorithm to estimate the global Euclidian Distance Matrix (EDM) with respect to total samples, which is time-consuming especially when the scale of training data is large. In order to solve D-SSL problems and overcome the common drawback of kernel based D-SSL algorithms, we propose a novel Manifold Regularization (MR) based D-SSL algorithm using Wavelet Neural Network (WNN) and Zero-Gradient-Sum (ZGS) distributed optimization strategy. Accordingly, each node is assigned an individual WNN with the same basis functions. In order to initialize the proposed D-SSL algorithm, we propose a centralized MR based SSL algorithm using WNN. We denote the proposed SSL and D-SSL algorithms as Laplacian WNN (LapWNN) and distributed LapWNN (D-LapWNN), respectively. The D-LapWNN algorithm works in a fully distributed fashion by using ZGS strategy, whose convergence is guaranteed by the Lyapunov method. During the learning process, each node only exchanges local coefficients with its neighbors rather than raw data. It means that the D-LapWNN algorithm is a privacy preserving method. At last, several illustrative simulations are presented to show the efficiency and advantage of the proposed algorithm.
The convergence rate of semi-supervised regression with quadratic loss
2018, Applied Mathematics and Computation
Citation Excerpt :
Semi-supervised learning addresses learning by using amount of unlabeled data, together with the labeled data, to build better learning because it requires less human effort and gives higher accuracy (see [34]). To provide theory supports for semi-supervised learning approach, many mathematicians have paid their attentions to the error analysis of the kernel regularized semi-supervised Laplacian learning (see e.g. [3–7,24,34]. Although the framework is complicated, it can actually means that one can accomplish the learning by increasing the unlabeled samples number l.
It is known that the semi-supervised learning deals with learning algorithms with less labeled samples and more unlabeled samples. One of the problems in this field is to show, at what extent, the performance depends upon the unlabeled number. A kind of modified semi-supervised regularized regression with quadratic loss is provided. The convergence rate for the error estimate is given in expectation mean. It is shown that the learning rate is controlled by the number of the unlabeled samples, and the algorithm converges with the increasing of the unlabeled sample number.
Kernel-based sparse regression with the correntropy-induced loss
2018, Applied and Computational Harmonic Analysis
Citation Excerpt :
Inspired from this study, we propose a sparse regularized regression framework with the C-loss and general kernel function, which is not necessary to be a Mercer kernel. For the kernelized sparse dictionary learning, only part samples are required to construct the predictor, which much improves the algorithm efficiency [22,3,4]. Although the sparse regularization results in the additional difficulty in error analysis, we overcome it in terms of a novel error decomposition and the characteristics of data dependent hypothesis spaces.
The correntropy-induced loss (C-loss) has been employed in learning algorithms to improve their robustness to non-Gaussian noise and outliers recently. Despite its success on robust learning, only little work has been done to study the generalization performance of regularized regression with the C-loss. To enrich this theme, this paper investigates a kernel-based regression algorithm with the C-loss and $ℓ_{1}$ -regularizer in data dependent hypothesis spaces. The asymptotic learning rate is established for the proposed algorithm in terms of novel error decomposition and capacity-based analysis technique. The sparsity characterization of the derived predictor is studied theoretically. Empirical evaluations demonstrate its advantages over the related approaches.
Error analysis of regularized least-square regression with Fredholm kernel
2017, Neurocomputing
Citation Excerpt :
In particular, the labeled data is the key factor on the excess risk without the additional assumption on the marginal distribution. This observation is consistent with the previous analysis for semi-supervised learning [1,4]. In our supervised experiment, Gaussian noise N(0, 0.01) is added to the data respectively.
Learning with Fredholm kernel has attracted increasing attention recently since it can effectively utilize the data information to improve the prediction performance. Despite rapid progress on theoretical and experimental evaluations, its generalization analysis has not been explored in learning theory literature. In this paper, we establish the generalization bound of least square regularized regression with Fredholm kernel, which implies that the fast learning rate $O (l^{- 1})$ can be reached under mild conditions (l is the number of labeled samples). Simulated examples show that this Fredholm regression algorithm can achieve the satisfactory prediction performance.
Example-based super-resolution via social images
2016, Neurocomputing
Citation Excerpt :
Dictionary based algorithms such as Yang et al. [18], Lu et al. [19] focus on representing the priori between low- and high-resolution training images with the low- and high-resolution dictionary pair. Similarly, the priori on the relation between low- and high-resolution images is represented by a regression function which is learned by supervised or semi-supervised learning methods [20], such as, Ni and Nquyen [21], Kim and Kwon [22], and Tang et al. [23]. Generally, the training time of these explicit priori based algorithms is extremely long when the size of training set is big.
A novel image patch based example-based super-resolution algorithm is proposed for benefitting from social image data. The proposed algorithm is designed based on matrix-value operator learning techniques where the image patches are understood as the matrices and the single-image super-resolution is treated as a problem of learning a matrix-value operator. Taking advantage of the matrix trick, the proposed algorithm is so fast that it could be trained on social image data. To our knowledge, the proposed algorithm is the fastest single-image super-resolution algorithm when both training and test time are considered. Experimental results have shown the efficiency and the competitive performance of the proposed algorithm to most of state-of-the-art single-image super-resolution algorithms.
The learning performance of the weak rescaled pure greedy algorithms
2024, Journal of Inequalities and Applications

View all citing articles on Scopus

View full text

Convergence rate of the semi-supervised greedy algorithm

Abstract

Introduction

Section snippets

The sparse semi-supervised greedy algorithm

Main result

Error analysis

An empirical study

Conclusion and discussion

Acknowledgments

Information Sciences

Neural Networks

Pattern Recognition

Applied and Computational Harmonic Analysis

Applied and Computational Harmonic Analysis

Journal of Complexity

Computers & Mathematics with Applications

Approximation and learning by greedy algorithm

Annals of Statistics

Semi-supervised learning on Riemannian manifolds

Machine Learning

Manifold regularizaion: a geometric framework for learning from labeled and unlabeled examples

Journal of Machine Learning Research

Semi-supervised learning

Semi-supervised multi-category classification with imperfect model

IEEE Transactions on Neural Networks

Support vector machine soft margin classifiers: error analysis

Journal of Machine Learning Research