Elsevier

Neural Networks

Volume 44, August 2013, Pages 44-50
Neural Networks

Convergence rate of the semi-supervised greedy algorithm

https://doi.org/10.1016/j.neunet.2013.03.001Get rights and content

Abstract

This paper proposes a new greedy algorithm combining the semi-supervised learning and the sparse representation with the data-dependent hypothesis spaces. The proposed greedy algorithm is able to use a small portion of the labeled and unlabeled data to represent the target function, and to efficiently reduce the computational burden of the semi-supervised learning. We establish the estimation of the generalization error based on the empirical covering numbers. A detailed analysis shows that the error has O(n1) decay. Our theoretical result illustrates that the unlabeled data is useful to improve the learning performance under mild conditions.

Introduction

The semi-supervised learning, i.e., learning from a set of the labeled and unlabeled data, has attracted many researchers recently due to its main challenge in how to improve its prediction performance using a few labeled data with a large set of unlabeled data. In literature, algorithms of the semi-supervised learning have been proposed in different perspectives. Examples include the graph-based learning (Belkin and Niyogi, 2004, Belkin et al., 2006, Chen et al., 2009, Johnson and Zhang, 2007, Johnson and Zhang, 2008), co-training (Blum and Mitchell, 1998, Sindhwani et al., 2005, Sindhwani and Rosenberg, 2008) and many others. A review study of the semi-supervised learning is discussed in Chapelle, Schölkopf, and Zien (2006) and Zhu (2005).

Among these methods proposed for semi-supervised learning, a family of them can be unified in a Tikhonov regularization scheme in a reproducing kernel Hilbert space (RKHS) HK with a Mercer kernel K, e.g., Belkin and Niyogi (2004), Sindhwani et al. (2005) and Sindhwani and Rosenberg (2008). For the labeled data {(xi,yi)}i=1n and the unlabeled data {xi}i=n+1n+m, the solution of the regularization framework usually has the following expression i=1m+nαiK(xi,),αiR.

The semi-supervised algorithms intend to search the coefficients {αi} for promising prediction performance. Although they are excellent in the empirical evaluation (Belkin and Niyogi, 2004, Sindhwani et al., 2005, Sindhwani and Rosenberg, 2008), two issues remain to be further addressed in theory:

  • Computation difficulty. Because the regularized framework generally uses the kernel expansions of all the labeled and unlabeled data, computation becomes a serious problem for a huge set of the unlabeled data in real applications.

  • Manifold assumption. In many graph-based methods such as Belkin and Niyogi (2004), Sindhwani et al. (2005) and Sindhwani and Rosenberg (2008), it is assumed that the high-dimensional data is relied on a low-dimensional manifold. However, for different types of data, the convincing evidences of the manifold structure are not available (Fan, Gu, Qiao, & Zhang, 2011).

To address the above issues, previous discussions have done to realize sparse semi-supervised learning in Fan et al. (2011), Sun and Shawe Taylor (2010) and Tsang and Kwok (2007) but the limitation is that they just use unlabeled data to construct an additional sparse regularization term.

In this paper, we investigate the sparse representation of the semi-supervised learning without manifold assumption, and consider the sparsity of the semi-supervised learning in data dependent hypothesis spaces. Inspired by the greedy algorithms in Barron, Cohen, Dahmen, and DeVore (2008), Nair, Choudhury, and Keane (2007) and Zhang, 2002, Zhang, 2009, we propose a new sparse greedy algorithm for the semi-supervised learning. Theoretical analysis shows that the proposed algorithm is efficient to realize the sparse learning. Several contributions of this work have been highlighted below:

  • Our method integrates three different machine learning methods in a coherent way: the sparse semi-supervised learning (Fan et al., 2011, Sun and Shawe Taylor, 2010, Tsang and Kwok, 2007), the greedy algorithm (Nair et al., 2007, Zhang, 2002, Zhang, 2009), and the error analysis in data dependent hypothesis spaces (Shi et al., 2011, Sun and Wu, 2011, Wu and Zhou, 2008, Xiao and Zhou, 2010). We also show how to use them to design and analyze a new semi-supervised algorithm.

  • Generalization error bounds are derived for nonsymmetric and indefinite kernels. Theoretical results show the relative values of the labeled data and unlabeled data to achieve fast learning rates. In particular, we illustrate that the role of the unlabeled data is twofold. The first one is that the semi-supervised method can achieve fast learning rates using the additionally unlabeled data. The second one is that the learning rates essentially depend on the number of the labeled data even if the number of unlabeled data tends to infinity. Furthermore, our error analysis results rely on weaker conditions than the previous methods which are based on density assumption or manifold assumption in Belkin et al. (2006), Belkin and Niyogi (2004), Chen and Li (2009), Chen et al. (2009), Chen, Li, and Peng (2010), Johnson and Zhang, 2007, Johnson and Zhang, 2008 and Rigollet (2007).

  • Even for the supervised learning settings, we can achieve faster learning rates than the previous results in Xiao and Zhou (2010), Shi et al. (2011) and Sun and Wu (2011). In particular, our analysis does not require the interior cone condition presented in Shi et al. (2011) and Xiao and Zhou (2010).

The organization of this paper is as follows. Section  2 provides the necessary background of the semi-supervised learning and then presents the sparse semi-supervised greedy algorithm. Section  3 includes the main result on error analysis and its proof is given in Section  4. An empirical study is given in Section  5. We conclude the paper in Section  6.

Section snippets

The sparse semi-supervised greedy algorithm

Let the input space XRd be a compact subset and Y=[M,M]. In the semi-supervised model, a learner obtains a labeled data set z={(xi,yi)}i=1n and an unlabeled data set x={xn+j}j=1m. Here, the labeled examples (xi,yi)ZX×Y,1in, are independent copies of the random element (x,y) having distribution ρ on Z. The unlabeled data xn+j,1jm, are independent copies of X, whose distribution (the margin distribution of ρ) is denoted by ρX. The learning goal is to pick up a function f:XY to minimize

Main result

Now we introduce a data-free function space similar to Shi et al. (2011) and Xiao and Zhou (2010).

Definition 2

Define a data-free assumption function space H1={f:f=j=1αjK̃uj,{αj}1,{uj}X,K̃uj=Kuj/KujLρX2} with the norm fH1=inf{j=1|αj|:f=j=1αjK̃uj}.

In order to investigate the approximation of π(fˆk) to fρ, we introduce a regularizing function fλ=argminfH1{E(f)+λfH1}, where λ>0 is a regularization parameter.

The regularizing error can be expressed as D(λ)=inffH1{E(f)E(fρ)+λfH1}.

The decay of

Error analysis

In this section, we provide the proof of Theorem 1 based on the upper bound analysis of the sample and hypothesis errors. The sample error is bounded by the error analysis method and empirical covering numbers. The hypothesis error is established in terms of theoretical analysis of the greedy algorithm presented in Barron et al. (2008).

An empirical study

Our theoretical analysis of the semi-supervised greedy algorithm (SSG) shows that it is efficient to achieve fast learning rates for the regression learning. In this section, we compare our method with the least square regularized regression (LSR) algorithm in RKHS.

The least square regularized regression algorithm has been extensively studied in learning theory (Cucker & Zhou, 2007) and can be formulated as fz=argminfHK{Ez(f)+λfK2}.

We consider X=[0,1], the Gaussian kernel K(x,t)=exp((xt)22μ

Conclusion and discussion

This paper has introduced a sparse semi-supervised method to learn the regression functions from samples using the orthogonal greedy algorithm. Fast learning rates were derived under mild assumptions. The symmetric or positive semi-definite demand for kernel and the interior cone condition for X (see Shi et al., 2011) is abandoned in this paper. There are some extensions to this method which we discuss as below:

  • 1.

    The semi-supervised learning based on other greedy algorithms: The proposed method

Acknowledgments

The authors would like to thank the reviewers for their valued comments and suggestions. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 11001092, No. 11071058, and No. 11226304; by the Fundamental Research Funds for the Central Universities (Program No. 2011PY130), and by the Macau Science and Technology Development Fund (FDCT) under grant 017/2012/A1 and the Research Committee at University of Macau under grants

References (29)

  • Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In 11th annual conference on...
  • O. Chapelle et al.

    Semi-supervised learning

    (2006)
  • H. Chen et al.

    Semi-supervised multi-category classification with imperfect model

    IEEE Transactions on Neural Networks

    (2009)
  • D.R. Chen et al.

    Support vector machine soft margin classifiers: error analysis

    Journal of Machine Learning Research

    (2004)
  • Cited by (19)

    • A distributed semi-supervised learning algorithm based on manifold regularization using wavelet neural network

      2019, Neural Networks
      Citation Excerpt :

      But in the real world, labeled data is always difficult or costly to obtain in many practical applications such as image processing (Gong et al., 2016), social media spammer detection (Yu, Chen, Jiang, Fu, & Qin, 2017), driver distraction detection (Liu, Yang, Huang, Yong, & Lin, 2016), online web video topic detection and tracking (Li, Jiang, Zhang, Pang, & Huang, 2016), linguistic and texture processing (Palomera & Figueroa, 2017; Silva, Coletta, Hruschka, & Jr, 2016) and so on. Consequently, semi-supervised learning (SSL) algorithms are widely investigated (Chen, Zhou, Tang, Li, & Pan, 2013; Frasca, Bertoni, Re, & Valentini, 2013; Hady, Schwenker, & Palm, 2010; Kawakita & Takeuchi, 2014; Levatic, Kocev, Ceci, & D?eroski, 2018). The task of SSL is to use additional unlabeled dataset on the basis of labeled samples.

    • The convergence rate of semi-supervised regression with quadratic loss

      2018, Applied Mathematics and Computation
      Citation Excerpt :

      Semi-supervised learning addresses learning by using amount of unlabeled data, together with the labeled data, to build better learning because it requires less human effort and gives higher accuracy (see [34]). To provide theory supports for semi-supervised learning approach, many mathematicians have paid their attentions to the error analysis of the kernel regularized semi-supervised Laplacian learning (see e.g. [3–7,24,34]. Although the framework is complicated, it can actually means that one can accomplish the learning by increasing the unlabeled samples number l.

    • Kernel-based sparse regression with the correntropy-induced loss

      2018, Applied and Computational Harmonic Analysis
      Citation Excerpt :

      Inspired from this study, we propose a sparse regularized regression framework with the C-loss and general kernel function, which is not necessary to be a Mercer kernel. For the kernelized sparse dictionary learning, only part samples are required to construct the predictor, which much improves the algorithm efficiency [22,3,4]. Although the sparse regularization results in the additional difficulty in error analysis, we overcome it in terms of a novel error decomposition and the characteristics of data dependent hypothesis spaces.

    • Error analysis of regularized least-square regression with Fredholm kernel

      2017, Neurocomputing
      Citation Excerpt :

      In particular, the labeled data is the key factor on the excess risk without the additional assumption on the marginal distribution. This observation is consistent with the previous analysis for semi-supervised learning [1,4]. In our supervised experiment, Gaussian noise N(0, 0.01) is added to the data respectively.

    • Example-based super-resolution via social images

      2016, Neurocomputing
      Citation Excerpt :

      Dictionary based algorithms such as Yang et al. [18], Lu et al. [19] focus on representing the priori between low- and high-resolution training images with the low- and high-resolution dictionary pair. Similarly, the priori on the relation between low- and high-resolution images is represented by a regression function which is learned by supervised or semi-supervised learning methods [20], such as, Ni and Nquyen [21], Kim and Kwon [22], and Tang et al. [23]. Generally, the training time of these explicit priori based algorithms is extremely long when the size of training set is big.

    • The learning performance of the weak rescaled pure greedy algorithms

      2024, Journal of Inequalities and Applications
    View all citing articles on Scopus
    View full text