Online Manifold Regularization by Dual Ascending Procedure

Sun, Boliang; Li, Guohui; Jia, Li; Zhang, Hui

doi:https://doi.org/10.1155/2013/838439

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusion Appendix References Copyright Related Articles

Research Article | Open Access

Volume 2013 | Article ID 838439 | https://doi.org/10.1155/2013/838439

Online Manifold Regularization by Dual Ascending Procedure

Boliang Sun,¹Guohui Li,¹Li Jia,¹and Hui Zhang¹

Academic Editor: Zheng-Guang Wu

Received25 Mar 2013

Accepted06 Jun 2013

Published02 Jul 2013

Abstract

We propose a novel online manifold regularization framework based on the notion of duality in constrained optimization. The Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online in this paper. Our algorithms are derived by gradient ascent in the dual function. For practical purpose, we propose two buffering strategies and two sparse approximations to reduce the computational complexity. Detailed experiments verify the utility of our approaches. An important conclusion is that our online MR algorithms can handle the settings where the target hypothesis is not fixed but drifts with the sequence of examples. We also recap and draw connections to earlier works. This paper paves a way to the design and analysis of online manifold regularization algorithms.

1. Introduction

Semisupervised learning (S²L) of different classifiers is an important problem in machine learning with interesting theoretical properties and practical applications [1–5]. Different from standard supervised learning (SL), the S²L paradigm learns from both labeled and unlabeled examples. In this paper, we investigate the online semisupervised learning (OS²L) problems which have three features as follows:(i)data is abundant but the resources to label them are limited;(ii)data arrives in a stream and cannot even store them all;(iii)no statistical assumptions are found, which means that can change over time.

OS²L algorithms take place in a sequence of consecutive rounds. On each round, the learner is given a training example and is required to predict the label if the example is unlabeled. To label the examples, the learner uses a prediction mechanism which builds a mapping from the set of examples to the set of labels. The quality of an OS²L algorithm is measured by the cumulative loss it makes along its run. The challenge of OS²L is that we do not observe the true label for unlabeled examples to evaluate the performance of prediction mechanism. Thus, if we want to update the prediction mechanism, we have to rely on indirect forms of feedback.

Lots of OS²L algorithms have been proposed in recent years. A popular idea [5, 6] is using a heuristic method to greedily label the unlabeled examples, which is essentially still employing an online supervised learning framework. References [7–9] also treat OS²L problem as online semisupervised clustering in that there are some must-links pairs (in the same cluster) and cannot-links pairs (cannot in the same cluster), but the effects of these methods are often influenced by “bridge points” (see a survey in [10]).

For solving OS²L problem, we introduce a novel online manifold regularization (MR) framework for the design and analysis of new online MR algorithms in this paper. Manifold regularization is a geometric framework for learning from examples. This idea of regularization exploits the geometry of the probability distribution that generates the data and incorporates it as an additional regularization term. Hence, the objective function has two regularization terms: one controls the complexity of the classifier in the ambient space and the other controls the complexity as measured by the geometry of the distribution.

Since decreasing the primal MR objective function is impossible before obtaining all the training examples, we propose a Fenchel conjugate transform to optimize the dual problem in an online manner. Unfortunately, the basic online MR algorithms derived from our framework have to store all the incoming examples and the time complexity on each learning round is . Therefore, we propose two buffering strategies and two sparse approximations to make our online MR algorithms practical. We also discuss the applicability of our framework to the settings where the target hypothesis is not fixed but drifts with the sequence of examples.

To the best of our knowledge, the closest prior work is an empirical online version of manifold regularization of SVMs [11]. Their method defines an instantaneous regularized risk to avoid optimizing the primal MR problem directly. The learning process is based on convex programming with stochastic gradient descent in kernel space. The update scheme of this work can be derived from our online MR framework.

This paper is structured as follows. In Section 2 we begin with a primal view of semisupervised learning problem based on manifold regularization. In Section 3, our new framework for designing and analyzing online MR algorithms is introduced. Next, in Section 4, we derive new algorithms from our online MR framework by gradient ascent. In Section 5, we propose two sparse approximations for kernel representation to reduce computational complexity. Connections to earlier analysis techniques are in Section 6. Experiments and analyses are in Section 7. In Section 8, possible extensions of our work are given.

2. Problem Setting

Our notation and problem setting are formally introduced in this section. The italic lower case letters refer to scalars (e.g., and ), and the bold letters refer to vectors (e.g., and ). denotes the th training example, where is the point, is its label, and is a flag to determine whether the label can be seen. If , the example is labeled; and if , the example is unlabeled. The hinge function is denoted by. denotes the inner product between vectors and . For any , the set of integers is denoted by .

Consider an input sequence , where and (). Let be a kernel over the training points and the corresponding reproducing kernel Hilbert space (RKHS). The S²L problem based on manifold regularization [12] can be written as minimizing where , is the RKHS norm of , is a loss function for the predictions of the training points, and are trade-off parameters, is the distance function which measures the difference between the predictions of and , and are the edge weights which define a graph over the examples, for example, a fully connected graph with Gaussian weights or -NN binary weights.

In (1), the objective function can be composed of three sums. The first sum measures the complexity of , the second measures the loss for labeled examples, and the last one is the manifold regularizer which encourages prediction smoothness over the graph which means that similar examples tend to have same predictions.

Denote that . Obviously, it is easy to seek using existing optimization tools after all the training examples arrived, which is called offline MR. Different from offline methods, an online MR process is performed in sequence of consecutive rounds. On each round, when an example arrives, the online MR algorithm is required to present its predictive label and update its prediction mechanism so as to be more accurate later.

For simplicity and concreteness, we focus on semisupervised binary linear classifiers in this paper, which means that and the data labels belong to . is chosen as a popular convex loss function in supervised classification: hinge-loss, defined as The function is defined as an absolute function in this paper, where Furthermore, (3) is composed of two hinge functions (see Figure 1 for an illustration) as follows:

(a)

(b)

(c)

To learn a max-margin decision boundary, we can rewrite (1) as Let edge weights and , and we can get a simple version of (5), as

The minimization problem of (6) in an online manner is what we consider in the rest of this paper.

3. Online Manifold Regularization in the Dual Problem

In this section, we propose a unified online manifold regularization framework of semisupervised binary classification problems. Our presentation reveals how the S²L problem based on MR in Section 2 can be optimized in an online manner.

Before describing our framework, let us recall the definition of Fenchel conjugate that we use as a main analysis tool. The Fenchel conjugate of a function is defined as Specially, the Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online in this paper.

Proposition 1. Let , where for all , , and . The Fenchel conjugate of is

Proof. We first rewrite the as the following: where for all . Based on the definition of Fenchel conjugate, we can obtain that
Since the third equality aforementioned follows from the strong max-min property, it can be transferred into a min-max problem. If , is ; otherwise, if , we have .

Back to the primal problem, we want to get a sequence of boundary which makes . In (6), decreasing the objective function directly is impossible in the condition of not getting all the training examples. In practice, we only get the example set on round , when the training examples arrive in a steam. In order to avoid the previous contradiction, we propose a Fenchel conjugate transform of S²L problems based on MR.

An equivalent problem of (6) is

Using the Lagrange dual function, we can rewrite (11) by introducing a vector group : Consider the dual function where is the Fenchel conjugate of . The primal problem can be described by Fenchel conjugate transform as follows:

In (14), we can see that our goal has been transferred from minimizing the primal problem to maximizing the dual function . In the following, we show how to ascend the dual function without the unobserved examples.

Based on Proposition 1, the Fenchel conjugate of is if , and for all , , ; otherwise, .

Since our goal is to maximize the dual function, we can restrict to the case that , and for all , , , where . has associated coefficients which are .

Based on the previous analysis, the dual function can be rewritten using a new coefficient vector ,

And our online MR task can be redescribed as ascending the dual function by updating the coefficient vector . Obviously, unobserved examples would make no influence on the value of dual function in (16) by setting their associate coefficients to zero.

Denote to be the coefficient vector on round , and its elements can be written as . The update process of coefficient vector on round should satisfy the following conditions:(i)If , ;(ii).

The first one means that the unobserved examples do not make influence on the value of dual function , and the second means that the value of dual function never decreases along the online MR process. Therefore, the dual function on round can be written as

Based on Lemmas 2 and 3 in the appendix, we can obtain that each coefficient vector has an associated boundary vector . On round , the associated boundary vector of is

Using a more general form, the associate vector in (18) also can be written as where .

To make a summary, we propose a template online MR algorithm by dual ascending procedure in Algorithm 1.

INPUT: two positive scalars: and ; edge weights .
INITIALIZE: a coefficient vector and its associated decision boundary vector .
PROCESS: For
Receive an example ,
Choose a new coefficient vector that satisfies ,
Return a new associated boundary vector in (18),
If , predict .

4. Deriving New Algorithms by Gradient Ascent

In the previous section, a template algorithm framework for online MR is proposed based on the idea of ascending the dual function. In this section we derive different online MR algorithms using different update schemes of coefficient vector in the dual function.

Let denote a subset of dual coefficients and is an element of coefficient vector . Our online MR algorithms simply perform a gradient ascent step over () on round that aims to increase the value of dual function: where and is a step size. We now propose three update schemes which modify different coefficients on each learning round.

4.1. Example-Associate (EA) Update

In traditional online supervised learning, the prediction mechanism is always updated only using the new arrived example, for example, Perceptron. Based on this notion, we propose an example-associate update scheme to ascend the dual function by updating the associated coefficients of the new training example on round that means .

In online MR process, the coefficients do not need to be grounded to zero on round . Based on Proposition 1, we have already obtained that every element of coefficient vector belongs to . Using a gradient ascent step in (20), the example-associate update process can be written as

Equation (21) implies that if , we have , and otherwise . Equations (22) and (23) also imply that the gradient ascent must satisfy , and otherwise we do not perform a gradient ascent on .

Unfortunately, this update scheme will not work in practice because it needs to store every input point to update the boundary vector; it also has an increasing time complexity . Here, we propose two buffering strategies to use a small buffer of examples on each learning round. Denote that , and the example belongs to the buffer on round if .(i)Buffer-. Let the buffer size be . -buffer replaces the oldest point in the buffer with the new incoming point after each learning round, which means that .(ii)Buffer-. This buffering strategy replaces the oldest unlabeled point in the buffer with the incoming point while keeping labeled points. The oldest labeled point is evicted from the buffer only when it is filled with labeled points.

Based on the previous analysis, the sub set of dual coefficients can be chosen using the process in Algorithm 2.

PROCESS: ,
If , .
For each
If , ;
If , .
Return .

Denote as the maximal step size on round . Since every dual coefficient belongs to , we have , , , . The optimal step size is where is the Hessian of over . Then, we obtain that if , .

Combining (22) and (23), we have

We also can rewrite the update process using the form of (19) as follows: The new associate boundary vector is

Algorithm 3 shows an online MR algorithm based on EA update.

INPUT: two positive scalars: and ; edge weights .
INITIALIZE: a coefficient vector and its associated decision boundary vector .
PROCESS: For
Receive an example ,
Get using the process in Algorithm 2,
Choose a step size ,
Update the boundary vector using (27),
If , predict ,
Renew the buffer.

Specially, while choosing a small stationary on each learning round, we must have . In this condition, the update process of boundary vector can be written as This update scheme is called -EA update.

For an aggressive gradient ascent, we can choose , which is called aggressive-EA update.

4.2. Overall Update

In EA update scheme, we employ an additive way to update the boundary vector by updating the associated coefficients of the new training example. In fact, all the associated coefficients of arrived examples can be updated on each learning round. Here, we propose another new update scheme which is called overall update.

Using EA update, the dual function on round can be rewritten as In fact, the dual coefficients in also can be updated in (28). Since has dual coefficients, it is impossible to update them, respectively. We introduce a new variable into (29), as

From (30), we can get that a gradient ascent update on actually means to multiply all the dual coefficients in by . Since every dual coefficient in belongs to , we constrain . The initial value of is zero. Using a gradient ascent on , we obtain

Therefore, we choose on each learning round for overall update. A buffering strategy also can be used in overall update, . Like EA update, we propose a process to get for overall update in Algorithm 4.

PROCESS: ,
If ,
.
If , .
For each
If , ;
If , .
Return .

The gradient ascent for is same as the EA update which has been shown in (21), (22), and (23). Combined with the constraint of , the maximal step size for overall update is , , , . The optimal step size also can be obtained using (24). Obviously, if , . Rewriting the overall update process using the form of (19), we have

The new associate boundary vector is Algorithm 5 shows the online MR algorithm based on overall update.

INPUT: two positive scalars: and ; edge weights .
INITIALIZE: a coefficient vector and its associated decision boundary vector .
PROCESS: For
Receive an example ,
Get using the process in Algorithm 4,
Choose a step size ,
Update the boundary vector using (33),
If , predict ,
Renew the buffer.

Like EA update, we also can derive -overall update and aggressive-overall update from the previous analysis.

4.3. Two-Step Update

In the two update schemes aforementioned, we actually make an assumption that the elements of an example arrive at the same time. But in some practical applications, the label is received after receiving training point occasionally. There is no need to update the boundary vector after receiving all the elements of an example. Here, we propose a two-step update scheme.

The two-step update scheme has twice updates on each learning round. The first update takes place after the training point arrives which updates the boundary vector using the geometry of the training points. The second update takes place after arrive which updates the boundary vector using the label. Obviously, EA update and overall update can be used in each update process of two-step update scheme. For example, we use EA update to describe the update process of two-step update scheme.

Denote as the coefficient vector after the first update on round and its associate boundary. The example-associate coefficients in the first update on round are , and new associate boundary vector can be written as In the second update process, the example-associate coefficient is , and new associate boundary vector is If , the second update process in (35) would not happen, and the two-step update degenerates into EA update. The range of and can be obtained by the same process in Section 4.1. Similar as the previous analysis, the overall update also can be used in each update process of two-step update scheme.

The online MR algorithm based on the two-step update can be described in Algorithm 6.

INPUT: two positive scalars: and ; edge weights .
INITIALIZE: a coefficient vector and its associated decision boundary vector .
PROCESS: For
Receive a training point ,
Choose a new coefficient vector that satisfies ,
Return a new associated boundary vector in (18).
Receive ,
If , , , predict ;
Else if , choose a new coefficient vector that satisfies , return a new associated
boundary vector in (18).

This update scheme is more like a new perspective of online MR problem, and its effect is influenced by the update schemes on each step. Therefore, we pay more attentions to the first two update schemes aforementioned in this paper.

5. Sparse Approximations for Kernel Representation

In practice, kernel functions are always used to find a linear classifier, like SVM. Our online MR framework contains the product of two points, so we can easily introduce the kernel function in our framework. If we note the kernel matrix such that can be replaced by in our framework. Therefore, we can rewrite (19) as Unfortunately, the online MR algorithms with kernel functions in Section 4 have to store the example sequence up to the current round (worst case). While using a buffering strategy for online MR which has a buffer size of , the stored matrix size is and the time complexity is on round . For practical purpose, we present two approaches to construct a sparse kernel representation for boundary vector on each round.

5.1. Absolute Threshold

To construct a sparse representation for the boundary vector, absolute threshold discards the examples whose associated coefficients are close to zero (more details in Section 7). Let denote the absolute threshold. When the absolute value of the associated coefficient of an input example does not increase in further update process, will be discarded if . The examples in the buffer cannot be discarded since the absolute values of their associated coefficients may increase in next rounds. The process of sparse approximation based on absolute threshold can be described in Algorithm 7.

INPUT: the absolute threshold ; the kernel representation of boundary on round : ,
PROCESS: For each in
If is not in the buffer and , discard the example and its associated coefficient .
Return a new boundary .

The process of sparse approximation based on absolute threshold for different update schemes may be a little different in practical applications. For online MR algorithms with EA update, the coefficients of input examples which are not in the buffer will not change in further update process, and this sparse approximation process only deals with the example ( for Buffer-) which is removed from the buffer on round . For online MR algorithms with overall update, this sparse approximation process deals with all the examples which are not in the buffer on current round since the coefficients of these examples also can be changed. This approach may not work; if we are unlucky enough that all the are larger than on each round, the kernel representation of boundary vector will not become sparse at all.

5.2. Maximal Coefficients (-MC)

Another way to construct a sparse kernel representation is to keep the examples of which the absolute value of associated coefficients are the first maximum. This approach is called maximal coefficients (-MC) in this paper. Similar as the absolute threshold, -MC does not discard the examples in the buffer of which absolute values of associated coefficients may increase in next round. The process of sparse approximation based on -MC can be described in Algorithm 8.

INPUT: the parameter ; the kernel representation of boundary on round : ,
PROCESS: For each in and not in the buffer
If does not belong to the first maximum of the coefficients, discard the example and its associated
coefficient .
Return a new boundary .

While using -MC for online MR algorithms which has a buffer size of , the stored kernel matrix size is at most and the time complexity is on each round.

6. On the Connection to Previous Work

6.1. About Dual Ascending Procedure

In the area of online learning, Shalev-Shwartz and Singer [13] propose a primal-dual perspective of online supervised learning algorithms. This work has the same dual ascending perspective as ours to achieve a better boundary vector. Different from it, we deal with an online MR problem of semisupervised learning, and our emphasis is how to construct a dual ascending model in semisupervised condition. An important conclusion in this paper is that the Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online, and this is also the reason why we use an absolute function to describe the difference between the predictions of two points.

The primal basic MR problem can degenerate into a basic supervised learning problem [14] while choosing the trade-off parameter . Consider Then, the dual function degenerates into Equation (39) is the dual function of basic supervised learning problem which is carefully discussed in [13].

6.2. About Online Manifold Regularization

Goldberg et al. [11] propose an empirical study of online MR which deals with the MR problem as follow: where and are trade-off parameters and is the number of labeled examples. Different from our framework, they use a square function to measure the difference between the predictions of two points (see Figure 2).

To avoid minimizing (40) directly, they further propose an instantaneous regularized risk empirically on round . Consider is the reverse label probability , which it assumes to be given and easily determined based on the rate at which humans can label the data at hand. In our work, we ignore this rate since it can be involved in the trade-off parameters and .

Based on the notion that has a form as , Goldberg et al. perform a gradient descent step over that aims to reduce the instantaneous risk on each round. The update scheme can be written as

Furthermore, this work uses an annealing heuristic trick which chooses a decaying step size , . This online MR algorithm is an empirical result which demonstrates its practicability by experiments and does not have enough theoretical analysis.

Compared with previous work, our online MR framework reinterprets the online MR process based on the notion of ascending the dual function, and it also can be used to derive different online MR algorithms. Here, we demonstrate that the update scheme in (42) can be derived from our online MR framework.

In Section 4.2, the gradient direction of overall update for ascending the dual function on round can be written as While choosing we have If , we must have and is a feasible ascending direction to make . Using to ascend the dual function, the update scheme can be written as Equations (42) and (46) are essentially the same update scheme with different trade-off parameters and edge weights.

7. Experiments and Analyses

This section presents a series of experimental results to report the effectiveness of our derived online MR algorithms. It is known that the performance of semisupervised learning depends on the correctness of model assumptions. Thus, our focus is on comparing different online MR algorithms, rather than different semisupervised regularization methods.

7.1. Datasets and Protocols

We report experimental results on two artificial and two real-world datasets in Table 1 with different properties.

The artificial datasets consist of two-class problems. The generated method of two moons dataset is available at http://manifold.cs.uchicago.edu/manifold_regularization/manifold.html; we set the radius of two moons to 4 and the width to 2, and only one example for each class is labeled in this dataset. To demonstrate that our online MR can handle concept drift, we also perform our experiments on two rotating spirals dataset of which 2% examples are labeled. Figure 3 shows that the spirals smoothly rotate 360° during the sequence, and the target boundary drifts with the sequence of examples.

The real-world datasets consist of two-class and multi-class problems. The Isolet dataset derives from the Isolet database of letters of the English alphabet spoken in isolation (available from the UCI machine learning repository). The database contains utterances of 150 subjects who spoke the name of each letter of the English alphabet twice. The speakers are grouped into 5 sets of 30 speakers each, referred to as isolet1 through isolet5. We considered the task of classifying the first 13 letters of the English alphabet from the last 13 only using isolet1 and isolet5 (1 utterance is missing in isolet5 due to poor recording). During the online MR process, all 52 utterances of one speaker are labeled and all the rest are left unlabeled. Our USPS dataset contains the USPS training set on handwritten digit recognition (preprocessed using PCA to 100 dimensions), and we apply online MR algorithms to 45 binary classification problems that arise in pairwise classification; 5 examples are randomly labeled for each class.

Our experimental protocols are as the following.(1)The training sequences are generated randomly from each datasets (except for two rotating spirals).(2)The offline MR algorithm for comparison is a state-of-the-art semisupervised learning algorithm based on manifold regularization which is called LapSVM [12].(3)Each example in each dataset is trained once during online MR process.

To avoid the influence of different training sequences, all results on each dataset are the average of five such trials except for two rotating spirals (this idea is inspired by [11]). The error bars are ±1 standard deviation.

All methods use the standard RBF kernel . The edge weights are Gaussian weights which define a fully connected graph, and the edge weight parameter is . For online MR algorithms comparisons, we choose Buffer- with to avoid high computational complexity. We implemented all the experiments using MATLAB.

7.2. Computational Complexity

For offline MR, a kernel matrix needs to be stored and inverted on round , and the time complexity approximately amounts to if using a gradient descent algorithm. Different from it, the computational complexity of our online MR algorithms is determined by the buffer size and the number of examples in the kernel representation of boundary vector on each round.

For our online MR without buffering strategies and sparse approximation approaches, the number of examples in the kernel representation is , and the time complexity is . While using a buffering strategy for online MR which has a buffer size of , the time complexity reduces to , but the number of examples in the kernel representation is still . In practice, only part of the examples have to be stored (and computed) based on the sparse approximation. Figure 4 shows the number of examples in the kernel representation of boundary vector on each learning round for different sparse approximation approaches.

Figure 4

The number of examples in the kernel representation of boundary vector for different sparse approximation approaches. This experiment is on the two moons dataset which has 4000 examples. If no sparse representation approaches are used in the online MR, the kernel representation contains all the input examples. The number of examples in the kernel representation of boundary vector increases slowly while using an absolute threshold, and the number is at most while using -MC for online MR algorithms.

We also compare cumulative runtime curves of five different MR algorithms on the two moons and Isolet datasets. The first one is basic online MR which only uses -EA update, but no buffering strategies and sparse approximation approaches. The second one is online MR which uses -EA update and Buffer- (). The third one is online MR which uses -EA update, Buffer- (), and an absolute threshold . The fourth is online MR which uses -EA update, Buffer- (), and -MC (). The last one uses offline MR (LapSVM) on each round. Figure 5 shows that online MR with buffering strategies and sparse representation performs better than basic online MR and offline MR on the runtime growth rate. Online MR algorithms without buffering strategies and sparse approximation approaches are time consuming and memory consuming, and it is intractable to apply them to real-world long time tasks.

(a)

(b)

The cumulative runtime growth curves of online MR with buffering strategies and sparse approximation approaches scale only linearly, while the others scale quadratically.

7.3. Accuracies

We used the same model selection strategy both for our online MR framework and traditional offline MR algorithms.

Based on the idea of “interested in the best performance and simply select the parameter values minimizing the error” [15], we select combinations of the parameter values on a finite grid in Table 2, and it is sufficient to perform algorithm comparisons.

While choosing an update scheme based on our online MR framework, we still have to select a step size on each learning round. We report the online MR error rate for three scenarios in this paper.(i)Stationary step size . (ii)Aggressive step size . (iii)Decreasing step size , which is also used in [11].

The best performances of all the online MR algorithms are presented in Table 3 and Figure 6. The following sections provide more additional details.

Table 3

Mean test error rates on different datasets. The error rates are reported for three different step size selection methods in the form of stationary step size/aggressive step size/decreasing step size. The result shows that our derived online MR algorithms achieve test accuracy comparable to offline MR. Specially, the experiments on two rotating spirals show that our online MR is able to track the changes in the sequence and maintain a much better error rate compared to offline MR. The performances of online MR algorithms are competitive with those of the state-of-the-art offline MR.

(a)

(b)

7.4. Additional Results

We now provide some additional results along the online MR algorithms run and discuss more precisely the effect of our derived online MR algorithms.

7.4.1. Effect of the Parameters , , and the Step Size

The parameters , and have similar effects on generalization as in the purely offline MR approach (see [12] for an empirical study). However, one has to try many choices of parameters during the model selection. The manifold regularizer incorporates unlabeled examples and causes the decision vector to appropriately adjust according to the geometry of training examples as is increased. If , the unlabeled examples are disregarded and online MR degenerates into online supervised learning.

The step size controls the increment of dual function on each learning round. We used three different step size selection methods for algorithm comparisons in last section. Here, we discuss the effect of different step size selection methods.

Stationary Step Size. Under mild conditions, this seemingly naive step size selection method has acceptable error rates on any input sequence. Figure 7 shows that a large stationary step size does not perform well in online MR algorithms. When one wishes to avoid optimizing the step size on each learning round, we suggest the stationary step size with a small value.

(a)

(b)

Aggressive Step Size. Since online MR algorithms adjust the boundary according to the local geometry of the incoming point and its label, the aggressive step size selection method aims to search for the optimal step size to increase the dual function more aggressively on each learning round. The experiments in Table 3 and Figure 6 imply that the aggressive selection method does not perform well on all the sequences.

Decreasing Step Size. This step size selection method is based on an idea that the boundary vector is approaching the optimal boundary as the online MR algorithms run. This selection method performs well on the datasets whose target boundaries are fixed, but the experiments on the two spirals dataset show that it does not perform well for drifting target boundaries.

7.4.2. Increasing Dual Function Achieves Comparable Risks and Error Rates

We compare the primal objective function versus the dual function on the training sequence of two moons dataset as increases. Figure 8 shows that the two curves approach each other along the online MR process using EA update (). The value of dual function never decreases as increases; correspondingly, the curve of primal function has a downward trend and some little fluctuations. Our experiments support the theory in Section 3 that increasing the dual problem achieves comparable risks of primal MR problem.

We also report the performance of on the whole dataset in Figure 9. This result shows that the decision boundary is adjusted to be a better one along the online MR process. Since online MR adjusts the decision boundary according to the label of the incoming example and the local geometry of the buffer on each learning round, the error rate of on the whole dataset is not always decreasing along the online MR process. It is also the reason why online MR can track the changes in the data sequence.

7.4.3. Online MR Handles Concept Drift

When the underlying distributions, both and , change during the course of learning, the algorithms are expected to track the changes in the data sequence. In the two rotating spirals dataset, the points will change their true labels during the sequence and every stationary boundary vector will have an error rate of 50%.

We show the error rates of basic online MR versus online MR (Buffer-) with different buffer sizes in Figure 10. This experiment illustrates that a suitable buffer size is able to adapt to the changing sequence and maintain a small error rate.

8. Conclusion and Future Directions

In this paper we presented an online manifold regularization framework based on dual ascending procedure. To ascend the dual function, we proposed three schemes to update the boundary on each learning rounds. Unfortunately, the basic online MR algorithms are time consuming and memory consuming. Therefore, we also applied buffering strategies and sparse approximation approaches to make online MR algorithms practical. Experiments show that our online MR algorithms can adjust the boundary vector with the input sequence and have risk and error rates comparable to offline MR. Specially, our online MR algorithms can handle the settings where the target boundary is not fixed but rather drifts with the sequence of examples.

There are many interesting questions remaining in the online semisupervised learning setting. For instance, we plan to study new online learning algorithms for other semisupervised regularizers those, in particular that with non-convex risks for unlabeled examples like S3VMs. Another direction is how to choose more effective parameters intelligently during the model selection.

Appendix

Fenchel Conjugate

The Fenchel conjugate of a function is defined as Since is defined as a supremum of linear functions, it is convex. The Fenchel-Young inequality states that for any and , we have .

Subgradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if , then Fenchel-Young inequality holds with equality. Here, we describe few lemmas of Fenchel conjugate which we use as theoretical tools in this paper. More details are in [16].

Lemma 2. Let be a closed and convex function, and let be its differential set at . Then, for all , one has .

Proof. Since and is closed and convex, we know that for all . Equivalently, The right-hand side of the previous equals to , and thus The assumption that is closed and convex implies that is the Fenchel conjugate of . Thus, Combining the two inequalities, we have

Lemma 3. Let be any norm on , and let with . Then where is the dual norm of . The domain of is also . For example, if , then since norm is dual to itself.

Lemma 4. Let where and with . Then, the conjugate of is

Lemma 5. Let be a function, and let be its Fenchel conjugate. For and , the Fenchel conjugate of is .

References

Y. Altun, D. McAllester, and M. Belkin, “Maximum margin semi-supervised learning for structured variables,” Advances in Neural Information Processing Systems, vol. 18, 2005.
View at: Google Scholar
K. Bennett and A. Demiriz, “Semi-supervised support vector machines,” Advances in Neural Information Processing Systems, vol. 11, pp. 368–374, 1999.
View at: Google Scholar
T. De Bie and N. Cristianini, Semi-Supervised Learning Using Semi-Definite Programming, MIT Press, Cambridge, Mass, USA, 2006.
X. Zhu, “Semi-supervised learning literature survey,” Tech. Rep. 1530, University of Wisconsin—Madison, 2008.
View at: Google Scholar
B. Babenko, M. H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 983–990, Miami, Fla, USA, June 2009.
View at: Publisher Site | Google Scholar
H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised online boosting for robust tracking,” in Proceedings of ECCV Computer Vision, vol. 5302 of Lecture Notes in Computer Science, pp. 234–247, 2008.
View at: Google Scholar
M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 39–48, Washington, DC, USA, August 2003.
View at: Publisher Site | Google Scholar
D. Cohn, R. Caruana, and A. McCallum, “Semi-supervised clustering with user feedback,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2000.
View at: Google Scholar
D. Klein, S. D. Kamvar, and C. D. Manning, “From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering,” in Proceedings of the 19th International Conference on Machine Learning, pp. 307–314, Morgan Kaufmann, 2002.
View at: Google Scholar
N. Grira, M. Crucianu, and N. Boujemaa, “Unsupervised and semi-supervised clustering: a brief survey,” in A Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE European Network of Excellence (FP6), 2004.
View at: Google Scholar
A. Goldberg, M. Li, and X. Zhu, “Online manifold regularization: a new learning setting and empirical study,” in Proceeding of ECML, 2008.
View at: Google Scholar
M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006.
View at: Google Scholar | MathSciNet
S. Shalev-Shwartz and Y. Singer, “A primal-dual perspective of online learning algorithms,” Machine Learning, vol. 69, no. 2-3, pp. 115–142, 2007.
View at: Publisher Site | Google Scholar
A. Blum, “On-line algorithms in machine learning,” in Proceedings of the Workshop on On-Line Algorithms, A. Fiat and G. J. Woeginger, Eds., vol. 1442 of Lecture Notes in Computer Science, pp. 306–325, Springer, Berlin, Germany, 1998.
View at: Publisher Site | Google Scholar | MathSciNet
O. Chapelle and A. Zien, “Semi-supervised classification by low density separation,” in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 57–64, 2005.
View at: Google Scholar
V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin, Germany, 2nd edition, 1995.
View at: MathSciNet

Copyright

Copyright © 2013 Boliang Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

836

Downloads

964

Citations