Abstract

Text classification and generation are two important tasks in the field of natural language processing. In this paper, we deal with both tasks via Variational Autoencoder, which is a powerful deep generative model. The self-attention mechanism is introduced to the encoder. The modified encoder extracts the global feature of the input text to produce the hidden code, and we train a neural network classifier based on the hidden code to perform the classification. On the other hand, the label of the text is fed into the decoder explicitly to enhance the categorization information, which could help with text generation. The experiments have shown that our model could achieve competitive classification results and the generated text is realistic. Thus the proposed integrated deep generative model could be an alternative for both tasks.

1. Introduction

Text classification is one of the most basic and important tasks in the field of natural language processing, in which one should assign predefined categories to the text. The result of classification is often used as the input for other tasks; thus an efficient and accurate classification algorithm is of great benefit. Traditional classification models combined with traditional representation of the text such as support vector machine base on the bag-of-word vectors or other classical methods have been able to achieve good results in some simple application scenarios [14].

In recent years, deep learning models based on neural networks have achieved remarkable results in various tasks, such as computer vision [5] and speech recognition [6]. Obviously, the above models have also played a significant role in the field of natural language processing. A considerable part of the task is based on the word vector representations, which are learned through neural language models [710]. These word vector representations, also known as word embeddings, are transformed by a deep neural network such as convolutional neural network (CNN) or recurrent neural network (RNN) to obtain more abstract features about the text [1114]. Thus we could perform the text classification based on these features. Such methods often lead to more advanced results on larger dataset[15].

On the other hand, text generation is also an important task of concern. Many works about neural language models or sequence to sequence models have been made to improve the quality of the generated text for a variety of purposes [1618]. Compared with other models, deep generative models have stronger expressive power, have the potential to handle more types of data, and have the natural advantage of being able to generate sample from the models, indicating their potential to deal with the above task.

Variational Autoencoder (VAE), which is a powerful deep generative model, has attracted the attention of many researchers in recent years [19, 20]. It consists of a probability encoder and a probability decoder and takes advantages of the variational inference. A variational lower bound is optimized instead of the traditional log-likelihood.

Variational Autoencoder has proved its ability, both in theory and in practice, thus could be chosen to perform the text generation task. What is more, the encoder of the VAE could be regarded as a feature extractor, which could be utilized to perform the classification task. In this paper, we propose an integrated model based on VAE to handle both generation and classification tasks. Although VAE has shown impressive advances in visual domain, such as image classification and image generation, its application in natural language processing has been relatively less studied [21]. In our work, we use a modified version of VAE to extract global features about the text, while using the text label to facilitate the generation of text during the decoding stage. The influence of the label information is enhanced by explicitly feeding label to the decoder at each time step. For text classification, we train a classifier based on neural network and integrate it with the VAE. Experiments have shown that our model can handle text classification and text generation problems at the same time and could achieve ideal results.

The remainder of the paper is structured as follows. In Section 2 we review the important architecture of VAE, which is the foundation of our method and experiments. In Section 3 we introduce the basic RNN structure. In Section 4 we elaborate the details of the proposed method. In Section 5 we introduce some related works. We provide the experimental evaluation and show the results in Section 6. Finally we make a conclusion in Section 7.

2. Review of VAE

In this section, we review the basic VAE model as the foundation of our work.

Variational methods provide an optimization-based alternative to the sampling-based Monte Carlo methods in learning the complex latent variables models. Variational methods try to approximate the true posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior distribution and a simple probability distribution which may be predefined and tractable. For instance, mean-field variational method [22] approximate the true posterior distribution with a fully factorized set of distributions. Recently, some stochastic variational inference methods have been proposed to update the variational parameters directly by sampling from the variational posterior distribution [2325].

Variational Autoencoder (VAE)[19, 26] has proved to be one of the most successful probabilistic generative models, which combines the variational learning framework and deep neural network. It could be regarded as a generative model that is based on a regularized version of the standard autoencoder.

In VAE, the deterministic encoder is replaced by a learned posterior recognition model . Similarly, there is a probabilistic decoder .

The objective function is the same as the Evidence Lower Bound (ELBO) in variational learning, which takes the following form: where H()is the entropy.

The ELBO could be reformulated as

In practice, the prior over the latent variables is usually chosen to be the centered isotropic multivariate Gaussian . The variational approximate posterior is chosen to be a multivariate Gaussian with a diagonal covariance structure whose distribution parameters are computed from with a fully connected neural network with a single hidden layer:

where and are outputs of the encoding MLP.

In the traditional stochastic variational learning framework, the Evidence Lower Bound is optimized with respect to both the generative parameters and the variational parameters . The gradient with respect to could be calculated by while the gradient with respect to could be calculated by

Unfortunately, this gradient estimator exhibits high variance, and some other methods should be used to alleviate it.

In VAE, the so-called reparameterization trick is used to solve the above problem. Assume that z is a continuous random variable and could be sampled by . Then sometimes could be expressed as a deterministic variable , where is an auxiliary variable with independent marginal and is some vector valued function parameterized by .

Given , we have thus we could construct the estimator by where .

For instance, assume that . Then could be reparameterized by , where . We have where

3. Recurrent Neural Network

Neural networks have been applied to a variety of tasks in the field of natural language processing. In particular, recurrent neural networks with long short-term memory (LSTM) [27, 28] cells or gated recurrent units (GRU)[29] have proven successful at tasks including machine translation [1618], machine comprehension, and many others. These models are especially suitable for handling sequential data.

Due to the vanishing and exploding gradient problems, it is widely believed that learning long-range dependencies with recurrent neural networks is challenging. To deal with this issue, LSTM introduces a more complex interaction structure. The recurrent computations in LSTM can be represented by where is the input, is the cell’s state, is the cell’s memory, is the cell’s candidate memory, are the state of the corresponding gate, and are parameters of the cell.

It should be noted that there are many variants of LSTM, but throughout this paper, we always use the standard LSTM.

4. Details of the Model

In this section, we introduce the structure of the model in detail.

4.1. The Probabilistic Encoder

In VAE, the encoder would encode the input into the hidden code. To handle text data, we use LSTM as the main part of the encoder. Given a sequence consisting of N words , we first convert each word into an embedding vector . The embeddings are encoded by column vectors in an embedding matrix , where is the dimension of the embedding and is the size of the vocabulary. Each column corresponds to the embedding of the i-th word in the vocabulary. The matrix is a parameter to be learned and the dimension of the embedding is a hyperparameter to be chosen by the user.

The embeddings of the words are then fed into the LSTM cell step by step, and finally we get the hidden state of all the N timesteps. Unlike the vanilla sequential VAE[21], here we prefer to use bidirectional LSTM and introduce the self-attention mechanism to more efficiently extract the information from the entire text. The variational posterior represented by the encoder would be a multivariate Gaussian with a diagonal covariance structure, where the mean and the standard deviation of the posterior are set to be the outputs of the MLP. Formally, we have where are parameters to be learned. In the above procedure, represent all the hidden state of the forward and backward directions of LSTM. is the concatenation of and . is the weight vector used in self-attention mechanism. is the weighted average of all the hidden state, also called the context state vector.

4.2. The Text Classifier

Since text classification is one of our aims, we can use the hidden code as the extracted abstract feature of text. Ideally, we assume that the hidden code contains important information about the input text, such as semantics, length, and sentiment. Based on these features, we can use traditional classifiers, such as SVM and AdaBoost to classify texts. It is important to note that the categorization of text is only one of our goals and our other goal is the generation of text, while the class information about the text helps the latter. Therefore, we consider integrating the classifier into the VAE model and train the neural network as classifier.

The hidden code is fed into a fully connected layer followed by a softmax layer, and the output is a vector indicating the probability of each class. Formally, we have where are parameters to be learned.

4.3. The Probabilistic Decoder

The decoder would generate the word sequentially, conditioned on the information provided by the encoder. We again use LSTM as the main part of the decoder. The procedure of word embedding remains unchanged. Decoding of timestep could be expressed as where are parameters to be learned.

Recall that we have the categorization information for the text. Thus we may use the label to help controlling the generation progress. A natural idea is to condition on both the hidden code and the label at the beginning of the decoding procedure. Unfortunately, this often does not achieve good performance in practice. As the sequence gets longer, the signal provided by the class label gets weaker rapidly. To make full use of the label, we concatenate on the word embedding and the one-hot label vector at each timestep, and the hidden state of each timestep could be represented by The remaining steps are the same as before.

4.4. The Objective Function

As our model needs to take into account both text classification and generation, the objective function we need to optimize also consists of two parts. Given a labeled data pair , we have and the Evidence Lower Bound could be expressed as where is the Kullback-Leibler divergence between the distribution and .

The loss of the classifier could be expressed as where represents the empirical distribution of the ground truth label.

The final objective function for the entire dataset is where is the hyperparameter controlling the trade-off between these two parts.

The reparameterization trick and sampling method could be used to calculate the gradient with respect to each parameter. Then the optimization would be applied.

The framework of our model is shown in Figure 1.

In recent years, VAE has proved to be a powerful deep generative model. The first work to combine VAE with text generation is [21]. They use the vanilla LSTM as both the encoder and the decoder. They introduced some tricks for training VAE such as word drop and KL annealing, which we also used in our work. But their work do not include the classification task. At the same time, their model cannot generate text for a given category by controlling the label.

There are some similar works on the same topic such as [36, 37]. In [36], they use the vanilla LSTM as the encoder of VAE, and the hidden code of the VAE is transformed from the hidden state of the last timestep in the LSTM. This means that the LSTM should compress the information into the last hidden state as much as possible, which is hard due to the sequential structure of the vanilla LSTM. In fact, the information of the previous timesteps is often lost. In contrast, our model use bidirectional LSTM as the encoder, which could capture more contextual information than the vanilla one. More importantly, we introduced the self-attention mechanisms [38, 39]; the final context state vector would be derived from all the hidden state in the LSTM rather than just the last hidden state. The extracted feature would contain more global information and could be used to improve the performance of both the classification and generation tasks. What is more, although they use a modified version of CNN-based decoder, they do not explicitly take advantage of the label during decoding stage, and the categorical signal may not be strong enough to guide the generation procedure, while our method explicitly uses the label information during decoding and it would help controlling the generation progress. In [37], they also use the vanilla LSTM as encoder, which would be less robust and less efficient.

6. Experiments

In this section we show the experimental results on several datasets to demonstrate the performance of the proposed method. First we would introduce the dataset we used in the experiments; then we would describe the details of the experiments; finally we would show the results of the experiment from two perspectives, text classification and text generation.

6.1. Datasets

We performed the model on three different benchmarks: IMDB dataset, SST1, and SST2.

The IMDB dataset is a benchmark for sentiment classification [30]. The task is to determine if the movie reviews are positive or negative. The Stanford Sentiment Treebank (SST) dataset consists of movie reviews with one sentence per review [35]. In SST1, the fine-grained labels are provided, including very positive, positive, neutral, negative, and very negative. The SST2 dataset is the same as SST1 but with neutral reviews removed and binary labels. Details about the dataset are shown in Table 1.

6.2. Experimental Setup

For all the experiments, the vocabulary size is set to 15000; the word out of the dictionary is replaced by the token. In practice, it is unlikely to handle sentences of arbitrary length; thus we would set the max length of the sentence to be 35. We would truncate the sentence if the length exceeds the maximum, and we would pad the sentence with the token to make the length of sentences in each minibatch consistent. During the decoding procedure, we would add the and token at the beginning and end of the sentence, respectively.

We use the pretrained GloVe vectors [40] with 300 dimensions as the initialization of the word embedding. During training, these embeddings are optimized as part of the parameters of the model. The hidden size of the LSTM cell is set to 128. The size of the hidden code is also set to 128. The batch size is set to 64. The classifier is a neural network with two hidden layers, and the size of each layer is set to 200 and 400.

We implemented the model with Tensorflow. The model is trained end-to-end using the ADAM optimizer, with learning rate of .

In order to better train the VAE model, the cost annealing trick is adopted to smooth the training by gradually increasing the weight of the KL divergence from zero to one. With this trick, we would avoid vanishingly small KL term in the VAE module. We also use word dropout to regularize the model. In other words, the tokens fed into the decoder would be replaced by the token with a certain probability. We set this ratio to 0.25 during training, while in the decoding stage, this trick would not be used.

All the other hyperparameters, such as the trade-off between the two terms in the cost function, are chosen based on the performance on the development set.

6.3. Classification Performance

To evaluate the proposed model, we first train a SVM classifier based on the bag of words feature as a baseline model on each dataset. These models are implemented by LIBSVM. We also compared our model with a few previous best methods.

The classification results on IMDB dataset are shown in Table 2. Some previous best supervised results are contained. For instance, the NB-LM bow3 method first generates binary bag of n-gram vectors, multiplies the component for each n-gram with the NB weight, and then trains a logistic regression classifier.

We found that although our model does not outperform some best methods, the proposed model could still get a competitive result, which indicates that the feature extracted by the encoder is meaningful. We emphasize that the best models are specially optimized for classification task, while our model also needs to consider the task of text generation. Therefore, simply using the result of classification as the basis of judgement does not fully demonstrate the ability of our model.

We also carried out a series of similar experiments on the SST dataset. The results are shown on Table 3. This time we compared some best methods based on neural networks, such as the Recursive Autoencoders (RAE) with pretrained word vectors from Wikipedia, Matrix-Vector Recursive Neural Network (MV-RNN) with parse trees, Recursive Neural Tensor Network (RNTN) with tensor based feature function and parse trees, and Dynamic Convolutional Neural Network (DCNN) with k-max pooling. Once again, we found that our model far outperforms the baseline and could achieve competitive results.

6.4. Generation Performance

We then conducted additional experiments to demonstrate the ability of our model on text generation. In order to make full use of the label information to generate the corresponding category of text, we explicitly feed the label into the decoder. In other words, the word embedding and the one-hot label vector are concatenated at each timestep as the input.

We used the model trained on the IMDB dataset to generate text corresponding to different categories. The results are shown on Table 4. We found that the generated text is realistic while corresponding the correct label, that is, including different positive and negative emotions.

To make the experiments more complete, we also evaluated the performance of text generation on SST dataset. The samples are shown on Table 5. We found that the generated text is shorter and simpler compared with the text generated from the IMDB dataset. This is because the IMDB dataset is larger and our model would learn more complex structures. The results have demonstrated that our model could consistently achieve good generative performance.

7. Conclusion

In this paper, we have proposed an integrated deep generative model to deal with both the text classification and generation. We use bidirectional LSTM and introduce the self-attention mechanism to enhance the encoder of VAE. We then extract the feature of the text and use the neural network classifier to perform the classification based on the global feature. What is more, the categorization information is explicitly fed into the decoder to help controlling the generation progress. The experiments have shown that our model could achieve competitive results on both tasks. While our work has enhanced the encoder of VAE for better performance, we have not modified the main structure of the decoder. In fact, some new mechanisms can be introduced to enhance the decoder, such as coverage probability. What is more, some constraint functions could be introduced to produce more controllable text after modifying the objective function. We leave them as our future work.

Data Availability

All the datasets used in this paper are publicly available and could be obtained from http://ai.stanford.edu/~amaas/data/sentiment/ and http://nlp.stanford.edu/sentiment/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant nos. 11771393 and 11632015) and Zhejiang Natural Science Foundation (Grant no. LZ14A010002).