1 Introduction

The digital revolution has led to an exponential increase in the information accessible online, which has found ample utility in personalization of products and services. By providing a more relevant and engaging experience to the users with personalized content, a significant increase in revenue and market share has been achieved. Many businesses like Yelp, Amazon, Netflix, etc. have their core business models centered around product personalization. Specifically, in the recommendation arena of movies, articles, books, etc. personalization is indispensable [23].

A vast majority of personalization techniques have focused on extracting explicit information on users and businesses through online ratings issued by a user. For example, Collaborative Filtering (CF) based algorithms personalize product recommendations based on the users’ and items’ historical ratings [16]. Social network based personalization algorithms further enhance the CF approaches by incorporating the user’s social network related information. On the other hand, Content-based algorithms incorporate attributes extracted from the business itself to generate personalized recommendation. However, these approaches fail to capture user’s inherent preferences and the ground-level facts about the businesses, which are better reflected in the reviews written by the users. Research on extracting information on user-business interactions through online reviews has focused on a variety of topic modeling and sentiment analysis approaches to gain insight into the user’s perspective and opinions [4, 5, 37]. However, lack of generalization and difficulty in interpretability by humans have limited the scope of their practical applications.

In this work, we propose a topic modeling based sentiment enhanced entity (user/item) profiling methodology that generates user and item profiles as numeric vectors from the review text. The proposed framework consists of three stages: text preprocessing, topic modeling, and profile generation. In text preprocessing, we adapt a codeword insertion step to replace topic relevant terms with the domain specific category names based on the business’ features. For example, in the context of restaurant reviews, the words “pizza” and “pepperoni” would be replaced by “American” if the restaurant serves North American cuisines and would be replaced by “Continental” if the restaurant serves Italian cuisines instead. We further parse the text to get Parts of Speech (POS) tags and a dependency tree in order to extract nouns for topic modeling and aspect level dependencies for sentiment classification.

In the topic modeling stage, we aggregate all the nouns and noun phrases from the preprocessed reviews at entity-level and generate two separate corpora, corresponding to user-level and business-level documents. We implement the entity Latent Dirichlet Allocation (eLDA) for extracting the topics, where “entity” refers to a user or item to be profiled. The derived topics are validated using several quantitative and qualitative (human judgment-based) metrics. The extracted topics are then mapped to the profile aspects in the profile generation stage. This ensures a domain-specific interpretability of the profiles. In addition, we use an aspect level sentiment classifier to classify the business aspects into strength/weaknesses as perceived by users. Finally, the entity profile vectors are generated as an aggregation of the probabilities of all the topics related to the aspects.

For this study, we have considered restaurant reviews posted by users in a publicly available dataset released by Yelp for Yelp dataset challenge to generate personalized profiles for users and restaurants. Our proposed method outperforms the benchmarked algorithms both in quantitative and qualitative metrics. This can be attributed to the following aspects of the proposed approach:

Noun Extraction: Extracting nouns and noun phrases retains the information critical to generating the profile, while ignoring the irrelevant information [22]. This in turn improves the topic quality and also reduces the computational time for topic modeling technique.

Codeword Insertion: Replacing multiple words referring to the same object with a codeword explicitly indicating the category reinforces domain-specific connections among words, that are not apparent from the corpus. Replacing redundant terms with codewords reduces the vocabulary size without loss of information. This enhances topic interpretability by concentrating the word-topic probability over a fewer set of words. It also improves discoverability of implicit topics that are not directly aggregated with a codeword.

The topic probabilities obtained from the proposed model are then used for generating entity profiles as a numeric vector of pre-determined dimension using the online reviews. Furthermore, we have enhanced the business profile by classifying the business aspects into user-perceived strengths and weaknesses using the parse tree based aspect level sentiment analysis.

The rest of the paper has been structured as follows: Sect. 2 presents literature related to this work. The solution overview has been described in detail in Sect. 3 followed by dataset description and experimental findings in Sect. 4 Sect. 5, respectively. Finally, conclusion and future work has been presented in Sect. 7.

2 Literature Review

In this section, we have reviewed the following topics relevant to our work: (1) Topic Modeling (2) Sentiment Analysis (3) Profiling and Personalization.

2.1 Topic Modeling

Information retrieval from massive text corpora has attracted attention from various research communities over the years. A plethora of work has focused on reducing the dimensionality of the word-frequency space by clustering various documents based on common underlying topics. The traditional Term Frequency-Inverse Document Frequency (TF-IDF) approach proposed by Salton and Buckley [28] reduces a document of arbitrary length to a real vector having a dimension equal to the total number of terms in the vocabulary of the entire corpus. Each vector element represents the term frequency (TF) weighted by the logarithm of inverse of the proportion of documents containing that term (IDF). Clustering algorithms are then applied to the TF-IDF matrix to cluster documents. This approach is appealing because of its simplicity and intuitive interpretability; however, it fails to capture underlying topics in the document, which can be modeled based on the co-occurrence of multiple terms. This problem has been addressed by Hoffman in the probabilistic Latent Semantic Indexing (pLSI) approach [12]. pLSI models a document as a probabilistic mixture of multiple topics, where a topic itself is a probabilistic distribution over the terms in the vocabulary. However, this method is not scalable since the number of parameters grow linearly with the number of documents in the corpus and lacks generalized application in other corpus.

The Latent Dirichlet Allocation (LDA) approach proposed by Blei et al. [5] addresses these issues by including a document level multinomial probability distribution over topics in the pLSI model. This ensures that the learnt parameters are not document specific, do not grow linearly with the number of documents and can be used to apply topics to another corpus built from the same vocabulary. LDA has been widely applied in a number of domains, successfully yielding topics of good quality. Variations of LDA such as local-LDA by Brody and Eldahad [6] and sentence LDA (sent-LDA) by Bao and Dutta [1] have been developed to further improve topic quality in specific domains. Another probabilistic topic modeling approach, Correlated Topic Models (CTM) has been developed by Blei and Lafferty [4] to model and extract topics having a high correlation. CTM supports the extraction of a greater number of topics from the same corpus with a higher log-likelihood; however, CTM is computationally more expensive as compared to LDA, and has a comparable performance for a lower number of topics.

The approaches discussed so far have the ability to soft cluster groups of words into various topics based on their co-occurrence in the corpus. However, they do not have any means of enforcing a domain-specific understanding of words that may not be captured in a bag of words approach because of the inherent richness of the domain’s vocabulary. Bao et al proposed an LDA based topic modeling approach where ratings corresponding to the reviews were incorporated in the topic modeling algorithm in [2]. This approach improved recommendation quality, but failed to explicitly generate topics and entity profiles. A semantic approach to topic modeling has been proposed by Linshi [17], where words of similar sentiment polarity have been grouped to reveal the sentiment associated with the individual topics. This approach has demonstrated the potential for revealing additional information through the incorporation of semantic knowledge in topic modeling. In this study, we have adapted a semantic-LDA hybrid approach by incorporating domain-specific knowledge in topic modeling, resulting in better topic quality and interpretability.

2.2 Sentiment Analysis

Sentiment Analysis (SA) is the computational treatment of subjective opinions and emotions expressed by people in texts. SA has most commonly been approached as a classification problem [24], where entire documents or parts of documents are classified to be associated with various sentiments. SA is done on three levels: document-level, sentence-level and aspect-level. Document-level SA classifies a document into a multitude of sentiments such as positive, negative, neutral, etc. Sentence-level SA makes the same classification for individual sentences instead of documents and the aspect-level SA associates different sentiments to various aspects of the document [24].

Aspect-level SA has the potential to provide better insight into the distinct polarity associated with various themes in rich review texts compared to other techniques where an aggregated polarity is given. It primarily involves three major steps: aspect detection, sentiment classification and aggregation [29]. Aspect detection methods have generally been frequency based [13, 18], syntactic [26, 38] or machine learning based [14, 19]. Frequency based and syntactic approaches are intuitive and have similar performance to machine learning techniques [29]. The sentiment classification, the second step, can occur separately or jointly [37] with aspect detection using either a lexicon-based bag of words approach for sentiment aggregation or implementing algorithms such as LDA, logistic regression, Hidden Markov Models, K-Nearest Neighbour, and Support Vector Machines for supervised and semi-supervised learning of the sentiments [24]. Aggregation of the sentiment associated with an aspect is the weighted average of all the individual sentiments [33, 34]. In this study, we have employed a hybrid aspect-level SA approach to extract the sentiment level polarities associated with the aspects of business profiles. A parse-tree based syntactic approach has been used for aspect detection and a machine learning based supervised classification approach for sentiment classification. Finally, we have aggregated the sentiments associated with an aspect by taking a simple average of the classification probabilities.

2.3 Profiling and Personalization

The exponential growth of the Internet has led to abundance of information at the disposal of web users. Personalization has helped tackle information overload by limiting nonessential content. For catering personalized services, accurately profiling users and businesses has become paramount. This has opened a new area of research for many researchers to mine both implicit (visits, clicks, time spent, etc.) and explicit (reviews, stars, etc.) information for profiling and personalization. Collaborative Filtering (CF) is a technique used for personalizing content by grouping similar users and businesses based on historical ratings [16, 27]. The underlying assumption is users who rated similarly in the past tend to rate similar in the future. One main limitation of similarity based CF is the time complexity does not grow linearly with number of users and businesses. Rather than exploiting the entire user space to find similar users, researchers have started using online social networks for identifying users’ preferences. In [15, 20, 35] the latent features of friends from online social networks are assumed to be similar to target users’ latent features and the influence from friends’ networks are utilized for personalizing content to the target user. Although personalization can be facilitated by incorporating recommendations from friends in social network, there is a potential in incorporating individual preferences and perspectives about the businesses for quality profiling. In order to maintain individual preferences of experienced users along with social network influence Feng and Qian in [11] proposed a framework that fuses personal interest along with social influence. The personal interest factor captures the most desirable business categories for the target user by analyzing the historical ratings. The recent growth of GPS enabled gadgets and Web 2.0 Technologies have attracted users to update their location information via check-ins. Prior works in [7, 39,40,41] focused on only geographical influence for providing recommendations whereas recent works in [9, 36] uses both geographical and social influences for providing Point-Of-Interest (POI) recommendations. Though the previously stated works provide personalized information using explicit information (ratings or stars), they do not analyze the user preferences that are latent in the form of reviews, posts, blogs, etc.

In this study, we propose an elegant approach to generate profiles for users and items by incorporating latent features mined from the user reviews using codeword based entity-level LDA technique. Further, sentiment analysis technique has been employed to detect sentiments associated with the topics.

3 Solution Overview

3.1 Problem Formulation

In this study, we aim to construct M user and N item profiles as k-dimensional real-valued vectors \(U_{u} \in \mathbb {R}^k \forall u \in \mathbb {U}\) and \(I_{i} \in \mathbb {R}^k \forall i \in \mathbb {I}\) from review texts, where \(\mathbb {U}\) is the set of users and \(\mathbb {I}\) is the set of items, such that \(|\mathbb {U}| = M\) and \(|\mathbb {I}| = N\).

Fig. 1.
figure 1

Architecture for entity profile generation

3.2 Solution Details

Herein, we propose a Latent Dirichlet Allocation (LDA) based entity profiling algorithm (eLDA) that ensures that the domain specific user interests are captured in user profile \(U_{u} \forall u \in \mathbb {U}\). Item profiles \(I_{i} \forall i \in \mathbb {I}\) captures the key business attributes along with the sentiment polarity which indicates the strengths and weaknesses as perceived by the users in general. The overall architecture of eLDA has been presented in Fig. 1.

Our algorithm consists of three major stages: text pre-processing, topic modeling, and entity profile generation. Each step has been explained in details below.

Text Pre-processing. Text pre-processing is customary in any Natural Language Processing technique. In eLDA this process consists of three stages, namely, codeword insertion, parse tree generation, and preliminary cleaning.

Codeword Insertion. We aggregate multiple words that represent the same category in the texts by replacing them with a common “codeword”. For example, the words ice-cream, cake, gateau, pastry are replaced by the codeword “desserts”. This work is focused on constructing restaurant profiles on a publicly available dataset collected from Yelp. To effectively tag the codewords, bag-of-words dictionary \(\mathbb {B}\) corresponding to each codeword (category) has been generated. Instead of manually creating the codeword dictionary, crawling multiple recipe, cuisine and professional restaurant review websites with appropriately categorized recipe names would be a convenient route. As a result of this data collection process, our bag-of-words dictionary contains over 5000 words corresponding to 10 distinct cuisine categories, namely: American, Continental, Southern, Alcohol, Desserts, Asian, Indo-Arabic, Sandwiches and Breakfast-Brunch. We use the bag-of-words list to replace the words with the corresponding codeword, determined by the context or the business features \(F_i\). In our case \(|F|=10\). For example, we have replaced the word “pizza” with “American” if the restaurant serves North American cuisine, and “Continental” if it serves Italian cuisines. Thus, we define the codeword insertion operator \(C(v,\mathbb {B})\) for word v and dictionary \(\mathbb {B}\) as:

$$\begin{aligned} C(v,\mathbb {B})= {\left\{ \begin{array}{ll} \mathbb {B}[v] &{} \mathbb {B}[v] \in F_{i} \\ v &{} \mathbb {B}[v] \not \in F_{i} \\ \end{array}\right. } \end{aligned}$$
(1)

Additionally, to detect if a restaurant enjoys location advantage, location specific keywords such as address of the restaurant has been added in a location codeword dictionary. Each dictionary category label is used as the respective codeword.

Parse Tree Generation. After codeword insertion, we parse the texts in order to generate the parse trees along with Parts Of Speech (POS) tags associated with the words. The POS tags are used to extract the nouns and noun phrases, which are later used in the topic modeling stage, to gain maximum critical information [22]. The parse trees are used for the aspect-level sentiment analysis of the business features in the profile generation stage. We have used the Compositional Vector Grammar based Parser [21, 32] available as a part of the Stanford NLP package for generating parse trees in our dataset. The parse tree attaches a POS tag t and an index j denoting the tree-node position for each of the words obtained from the codeword transformation. Thus, the tree transformation converts each word v into a triplet:

$$\begin{aligned} T(v)=(v,t,j) \end{aligned}$$
(2)

Here, T(v)[i] corresponds to the word v, POS tag t and tree-node position j for \(i={1,2,3}\), respectively.

Preliminary Cleaning. After generating parse trees, we remove the stop-words, punctuations and special characters in order to retain information critical to topic modeling and sentiment analysis. Finally, we lemmatize and stem the words to their root words in order to improve topic quality and interpretability [3].

Topic Modeling. We formulate the entity Latent Dirichlet Allocation (eLDA) algorithm as an extension of LDA [5] for extracting topics to construct entity-level profiles described as follows:

  1. 1.

    Extract nouns and noun phrases i.e., aggregate that all the \(w=C (\mathbb {B},v) s.t. T(w)[2]\) = NN or NNS or NNP or NNPS after codeword insertion from each of the reviews.

  2. 2.

    Generate M user-level aggregated documents \(E_{u} \forall u \in \mathbb {U}\) and N item-level aggregated documents \(E_{i} \forall i \in \mathbb {I}\) s.t. \(E_{e}\) is the aggregation of all the pre-processed nouns present in the reviews pertaining to the entity \(e \in \mathbb {U} \) or \(e \in \mathbb {I}\).

  3. 3.

    For each entity aggregation \(E_{e}\):

    1. (a)

      Choose \(L\sim \) Poisson\((\xi )\)

    2. (b)

      Choose \(\theta \sim \) Dirichlet(\(\alpha \))

    3. (c)

      For each of the L nouns in \(E_{e}\):

      1. i.

        Choose a topic \(z_{l}\) from Multinomial \((\theta )\).

      2. ii.

        Choose a noun \(w_{n}\) from \(p(w_{n}|z_{n,\beta })\), a multinomial probability distribution

We obtain the probability of \(\mathbf w _{u}\), \(\mathbf w _{i}\) (which are the term frequency forms of \(E_{u}\) and \(E_{i}\)), the user-level corpus \(D_{u}\) and item-level corpus \(D_{i}\) of entity-level aggregations as:

$$\begin{aligned} p(\mathbf w _u|\alpha ,\beta )= \int p(\theta _{u}|\alpha ) \left( \prod _{l=1}^{L_{u}}\sum _{z_{u_{l}}}p(z_{u_{l}}|\theta _{u})p(w_{u_{l}}|z_{u_{l}},\beta )\right) d\theta _{u} \end{aligned}$$
(3)
$$\begin{aligned} p(\mathbf w _i|\alpha ,\beta )= \int p(\theta _{i}|\alpha ) \left( \prod _{l=1}^{L_{i}}\sum _{z_{i_{l}}}p(z_{i_{l}}|\theta _{i})p(w_{i_{l}}|z_{i_{l}},\beta )\right) d\theta _{i} \end{aligned}$$
(4)
$$\begin{aligned} p(D_{u}|\alpha ,\beta )=\prod _{u\in \mathbb {U}} \int p(\theta _{u}|\alpha ) \left( \prod _{l=1}^{L_{u}}\sum _{z_{u_{l}}}p(z_{u_{l}}|\theta _{u})p(w_{u_{l}}|z_{u_{l}},\beta )\right) d\theta _{u} \end{aligned}$$
(5)
$$\begin{aligned} p(D_{i}|\alpha ,\beta )=\prod _{i\in \mathbb {I}} \int p(\theta _{i}|\alpha ) \left( \prod _{l=1}^{L_{i}}\sum _{z_{i_{l}}}p(z_{i_{l}}|\theta _{i})p(w_{i_{l}}|z_{i_{l}},\beta )\right) d\theta _{i} \end{aligned}$$
(6)

Finally, the joint probability of a topic allocation \(\theta _{e}\) to an entity aggregation \(E_{e}\) is given by:

$$\begin{aligned} p(\theta _{e},\mathbf z _e|\mathbf w _{e},\alpha ,\beta )=\frac{p(\theta _{e},\mathbf z _{e},\mathbf w _{e}|\alpha ,\beta )}{p(\mathbf w _{e},\alpha ,\beta )} \end{aligned}$$
(7)

We use Gibbs sampling for estimating the optimal parameters for inferring the topic distribution since the integral in Eq. 3 is generally hard to compute. Finally, we extract a set of topics \(T_{u}\), \(T_{i}\) from document corpora \(D_{u}\), \(D_{i}\) and the associated document-topic probability matrices \(P_{u}\), \(P_{i}\).

3.3 Entity Profile Generation

The entity profile generation stage is executed in three steps: topic to aspect mapping, aspect-level sentiment analysis, and profile generation.

Topic to Aspect Mapping. The user-level topics extracted from eLDA for the user-level document corpus \(D_{u}\) and business-level document corpus \(D_{i}\) are mapped to k domain-specific profile categories where the mapping is defined as:

$$\begin{aligned} \mathcal {M}:\{T_{u},T_{i}\} \rightarrow \varPi \end{aligned}$$
(8)

Here, \(\varPi \) refers to the domain specific set of categories. In general, \(|\varPi |=k \le max{|T_{u}|,|T_{i}|}\).

Aspect Level Sentiment Analysis. We use the parse tree transformation T(vtj) generated in the pre-processing step for aspect identification and the corresponding sentiment classification. We use a supervised classifier \(\mathbb {K}\) trained on a sentiment-labeled review dataset for a binary (positive/negative) classification of the sentiments. It is worthy to note that we do not apply sentiment classification for User eLDA as that would misrepresent the user profiles. We argue that if a user is concerned about a particular aspect of a restaurant, then only he/she mentions that in a review. Hence, the sentiment towards the identified aspects will always be positive.

The aspect level sentiment analysis proceeds as follows:

For every item level document \(d_{i }\in D_{i}~~\forall i \in \mathbb {I}\):

  1. 1.

    Aspects \(A_{i} \leftarrow \{v | t\) = NN or NNS or NNP or NNPS \( \forall T(v)=(v,t,j) \} \)

  2. 2.

    \(temp \leftarrow \{\}\)

  3. 3.

    For each of the aspects v in \(A_{i}\):

    1. (a)

      Generate the dependency text \(\delta \) as the aggregation of the children of v, i.e. aggregate all u s.t. \(j'=T(u,t',j')[3] \in children[j]\)

    2. (b)

      After preliminary cleaning mentioned in Sect. 3.2, generate the Document-Term frequency matrix \(dtm_{i}\) for \(\delta \).

    3. (c)

      Use classifier \(\mathbb {K}\) to obtain probability of positive sentiment \(s_{i}=\mathbb {K}(dtm_{i})\)

    4. (d)

      Store \(s_{i}\) corresponding to v in temp s.t. \(temp[v]=s_{i}\)

  4. 4.

    Use Eq. 8 to map individual aspects in \(A_{i}\) to the k profile categories and aggregate the sentiment probabilities for all aspects under each category as an arithmetic mean. Thus, \(temp,\mathcal {M}\rightarrow S_{i}\)

  5. 5.

    Store \(S_{i}\) in S.

Profile Generation. We aggregate the topic probabilities \(P_{u}\) and \(P_{i}\) using the mapping \(\mathcal {M}\) given in Eq. 8 to obtain the k-dimensional profile category probabilities \(\mathbb {P}_{u}\) and \(\mathbb {P}_{i}\).

The User Profile \(U_{u}\) is defined as a k-dimensional numerical vector where the k elements of the vector denote the probability that the user u is interested in the each of the k categories in \(\varPi \). Thus,

$$\begin{aligned} U_{u}=[\mathbb {P}_{ur}]_{k \times 1} \end{aligned}$$
(9)

where \(\mathbb {P}_{ur}\) is the probability that user u is interested in category \(\varPi _{r}\).

The business profile \(I_{i}\) has been characterized in analogy to the user profile, with the additional incorporation of the normalized sentiment score \(S_{i}\). The normalized sentiment score includes a sentiment polarity with the existing profile topics enabling to capture the positive and negative aspects. Thus, the business profile is defined as:

$$\begin{aligned} I_{i}=[\mathbb {P}_{is} \times f_{is}]_{k \times 1} \end{aligned}$$
(10)

where \(\mathbb {P}_{is}\) is the probability that business i is associated with category \(\varPi _{s}\) and \(S_{is}\) is the overall sentiment of business i with respect to category \(\varPi _{s}\). Also, \(f_{is}\) is the sentiment polarity of item i with respect to category s defined as:

$$\begin{aligned} f_{is}= {\left\{ \begin{array}{ll} -1 &{} S_{is} \in [0,l_{1}] \\ 1 &{} S_{is} \in [l_{1},l_{2}] \\ 2 &{} S_{is} \in [l_{2},1] \\ \end{array}\right. } \end{aligned}$$
(11)

Here, \(l_{1}\) and \(l_{2}\) are the experimentally decided cut-offs for the negative and neutral sentiments. Thus, a sentiment polarity of \(-1\) denotes a negative sentiment, 1 denotes neutral sentiment and 2 denotes a positive sentiment.

4 Dataset Description

In this study, we use the publicly available dataset released by Yelp datasetFootnote 1. The dataset contains user-item interactions in the form of ratings and textual user reviews along with business features. The dataset contains 144,072 businesses, located in United States of America (USA), United Kingdom, Canada and Germany. These business are categorized into 1191 categories such as restaurants, nightlife, religious organizations, etc. Out of 45,472 users, we consider only users who have reviewed at least five businesses globally which reduces the number of users to 11609 and number of businesses to 1983. With no loss of generality, we consider the reviews of businesses tagged as “restaurants”, located in Phoenix city, USA. The basic dataset description is provided in Table 1.

Table 1. Dataset description

5 Experimental Findings

In this section, we discuss the experiments conducted to evaluate our proposed topic modeling algorithm (eLDA) against benchmark models.

5.1 Benchmark Models

We have chosen three unsupervised models TF-IDF, standard LDA and local-LDA as benchmark models for evaluating the performance of our proposed model.

  1. 1.

    Term Frequency-Inverse Document Frequency (TF-IDF) [28]: This algorithm weighs the term frequency (TF) weighted by the logarithm of the inverse of the proportion of documents containing that term. A document-term matrix is then generated, with each document containing the TF-IDF score for all the terms in the vocabulary. K-means clustering technique is then employed for clustering the documents.

  2. 2.

    Standard LDA [5]: This algorithm treats each review as an individual document. Each document is further modeled as a probabilistic mixture of topics which themselves are distributed over the terms present in the vocabulary. This is a soft clustering technique that yields a probabilistic distribution of topics to a document.

  3. 3.

    Local LDA [6]: This algorithm is similar to standard LDA but treats each sentence as a document.

All the benchmark models are applied on review text which has been processed for removal of stopwords, numbers and punctuation marks. The corpus is then stemmed and lemmatized to obtain root words.

5.2 Intermediate Models

We introduce two intermediate models, “LDA Noun-Codeword” and “LDA Noun” to analyze the effect of extracting nouns and codeword insertion steps in isolation. “LDA Noun-Codeword” applies LDA on noun-codewords (some nouns replaced with noun-codewords) and each document corresponds to one review. “LDA Noun” applies LDA on nouns extracted from the reviews. In forthcoming sections, we compare the performance of our proposed models, User eLDA and Item eLDA, including the intermediate models against benchmark models.

5.3 Metrics for Evaluation

For evaluating the quality of the derived topics, we consider metrics such as perplexity, silhouette coefficient, word intrusion score and topic labeling. Quantitative metrics such as perplexity and silhouette coefficient are used to validate the model performance whereas word intrusion and topic labeling are qualitative measures used for evaluating the quality of the derived topics.

  1. 1.

    Perplexity: Perplexity measures the predictive performance of a model given a set of unobserved documents. Perplexity is a monotonically decreasing function of log likelihood of unseen documents which is defined as follows:

    $$\begin{aligned} Perplexity (D_{test}) = exp\left( - \frac{\sum _{d\in \mathbb {T}} log p(w_{d})}{\sum _{d\in \mathbb {T}} N_{d}} \right) \end{aligned}$$
    (12)

    Where \(D_{test}\) is the test corpus containing documents belonging to test set \(\mathbb {T}\) and \(N_{d}\) is the number of words in document d. Lower values of perplexity imply better performance of the model. Since TF-IDF does not have a log likelihood component, perplexity cannot be used to validate TF-IDF.

  2. 2.

    Silhouette Coefficient: Silhouette Coefficient measures how closely a document is assigned to its own cluster compared to other clusters. For a given document i, the silhouette coefficient is given as

    $$\begin{aligned} s(i) = \frac{b(i) - a(i)}{\max \{{a(i),b(i)}\}} \end{aligned}$$
    (13)

    Where a(i) is the average distance between document i to all other documents in the same cluster and b(i) is the lowest average distance between document i to all other documents from other clusters. Silhouette coefficient can range from −1 to 1, where 1 implies a document is well assigned to its own cluster than other clusters and −1 implies the opposite.

  3. 3.

    Word Intrusion: Word intrusion task aids in quantitively measuring the coherence of the topics [8]. In word intrusion task, a subject is presented with a set of most probable terms from a topic along with a randomly selected intruder term that does not belong to that topic. The task of the subject is to find the intruder term. For example, for a given set of words like \(\{summer, winter, spring, autumn, dog\}\), a subject can easily identify dog as the intruder since all other words refer to seasons. When the terms in a topic lack such coherence, it becomes very difficult to identify the intruder term and subjects may randomly choose the intruder. Model precision is defined as follows:

    $$\begin{aligned} MP_{k}^{m} = \frac{1}{S} \sum _{s} \mathbbm {1} (i_{k,s}^{m} =w_{k}^{m}) \end{aligned}$$
    (14)

    where \(MP_{k}^{m}\) is the word intrusion score generated for topic k and inferred from model m. Also, \(i_{t,s}^{m}\) is the intruder word chosen by subject s and \(w_{k}^{m}\) is the actual intruder word. \(\mathbbm {1}(\cdot )\) is equal to 1 if the intruder word identified by the subject is same as actual intruder word \(w_{k}^{m}\) and 0 otherwise. Final word intrusion score for a model is the average word intrusion score for all the topics.

  4. 4.

    Topic Labeling: Topic labeling is the process of tagging domain-specific names to each of the topics. The degree to which subjects agree with the labeled terms for the topics can be quantified using Kappa statistics [31]. Cohen’s Kappa is defined as follows.

    $$\begin{aligned} \kappa = \frac{P_{0} - P_{c}}{1-P_{c}} \end{aligned}$$
    (15)

    where \(P_{0}\) is the proportion of mutual agreement and \(P_{c}\) is the proportion of agreement by chance. The values for Kappa can range from −1 to 1, where 1 represents perfect agreement, −1 perfect disagreement and 0 agreement by chance.

5.4 Model Performance

Predictive Power. In this section, we evaluate the performance of intermediate models against benchmark models. To validate the impact of using noun codewords to extract topics, we first compare the performance of intermediate models LDA Noun and LDA Noun-Codeword with benchmark models. Table 2 summarizes the perplexities for the held out documents for number of topics (K) varied from 5 to 100. Note that TF-IDF is not reported in Table 2 as it does not have a log likelihood component to measure perplexity.

Table 2. Performance comparison with model perplexity

As we can see from Table 2, LDA Noun-Codeword significantly outperforms the benchmark models. This can be attributed to two reasons: firstly, by replacing most of the food terms in the nouns with codewords reduces the number of terms in the vocabulary. Secondly, it reduces the confusion of assigning codewords to specific cuisine topics. For example, “pizza” being a very popular food term can belong to multiple topics like “American”, “Continental”, etc. and assigning “pizza” to the relevant topic for a given document becomes challenging. However, information about the cuisines offered by each restaurant can be used to replace these food terms with cuisine terms. For example “pizza” can be replaced with “American” if the restaurant serves American food. The task is now simplified to assign these codewords to cuisine topics and therefore identifying each entity’s cuisine preferences becomes easier. Also it helps to uncover other aspects of the restaurants like deals, hangout place etc. which remained hidden otherwise.

Cluster Quality. Table 3 summarizes the silhouette coefficients for all the models for varied number of topics. For LDA based models, each document is considered as one datapoint and the most probable topic as the cluster the document is assigned to. Each document’s topic probabilities are used as distance vectors for computing distance. For TF-IDF, the document-term matrix with TF-IDF scores is used for measuring distance. We use Euclidean distance to compute the distance between the documents for all the models.

Table 3. Performance comparison with silhouette coefficient

As seen from Table 3, Local LDA has highest silhouette coefficient compared to other models. One should note that silhouette coefficient assumes hard clustering, where one document can be assigned to only one cluster. The silhouette coefficient for Local LDA is relatively high because Local LDA treats each sentence as one document and the number of topics discussed in a sentence is much lower than number of topics discussed in an entire review. Therefore, measurement with silhouette coefficient alone will not be a determinant factor for model performance. To compare the quality of the topics, we will use other metrics as well.

Performance of Entity Level Models. To determine the efficacy of the proposed approaches (User eLDA, Item eLDA and LDA Noun-Codeword), perplexity and silhouette coefficient are measured for different number of topics.

Fig. 2.
figure 2

Performance comparison with Model Perplexity

Fig. 3.
figure 3

Performance comparison with Silhouette Coefficient

Figure 2 demonstrates the superiority of LDA Noun-Codeword, for higher values of K with lowest perplexity. However, from Fig. 3, it can be observed that the silhouette coefficients for Item eLDA performs better compared to other models. Figure 4 compares the performance of Noun-Codeword models for 50 topics with respect to both perplexity and silhouette coefficient. Model performance is said to be better if the model perplexity is low and silhouette coefficient is high. To test the statistical significance of the performance of different models in terms of perplexity and silhouette coefficient, we perform a one-tailed two sample t-test for different number of topics. Table 4 summarizes the results of the statistical test.

Fig. 4.
figure 4

Performance comparison of Noun-Codeword models (k = 50)

Table 4. Performance comparison with statistical significance

The statistical test on User eLDA perplexity yields a p-value of 0.0007, which signifies the mean perplexity for User eLDA is statistically higher than LDA Noun-Codeword perplexity. The statistical test on Item eLDA perplexity yields a p-value of 0.25, and implies that mean perplexity for Item eLDA is statistically lower or not different from LDA Noun-Codeword perplexity. Although the perplexity of User eLDA is relatively higher than LDA Noun-Codeword, the performance of User eLDA is exceptionally exceeding compared to the traditional models (65.23% and 69.46% reduction in perplexity compared to Local LDA and Standard LDA respectively).

The statistical tests on silhouette coefficient yield p-values of 0.41 and 0.87 for User eLDA and Item eLDA respectively and therefore signify the mean silhouette coefficient for User eLDA and Item eLDA are statistically higher or not different from LDA Noun-Codeword silhouette coefficient.

The benefits of aggregating the noun-codewords at entity level (user/item) is multifaceted. Firstly, the size of the document-term matrix reduces drastically through aggregation and large number of terms can be accommodated in the document-term matrix. The number of documents in document-term matrix in this case have reduced by 77.34% and 96.12% for User eLDA and Item eLDA respectively, compared to LDA Noun-Codeword. Since the size of the document-term matrix is reduced, a reduction in the computational time can also be expected. Lastly, profiling the entities becomes much simpler as each document corresponds to an individual entity and the topic probabilities for each document can simply be used to represent entities’ preference vectors.

Fig. 5.
figure 5

Perplexity as a function of number of topics

Selecting the Number of Topics (K). To choose the optimal number of topics (K), held out perplexity is plotted as a function of number of topics for Noun-Codeword models. As seen from Fig. 5, the perplexity for the models is a decreasing function of number of topics, but remains stable after 50. Therefore, we pick 50 as the optimal topic number for our dataset.

5.5 Quality of Topics

The metrics discussed in Sect. 5.4 are crucial for evaluating model’s performance. However, it is equally important to validate the quality of the derived topics. Herein, we present word clouds for each of the models, where the font size corresponds to the probability of the terms occurring in the topics. Later, we validate the quality of the topics using word intrusion and topic labeling technique.

Word Clouds. Due to space constraint, five random topics are selected from each of the models and are presented using word clouds. The size of each terms corresponds to the term probability for a given topic. The five topics include the following: cuisine (2), location (1), customer service (1) and order delivery (1). As seen from Fig. 6, for Noun-Codeword models, i.e LDA Noun-Codeword, User eLDA and Item eLDA, the term probabilities of noun codewords in cuisine topics (“Continental”, “American”) are exceptionally high. Similarly, the key terms for other topics such as location, customer service and order delivery have a relatively higher probabilities compared to other models. This visual representation depicts the relevance of terms in a topic for Noun-Codeword models, as the term probabilities are relatively higher compared to other models.

Fig. 6.
figure 6

Word cloud

Topic Labeling. Topic labeling is a crucial step for understanding the context of the derived topics. Manually labeling topics ensures high labeling quality [8]. We selected top 15 topics (based on mean topic probabilities) from each model and employed a random user to manually label the topics. Two subjects were asked to rate these topic labels as “relevant” or “irrelevant”. Cohen’s Kappa is used to measure the inter-rater agreement for the labeled topics. Table 5 shows the Kappa statistics for different models.

Table 5. Cohen’s Kappa for topic labeling

It can be observed from Table 5 that Kappa statistic is high for Noun-Codeword models (User eLDA, Item eLDA, LDA Noun-Codeword). This indicates that our proposed model is able to generate of high quality topics and thereby reduces the chances of mislabeling the topics.

Word Intrusion. For evaluating the coherence of terms in topics, word intrusion task is performed. Top five terms based on high term probabilities are chosen from five topics, from all the models (shown in Fig. 6). For each of the topics, a low probable term from the same topic is chosen as intruder word. These set of terms are presented to 15 subjects. The task of the subjects is to find the intruder word from the given set of words. Since the probability of cuisine specific term in cuisine topics for Noun-Codeword models is extremely high, the cuisine term in isolation is sufficient for representing the entire topic. For example: in Fig. 6, User eLDA Topic 4 depicts that term “Continental” has a very high probability and is a representative of the entire topic. This eliminates the need for performing word intrusion task for cuisine topics for Noun-Codeword models and we argue that the model precision is always 1 for cuisine topics. For other benchmark models, we perform word intrusion on five topics whereas for Noun-Codeword models (User eLDA, Item eLDA and Review Noun-Codeword) we perform word intrusion task on only three non-cuisine topics (location, customer service and order delivery). Figure 7 presents model precision for all the models.

Fig. 7.
figure 7

Performance comparison with model precision

Figure 7 shows that Model Precision is high for Noun-Codeword models (User eLDA, Item eLDA and Review Noun-Codeword), which demonstrates better semantic coherence in the inferred topics.

5.6 Sentiment Analysis

Sentiment Classifier Selection. We implement supervised machine learning sentiment classification algorithms to classify the sentiments associated with each of the aspects as positive or negative. We evaluate supervised classification algorithms commonly cited in sentiment analysis literature including logistic regression, K-Nearest neighbours (KNN), Gradient Boosting classifier, Random Forest classifier and Naive Bayes Classifier [24, 25, 29] to test their efficacy in classifying restaurant reviews. We use a dataset containing 1000 unique Yelp restaurant reviewsFootnote 2 with sentiment labels for training our classifiers. The standard precision, recall, F-measure, and accuracy of the models, as obtained after an average of five random 80%–20% train-test splits are shown in Table 6.

Table 6. Sentiment classifier model performance

Clearly, logistic regression outperforms the other algorithms in terms of F-measure score and overall accuracy. Therefore we select logistic regression as the sentiment classifier (\(\mathbb {K}\)) because of its high accuracy and well-balanced performance in terms of precision and recall.

6 Applications of Entity Profiles

6.1 Example of Profile Generation

The user and business profiles generated using our methodology have a multitude of potential applications. Our methodology generates the users and businesses profiles using review texts, thus reflecting the popular preferences and perspectives about the businesses. We demonstrate one example each for profiling users and businesses from a sample review taken from the Yelp dataset.

An Example of User Profiing. Consider a snippet of the user-level document for one user: ‘Decent food, decent beer, fast service what more do you want? Price was fair too ! Some TVs to catch sports. Nothing special but gets the job done. Okay food, drinks, waitresses etc. The highlight is the karaoke...’

We can intuitively say that this user is interested in food, drinks, service, prices and overall ambience based on the fact that he/she has mentioned these attributes. We demonstrate step-wise generation of the profile for this particular user:

  1. 1.

    After the pre-processing stage, the nouns and codewords are extracted. Thus, the above text is transformed into: “AMERICAN ALCOHOL service price tvs sports nothing job AMERICAN drinks waitresses highlight karaoke....”. Thus, the text now contains only nouns and noun-phrase with the appropriate codewords, which are “AMERICAN” and “ALCOHOL” in this case, to denote the corresponding food category.

  2. 2.

    In the topic modelling stage, the pre-processed text is run through the User eLDA model to extract the corresponding topics \(T_{u}\) and the K-dimensional probability distribution vector \(p_{u}\).

  3. 3.

    In the profile generation stage, a manually generated mapping \(\mathcal {M}\) is used to map the \(K=50\) topics in \(T_{u}\) to the k user profile categories. In our case, we have \(k=15\) categories, corresponding to \(\varPi = \{\)american, continental, asian, indo-arabic, desserts, sandwiches, alcohol, nightlife, breakfast-brunch, southern, ambience, price, restaurant quality, service efficiency, location\(\}\). Thus, the mapping \(\mathcal {M}\) would map the topics “good waiter”, “bad service”, “bad waiter” to the category “Service quality”. Also, the k-dimensional category probability vector \(P_{u}\) will be the aggregate of the K-dimensional topic probability vector \(p_{u}\) by taking the sum.

  4. 4.

    The user profile \(U_{u}\) is generated as: \([0.147,0,0,0,0,0,0.031,0,0,0.039,0,0,0.031,0.112,0.147]\), where the \(U_u[i]\) corresponds to \(\varPi _{i}\) for all \(k\in [1,15]\).

    The topic probabilites for most of the terms in this example are very low and hence was rounded to zero. We can directly infer that the user has a very high interest in American cuisine (0.147), restaurant locations (0.147) and service efficiency (0.112), while being mindful of the restaurant quality (0.031), nightlife (0.031), and southern cuisine (0.039). This finding corroborates well with our initial intuition about the user.

An Example of Item Profiling. Consider a snippet of the item-level document for one restaurant: ‘Brand new food business! They make you a pizza, you take it home and bake it. It’s a create your own pizza place... But terrible service... ’

We can infer that this restaurant serves a take-at-home pizza, which is satisfactory in terms of food, but poor in service. We can generate the item profile \(I_i\) for this restaurant following the steps similar to those mentioned for generating user profiles. This item’s profile obtained from Item eLDA model is [0.34, 0.04, 0, 0, 0, 0, 0.03, 0, 0, 0.04, 0, 0, 0.031, 0.23, 0]. Further, we generate the aspect level sentiment-polarities as follows:

Fig. 8.
figure 8

Example of a parse tree for a single sentence

  1. 1.

    Firstly, we generate the parse tree from the text preprocessing stage as depicted in Fig. 8.

  2. 2.

    We then derive the aspect level sentiment score as explained in Sect. 3.3. Thus we obtain the following aspects and corresponding scores: {new american business: 0.83, AMERICAN: 0.69, home: 0.41 own pizza place: 0.63 and service: 0.14}.

  3. 3.

    We then aggregate the aspect level sentiments to category level sentiments using arithmetic mean: In this case, the sentiment for ‘American’ category would be an aggregate of sentiment scores of ‘American’, ‘new american business’ and ‘own pizza place’, which is 0.72. The aggregated sentiment scores for home and service correspond to service efficiency category and will be 0.28.

  4. 4.

    Then the sentiment polarities are calculated as follows: American having a score of 0.59 will take a positive score of +2 and service with a score of 0.28 will take a negative sentiment score of −1.

  5. 5.

    The topic probabilities for this business obtained from Item eLDA is [0.34, 0.04, 0, 0, 0, 0, 0.03, 0, 0, 0.04, 0, 0, 0.031, 0.23, 0]. Now the category sentiment polarities will be multiplied to the topic probabilities to generate the business profile \(I_{i}\) as: \(I_{i}=[0.68,0.04,0,0,0,0,0.03,0,0,0.04,0,0,0.031,-0.23,0] \). This vector will represent the business profile for the specified business.

We can infer from the generated business profile that the restaurant is perceived to have a very good american cuisine, a decent continental cuisine, alcohol service, southern cuisines and overall quality, but bad service efficiency. Thus, the generated business profile closely reflects the inferred inutition about the restaurant.

6.2 Proposed Applications

The applications of these personalized profiles are multifaceted. We envision their direct applications in the following fields:

Recommender Systems. Recommender systems generate personalized recommendations by clustering similar users and items based on historical ratings [20, 35]. Similarity measurements can be directly refined by incorporating the user and business profiles generated from text reviews. Further, our profiles can explain the recommendations in a more natural manner, by weighing in the user’s interests and the business’s strengths and weaknesses.

Web Personalization. Web personalization customize online content to suit individual users and effectively project the businesses [10]. Generated profiles can potentially make a twofold impact in this sector. Personalization of online content based on the generated profiles can potentially improve click-through and conversion rates.

Marketing. Traditionally, users and businesses are profiled using historical rating and purchase history to classify them into various market segments [30]. However, profiles generated using online reviews have the potential to unravel attributes such as price, quality, ambiance etc. which in-turn can improve the existing customer segmentation process.

7 Conclusion

In this paper, we propose an entity-level topic model - eLDA, which incorporates domain-specific noun-codewords aggregated at entity-levels (user and item levels) to derive underlying abstract topics hidden in textual reviews. Several experiments have been conducted on a large review dataset of restaurants to validate the predictive performance and quality of derived topics. Experimental results reveal that our proposed models, User eLDA and item eLDA, outperform other benchmark models with improved topic coherence. The topic probabilities obtained from eLDA models are then mapped to domain-specific aspects for building entity profiles. This pivotal step for personalization reveal the actual meaning of these latent features and hence improves the ease of interpretability, which was largely overlooked by the traditional CF-based personalized algorithms. The aggregation of documents to the entity levels reduces the size of large Document-Term Matrix (DTM) drastically which ensures a huge reduction in the computational complexity as well. Findings show that there is a gratifying reduction of 77.34% and 96.12% in DTM size for User eLDA and item eLDA, respectively compared to LDA Noun-codeword, with good predictive power and topic quality. In our future work, we plan to fuse the entity profiles to an existing recommender system algorithm and validate the quality of the recommendations.