Keywords

1 Introduction

The goal of query classification is to identify the category label known a priori that best represents the domain of the keywords submitted in a query. However, the performance of a query based classifier largely depends on the keywords submitted by the users which often do not express the underlying information is searched for. Such keywords are called noise terms and cannot unambiguously represent the actual context of the query, resulting error in classification. Selection of more appropriate keywords, represent the context of the query enhances performance of the classifier.

In most of the cases, feedback mechanism [1] performs well when user can modify the query based on the suggestions provided by the search engine. However, there is no scope of query refinement using feedback when the user does not have any domain knowledge and this scenario is very common in rural healthcare sector of India. The healthcare services to remote villages face real challenge due to scarcity of doctors. Generally, health assistants manage the rural health centers but they have lack of expertise to refine the patient’s query containing noise terms. In [2], a query classification system has been proposed for diagnosis of the disease at primary level by processing the imprecise query keywords with the help of experts’ knowledge base. Therefore, the aim of the paper is to develop an autonomous provisional disease diagnosis system using statistical and computational methods, which effectively can monitor the health of the rural people.

It has been observed that the relevant words are spatially distributed while irrelevant words are randomly distributed in the document. Therefore, there is enormous difference in the pattern of occurrences between the relevant and the non-relevant terms in the document. A spatial distribution based method has been proposed for obtaining the symptoms related terms from the disease-related document corpus. In level statistical analysis of quantum disordered system, “energy level” of a word within an “energy spectrum” is considered as spatial distribution of the word to extract the relevant words whose energy levels attract each other [3]. In this paper, we propose a nearest-neighbour spacing distribution (NNSD) based approach to obtain symptoms w.r.t the disease-classes, known apriori. A symptom vocabulary (SV) is constructed using the unique symptoms present in the disease-classes [4]. We build a disease-symptom matrix (DSM) consisting of number of symptoms present in the SV and the number of corresponding diseases where each element of the matrix denotes tf-score [5] of the respective symptom, considering the disease related MDC. The DSM is built by extracting knowledge from the MDC and sparse in nature.

After knowledge extraction, the imprecise query submitted by the patient is expanded using the terms, which have strong association with the query keyword. For measuring association, suitable adaptive technique is needed which represent context of the query more precisely. In this paper, for expanding the query 5-gram collocation model and log-likelihood ratio (LLR) are employed to measure the association [6]. For a query keyword five co-occurred terms are considered as expanded query, called bag-of-symptoms (BoS). Each term in the BoS might not exactly match with the terms in the SV, though semantically or contextually similar. Here, we propose a novel approach for finding the most similar terms in the SV w.r.t each term in the BoS using distributional similarity measure. Finally, the SV is encoded with tf value of the terms in the BoS and considered as the feature vector (FV). Since the terms in a BoS are very specific, the FV is sparse and used as the test pattern for predicting the disease. We utilize sparse DSM and sparse test pattern for predicting the disease of a patient by applying Sparse Representation based Classifier(SRC) [7]. The proposed system is described in Fig. 1.

Fig. 1.
figure 1

Architecture of the proposed system

This paper is divided into four sections. Section 2 describes the methodology. Results are summarized in Sect. 3 and conclusions are arrived at Sect. 4.

2 Methodology

In the paper, we propose an autonomous provisional disease identification system based on the patient’s query keyword which is often imprecise consisting of noise terms. First contribution of the paper is knowledge extraction by analyzing the pre-defined disease related document corpuses collected from different medical sources.

2.1 Disease-Class Generation

Here, we utilize NNSD of words (or symptoms) over the documents for finding the relevant symptoms to a disease. The spacing distribution P(d) of a word w is obtained as the normalized histogram of the set of distances or spacing (d 1 , d 2 , …, d m ) between consecutive occurrences of a word w in the documents, where m is the number of times the word w occurs in the document [4]. It has been observed that a non-relevant word like “and” is placed at random in the document, whereas a relevant word like “angina” appears in the “heart-disease” related document following spatial distribution P(d). Therefore, the level of attraction of relevant words is higher than the level of attraction of irrelevant words. The relevance of a word is defined using the parameter ρ where \( \rho = \frac{\sigma }{{\overline{d} }} \), \( \overline{d} \) is the average distance and σ is the standard deviation \( \sqrt {d^{2} - \overline{d}^{2} } \) for distribution P(d). For different words, ρ value is used for comparing the distributional similarity. When the words are uncorrelated they follow Poisson distribution.

The relevant words follow a correlated spatial distribution and form a group w.r.t a disease based on ρ. In this paper, we obtain group of relevant words as symptoms for each disease class by applying NNSD to each word in the document. From Fig. 2 it is evident that the relevant words “heart”, “angina” and “Palpitation” follow similar type of distribution with different mean and standard deviation while the non-relevant term “called” and “high” follow random distribution.

Fig. 2.
figure 2

Distribution of different keywords

The symptoms in different disease classes are thus obtained and a Symptom Vocabulary (SV) is built with k number of unique symptoms (n << k) present in n different disease-classes. A disease-symptom-matrix (DSM)n×k is built as measurement space, each element of which is calculated using Eq. (1),

$$ \begin{aligned} {\text{DSM }}\left[ {ij} \right]{\text{ }} & = log\left( {{\text{1}} + f_{{w,D}} } \right);{\text{ if}}\,j^{{{\text{th}}}} \,{\text{term}}\,{\text{of}}\,{\text{SV}}\,{\text{presents}}\,{\text{in}}\,{\text{disease - class}}\,~i \\ & = 0,{\text{ otherwise}} \\ \end{aligned} $$
(1)

Where f w,D is the count of the term w in D [5].

The DSM is a sparse matrix as most of the symptoms are unique for a disease and used to classify the query keyword submitted by the patient.

2.2 Query Expansion Model

In the proposed query expansion model, the query of a patient has been expanded using 5-gram collocation by consulting the same MDC. We find the co-occurred terms of the query keyword using LLR as association measure. It has been observed that beyond 5-gram, the co-occurred terms are redundant [8]. The expanded query consisting of five co-occurred terms and defined as bag-of- symptoms (BoS). The BoSs are not unique and there may be multiple BoSs for each keyword due to associations with different words throughout the document. From multiple BoSs the highest LLR scored BoS has been chosen as expanded query. For example, if a patient enquires about “heart” related problems, the keyword “heart” is expanded and the top scored BoS: (heartbeat angina heart disease nausea) is considered as expanded query.

Each BoS is used to generate the feature vector (FV) by comparing each term of the BoS with the symptoms in SV depending on the ρ value. The symptom in the SV, which is closest to the term of a BoS is encoded with the tf score [5] of the respective term. In case multiple terms of a BoS are mapped to the same symptom of the SV, highest tf score is used to encode the respective symptom. Remaining elements of the SV are set to zero and so the FV is sparse in nature.

2.3 Sparse Representation Based Classification of Query

The FV is represented by vector y, which is sparse and we apply SRC to classify the query by reconstruction using Eq. (2).

$$ {\mathbf{y}}^{T} = \text{DSM}^{\text{T}} *{\mathbf{W}} $$
(2)

Where W is the co-efficient vector and is sparse since not all elements of the disease-classes contribute to reconstruct the query sample y.

The sparsest solution can be obtained by solving the following optimization problem, given in Eq. (3),

$$ \widehat{{\mathbf{W}}}_{0} = \arg \hbox{min} \left\| {{\mathbf{W}}_{0 } } \right\|, {\text{subject to}} DSM^{T} *{\mathbf{W}} = \varvec{y}^{\varvec{T}} $$
(3)

Where \( \left\| . \right\|_{0} \) is the \( L_{0} \) - norm, counting the number of non-zero entries in the co-efficient vector. This problem has been solved in polynomial time by standard linear programming algorithm [8]. After the sparsest solution say, \( \widehat{\varvec{w}}_{1} \) is obtained, the SRC [7] is performed in the following way.

For each disease-class i, let \( \partial_{i} : {\mathbb{R}}^{S} \to {\mathbb{R}}^{S} \) be the characteristic function that selects the co-efficient associated with the i th class. Using only the co-efficient associated with the i th class, reconstruction has been performed for a given test sample y as \( {\varvec{y}_{{\varvec{new}}}^{\varvec{i}}}^{\varvec{T}} = DSM^{T} * \partial_{i} \left( {\widehat{\varvec{W}}_{1} } \right) \) where \( {\mathbf{y}}_{{\varvec{new}}}^{\varvec{i}} \) is called the prototype of class i with respect to the sample y. Equation (4) calculates the residual distance between the actual and its prototype of class i,

$$ r_{i} \left( \varvec{y} \right) = \left\| {\varvec{y} - \varvec{y}_{{\varvec{new}}}^{\varvec{i}} } \right\|_{2} $$
(4)

The SRC decision rule: If \( r_{m} \left( {\mathbf{y}} \right) = \min_{i} r_{i} \left( {\mathbf{y}} \right), \) y is assigned to the class m [9].

Example

  1. Step1.

    The BoS corresponding to the patient’s keyword ‘Angina’ is (Fatigue Coronary Palpitation Heart Nausea] T

  2. Step2:

    Encode the expanded query as test pattern y using SV (1×70). The term “Coronary” is not present in SV, so replace “Coronary” with most similar symptom “Heart” by comparing ρ value. FV y is given as follows:

    y (1×70) = [ 0, 0. …, 0, 1.27, …, 0, 0, …, 0, 0.3, …, 0, 0, …, 0, 0.3, …, 0, 0, …, 0, 0.9, …, 0, 0 ]T

  3. Step3:

    Considering y as the encoded test sample and DSM (4 × 70) as training set, obtain the sparse coding vector W 4×1 using following Eq. (3)

    W 4×1 = [0.12 0.03 –0.04 –0.005]T

  4. Step4:

    Reconstruct y ( \( {\mathbf{y}}_{{{\mathbf{new}}}}^{\varvec{i}} ) \) for every non-zero coefficient in W i for the i th disease class label.

  5. Step5:

    Residual distance for each class i is given using Eq. (4).

    r i  = [1.88 1.97 2.03 2]T

  6. Step6:

    Minimum residual distance is 2.01 corresponding to i = 1. Therefore, the query is classified as disease-class “Heart-Disease”.

3 Results and Discussions

In our experiment, a large medical document corpus (MDC) is prepared by consulting several medical websites (webmd.com, mayoclinic.org, healthcare.com) and literatures [10]. There are 260 documents divided into four sub-corpuses representing diseases, namely “Heart-disease”, “Diabetes”, “Diarrhea” and “Lung-disease”.

The NNSD-SRC method has been applied on four different sub-corpuses to extract the relevant terms, which are symptoms and the dimension of the SV is 70. We sample 200 patients’ query from a rural health kiosk in a span of one week and classify the query using 10-fold cross validation technique. NNSD-SRC method shows significant improvement in accuracy and guarantees lower rate of misclassification while comparing with other classifiers, as given in Table 1. High precision and recall value ensures that NNSD-SRC performs better than other classifiers. ROC curves for different classifiers are given in Fig. 3, which demonstrates best performance of the NNSD-SRC.

Table 1. Comparisons using different classifiers
Fig. 3.
figure 3

ROC curve using different classifiers

4 Conclusions

The proposed NNSD-SRC based provisional disease diagnosis method, which minimizes the experts’ involvement. The patient’s query has been expanded moderately based on 5-gram collocation approach. For classification of the query sparse representation based classifier (SRC) is employed which utilizes sparsity of the feature vector and the DSM matrix. The SRC based classifier outperforms other classifiers showing significant improvement in accuracy and sensitivity on different data sets. In the work, we prepare a benchmark data set MDC of medical documents related to “Heart-disease”, “Diabetes”, “Diarrhea” and “Lung-disease” and verified with the experts. The performance of the system is satisfactory and used in rural healthcare in India where scarcity of doctors is a real challenge.