Keywords

1 Introduction

The need to search for specific information in the ever expanding Internet has led the development of Web search engines. Whereas their benefit is the provision of a direct connection between users and the information or products sought, any search outcome will be influenced by a commercial interest as well as by the users’ own ambiguity in formulating their requests or queries. An example of this situation is travel services. The Internet has made accessible real time travel industry’s information and services; customers can purchase flight tickets, hotels and holiday packs online. Distribution costs have been reduced due a shorter value chain; however businesses not shown on the top positions within the search results may lose potential customers. A similar scenario also occurs within academic search; the Internet has allowed the democratization of academic publications. Authors can upload their work onto their personal Webpages bypassing the traditional model of the journal peer review. There is the biased interest from authors to get their publications in top search positions in order to reach a bigger audience so they will be cited more. In both examples ranking algorithms are essential as they decide the relevance; they make information visible or hidden to customers or users. Under this model, Web search engines or recommender systems can be tempted to artificially rank results from some specific businesses for a fee whereas also authors or business can be tempted to manipulate ranking algorithms by “optimizing” the presentation of their work or products. The main consequence is that irrelevant results may be shown on top positions and relevant ones “hidden” at the very bottom of the search list.

In order to address the presented search issues; this paper proposes an Intelligent Internet Search Assistant (ISA) that acts as an interface between an individual user’s query and the different search engines. Our ISA acquires a query from the user and retrieves results from one or various search engines assigning one neuron per each Web result dimension. The result relevance is calculated by applying our innovative cost function based on the division of a query into a multidimensional vector weighting its dimension terms with different relevance parameters. Our ISA adapts and learns the perceived user’s interest and reorders the retrieved snippets based in our dimension relevant centre point. Our ISA learns result relevance on an iterative process where the user evaluates directly the listed results. We evaluate and compare its performance against other search engines with a new proposed quality definition, which combines both relevance and rank. We have also included two learning algorithms; Gradient Descent learns the centre of relevant dimensions and Reinforcement Learning updates the network weights based on rewarding relevant dimensions and punishing irrelevant ones. We have validated our ISA against other Web search engines and metasearch engines using travel services and open user queries. We have also analysed the Gradient Descent and Reinforcement Learning algorithms based on result relevance and learning speed.

We describe the application of neural networks in Web search in Sect. 2. We define our Intelligent Internet Search Assistant mathematical model in Sect. 3 and we have validated it against other Web search engines in Sect. 4. Finally, we present our conclusions in Sect. 5.

2 Related Work

The ability of neural networks to learn iteratively from different inputs to acquire the desired outputs as a mechanism of adaptation to users’ interest in order to provide relevant answers have already been applied in the World Wide Web and recommender systems.

F. Scarselli et al. [1] and M. Chau et al. [2] use a neural network by assigning a neuron to each Web page; they create a graph where the neural links are the equivalent of the hyperlinks. S. Bermejo et al. [3] use a similar approach to our proposal, the allocation of one neuron per Web search result, however the main difference is that the network is trained to cluster results by meaning. C. Burgues et al. [4] define RankNet which uses neural networks to evaluate Web sites by training the neural network based on query-document pairs. Shu, B. et al. [5] retrieve results from different Web search engines and train the network following the assumption that a result in a top position would be relevant. J. Boyan et al. [6] use reinforcement learning to rank Web pages using their HTML properties and hyperlink connections between them. X. Wang et al. [7] use a back propagation neural network with its input nodes corresponding to an specific quantified user profile and one output node which it is the a probability the user would consider the Web page relevant.

3 The Intelligent Internet Search Assistant Model

The search assistant we design is based on the Random Neural Network (RNN) [810]. This is a biologically inspired spiking recurrent stochastic model for neural networks. Its main analytical properties are the “product form” and the existence of the unique network steady state solution. The RNN represents more closely how signals are transmitted in many biological neural networks where they actual travel as spikes or impulses, rather than as analogue signal levels. It has been used in different applications including network routing with cognitive packet networks, using reinforcement learning, which requires the search for paths that meet certain pre-specified quality of service requirements [11, 17], search for exit routes for evacuees in emergency situations [12, 13], pattern based search for specific objects [14], video compression [15], and image texture learning and generation [16].

3.1 Search Model

In the case of our own application of the RNN, the search for information or for some meaning needs requires us to specify some elements: an M-dimensional universe of X entities or ideas to be searched, a high level query that specifies the N-properties or concepts requested by a user and a method that searches and selects Y entities from the universe showing the first Z results to user according to an algorithm or rule. Each entity or concept in the universe is distinct from the others in some recognizable way; for instance two entities may be different just in the date or time-stamp that characterizes the time when they were last stored or in the ownership or origin of the entities. On the other hand, we consider concepts to be distinct if they contain any different meaning, even though if they are identical with respect to a user’s query.

We consider that the universe which we are searching within as a relation U that consists of a set of X M-tuples, U = {v1, v2 … vX}, where vi = (li1, li2 … liM) and li are the M different attributes for i = 1, 2 … X. The relation U is a very large relation consisting on M > > N attributes. The important concept in the development of this paper is a query can be defined as Rt(n(t)) = (Rt(1), Rt(2), …, Rt(n(t))) where n(t) is a variable N-dimension attribute vector with 1 < N < M and t is the search iteration being t > 0; n(t) is variable so that attributes can be added or removed based on their relevance as the search progresses, i.e. as t increases. Each Rt(n(t)) takes its values from the attributes within the domain D(n(t)), where D is the corresponding domain that forms the universe U. Thus D(n(t)) is a set of properties or meanings based in words or integers, but also words in another language, or a set of icons, images or sounds.

The answer A to the query Rt(n(t)) is a set of Y M-tuples A = {v1, v2 … vY} where vo = (lo1, lo2 … loM) and lo are the M different attributes for o = 1, 2 … Y. Our Intelligent Internet Search Assistant only shows to the user the first set of Z tuples that have the highest neuron potentials among the set of Y tuples. The neuron potential that represents the relevance of each M-tuple vo is calculated at each t iteration. The user or the high level query itself is limited mainly by two main factors: the user’s lack of information about all the attributes that form the universe U of entities and ideas, or the user’s lack of precise knowledge about what he is looking for.

3.2 Result Cost Function

We consider the universe U is formed of the entire results that can be searched. We assign each result provided by a search engine to an M-tuple vo of the answer set A. We calculate the result relevance based on a cost function described within this section. The query Rt(n(t)) is a variable N-dimension vector that specifies the attributes the user consider relevant. The number of dimensions of the attribute vector n(t) varies as the iteration t increases. Our Intelligent Internet Search Assistant associates an M-tuple vo to each result provided by the Search Engine creating an answer set A of Y M-tuples. Search Engines select their results from the universe U. We apply our cost function to each result or M-tuple vo from the answer set A of Y M-tuples. We consider each vo as a M-dimensional vector. The cost function is firstly calculated based on the relevant N attributes the user introduced on the High Level Query, R1(n(1)) within the domain D(n(1)) however, as the search progresses, Rt(n(t)), attributes may be added or removed based on the perceived relevance within the domain D’(n(t)). We calculate the overall Result Score, RS, by measuring the relationship between the values of its different attributes:

$$ {\text{RS}} = \;{\text{RV}} * {\text{HW}} $$
(1)

where RV is the Result Value which measures the result relevance and HW the Homogeneity Weight. The Homogeneity Weight (HW) rewards results that have relevance or scores dispersed along their attributes. This parameter is also based on the idea that the first dimensions or attributes of the user query Rt(n(t)) are more important than the last ones:

$$ {\text{HW}} = \;\frac{{\sum\limits_{{{\text{n}} = 1}}^{\text{N}} {\text{HF[n]}} }}{\text{N}} $$
(2)

where HF[n], homogeneity factor, is a N-dimension vector associated to the result and n is the attribute index from the query Rt(n(t)):

$$ \begin{array}{*{20}c} {{\text{HF[n]}} = } & {\left| {\begin{array}{*{20}c} {\;\frac{{{\text{N}} - {\text{n}}}}{\text{N}}\quad {\text{if SD[n]}} > 0} \\ {\;\;\;\begin{array}{*{20}c} {} \\ { 0\quad \;\;{\text{if SD[n]}} = 0} \\ \end{array} } \\ \end{array} } \right.} \\ \end{array} $$
(3)

We define Score Dimension SD[n] as a N-dimension vector that represents the attribute values of each result or M-tuple vo in relation with the query Rt(n(t)). The Result Value (RV) is the sum of each dimension individual score:

$$ {\text{RV}} = \;\sum\limits_{{{\text{n}} = 1}}^{\text{N}} {\text{SD[n]}} $$
(4)

where n is the attribute index from the query Rt(n(t)). Each dimension of the Score Dimension vector SD[n] is calculated independently for each n-attribute value that forms the query Rt(n(t)):

$$ {\text{SD[n]}} = \;{\text{S}} * {\text{PPW}} * {\text{RPW}} * {\text{DPW}} $$
(5)

We consider only three different types of domains of interest: words, numbers (as for dates and times) and prices. S is the score calculated depending if the domain of the attribute is a word (WS), number (NS) or price (PS). If the domain D(n) is a word, our ISA calculates the score Word Score (WS) following the formula:

$$ {\text{S}} = \;\frac{\text{WR}}{\text{NW}} $$
(6)

where the value of WR is 1 if the word of the n-attribute of the query Rt(n(t)) is contained in the search result or 0 otherwise. NW is the number of words in the search result. If the domain D(n) is a number, our ISA selects the best Number Score (NS) from the numbers they are contained within the search result that maximizes the cost function:

$$ {\text{S}} = \;\frac{{\left( {1 - \left( {\frac{{\left| {{\text{DV}} - {\text{RV}}} \right|}}{{\left| {\text{DV}} \right| + \left| \text{RV} \right|}}} \right)} \right)}}{\text{NN}} $$
(7)

where DV is the value of the n-attribute of the query Rt(n(t)), RV is the value of a number in the result and NN is the total number of numbers in the result. If the domain D(n) is a price, our ISA chooses the best Price Score (PS) from the prices in the result that maximizes the cost function:

$$ {\text{S}}\, = \,\,\frac{{\left( {\frac{\text{DV}}{\text{RV}}} \right)}}{\text{NP}} $$
(8)

where DV is value of the n-attribute of the query Rt(n(t)), RV is the value of a price in the result and NP is the total number of prices in the result. We penalize if the search result provides unnecessary information by dividing the score by the total amount of elements in the Web result. The dimension Score Dimension vector, SD[n] is weighted according to different relevance factors:

$$ {\text{SD[n]}} = \;{\text{S}} * {\text{PPW}} * {\text{RPW}} * {\text{DPW}} $$
(9)

The Position Parameter Weight (PPW) is based on the idea that an attribute value shown within the first positions of the search result is more relevant than if it is shown at the final:

$$ {\text{PPW}} = \;\frac{{{\text{NC}} - {\text{DVP}}}}{\text{NC}} $$
(10)

where NC is the number of characters in the result and DVP is the position within the result where the value of the dimension is shown. The Relevance Parameter Weight (RPW) incorporates the user’s perception of relevance by rewarding the first attributes of the query Rt(n(t)) as highly desirable and penalising the last ones:

$$ {\text{RPW}} = \; 1- \frac{\text{PD}}{\text{N}} $$
(11)

where PD is the position of the n-attribute of the query Rt(n(t)) and N is the total number of dimensions of the query vector Rt(n(t)). The Dimension Parameter Weight (DPW) incorporates the observation of user relevance with the value of domains D(n(t)) by providing a better score on the domain values the user has more filled on the query:

$$ {\text{DPW}} = \;\frac{\text{NDT}}{\text{N}} $$
(12)

where NDT is the number of dimensions with the same domain (word, number or price) on the query Rt(n(t)) and N is the total number of dimensions of the query vector Rt(n(t)). We assign this final Result Score value (RS) to each M-tuple vo of the answer set A. This value is used by our ISA to reorder the answer set A of Y M-tuples, showing to the user the first set of Z results which have the higher potential value.

3.3 User Iteration

The user, based on the answer set A can now act as an intelligent critic and select a subset of P relevant results, CP, of A. CP is a set that consists of P M-tuples CP = {v1, v2 … vP}. We consider vP as a vector of M dimensions; vp = (lp1, lp2 … lpM) where lp are the M different attributes for p = 1, 2 … P. Similarly, the user can also select a subset of Q irrelevant results, CQ of A, CQ = {v1, v2 … vQ}. We consider vq as a vector of M dimensions; vq = (lq1, lq2 … lqM) where lq are the M different attributes for q = 1, 2 … Q. Based on the user iteration, our Intelligent Internet Search Assistant provides to the user with a different answer set A of Z M-tuples reordered to MD, the minimum distance to the Relevant Centre for the results selected, following the formula:

$$ {\text{RCP[n]}} = \;\frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {{\text{SD}}_{\text{p}} [ {\text{n]}}} }}{\text{P}} = \;\frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {{\text{l}}_{\text{pn}} } }}{\text{P}} $$
(13)

where P is the number of relevant results selected, n the attribute index from the query Rt(n(t)) and SDp[n] the associated Score Dimension vector to the result or M-tuple vP formed of lpn attributes. An equivalent equation applies to the calculation of the Irrelevant Centre Point. Our Intelligent Internet Search Assistant reorders the retrieved Y set of M-tuples showing only to the user the first Z set of M-tuples based on the lowest distance (MD) between the difference of their distances to both Relevant Centre Point (RD) and the Irrelevant Centre Point (ID) respectively:

$$ {\text{MD}} = \;{\text{RD}} - {\text{ID}} $$
(14)

where MD is the result distance, RD is the Relevant Distance and ID is the Irrelevant Distance. The Relevant Distance (RD) of each result or M-tuple vq is formulated as below:

$$ {\text{RD}} = \;\sqrt {\sum\limits_{{{\text{n}} = 1}}^{\text{N}} {\left( {{\text{SD}}[{\text{n}}] - {\text{RCP[n]}}} \right)^{ 2} } } $$
(15)

where SD[n] is the Score Dimension vector of the result or M-tuple vq and RCP[n] is the coordinate of the Relevant Centre Point. Equivalent equation applies to the calculation of the Irrelevant Distance. Therefore we are presenting an iterative search progress that learns and adapts to the perceived user relevance based on the dimensions or attributes the user has introduced on the initial query.

3.4 Dimension Learning

The answer set A to the query R1(n(1)) is based on the N dimension query introduced by the user however results are formed of M dimensions therefore the subset of results the user has considered as relevant may have other relevant concepts hidden the user did not considered on the original query. We consider the domain D(m) or the M attributes from which our universe U is formed as the different independent words that form the set of Y results retrieved from the search engines. Our cost function is expanded from the N attributes defined in the query R1(n(1)) to the M attributes that form the searched results. Our Score Dimension vector, SD[m], is now based on M-dimensions. An analogue attribute expansion is applied to the Relevance Centre Calculation, RCP[m]. The query R1(n(1)) is based on the N-Dimension vector introduced by the user however the answer set A consist of Y M-tuples. The user, based on the presented set A, selects a subset of P relevant results, CP and a subset of Q irrelevant results, CQ.

Let us consider CP as a set that consists of P M-tuples CP = {v1, v2 … vP} where vP is a vector of M dimensions; vP = (lp1, lp2 … lpM) and lp are the M different attributes for p = 1, 2 … P. The M-dimension vector Dimension Average, DA[m], is the average value of the m-attributes for the selected relevant P results:

$$ {\text{DA[m]}} = \;\frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {{\text{SD}}_{\text{p}} [ {\text{m]}}} }}{\text{P}} = \;\frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {{\text{l}}_{\text{pm}} } }}{\text{P}} $$
(16)

where P is the number of relevant results selected, m the attribute index of the relation U and SDp[m] the associated Score Dimension vector to the result or M-tuple vP formed of lpm attributes. We define ADV as the Average Dimension Value of the M-dimension vector DA[m]:

$$ {\text{ADV}} = \frac{{\sum\limits_{{{\text{m}} = 1}}^{\text{M}} {\text{DA[m]}} }}{\text{M}} $$
(17)

where M is the total number of attributes that form the relation U. The correlation vector σ[m] is the difference between the dimension values of each result with the average vector:

$$ {{\sigma [ \text{m]}}} = \frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {\left( {{\text{SD}}_{\text{p}} [ {\text{m]} - \text{DA[m]}}} \right)} }}{\text{P}} = \frac{{\sum\limits_{{{\text{p}} = 1}}^{\text{P}} {\left( {{\text{l}}_{\text{Pm}} {{ - \text{DA[m]}}}} \right)}}}{\text{P}} $$
(18)

where P is the number of relevant results selected, m the attribute index of the relation U and SDp[m] the associated Score Dimension vector to the result or M-tuple vP formed of lpm attributes. We define C as the average correlation value of the M-dimensions of the vector σ[m]:

$$ {\text{C}} = \frac{{\sum\limits_{{{\text{m}} = 1}}^{\text{M}} {\sigma [m]}}}{\text{M}} $$
(19)

where M is the total number of attributes that form the relation U. We consider an m-attribute relevant if its associated Dimension Average value DA[m] is larger than the average dimension ADV and its correlation value σ[m] is lesser than the average correlation C. We have therefore changed the relevant attributes of the searched entities or ideas by correlating the error value of its concepts or properties represented as attributes or dimensions. On the next iteration, the query R2(n(2)) is formed by the attributes our ISA has considered relevant. The answer to the query R2(n(2)) is a different set A of Y M-tuples. This process iterates until there are not new relevant results to be shown to the user.

3.5 Gradient Descent Learning

Gradient Descent learning is based on the adaptation to the perceived user interests or understanding of meaning by correlating the attribute values of each result to extract similar meanings and cancel superfluous ones. The ISA Gradient Descent learning algorithm is based on a recurrent model. The inputs i = {i1, …, iP} are the M-tuples vP corresponding to the selected relevant result subset CP and the desired outputs y = {y1, …, yP} are the same values as the input. Our ISA then obtains the learned random neural network weights, calculates the relevant dimensions and finally reorders the results according to the minimum distance to the new Relevant Centre Point focused on the relevant dimensions.

3.6 Reinforcement Learning

The external interaction with the environment is provided when the user selects the relevant result set CP. Reinforcement Learning adapts to the perceived user relevance by incrementing the value of relevant dimensions and reducing it for the irrelevant ones. Reinforcement Learning modifies the values of the m attributes of the results, accentuating hidden relevant meanings and lowering irrelevant properties. We associate the Random Neural Network weights to the answer set A; W = A. Our ISA updates the network weights W by rewarding the result relevant attributes by:

$$ {\text{w(p,}}\,{\text{m)}} = {\text{ l}}_{\text{pm}}^{{{\text{s}} - 1}} \, + {\text{ l}}_{\text{pm}}^{{{\text{s}} - 1}} *\left( {\frac{{{\text{l}}_{\text{pm}}^{{{\text{s}} - 1}} }}{{\sum\nolimits_{{{\text{m}} = 1}}^{\text{M}} {{\text{l}}_{\text{pm}}^{{{\text{s}} - 1}} } }}} \right) $$
(20)

where p is the result or M-tuple vP formed of lpm attributes, m the result attribute index, M the total number of attributes and s the iteration number. ISA also updates the network weights by punishing the result irrelevant attributes by:

$$ {\text{w(p,}}\,{\text{m)}} = {\text{ l}}_{\text{pm}}^{{{\text{s}} - 1}} \, - {\text{ l}}_{\text{pm}}^{{{\text{s}} - 1}} *\left( {\frac{{{\text{l}}_{\text{pm}}^{{{\text{s}} - 1}} }}{{\sum\nolimits_{{{\text{m}} = 1}}^{\text{M}} {{\text{l}}_{\text{pm}}^{{{\text{s}} - 1}} } }}} \right) $$
(21)

where p is the result or M-tuple vP formed of lpm attributes, m the result attribute index, M the total number of attributes and s the iteration number. Our ISA then recalculates the potential of each of the result based on the updated network weights and reorders them, showing to the user the results which have a higher potential or score.

4 Validation

The Intelligent Internet Search Assistant we have proposed emulates how Web search engines work by using a very similar interface to introduce and display information. We validate our ISA algorithm with a set of three different experiments. Users in the experiments can both choose between the different Web search engines and the N number of results they would to retrieve from each one. We propose the following formula to measure Web search quality; it is based on the concept that a better search engine provides with a list of more relevant results on top positions. In an list of N results, we score N to the first result and 1 to the last result, the value of the quality proposed is then the summation of the position score based of each of the selected results. Our definition of Quality, Q, can be defined as:

$$ {\text{Q}} = \;\sum\limits_{{{\text{i}} = 1}}^{\text{Y}} {{\text{RSE}}_{\text{i}} } $$
(22)

where RSEi is the rank of the result i in a particular search engine with a value of N if the result is in the first position and 1 if the result is the last one. Y is the total number of results selected by the user. The best Web search engine would have the largest Quality value. We define normalized quality, \( \overline{\text{Q}} \), as the division of the quality, Q, by the optimum figure which it is when the user consider relevant all the results provided by the Web search engine. On this situation Y and N have the same value:

$$ \overline{\text{Q}} = \;\frac{\text{Q}}{{\frac{{{\text{N(N}} + 1 )}}{ 2}}} $$
(23)

We define I as the quality improvement between a Web search engine and a reference:

$$ {\text{I}} = \;\frac{{{\text{QW}} - {\text{QR}}}}{\text{QR}} $$
(24)

where I is the Improvement, QW is the quality of the Web search engine and QR is the quality reference; we use the Quality of Google as QR in our validation exercise.

In our first experiment we have asked to our validators to search for different queries using only Google; ISA provides with a set of reordered results from which the user needs to select the relevant results. We show the average values for the 20 different queries, the average number of results retrieved by Google and the average number of results selected by the user. We represent the normalized quality of Google and ISA with the improvement of our algorithm against Google. In our second experiment, ISA provides with a reordered list from where the user needs to select which results are relevant. Our ISA reorders the results using the dimension relevant centre point providing to the user with another reordered result list from where the user needs to select the relevant ones. We show the average values for the 16 different queries, the average number of results selected by the user and the average number of results selected. We also represent the normalized quality of Google, ISA and the ISA with the relevant circle iteration including the improvement against Google in both scenarios. In our third experiment, validators can select from which Web search engine they would their results to be retrieved from; as in our first experiment, the users need to select the relevant results. Our ISA combines the results retrieved from the different Web search engines selected. We present the average values for the 18 different queries. We show the normalized quality of each Web search engine selected including our ISA; because users can choose any Web search engine; we are not introducing the improvement value as we do not have a unique reference Web search engine (Table 1).

Table 1. Web search engine validation

4.1 ISA Learning

Users in the experiments can choose between Google and Bing with either Gradient Descent or Reinforcement Learning type. Our ISA then collects the first 50 results from the Web search engine selected, reorders them according to its cost function and finally show to the user the first 20 results. We consider 50 results is a good approximation of search depth as more results can add clutter and irrelevance; 20 results is the average number of results read by a user before he launches another search if he does not find any relevant one. ISA reorders results while learning on the two step iterative process showing only the best 20 results to the user. We present the average Quality values of the Web search engine and ISA for the 29 different queries searched by different users, the learning type and the Web search engine used. The first I represents the improvement from ISA against the Web search; the second I is between ISA iterations 2 and 1 and finally the third I is between the ISA iterations 3 and 2 (Table 2).

Table 2. ISA learning validation

5 Conclusions

We have proposed a novel approach to Web search where the user iteratively trains the neural network while looking for relevant results. We have also defined a different process; the application of the Random Neural Network as a biological inspired algorithm to measure both user relevance and result ranking based on a predetermined cost function. Our Intelligent Internet Search Assistant performs generally slightly better than Google and other Web search engines however, this evaluation may be biased because users tend to concentrate on the first results provided which were the ones we showed in our algorithm. Our ISA adapts and learns from user previous relevance measurements increasing significantly its quality and improvement within the first iteration. Reinforcement Learning algorithm performs better than Gradient Descent. Although Gradient Descent provides a better quality on the first iteration; Reinforcement Learning outperforms on the second one due its higher learning rate. Both of them have a residual learning on their third iteration. Gradient Descent would have been the preferred learning algorithm if only one iteration is required; however Reinforcement Learning would have been a better option in the case of two iterations. It is not recommended three iterations because learning is only residual.