Practical and effective IR-style keyword search over semantic web

https://doi.org/10.1016/j.ipm.2008.12.005Get rights and content

Abstract

This paper presents a novel IR-style keyword search model for semantic web data retrieval, distinguished from current retrieval methods. In this model, an answer to a keyword query is a connected subgraph that contains all the query keywords. In addition, the answer is minimal because any proper subgraph can not be an answer to the query. We provide an approximation algorithm to retrieve these answers efficiently. A special ranking strategy is also proposed so that answers can be appropriately ordered. The experimental results over real datasets show that our model outperforms existing possible solutions with respect to effectiveness and efficiency.

Introduction

With the increasing amount of ontologies encoded in RDF/S or OWL languages available in semantic web, extensive efforts have been made towards effective retrieving facilities for the data. In this area, existing research work can be classified into two major categories: the structured query approach and the semantic search. The former is standardized by W3C which envisions that, users should be able to issue structured SQL-like expressive queries such as RQL, RDQL, or SPARQL (Prudhommeaux & Seaborne, 2008) and receive a set of triples as answers. This approach is effective, if users have a detailed knowledge of underlying schemas and ontology languages. Nevertheless, the assumed scenario is not always true in practice, since these languages are complicated even for developers of semantic web applications. An alternative approach is the so-called semantic search which usually adopts more user-friendly unstructured (Rocha, Schwabe, & Aragao, 2004) or semi-structured (Anyanwu et al., 2005, Guha et al., 2003) query strategies. However, these reported approaches are still much less attractive than keyword search, which is the most effective and successful paradigm for modern information retrieval.

This paper focuses on the problem of supporting effective keyword search over semantic web data. The semantic web data model (RDF/S or OWL) conforms to a node-labeled and edge-labeled directed graph (Hayes, 2004), where nodes represent resources or literals and edges represent properties carrying heterogeneous semantics. In our proposed model, an answer to a keyword querying is a connected “minimal” subgraph, which contains all the query keywords. In a real-world semantic web application, there may exist a large number of literal nodes matching individual keywords and hence many answers may satisfy the query. It is a non-trivial problem how to efficiently find these answers (e.g., consider a 5-keyword query over an RDF graph with thousands of nodes). In addition, these answers should be reasonably ranked in decreasing relevance. Otherwise, uses will be discouraged to apply keyword search.

To illustrate our model, suppose a simplified scientific literature knowledge base shown in Fig. 1. We assume the user issues a keyword query Q={John,K6} to find some useful information among the two keywords. Fig. 2 shows four possible answers (i.e., R1,R2,R3 and C1), which contain both keywords and convey particular semantics as well. For example, answer R1 corresponds to paper p4 written by author John with term K6 in its full-text. However, there exists a major difference between those answers shown in Fig. 2a and the one in Fig. 2b. The candidate answer C1 has three literal nodes containing the same keyword K6 simultaneously. Actually, answers R1,R2 and R3 are all proper subset of C1. Hence, we shall remove it from the final answer results because it has redundant nodes/edges. In addition, the answers in Fig. 2a are not equally useful to users. For example, R1 might be more desirable than R2 and R3, since it represents stronger relationship between keywords John and K6.

The key contributions of this paper are the following:

  • We propose a novel IR-style keyword searching model for semantic web data retrieval. Our search scheme supports a free-from keyword querying interface, which is especially desirable as well as practical for ordinary semantic web users. This model can generate answers with explicit semantics (connected subgraphs) and return ranked top-k results effectively.

  • We present an efficient searching algorithm for finding answers. The key problem is that there may be many potential answers matching a keyword querying over a large data graph. Efficiently locating these answers is a non-trivial challenge.

  • We conduct experiments over real datasets. Experimental results show that our model is feasible and delivers high-quality search results.

The rest of this paper is organized as follows. Section 2 reviews related work. We define the data model, and then describe the formal query and answer semantics in Section 3. We detail our solution for keyword search in Section 4. Section 5 reports the experimental results. Finally, we conclude the paper and point out future research directions in Section 6.

Section snippets

Related work

We briefly discuss the semantic search approach. The idea of semantic search was firstly introduced by Guha, McCool, and Miller (2003). Later, Rocha et al. (2004) presented a purely entity-centric semantic search approach. It seeks to find important related entities to a given set of keywords using a spreading activation algorithm (Cohen and Kjeldsen, 1987, Crestani, 1997). The limitation is that, “an object by itself is intensely uninteresting”. To address this issue, Anyanwu et al. (2005)

Graph data model

Assume there is an infinite set U (URI references), an infinite set B={Nj:jN} (blank nodes), and an infinite set L (RDF literals). An 3-tuple (subject,property,object)(UB)×U×(UBL) is called an RDF triple, which asserts that a resource, the subject, has a property whose value is the object. An RDF graph is a set of RDF triples (Hayes, 2004).

We extend RDF semantics and model the data as a weighted directed graph G=(V,E,w), where V={vi:iN} is a finite set of nodes denoting subjects or

An approximate algorithm

The basic idea of our proposed algorithm is as follows. Initially, an answer tree starts with a single node in any group. Then we construct the answer tree through repeatedly adding shortest paths to nodes which groups are not covered by the current tree. After l-1 iterations, the answer tree is inserted into an increasing priority queue. Later, we dequeue the tree and replace some nodes which have increasing costs. The newly obtained tree is also inserted into the queue. We first give the

Experiments

Section 5.1 describes the data set, evaluation metrics and setups for comparison. Section 5.2 reports experimental results.

Conclusion and future work

Existing semantic web data retrieval methods can be classified into two major categories: the structured query approach and the semantic search. However, it is still very desirable to support flexible keyword search over semantic web, since ordinary users usually do not understand the underlying data structure. Moreover, they have been accustomed to traditional keyword search for years. This paper presents a novel IR-style keyword search model for semantic web data retrieval. In our model, an

Acknowledgements

This work is supported by the National 973 Key Basic Research Program under Grant No. 2003CB317003, and the CityU Strategic Research Grants 7002102 and 7002214.

References (20)

  • P. Cohen et al.

    Information retrieval by constrained spreading activation on semantic networks

    Information Processing and Management

    (1987)
  • X. Ning et al.

    RSS: A framework enabling ranked search on the semantic web

    Information Processing and Management

    (2008)
  • Anyanwu, K., Maduko, A., & Sheth, A. (2005). SemRank: Ranking complex semantic relationship search results on the...
  • Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). ObjectRank: Authority-based keyword search in databases. In...
  • Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S. (2002). Keyword searching and browsing in...
  • Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005). Learning to rank using...
  • F. Crestani

    Application of spreading activation techniques in information retrieval

    Artificial Intelligence Review

    (1997)
  • Diligenti, M., Gori, M., & Maggini, M. (2005). Learning web page scores by error back-propagation. In Proceedings of...
  • M.L. Fredman et al.

    Fibonacci heaps and their uses in improved network optimization algorithms

    Journal of the ACM

    (1987)
  • Garg, N., Konjevod, G., & Ravi, R. (1998). A polylogarithmic approximation algorithm for the group Steiner tree...
There are more references available in the full text version of this article.

Cited by (0)

View full text