Practical and effective IR-style keyword search over semantic web
Introduction
With the increasing amount of ontologies encoded in RDF/S or OWL languages available in semantic web, extensive efforts have been made towards effective retrieving facilities for the data. In this area, existing research work can be classified into two major categories: the structured query approach and the semantic search. The former is standardized by W3C which envisions that, users should be able to issue structured SQL-like expressive queries such as RQL, RDQL, or SPARQL (Prudhommeaux & Seaborne, 2008) and receive a set of triples as answers. This approach is effective, if users have a detailed knowledge of underlying schemas and ontology languages. Nevertheless, the assumed scenario is not always true in practice, since these languages are complicated even for developers of semantic web applications. An alternative approach is the so-called semantic search which usually adopts more user-friendly unstructured (Rocha, Schwabe, & Aragao, 2004) or semi-structured (Anyanwu et al., 2005, Guha et al., 2003) query strategies. However, these reported approaches are still much less attractive than keyword search, which is the most effective and successful paradigm for modern information retrieval.
This paper focuses on the problem of supporting effective keyword search over semantic web data. The semantic web data model (RDF/S or OWL) conforms to a node-labeled and edge-labeled directed graph (Hayes, 2004), where nodes represent resources or literals and edges represent properties carrying heterogeneous semantics. In our proposed model, an answer to a keyword querying is a connected “minimal” subgraph, which contains all the query keywords. In a real-world semantic web application, there may exist a large number of literal nodes matching individual keywords and hence many answers may satisfy the query. It is a non-trivial problem how to efficiently find these answers (e.g., consider a 5-keyword query over an RDF graph with thousands of nodes). In addition, these answers should be reasonably ranked in decreasing relevance. Otherwise, uses will be discouraged to apply keyword search.
To illustrate our model, suppose a simplified scientific literature knowledge base shown in Fig. 1. We assume the user issues a keyword query to find some useful information among the two keywords. Fig. 2 shows four possible answers (i.e., and ), which contain both keywords and convey particular semantics as well. For example, answer corresponds to paper written by author John with term K6 in its full-text. However, there exists a major difference between those answers shown in Fig. 2a and the one in Fig. 2b. The candidate answer has three literal nodes containing the same keyword K6 simultaneously. Actually, answers and are all proper subset of . Hence, we shall remove it from the final answer results because it has redundant nodes/edges. In addition, the answers in Fig. 2a are not equally useful to users. For example, might be more desirable than and , since it represents stronger relationship between keywords John and K6.
The key contributions of this paper are the following:
- •
We propose a novel IR-style keyword searching model for semantic web data retrieval. Our search scheme supports a free-from keyword querying interface, which is especially desirable as well as practical for ordinary semantic web users. This model can generate answers with explicit semantics (connected subgraphs) and return ranked top-k results effectively.
- •
We present an efficient searching algorithm for finding answers. The key problem is that there may be many potential answers matching a keyword querying over a large data graph. Efficiently locating these answers is a non-trivial challenge.
- •
We conduct experiments over real datasets. Experimental results show that our model is feasible and delivers high-quality search results.
The rest of this paper is organized as follows. Section 2 reviews related work. We define the data model, and then describe the formal query and answer semantics in Section 3. We detail our solution for keyword search in Section 4. Section 5 reports the experimental results. Finally, we conclude the paper and point out future research directions in Section 6.
Section snippets
Related work
We briefly discuss the semantic search approach. The idea of semantic search was firstly introduced by Guha, McCool, and Miller (2003). Later, Rocha et al. (2004) presented a purely entity-centric semantic search approach. It seeks to find important related entities to a given set of keywords using a spreading activation algorithm (Cohen and Kjeldsen, 1987, Crestani, 1997). The limitation is that, “an object by itself is intensely uninteresting”. To address this issue, Anyanwu et al. (2005)
Graph data model
Assume there is an infinite set U (URI references), an infinite set (blank nodes), and an infinite set L (RDF literals). An 3-tuple is called an RDF triple, which asserts that a resource, the subject, has a property whose value is the object. An RDF graph is a set of RDF triples (Hayes, 2004).
We extend RDF semantics and model the data as a weighted directed graph , where is a finite set of nodes denoting subjects or
An approximate algorithm
The basic idea of our proposed algorithm is as follows. Initially, an answer tree starts with a single node in any group. Then we construct the answer tree through repeatedly adding shortest paths to nodes which groups are not covered by the current tree. After iterations, the answer tree is inserted into an increasing priority queue. Later, we dequeue the tree and replace some nodes which have increasing costs. The newly obtained tree is also inserted into the queue. We first give the
Experiments
Section 5.1 describes the data set, evaluation metrics and setups for comparison. Section 5.2 reports experimental results.
Conclusion and future work
Existing semantic web data retrieval methods can be classified into two major categories: the structured query approach and the semantic search. However, it is still very desirable to support flexible keyword search over semantic web, since ordinary users usually do not understand the underlying data structure. Moreover, they have been accustomed to traditional keyword search for years. This paper presents a novel IR-style keyword search model for semantic web data retrieval. In our model, an
Acknowledgements
This work is supported by the National 973 Key Basic Research Program under Grant No. 2003CB317003, and the CityU Strategic Research Grants 7002102 and 7002214.
References (20)
- et al.
Information retrieval by constrained spreading activation on semantic networks
Information Processing and Management
(1987) - et al.
RSS: A framework enabling ranked search on the semantic web
Information Processing and Management
(2008) - Anyanwu, K., Maduko, A., & Sheth, A. (2005). SemRank: Ranking complex semantic relationship search results on the...
- Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). ObjectRank: Authority-based keyword search in databases. In...
- Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S. (2002). Keyword searching and browsing in...
- Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005). Learning to rank using...
Application of spreading activation techniques in information retrieval
Artificial Intelligence Review
(1997)- Diligenti, M., Gori, M., & Maggini, M. (2005). Learning web page scores by error back-propagation. In Proceedings of...
- et al.
Fibonacci heaps and their uses in improved network optimization algorithms
Journal of the ACM
(1987) - Garg, N., Konjevod, G., & Ravi, R. (1998). A polylogarithmic approximation algorithm for the group Steiner tree...