Integrating keyword search into XML query processing

https://doi.org/10.1016/S1389-1286(00)00069-4Get rights and content

Abstract

Due to the popularity of the XML data format, several query languages for XML have been proposed, specially devised to handle data of which the structure is unknown, loose, or absent. While these languages are rich enough to allow for querying the content and structure of an XML document, a varying or unknown structure can make formulating queries a very difficult task. We propose an extension to XML query languages that enables keyword search at the granularity of XML elements, that helps novice users formulate queries, and also yields new optimization opportunities for the query processor. We present an implementation of this extension on top of a commercial RDBMS; we then discuss implementation choices and performance results.

Introduction

There is no doubt that XML is rapidly becoming one of the most important data formats. It is already used for scientific data (e.g., DNA sequences), in linguistics (e.g., the Treebank database at the University of Pennsylvania), to annotate large documents (e.g., Shakespeare's work), or for data exchange on the Internet (e.g., for electronic commerce). Furthermore, large software vendors, including IBM, Microsoft, and Oracle, as well as a large number of new start-ups are developing tools to manage XML data and applications which are based on XML.

One of the strengths of XML is that it can be used to represent structured data (i.e., records) as well as unstructured data (i.e., text). For example, XML can be used in a hospital to represent (structured) information about patients (e.g., name, address, birth date) and (unstructured) observations from doctors. To take advantage of this strength, however, it is important to have tools that can work effectively with both kinds of data; it is in particular important to have XML query languages which select records from the structured part of an XML document and search for information in text. For instance, it should be possible to pose one query that finds all patients that are older than 45 years and have some specific symptoms.

Keyword search is also important to query XML data with a regular structure, if the user does not know the structure or only knows the structure partially. Such a situation arises frequently on the Web; a user visits an (XML) Web site, but does not know (and does not want to know) how the data are stored at that Web site. For instance, a user who wants to buy a car on the Internet might not know how exactly the price and category of a car are represented at the dealer’s Web site; rather than looking at the DTD, the user would prefer to directly ask for all cars with price < $1000; this query involves a keyword search for price and the evaluation of a predicate on the value of price.

A third reason to integrate keyword search into XML query processing is to query several XML documents at the same time. Again, a user might be interested in buying a cheap car on the Internet; this time, however, the user wants to get information from several car dealers at once. The car dealers may store their data in different ways, but all car dealers that the user is interested in will somehow specify a price for each car. The user query will be the same as in the previous paragraph, i.e., the query will involve keyword search on price even if the user knows exactly how each car dealer stores his/her data.

Both regular (structured) XML query processing and keyword search have been studied extensively in previous work. (We will give an overview of related work in Section 6.) To date, however, nobody has ever shown how both features can be combined. Extending an XML query language for keyword search and showing how such an extended query language can be implemented is the purpose of this paper.

Obviously, there are many alternative ways to process XML queries with keyword search. In this work, we propose to exploit a standard, off-the-shelf relational database system (RDBMS) as much as possible. Examples of popular RDBMS products are IBM DB2, Microsoft SQL Server, or Oracle 8. Using an RDBMS has several advantages. First, as we will see, it is very easy to build an extended XML query processor that integrates keyword search on top of an RDBMS; it already provides most of the functionality that is required. Second, RDBMSs are universally available. Most organizations have an RDBMS installed so that no additional costs are incurred. Third, RDBMSs allow to mix XML data and other (relational) data. Not all the data in the world are XML yet! Fourth, RDBMSs show very good performance for this purpose. More than twenty years of research and development have been invested into making RDBMSs the best possible general-purpose query processors and the RDBMS vendors are continuously improving their products. In particular, RDBMSs are capable of storing and processing large volumes of data (up to terabytes).

Relational databases can be used in different ways for our purposes. In this paper, we consider two scenarios. In the first scenario the whole XML data is replicated (or initially stored) in the relational database. This scenario provides the best performance. In this scenario, the XML query including keyword search can be entirely executed by the RDBMS, thereby taking full advantage of the powerful query processing capabilities of the RDBMS and interleaving keyword search with the other operations of an XML query in the best possible way. Also, no data are moved through the network and no process boundaries need to be crossed to execute queries in this scenario. In effect, this scenario shows how an RDBMS can be used as a data warehouse for XML data.

Unfortunately, it is not always possible or cost-effective to build a data warehouse. In the long run, for instance, it will not be viable for technical and legal reasons to replicate all the XML data on the Web. Therefore, we describe a second scenario in which query processing is carried out in a distributed way. In this scenario, XML documents are stored by individual data sources. An RDBMS is used to store indices which can be used to execute keyword searches and to find all relevant XML data sources for a query. In fact, the XML data sources could again be powered by an RDBMS; however, the data sources could also be implemented on top of a simple file system.

The techniques developed in this work are also applicable if an object-oriented database system (OODBMS) is used instead of an RDBMS. To some extent, our approaches are also applicable if a special-purpose XML query processor like Tamino [14]or Excelon [5]is used. The current generation of OODBMSs and special-purpose XML query processors, however, is not mature enough to process large amounts of data so we focus on the use of RDBMSs throughout this paper. Furthermore, we will not exploit any `object-relational' features which are currently built into many RDBMSs because these features are not useful for our needs.

In a nutshell, the goal of our work is to integrate keyword search into XML query processing and make use of existing (relational) database systems as much as possible. In the remainder of this paper, we will report on the following developments.

  • 1.

    We will show how to extend an existing XML query language in order to support keyword search. This will make it possible to query XML data without structure (i.e., text), help users to query XML documents with structure, if the users do not know the structure, and help to query multiple XML documents with the same ontology, but different DTDs.

  • 2.

    We will present an extension of inverted files in order to support keyword search. Furthermore, we show how such extended inverted files can be stored in a relational database.

  • 3.

    We will show how XML queries that involve keyword search and other operations can be entirely processed using an RDBMS, if the XML data are replicated in one relational database.

  • 4.

    We will also show how XML queries with keyword search can be executed, if the XML data cannot be stored in a relational database.

  • 5.

    We will present performance experiments that demonstrate the overheads of our approach (size of indices, etc.) and give a feeling for the cost of extended XML query processing with keyword search.

Section 2describes the data model and query language used in this work. Section 3presents the proposed indices for keyword search (i.e., inverted files). Section 4discusses the role of RDBMSs in query processing in more detail. Section 5contains performance experiments. Section 6gives an overview of related work. Section 7concludes this paper with suggestions for future work.

Section snippets

Data model and query language

Abundant work recently addressed the problem of finding a formal data model and a query language for XML data. Since this still remains an open problem (no common agreement has been reached yet), we describe in this section the data model and query language that are the basis for our work. However, our data model and query language are similar in spirit to the other proposals so that the results presented in this paper can be easily adapted to those other formalisms. Assuming that the final

Relational support for full-text indexing

In this section we describe an extension of inverted files for full-text indexing. An extended inverted file can be used to implement keyword search (i.e., the contains predicates) and to find relevant XML data sources or XML elements in a distributed environment. Furthermore, we will show how inverted files can be stored in a relational database and discuss variants. How inverted files are used during query processing is detailed in Section 4.

Extended XML-QL query processing

We will now turn to a discussion of how XML-QL queries with contains predicates can be processed. We will first describe query processing in the first scenario of the introduction, in which the inverted file and all the XML data are stored or replicated in an RDBMS. After that, we will discuss the second scenario of the introduction.

Experiments

We implemented a prototype XML-QL query processor with keyword search on top of an off-the-shelf RDBMS. In this section, we will present the results of initial performance experiments conducted with our prototype. We will only report on experiments performed in a scenario in which all the XML data (in addition to the inverted file) is replicated in the RDBMS.

Related work

Both `structured queries' and `keyword search' have extensively been studied in the database and information retrieval literature. Specific work on XML query processing is reported in 4, 13, 16, and information retrieval techniques such as those used in current Web search engines can be used for XML just as well as for HTML or any other text data. What makes our work different is that we show how keyword search can be integrated into (structured) query processing and why this works particularly

Conclusion

We showed how an existing XML query language can be extended in order to support keyword search. Furthermore, we described how such an extended XML query language can be implemented. The most important data structure needed for keyword search is the inverted file. We gave the necessary extensions of inverted files for XML query processing and showed how inverted files can be stored and queried using a relational database system. The techniques described in this paper can easily be implemented;

Daniela Florescu received her Ph.D. in 1996, on `Search Spaces for object oriented query optimization'. She is now a researcher at INRIA Rocquencourt, in the Caravel project. Dr. Florescu is among the authors of the XML-QL query language and the main designer of the Strudel Web-site management system. Her current research interests include XML technologies (query languages, storage, query optimization), static query optimization, data-intensive Web-site management, and data cleaning. Daniela

References (20)

  • P. Buneman, S.B. Davidson, G.G. Hillebrand and D. Suciu, A query language and optimization techniques for unstructured...
  • S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi and L. Tanca, XML-GL: a graphical language for querying and...
  • A. Deutsch, M.F. Fernandez, D. Florescu, A.Y. Levy and D. Suciu, A query language for XML (electronic version), in:...
  • A. Deutsch, M.F. Fernandez and D. Suciu, Storing semistructured data with STORED (electronic version), in: Proc. of ACM...
  • Excelon from ODI:...
  • W.B. Frakes and R.A. Baeza-Yates (Eds.), Information Retrieval: Data Structures and Algorithms, Prentice-Hall,...
  • D. Florescu and D. Kossmann, Storing and querying XML data using an RDBMS (extended version), IEEE Data Eng. Bull. 22...
  • D. Florescu, A. Levy, I. Manolescu and D. Suciu, Query optimization in the presence of limited access patterns...
  • H. Galhardas, D. Florescu, D. Shasha and E. Simon, AJAX: an extensible data cleaning tool (electronic version), in:...
  • B.R. Iyer and D. Wilhite, Data compression support in databases, in: J.B. Bocca, M. Jarke and C. Zaniolo (Eds.), Proc....
There are more references available in the full text version of this article.

Cited by (101)

  • Querying XML documents using Prolog engines: When is this a good idea?

    2019, Information Processing and Management
    Citation Excerpt :

    Other approaches use different techniques to extract content of XML documents. They use resources like natural language (Suryanarayana et al., 2018) or information retrieval (Dahaka et al., 2017; de Campos et al., 2010; Florescu et al., 2000; Yun & Chung, 2012), for instance. Additionally, several techniques have been used to accelerate XML query processing, such as indexing (Alghamdi et al., 2014) and parallel query processing (Alrammal & Hains, 2014; Braganholo & Mattoso, 2014; Fana et al., 2018).

  • An approach for transforming keyword-based queries to SPARQL on RDF data source federations

    2016, 15th International Conference on Advances in ICT for Emerging Regions, ICTer 2015 - Conference Proceedings
  • XML indexing techniques for handling large amounts of data

    2016, Indian Journal of Science and Technology
  • A proposal for searching desktop data

    2016, Advances in Intelligent Systems and Computing
View all citing articles on Scopus

  1. Download : Download high-res image (36KB)
  2. Download : Download full-size image
Daniela Florescu received her Ph.D. in 1996, on `Search Spaces for object oriented query optimization'. She is now a researcher at INRIA Rocquencourt, in the Caravel project. Dr. Florescu is among the authors of the XML-QL query language and the main designer of the Strudel Web-site management system. Her current research interests include XML technologies (query languages, storage, query optimization), static query optimization, data-intensive Web-site management, and data cleaning. Daniela Florescu is also a member of the W3C working group on XML query languages (homepage).

  1. Download : Download high-res image (46KB)
  2. Download : Download full-size image
Donald Kossmann received BSc and MSc degrees in 1989 and 1991 from the University of Passau (Germany) and a Ph.D. in Computer Science in 1995 from the Technical University of Aachen (Germany). From 1995 to 1996, he was a Research Associate at the University Maryland, College Park. Since 1996, he is an Assistant Professor for Computer Science at the University of Passau (Germany). His research is focussed on distributed and object-oriented database systems (homepage).

  1. Download : Download high-res image (43KB)
  2. Download : Download full-size image
Ioana Manolescu received her MSc in 1998, from Ecole Normale Supérieure, in Paris, and is now a Ph.D. student at INRIA Rocquencourt, in the Caravel project; her topic is ‘Query Optimization for Semistructured Data’. She is also interested in XML schema extraction for storage optimization, and query optimization for data integration systems. Together with Daniela Florescu and Donald Kossmann, she is currently working on a new system that allows a relational data integration engine to seamlessly integrate XML documents (homepage).

1

E-mail: {Daniela.Florescu,Ioana.Manolescu}@inria.fr

2

E-mail: [email protected]

View full text