ABSTRACT
Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interaction with dataspaces involves exploring the data, and users do not have a single schema to which they can pose queries. Consequently, it is important that queries are allowed to specify varying degrees of structure, spanning keyword queries to more structure-aware queries.
This paper considers indexing support for queries that combine keywords and structure. We describe several extensions to inverted lists to capture structure when it is present. In particular, our extensions incorporate attribute labels, relationships between data items, hierarchies of schema elements, and synonyms among schema elements. We describe experiments showing that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data.
- S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In Proc. of ICDE, 2002. Google ScholarDigital Library
- S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, 2002.Google ScholarDigital Library
- R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton simulation over tires. Journal of the ACM, 43(6):915--936, 1996. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarDigital Library
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007. Google ScholarDigital Library
- H. Bast and I. Weber. Type less, find more: Fast autocompletion search with a succinct index. In SigIR, 2006. Google ScholarDigital Library
- G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In Proc. of ICDE, 2002. Google ScholarDigital Library
- L. Blunschi, J.-P. Dittrich, O. R. Girard, S. K. Karakashian, and M. A. V. Salles. A dataspace odyssey: The iMeMex personal dataspace management system. In CIDR, 2007.Google Scholar
- N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Sigmod, 2002. Google ScholarDigital Library
- S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, 2006. Google ScholarDigital Library
- Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- Z. Chen, J. Gehrke, F. Korn, N. Koudas,J. Shanmugasundaram, and D. Srivastava. Index structures for matching xml twigs using relational query processors. In ICDE Workshops, 2005. Google ScholarDigital Library
- S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In Proc. of VLDB, 2002. Google ScholarDigital Library
- J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In Proc. of ICDE, 2001.Google Scholar
- C. Chung, J. Min, and K. Shim. APEX: An adaptive path index for XML data. In Proc. of SIGMOD, 2002. Google ScholarDigital Library
- B. F. Cooper, N. Sample, M. J.Franklin, G. R. Hjaltason,and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarDigital Library
- L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006. Google ScholarDigital Library
- P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. D B Life: A community information management platform for the database research community. In CIDR, 2007.Google Scholar
- X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In CIDR, 2005.Google Scholar
- R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In Proc.of VLDB, 1998. Google ScholarDigital Library
- R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of VLDB, Athens, Greece, 1997. Google ScholarDigital Library
- J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In VLDB, 2005. Google ScholarDigital Library
- M. Gubanov and P. A. Berstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google Scholar
- A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, 2006. Google ScholarDigital Library
- H. He and J. Yang. Multiresolution indexing of XML for frequent queries. In Proc. of ICDE, 2004. Google ScholarDigital Library
- V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, 2002. Google ScholarDigital Library
- V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In Proc. of ICDE, 2003.Google ScholarCross Ref
- Jena. http://jena.sourceforge.net/, 2005.Google Scholar
- H. Jiang, H. Lu, W. Wang, and B. C. Ooi. XR-Tree: Indexing XML data for efficient structural joins. In ICDE, 2003.Google Scholar
- Y.-J. Joung and L.-W. Yang. KISS: A simple prefix search scheme in P2P networks. In WebDB, 2006.Google Scholar
- R. Kaushik, P. Bohannon, J. F. Naughton, and H. F.Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002. Google ScholarDigital Library
- R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proc. of SIGMOD, 2004. Google ScholarDigital Library
- R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002. Google ScholarDigital Library
- Lucene. http://jakarta.apache.org/lucene/docs/index.html,2005.Google Scholar
- T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999. Google ScholarDigital Library
- P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer sequences. In ICDE, 2004. Google ScholarDigital Library
- M. Sayyadian, H. Lekhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, 2007.Google ScholarCross Ref
- A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XML data management. In VLDB, 2002. Google ScholarDigital Library
- P. Valduriez. Join indices. ACM transactions on Database Systems, 12(2), 1987. Google ScholarDigital Library
- H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XML data by tree structures. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- W. Wang, H. Jiang, H. Lu, and J. X. Yu. PBiTree coding and efficient processing of containment joins. In ICDE, 2003.Google ScholarCross Ref
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann Publishers, San Francisco, 1999. Google ScholarDigital Library
- Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In SIGMOD, 2005. Google ScholarDigital Library
- N. Zhang, T. Ozsu, I. F. Ilyas, and A. Aboulnaga. Fix: Feature-based indexing technique for XML documents. In VLDB, 2006. Google ScholarDigital Library
Index Terms
- Indexing dataspaces
Recommendations
Structure-aware indexing for keyword search in databases
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementMost of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact ...
Indexing dataspaces with partitions
Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient ...
An Efficient Schema-Based Technique for Querying XML Data
As data integration over the Web has become an increasing demand, there is a growing desire to use XML as a standard format for data exchange. For sharing their grammars efficiently, most of the XML documents in use are associated with a document ...
Comments