skip to main content
10.1145/1247480.1247487acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Indexing dataspaces

Published:11 June 2007Publication History

ABSTRACT

Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interaction with dataspaces involves exploring the data, and users do not have a single schema to which they can pose queries. Consequently, it is important that queries are allowed to specify varying degrees of structure, spanning keyword queries to more structure-aware queries.

This paper considers indexing support for queries that combine keywords and structure. We describe several extensions to inverted lists to capture structure when it is present. In particular, our extensions incorporate attribute labels, relationships between data items, hierarchies of schema elements, and synonyms among schema elements. We describe experiments showing that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data.

References

  1. S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In Proc. of ICDE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton simulation over tires. Journal of the ACM, 43(6):915--936, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Bast and I. Weber. Type less, find more: Fast autocompletion search with a succinct index. In SigIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In Proc. of ICDE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Blunschi, J.-P. Dittrich, O. R. Girard, S. K. Karakashian, and M. A. V. Salles. A dataspace odyssey: The iMeMex personal dataspace management system. In CIDR, 2007.Google ScholarGoogle Scholar
  9. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Sigmod, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Chen, J. Gehrke, F. Korn, N. Koudas,J. Shanmugasundaram, and D. Srivastava. Index structures for matching xml twigs using relational query processors. In ICDE Workshops, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In Proc. of ICDE, 2001.Google ScholarGoogle Scholar
  15. C. Chung, J. Min, and K. Shim. APEX: An adaptive path index for XML data. In Proc. of SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. F. Cooper, N. Sample, M. J.Franklin, G. R. Hjaltason,and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. D B Life: A community information management platform for the database research community. In CIDR, 2007.Google ScholarGoogle Scholar
  19. X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In CIDR, 2005.Google ScholarGoogle Scholar
  20. R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In Proc.of VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of VLDB, Athens, Greece, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Gubanov and P. A. Berstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google ScholarGoogle Scholar
  24. A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. He and J. Yang. Multiresolution indexing of XML for frequent queries. In Proc. of ICDE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In Proc. of ICDE, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jena. http://jena.sourceforge.net/, 2005.Google ScholarGoogle Scholar
  29. H. Jiang, H. Lu, W. Wang, and B. C. Ooi. XR-Tree: Indexing XML data for efficient structural joins. In ICDE, 2003.Google ScholarGoogle Scholar
  30. Y.-J. Joung and L.-W. Yang. KISS: A simple prefix search scheme in P2P networks. In WebDB, 2006.Google ScholarGoogle Scholar
  31. R. Kaushik, P. Bohannon, J. F. Naughton, and H. F.Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proc. of SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lucene. http://jakarta.apache.org/lucene/docs/index.html,2005.Google ScholarGoogle Scholar
  35. T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer sequences. In ICDE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Sayyadian, H. Lekhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  38. A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XML data management. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Valduriez. Join indices. ACM transactions on Database Systems, 12(2), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XML data by tree structures. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. W. Wang, H. Jiang, H. Lu, and J. X. Yu. PBiTree coding and efficient processing of containment joins. In ICDE, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  42. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann Publishers, San Francisco, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. N. Zhang, T. Ozsu, I. F. Ilyas, and A. Aboulnaga. Fix: Feature-based indexing technique for XML documents. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexing dataspaces

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
      June 2007
      1210 pages
      ISBN:9781595936868
      DOI:10.1145/1247480
      • General Chairs:
      • Lizhu Zhou,
      • Tok Wang Ling,
      • Program Chair:
      • Beng Chin Ooi

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader