Article

Indexing dataspaces

Authors:
Xin Dong

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Alon Halevy

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataJune 2007Pages 43–54https://doi.org/10.1145/1247480.1247487

Published:11 June 2007Publication History

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 43–54

ABSTRACT

Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interaction with dataspaces involves exploring the data, and users do not have a single schema to which they can pose queries. Consequently, it is important that queries are allowed to specify varying degrees of structure, spanning keyword queries to more structure-aware queries.

This paper considers indexing support for queries that combine keywords and structure. We describe several extensions to inverted lists to capture structure when it is present. In particular, our extensions incorporate attribute labels, relationships between data items, hierarchies of schema elements, and synonyms among schema elements. We describe experiments showing that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data.

References

S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In Proc. of ICDE, 2002. Google ScholarDigital Library
S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, 2002.Google ScholarDigital Library
R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton simulation over tires. Journal of the ACM, 43(6):915--936, 1996. Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007. Google ScholarDigital Library
H. Bast and I. Weber. Type less, find more: Fast autocompletion search with a succinct index. In SigIR, 2006. Google ScholarDigital Library
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In Proc. of ICDE, 2002. Google ScholarDigital Library
L. Blunschi, J.-P. Dittrich, O. R. Girard, S. K. Karakashian, and M. A. V. Salles. A dataspace odyssey: The iMeMex personal dataspace management system. In CIDR, 2007.Google Scholar
N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Sigmod, 2002. Google ScholarDigital Library
S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, 2006. Google ScholarDigital Library
Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
Z. Chen, J. Gehrke, F. Korn, N. Koudas,J. Shanmugasundaram, and D. Srivastava. Index structures for matching xml twigs using relational query processors. In ICDE Workshops, 2005. Google ScholarDigital Library
S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In Proc. of VLDB, 2002. Google ScholarDigital Library
J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In Proc. of ICDE, 2001.Google Scholar
C. Chung, J. Min, and K. Shim. APEX: An adaptive path index for XML data. In Proc. of SIGMOD, 2002. Google ScholarDigital Library
B. F. Cooper, N. Sample, M. J.Franklin, G. R. Hjaltason,and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001. Google ScholarDigital Library
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006. Google ScholarDigital Library
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. D B Life: A community information management platform for the database research community. In CIDR, 2007.Google Scholar
X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In CIDR, 2005.Google Scholar
R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In Proc.of VLDB, 1998. Google ScholarDigital Library
R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of VLDB, Athens, Greece, 1997. Google ScholarDigital Library
J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In VLDB, 2005. Google ScholarDigital Library
M. Gubanov and P. A. Berstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google Scholar
A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, 2006. Google ScholarDigital Library
H. He and J. Yang. Multiresolution indexing of XML for frequent queries. In Proc. of ICDE, 2004. Google ScholarDigital Library
V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, 2002. Google ScholarDigital Library
V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In Proc. of ICDE, 2003.Google ScholarCross Ref
Jena. http://jena.sourceforge.net/, 2005.Google Scholar
H. Jiang, H. Lu, W. Wang, and B. C. Ooi. XR-Tree: Indexing XML data for efficient structural joins. In ICDE, 2003.Google Scholar
Y.-J. Joung and L.-W. Yang. KISS: A simple prefix search scheme in P2P networks. In WebDB, 2006.Google Scholar
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F.Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002. Google ScholarDigital Library
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proc. of SIGMOD, 2004. Google ScholarDigital Library
R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002. Google ScholarDigital Library
Lucene. http://jakarta.apache.org/lucene/docs/index.html,2005.Google Scholar
T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999. Google ScholarDigital Library
P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer sequences. In ICDE, 2004. Google ScholarDigital Library
M. Sayyadian, H. Lekhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, 2007.Google ScholarCross Ref
A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XML data management. In VLDB, 2002. Google ScholarDigital Library
P. Valduriez. Join indices. ACM transactions on Database Systems, 12(2), 1987. Google ScholarDigital Library
H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XML data by tree structures. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
W. Wang, H. Jiang, H. Lu, and J. X. Yu. PBiTree coding and efficient processing of containment joins. In ICDE, 2003.Google ScholarCross Ref
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann Publishers, San Francisco, 1999. Google ScholarDigital Library
Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In SIGMOD, 2005. Google ScholarDigital Library
N. Zhang, T. Ozsu, I. F. Ilyas, and A. Aboulnaga. Fix: Feature-based indexing technique for XML documents. In VLDB, 2006. Google ScholarDigital Library

Index Terms

Indexing dataspaces
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Structure-aware indexing for keyword search in databases
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Most of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact ...
Read More
Indexing dataspaces with partitions

Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient ...
Read More
An Efficient Schema-Based Technique for Querying XML Data

As data integration over the Web has become an increasing demand, there is a growing desire to use XML as a standard format for data exchange. For sharing their grammars efficiently, most of the XML documents in use are associated with a document ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dataspace
heterogeneity
indexing
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 96
  Total Citations
  View Citations
- 177
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Indexing dataspaces

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Structure-aware indexing for keyword search in databases

Indexing dataspaces with partitions

An Efficient Schema-Based Technique for Querying XML Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Indexing dataspaces

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Structure-aware indexing for keyword search in databases

Indexing dataspaces with partitions

An Efficient Schema-Based Technique for Querying XML Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media