research-article

Dima: a distributed in-memory similarity-based query processing system

Authors:
Ji Sun

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Zeyuan Shang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Guoliang Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Dong Deng

MIT

MIT
View Profile

,
Zhifeng Bao

RMIT

RMIT
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 12pp 1925–1928https://doi.org/10.14778/3137765.3137810

Published:01 August 2017Publication History

Proceedings of the VLDB Endowment

Abstract

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

References

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In ICDE, pages 340--351, 2014.Google ScholarCross Ref
D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition based method for exact set similarity joins. PVLDB, 9(4):360--371, 2015. Google ScholarDigital Library
Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarDigital Library
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarDigital Library
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011. Google ScholarDigital Library
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarDigital Library
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008. Google ScholarDigital Library

Index Terms

Dima: a distributed in-memory similarity-based query processing system
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 12
August 2017
427 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2017
Published in pvldb Volume 10, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 171
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dima: a distributed in-memory similarity-based query processing system

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dima: a distributed in-memory similarity-based query processing system

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media