Efficient Set-Correlation Operator Inside Databases

Gao, Fei; Song, Shao-Xu; Chen, Lei; Wang, Jian-Min

doi:10.1007/s11390-016-1657-z

Efficient Set-Correlation Operator Inside Databases

Regular Paper
Published: 08 July 2016

Volume 31, pages 683–701, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Fei Gao^1,2,
Shao-Xu Song^1,2,
Lei Chen³ &
…
Jian-Min Wang^1,2

91 Accesses
1 Citation
Explore all metrics

Abstract

Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson’s correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson’s correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In Proc. the 32nd VLDB, September 2006, pp.918-929.
Hadjieleftheriou M, Yu X, Koudas N, Srivastava D. Hashed samples: Selectivity estimators for set similarity selection queries. PVLDB, 2008, 1(1): 201-212.
Google Scholar
Lee H, Ng R T, Shim K. Power-law based estimation of set similarity join size. PVLDB, 2009, 2(1): 658-669.
Google Scholar
White R W, Jose J M. A study of topic similarity measures. In Proc. the 27th SIGIR, July 2004, pp.520-521.
Zhu X, Song S, Lian X, Wang J, Zou L. Matching heterogeneous event data. In Proc. SIGMOD, June 2014, pp.1211-1222.
Zhu X, Song S, Wang J, Yu P S, Sun J. Matching heterogeneous events with patterns. In Proc. the 30th ICDE, March 31-April 4, 2014, pp.376-387.
Wang J, Song S, Zhu X, Lin X. Efficient recovery of missing events. PVLDB, 2013, 6(10): 841-852.
Google Scholar
Wang J, Song S, Lin X, Zhu X, Pei J. Cleaning structured event logs: A graph repair approach. In Proc. the 31st ICDE, April 2015, pp.30-41.
Song S, Chen L. Similarity joins of text with incomplete information formats. In Proc. the 12th DASFAA, April 2007, pp.313-324.
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In Proc. the 22nd ICDE, April 2006, p.5.
Beckmann J L, Halverson A, Krishnamurthy R, Naughton J F. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. the 22nd ICDE, April 2006, p.58.
Jain A, Doan A, Gravano L. SQL queries over unstructured text databases. In Proc. the 23rd ICDE, April 2007, pp.1255-1257.
Dong X, Halevy A Y. Indexing dataspaces. In Proc. SIGMOD, June 2007, pp.43-54.
Song S, Chen L, Yuan M. Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng., 2011, 23(12): 1872-1887.
Article Google Scholar
Song S, Chen L, Yu P S. On data dependencies in dataspaces. In Proc. the 27th ICDE, April 2011, pp.470-481.
Dong X, Halevy A Y, Madhavan J, Nemes E, Zhang J. Similarity search for web services. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.372-383.
Song S, Chen L. Probabilistic correlation-based similarity measure of unstructured records. In Proc. the 16th CIKM, November 2007, pp.967-970.
Song S, Zhu H, Chen L. Probabilistic correlation-based similarity measure on text records. Inf. Sci., 2014, 289: 8-24.
Article Google Scholar
Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets. In Proc. the 15th WWW, May 2006, pp.377-386.
Liu S, Liu F, Yu C, Meng W. An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In Proc. the 27th SIGIR, July 2004, pp.266-272.
Jin R, Chai J Y, Si L. Learn to weight terms in information retrieval using category information. In Proc. the 22nd ICML, August 2005, pp.353-360.
Xiong H, Shekhar S, Tan P N, Kumar V. Exploiting a support-based upper bound of Pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In Proc. the 10th KDD, August 2004, pp.334-343.
Song S, Chen L. Efficient set-correlation operator inside databases. In Proc. CIKM, October 2010, pp.139-148.
Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In Proc. the 27th VLDB, September 2001, pp.491-500.
Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. SIGMOD, June 1998, pp.201-212.
Gravano L, Ipeirotis P G, Koudas N, Srivastava D. Text joins in an RDBMS for web data integration. In Proc. the 12th WWW, May 2003, pp.90-101.
Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. AddisonWesley, 1989.
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th KDD, August 2003, pp.39-48.
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In Proc. the 8th KDD, July 2002, pp.269-278.
Hofmann T. Probabilistic latent semantic analysis. In Proc. UAI, July 1999, pp.289-296.
Hofmann T. Probabilistic latent semantic indexing. In Proc. the 22nd SIGIR, August 1999, pp.50-57.
Deerwester S C, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by latent semantic analysis. JASIS, 1990, 41(6): 391-407.
Article Google Scholar
Brin S, Motwani R, Silverstein C. Beyond market baskets: Generalizing association rules to correlations. In Proc. SIGMOD, May 1997, pp.265-276.
Jermaine C. The computational complexity of high dimensional correlation search. In Proc. ICDM, November 2001, pp.249-256.
Xiong H, Shekhar S, Tan P N, Kumar V. TAPER: A twostep approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 2006, 18(4): 493-508.
Article Google Scholar
Sparck Jones K. Index term weighting. Information Storage and Retrieval, 1973, 9(11): 619-633.
Robertson S. Understanding inverse document frequency: On theoretical argument for IDF. Journal of Documentation, 2004, 60(5): 503-520.
Article Google Scholar
Chaudhuri S, Das G, Hristidis V, Weikum G. Probabilistic ranking of database query results. In Proc. the 30th VLDB, August 29-September 3, 2004, pp.888-899.
Chirita P A, Firan C S, Nejdl W. Personalized query expansion for the web. In Proc. the 30th SIGIR, July 2007, pp.7-14.
Theobald M, Schenkel R, Weikum G. Efficient and selftuning incremental query expansion for top-k query processing. In Proc. the 28th SIGIR, August 2005, pp.242-249.
Metzler D, Dumais S T, Meek C. Similarity measures for short segments of text. In Proc. the 29th ECIR, April 2007, pp.16-27.
Allan J, Wade C, Bolivar A. Retrieval and novelty detection at the sentence level. In Proc. the 26th SIGIR, August 2003, pp.314-321.
Balasubramanian N, Allan J, Croft W B. A comparison of sentence retrieval techniques. In Proc. the 30th SIGIR, July 2007, pp.813-814.
Li X, Croft W B. Improving novelty detection for general topics using sentence level information patterns. In Proc. the 15th CIKM, November 2006, pp.238-247.
Li X, Croft W B. Novelty detection based on sentence level patterns. In Proc. the 14th CIKM, November 2005, pp.744-751.
Murdock V, CroftWB. A translation model for sentence retrieval. In Proc. HLT/EMNLP, October 2005, pp.684-691.
Fung P, Yee Lo Y. An IR approach for translating new words from nonparallel, comparable texts. In Proc. the 36th COLING-ACL, August 1998, pp.414-420.

Download references

Author information

Authors and Affiliations

Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, 100084, China
Fei Gao, Shao-Xu Song & Jian-Min Wang
School of Software, Tsinghua University, Beijing, 100084, China
Fei Gao, Shao-Xu Song & Jian-Min Wang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Lei Chen

Authors

Fei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shao-Xu Song
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Min Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shao-Xu Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, F., Song, SX., Chen, L. et al. Efficient Set-Correlation Operator Inside Databases. J. Comput. Sci. Technol. 31, 683–701 (2016). https://doi.org/10.1007/s11390-016-1657-z

Download citation

Received: 26 February 2016
Revised: 05 May 2016
Published: 08 July 2016
Issue Date: July 2016
DOI: https://doi.org/10.1007/s11390-016-1657-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Set-Correlation Operator Inside Databases

Abstract

Access this article

Similar content being viewed by others

Building Large-Scale Knowledge Base for Relations from Text

Keyword Search over Relational Databases: Issues, Approaches and Open Challenges

Large Scale Semantic Relation Discovery: Toward Establishing the Missing Link Between Wikipedia and Semantic Network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Set-Correlation Operator Inside Databases

Abstract

Access this article

Similar content being viewed by others

Building Large-Scale Knowledge Base for Relations from Text

Keyword Search over Relational Databases: Issues, Approaches and Open Challenges

Large Scale Semantic Relation Discovery: Toward Establishing the Missing Link Between Wikipedia and Semantic Network

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation