Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing

Wang, Ling; Ding, Wei; Zhou, Tie Hua; Ryu, Keun Ho

doi:10.1007/978-3-319-24069-5_35

Ling Wang¹⁷,
Wei Ding¹⁷,
Tie Hua Zhou¹⁸ &
…
Keun Ho Ryu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9329))

1664 Accesses

Abstract

As the amount of digital information is exploding in social, industry and scientific areas, MapReduce is a distributed computation framework, which has become widely adopted for analytics on large-scale data. Also, the idea which is used to solve the large-scale data problem by the use of approximation algorithms has become a very important solution in recent years. Especially for solving high-dimensional text data processing, semantic Web and search engine are required to pay attention to proximity searches and text relevance analysis. The difficulties of large-scale text processing mainly include its quick comparison and relevance judgment. In this paper, we propose an approximate bit string for approximation search method on MapReduce platform. Experiments exhibits excellent performance on efficiency effectiveness and scalability of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth annual ACM Symposium on Theory of Computing, pp. 604–613. ACM Press, Dallas (1998)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. M. Communications, 107–113 (2008). ACM Press
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. M. Communications, 72–77 (2010). ACM Press
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005. IEEE Press, Long Beach (2010)
Google Scholar
Salakhutdinov, R., Hinton, G.: Semantic Hashing. J. Approximate Reasoning, 969–978 (2009). Elsevier Press
Google Scholar
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2074–2081. IEEE Press, Providence (2012)
Google Scholar
Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. IEEE Press, San Francisco (2010)
Google Scholar
Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E.: Spherical hashing. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2957–2964. IEEE Press, Providence (2012)
Google Scholar
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional space. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann Press, New York (1998)
Google Scholar
Heisterkamp, D.R., Peng, J.: Kernel vector approximation files for relevance feedback retrieval in large image databases. J. Multimedia Tools and Applications, pp. 175–189. Kluwer Academic Press (2005)
Google Scholar
The Apache Software Foundation. Hadoop. http://hadoop.apache.org/
Shim, K.: MapReduce algorithms for big data analysis. J. PVLDB, pp. 2016–2017 (2012). VLDB Endowment Press
Google Scholar
The Apache Software Foundation. Mahout. http://mahout.apache.org/
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based algorithm for string similarity joins. In: Proceedings of International Conference on Data Engineering, pp. 340–351. IEEE Press, Chicago (2013)
Google Scholar
Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM Press, Chicago (2013)
Google Scholar
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient Processing of k Nearest Neighbor Joins Using MapReduce. J. PVLDB, pp. 1016–1027 (2012). VLDB Endowment
Google Scholar
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM Press, Berlin (2012)
Google Scholar
Kllapi, H., Harb, B., Yu, C.: Near neighbor join. In: Proceedings of International Conference on Data Engineering, pp. 1120–1131. IEEE Press, Chicago (2014)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. J. PVLDB, pp. 704–715 (2012). VLDB Endowment
Google Scholar
Okcan, A., and Riedewald, M.: Processing Theta-Joins Using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 949–960. ACM Press, Athens (2011)
Google Scholar
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering, pp. 510–521. IEEE Press, Washington (2012)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM Press, Indianapolis (2010)
Google Scholar
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM Press, Boston (2009)
Google Scholar
Lewis, D.D.: Reuters-21578 Text Categorization Test Collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/

Download references

Author information

Authors and Affiliations

Department of Computer Science, School of Electrical & Computer Engineering, Northeast Dianli University, Jilin, China
Ling Wang & Wei Ding
Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Chungbuk, Korea
Tie Hua Zhou & Keun Ho Ryu

Authors

Ling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Ding
View author publications
You can also search for this author in PubMed Google Scholar
Tie Hua Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Keun Ho Ryu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tie Hua Zhou .

Editor information

Editors and Affiliations

Universidad Complutense de Madrid, Madrid, Spain
Manuel Núñez
Wrocław University of Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
Computer Science Department, Universidad Autónoma De Madrid, Madrid, Spain
David Camacho
Wrocław University of Technology, Wroclaw, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Ding, W., Zhou, T.H., Ryu, K.H. (2015). Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing. In: Núñez, M., Nguyen, N., Camacho, D., Trawiński, B. (eds) Computational Collective Intelligence. Lecture Notes in Computer Science(), vol 9329. Springer, Cham. https://doi.org/10.1007/978-3-319-24069-5_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-24069-5_35
Published: 24 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24068-8
Online ISBN: 978-3-319-24069-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics