Abstract
As the amount of digital information is exploding in social, industry and scientific areas, MapReduce is a distributed computation framework, which has become widely adopted for analytics on large-scale data. Also, the idea which is used to solve the large-scale data problem by the use of approximation algorithms has become a very important solution in recent years. Especially for solving high-dimensional text data processing, semantic Web and search engine are required to pay attention to proximity searches and text relevance analysis. The difficulties of large-scale text processing mainly include its quick comparison and relevance judgment. In this paper, we propose an approximate bit string for approximation search method on MapReduce platform. Experiments exhibits excellent performance on efficiency effectiveness and scalability of the proposed algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth annual ACM Symposium on Theory of Computing, pp. 604–613. ACM Press, Dallas (1998)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. M. Communications, 107–113 (2008). ACM Press
Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. M. Communications, 72–77 (2010). ACM Press
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005. IEEE Press, Long Beach (2010)
Salakhutdinov, R., Hinton, G.: Semantic Hashing. J. Approximate Reasoning, 969–978 (2009). Elsevier Press
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2074–2081. IEEE Press, Providence (2012)
Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. IEEE Press, San Francisco (2010)
Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E.: Spherical hashing. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2957–2964. IEEE Press, Providence (2012)
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional space. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann Press, New York (1998)
Heisterkamp, D.R., Peng, J.: Kernel vector approximation files for relevance feedback retrieval in large image databases. J. Multimedia Tools and Applications, pp. 175–189. Kluwer Academic Press (2005)
The Apache Software Foundation. Hadoop. http://hadoop.apache.org/
Shim, K.: MapReduce algorithms for big data analysis. J. PVLDB, pp. 2016–2017 (2012). VLDB Endowment Press
The Apache Software Foundation. Mahout. http://mahout.apache.org/
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based algorithm for string similarity joins. In: Proceedings of International Conference on Data Engineering, pp. 340–351. IEEE Press, Chicago (2013)
Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM Press, Chicago (2013)
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient Processing of k Nearest Neighbor Joins Using MapReduce. J. PVLDB, pp. 1016–1027 (2012). VLDB Endowment
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM Press, Berlin (2012)
Kllapi, H., Harb, B., Yu, C.: Near neighbor join. In: Proceedings of International Conference on Data Engineering, pp. 1120–1131. IEEE Press, Chicago (2014)
Metwally, A., Faloutsos, C.: V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. J. PVLDB, pp. 704–715 (2012). VLDB Endowment
Okcan, A., and Riedewald, M.: Processing Theta-Joins Using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 949–960. ACM Press, Athens (2011)
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering, pp. 510–521. IEEE Press, Washington (2012)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM Press, Indianapolis (2010)
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM Press, Boston (2009)
Lewis, D.D.: Reuters-21578 Text Categorization Test Collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, L., Ding, W., Zhou, T.H., Ryu, K.H. (2015). Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing. In: Núñez, M., Nguyen, N., Camacho, D., Trawiński, B. (eds) Computational Collective Intelligence. Lecture Notes in Computer Science(), vol 9329. Springer, Cham. https://doi.org/10.1007/978-3-319-24069-5_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-24069-5_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24068-8
Online ISBN: 978-3-319-24069-5
eBook Packages: Computer ScienceComputer Science (R0)