Skip to main content

Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing

  • Conference paper
  • First Online:
Computational Collective Intelligence

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9329))

  • 1664 Accesses

Abstract

As the amount of digital information is exploding in social, industry and scientific areas, MapReduce is a distributed computation framework, which has become widely adopted for analytics on large-scale data. Also, the idea which is used to solve the large-scale data problem by the use of approximation algorithms has become a very important solution in recent years. Especially for solving high-dimensional text data processing, semantic Web and search engine are required to pay attention to proximity searches and text relevance analysis. The difficulties of large-scale text processing mainly include its quick comparison and relevance judgment. In this paper, we propose an approximate bit string for approximation search method on MapReduce platform. Experiments exhibits excellent performance on efficiency effectiveness and scalability of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth annual ACM Symposium on Theory of Computing, pp. 604–613. ACM Press, Dallas (1998)

    Google Scholar 

  2. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. M. Communications, 107–113 (2008). ACM Press

    Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. M. Communications, 72–77 (2010). ACM Press

    Google Scholar 

  4. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: Proceedings of the 26th International Conference on Data Engineering, pp. 996–1005. IEEE Press, Long Beach (2010)

    Google Scholar 

  5. Salakhutdinov, R., Hinton, G.: Semantic Hashing. J. Approximate Reasoning, 969–978 (2009). Elsevier Press

    Google Scholar 

  6. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2074–2081. IEEE Press, Providence (2012)

    Google Scholar 

  7. Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 3424–3431. IEEE Press, San Francisco (2010)

    Google Scholar 

  8. Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E.: Spherical hashing. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 2957–2964. IEEE Press, Providence (2012)

    Google Scholar 

  9. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional space. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann Press, New York (1998)

    Google Scholar 

  10. Heisterkamp, D.R., Peng, J.: Kernel vector approximation files for relevance feedback retrieval in large image databases. J. Multimedia Tools and Applications, pp. 175–189. Kluwer Academic Press (2005)

    Google Scholar 

  11. The Apache Software Foundation. Hadoop. http://hadoop.apache.org/

  12. Shim, K.: MapReduce algorithms for big data analysis. J. PVLDB, pp. 2016–2017 (2012). VLDB Endowment Press

    Google Scholar 

  13. The Apache Software Foundation. Mahout. http://mahout.apache.org/

  14. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based algorithm for string similarity joins. In: Proceedings of International Conference on Data Engineering, pp. 340–351. IEEE Press, Chicago (2013)

    Google Scholar 

  15. Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 829–837. ACM Press, Chicago (2013)

    Google Scholar 

  16. Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient Processing of k Nearest Neighbor Joins Using MapReduce. J. PVLDB, pp. 1016–1027 (2012). VLDB Endowment

    Google Scholar 

  17. Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM Press, Berlin (2012)

    Google Scholar 

  18. Kllapi, H., Harb, B., Yu, C.: Near neighbor join. In: Proceedings of International Conference on Data Engineering, pp. 1120–1131. IEEE Press, Chicago (2014)

    Google Scholar 

  19. Metwally, A., Faloutsos, C.: V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. J. PVLDB, pp. 704–715 (2012). VLDB Endowment

    Google Scholar 

  20. Okcan, A., and Riedewald, M.: Processing Theta-Joins Using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 949–960. ACM Press, Athens (2011)

    Google Scholar 

  21. Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering, pp. 510–521. IEEE Press, Washington (2012)

    Google Scholar 

  22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 495–506. ACM Press, Indianapolis (2010)

    Google Scholar 

  23. Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM Press, Boston (2009)

    Google Scholar 

  24. Lewis, D.D.: Reuters-21578 Text Categorization Test Collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tie Hua Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, L., Ding, W., Zhou, T.H., Ryu, K.H. (2015). Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing. In: Núñez, M., Nguyen, N., Camacho, D., Trawiński, B. (eds) Computational Collective Intelligence. Lecture Notes in Computer Science(), vol 9329. Springer, Cham. https://doi.org/10.1007/978-3-319-24069-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24069-5_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24068-8

  • Online ISBN: 978-3-319-24069-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics