Skip to main content

Interdependence of Text Mining Quality and the Input Data Preprocessing

  • Conference paper
Artificial Intelligence Perspectives and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 347))

Abstract

The paper focuses on preprocessing techniques application to short informal textual documents created in different natural languages. The goal is to evaluate the impact on the quality of the results and computational complexity of the text mining process designed to reveal knowledge hidden in the data. Extensive number of experiments were carried out with real world text data with correction of spelling errors, stemming, stop words removal, and their combinations applied. Support vector machine, decision trees, and k-means algorithms as the commonly used methods were considered to analyze the text data. The text mining quality was generally not influenced significantly, however, the positive impact represented by the decreased computational complexity was observed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Berry, M.W., Kogan, J.: Text Mining: Applications and Theory. Wiley, Chichester (2010)

    Google Scholar 

  2. Carvalho, G., Matos, D.M., Rocio, V.: Document Retrieval for Question Answering: A Quantitative Evaluation of Text Preprocessing. In: PIKM 2007, pp. 125–130. ACM (2007)

    Google Scholar 

  3. Clark, E., Araki, K.: Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences 27, 2–11 (2011)

    Article  Google Scholar 

  4. Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Information Retrieval 9, 311–330 (2006)

    Article  Google Scholar 

  5. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press (2006)

    Google Scholar 

  6. Habernal, I., Ptáček, T., Steinberger, J.: Supervised sentiment analysis in Czech social media. Information Processing and Management 50, 693–707 (2014)

    Article  Google Scholar 

  7. Haddi, E., Liu, X., Shi, Y.: The Role of Text Pre-processing in Sentiment Analysis. Procedia Computer Science 17, 26–32 (2013)

    Article  Google Scholar 

  8. Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)

    Book  Google Scholar 

  9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  10. Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. Procedia Computer Science 18, 1198–1207 (2013)

    Article  Google Scholar 

  11. Noble, W.S.: What is a support vector machine? Nature Biotechnology 24, 1564–1567 (2006)

    Article  Google Scholar 

  12. Petz, G., et al.: Computational approaches for mining user’s opinions on the Web 2.0. Information Processing & Management 50, 899–908 (2014)

    Article  Google Scholar 

  13. Porter, M.F.: Snowball: A language for stemming algorithms (2001)

    Google Scholar 

  14. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  15. Rajman, M., Vesely, M.: From Text to Knowledge: Document Processing and Visualization: A Text Mining Approach. In: Sirmakessis, S. (ed.) Text Mining and Its Applications: Results of the NEMIS Launch Conference, pp. 7–24. Springer (2004)

    Google Scholar 

  16. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  18. Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In: Sattar, A., Kang, B.-H. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  19. Tourné, N., Godoy, D.: Evaluating tag filtering techniques for web resource classification in folksonomies. Expert Systems with Applications 39, 9723–9729 (2012)

    Article  Google Scholar 

  20. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Information Processing & Management 50, 104–112 (2014)

    Article  Google Scholar 

  21. Xu, R., Wunsch, D.C.: Clustering. Wiley, Hoboken (2009)

    Google Scholar 

  22. Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers (2011)

    Google Scholar 

  23. Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowledge-Based Systems, 879–886 (2008)

    Google Scholar 

  24. Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report, University of Minnesota (2001)

    Google Scholar 

  25. Žižka, J., Dařena, F.: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  26. https://code.google.com/p/stop-words

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to František Dařena .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Dařena, F., Žižka, J. (2015). Interdependence of Text Mining Quality and the Input Data Preprocessing. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Prokopova, Z., Silhavy, P. (eds) Artificial Intelligence Perspectives and Applications. Advances in Intelligent Systems and Computing, vol 347. Springer, Cham. https://doi.org/10.1007/978-3-319-18476-0_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18476-0_15

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18475-3

  • Online ISBN: 978-3-319-18476-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics