Abstract
The paper focuses on preprocessing techniques application to short informal textual documents created in different natural languages. The goal is to evaluate the impact on the quality of the results and computational complexity of the text mining process designed to reveal knowledge hidden in the data. Extensive number of experiments were carried out with real world text data with correction of spelling errors, stemming, stop words removal, and their combinations applied. Support vector machine, decision trees, and k-means algorithms as the commonly used methods were considered to analyze the text data. The text mining quality was generally not influenced significantly, however, the positive impact represented by the decreased computational complexity was observed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Berry, M.W., Kogan, J.: Text Mining: Applications and Theory. Wiley, Chichester (2010)
Carvalho, G., Matos, D.M., Rocio, V.: Document Retrieval for Question Answering: A Quantitative Evaluation of Text Preprocessing. In: PIKM 2007, pp. 125–130. ACM (2007)
Clark, E., Araki, K.: Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences 27, 2–11 (2011)
Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Information Retrieval 9, 311–330 (2006)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press (2006)
Habernal, I., Ptáček, T., Steinberger, J.: Supervised sentiment analysis in Czech social media. Information Processing and Management 50, 693–707 (2014)
Haddi, E., Liu, X., Shi, Y.: The Role of Text Pre-processing in Sentiment Analysis. Procedia Computer Science 17, 26–32 (2013)
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. Procedia Computer Science 18, 1198–1207 (2013)
Noble, W.S.: What is a support vector machine? Nature Biotechnology 24, 1564–1567 (2006)
Petz, G., et al.: Computational approaches for mining user’s opinions on the Web 2.0. Information Processing & Management 50, 899–908 (2014)
Porter, M.F.: Snowball: A language for stemming algorithms (2001)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Rajman, M., Vesely, M.: From Text to Knowledge: Document Processing and Visualization: A Text Mining Approach. In: Sirmakessis, S. (ed.) Text Mining and Its Applications: Results of the NEMIS Launch Conference, pp. 7–24. Springer (2004)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In: Sattar, A., Kang, B.-H. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006)
Tourné, N., Godoy, D.: Evaluating tag filtering techniques for web resource classification in folksonomies. Expert Systems with Applications 39, 9723–9729 (2012)
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Information Processing & Management 50, 104–112 (2014)
Xu, R., Wunsch, D.C.: Clustering. Wiley, Hoboken (2009)
Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers (2011)
Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowledge-Based Systems, 879–886 (2008)
Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report, University of Minnesota (2001)
Žižka, J., Dařena, F.: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Dařena, F., Žižka, J. (2015). Interdependence of Text Mining Quality and the Input Data Preprocessing. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Prokopova, Z., Silhavy, P. (eds) Artificial Intelligence Perspectives and Applications. Advances in Intelligent Systems and Computing, vol 347. Springer, Cham. https://doi.org/10.1007/978-3-319-18476-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-18476-0_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18475-3
Online ISBN: 978-3-319-18476-0
eBook Packages: EngineeringEngineering (R0)