Interdependence of Text Mining Quality and the Input Data Preprocessing

Dařena, František; Žižka, Jan

doi:10.1007/978-3-319-18476-0_15

František Dařena⁷ &
Jan Žižka⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 347))

1507 Accesses
1 Citations
1 Altmetric

Abstract

The paper focuses on preprocessing techniques application to short informal textual documents created in different natural languages. The goal is to evaluate the impact on the quality of the results and computational complexity of the text mining process designed to reveal knowledge hidden in the data. Extensive number of experiments were carried out with real world text data with correction of spelling errors, stemming, stop words removal, and their combinations applied. Support vector machine, decision trees, and k-means algorithms as the commonly used methods were considered to analyze the text data. The text mining quality was generally not influenced significantly, however, the positive impact represented by the decreased computational complexity was observed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Visualizing the Document Pre-processing Effects in Text Mining Process

The Relevance of Preprocessing in Text Classification

References

Berry, M.W., Kogan, J.: Text Mining: Applications and Theory. Wiley, Chichester (2010)
Google Scholar
Carvalho, G., Matos, D.M., Rocio, V.: Document Retrieval for Question Answering: A Quantitative Evaluation of Text Preprocessing. In: PIKM 2007, pp. 125–130. ACM (2007)
Google Scholar
Clark, E., Araki, K.: Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences 27, 2–11 (2011)
Article Google Scholar
Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Information Retrieval 9, 311–330 (2006)
Article Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press (2006)
Google Scholar
Habernal, I., Ptáček, T., Steinberger, J.: Supervised sentiment analysis in Czech social media. Information Processing and Management 50, 693–707 (2014)
Article Google Scholar
Haddi, E., Liu, X., Shi, Y.: The Role of Text Pre-processing in Sentiment Analysis. Procedia Computer Science 17, 26–32 (2013)
Article Google Scholar
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)
Book Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. Procedia Computer Science 18, 1198–1207 (2013)
Article Google Scholar
Noble, W.S.: What is a support vector machine? Nature Biotechnology 24, 1564–1567 (2006)
Article Google Scholar
Petz, G., et al.: Computational approaches for mining user’s opinions on the Web 2.0. Information Processing & Management 50, 899–908 (2014)
Article Google Scholar
Porter, M.F.: Snowball: A language for stemming algorithms (2001)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Rajman, M., Vesely, M.: From Text to Knowledge: Document Processing and Visualization: A Text Mining Approach. In: Sirmakessis, S. (ed.) Text Mining and Its Applications: Results of the NEMIS Launch Conference, pp. 7–24. Springer (2004)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In: Sattar, A., Kang, B.-H. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006)
Chapter Google Scholar
Tourné, N., Godoy, D.: Evaluating tag filtering techniques for web resource classification in folksonomies. Expert Systems with Applications 39, 9723–9729 (2012)
Article Google Scholar
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Information Processing & Management 50, 104–112 (2014)
Article Google Scholar
Xu, R., Wunsch, D.C.: Clustering. Wiley, Hoboken (2009)
Google Scholar
Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers (2011)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowledge-Based Systems, 879–886 (2008)
Google Scholar
Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report, University of Minnesota (2001)
Google Scholar
Žižka, J., Dařena, F.: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Chapter Google Scholar
https://code.google.com/p/stop-words

Download references

Author information

Authors and Affiliations

Department of Informatics, Faculty of Business and Economics, Mendel University, Zemědělská 1, 613 00, Brno, Czech Republic
František Dařena & Jan Žižka

Authors

František Dařena
View author publications
You can also search for this author in PubMed Google Scholar
Jan Žižka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to František Dařena .

Editor information

Editors and Affiliations

Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Radek Silhavy
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Roman Senkerik
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Zuzana Kominkova Oplatkova
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Zdenka Prokopova
Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Petr Silhavy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dařena, F., Žižka, J. (2015). Interdependence of Text Mining Quality and the Input Data Preprocessing. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Prokopova, Z., Silhavy, P. (eds) Artificial Intelligence Perspectives and Applications. Advances in Intelligent Systems and Computing, vol 347. Springer, Cham. https://doi.org/10.1007/978-3-319-18476-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-18476-0_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18475-3
Online ISBN: 978-3-319-18476-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Interdependence of Text Mining Quality and the Input Data Preprocessing

Abstract

Access this chapter

Preview

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Visualizing the Document Pre-processing Effects in Text Mining Process

The Relevance of Preprocessing in Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Interdependence of Text Mining Quality and the Input Data Preprocessing

Abstract

Access this chapter

Preview

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Visualizing the Document Pre-processing Effects in Text Mining Process

The Relevance of Preprocessing in Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation