Abstract
Word boundary ambiguity is a major problem for the Thai morphological analysis since the Thai words are written consecutively with no word delimiters. However the part of speech (POS) tagged corpus which has been used is constructed from the academic papers and there are no researches that worked on the documents written in the informal language. This paper presents Thai morphological analysis with unknown word boundary detection using both POS tagged and untagged corpora. Viterbi algorithm and Maximum Entropy (ME) - Viterbi algorithm are employed separately to evaluate our methods. The unknown word problem is handled by making use of string’s length in order to estimate word boundaries. The experiments are performed on documents written in formal language and documents written in informal language. The experiments show that the method we proposed to use untagged corpus in addition to tagged corpus is efficient for the text written in informal language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Charoenporn, T., Sornlertlamvanich, V., Isahara, H.: Building A Large Thai Text Corpus - Part-Of-Speech Tagged Corpus ORCHID. In: Proceedings of the NLPRS 1997, pp. 509–512 (1997)
Sornlertlamvanich, V., Charoenporn, T., Isahara, H.: ORCHID: Thai Part-Of-Speech Tagged Corpus Technical Report TRNECTEC-1997-001, NECTEC (1997)
Boriboon, M., Kriengket, K., Chootrakool, P., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., Kosawat, K.: BEST Corpus Development and Analysis. In: Proceedings of the 2009 International Conference on Asian Language Processing, pp. 322–327 (2009)
Viterbi, A.J.: Viterbi, Error bounds for convolutional codes and anasymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)
McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: The 17th International Conference on Machine Learning (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luangpiensamut, W., Komiya, K., Kotani, Y. (2012). Using Tagged and Untagged Corpora to Improve Thai Morphological Analysis with Unknown Word Boundary Detections. In: Anthony, P., Ishizuka, M., Lukose, D. (eds) PRICAI 2012: Trends in Artificial Intelligence. PRICAI 2012. Lecture Notes in Computer Science(), vol 7458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32695-0_69
Download citation
DOI: https://doi.org/10.1007/978-3-642-32695-0_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32694-3
Online ISBN: 978-3-642-32695-0
eBook Packages: Computer ScienceComputer Science (R0)