Abstract
This paper presents a semi-supervised learning method to enhance biomedical named entity classification using features generated from labeled and terabyte unlabeled data, called Feature Coupling Degree (FCD) features. Highly discriminative context words are obtained from labeled free text using Chi-square method and queries formed by combining the named entity and context words are retrieved by search engine. Then the retrieved web page counts are converted into binary features by discretization. We investigate the effect of this type of feature in a biomedical corpus generated from several online resources. Support Vector Machine (SVM) is used as classifier and the performances of different features with various kernels and discretization methods are compared. The results show that the method enhances the classification performance especially for Out-of-Vocabulary (OOV) terms and relative small size of training data. In addition, only using FCD features with polynomial kernels, the performance is competitive to classical features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lynette, H., Alexander, Y., Christian, B., Alfonso, V.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1), 1 (2005)
Finkel, J., Dingare, S., Manning, C.: Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6(suppl. 1), 5 (2005)
McDonald, R., Pereira, F.: Identifying gene and. protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl. 1), 6 (2005)
Guodong, Z., Jie, Z., Jian, S., et al.: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20(7), 1178–1190 (2004)
Cohen, W.W., Sarawagi, S.: Semi-Markov Conditional Random Fields for Information Extraction. In: Eighteenth Annual Conference on Neural Information Processing Systems (NIPS) (2004)
Tomohiro, M., Sevrani, F., Masaki, M., Kouichi, D., Hirohumi, D.: Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics 6(suppl. 1), 8 (2005)
Vapnik, V.: Statistical learning theory. Wiley-Interscience, Chichester (1998)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: 11th Annual Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: 20th International Conference on Machine Learning (ICML) (2003)
Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853 (2005)
Rajat, R., Alexis, B., Honglak, L., Benjamin, P., Andrew, Y.N.: Self-taught learning: Transfer learning from unlabeled data. In: 24th International Conference on Machine Learning (ICML) (2007)
Lukasz, K., Krzysztof, C.: CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engeering 16(2), 145–153 (2004)
Joachims, T.: Making large-Scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning, MIT-Press, Cambridge (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, Y., Lin, H., Yang, Z. (2008). Enhancing Biomedical Named Entity Classification Using Terabyte Unlabeled Data. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_71
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)