Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

Chen, Chong; Yan, Hongfei; Li, Xiaoming

doi:10.1007/978-3-540-89447-6_18

Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

Chong Chen^20,21,
Hongfei Yan²⁰ &
Xiaoming Li²⁰

Conference paper

914 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5345))

Abstract

With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB’s conditional independence hypothesis well; 2) the abound one-time-occurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evaluate the model.

The research is supported by PRC MOST Grant 2006BAH02A10, 863 Grant 2006AA01Z196 and 863 Grant 2007AA01Z154.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rose, D.E., Danny, L.: Understanding user goals in web search. In: Proceedings of the 13th international conference on World Wide Web(WWW 2004). ACM Press, New York (2004)
Google Scholar
China Internet Network Information Center, Chinese Internet Development Report (2008), http://www.cnnic.cn/index/0E/00/11/index.htm
Internet Archive, http://www.archive.org
Chinese Digital Asset Library (restricted to public access for content copyright), http://cdal.net.pku.edu.cn
Chen, C., Yan, H.F., Li, X.M.: CDAL: A Scalable Scheme for Digital Resource Reorganization. In: Liu, W., Shi, Y., Li, Q. (eds.) ICWL 2004. LNCS, vol. 3143, pp. 193–200. Springer, Heidelberg (2004)
Chapter Google Scholar
Wolin, B.: Automatic classification in product catalogs. In: Proceedings of the 25th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR 2002), Tampere, Finland, August 11 - 15, 2002, pp. 351–352. ACM Press, New York (2002)
Chapter Google Scholar
Chen, H., Wang, J., Han, J.Q., Xie, X.: FTP Files Distribution Characteristics and Their Implications. Computer Engineering and Applications 1, 129–133 (2004)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997 (1997)
Google Scholar
Chinese word parser, ICTCLAS, http://www.nlp.org.cn/
Chen, C., Yan, H.F.: Web Resource Naming Conventions and User Behavior Analysis. Journal of the China Society for Scientific and Technical Information (2008) (accepted)
Google Scholar
Maron, M.E.: Automatic Indexing: An Experimental Inquiry. Journal of the ACM (JACM) 8(3), 404–417 (1961)
Article MATH Google Scholar
Gale, W.A., Sampson, G.: Good-Turing Frequency Estimation Without Tears. Journal of Quantitative Linguistics 2, 217–237 (1995)
Article Google Scholar
Good Turing Frequency Estimation, http://www.grsampson.net/RGoodTur.htmlH
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Languages Modeling. In: Proceeding of 34th Annual Meeting of ACL (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, 100871, Beijing, China
Chong Chen, Hongfei Yan & Xiaoming Li
Department of Information Management, School of Management, Beijing Normal University, 100875, Beijing, China
Chong Chen

Authors

Chong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hongfei Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi Kohoku-ku, 223-8522, Yokohama, Japan
Takahira Yamaguchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C., Yan, H., Li, X. (2008). Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features. In: Yamaguchi, T. (eds) Practical Aspects of Knowledge Management. PAKM 2008. Lecture Notes in Computer Science(), vol 5345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89447-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-89447-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89446-9
Online ISBN: 978-3-540-89447-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics