ABSTRACT
In this paper we present an approach to structure learning in the area of web documents. This is done in order to approach the goal of webgenre tagging in the area of web corpus linguistics. A central outcome of the paper is that purely structure oriented approaches to web document classification provide an information gain which may be utilized in combined approaches of web content and structure analysis.
- Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel, and Aya Soffer. 2003. The connectivity sonar: detecting site functionality by structural patterns. In Proc. of the 14th ACM conference on Hypertext and Hypermedia, pages 38--47. Google ScholarDigital Library
- Marco Baroni and Silvia Bernardini, editors. 2006. WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google Scholar
- Philip Bille. 2003. Tree edit distance, alignment distance and inclusion. Technical report TR-2003-23.Google Scholar
- Soumen Chakrabarti, Byron Dom, and Piotr Indyk. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 307--318. ACM. Google ScholarDigital Library
- Weimin Chen. 2001. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158. Google ScholarDigital Library
- Matthias Dehmer. 2005. Strukturelle Analyse Webbasierter Dokumente. Ph.D. thesis, Technische Universität Darmstadt, Fachbereich Informatik.Google Scholar
- Johannes Fürnkranz. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, pages 487--498, Berlin/New York. Springer. Google ScholarDigital Library
- M. Höchsmann, T. Töller, R. Giegerich, and S. Kurtz. 2003. Local similarity in rna secondary structures. In Proc. Computational Systems Bioinformatics, pages 159--168. Google ScholarDigital Library
- Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.Google Scholar
- Thorsten Joachims. 2002. Learning to classify text using support vector machines. Kluwer, Boston. Google ScholarDigital Library
- Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3):333--347. Google ScholarDigital Library
- P. Klein. 1998. Computing the edit-distance between unrooted ordered trees. In G. Bilardi, G. F. Italiano, A. Pietracaprina, and G. Pucci, editors, Proceedings of the 6th Annual European Symposium, pages 91--102, Berlin. Springer. Google ScholarDigital Library
- Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632. Google ScholarDigital Library
- Alexander Mehler and Rüdiger Gleim. 2005. The net for the graphs --- towards webgenre representation for corpus linguistic studies. In Marco Baroni and Silvia Bernardini, editors, WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google Scholar
- Alexander Mehler, Rüdiger Gleim, and Matthias Dehmer. 2005. Towards structure-sensitive hypertext categorization. In Proceedings of the 29th Annual Conference of the German Classification Society, Berlin. Springer.Google Scholar
- Yoshiaki Mizuuchi and Keishi Tajima. 1999. Finding context paths for web pages. In Proceedings of the 10th ACM Conference on Hypertext and Hypermedia, pages 13--22. Google ScholarDigital Library
- Michael Stubbs. 2001. On inference theories and code theories: Corpus evidence for semantic schemas. Text, 21(3):437--465.Google ScholarCross Ref
- K. C. Tai. 1979. The tree-to-tree correction problem. Journal of the ACM, 26(3):422--433. Google ScholarDigital Library
- Eija Ventola. 1987. The Structure of Social Interaction: a Systemic Approach to the Semiotics of Service Encounters. Pinter, London.Google Scholar
- Yiming Yang, Sean Slattery, and Rayid Ghani. 2002. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2--3):219--241. Google ScholarDigital Library
- Yiming Yang. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88. Google ScholarDigital Library
- K. Zhang and D. Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18:1245--1262. Google ScholarDigital Library
- Web corpus mining by instance of Wikipedia
Recommendations
Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebSearching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Multi-Instance Learning Based Web Mining
In multi-instance learning , the training set comprises labeled bags that are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this paper, a web mining problem, i.e. web index recommendation, is investigated ...
Comments