skip to main content
10.5555/1628297.1628307dlproceedingsArticle/Chapter ViewAbstractPublication PageswacConference Proceedingsconference-collections
research-article
Free Access

Web corpus mining by instance of Wikipedia

Published:01 April 2006Publication History

ABSTRACT

In this paper we present an approach to structure learning in the area of web documents. This is done in order to approach the goal of webgenre tagging in the area of web corpus linguistics. A central outcome of the paper is that purely structure oriented approaches to web document classification provide an information gain which may be utilized in combined approaches of web content and structure analysis.

References

  1. Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel, and Aya Soffer. 2003. The connectivity sonar: detecting site functionality by structural patterns. In Proc. of the 14th ACM conference on Hypertext and Hypermedia, pages 38--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marco Baroni and Silvia Bernardini, editors. 2006. WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google ScholarGoogle Scholar
  3. Philip Bille. 2003. Tree edit distance, alignment distance and inclusion. Technical report TR-2003-23.Google ScholarGoogle Scholar
  4. Soumen Chakrabarti, Byron Dom, and Piotr Indyk. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 307--318. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Weimin Chen. 2001. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Matthias Dehmer. 2005. Strukturelle Analyse Webbasierter Dokumente. Ph.D. thesis, Technische Universität Darmstadt, Fachbereich Informatik.Google ScholarGoogle Scholar
  7. Johannes Fürnkranz. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, pages 487--498, Berlin/New York. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Höchsmann, T. Töller, R. Giegerich, and S. Kurtz. 2003. Local similarity in rna secondary structures. In Proc. Computational Systems Bioinformatics, pages 159--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.Google ScholarGoogle Scholar
  10. Thorsten Joachims. 2002. Learning to classify text using support vector machines. Kluwer, Boston. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3):333--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Klein. 1998. Computing the edit-distance between unrooted ordered trees. In G. Bilardi, G. F. Italiano, A. Pietracaprina, and G. Pucci, editors, Proceedings of the 6th Annual European Symposium, pages 91--102, Berlin. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alexander Mehler and Rüdiger Gleim. 2005. The net for the graphs --- towards webgenre representation for corpus linguistic studies. In Marco Baroni and Silvia Bernardini, editors, WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google ScholarGoogle Scholar
  15. Alexander Mehler, Rüdiger Gleim, and Matthias Dehmer. 2005. Towards structure-sensitive hypertext categorization. In Proceedings of the 29th Annual Conference of the German Classification Society, Berlin. Springer.Google ScholarGoogle Scholar
  16. Yoshiaki Mizuuchi and Keishi Tajima. 1999. Finding context paths for web pages. In Proceedings of the 10th ACM Conference on Hypertext and Hypermedia, pages 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Stubbs. 2001. On inference theories and code theories: Corpus evidence for semantic schemas. Text, 21(3):437--465.Google ScholarGoogle ScholarCross RefCross Ref
  18. K. C. Tai. 1979. The tree-to-tree correction problem. Journal of the ACM, 26(3):422--433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Eija Ventola. 1987. The Structure of Social Interaction: a Systemic Approach to the Semiotics of Service Encounters. Pinter, London.Google ScholarGoogle Scholar
  20. Yiming Yang, Sean Slattery, and Rayid Ghani. 2002. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2--3):219--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yiming Yang. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Zhang and D. Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18:1245--1262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Web corpus mining by instance of Wikipedia

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image DL Hosted proceedings
      WAC '06: Proceedings of the 2nd International Workshop on Web as Corpus
      April 2006
      82 pages

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      • Published: 1 April 2006

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader