research-article

Free Access

Web corpus mining by instance of Wikipedia

Authors:
Rüdiger Gleim

Bielefeld University, Bielefeld, Germany

Bielefeld University, Bielefeld, Germany
View Profile

,
Alexander Mehler

Bielefeld University, Bielefeld, Germany

Bielefeld University, Bielefeld, Germany
View Profile

,
Matthias Dehmer

Technische Universität Darmstadt

Technische Universität Darmstadt
View Profile

Authors Info & Claims

WAC '06: Proceedings of the 2nd International Workshop on Web as CorpusApril 2006Pages 67–74

Published:01 April 2006Publication History

WAC '06: Proceedings of the 2nd International Workshop on Web as Corpus

Pages 67–74

ABSTRACT

In this paper we present an approach to structure learning in the area of web documents. This is done in order to approach the goal of webgenre tagging in the area of web corpus linguistics. A central outcome of the paper is that purely structure oriented approaches to web document classification provide an information gain which may be utilized in combined approaches of web content and structure analysis.

References

Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel, and Aya Soffer. 2003. The connectivity sonar: detecting site functionality by structural patterns. In Proc. of the 14th ACM conference on Hypertext and Hypermedia, pages 38--47. Google ScholarDigital Library
Marco Baroni and Silvia Bernardini, editors. 2006. WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google Scholar
Philip Bille. 2003. Tree edit distance, alignment distance and inclusion. Technical report TR-2003-23.Google Scholar
Soumen Chakrabarti, Byron Dom, and Piotr Indyk. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 307--318. ACM. Google ScholarDigital Library
Weimin Chen. 2001. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158. Google ScholarDigital Library
Matthias Dehmer. 2005. Strukturelle Analyse Webbasierter Dokumente. Ph.D. thesis, Technische Universität Darmstadt, Fachbereich Informatik.Google Scholar
Johannes Fürnkranz. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, pages 487--498, Berlin/New York. Springer. Google ScholarDigital Library
M. Höchsmann, T. Töller, R. Giegerich, and S. Kurtz. 2003. Local similarity in rna secondary structures. In Proc. Computational Systems Bioinformatics, pages 159--168. Google ScholarDigital Library
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.Google Scholar
Thorsten Joachims. 2002. Learning to classify text using support vector machines. Kluwer, Boston. Google ScholarDigital Library
Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3):333--347. Google ScholarDigital Library
P. Klein. 1998. Computing the edit-distance between unrooted ordered trees. In G. Bilardi, G. F. Italiano, A. Pietracaprina, and G. Pucci, editors, Proceedings of the 6th Annual European Symposium, pages 91--102, Berlin. Springer. Google ScholarDigital Library
Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632. Google ScholarDigital Library
Alexander Mehler and Rüdiger Gleim. 2005. The net for the graphs --- towards webgenre representation for corpus linguistic studies. In Marco Baroni and Silvia Bernardini, editors, WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.Google Scholar
Alexander Mehler, Rüdiger Gleim, and Matthias Dehmer. 2005. Towards structure-sensitive hypertext categorization. In Proceedings of the 29th Annual Conference of the German Classification Society, Berlin. Springer.Google Scholar
Yoshiaki Mizuuchi and Keishi Tajima. 1999. Finding context paths for web pages. In Proceedings of the 10th ACM Conference on Hypertext and Hypermedia, pages 13--22. Google ScholarDigital Library
Michael Stubbs. 2001. On inference theories and code theories: Corpus evidence for semantic schemas. Text, 21(3):437--465.Google ScholarCross Ref
K. C. Tai. 1979. The tree-to-tree correction problem. Journal of the ACM, 26(3):422--433. Google ScholarDigital Library
Eija Ventola. 1987. The Structure of Social Interaction: a Systemic Approach to the Semiotics of Service Encounters. Pinter, London.Google Scholar
Yiming Yang, Sean Slattery, and Rayid Ghani. 2002. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2--3):219--241. Google ScholarDigital Library
Yiming Yang. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67--88. Google ScholarDigital Library
K. Zhang and D. Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18:1245--1262. Google ScholarDigital Library

Web corpus mining by instance of Wikipedia
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Searching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Read More
Web Corpus Construction
Read More
Multi-Instance Learning Based Web Mining

In multi-instance learning , the training set comprises labeled bags that are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this paper, a web mining problem, i.e. web index recommendation, is investigated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WAC '06: Proceedings of the 2nd International Workshop on Web as Corpus
April 2006
82 pages
Program Chairs:
Adam Kilgarriff,
Marco Baroni
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 1 April 2006
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 260
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web corpus mining by instance of Wikipedia

WAC '06: Proceedings of the 2nd International Workshop on Web as Corpus

ABSTRACT

References

Cited By

Recommendations

Mining web site's topic hierarchy

Web Corpus Construction

Multi-Instance Learning Based Web Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web corpus mining by instance of Wikipedia

WAC '06: Proceedings of the 2nd International Workshop on Web as Corpus

ABSTRACT

References

Cited By

Recommendations

Mining web site's topic hierarchy

Web Corpus Construction

Multi-Instance Learning Based Web Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media