column

Utilising Wikipedia for Text Mining Applications

Author:
Muhammad Atif Qureshi

National University of Ireland, Galway and University of Milano-Bicocca

National University of Ireland, Galway and University of Milano-Bicocca
View Profile

Authors Info & Claims

ACM SIGIR Forum Volume 49 Issue 2December 2015pp 150–151https://doi.org/10.1145/2888422.2888449

Published:29 January 2016Publication History

ACM SIGIR Forum

Abstract

The process whereby inferences are made from textual data is broadly referred to as text mining. In order to ensure the quality and effectiveness of the derived inferences, several approaches have been proposed for different text mining applications. Among these applications, classifying a piece of text into pre-defined classes through the utilisation of training data falls into supervised approaches while arranging related documents or terms into clusters falls into unsupervised approaches. In both these approaches, processing is undertaken at the level of documents to make sense of text within those documents. Recent research efforts have begun exploring the role of knowledge bases in solving the various problems that arise in the domain of text mining. Of all the knowledge bases, Wikipedia on account of being one of the largest human-curated, online encyclopaedia has proven to be one of the most valuable resources in dealing with various problems in the domain of text mining. However, previous Wikipedia-based research efforts have not taken both Wikipedia categories and Wikipedia articles together as a source of information.

This thesis serves as a first step in eliminating this gap and throughout the contributions made in this thesis, we have shown the effectiveness of Wikipedia category-article structure for various text mining tasks. Wikipedia categories are organized in a taxonomical manner serving as semantic tags for Wikipedia articles and this provides a strong abstraction and expressive mode of knowledge representation. In this thesis, we explore the effectiveness of this mode of Wikipedia's expression (i.e., the category-article structure) via its application in the domains of text classification, subjectivity analysis (via a notion of "perspective" in news search), and keyword extraction.

First, we show the effectiveness of exploiting Wikipedia for two classification tasks i.e., 1- classifying the tweets1 being relevant/irrelevant to an entity or brand, 2- classifying the tweets into different topical dimensions such as tweets related with workplace, innovation, etc. To do so, we define the notion of relatedness between the text in tweet and the information embedded within the Wikipedia category-article structure. Then, we present an application in the area of news search by using the same notion of relatedness to show more information related to each search result highlighting the amount perspective or subjective bias in each returned result towards a certain opinion, topical drift, etc. Finally, we present a keyword extraction strategy using community detection over the Wikipedia categories to discover related keywords arranged in different communities.

The relationship between Wikipedia categories and articles is explored via a textual phrase matching framework whereby the starting point is textual phrases that match Wikipedia articles' titles/redirects. The Wikipedia articles for which a match occurs are then utilised by extraction of their associated categories, and these Wikipedia categories are used to derive various structural measures such as those relating to taxonomical depth and Wikipedia articles they contain. These measures are utilised in our proposed text classification, subjectivity analysis, and keyword extraction framework and the performance is analysed via extensive experimental evaluations. These experimental evaluations undertake comparisons with standard text mining approaches in the literature and our Wikipedia framework based on its category-article structure outperforms the standard text mining techniques.

Recommendations

Utilizing Wikipedia knowledge in open directory project-based text classification
SAC '17: Proceedings of the Symposium on Applied Computing

Traditional Open Directory Project (ODP)-based text classification methods use bag-of-words approach, which only utilizes single words in ODP documents and ignores important types of semantic information such as phrases and related terms. In this paper, ...
Read More
Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction
Abstract
Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ...
Read More
Utilizing Annotated Wikipedia Article Titles to Improve a Rule-Based Named Entity Recognizer for Turkish
FQAS 2013: Proceedings of the 10th International Conference on Flexible Query Answering Systems - Volume 8132

Named entity recognition is one of the information extraction tasks which aims to identify named entities such as person/ location/organization names along with some numeric and temporal expressions in free natural language texts. In this study, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGIR Forum Volume 49, Issue 2
December 2015
141 pages
ISSN:0163-5840
DOI:10.1145/2888422
Editors:
Ben Carterette
University of Delaware, Newark, DE, USA
,
Craig Macdonald
University of Glasgow, Glasgow, UK
Issue’s Table of Contents
Copyright © 2016 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 January 2016
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 104
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Utilising Wikipedia for Text Mining Applications

ACM SIGIR Forum

Abstract

Cited By

Recommendations

Utilizing Wikipedia knowledge in open directory project-based text classification

Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction

Utilizing Annotated Wikipedia Article Titles to Improve a Rule-Based Named Entity Recognizer for Turkish

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Utilising Wikipedia for Text Mining Applications

ACM SIGIR Forum

Abstract

Cited By

Recommendations

Utilizing Wikipedia knowledge in open directory project-based text classification

Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction

Utilizing Annotated Wikipedia Article Titles to Improve a Rule-Based Named Entity Recognizer for Turkish

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media