Authors:
Jingwen Wang
and
Jie Wang
Affiliation:
University of Massachusetts, United States
Keyword(s):
Article Extraction, Text Automation, Density, Similarity.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Business Analytics
;
Data Analytics
;
Data Engineering
;
Information Extraction
;
Interactive and Online Data Mining
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
We present a new method called qRead to achieve real-time content extractions from web pages with high
accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model
(DOM) trees, and machine learning models. These methods, while having met with certain success, may not
meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on
a complex web page is time-consuming, and using machine learning models could make things unnecessarily
more complicated. Different from previous approaches, qRead uses segment densities and similarities to
identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions
the remaining text into natural segments. It then uses the highest ratio of words over the number of lines
in a segment combined with similarity between the segment and the title to identify main contents. We show
that, through extensive experiments, q
Read achieves a 96.8% accuracy on Chinese web pages with an average
extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction
time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and
meeting the real-time extraction requirement.
(More)