Web data extraction, applications and techniques: A survey

doi:10.1016/j.knosys.2014.07.007

Knowledge-Based Systems

Volume 70, November 2014, Pages 301-323

https://doi.org/10.1016/j.knosys.2014.07.007 Get rights and content

Abstract

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Introduction

Web Data Extraction systems are a broad class of software applications targeting at extracting data from Web sources [81], [11]. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it: for instance, if the source is an HTML Web page, the extracted content could consist of elements in the page as well as the full-text of the page itself. Eventually, extracted data might be post-processed, converted in the most convenient structured format and stored for further usage [130], [63].

Web Data Extraction systems find extensive use in a wide range of applications including the analysis of text-based documents available to a company (like e-mails, support forums, technical and legal documentation, and so on), Business and Competitive Intelligence [9], crawling of Social Web platforms [17], [52], Bio-Informatics [99] and so on. The importance of Web Data Extraction systems depends on the fact that a large (and steadily growing) amount of data is continuously produced, shared and consumed online: Web Data Extraction systems allow to efficiently collect these data with limited human effort. The availability and analysis of collected data is an indefeasible requirement to understand complex social, scientific and economic phenomena which generate the data. For example, collecting digital traces produced by users of Social Web platforms like Facebook, YouTube or Flickr is the key step to understand, model and predict human behavior [68], [94], [3].

In the commercial field, the Web provides a wealth of public domain information. A company can probe the Web to acquire and analyze information about the activity of its competitors. This process is known as Competitive Intelligence [22], [125] and it is crucial to quickly identify the opportunities provided by the market, to anticipate the decisions of the competitors as well as to learn from their faults and successes.

The design and implementation of Web Data Extraction systems has been discussed from different perspectives and it leverages on scientific methods coming from various disciplines including Machine Learning, Logic and Natural Language Processing.

In the design of a Web Data Extraction system, many factors must be taken into account; some of them are independent of the specific application domain in which we plan to perform Web Data Extraction. Other factors, instead, heavily depend on the particular features of the application domain: as a consequence, some technological solutions which appear to be effective in some application contexts are not suitable in others.

In its most general formulation, the problem of extracting data from the Web is hard because it is constrained by several requirements. The key challenges we can encounter in the design of a Web Data Extraction system can be summarized as follows:

•
Web Data Extraction techniques implemented in a Web Data Extraction system often require the help of human experts. A first challenge consists of providing a high degree of automation by reducing human efforts as much as possible. Human feedback, however, may play an important role in raising the level of accuracy achieved by a Web Data Extraction system.
A related challenge is, therefore, to identify a reasonable trade-off between the need of building highly automated Web Data Extraction procedures and the requirement of achieving accurate performance.
•
Web Data Extraction techniques should be able to process large volumes of data in relatively short time. This requirement is particularly stringent in the field of Business and Competitive Intelligence because a company needs to perform timely analysis of market conditions.
•
Applications in the field of Social Web or, more in general, those dealing with personal data must provide solid privacy guarantees. Therefore, potential (even if unintentional) attempts to violate user privacy should be timely and adequately identified and counteracted.
•
Approaches relying on Machine Learning often require a significantly large training set of manually labeled Web pages. In general, the task of labeling pages is time-expensive and error-prone and, therefore, in many cases we cannot assume the existence of labeled pages.
•
Oftentimes, a Web Data Extraction tool has to routinely extract data from a Web Data source which can evolve over time. Web sources are continuously evolving and structural changes happen with no forewarning, thus are unpredictable. Eventually, in real-world scenarios it emerges the need of maintaining these systems, that might stop working correctly if lacking of flexibility to detect and face structural modifications of related Web sources.

The theme of Web Data Extraction is covered by a number of reviews. Laender et al. [81] presented a survey that offers a rigorous taxonomy to classify Web Data Extraction systems. The authors introduced a set of criteria and a qualitative analysis of various Web Data Extraction tools.

Kushmerick [79] defined a profile of finite-state approaches to the Web Data Extraction problem. The author analyzed both wrapper induction approaches (i.e., approaches capable of automatically generating wrappers by exploiting suitable examples) and maintenance ones (i.e., methods to update a wrapper each time the structure of the Web source changes). In that paper, Web Data Extraction techniques derived from Natural Language Processing and Hidden Markov Models were also discussed. On the wrapper induction problem, Flesca et al. [45] and Kaiser and Miksch [64] surveyed approaches, techniques and tools. The latter paper, in particular, provided a model describing the architecture of an Information Extraction system. Chang et al. [19] introduced a tri-dimensional categorization of Web Data Extraction systems, based on task difficulties, techniques used and degree of automation. In 2007, Fiumara [44] applied these criteria to classify four state-of-the-art Web Data Extraction systems. A relevant survey on Information Extraction is due to Sarawagi [105] and, in our opinion, anybody who intends to approach this discipline should read it. Recently, some authors focused on unstructured data management systems (UDMSs) [36], i.e., software systems that analyze raw text data, extract from them some structure (e.g. person name and location), integrate the structure (e.g., objects like New York and NYC are merged into a single object) and use the integrated structure to build a database. UDMSs are a relevant example of Web Data Extraction systems and the work from Doan et al. [36] provides an overview of Cimple, an UDMS developed at the University of Wisconsin. To the best of our knowledge, the survey from Baumgartner et al. [11] is the most recently updated review on the discipline as of this work.

The goal of this survey is to provide a structured and comprehensive overview of the research in Web Data Extraction as well as to provide an overview of most recent results in the literature.

We adopt a different point of view with respect to that used in other survey on this discipline: most of them present a list of tools, reporting a feature-based classification or an experimental comparison of these tools. Many of these papers are solid starting points in the study of this area. Unlike the existing surveys, our ambition is to provide a classification of existing Web Data Extraction techniques in terms of the application domains in which they have been employed. We want to shed light on the various research directions in this field as well as to understand to what extent techniques initially applied in a particular application domain have been later re-used in others. To the best of our knowledge, this is the first survey that deeply analyzes Web Data Extraction techniques (and systems implementing these techniques) from a perspective of their application fields.

However, we also provide a detailed discussion of techniques to perform Web Data Extraction. We identify two main categories, i.e., approaches based on Tree Matching algorithms and approaches based on Machine Learning algorithms. For each category, we first describe the basic employed techniques and then we illustrate their variants. We also show how each category addresses the problems of wrapper generation and maintenance. After that, we focus on applications that are strictly interconnected with Web Data Extraction tasks. We cover in particular enterprise, social and scientific applications by discussing which fields have already been approached (e.g., advertising engineering, enterprise solutions, Business and Competitive intelligence, etc.) and which are potentially going to be in the future (e.g., Bio-informatics, Web Harvesting, etc.).

We also discuss about the potential of cross-fertilization, i.e., whether strategies employed in a given domain can be re-used in others or, otherwise, if some applications can be adopted only in particular domains.

This survey is organized into two main parts. The first one is devoted to provide general definitions which are helpful to understand the material proposed in the survey. To this purpose, Section 2 illustrates the techniques exploited for collecting data from Web sources, and the algorithms that underlay most of Web Data Extraction systems. The main features of existing Web Data Extraction systems are largely discussed in Section 3.

The second part of this work is about the applications of Web Data Extraction systems to real-world scenarios. In Section 4 we identify two main domains in which Web Data Extraction techniques have been employed: applications at the enterprise level and at the Social Web level. The formers are described in Section 4.1, whereas the laters are covered in Section 4.2. This part concludes discussing the opportunities of cross-fertilization among different application scenarios (see Section 4.3).

In Section 5 we draw our conclusions and discuss potential applications of Web Data Extraction techniques that might arise in the future.

Section snippets

Techniques

The first part of this survey is devoted to the discussion of the techniques adopted in the field of the Web Data Extraction. In this part we extensively review approaches to extracting data from HTML pages. HTML is the predominant language for implementing Web pages and it is largely supported by W3C consortium. HTML pages can be regarded as a form of semi-structured data (even if less structured than other sources like XML documents) in which information follows a nested structure; HTML

Web Data Extraction Systems

In this section we get into details regarding the characteristics of existing Web Data Extraction systems. We can generically define a Web Data Extraction system as a platform implementing a sequence of procedures (for example, Web wrappers) that extract information from Web sources [81]. A large number of Web Data Extraction systems are available as commercial products even if an increasing number of free, open-source alternatives to commercial software is now entering into the market.

In the

Applications

The aim of the second part of this paper is to survey and analyze a large number of applications that are strictly interconnected with Web Data Extraction tasks. To the best of our knowledge, this is the first attempt to classify applications based on Web Data Extraction techniques even if they have been originally designed to operate in specific domain and, in some cases, they can appear as unrelated.

The spectrum of applications possibly benefiting from Web Data Extraction techniques is quite

Conclusions

The World Wide Web contains a large amount of unstructured data. The need for structured information urged researchers to develop and implement various strategies to accomplish the task of automatically extracting data from Web sources. Such a process is known with the name of Web Data Extraction and it has had (and continues to have) a wide range of applications in several fields, ranging from commercial to Social Web applications.

The central thread of this survey is to classify existing

References (130)

L. Bettencourt et al.
The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models
Phys. A: Stat. Mech. Appl.
(2006)
H. Chen et al.
Ci spider: a tool for competitive intelligence on the web
Decis. Support Syst.
(2002)
W. Chen
New algorithm for ordered tree-to-tree correction problem
J. Algor.
(2001)
C.-N. Hsu et al.
Generating finite-state transducers for semi-structured data extraction from the web
Inf. Syst.
(1998)
N. Kushmerick
Wrapper induction: efficiency and expressiveness
Artif. Intell.
(2000)
F. Abel et al.
Cross-system user modeling and personalization on the social web
User Model. User-Adapt. Interact.
(2013)
D. Amalfitano et al.
Reverse engineering finite state machines from rich internet applications
L. Backstrom, P. Boldi, M. Rosa, J. Ugander, S. Vigna, Four degrees of separation, 2011....
M. Balduzzi et al.
Abusing social networks for automated user profiling
R. Baumgartner et al.
Web data extraction for service creation
Search Computing: Challenges and Directions
(2010)

R. Baumgartner et al.

Deepweb navigation in web data extraction

R. Baumgartner et al.

The elog web extraction language

R. Baumgartner et al.

Visual web information extraction with lixto

R. Baumgartner, O. Frölich, G. Gottlob, P. Harz, M. Herzog, P. Lehmann, T. Wien, Web data extraction for business...

R. Baumgartner, K. Fröschl, M. Hronsky, M. Pöttler, N. Walchhofer, Semantic online tourism market monitoring, in: Proc....

R. Baumgartner et al.

Web data extraction system

Encycl. Database Syst.

(2009)

R. Baumgartner et al.

Scalable web data extraction for online market intelligence

Proc. 35th Int. Conf. Very Large Databases

(2009)

A. Berger et al.

A maximum entropy approach to natural language processing

Comput. Linguist.

(1996)

M. Berthold et al.

Intelligent Data Analysis: An Introduction

(1999)

M. Califf et al.

Bottom-up relational learning of pattern matching rules for information extraction

J. Machine Learning Res.

(2003)

S. Catanese et al.

Crawling facebook for social network analysis purposes

A. Chaabane, G. Acs, M. Kaafar, You are what you like! information leakage through users’ interests, in: Proc. Annual...

C. Chang et al.

A survey of web information extraction systems

IEEE Trans. Knowl. Data Eng.

(2006)

D. Chau, S. Pandit, S. Wang, C. Faloutsos, Parallel crawling for online social networks, in: Proc. 16th International...

F. Chen et al.

Efficient information extraction over evolving text data

M. Collins

A new statistical parser based on bigram lexical dependencies

M.D. Conover et al.

The geospatial characteristics of a social movement communication network

PloS One

(2013)

D. Crandall et al.

Mapping the world’s photos

V. Crescenzi et al.

Automatic information extraction from large websites

J. ACM

(2004)

V. Crescenzi et al.

Roadrunner: towards automatic data extraction from large web sites

V. Crescenzi, G. Mecca, P. Merialdo, Improving the expressiveness of roadrunner, in: SEBD, 2004, pp....

N. Dalvi et al.

Robust web extraction: an approach based on a probabilistic tree-edit model

N. Dalvi et al.

Automatic wrappers for large scale Web extraction

Proc. VLDB Endowment

(2011)

K. Dave, S. Lawrence, D. Pennock, Mining the peanut gallery: opinion extraction and semantic classification of product...

P. De Meo et al.

Analyzing user behavior across social sharing environments

ACM Trans. Intell. Syst. Technol.

(2013)

P. De Meo et al.

Finding reliable users and social networks in a social internetworking system

M. Descher, T. Feilhauer, T. Ludescher, P. Masser, B. Wenzel, P. Brezany, I. Elsayed, A. Wöhrer, A.M. Tjoa, D. Huemer,...

A. Doan et al.

Information extraction challenges in managing unstructured data

ACM SIGMOD Record

(2009)

R. Fayzrakhmanov, M. Goebel, W. Holzinger, B. Kruepl, A. Mager, R. Baumgartner, Modelling web navigation with the user...

E. Ferrara

A large-scale community structure analysis in facebook

EPJ Data Sci.

(2012)

E. Ferrara et al.

Automatic wrapper adaptation by tree edit distance matching

Combinations Intell. Methods Appl.

(2011)

E. Ferrara, R. Baumgartner, Design of automatically adaptable web wrappers, in: Proc. 3rd International Conference on...

E. Ferrara et al.

Intelligent self-repairable web wrappers

Lecture Notes in Computer Science

(2011)

E. Ferrara et al.

Clustering memes in social media

E. Ferrara et al.

Traveling trends: social butterflies or frequent fliers?

G. Fiumara, Automated information extraction from web sources: a survey, in: Proc. of Between Ontologies and...

S. Flesca et al.

Web wrapper induction: a brief survey

AI Commun.

(2004)

D. Freitag

Machine learning for information extraction in informal domains

Machine Learning

(2000)

T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A.J. Sellers, C. Wang,...

T. Furche et al.

OXPath: a language for scalable, memory-efficient data extraction from web applications

Proc. VLDB Endowment

(2011)

Cited by (0)

View full text

Web data extraction, applications and techniques: A survey

Abstract

Introduction

Section snippets

Techniques

Web Data Extraction Systems

Applications

Conclusions

Phys. A: Stat. Mech. Appl.

Decis. Support Syst.

J. Algor.

Inf. Syst.

Artif. Intell.

Cross-system user modeling and personalization on the social web

User Model. User-Adapt. Interact.

Reverse engineering finite state machines from rich internet applications

Abusing social networks for automated user profiling

Web data extraction for service creation

Search Computing: Challenges and Directions

Deepweb navigation in web data extraction

The elog web extraction language

Visual web information extraction with lixto

Web data extraction system

Encycl. Database Syst.

Scalable web data extraction for online market intelligence

Proc. 35th Int. Conf. Very Large Databases

A maximum entropy approach to natural language processing

Comput. Linguist.

Intelligent Data Analysis: An Introduction

Bottom-up relational learning of pattern matching rules for information extraction

J. Machine Learning Res.

Crawling facebook for social network analysis purposes

A survey of web information extraction systems

IEEE Trans. Knowl. Data Eng.

Efficient information extraction over evolving text data

A new statistical parser based on bigram lexical dependencies

The geospatial characteristics of a social movement communication network

PloS One

Mapping the world’s photos

Automatic information extraction from large websites

J. ACM

Roadrunner: towards automatic data extraction from large web sites

Robust web extraction: an approach based on a probabilistic tree-edit model

Automatic wrappers for large scale Web extraction

Proc. VLDB Endowment

Analyzing user behavior across social sharing environments

ACM Trans. Intell. Syst. Technol.

Finding reliable users and social networks in a social internetworking system

Information extraction challenges in managing unstructured data

ACM SIGMOD Record

A large-scale community structure analysis in facebook

EPJ Data Sci.

Automatic wrapper adaptation by tree edit distance matching

Combinations Intell. Methods Appl.

Intelligent self-repairable web wrappers

Lecture Notes in Computer Science

Clustering memes in social media

Traveling trends: social butterflies or frequent fliers?

Web wrapper induction: a brief survey

AI Commun.

Machine learning for information extraction in informal domains

Machine Learning

OXPath: a language for scalable, memory-efficient data extraction from web applications

Proc. VLDB Endowment