Elsevier

Knowledge-Based Systems

Volume 35, November 2012, Pages 312-319
Knowledge-Based Systems

Statistical cross-language Web content quality assessment

https://doi.org/10.1016/j.knosys.2012.05.018Get rights and content

Abstract

Cross-language Web content quality assessment plays an important role in many Web content processing applications. In the previous research, natural language processing, heuristic content and term frequency-inverse document frequency features based statistical systems have proven effective for Web content quality assessment. However, these are language-dependent features, which are not suitable for cross-language ranking. This paper proposes a cross-language Web content quality assessment method. First multi-modal language-independent features are extracted. The extracting features include character features, domain registration features, two-layer hyperlink analysis features and third-party Web service features. All the extracted features are then fused. Based on the fused features, feature selection is carried out to get a new eigenspace. Finally cross-language Web content quality model on the eigenspace can be learned. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets demonstrate that every scale feature has discriminability; different modalities of features are complementary to each other; and the feature selection is effective for statistical learning based cross-language Web content quality assessment.

Introduction

Web content quality assessment is crucial to various Web content processing applications, such as Web search, Web archive, Internet directory and domain abuse detection. One fundamental question is how to assess the quality of the Web content. To answer this question we need to define the appropriate quality metrics. Probably, the best way to assess the quality of a Web page probably is to browse it, since human beings are the end receivers in most web content processing environments. However, the manual method is too inconvenient and expensive to operate.

Most existing data quality measurements were developed on anad hoc basis to solve specific problems [19]. The fundamental principles for developing stable metric are seldom investigated. Pipino et al. [31] described principles that can help organizations develop usable data quality metrics. However these principles are not specially developed in the setting of the Web content. As such, Web content quality assessment remains an open problem. The focus of the current research is on the computational models that can automatically predict the Web content quality.

In the previous research, Herrera-Viedma, E. et al. provided several evaluation methods for the information quality of specific Web sites, where fuzzy linguistic model and semantic Web technologies are jointly used to improve the evaluation process [16], [15], [17], [14], which gives some helpful inspiration to the features extraction, especially semantic features extraction in statistical learning based methods.

Richardson et al. [34] proposed Feature Based Static Ranking and claimed that they outperformed PageRank [29]. In their work, most features are heuristic content attributes (e.g. page and and anchor text); and the popularity feature is impracticable for most of the researchers, as it is the commercial service data; moreover, they do not take Web spam into account, which severely distorts search engines’ ranking results.

In 2010, ECML/PKDD Discovery Challenge (DC2010) released a set of Web content quality metrics and a dataset, with the aim at developing host-level classification based on the genre of the Web sites as well as their readability, authoritativeness, trustworthiness and neutrality, etc. [6]. DC2010 also involved a multilingual task and expected language-independent features which are used to classify the French and German language test sets. One natural question arises, why cross-language Web content quality assessment is important and necessary? To answer this question, we provide some illustrative examples as below:

  • Web content quality assessment is a typical static ranking (query independent ranking) problem [34]. Cross-language Web content quality assessment is important for the ranking of cross-language information retrieval.

  • Domain names are hostnames that identify Internet Protocol (IP) resources such as web sites. According to statistics, the number of domains has increased rapidly, and merely the Top Level Domains has been up to 215 millions [36]. However the DNS resolution capacity is limited. On the other hand, 80% queries are absorbed by 0.6% domains, and 52.8% queries are spam related [22]. To make the usage of the DNS resolution resource more efficient, DNS experts suggest that the Internet Service Providers should provide service for domains in different levels based on their quality. As multi-language is the typical feature of domain application, the effective cross-language quality assessment algorithms are pressing.

  • Cybercrime is seriously harmful to surfers and e-commerce. In particular, phishing attacks spiked to 67,677 during the last half year in 2010, up from 48,244 in the first half of last year [2]. The decided attack sites are blocked by popular browsers such as Internet Explorer. However, there exists false positive. Since the target of phishing attacks are mostly financial institutions and highstatus e-commerce companies, the false positive will lead to very severe consequences. Cross-language content quality assessment is effective to reduce the false detection.

In DC2010, all the participant teams employed statistical learning methods for Web content quality assessment. The learning features include link analysis features (page level link analysis features and host level link analysis features), heuristical content features, term frequency-inverse document frequency features and natural language processing features [5], [3], [12]. However, as language dependent features, natural language processing and term frequency-inverse document frequency features are unfit for cross-language quality assessment. That is, the extracted English features are not suitable for French and German Tasks in DC2010.

According to the above discussion, how to construct a computational model and extract effective features to train the model for cross-language Web content quality prediction becomes an interesting and challenging topic. Supervised learning methods have demonstrated their effectiveness for many classification and regression problems, among others, spam detection, text categorization and anti-phishing [9], [21], [38], [35], [11], [1], [26]. These successful applications inspire us to evaluate Web content quality with the idea of supervised learning.

In this paper, we propose a statistical assessment model and a series of language-independent features to assess the Web content quality. Contrast experiments on DC2010 dataset were carried out to demonstrate the effectiveness of the proposed method and the extracted multi-scale features.

The proposed cross-language Web content quality assessment is depicted in Fig. 1. The quality assessment method firstly extracts multi-scale language independent features based on hostname string, hyperlink graph, third-party resources and WHOIS service. The extracted multi-scale features are then fused to a joint feature, after which Information Gain (IG) algorithm based feature selection is carried out to generate the eigenspace, and the assessment model is learned on the eigenspace. Finally, the learned model can assess any given Web site or rank any host list. The details of the assessment model will be described in the next section.

The remaining of this paper is organized as follows. Section 2 presents the statistical Web content quality assessment model. Section 3 first introduces the features extraction including character features, domain registration features, third-party features and two-level link analysis features. Then it discusses the feature fusion and the feature selection strategy. Section 4 describes the data sets, learning algorithm, and the evaluation metrics. It then compares in detail the effectiveness of extracted language-independent features and fused features. Finally, Section 5 draws the conclusion and provides some implications for the future work on Web content quality assessment.

Section snippets

Statistical Web content quality assessment

In order to model a supervised learning problem, a training set with labels should be collected in advance. To label the quality of a given Web page, DC2010 provides a feasible way: The quality value is defined based on genre, trust, factuality and bias. Typically, DC2010 gives the discrete values empirically: The spam host is worth quality 0; news/editorial and educational sites are worth 5; discussion hosts are worth 4 while others are worth 3. DC2010 also offers 2 bonus scores for facts or

Language-independent statistical features extraction

This section describes the multi-scale features used for learning an automatic cross-language Web content assessment model. The extracted features include character features, domain registration features, page and host level hyperlink analysis features, and third-party Web service features.

In the description of extracted features, several figures are presented, which provide visual representation of the prevalence of Web content quality relative to a given feature. The figure only shows the

Evaluation measurement

Normalized discounted cumulative gain (NDCG) is a measure of effectiveness of ranking algorithms or related applications [20], which is used for evaluating the submissions of DC2010 [6]. In our experiments, this measurement is also employed for performance evaluation.

Data set

All the available English training samples in DC2010 data collection are used [6], which are the same as those in our previous work [12]. DC2010 data set consists of 23,808,829 pages, 600 million hyperlinks and 191,388 different

Conclusions

Cross-language Web content quality assessment plays an important role in many Web content processing applications, such as cross-language Web information retrieval, Web abuse detection, and domain reputation assessment. In this paper, we proposed a statistical learning based cross-language Web content quality assessment solution, in which statistical assessment model is built; a series of language-independent features are extracted; and information gain based feature selection on the joint

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments, and DC2010 organization committee for providing the check and measure scripts.

This paper is supported by grants National Natural Science Foundation of China (Nos. 61005029, 61070039, and 61103138) and Natural Science Foundation of Beijing (No. 4112062).

References (39)

  • L. Becchetti et al.

    Link analysis for web spam detection

    ACM Transactions on Web

    (2008)
  • A.A. Benczúr, C. Castillo, Z. Gyöngyi, J. Masanes, Overview of the ECML/PKDD discovery challenge 2010, in: Proc. of the...
  • A.A. Benczúr, C. Castillo, J. Masanes, M. Matthews, M. Erdélyi, Z. Gyöngyi, ECML/PKDD 2010 discovery challenge data...
  • A.A. Benczúr, C. Castillo, J. Masanes, M. Matthews, M. Erdélyi, Z. Gyöngyi, Rules: ECML/PKDD 2010 discovery challenge,...
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • C. Castillo et al.

    Know your neighbors: Web spam detection using the web topology

  • W. Fan et al.

    Effective estimation of posterior probabilities: explaining the accuracy of randomized decision tree approaches

  • G. Geng et al.

    Link based small sample learning for web spam detection

  • G. Geng, B. Xiao, X. Zhang, D. Zhang, Evaluating Web content quality via multi-scale features, in: Proc. of the...
  • Cited by (5)

    • Combination of multiple bipartite ranking for multipartite web content quality evaluation

      2015, Neurocomputing
      Citation Excerpt :

      The popular features are impracticable to get for most practical applications. Geng et al. [6] explore the cross-language website ranking problem by extracting the multi-modal language-independent features and mapping these features to the eigenspace. Web spam can significantly deteriorate the quality of the search results of the search engines.

    • A cross-domain hidden spam detection method based on domain name resolution

      2017, Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST
    • Ascertaining spam web pages based on ant colony optimization algorithm

      2014, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Current state of anti-phishing approaches and revealing competencies

      2014, Journal of Theoretical and Applied Information Technology

    A short version of the present paper appeared at ACM SIGIR 2011 [Geng, G., Li, X., Wang, L., Wang, W., Shen, S., et al. 2011].

    View full text