skip to main content
10.1145/2361354.2361380acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Structural and visual comparisons for web page archiving

Published:04 September 2012Publication History

ABSTRACT

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.

References

  1. M. Ben Saad, S. Gançarski, and Z. Pehlivan, "A novel web archiving approach based on visual pages analysis," in IWAW 2009.Google ScholarGoogle Scholar
  2. M. Oita and P. Senellart, "Deriving dynamics of web pages: A survey," in TWAW, March 2011.Google ScholarGoogle Scholar
  3. D. Cai, S. Yu, J. Wen, and W. Ma, "Vips: a vision-based page segmentation algorithm," Microsoft Technical Report, MSR-TR-2003--79--2003, 2003.Google ScholarGoogle Scholar
  4. Z. Pehlivan, M. Ben Saad, and S. Gançarski, "Vi-DIFF: Understanding Web Pages Changes," in DEXA 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Cao, B. Mao, and J. Luo, "A segmentation method for web page analysis using shrinking and dividing," JPEDS, vol. 25, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A.Y. Fu, L. Wenyin, and X. Deng, "Detecting phishing web pages with visual similarity assessment based on earth mover's distance (emd)," TDSC, vol. 3, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Thome, D. Merad, and S. Miguet, "Learning articulated appearance models for tracking humans: A spectral graph matching approach," Signal Processing: Image Communication, vol. 23, no. 10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Spengler and P. Gallinari, "Document structure meets page layout: Loopy random fields for web news content extraction," in DocEng, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Lowe, "Distinctive image features from scale-invariant keypoints," IJCV, vol. 60, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W.Y. Ma and B.S. Manjunath, "Netra: A toolbox for navigating large image databases," in ICIP 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Fournier, M. Cord, and S. Philipp-Foliguet, "Retin: A content-based image indexing and retrieval system," PAA, vol. 4, no. 2, pp. 153--173, 2001.Google ScholarGoogle Scholar
  12. S. Avila, N. Thome, M. Cord, E. Valle, and A. Araújo, "Bossa: Extended bow formalism for image classification," in ICIP 2011.Google ScholarGoogle Scholar
  13. K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, "The devil is in the details: an evaluation of recent feature encoding methods," BMVC, 2011.Google ScholarGoogle Scholar
  14. R. Song, H. Liu, J.R. Wen, and W.Y. Ma, "Learning block importance models for web pages," in WWW 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Picard, N. Thome, and M. Cord, "An efficient system for combining complementary kernels in complex visual categorization tasks," in ICIP 2010.Google ScholarGoogle Scholar
  16. L. Yang and R. Jin, "Distance metric learning: A comprehensive survey," Michigan State University, pp. 1--51, 2006.Google ScholarGoogle Scholar
  17. A. Frome, Y. Singer, and J. Malik, "Image retrieval and classification using local distance functions," in NIPS 2006.Google ScholarGoogle Scholar
  18. D. Mladenić, J. Brank, M. Grobelnik, and N. Milic-Frayling, "Feature selection using linear classifier weights: interaction with classification models," in SIGIR 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Structural and visual comparisons for web page archiving

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
        September 2012
        256 pages
        ISBN:9781450311168
        DOI:10.1145/2361354

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 September 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate178of537submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader