ABSTRACT
In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
- M. Ben Saad, S. Gançarski, and Z. Pehlivan, "A novel web archiving approach based on visual pages analysis," in IWAW 2009.Google Scholar
- M. Oita and P. Senellart, "Deriving dynamics of web pages: A survey," in TWAW, March 2011.Google Scholar
- D. Cai, S. Yu, J. Wen, and W. Ma, "Vips: a vision-based page segmentation algorithm," Microsoft Technical Report, MSR-TR-2003--79--2003, 2003.Google Scholar
- Z. Pehlivan, M. Ben Saad, and S. Gançarski, "Vi-DIFF: Understanding Web Pages Changes," in DEXA 2010. Google ScholarDigital Library
- J. Cao, B. Mao, and J. Luo, "A segmentation method for web page analysis using shrinking and dividing," JPEDS, vol. 25, 2010. Google ScholarDigital Library
- A.Y. Fu, L. Wenyin, and X. Deng, "Detecting phishing web pages with visual similarity assessment based on earth mover's distance (emd)," TDSC, vol. 3, 2006. Google ScholarDigital Library
- N. Thome, D. Merad, and S. Miguet, "Learning articulated appearance models for tracking humans: A spectral graph matching approach," Signal Processing: Image Communication, vol. 23, no. 10, 2008. Google ScholarDigital Library
- A. Spengler and P. Gallinari, "Document structure meets page layout: Loopy random fields for web news content extraction," in DocEng, 2010. Google ScholarDigital Library
- D. Lowe, "Distinctive image features from scale-invariant keypoints," IJCV, vol. 60, 2004. Google ScholarDigital Library
- W.Y. Ma and B.S. Manjunath, "Netra: A toolbox for navigating large image databases," in ICIP 1997. Google ScholarDigital Library
- J. Fournier, M. Cord, and S. Philipp-Foliguet, "Retin: A content-based image indexing and retrieval system," PAA, vol. 4, no. 2, pp. 153--173, 2001.Google Scholar
- S. Avila, N. Thome, M. Cord, E. Valle, and A. Araújo, "Bossa: Extended bow formalism for image classification," in ICIP 2011.Google Scholar
- K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, "The devil is in the details: an evaluation of recent feature encoding methods," BMVC, 2011.Google Scholar
- R. Song, H. Liu, J.R. Wen, and W.Y. Ma, "Learning block importance models for web pages," in WWW 2004. Google ScholarDigital Library
- D. Picard, N. Thome, and M. Cord, "An efficient system for combining complementary kernels in complex visual categorization tasks," in ICIP 2010.Google Scholar
- L. Yang and R. Jin, "Distance metric learning: A comprehensive survey," Michigan State University, pp. 1--51, 2006.Google Scholar
- A. Frome, Y. Singer, and J. Malik, "Image retrieval and classification using local distance functions," in NIPS 2006.Google Scholar
- D. Mladenić, J. Brank, M. Grobelnik, and N. Milic-Frayling, "Feature selection using linear classifier weights: interaction with classification models," in SIGIR 2004. Google ScholarDigital Library
Index Terms
- Structural and visual comparisons for web page archiving
Recommendations
Web archiving and digital libraries (WADL) 2022
JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital LibrariesThe 2022 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources, including ...
Web Archiving and Digital Libraries (WADL)
JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital LibrariesThe 2018 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of Web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources and will also ...
Web archiving and digital libraries (WADL)
JCDL '19: Proceedings of the 18th Joint Conference on Digital LibrariesThe 2019 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources and will also ...
Comments