research-article

Structural and visual comparisons for web page archiving

Authors:
Marc Teva Law

LIP6, UPMC - Sorbonne University, Paris, France

LIP6, UPMC - Sorbonne University, Paris, France
View Profile

,
Nicolas Thome

LIP6, UPMC - Sorbonne University, Paris, France

LIP6, UPMC - Sorbonne University, Paris, France
View Profile

,
Stéphane Gançarski

LIP6, UPMC - Sorbonne University, Paris, France

LIP6, UPMC - Sorbonne University, Paris, France
View Profile

,
Matthieu Cord

LIP6, UPMC - Sorbonne University, Paris, France

LIP6, UPMC - Sorbonne University, Paris, France
View Profile

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineeringSeptember 2012Pages 117–120https://doi.org/10.1145/2361354.2361380

Published:04 September 2012Publication History

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

Pages 117–120

ABSTRACT

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.

References

M. Ben Saad, S. Gançarski, and Z. Pehlivan, "A novel web archiving approach based on visual pages analysis," in IWAW 2009.Google Scholar
M. Oita and P. Senellart, "Deriving dynamics of web pages: A survey," in TWAW, March 2011.Google Scholar
D. Cai, S. Yu, J. Wen, and W. Ma, "Vips: a vision-based page segmentation algorithm," Microsoft Technical Report, MSR-TR-2003--79--2003, 2003.Google Scholar
Z. Pehlivan, M. Ben Saad, and S. Gançarski, "Vi-DIFF: Understanding Web Pages Changes," in DEXA 2010. Google ScholarDigital Library
J. Cao, B. Mao, and J. Luo, "A segmentation method for web page analysis using shrinking and dividing," JPEDS, vol. 25, 2010. Google ScholarDigital Library
A.Y. Fu, L. Wenyin, and X. Deng, "Detecting phishing web pages with visual similarity assessment based on earth mover's distance (emd)," TDSC, vol. 3, 2006. Google ScholarDigital Library
N. Thome, D. Merad, and S. Miguet, "Learning articulated appearance models for tracking humans: A spectral graph matching approach," Signal Processing: Image Communication, vol. 23, no. 10, 2008. Google ScholarDigital Library
A. Spengler and P. Gallinari, "Document structure meets page layout: Loopy random fields for web news content extraction," in DocEng, 2010. Google ScholarDigital Library
D. Lowe, "Distinctive image features from scale-invariant keypoints," IJCV, vol. 60, 2004. Google ScholarDigital Library
W.Y. Ma and B.S. Manjunath, "Netra: A toolbox for navigating large image databases," in ICIP 1997. Google ScholarDigital Library
J. Fournier, M. Cord, and S. Philipp-Foliguet, "Retin: A content-based image indexing and retrieval system," PAA, vol. 4, no. 2, pp. 153--173, 2001.Google Scholar
S. Avila, N. Thome, M. Cord, E. Valle, and A. Araújo, "Bossa: Extended bow formalism for image classification," in ICIP 2011.Google Scholar
K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, "The devil is in the details: an evaluation of recent feature encoding methods," BMVC, 2011.Google Scholar
R. Song, H. Liu, J.R. Wen, and W.Y. Ma, "Learning block importance models for web pages," in WWW 2004. Google ScholarDigital Library
D. Picard, N. Thome, and M. Cord, "An efficient system for combining complementary kernels in complex visual categorization tasks," in ICIP 2010.Google Scholar
L. Yang and R. Jin, "Distance metric learning: A comprehensive survey," Michigan State University, pp. 1--51, 2006.Google Scholar
A. Frome, Y. Singer, and J. Malik, "Image retrieval and classification using local distance functions," in NIPS 2006.Google Scholar
D. Mladenić, J. Brank, M. Grobelnik, and N. Milic-Frayling, "Feature selection using linear classifier weights: interaction with classification models," in SIGIR 2004. Google ScholarDigital Library

Index Terms

Structural and visual comparisons for web page archiving
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Web archiving and digital libraries (WADL) 2022
JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

The 2022 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources, including ...
Read More
Web Archiving and Digital Libraries (WADL)
JCDL '18: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries

The 2018 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of Web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources and will also ...
Read More
Web archiving and digital libraries (WADL)
JCDL '19: Proceedings of the 18th Joint Conference on Digital Libraries

The 2019 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources and will also ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
September 2012
256 pages
ISBN:9781450311168
DOI:10.1145/2361354
General Chair:
Cyril Concolato
Telecom ParisTech, France
,
Program Chair:
Patrick Schmitz
University of California, Berkeley, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
change detection algorithms
digital preservation
pattern recognition
support vector machines
web archiving
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 169
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Structural and visual comparisons for web page archiving

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web archiving and digital libraries (WADL) 2022

Web Archiving and Digital Libraries (WADL)

Web archiving and digital libraries (WADL)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Structural and visual comparisons for web page archiving

DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web archiving and digital libraries (WADL) 2022

Web Archiving and Digital Libraries (WADL)

Web archiving and digital libraries (WADL)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media