Web archive profiling through CDX summarization

Alam, Sawood; Nelson, Michael L.; Van de Sompel, Herbert; Balakireva, Lyudmila L.; Shankar, Harihar; Rosenthal, David S. H.

doi:10.1007/s00799-016-0184-4

Web archive profiling through CDX summarization

Published: 16 July 2016

Volume 17, pages 223–238, (2016)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Sawood Alam ORCID: orcid.org/0000-0002-8267-3326¹,
Michael L. Nelson¹,
Herbert Van de Sompel²,
Lyudmila L. Balakireva²,
Harihar Shankar² &
…
David S. H. Rosenthal³

1705 Accesses
11 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web Archive Profiling Through CDX Summarization

Web Archive Profiling Through Fulltext Search

Profiling web archive coverage for top-level domain and content language

Article 27 June 2014

Notes

http://labs.mementoweb.org/aggregator_config/archivelist.xml.
http://timetravel.mementoweb.org/.
http://oldweb.today/.
CDX files are created as an index of the WARC [15] files generated from the Heritrix web crawler; see [13] for a description of the CDX file format.
https://github.com/oduwsdl/archive_profiler.
http://stackoverflow.com/questions/16256913/improving-performance-of-very-large-dictionary-in-python.
In our dataset, Archive-It has 0.71 % non-HTTP entries in their CDX files, while UKWA has no non-HTTP entries.

References

Alam, S., Cartledge, C.L., Nelson, M.L.: Support for various HTTP methods on the web. Tech. Rep (2014). arXiv:1405.2330
Alam, S., Kreymer, I., Nelson, M.L.: Object resource stream (ORS) and CDX-JSON (CDXJ) draft (2015). https://github.com/oduwsdl/ORS
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Proceedings of 19th international conference on theory and practice of digital libraries. TPDL 2015, vol. 9316, pp. 3–14. Poznań, Poland (2015)
AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the Internet Archive. Int. J. Digit. Librar. 14(3–4), 101–115 (2014)
Article Google Scholar
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Proc. Int. Conf. Theory Pract. Digit. Librar. TPDL 2013, 60–71 (2013)
Google Scholar
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Librar. 14(3–4), 149–166 (2014)
Article Google Scholar
Ben-Kiki, O., Evans, C., Ingy döt Net: YAML Ain’t Markup Language (YAML\(^{{\rm TM}}\)) Version 1.2 (2009). http://www.yaml.org/spec/1.2/spec.html
Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries, JCDL ’16, pp. 63–72 (2016). doi:10.1145/2910896.2910899
Crockford, D.: The application/json media type for javascript object notation (JSON). RFC 4627 (2006)
Deutsch, P.: GZIP file format specification version 4.3. RFC 1952 (1996)
Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)
Article Google Scholar
Gravano, L., Chang, C.C.K., García-Molina, H., Paepcke, A.: STARTS: stanford proposal for internet meta-searching. SIGMOD Rec. 26(2), 207–218 (1997). doi:10.1145/253262.253299
Article Google Scholar
Internet archive: CDX file format (2003). http://archive.org/web/researcher/cdx_file_format.php
Internet archive: archive-it -web archiving services for libraries and archives (2006). https://www.archive-it.org/
ISO 28500: WARC (Web ARChive) file format (2009). http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
Liu, L.: Query routing in large-scale digital library systems. In: 15th International Conference on Data Engineering, 1999. Proceedings, pp. 154–163 (1999). doi:10.1109/ICDE.1999.754918
Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)
Article Google Scholar
Mozilla Foundation: Public Suffix List (2015). https://publicsuffix.org/
Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, pp. 379–380. ACM, New York (2012)
Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC memento aggregator experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
Sigursson, K., Stack, M., Ranitovic, I.: Heritrix user manual: sort-friendly URI reordering transform (2006). http://crawler.archive.org/articles/user_manual/glossary.html#surt
Sporny, M., Kellogg, G., Lanthaler, M.: A JSON-based serialization for linked data. W3C Recommendation (2014)
Stanford University Libraries: Stanford Web Archive Portal (2013). https://swap.stanford.edu/
Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)
Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)
Article Google Scholar
UK Web Archive: Crawled URL Index JISC UK Web Domain Dataset (1996–2013) (2014). doi:10.5259/ukwa.ds.2/cdx/1
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento. RFC 7089 (2013)
Weka: Attribute-relation file format (ARFF) (2009). http://weka.wikispaces.com/ARFF

Download references

Acknowledgments

This work is supported in part by the International Internet Preservation Consortium (IIPC). Andy Jackson (from the British Library) helped us with the UKWA datasets. Kris Carpenter (from the Internet Archive) and Joseph E. Ruettgers (from the Old Dominion University) helped us with the Archive-It datasets. Ahmed AlSum and Nicholas Taylor (from Stanford University Libraries) helped us with the Stanford datasets and by running our profiling script on Stanford archive. Ilya Kreymer contributed to the discussion about CDXJ profile serialization format.

Author information

Authors and Affiliations

Computer Science Department, Old Dominion University, Norfolk, VA, USA
Sawood Alam & Michael L. Nelson
Los Alamos National Laboratory, Los Alamos, NM, USA
Herbert Van de Sompel, Lyudmila L. Balakireva & Harihar Shankar
Stanford University Libraries, Stanford, CA, USA
David S. H. Rosenthal

Authors

Sawood Alam
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar
Herbert Van de Sompel
View author publications
You can also search for this author in PubMed Google Scholar
Lyudmila L. Balakireva
View author publications
You can also search for this author in PubMed Google Scholar
Harihar Shankar
View author publications
You can also search for this author in PubMed Google Scholar
David S. H. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sawood Alam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alam, S., Nelson, M.L., Van de Sompel, H. et al. Web archive profiling through CDX summarization. Int J Digit Libr 17, 223–238 (2016). https://doi.org/10.1007/s00799-016-0184-4

Download citation

Received: 10 January 2016
Revised: 03 July 2016
Accepted: 04 July 2016
Published: 16 July 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s00799-016-0184-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web archive profiling through CDX summarization

Abstract

Access this article

Similar content being viewed by others

Web Archive Profiling Through CDX Summarization

Web Archive Profiling Through Fulltext Search

Profiling web archive coverage for top-level domain and content language

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Web archive profiling through CDX summarization

Abstract

Access this article

Similar content being viewed by others

Web Archive Profiling Through CDX Summarization

Web Archive Profiling Through Fulltext Search

Profiling web archive coverage for top-level domain and content language

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation