skip to main content
10.1145/3383583.3398513acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

Published:01 August 2020Publication History

ABSTRACT

The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building---all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.

References

  1. Maria José Afanador-Llach, James Baker, Adam Crymble, Víctor Gayol, Martin Grandjean, Jennifer Isasi, Francois Dominic Laramée, Zoe LeBlanc, Matthew Lincoln, Sarah Melton, Jose Antonio Motilla, Joshua G. Ortiz Baco, Sofia Papastamkou, Jessica Parr, Marie Puren, Riva Quiroga, Antonio Rojas Castro, Anna-Maria Sichani, Anandi Silva Knuppel, Amanda Visconti, and Brandon Walsh. 2019. 2019 Programming Historian Deposit release. https://doi.org/10.5281/zenodo.3525082Google ScholarGoogle Scholar
  2. Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An Open Source Software for Exploring and Manipulating Networks. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media. San Jose, California, 361--362.Google ScholarGoogle Scholar
  3. Neils Brügger. 2018. The Archived Web. Doing History in the Digital Age .MIT Press, Cambridge, Massachusetts.Google ScholarGoogle Scholar
  4. Niels Brügger and Ian Milligan (Eds.). 2018. The SAGE Handbook of Web History .SAGE Publications Limited.Google ScholarGoogle Scholar
  5. Niels Brügger and Ralph Schroeder (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and the Present .UCL Press.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ryan Deschamps, Samantha Fritz, Jimmy Lin, Ian Milligan, and Nick Ruest. 2019 a. The Cost of a WARC: Analyzing Web Archives in the Cloud. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 261--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ryan Deschamps, Nick Ruest, Jimmy Lin, Samantha Fritz, and Ian Milligan. 2019 b. The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration of Web Archives. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 337--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gabriel A. Devenyi, Rémi Emonet, Rayna M. Harris, Kate L. Hertweck, Damien Irving, Ian Milligan, and Greg Wilson. 2018. Ten Simple Rules for Collaborative Lesson Development. PLOS Computational Biology, Vol. 14, 3 (03 2018), 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  9. Matthew Farrell, Edward McCain, Maria Praetzellis, Grace Thomas, and Paige Walker. 2017. Web Archiving in the United States: A 2017 Survey. Technical Report. National Digital Stewardship Alliance. https://osf.io/ht6ay/Google ScholarGoogle Scholar
  10. Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL 2016). Newark, New Jersey, 83--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2016). Newark, New Jersey, 103--106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. 2017. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, Vol. 10, 4 (2017), Article 22.Google ScholarGoogle Scholar
  13. Ian Milligan. 2019. History in the Age of Abundance? How the Web is Transforming Historical Research. McGill-Queen's University Press.Google ScholarGoogle Scholar
  14. Ian Milligan, Nathalie Casemajor, Samantha Fritz, Jimmy Lin, Nick Ruest, Matthew S. Weber, and Nicholas Worby. 2019. Building Community and Tools for Analyzing Web Archives through Datathons. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 265--268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History .Verso.Google ScholarGoogle Scholar
  16. Nick Ruest. 2020. Ministry of Environment of Québec (2011--2014) Web Archive Collection Derivatives. https://doi.org/10.5281/zenodo.3599771Google ScholarGoogle Scholar
  17. Matthew S. Weber and Philip M. Napoli. 2018. Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism. Digital Journalism, Vol. 6, 9 (2018), 1186--1205.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jane Winters. 2017. Coda: Web Archives for Humanities Research -- Some Reflections. In The Web as History: Using Web Archives to Understand the Past and the Present, Niels Brügger and Ralph Schroeder (Eds.). UCL Press, 238--248.Google ScholarGoogle Scholar
  19. Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. 2019. Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 436--437.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
      August 2020
      611 pages
      ISBN:9781450375856
      DOI:10.1145/3383583

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 August 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate415of1,482submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader