skip to main content
10.1145/2740908.2741695acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving

Published:18 May 2015Publication History

ABSTRACT

Warcbase is an open-source platform for storing, managing, and analyzing web archives using modern "big data" infrastructure on commodity clusters---specifically, HBase for storage and Hadoop for data analytics. This paper describes an effort to scale "down" Warcbase onto a Raspberry Pi, an inexpensive single-board computer about the size of a deck of playing cards. Apart from an interesting technology demonstration, such a design presents new opportunities for personal web archiving, in enabling a low-cost, low-power, portable device that is able to continuously capture a user's web browsing history---not only the URLs of the pages that a user has visited, but the contents of those pages---and allowing the user to revisit any previously-encountered page, as it appeared at that time. Experiments show that data ingestion throughput and temporal browsing latency are adequate with existing hardware, which means that such capabilities are already feasible today.

References

  1. D. Abrams, R. Baecker, and M. Chignell. Information archiving with bookmarks: Personal web space construction and organization. CHI, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aiyer, M. Bautin, G. Chen, P. Khemani, K. Muthukkaruppan, K. Spiegelberg, L. Tang, and M. Vaidya. Storage infrastructure behind Facebook Messages:\ Using HBase at scale. IEEE Data Engineering Bulletin, 35(2):4--13, 2012.Google ScholarGoogle Scholar
  3. D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Balasubramanian, N. Balasubramanian, S. J. Huston, D. Metzler, and D. J. Wetherall. FindAll: A local search engine for mobile phones. CoNEXT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Barton. Mignify: A big data refinery built on HBase. HBaseCon, 2012.Google ScholarGoogle Scholar
  6. R. Boardman and M. A. Sasse. "Stuff goes into the computer and doesn't come out": A cross-tool study of personal information management. CHI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Gomes, M. Costa, D. Cruz, J. Miranda, and S. Fontes. Creating a billion-scale searchable web archive. WWW Companion, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Gomes, J. Miranda, and M. Costa. A survey on web archiving initiatives. TPDL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Gurrin, A. F. Smeaton, and A. R. Doherty. Lifelogging:\ Personal big data. Foundation and Trends in Information Retrieval, 8(1):1--125, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Hunt, M. Konar, F. Junqueira, and B. Reed. ZooKeeper:\ Wait-free coordination for Internet-scale systems. USENIX, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W.-S. Li, Q. Vu, D. Agrawal, Y. Hara, and H. Takano. PowerBookmarks:\ A system for personalizable web information organization, sharing, and management. Computer Networks, 31(11--16):1375--1389, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Lin, M. Gholami, and J. Rao. Infrastructure for supporting exploration and discovery in web archives. WWW Companion, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. One DBMS for all:\ The brawny few and the wimpy crowd. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Neudecker and S. Schlarb. The elephant in the library:\ Integrating Hadoop. Hadoop Summit Europe, 2013.Google ScholarGoogle Scholar
  16. B. Tofel. 'Wayback' for accessing web archives. International Web Archiving Workshop, 2007.Google ScholarGoogle Scholar
  17. S. K. Tyler and J. Teevan. Large scale query log analysis of re-finding. WSDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Want, T. Pering, G. Danneels, M. Kumar, M. Sundar, and J. Light. The Personal Server:\ Changing the way we think about ubiquitous computing. UbiComp, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
          May 2015
          1602 pages
          ISBN:9781450334730
          DOI:10.1145/2740908

          Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 May 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader