ABSTRACT
Warcbase is an open-source platform for storing, managing, and analyzing web archives using modern "big data" infrastructure on commodity clusters---specifically, HBase for storage and Hadoop for data analytics. This paper describes an effort to scale "down" Warcbase onto a Raspberry Pi, an inexpensive single-board computer about the size of a deck of playing cards. Apart from an interesting technology demonstration, such a design presents new opportunities for personal web archiving, in enabling a low-cost, low-power, portable device that is able to continuously capture a user's web browsing history---not only the URLs of the pages that a user has visited, but the contents of those pages---and allowing the user to revisit any previously-encountered page, as it appeared at that time. Experiments show that data ingestion throughput and temporal browsing latency are adequate with existing hardware, which means that such capabilities are already feasible today.
- D. Abrams, R. Baecker, and M. Chignell. Information archiving with bookmarks: Personal web space construction and organization. CHI, 1998. Google ScholarDigital Library
- A. Aiyer, M. Bautin, G. Chen, P. Khemani, K. Muthukkaruppan, K. Spiegelberg, L. Tang, and M. Vaidya. Storage infrastructure behind Facebook Messages:\ Using HBase at scale. IEEE Data Engineering Bulletin, 35(2):4--13, 2012.Google Scholar
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP, 2009. Google ScholarDigital Library
- A. Balasubramanian, N. Balasubramanian, S. J. Huston, D. Metzler, and D. J. Wetherall. FindAll: A local search engine for mobile phones. CoNEXT, 2012. Google ScholarDigital Library
- S. Barton. Mignify: A big data refinery built on HBase. HBaseCon, 2012.Google Scholar
- R. Boardman and M. A. Sasse. "Stuff goes into the computer and doesn't come out": A cross-tool study of personal information management. CHI, 2004. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. OSDI, 2006. Google ScholarDigital Library
- D. Gomes, M. Costa, D. Cruz, J. Miranda, and S. Fontes. Creating a billion-scale searchable web archive. WWW Companion, 2013. Google ScholarDigital Library
- D. Gomes, J. Miranda, and M. Costa. A survey on web archiving initiatives. TPDL, 2011. Google ScholarDigital Library
- C. Gurrin, A. F. Smeaton, and A. R. Doherty. Lifelogging:\ Personal big data. Foundation and Trends in Information Retrieval, 8(1):1--125, 2014. Google ScholarDigital Library
- P. Hunt, M. Konar, F. Junqueira, and B. Reed. ZooKeeper:\ Wait-free coordination for Internet-scale systems. USENIX, 2010. Google ScholarDigital Library
- W.-S. Li, Q. Vu, D. Agrawal, Y. Hara, and H. Takano. PowerBookmarks:\ A system for personalizable web information organization, sharing, and management. Computer Networks, 31(11--16):1375--1389, 1999. Google ScholarDigital Library
- J. Lin, M. Gholami, and J. Rao. Infrastructure for supporting exploration and discovery in web archives. WWW Companion, 2014. Google ScholarDigital Library
- T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser, A. Kemper, and T. Neumann. One DBMS for all:\ The brawny few and the wimpy crowd. SIGMOD, 2014. Google ScholarDigital Library
- C. Neudecker and S. Schlarb. The elephant in the library:\ Integrating Hadoop. Hadoop Summit Europe, 2013.Google Scholar
- B. Tofel. 'Wayback' for accessing web archives. International Web Archiving Workshop, 2007.Google Scholar
- S. K. Tyler and J. Teevan. Large scale query log analysis of re-finding. WSDM, 2010. Google ScholarDigital Library
- R. Want, T. Pering, G. Danneels, M. Kumar, M. Sundar, and J. Light. The Personal Server:\ Changing the way we think about ubiquitous computing. UbiComp, 2002. Google ScholarDigital Library
Index Terms
- Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving
Recommendations
Infrastructure for supporting exploration and discovery in web archives
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebWeb archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools ...
Implementation of a Big Data Accessing and Processing Platform for Medical Records in Cloud
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the ...
Replica parallelism to utilize the granularity of data
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and TheoryAs the volume of relational data is increased significantly, big data technologies have been noticed for recent years. Hadoop File System (HDFS) [14] is a basis of several big data systems and enables large data sets to be stored across the big data ...
Comments