Abstract
We present an approach to web content aggregation that allows information to be harvested from web pages, independent of specific markup languages. It builds on ideas from data warehousing and we present solutions to the well-known problems of data integration, namely detection of equivalences and data cleaning, adapted to this context. We describe how the content aggregation engine has been realised as an extensible framework in such a way that end-users as well as developers can use the associated tools to create personal libaries of content extracted from the web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abrams, D., Baecker, R., Chignell, M.: Information archiving with bookmarks: personal web space construction and organization. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 1998, pp. 41–48. ACM Press/Addison-Wesley Publishing Co., New York (1998)
Card, S.K., Robertson, G.G., York, W.: The webbook and the web forager: an information workspace for the world-wide web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Common Ground, CHI 1996, p. 111. ACM, New York (1996)
Robertson, G., Czerwinski, M., Larson, K., Robbins, D.C., Thiel, D., van Dantzich, M.: Data mountain: using spatial memory for document management. In: Proceedings of the 11th Annual ACM Symposium on User Interface Software and Technology, UIST 1998, pp. 153–162. ACM, New York (1998)
Amento, B., Terveen, L., Hill, W., Hix, D.: Topicshop: enhanced support for evaluating and organizing collections of web sites. In: Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology, UIST 2000, pp. 201–209. ACM, New York (2000)
Schraefel, M.C., Zhu, Y., Modjeska, D., Wigdor, D., Zhao, S.: Hunter gatherer: Interaction support for the creation and management of within-web-page collections. In: Proc. 11th Intl. Conf. on World Wide Web, WWW 2002 (2002)
Dontcheva, M., Drucker, S.M., Salesin, D., Cohen, M.F.: Relations, cards, and search templates: User-guided web data integration and layout. In: Proc. of the 20th ACM Symposium on User Interface Software and Technology, UIST 2007 (2007)
Hogue, A., Karger, D.: Thresher: Automating the unwrapping of semantic content from the world wide web. In: Proc. 14th Intl. Conf. on World Wide Web, WWW 2005 (2005)
Huynh, D., Mazzocchi, S., Karger, D.: Piggy bank: Experience the semantic web inside your web browser. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1) (2007)
Nebot, V., Berlanga, R.: Building data warehouses with semantic data. In: Proc. of the 2010 EDBT/ICDT Workshops, EDBT 2010 (2010)
Moya, L.G., Kudama, S., Cabo, M.J.A., Llavori, R.B.: Integrating web feed opinions into a corporate data warehouse. In: Proc. 2nd Intl. Workshop on Business intelligence and the WEB, BEWEB 2011 (2011)
Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proc. Endow. VLDB 2, 1090–1101 (2009)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Geel, M., Church, T., Norrie, M.C. (2012). Mix-n-Match: Building Personal Libraries from Web Content. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-33290-6_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33289-0
Online ISBN: 978-3-642-33290-6
eBook Packages: Computer ScienceComputer Science (R0)