Restoring Semantically Incomplete Document Collections Using Lexical Signatures

Meneses, Luis; Barthwal, Himanshu; Singh, Sanjeev; Furuta, Richard; Shipman, Frank

doi:10.1007/978-3-642-40501-3_33

Luis Meneses²¹,
Himanshu Barthwal²¹,
Sanjeev Singh²¹,
Richard Furuta²¹ &
…
Frank Shipman²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2576 Accesses
1 Citations

Abstract

Unexpected changes create a problem when managing missing resources in a digital collection. In decentralized and distributed collections such as Walden’s Paths, a missing point or an incomplete resource is of grave importance as it can potentially interrupt the continuity in the narration and render the collection semantically incomplete. We can foresee two possible scenarios occurring when resources cannot be found. First, we have access to a copy of the missing document or to its lexical signatures, which allows us to find the missing resource. The second case is more interesting to us. What happens if we don’t have any valid metadata associated to the missing resource? To solve this problem, we used the lexical signatures of valid documents within a collection to find suitable replacements for absent resources. As results we found that traditional similarity metrics do not adequately convey the relationships between the elements in the collections. Our analyses also showed that our procedures were able to restore the semantic integrity of incomplete document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bogen, P.L., Pogue, D., Poursardar, F., Li, Y., Furuta, R., Shipman, F.: WPv4: a re-imagined Walden’s paths to support diverse user communities. In: Proc. of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Ottawa, Ontario, Canada, pp. 419–420 (2011)
Google Scholar
Cassel, L., Fox, E., Shipman, F., Brusilovsky, P., Fax, W., Garcia, D., Hislop, G., Furuta, R., Delcambre, L., Potluri, S.: Ensemble: enriching communities and collections to support education in computing: poster session. Journal of Computing Sciences in Colleges 25, 224–226 (2010)
Google Scholar
McCown, F., Marshall, C.C., Nelson, M.L.: Why web sites are lost (and how they’re sometimes found). Communications of the ACM 52, 141–145 (2009)
Article Google Scholar
Klein, M., Ware, J., Nelson, M.L.: Rediscovering missing web pages using link neighborhood lexical signatures. In: Proc. of the 11th Annual International ACM/IEEE Joint Conference on Digital libraries, Ottawa, Ontario, Canada (2011)
Google Scholar
Klein, M., Nelson, M.L.: Evaluating methods to rediscover missing web pages from the web infrastructure. In: Proc. Of The 10th Annual Joint Conference on Digital Libraries, Gold Coast, Queensland, Australia (2010)
Google Scholar
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: towards an understanding of the web’s decay. In: Proc. of the 13th International Conference on World Wide Web, New York, NY, USA (2004)
Google Scholar
SalahEldeen, H.M., Nelson, M.L.: Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)
Chapter Google Scholar
Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proc. of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, United States (2001)
Google Scholar
Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Perception of content, structure, and presentation changes in Web-based hypertext. In: Proc. of the 12th ACM Conference on Hypertext and Hypermedia, Arhus, Denmark (2001)
Google Scholar
Logasa Bogen, P., Francisco-Revilla, L., Furuta, R., Hubbard, T., Karadkar, U.P., Shipman, F.: Longitudinal study of changes in blogs. In: Proc. of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada (2007)
Google Scholar
Meneses, L., Furuta, R., Shipman, F.: Identifying “Soft 404” Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 197–208. Springer, Heidelberg (2012)
Chapter Google Scholar
Dalal, Z., Dash, S., Dave, P., Francisco-Revilla, L., Furuta, R., Karadkar, U., Shipman, F.: Managing distributed collections: evaluating web page changes, movement, and replacement. In: Proc. of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tuscon, AZ, USA, pp. 160–168 (2004)
Google Scholar
Baeza-Yates, R., Pereira, I., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: Proc. of the 17th International Conference on World Wide Web, Beijing, China (2008)
Google Scholar
Ashman, H.: Electronic document addressing: dealing with change. ACM Computing Surveys 32, 201–212 (2000)
Article Google Scholar
Ashman, H., Davis, H., Whitehead, J., Caughey, S.: Missing the 404: link integrity on the World Wide Web. In: Proc. of the Seventh International Conference on World Wide Web, Brisbane, Australia (1998)
Google Scholar
Davis, H.C.: Hypertext link integrity. ACM Computing Surveys 31, 28 (1999)
Article Google Scholar
Davis, H.C.: Referential integrity of links in open hypermedia systems. In: Proc. of the Ninth ACM Conference on Hypertext and Hypermedia, Pittsburgh, Pennsylvania, United States (1998)
Google Scholar
Kahle, B.: Preserving the Internet. Scientific American 276, 82–83 (1997)
Article Google Scholar
Koehler, W.: Web page change and persistence—a four-year longitudinal study. Journal of the American Society for Information Science and Technology 53, 162–171 (2002)
Article Google Scholar
Spinellis, D.: The decay and failures of web references. Communications of the ACM 46, 71–77 (2003)
Article Google Scholar
Phelps, T.A., Wilensky, R.: Robust Hyperlinks Cost Just Five Words Each. University of California at Berkeley (2000)
Google Scholar
Park, S.-T., Pennock, D.M., Giles, C.L., Krovetz, R.: Analysis of lexical signatures for improving information persistence on the World Wide Web. Transactions on Information Systems 22, 540–572 (2004)
Article Google Scholar
Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proc. of the 21st ACM Conference on Hypertext and Hypermedia, Toronto, Ontario, Canada (2010)
Google Scholar
McCown, F., Smith, J.A., Nelson, M.L.: Lazy preservation: reconstructing websites by crawling the crawlers. In: Proc. of the 8th Annual ACM International Workshop on Web Information and Data Management, Arlington, Virginia, USA, pp. 67–74 (2006)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks 29, 1157–1166 (1997)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proc. of the Thiry-fourth Annual ACM Symposium on Theory of Computing, Montreal, Quebec, Canada (2002)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: Proc. of the USENIX Winter 1994 Technical Conference, San Francisco, California (1994)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Finding Near-Replicas of Documents and Servers on the Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 204–212. Springer, Heidelberg (1999)
Chapter Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proc. of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA, pp. 398–409 (1995)
Google Scholar
Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proc. of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA (2005)
Google Scholar
McCown, F., Nelson, M.L.: Search engines and their public interfaces: which apis are the most synchronized? In: Proc. of the 16th International Conference on World Wide Web, Banff, Alberta, Canada (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for the Study of Digital Libraries and Department of Computer Science and Engineering, Texas A&M University, College Station, TX, 77843–3112, USA
Luis Meneses, Himanshu Barthwal, Sanjeev Singh, Richard Furuta & Frank Shipman

Authors

Luis Meneses
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Barthwal
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev Singh
View author publications
You can also search for this author in PubMed Google Scholar
Richard Furuta
View author publications
You can also search for this author in PubMed Google Scholar
Frank Shipman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, 7491, Trondheim, Norway
Trond Aalberg
Department of Archives and Library Science, Ionian University, 49100, Corfu, Greece
Christos Papatheodorou
Department of Library Information and Archive Sciences, University of Malta, MSD2280, Msida, Malta
Milena Dobreva
Library and Information Center, University of Patras, 26504, Patras, Greece
Giannis Tsakonas
National Archives of Malta, RBT1043, Rabat, Malta
Charles J. Farrugia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meneses, L., Barthwal, H., Singh, S., Furuta, R., Shipman, F. (2013). Restoring Semantically Incomplete Document Collections Using Lexical Signatures. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-40501-3_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40500-6
Online ISBN: 978-3-642-40501-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics