Skip to main content

Analyzing Information Retrieval Methods to Recover Broken Web Links

  • Conference paper
Advances in Information Retrieval (ECIR 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5993))

Included in the following conference series:

Abstract

In this work we compare different techniques to automatically find candidate web pages to substitute broken links. We extract information from the anchor text, the content of the page containing the link, and the cache page in some digital library. The selected information is processed and submitted to a search engine. We have compared different information retrieval methods for both, the selection of terms used to construct the queries submitted to the search engine, and the ranking of the candidate pages that it provides, in order to help the user to find the best replacement. In particular, we have used term frequencies, and a language model approach for the selection of terms; and cooccurrence measures and a language model approach for ranking the final results. To test the different methods, we have also defined a methodology which does not require the user judgments, what increases the objectivity of the results.

This work has been partially supported by the Spanish Ministry of Science and Innovation within the project QEAVis-Catiex (TIN2007-67581-C02-01) and the Regional Government of Madrid under the Research Network MAVIR (S-0505/TIC-0267).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Craswell, N., Hawking, D., Robertson, S.: Effective site finding using link anchor information. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 250–257. ACM Press, New York (2001)

    Chapter  Google Scholar 

  2. Efthimiadis, E.N.: Query expansion. Annual Review of Information Systems and Technology 31, 121–187 (1996)

    Google Scholar 

  3. Markwell, J., Brooks, D.W.: Broken links: The ephemeral nature of educational www hyperlinks. Journal of Science Education and Technology 11(2), 105–108 (2002)

    Article  Google Scholar 

  4. Kahle, B.: Preserving the internet. Scientific American 276(3), 82–83 (1997)

    Article  Google Scholar 

  5. Koehler, W.: Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2), 162–171 (2002)

    Article  Google Scholar 

  6. Ingham, D., Caughey, S., Little, M.: Fixing the broken-link problem: the w3objects approach. Comput. Netw. ISDN Syst. 28(7-11), 1255–1268 (1996)

    Article  Google Scholar 

  7. Shimada, T., Futakata, A.: Automatic link generation and repair mechanism for document management. In: HICSS 1998: Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, vol. 2, p. 226. IEEE Computer Society Press, Los Alamitos (1998)

    Chapter  Google Scholar 

  8. Martinez-Romo, J., Araujo, L.: Recommendation system for automatic recovery of broken web links. In: Geffner, H., Prada, R., Machado Alexandre, I., David, N. (eds.) IBERAMIA 2008. LNCS (LNAI), vol. 5290, pp. 302–311. Springer, Heidelberg (2008)

    Google Scholar 

  9. Nakamizo, A., Iida, T., Morishima, A., Sugimoto, S., Kitagawa, H.: A tool to compute reliable web links and its applications. In: SWOD 2005: Proc. International Special Workshop on Databases for Next Generation Researchers, pp. 146–149. IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

  10. Morishima, A., Nakamizo, A., Iida, T., Sugimoto, S., Kitagawa, H.: Pagechaser: A tool for the automatic correction of broken web links. In: ICDE, pp. 1486–1488 (2008)

    Google Scholar 

  11. Klein, M., Nelson, M.L.: Revisiting lexical signatures to (re-)discover web pages. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 371–382. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  12. Harrison, T.L., Nelson, M.L.: Just-in-time recovery of missing web pages. In: HYPERTEXT 2006: Proceedings of the seventeenth conference on Hypertext and hypermedia, pp. 145–156. ACM Press, New York (2006)

    Chapter  Google Scholar 

  13. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  14. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley Interscience, New York (1991)

    Book  MATH  Google Scholar 

  15. Rijsbergen, C.J.V.: A theoretical basis for the use of cooccurrence data in information retrieval. Journal of Documentation 33, 106–119 (1977)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Martinez-Romo, J., Araujo, L. (2010). Analyzing Information Retrieval Methods to Recover Broken Web Links. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12275-0_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12274-3

  • Online ISBN: 978-3-642-12275-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics