Skip to main content

Learning URI Selection Criteria to Improve the Crawling of Linked Open Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11503))

Abstract

As the Web of Linked Open Data is growing the problem of crawling that cloud becomes increasingly important. Unlike normal Web crawlers, a Linked Data crawler performs a selection to focus on collecting linked RDF (including RDFa) data on the Web. From the perspectives of throughput and coverage, given a newly discovered and targeted URI, the key issue of Linked Data crawlers is to decide whether this URI is likely to dereference into an RDF data source and therefore it is worth downloading the representation it points to. Current solutions adopt heuristic rules to filter irrelevant URIs. Unfortunately, when the heuristics are too restrictive this hampers the coverage of crawling. In this paper, we propose and compare approaches to learn strategies for crawling Linked Data on the Web by predicting whether a newly discovered URI will lead to an RDF data source or not. We detail the features used in predicting the relevance and the methods we evaluated including a promising adaptation of FTRL-proximal online learning algorithm. We compare several options through extensive experiments including existing crawlers as baseline methods to evaluate their efficacy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    RFC 3986, section 3(2005).

  2. 2.

    https://github.com/aappleby/smhasher/wiki/MurmurHash3.

  3. 3.

    Bloom filter may report false positive results (but not false negatives) with a low chance. Thus it is possible that a URI has a wrong content type feature.

  4. 4.

    https://any23.apache.org/.

  5. 5.

    We only consider hard URIs.

References

  1. Berners-Lee, T.: Linked data - design issues (2006). https://www.w3.org/DesignIssues/LinkedData.html

  2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  3. Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  4. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)

    Article  Google Scholar 

  5. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)

    Google Scholar 

  6. Dodds, L.: Slug: A Semantic Web Crawler (2006)

    Google Scholar 

  7. Duchi, J.C., Singer, Y.: Efficient learning using forward-backward splitting. In: NIPS, pp. 495–503 (2009)

    Google Scholar 

  8. Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_5

    Chapter  Google Scholar 

  9. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpedia, freebase, opencyc, wikidata, and YAGO. Seman. Web 9(1), 77–129 (2018)

    Article  Google Scholar 

  10. Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space, vol. 1. Morgan & Claypool Publishers, San Rafael (2011)

    Google Scholar 

  11. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)

    Google Scholar 

  12. Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)

    Article  Google Scholar 

  13. Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track (2010)

    Google Scholar 

  14. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6, 707–710 (1966)

    MathSciNet  Google Scholar 

  15. McMahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and L1 regularization. In: AISTATS, pp. 525–533 (2011)

    Google Scholar 

  16. McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: SIGKDD, pp. 1222–1230 (2013)

    Google Scholar 

  17. Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: CIKM, pp. 1039–1048 (2014)

    Google Scholar 

  18. Umbrich, J., Harth, A., Hogan, A., Decker, S.: Four heuristics to guide structured content crawling. In: ICWE, pp. 196–202 (2008)

    Google Scholar 

  19. Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.J., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML, pp. 1113–1120 (2009)

    Google Scholar 

  20. Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: NIPS, pp. 2116–2124 (2009)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the ANSWER project PIA FSN2 \(\text {N}^\circ \)P159564-2661789/DOS0060094 between Inria and Qwant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, H., Gandon, F. (2019). Learning URI Selection Criteria to Improve the Crawling of Linked Open Data. In: Hitzler, P., et al. The Semantic Web. ESWC 2019. Lecture Notes in Computer Science(), vol 11503. Springer, Cham. https://doi.org/10.1007/978-3-030-21348-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21348-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21347-3

  • Online ISBN: 978-3-030-21348-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics