Learning URI Selection Criteria to Improve the Crawling of Linked Open Data

Huang, Hai; Gandon, Fabien

doi:10.1007/978-3-030-21348-0_13

Learning URI Selection Criteria to Improve the Crawling of Linked Open Data

Conference paper
First Online: 25 May 2019

2614 Accesses
2 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11503))

Abstract

As the Web of Linked Open Data is growing the problem of crawling that cloud becomes increasingly important. Unlike normal Web crawlers, a Linked Data crawler performs a selection to focus on collecting linked RDF (including RDFa) data on the Web. From the perspectives of throughput and coverage, given a newly discovered and targeted URI, the key issue of Linked Data crawlers is to decide whether this URI is likely to dereference into an RDF data source and therefore it is worth downloading the representation it points to. Current solutions adopt heuristic rules to filter irrelevant URIs. Unfortunately, when the heuristics are too restrictive this hampers the coverage of crawling. In this paper, we propose and compare approaches to learn strategies for crawling Linked Data on the Web by predicting whether a newly discovered URI will lead to an RDF data source or not. We detail the features used in predicting the relevance and the methods we evaluated including a promising adaptation of FTRL-proximal online learning algorithm. We compare several options through extensive experiments including existing crawlers as baseline methods to evaluate their efficacy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
RFC 3986, section 3(2005).
2.
https://github.com/aappleby/smhasher/wiki/MurmurHash3.
3.
Bloom filter may report false positive results (but not false negatives) with a low chance. Thus it is possible that a URI has a wrong content type feature.
4.
https://any23.apache.org/.
5.
We only consider hard URIs.

References

Berners-Lee, T.: Linked data - design issues (2006). https://www.w3.org/DesignIssues/LinkedData.html
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Article MathSciNet MATH Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Article Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Google Scholar
Dodds, L.: Slug: A Semantic Web Crawler (2006)
Google Scholar
Duchi, J.C., Singer, Y.: Efficient learning using forward-backward splitting. In: NIPS, pp. 495–503 (2009)
Google Scholar
Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_5
Chapter Google Scholar
Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpedia, freebase, opencyc, wikidata, and YAGO. Seman. Web 9(1), 77–129 (2018)
Article Google Scholar
Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space, vol. 1. Morgan & Claypool Publishers, San Rafael (2011)
Google Scholar
Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)
Google Scholar
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Article Google Scholar
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track (2010)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
MathSciNet Google Scholar
McMahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and L1 regularization. In: AISTATS, pp. 525–533 (2011)
Google Scholar
McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: SIGKDD, pp. 1222–1230 (2013)
Google Scholar
Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: CIKM, pp. 1039–1048 (2014)
Google Scholar
Umbrich, J., Harth, A., Hogan, A., Decker, S.: Four heuristics to guide structured content crawling. In: ICWE, pp. 196–202 (2008)
Google Scholar
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.J., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML, pp. 1113–1120 (2009)
Google Scholar
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: NIPS, pp. 2116–2124 (2009)
Google Scholar

Download references

Acknowledgement

This work is supported by the ANSWER project PIA FSN2 \(\text {N}^\circ \)P159564-2661789/DOS0060094 between Inria and Qwant.

Author information

Authors and Affiliations

Inria, Université Côte d’Azur, CNRS, I3S, Sophia Antipolis, France
Hai Huang & Fabien Gandon

Authors

Hai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Gandon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Huang .

Editor information

Editors and Affiliations

Wright State University, Dayton, OH, USA
Pascal Hitzler
KMi, The Open University, Milton Keynes, UK
Miriam Fernández
University of California, Santa Barbara, CA, USA
Krzysztof Janowicz
Maastricht University, Maastricht, The Netherlands
Amrapali Zaveri
Heriot-Watt University, Edinburgh, UK
Alasdair J.G. Gray
IBM Research, Dublin, Ireland
Vanessa Lopez
The Australian National University, Canberra, ACT, Australia
Armin Haller
Jönköping University, Jönköping, Sweden
Karl Hammar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, H., Gandon, F. (2019). Learning URI Selection Criteria to Improve the Crawling of Linked Open Data. In: Hitzler, P., et al. The Semantic Web. ESWC 2019. Lecture Notes in Computer Science(), vol 11503. Springer, Cham. https://doi.org/10.1007/978-3-030-21348-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-21348-0_13
Published: 25 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21347-3
Online ISBN: 978-3-030-21348-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics