Anytime Large-Scale Analytics of Linked Open Data

Soulet, Arnaud; Suchanek, Fabian M.

doi:10.1007/978-3-030-30793-6_33

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11778))

Included in the following conference series:

International Semantic Web Conference

2416 Accesses
10 Citations

Abstract

Analytical queries are queries with numerical aggregators: computing the average number of objects per property, identifying the most frequent subjects, etc. Such queries are essential to monitor the quality and the content of the Linked Open Data (LOD) cloud. Many analytical queries cannot be executed directly on the SPARQL endpoints, because the fair use policy cuts off expensive queries. In this paper, we show how to rewrite such queries into a set of queries that each satisfy the fair use policy. We then show how to execute these queries in such a way that the result provably converges to the exact query answer. Our algorithm is an anytime algorithm, meaning that it can give intermediate approximate results at any time point. Our experiments show that the approach converges rapidly towards the exact solution, and that it can compute even complex indicators at the scale of the LOD cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://wiki.dbpedia.org/public-sparql-endpoint.
2.
Given a set \(\varOmega \) with a probability distribution P, \(x \sim P( \varOmega )\) denotes that the element \(x \in \varOmega \) is drawn at random with a probability P(x).

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: the Logical Level. Addison-Wesley Longman Publishing Co., Inc, Boston (1995)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., et al. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33876-2_31
Chapter Google Scholar
Auer, S., Lehmann, J., Hellmann, S.: LinkedGeoData: adding a spatial dimension to the web of data. In: Bernstein, A., et al. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 731–746. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04930-9_46
Chapter Google Scholar
Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: Querying RDF streams with c-SPARQL. ACM SIGMOD Rec. 39(1), 20–26 (2010)
Article Google Scholar
Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inf. 41(5), 706–7016 (2008)
Article Google Scholar
Bienvenu, M., Deutch, D., Martinenghi, D., Senellart, P., Suchanek, F.M.: Dealing with the deep web and all its quirks. In: VLDS (2012)
Google Scholar
Bolles, A., Grawunder, M., Jacobi, J.: Streaming SPARQL - extending SPARQL to process data streams. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 448–462. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68234-9_34
Chapter Google Scholar
Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM Sigmod Rec. 26(1), 65–74 (1997)
Article Google Scholar
Codd, E.F., Codd, S.B., Salley, C.T.: Providing OLAP (on-line analytical processing) to user-analysts: an IT mandate. Codd Date 32 (1993)
Google Scholar
Colazzo, D., Goasdoué, F., Manolescu, I., Roatiş, A.: RDF analytics: lenses over semantic graphs. In: WWW (2014)
Google Scholar
Costabello, L., Villata, S., Vagliano, I., Gandon, F.: Assisted policy management for SPARQL endpoints access control. In: ISWC Demo (2013)
Google Scholar
Cyganiak, R.: A relational algebra for SPARQL. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170 35 (2005)
Google Scholar
Forchhammer, B., Jentzsch, A., Naumann, F.: LODOP - multi-query optimization for linked data profiling queries. In: PROFILES@ESWC (2014)
Google Scholar
Franke, C., Morin, S., Chebotko, A., Abraham, J., Brazier, P.: Distributed semantic web data management in HBase and MySQL cluster. In: CLOUD (2011)
Google Scholar
Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM (2017)
Google Scholar
Gottron, T.: Of sampling and smoothing: approximating distributions over linked open data. In: PROFILES@ ESWC (2014)
Google Scholar
Goujon, M., et al.: A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Res. 38(Suppl\(\_\)2), W695–W699 (2010)
Article Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. ACM Sigmod Rec. 26, 171–182 (1997)
Article Google Scholar
Ibragimov, D., Hose, K., Pedersen, T.B., Zimányi, E.: Processing aggregate queries in a federation of SPARQL endpoints. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 269–285. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18818-8_17
Chapter Google Scholar
Khan, Y., et al.: SAFE: policy aware SPARQL query federation over RDF data cubes. In: Workshop on Semantic Web Applications for Life Sciences (2014)
Google Scholar
Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: the journey using a nested triplegroup algebra. VLDB J. 4(12), 1426–1429 (2011)
Google Scholar
Kotoulas, S., Urbani, J., Boncz, P., Mika, P.: Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig. ISWC 2012. LNCS, vol. 7649, pp. 247–262. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_16
Chapter Google Scholar
Lajus, J., Suchanek, F.M.: Are all people married? Determining obligatory attributes in knowledge bases. In: WWW (2018)
Google Scholar
Manolescu, I., Mazuran, M.: Speeding up RDF aggregate discovery through sampling. In: Workshop on Big Data Visual Exploration (2019)
Google Scholar
Muñoz, E., Nickles, M.: Statistical relation cardinality bounds in knowledge bases. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W.I. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. LNCS, vol. 11310, pp. 67–97. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-662-58415-6_3
Chapter Google Scholar
Nirkhiwale, S., Dobra, A., Jermaine, C.: A sampling algebra for aggregate estimation. VLDB J. 6(14), 1798–1809 (2013)
Google Scholar
Olken, F.: Random sampling from databases. Ph.D. thesis, University of California, Berkeley (1993)
Google Scholar
Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 137–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_9
Chapter Google Scholar
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68234-9_39
Chapter Google Scholar
Saleem, M., Hasnain, A., Ngomo, A.C.N.: LargeRDFBench: a billion triples benchmark for SPARQL endpoint federation. J. Web Semant. 48, 85–125 (2018)
Article Google Scholar
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. VLDB J. 9(10), 804–815 (2016)
Google Scholar
Sejdiu, G., Ermilov, I., Lehmann, J., Mami, M.N.: DistLODStats: distributed computation of RDF dataset statistics. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 206–222. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_13
Chapter Google Scholar
Soulet, A., Giacometti, A., Markhoff, B., Suchanek, F.M.: Representativeness of knowledge bases with the generalized Benford’s Law. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 374–390. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6_22
Chapter Google Scholar
Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Mag. 17(3), 73 (1996)
Google Scholar

Download references

Acknowledgements

This work was partially supported by the grant ANR-16-CE23-0007-01 (“DICOS”).

Author information

Authors and Affiliations

Université de Tours, LIFAT, Blois, France
Arnaud Soulet
Telecom Paris, Institut Polytechnique de Paris, Paris, France
Arnaud Soulet & Fabian M. Suchanek

Authors

Arnaud Soulet
View author publications
You can also search for this author in PubMed Google Scholar
Fabian M. Suchanek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arnaud Soulet .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Chiara Ghidini
Linköping University, Linköping, Sweden
Olaf Hartig
University of Bonn, Bonn, Germany
Maria Maleshkova
University of Economics Prague, Prague, Czech Republic
Vojtěch Svátek
University of Illinois at Chicago, Chicago, IL, USA
Isabel Cruz
University of Chile, Santiago, Chile
Aidan Hogan
Memect Technology, Beijing, China
Jie Song
Mines Saint-Etienne, Saint-Etienne, France
Maxime Lefrançois
Inria Sophia Antipolis - Méditerranée, Sophia Antipolis, France
Fabien Gandon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soulet, A., Suchanek, F.M. (2019). Anytime Large-Scale Analytics of Linked Open Data. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-30793-6_33
Published: 17 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30792-9
Online ISBN: 978-3-030-30793-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)