ABSTRACT
Knowledge Graphs (KGs) are becoming the core of most artificial intelligent and cognitive applications. Popular KGs such as DBpedia and Wikidata have chosen the RDF data model to represent their data. Despite the advantages, there are challenges in using RDF data, for example, data validation. Ontologies for specifying domain conceptualizations in RDF data are designed for entailments rather than validation. Most ontologies lack the granular information needed for validating constraints. Recent work on RDF Shapes and standardization of languages such as SHACL and ShEX provide better mechanisms for representing integrity constraints for RDF data. However, manually creating constraints for large KGs is still a tedious task. In this paper, we present a data driven approach for inducing integrity constraints for RDF data using data profiling. Those constraints can be combined into RDF Shapes and can be used to validate RDF graphs. Our method is based on machine learning techniques to automatically generate RDF shapes using profiled RDF data as features. In the experiments, the proposed approach achieved 97% precision in deriving RDF Shapes with cardinality constraints for a subset of DBpedia data.
- Ziawasch Abedjan and Felix Naumann. 2013. Improving RDF Data Through Association Rule Mining. Datenbank-Spektrum 13, 2 (01 Jul 2013), 111--120.Google Scholar
- Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of databases: the logical level. (1995). Google ScholarDigital Library
- Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A Inkeri Verkamo, et al. 1996. Fast discovery of association rules. Advances in knowledge discovery and data mining 12, 1 (1996), 307--328. Google ScholarDigital Library
- Adrien Basse, Fabien Gandon, Isabelle Mirbel, and Moussa Lo. 2010. DFS-based frequent graph pattern extraction to characterize the content of RDF Triple Stores. In Web Science Conference 2010 (WebSci10).Google Scholar
- Christopher M Bishop. 2006. Pattern recognition and machine learning. springer. Google ScholarDigital Library
- Peter Bloem and Gerben K. D. De Vries. 2014. Machine Learning on Linked Data, a Position Paper. In Proceedings of the 1st International Conference on Linked Data for Knowledge Discovery - Volume 1232 (LD4KD'14). CEUR-WS.org, Aachen, Germany, Germany, 64--68. http://dl.acm.org/citation.cfm?id=3053827.3053834 Google ScholarDigital Library
- Eva Blomqvist, Ziqi Zhang, Anna Lisa Gentile, Isabelle Augenstein, and Fabio Ciravegna. 2013. Statistical knowledge patterns for characterising linked data. In Proceedings of the 4th International Conference on Ontology and Semantic Web Patterns-Volume 1188. CEUR-WS. org, 1--13. Google ScholarDigital Library
- Lorenz Bühmann, Daniel Fleischhacker, Jens Lehmann, Andre Melo, and Johanna Völker. 2014. Inductive lexical learning of class expressions. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 42--53.Google ScholarCross Ref
- Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357. Google ScholarCross Ref
- Luc De Raedt, Tias Guns, and Siegfried Nijssen. 2010. Constraint programming for data mining and machine learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10). 1671--1675. Google ScholarDigital Library
- David A Freedman. 2009. Statistical models: theory and practice. cambridge university press.Google Scholar
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
- Johannes Fürnkranz and Peter A Flach. 2005. Roc 'n' rule learning - towards a better understanding of covering algorithms. Machine Learning 58, 1 (2005), 39--77. Google ScholarDigital Library
- Lise Getoor and Ben Taskar. 2007. Introduction to statistical relational learning. MIT press. Google ScholarDigital Library
- TinKamHo. 1995. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, Vol. 1. IEEE, 278--282. Google ScholarDigital Library
- Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel Polleres. 2010. Weaving the Pedantic Web. In Proceedings of the Linked Data on the Web (LDOW 2010), Vol. 628. CEUR Workshop Proceedings.Google Scholar
- Theodore Johnson. 2009. Data Profiling. In Encyclopedia of Database Systems, LING LIU and M. TAMER ÖZSU (Eds.). Springer US, Boston, MA, 604--608.Google Scholar
- Hassan Khosravi and Bahareh Bina. 2010. A Survey on Statistical Relational Learning.. In Canadian Conference on AI. Springer, 256--268. Google ScholarDigital Library
- Holger Knublauch and Dimitris Kontokostas. 2017. W3C Shapes Constraint Language (SHACL). (July 2017). https://www.w3.org/TR/shacl/Google Scholar
- Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press. Google ScholarDigital Library
- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167--195.Google ScholarCross Ref
- Stephen W Liddle, David W Embley, and Scott N Woodfield. 1993. Cardinality constraints in semantic data models. Data & Knowledge Engineering 11, 3 (1993), 235--270. Google ScholarDigital Library
- Deborah L McGuinness, Frank Van Harmelen, et al. 2004. OWL web ontology language overview. W3C recommendation 10, 10 (2004), 2004.Google Scholar
- Nandana Mihindukulasooriya, María Poveda-Villalón, Raúl García-Castro, and Asunción Gómez-Pérez. 2015. Loupe-An Online Tool for Inspecting Datasets in the Linked Data Cloud. In Demo at the 14th International Semantic Web Conference. Bethlehem, USA.Google Scholar
- Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 984--994. Google ScholarDigital Library
- Eric Prud'hommeaux, Iovka Boneva, Jose Emilio Labra-Gayo, and Gregg Kellogg. 2017. Shape Expressions Language 2.0. (July 2017). http://shex.io/shex-semantics/Google Scholar
- Eric Prud'hommeaux, Jose Emilio Labra Gayo, and Harold Solbrig. 2014. Shape expressions: an RDF validation and transformation language. In Proceedings of the 10th International Conference on Semantic Systems. ACM, 32--40. Google ScholarDigital Library
- J Ross Quinlan. 2014. C4. 5: programs for machine learning. (2014).Google Scholar
- Dan Steinberg and Phillip Colla. 2009. CART: classification and regression trees. The top ten algorithms in data mining 9 (2009), 179.Google ScholarCross Ref
- Johan AK Suykens, Tony Van Gestel, and Jos De Brabanter. 2002. Least squares support vector machines. World Scientific.Google Scholar
- Jiao Tao, Evren Sirin, Jie Bao, and Deborah L McGuinness. 2010. Extending OWL with Integrity Constraints. Description Logics 573 (2010).Google Scholar
- Giri Kumar Tayi and Donald P Ballou. 1998. Examining Data Quality. Commun. ACM 41, 2 (1998), 54--57. Google ScholarDigital Library
- Raphael Troncy and Giuseppe Rizzo et al. 2017. 3cixty: Building Comprehensive Knowledge Bases for City Exploration. Web Semantics: Science, Services and Agents on the World Wide Web 46-47, Supplement C (2017), 2 -- 13. Google ScholarDigital Library
- WEKA. 2013. Weka Manual for Version 3-7-8. Technical Report. WEKA. https://pdfs.semanticscholar.org/d617/d41097bdf97d994d1481adbcfe0c05a51696.pdfGoogle Scholar
Index Terms
- RDF shape induction using knowledge base profiling
Recommendations
The RDF foundry: call for an initiative to build enhanced RDF resources for biological data integration
WIMS '11: Proceedings of the International Conference on Web Intelligence, Mining and SemanticsCurrently, the OBO Foundry plays an important role by setting guidelines to formalise the concepts within the biomedical domain. The ontologies within the OBO Foundry are usually represented in the OBO ontology language. While being human-readable, this ...
The role of reasoning for RDF validation
SEMANTICS '15: Proceedings of the 11th International Conference on Semantic SystemsFor data practitioners embracing the world of RDF and Linked Data, the openness and flexibility is a mixed blessing. For them, data validation according to predefined constraints is a much sought-after feature, particularly as this is taken for granted ...
Extended RDF: Computability and complexity issues
ERDF stable model semantics is a recently proposed semantics for ERDF ontologies and a faithful extension of RDFS semantics on RDF graphs. In this paper, we elaborate on the computability and complexity issues of the ERDF stable model semantics. Based ...
Comments