Abstract
A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.
Similar content being viewed by others
Notes
For the reference ontology, the maximum entropies and entropy variances are stored in \(GE_{index}\), while, for the target source, these measures have to be computed at runtime.
Even if the table shows the analysis performed on only few properties belonging to three classes, we performed the experiment over 50\(+\) properties belonging to 10\(+\) classes obtaining results entirely similar to the one shown.
For the sake of simplicity, the table shows the analysis performed on only few classes. Nevertheless, we performed the experiment over 50+ classes and the results showed trends similar to the ones represented.
References
Balakrishnan S, Halevy AY, Harb B, Lee H, Madhavan J, Rostamizadeh A, Shen W, Wilder K, Wu F, Yu C (2015) Applying webtables in practice. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.cidrdb.org
Bergamaschi S, Domnori E, Guerra F, Orsini M, Trillo-Lado R, Velegrakis Y (2010) Keymantic: semantic keyword-based searching in data integration systems. PVLDB 3(2):1637–1640
Bergamaschi S, Ferrari D, Guerra F, Simonini G (2014) Discovering the topics of a data source: a statistical approach. In: Surfacing the Deep and the Social Web (SDSW) workshop held at international semantic web conference
Bergamaschi S, Guerra F, Interlandi M, Lado RT, Velegrakis Y (2016) Combining user and database perspective for solving keyword queries over relational databases. Inf Syst 55:1–19
Bergamaschi S, Sartori C, Guerra F, Orsini M (2007) Extracting relevant attribute values for improved search. IEEE Int Comput 11(5):26–35
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84
Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1(1):9–36
Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. SIGMOD Rec 35(3):34–41
Dhar V (2013) Data science and prediction. Commun ACM 56(12):64–73
Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, UK
Han L, Finin T, Joshi A (2012) Schema-free structured querying of dbpedia data. In: Chen XW, Lebanon G, Wang H, Zaki MJ (eds) CIKM, pp 2090–2093. ACM
Havrda J, Charvát F (1967) Quantification method of classification processes. Concept of structural \(a\)-entropy. Kybernetika 3(1):30–35
Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Halevy AY, Ives ZG, Doan AH (eds) SIGMOD conference, pp 205–216. ACM
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1):484–493
Madhavan J, Afanasiev L, Antova L, Halevy AY (2009) Harnessing the deep web: present and future. In: CIDR. www.cidrdb.org
Oren E, Delbru R, Catasta M, Cyganiak R, Stenzhorn H, Tummarello G (2008) Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1):37–52
Rahm E (2011) Towards large-scale schema and ontology matching. In: Schema matching and mapping, pp 3–27
Schopman BAC, Wang S, Isaac A, Schlobach S (2012) Instance-based ontology matching by instance enrichment. J Data Semant 1(4):219–236
Shvaiko P, Euzenat J (2013) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng 25(1):158–176
Srivastava D, Velegrakis Y (2007) Intensional associations between data and metadata. In: SIGMOD, pp 401–412
Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52(1–2):479–487
Van der Vaart AW (2000) Asymptotic statistics. Cambridge university press, Cambridge
Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: SIGIR 2006: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA, August 6–11, 2006, pp 178–185
Wright A (2008) Searching the deep web. Commun ACM 51(10):14–15
Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
Yu C, Jagadish HV (2006) Schema summarization. In: Proceedings of the 32nd international conference on very large data bases, Seoul, Korea, September 12–15, 2006, pp 319–330
Zhang X, Cheng G, Qu Y (2007) Ontology summarization based on rdf sentence graph. In: proceedings of the 16th international conference on world wide web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007, pp 707–716
Acknowledgments
The authors would like to acknowledge the networking support by the COST Action IC1302 (http://www.keystone-cost.eu).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bergamaschi, S., Ferrari, D., Guerra, F. et al. Providing Insight into Data Source Topics. J Data Semant 5, 211–228 (2016). https://doi.org/10.1007/s13740-016-0063-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13740-016-0063-6