Skip to main content
Log in

Providing Insight into Data Source Topics

  • Original Article
  • Published:
Journal on Data Semantics

Abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://ckan.org.

  2. http://dbpedia.org.

  3. For the reference ontology, the maximum entropies and entropy variances are stored in \(GE_{index}\), while, for the target source, these measures have to be computed at runtime.

  4. http://wiki.dbpedia.org/Datasets/DatasetStatistics.

  5. Even if the table shows the analysis performed on only few properties belonging to three classes, we performed the experiment over 50\(+\) properties belonging to 10\(+\) classes obtaining results entirely similar to the one shown.

  6. For the sake of simplicity, the table shows the analysis performed on only few classes. Nevertheless, we performed the experiment over 50+ classes and the results showed trends similar to the ones represented.

  7. http://perso.telecom-paristech.fr/~eagan/class/as2013/inf229/labs/datasets.

  8. http://www.w3.org/TR/WD-rdf-syntax-971002/.

References

  1. Balakrishnan S, Halevy AY, Harb B, Lee H, Madhavan J, Rostamizadeh A, Shen W, Wilder K, Wu F, Yu C (2015) Applying webtables in practice. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.cidrdb.org

  2. Bergamaschi S, Domnori E, Guerra F, Orsini M, Trillo-Lado R, Velegrakis Y (2010) Keymantic: semantic keyword-based searching in data integration systems. PVLDB 3(2):1637–1640

    Google Scholar 

  3. Bergamaschi S, Ferrari D, Guerra F, Simonini G (2014) Discovering the topics of a data source: a statistical approach. In: Surfacing the Deep and the Social Web (SDSW) workshop held at international semantic web conference

  4. Bergamaschi S, Guerra F, Interlandi M, Lado RT, Velegrakis Y (2016) Combining user and database perspective for solving keyword queries over relational databases. Inf Syst 55:1–19

    Article  Google Scholar 

  5. Bergamaschi S, Sartori C, Guerra F, Orsini M (2007) Extracting relevant attribute values for improved search. IEEE Int Comput 11(5):26–35

    Article  Google Scholar 

  6. Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84

    Article  MathSciNet  Google Scholar 

  7. Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1(1):9–36

    Article  Google Scholar 

  8. Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. SIGMOD Rec 35(3):34–41

    Article  Google Scholar 

  9. Dhar V (2013) Data science and prediction. Commun ACM 56(12):64–73

    Article  Google Scholar 

  10. Euzenat J, Shvaiko P (2013) Ontology matching, 2nd edn. Springer, UK

    Book  MATH  Google Scholar 

  11. Han L, Finin T, Joshi A (2012) Schema-free structured querying of dbpedia data. In: Chen XW, Lebanon G, Wang H, Zaki MJ (eds) CIKM, pp 2090–2093. ACM

  12. Havrda J, Charvát F (1967) Quantification method of classification processes. Concept of structural \(a\)-entropy. Kybernetika 3(1):30–35

    MathSciNet  MATH  Google Scholar 

  13. Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Halevy AY, Ives ZG, Doan AH (eds) SIGMOD conference, pp 205–216. ACM

  14. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1):484–493

    Google Scholar 

  15. Madhavan J, Afanasiev L, Antova L, Halevy AY (2009) Harnessing the deep web: present and future. In: CIDR. www.cidrdb.org

  16. Oren E, Delbru R, Catasta M, Cyganiak R, Stenzhorn H, Tummarello G (2008) Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1):37–52

    Article  Google Scholar 

  17. Rahm E (2011) Towards large-scale schema and ontology matching. In: Schema matching and mapping, pp 3–27

  18. Schopman BAC, Wang S, Isaac A, Schlobach S (2012) Instance-based ontology matching by instance enrichment. J Data Semant 1(4):219–236

    Article  Google Scholar 

  19. Shvaiko P, Euzenat J (2013) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng 25(1):158–176

    Article  Google Scholar 

  20. Srivastava D, Velegrakis Y (2007) Intensional associations between data and metadata. In: SIGMOD, pp 401–412

  21. Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52(1–2):479–487

    Article  MathSciNet  MATH  Google Scholar 

  22. Van der Vaart AW (2000) Asymptotic statistics. Cambridge university press, Cambridge

    MATH  Google Scholar 

  23. Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: SIGIR 2006: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA, August 6–11, 2006, pp 178–185

  24. Wright A (2008) Searching the deep web. Commun ACM 51(10):14–15

    Article  Google Scholar 

  25. Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645

    Google Scholar 

  26. Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910

    Google Scholar 

  27. Yu C, Jagadish HV (2006) Schema summarization. In: Proceedings of the 32nd international conference on very large data bases, Seoul, Korea, September 12–15, 2006, pp 319–330

  28. Zhang X, Cheng G, Qu Y (2007) Ontology summarization based on rdf sentence graph. In: proceedings of the 16th international conference on world wide web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007, pp 707–716

Download references

Acknowledgments

The authors would like to acknowledge the networking support by the COST Action IC1302 (http://www.keystone-cost.eu).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Guerra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bergamaschi, S., Ferrari, D., Guerra, F. et al. Providing Insight into Data Source Topics. J Data Semant 5, 211–228 (2016). https://doi.org/10.1007/s13740-016-0063-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13740-016-0063-6

Keywords

Navigation