Cross-lingual link discovery with TR-ESA

https://doi.org/10.1016/j.ins.2017.02.019Get rights and content

Abstract

Cross-lingual data linking is the problem of establishing links between resources, such as places, services, or movies, which are described in different languages. In cross-lingual data linking it is often the case that very short descriptions have to be matched, which makes the problem even more challenging. This work presents a method named TRanslation-based Explicit Semantic Analysis (tr-esa) to represent and match short textual descriptions available in different languages. tr-esa translates short descriptions in any given language into a pivot language by exploiting a machine translation tool. Then, it generates a Wikipedia-based representation of the translated text by using the Explicit Semantic Analysis technique. The resulting representations are used to match short descriptions in different languages. The method is incorporated in CroSeR (Cross-lingual Service Retrieval), an interactive data linking tool that recommends potential matches to users. We compared results coming from an in-vitro evaluation on a gold standard consisting of five datasets in different languages, with an in-vivo experiment that involved human experts supported by CroSeR. The in-vivo evaluation confirmed the results of the in-vitro evaluation and the overall effectiveness of the proposed method.

Section snippets

Introduction and motivations

The Linked Data paradigm has been proposed to publish structured data on the web in a way that data can be easily consumed by third-party applications [18]. Several tools can be used to transform data into Resource Description Framework (rdf),1 a format compliant to the Linked Data principles. However, publishing an rdf dataset on the web is not sufficient to realize the vision of linked data. To interconnect two datasets, a data linking task has to be performed. The task

A Semantic matching function for short textual descriptions

Matching two or more texts is essential for several artificial-intelligence tasks, such as classification, clustering, filtering, and retrieval. Text matching can be implemented as simple string matching, which analyzes the lexical overlap between two texts, or can take into account also their semantics.

In this section, we present a semantic-based matching function able to deal with short textual content in different languages. The data linking strategy adopted in the paper is based on this

Interactive cross-lingual data linking

tr-esa is used as a feature generation and matching method in the interactive approach to cross-lingual data linking [16], [26], [27] proposed in the paper.

Definition 1

(Cross-lingual data linking) Let S and T be two sets of resources, called source (S) and target (T) dataset, described in two different languages L1 and L2 respectively. Let R be set of relations between resources in S and T. A cross-lingual data linking task can be defined as a partial function l: S × TR, defined as follows: l(si,tj)=rw,

CroSeR for cross-lingual linking of E-gov services

The cross-lingual link discovery approach described in Section 3 was implemented in a system named CroSeR (Cross-lingual Service Retrieval). CroSeR supports users in the specific task of linking e-gov services described in different languages. In this domain, S represents the source service catalog, T is the target service catalog, and R is the set of relations defined as R={owl:sameAs, skos:narrowMatch, skos:broadMatch}. The target service T is the European Local Government Service List (lgsl).

Experimental evaluation

We carried out two experimental sessions: an in-vitro experiment useful to detect the best system configuration, and an in-vivo experiment in which CroSeR was exploited for helping human experts to link an Italian catalog of e-gov services to the lgsl.

We tested our approach in the e-gov domain for different reasons:

- First, linking public services descriptions is a real-world problem of interest for many governments involved in Open Data initiatives. Linking public services is an objective of

Related work

To better scope the problem addressed in this paper, we report the distinction between multi-language information access (mlia) and cross-lingual information access (clia) proposed in the literature. mlia is the problem of accessing, querying and retrieving information from collections in any language and at any level of specificity [43]. In this sense, mlia subsumes clia, which is the problem of accessing a data collection in a target language L′ by using a source language L, where LL′.

Conclusions and future work

In this paper we presented a cross-lingual link discovery approach based on an effective method to match short textual descriptions written in different languages. Our matching method is based on the definition of tr-esa, a translation-based version of the Explicit Semantic Analysis that performs a machine translation of the input text and generates a Wikipedia-based representations for it. This matching method is used to recommend potential cross-lingual links to users of a web application by

References (53)

  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    JASIS

    (1990)
  • G. Demartini et al.

    Large-scale linked data integration using probabilistic reasoning and crowdsourcing

    VLDB J.

    (2013)
  • M. Deshpande et al.

    Item-based top-n recommendation algorithms

    ACM Trans. Inf. Syst.

    (2004)
  • W.E. Djeddi, M.T. Khadir, XMap++: results for OAEI 2014, in: [48], pp....
  • Z. Dragisic, K. Eckert, J. Euzenat, D. Faria, A. Ferrara, R. Granada, V. Ivanova, E. Jiménez-Ruiz, A.O. Kempf, P....
  • D. Faria, C. Martins, A. Nanavaty, A. Taheri, C. Pesquita, E. Santos, I.F. Cruz, F.M. Couto, Agreementmakerlight...
  • S. Fernando et al.

    Comparing taxonomies for organising collections of documents

  • P. Ferragina et al.

    TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities)

    Proceedings of the 19th ACM International Conference on Information and Knowledge Management

    (2010)
  • E. Gabrilovich et al.

    Wikipedia-based semantic interpretation for natural language processing

    J. Artif. Intell. Res.

    (2009)
  • J. Gracia et al.

    Monolingual and cross-lingual ontology matching with CIDER-CL: evaluation report for OAEI 2013

  • H. Halpin et al.

    When owl:sameAs isn’t the same: an analysis of identity in linked data

    The Semantic Web–ISWC 2010

    (2010)
  • T. Heath et al.

    Linked Data: Evolving the Web into a Global Data Space

    (2011)
  • T. Hedlund et al.

    Dictionary-based cross-language information retrieval: learning experiences from CLEF 2000–2002

    Inf. Retr.

    (2004)
  • M.A. Helou et al.

    Cross-lingual lexical matching with word translation and local similarity optimization

    Proceedings of the 10th International Conference on Semantic Systems, SEMANTiCS 2015, Vienna, Austria, September.

    (2015)
  • M.A. Helou et al.

    Effectiveness of automatic translations for cross-lingual ontology mapping

    J. Artif. Intell. Res.

    (2016)
  • S. Hertling et al.

    WikiMatch - using Wikipedia for ontology matching

    Proceedings of the 7th International Workshop on Ontology Matching (OM 2012)

    (2012)
  • Cited by (8)

    • A fully automated approach to a complete Semantic Table Interpretation

      2020, Future Generation Computer Systems
      Citation Excerpt :

      Tables are essential to perform queries, but the implicit or visual structures employed in tables are not easily machine-readable. In order to allow computers to interpret, combine and reuse such data for several artificial-intelligence tasks (such as classification, clustering, filtering, and retrieval [5]), the semantics of data should become explicit. Therefore, an underlying requirement is identifying and annotating entities in cells, their types and the connections between entities.

    • Linking and disambiguating entities across heterogeneous RDF graphs

      2019, Journal of Web Semantics
      Citation Excerpt :

      The latter approach anchors the resources as vectors of BabelNet identifiers where each of them represents a sense of a term allowing to compute vector distances as a proxy for instance similarity. Combining machine translation with concept embeddings, [25] translates each resource description to English and then a Wikipedia-based representation (a set of concepts) is generated for the resources in order to compare them. Datatype properties vs. object properties.

    • Semantics in adaptive and personalised systems: Methods, tools and applications

      2019, Semantics in Adaptive and Personalised Systems: Methods, Tools and Applications
    • Doing web data: From dataset recommendation to data linking

      2018, NoSQL Data Models: Trends and Challenges
    View all citing articles on Scopus
    View full text