Cross-lingual link discovery with TR-ESA

doi:10.1016/j.ins.2017.02.019

Information Sciences

Volumes 394–395, July 2017, Pages 68-87

https://doi.org/10.1016/j.ins.2017.02.019 Get rights and content

Abstract

Cross-lingual data linking is the problem of establishing links between resources, such as places, services, or movies, which are described in different languages. In cross-lingual data linking it is often the case that very short descriptions have to be matched, which makes the problem even more challenging. This work presents a method named TRanslation-based Explicit Semantic Analysis (tr-esa) to represent and match short textual descriptions available in different languages. tr-esa translates short descriptions in any given language into a pivot language by exploiting a machine translation tool. Then, it generates a Wikipedia-based representation of the translated text by using the Explicit Semantic Analysis technique. The resulting representations are used to match short descriptions in different languages. The method is incorporated in CroSeR (Cross-lingual Service Retrieval), an interactive data linking tool that recommends potential matches to users. We compared results coming from an in-vitro evaluation on a gold standard consisting of five datasets in different languages, with an in-vivo experiment that involved human experts supported by CroSeR. The in-vivo evaluation confirmed the results of the in-vitro evaluation and the overall effectiveness of the proposed method.

Section snippets

Introduction and motivations

The Linked Data paradigm has been proposed to publish structured data on the web in a way that data can be easily consumed by third-party applications [18]. Several tools can be used to transform data into Resource Description Framework (rdf),¹ a format compliant to the Linked Data principles. However, publishing an rdf dataset on the web is not sufficient to realize the vision of linked data. To interconnect two datasets, a data linking task has to be performed. The task

A Semantic matching function for short textual descriptions

Matching two or more texts is essential for several artificial-intelligence tasks, such as classification, clustering, filtering, and retrieval. Text matching can be implemented as simple string matching, which analyzes the lexical overlap between two texts, or can take into account also their semantics.

In this section, we present a semantic-based matching function able to deal with short textual content in different languages. The data linking strategy adopted in the paper is based on this

Interactive cross-lingual data linking

tr-esa is used as a feature generation and matching method in the interactive approach to cross-lingual data linking [16], [26], [27] proposed in the paper.

Definition 1

(Cross-lingual data linking) Let S and T be two sets of resources, called source (S) and target (T) dataset, described in two different languages L₁ and L₂ respectively. Let R be set of relations between resources in S and T. A cross-lingual data linking task can be defined as a partial function l: S × T → R, defined as follows: $l (s_{i}, t_{j}) = r_{w},$

CroSeR for cross-lingual linking of E-gov services

The cross-lingual link discovery approach described in Section 3 was implemented in a system named CroSeR (Cross-lingual Service Retrieval). CroSeR supports users in the specific task of linking e-gov services described in different languages. In this domain, S represents the source service catalog, T is the target service catalog, and R is the set of relations defined as $R = {$ owl:sameAs, skos:narrowMatch, skos:broadMatch}. The target service T is the European Local Government Service List (lgsl).

Experimental evaluation

We carried out two experimental sessions: an in-vitro experiment useful to detect the best system configuration, and an in-vivo experiment in which CroSeR was exploited for helping human experts to link an Italian catalog of e-gov services to the lgsl.

We tested our approach in the e-gov domain for different reasons:

- First, linking public services descriptions is a real-world problem of interest for many governments involved in Open Data initiatives. Linking public services is an objective of

Related work

To better scope the problem addressed in this paper, we report the distinction between multi-language information access (mlia) and cross-lingual information access (clia) proposed in the literature. mlia is the problem of accessing, querying and retrieving information from collections in any language and at any level of specificity [43]. In this sense, mlia subsumes clia, which is the problem of accessing a data collection in a target language L′ by using a source language L, where L ≠ L′.

Conclusions and future work

In this paper we presented a cross-lingual link discovery approach based on an effective method to match short textual descriptions written in different languages. Our matching method is based on the definition of tr-esa, a translation-based version of the Explicit Semantic Analysis that performs a machine translation of the input text and generates a Wikipedia-based representations for it. This matching method is used to recommend potential cross-lingual links to users of a web application by

References (53)

E. Belyaeva et al.
Using semantic data to improve cross-lingual linking of article clusters
Web Semant.
(2015)
J. Gracia et al.
Challenges for the multilingual web of data
Web Semant.
(2012)
F. Narducci et al.
Concept-based item representations for a cross-lingual content-based recommendation process
Inf. Sci.
(2016)
A.-C. Ngonga Ngomo
On link discovery using a hybrid approach
J. Data Semant.
(2012)
H. Paulheim
WeSeE-Match results for OEAI 2012
Proceedings of the 7th International Workshop on Ontology Matching (OM 2012)
(2012)
L.-X. Tang et al.
Overview of the NTCIR-10 cross-lingual link discovery task
Proceedings of the Tenth NTCIR Workshop Meeting, page to appear, NII, Tokyo
(2013)
C. Baldassarre et al.
Bridging the gap between citizens and local administrations with knowledge-based service bundle recommendations
24th International Workshop on Database and Expert Systems Applications, DEXA 2013, Prague, Czech Republic, August 26–29, 2013
(2013)
D.M. Blei et al.
Latent dirichlet allocation
J. Mach. Learn. Res.
(2003)
T. Cassidy et al.
Analysis and refinement of cross-lingual entity linking
Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics
(2012)
I.F. Cruz et al.
Quality-based model for effective and robust multi-user pay-as-you-go ontology matching
Semant. Web
(2015)

S.C. Deerwester et al.

Indexing by latent semantic analysis

JASIS

(1990)

G. Demartini et al.

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

VLDB J.

(2013)

M. Deshpande et al.

Item-based top-n recommendation algorithms

ACM Trans. Inf. Syst.

(2004)

W.E. Djeddi, M.T. Khadir, XMap++: results for OAEI 2014, in: [48], pp....

Z. Dragisic, K. Eckert, J. Euzenat, D. Faria, A. Ferrara, R. Granada, V. Ivanova, E. Jiménez-Ruiz, A.O. Kempf, P....

D. Faria, C. Martins, A. Nanavaty, A. Taheri, C. Pesquita, E. Santos, I.F. Cruz, F.M. Couto, Agreementmakerlight...

S. Fernando et al.

Comparing taxonomies for organising collections of documents

P. Ferragina et al.

TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities)

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

(2010)

E. Gabrilovich et al.

Wikipedia-based semantic interpretation for natural language processing

J. Artif. Intell. Res.

(2009)

J. Gracia et al.

Monolingual and cross-lingual ontology matching with CIDER-CL: evaluation report for OAEI 2013

H. Halpin et al.

When owl:sameAs isn’t the same: an analysis of identity in linked data

The Semantic Web–ISWC 2010

(2010)

T. Heath et al.

Linked Data: Evolving the Web into a Global Data Space

(2011)

T. Hedlund et al.

Dictionary-based cross-language information retrieval: learning experiences from CLEF 2000–2002

Inf. Retr.

(2004)

M.A. Helou et al.

Cross-lingual lexical matching with word translation and local similarity optimization

Proceedings of the 10th International Conference on Semantic Systems, SEMANTiCS 2015, Vienna, Austria, September.

(2015)

M.A. Helou et al.

Effectiveness of automatic translations for cross-lingual ontology mapping

J. Artif. Intell. Res.

(2016)

S. Hertling et al.

WikiMatch - using Wikipedia for ontology matching

Proceedings of the 7th International Workshop on Ontology Matching (OM 2012)

(2012)

Cited by (8)

A fully automated approach to a complete Semantic Table Interpretation
2020, Future Generation Computer Systems
Citation Excerpt :
Tables are essential to perform queries, but the implicit or visual structures employed in tables are not easily machine-readable. In order to allow computers to interpret, combine and reuse such data for several artificial-intelligence tasks (such as classification, clustering, filtering, and retrieval [5]), the semantics of data should become explicit. Therefore, an underlying requirement is identifying and annotating entities in cells, their types and the connections between entities.
In recent years, there has been an increasing interest in extracting and annotating tables on the Web. This activity allows the transformation of text data into machine-readable formats to enable the execution of various artificial intelligence tasks, e.g. semantic search and dataset extension. Semantic Table Interpretation is the process of annotating elements in a table. Current approaches are mainly based on lexical matching algorithms that rely on metadata associated with tables or custom Knowledge Graphs. Their main limitations are due to the lack of metadata, the little use of contextual semantics, and the incompleteness of the proposed methods that do not include all the necessary steps. In this paper, we propose a comprehensive approach and a tool that provides an unsupervised method to annotate independent tables, possibly without header row or other external information. The approach is based on the definition of a context created from the elements within the table in order to discriminate among matching entities found in shared Knowledge Graphs and create high-quality annotations. The approach has achieved excellent results in an international challenge, thus proving its effectiveness.
Linking and disambiguating entities across heterogeneous RDF graphs
2019, Journal of Web Semantics
Citation Excerpt :
The latter approach anchors the resources as vectors of BabelNet identifiers where each of them represents a sense of a term allowing to compute vector distances as a proxy for instance similarity. Combining machine translation with concept embeddings, [25] translates each resource description to English and then a Wikipedia-based representation (a set of concepts) is generated for the resources in order to compare them. Datatype properties vs. object properties.
Establishing identity links across RDF datasets is a central and challenging task on the way to realising the Data Web project. It is well-known that data supplied by different sources can be highly heterogeneous—two entities referring to the same real world object are often described, structured and valued differently, or in a complementary fashion. In this paper, we explore the origins and the multiplicity of data heterogeneity problems, proposing a novel classification that allows to isolate challenges and to position our and future work. Many state-of-the-art data linking approaches rely on sets of discriminative properties, provided by the user or by specialised tools, which, in the lack of knowledge of the nature of the data, do not allow to account automatically for a large number of structural heterogeneities. In addition, similarity measures and thresholds need to be selected and tuned manually or learned by specialised algorithms. We propose a solution covering an important number of heterogeneities, attempting to reduce the user configuration effort, based on: (i) Property filtering, or automatic data cleaning of “problematic” attributes; (ii) Instance profiling allowing to represent each resource by a sub-graph considered relevant for the comparison task; and (iii) Instance vector representation allowing to compare resources. To reduce the false positives rate, we apply a (iv) Post-processing step based on hierarchical clustering and key ranking techniques aiming to disambiguate highly similar, though not identical instances. This pipeline is implemented in Legato—a data linking tool, showing to outperform or to perform as well as state-of-the-art tools on highly heterogeneous and diverse benchmark datasets, yet keeping the user configuration effort low.
A survey of semantic relatedness evaluation datasets and procedures
2020, Artificial Intelligence Review
Semantics in adaptive and personalised systems: Methods, tools and applications
2019, Semantics in Adaptive and Personalised Systems: Methods, Tools and Applications
Linking and Disambiguating Entities across Heterogeneous RDF Graphs
2019, SSRN
Doing web data: From dataset recommendation to data linking
2018, NoSQL Data Models: Trends and Challenges

View all citing articles on Scopus

View full text

Cross-lingual link discovery with TR-ESA

Abstract

Section snippets

Introduction and motivations

A Semantic matching function for short textual descriptions

Interactive cross-lingual data linking

CroSeR for cross-lingual linking of E-gov services

Experimental evaluation

Related work

Conclusions and future work

Web Semant.

Web Semant.

Inf. Sci.

J. Data Semant.

Bridging the gap between citizens and local administrations with knowledge-based service bundle recommendations

24th International Workshop on Database and Expert Systems Applications, DEXA 2013, Prague, Czech Republic, August 26–29, 2013

Latent dirichlet allocation

J. Mach. Learn. Res.

Analysis and refinement of cross-lingual entity linking

Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics

Quality-based model for effective and robust multi-user pay-as-you-go ontology matching

Semant. Web

Indexing by latent semantic analysis

JASIS

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

VLDB J.

Item-based top-n recommendation algorithms

ACM Trans. Inf. Syst.

Comparing taxonomies for organising collections of documents

TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities)

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

Wikipedia-based semantic interpretation for natural language processing

J. Artif. Intell. Res.

Monolingual and cross-lingual ontology matching with CIDER-CL: evaluation report for OAEI 2013

When owl:sameAs isn’t the same: an analysis of identity in linked data

The Semantic Web–ISWC 2010

Linked Data: Evolving the Web into a Global Data Space

Dictionary-based cross-language information retrieval: learning experiences from CLEF 2000–2002

Inf. Retr.

Cross-lingual lexical matching with word translation and local similarity optimization

Proceedings of the 10th International Conference on Semantic Systems, SEMANTiCS 2015, Vienna, Austria, September.

Effectiveness of automatic translations for cross-lingual ontology mapping

J. Artif. Intell. Res.

WikiMatch - using Wikipedia for ontology matching

Proceedings of the 7th International Workshop on Ontology Matching (OM 2012)