Keywords

1 Introduction and Motivation

The progressive digitization process of historical archives makes documentary and archival contents spread beyond their traditional boundaries – the archives themselves – and opens new possibilities of employment for a significant amount of biographical, factual and geographical data available in the archival resources. On the other hand, the Semantic Web and Linked Open Data paradigm provides a stable technological layer for representing such information and for making it reusable in a wide range of applications. In such a context, how to extract and represent this knowledge, in order to make it accessible and usable for the end user, becomes an issue.

In this paper we propose to use a knowledge-based approach for extracting and representing historical events gathered from archives and to make them available for higher-level applications in the form of narrative chains. The resulting framework should provide a reusable ICT solution for the historical domain.

2 State of the Art

This project is situated at the intersection of different research areas; in this section we give a concise account of the most relevant works in such domains, highlighting interesting results that have been achieved, according to our purpose and scope.

Historical events and cultural heritage. In general, a comprehensive survey on the adoption of Semantic Web technologies in the historical research can be found in [20]. Our project shares some of the goals and, partially, the application domain of the Agora project, developed at the VU Amsterdam, aimed at providing a context for cultural heritage collections (in particular, museum objects), connecting them to relevant historical events [26]. The Agora project, in turn, was tightly related to the Semantics of History project, aimed at the extraction of historical events from textual resources. Further results of the two projects, particularly in providing semantically-based access to museum resources, can be found in [7]. An example of semantic, event-based access to cultural resources is depicted in [8] while in [4] events are first class concept in an ontology-based application that employs narratives for museums and exhibitions.

Event ontologies and ontology modeling. For a synthetic introduction to the concept of “event” in the philosophical thought, the “Events” item in the Stanford Encyclopedia of Philosophy can be taken as a reference [2]. As a first step of this Ph.D. project, an analysis of the existing ontologies has been carried out (and it is still ongoing) to assess how each model covers the concept of “event”; among the existing ontologies and models, the Simple Event Model (SEM) [27], the Linked Open Description of Events (LODE) [24], the Event Ontology [10] and the F Event Model [22] are explicitly designed for representing events, though they show different purposes and levels of expressiveness and reusability. SEM is a domain-independent model, aimed at representing events on the web. It explicitly models the concepts of event, actor, time, space, role and authority and it builds on a loose (inclusive) definition of “event”. LODE is a lightweight and property-based ontology and it is the result of an alignment between different ontologies, namely ABC, CIDOC-CRM, Event Ontology and DOLCE + DnS Ultralite (DUL). LODE aims at describing the aspects of an event that result from the answer to the four w-questions: what is happened, where, when and who is involved. The Event Ontology (EO) is a minimal event model, it represents the basic features of an event and, though it can be considered domain-independent, it was designed having in mind the description of music performances (like concerts). The F-Event Model is an upper ontology extending DOLCE + DnS Ultralite (DUL); by means of a rich conceptualization and a quite complex formalization, it allows to describe events in time and space, participation, structural relations between events (like mereology, causality, correlation), documentary support about an event and interpretation.

Other ontologies, like CIDOC-CRM [19] and the Europeana Data Model [9], provide a modeling of the concept of event within the wider framework represented by Cultural Heritage-related resources. An interesting classification of events (at upper level) can be found in [18], where the authors present a formal characterization of events, based on order-sorted logic. Among the foundational ontologies, DOLCE [1] implements in its backbone taxonomy one of the possible classifications of events discussed in [2], distinguishing between activities, accomplishments, achievements and states. The analysis also suggested that there are ontological problems yet not tackled in the representation of historical events [See Sect. 3].

Semantics for diachronical geography. As illustrated in Sect. 3, one of the goals of this Ph.D. proposal is to support a diachronical perspective on the geographical knowledge involved in the representation of historical events and concepts; in this field, valuable experience can be provided by projects like Pleiades [21], a gazetteer of places of ancient times, used for educational purposes, and GeoLat [3], a system allowing users to explore the geography of ancient times in Latin literature texts.

Event extraction. The task of extracting events from textual corpora has been performed according to different approaches, that can be roughly classified in data-driven or knowledge-driven [16]. While the former employ quantitative, statistical techniques and require to be trained on large textual corpora, the latter rely on domain knowledge and perform text mining by means of rules and patterns. A set of publicly available Information Extraction tools (some of which including event extraction), combining NLP and Semantic Web technologies, has been reviewed in [11]. However, as far as historical events are concerned, few examples are available (see [6, 23]), but there seems to be no benchmark to refer to. An interesting result has been achieved in [15], though in that case the input was represented by Wikipedia pages (i.e., a single type of source, providing semi-structured content) rather than by heterogeneous plain texts.

3 Problem Statement and Contributions

The Ph.D. proposal partially originates within Harlock’900, a three-year project established in December 2015 and involving the Department of Computer Science of the University of Turin and the Istituto Piemontese Antonio Gramsci, a non-profit institute promoting research on contemporary history topics. An overview of the approach that will be adopted in Harlock’900 can be found in [13]. The project’s main goal is that of employing a set of resources from the Istituto Gramsci’s archive and enriching the existing metadata with information describing the content of the resource itself; such information will be expressed using formal representations based on ontologies. The goal of the enrichment process is that of allowing the reuse and exploration of information for application (e.g. educational, touristic) purposes. The enriched metadata represent a semantic layer that will ensure a content-based access to archival resources.

The semantic layer should take into account the following aspects:

  • supporting a diachronical perspective on geography, connecting the representation of a place in history with the actualized representation of the same place;

  • connecting the representation of factual/historical entities at different granularity levels, allowing, for example, both a biographical and a general history perspective to be mutually connected and aligned;

  • supporting the use – at application level – of narrative structures by giving a formal representation of events, their factual components and their complex relations.

This Ph.D. research will become part of such a wider project, taking on specific responsibility for the following research questions:

  • what are the full requirements of the semantic layer (from an ontological point of view), accordingly to the previously stated aspects?

  • Are the currently available event models adequate to fulfill the mentioned aspects?

  • If not, what extensions to existing ontologies or integrations between different ontology modules are required for the purpose?

  • Which tools can be used to automatically or semi-automatically perform event mining from text and annotation based on the model?

  • What contribution can be given by Linked Open Data (LOD) (as input) in such a process?

  • What contribution can the resulting framework give to the LOD (as output)?

The research problem at the basis of this proposal can be further decomposed into three main sub-problems:

  • from the analysis performed so far it seems that the current available event models are not able to accurately account for all the conceptual requirements in the historical domain: for example, the reviewed models do not seem to provide the needed means to accurately represent collective entities (e.g. a political party); secondly, it seems difficult, using such models, to give a diachronical representation of places, asserting for example that a certain place, named in a certain way in 1920 and having a certain role at that time, was named differently in 2005 and had a different purpose (a military base became a school, a private house became an hotel, and so on). Third, in the event extraction process it could be beneficial to have an articulated taxonomy of historical events available, to classify the extracted events. In wider terms, it seems that, when individually taken, none of the available event models is able to cover all the conceptual aspects involved in a rich representation of (historical) events.

  • an event-based framework supporting narrative is needed in order to put the domain information enclosed in historical archives at the service of different sorts of applications; this problem could impose backward requirements, especially on the event model;

  • despite remarkable improvements in Natural Language Processing (NLP) and Semantic Web methodologies during the last decade, and even if some state-of-the-art tools are available, there still is no standard for event extraction from text; this project does not have as main goal that of improving NLP methodologies themselves; nevertheless, the availability of a domain event model, designed for the purpose of representing historical events, could allow us to treat our archival resources as a valuable test bed for such technologies.

The Ph.D. proposal aims at providing an event-based framework able to make contents provided by historical archives reusable for original applications. The underlying research hypothesis is that none of the available event models, individually taken as they are, provides a covering of all the aspects that characterize the concept of (historical) “event”, while their integration (and their enrichment, where necessary) would allow us to succeed both in extracting and in representing the semantics of events in historical archives in a form that is suitable for their exploitation in novel contexts (such as tourism or education applications) provided by the Harlock’900 project.

4 Research Methodology and Approach

In order to achieve our research goals, we will adopt an iterative methodology, involving the following steps: (a) analysis of the existing event ontologies and, simultaneously, (b) requirements and data analysis (based on an appropriate selection of texts from the historical archives); (c) design hypothesis of the model and (d) resources annotation based on the model. “Outside” the iterative cycle, once the model will have become stable enough, the connection with the LOD will be established by contributing with the extracted data and, if necessary, by providing mappings of our model.

It is relevant to point out that, while the Harlock’900 project aims at covering different resource types, the Ph.D. proposal will focus only on textual digitized resources (hence images, posters, handwritten papers and similar are excluded from its scope). In particular, given the traditional distinction between primary and secondary historical sources (as recalled in [20]), the considered input data at application level will consist of primary (“narrative”) sources like biographies, newspaper articles, chronicles. In the analysis stage, also secondary resources (e.g., history books) will be employed, as well as digital historical datasets (see for instance the one cited in [15] and available as API at http://www.vizgr.org/historical-events/) for hands-on testing of the existing event models and for creating a corpus of semantic representations of historical events to be used by the system. Since the Harlock’900 project aims at making accessible archival resources, which enclose much information about historical events (that are, in that form, often barely accessible), primary narrative sources are more interesting for the purpose of this work.

The described methodology will be instantiated as follows:

  • requirements: a textual corpus from the archive will be selected and manually analyzed, extracting the ontological design requirements; simultaneously, the available event models are analyzed in order to identify the requirements fulfilled by each model, design patterns, lacks in coverage and differences between the models;

  • hypothesis on the model: based on the previous step, an ontological analysis of the event domain will be carried out, in order to define the intended model; in doing so, particular attention will be paid to the reuse of existing (parts of) ontologies, integrating them as needed; in fact, the analysis will not necessarily produce a new ontological model (i.e., it is not aimed at creating a new ontology from scratch), but it should identify the classes and relations needed for the purpose of the project and the existing models that provide them. Only if needed, an extension in terms of classes or properties will be realized, but the current work is mainly conceived as an integration between different models. This approach will also guarantee a high degree of interoperability with existing ontologies. Furthermore, in the design process, state-of-the-art design methodologies will be employed like OntoClean [14];

  • event extraction and annotation: based on the designed model, historical events must be extracted from textual resources and annotated; a set of tools will be implemented/employed that make use of extraction techniques [11]; a second means for enriching resources (that will be taken into account in Harlock’900 but not in this Ph.D. project) is that of allowing “trusted users” – i.e. domain professionals – to contribute by editing metadata in order to participate to their enrichment.

5 Current Work

Since the presented project refers to a Ph.D. at early stage (beginning: 1 October 2015), no significant result has been achieved so far. Current work is focused on the analysis of the literature and of the existing event ontologies, in order to identify requirements and design choices. Also, the ontology design and annotation experience gained in previous projects, like the one documented in [12], will represent a valuable starting point. The next steps in the analysis will be (a) a hands-on test of the models aimed at understanding their strength and weaknesses in the annotation of historical texts, and (b) an investigation of the LOD cloud to verify which models are employed to describe events, how many and what kind of events are represented.

6 Evaluation Plan

Within the described proposal, three main phases will be subject to evaluation. The semantic event model will be evaluated during each iteration cycle, identifying a suitable corpus of texts in order to asses the coverage and expressiveness of the designed model. The extraction and annotation tool(s) will be evaluated using traditional statistical performance measures (precision, recall, accuracy, for instance). Lastly, once the whole framework is available and operative, a user study will be set up to test its main functionalities with end users.

7 Conclusions

In this paper I outlined my Ph.D. proposal, aimed at extracting semantic knowledge from historical archives and making it available for application purposes. To this end, I stated the main research problem I intend to tackle, together with the approach and research methodology to be followed. Also, the steps to be evaluated and the evaluation approach have been sketched. Due to the early stage of this Ph.D., it has not been possible to provide achieved results yet.