Keywords

1 Introduction

A large number of datasets, such as demography, economic indexes, or public policies results, are statistical types of data [1]. Often these data need to be combined to create value. Data cubes are useful for combining data [2]. Data cubes are the array of 2 or more datasets based on the Structured Query Language (SQL) join functionality [3]. Data cubes enable data analysis of for example time-series to detect trends, abnormalities, unusual patterns or can be used to compare geographic regions with each other. The authors of [4] show that data cubes can be used to aggregate unemployment and election datasets to explore the relationship between them.

Organising and reusing datasets is often found to be hard due to challenges like access to data [5], manipulation of data [6], accuracy of data [7], and a long list of other data quality issues [8,9,10,11,12]. Linking those datasets using the Linked Open Statistical Data (LOSD) approach enables the creation of data cubes. Statistical datasets have their peculiarities and due this reason, the W3C adopted the Resource Description Framework Data Cube (QB) vocabulary to standardise the modelling of cubes as RDF graphs [13]. While statistical data cubes platforms are still on an initial mature stage [6], there is a need to evaluate OpenCube platforms. Yet no models exist to evaluate open cubes platforms.

2 Research Approach

The objective of this project paper is to develop an evaluation framework to evaluate an open data cubes platform (ODCP). Eight main processes are identified and a list of 23 requirements are derived which can be used to evaluate OpenCube platforms and applications. Using the evaluation model six cases were evaluated. The first three cases were developed by students at Delft University of Technology (https://goo.gl/y5HgJq), whereas the other three cases have been developed within the OpenGovIntelligence project (www.opengovintelligence.eu).

In the literature there is no overview of functions needed by data cubes. Nevertheless ISO/IEC 25010:2010, the standard for Systems and Software Quality Requirements and Evaluation [14] can be of help, as these present a structured list of requirements. This list of requirements will be used for evaluating statistical cubes platforms. Further, based on the description of ISO 25010:2010, we created questions to evaluate each of the requirements, as presented in Table 1.

Table 1. Open statistical data cube parameters, requirements and questions

The questionnaire was used to evaluate 6 case studies in which open data cubes were designed using the OpenGovIntelligence platform. The survey was conducted on a qualitative way to identify if the platform could be used to design statistical data cubes. The answers allowed us to evaluate the data cubes by looking at which requirements were fulfilled by the open data cube platform. Also this allowed us to identify the main issues that open statistical data cubes designers face during the design and implementation of open data cues. The requirements covered were used as an indication for the maturity of development.

3 Background

Statistical data is often organised in a multidimensional manner where a measured fact is described based on a number of dimensions. As an example, Olympics statistics can bring three different dimensions: countries (USA, GB, China), medal (gold, silver, Bronze) and year (2004, 2008, 2012) and summarised on the Fig. 1 [15]. In the example, each of the cells contains a measure referring to Olympian statistical data, but together, they form a data cube.

Fig. 1.
figure 1

Olympics Medals distributed by countries within the years.

The functionality we derived is created by adapting the Linked Open Statistical Data Cubes (LOSDC) cycle consisting of eight steps [1] modified by [16]. The steps are divided into (1) Data Cubes Creation and (2) Data cubes Analysis processes. Figure 2 shows the main steps which are described hereafter. Also the typical software tools used for supporting each step are presented.

Fig. 2.
figure 2

Data Cubes Steps.

A-Data Cubes Creation Processes

Step 1-Discover and Pre-process Raw Data

This first step is aimed at handling and preparing the file formats to be ready for the next steps. As an example XLS (spreadsheets file format), Comma-Separated Values (CSV) and JavaScript Object Notation (JSON) is used as an input. One of the most used tool for this step is the OpenRefine (http://openrefine.org/). This steps is needed for increasing the capacity and resilience for managing, updating and extending data because they are on an greater interoperable format (CSV, JSON) than XLS as an example.

Step 2-Define Structure and Create Cubes

The objective of the second step is to define the structure of the data cube using the Resource Description Framework (RDF) data cube vocabulary. For this own code lists or standard taxonomies created by external, supranational or international organisations like the W3C data cubes (https://www.w3.org/TR/vocab-data-cube/) can be used [13]. After this, the data in RDF format is validated. The tool used for this step is Cube Builder (https://github.com/OpenGovIntelligence/data-cube-builder) and Grafter (http://grafter.org/). This step is necessary for enabling ontology and concept scheme management.

Step 3-Annotate Cubes

The third step creates metadata about the datasets. Metadata explains the meaning of the datasets. Metadata enabled data provenance, understanding data production processes and cube structures. In this way data can be reused by others and the effort and cost for publishers to integrate with other data sources are reduces. Annotation can based on standard thesaurus of statistical concepts, validate the metadata and can include the creation of links with compatible (external and internal) data cubes. As an example, the W3C also created the Vocabulary of Interlinked Datasets (VoID), aiming to be the connection between publishers and users of RDF datasets [17]. On the practice, OntoGov (Ontology-Enabled Electronic Government service configuration) defined a vocabulary with well-defined term that enabled automated discovery, composition, negotiation and reconfiguration of services between departments and governments [18]. The latter facilitates the analyses and even automatic combining with other datasets.

Step 4-Publish Cube

The fourth step finishes the Data Cubes Creation Process by publishing data cubes in data catalogues. This step also can use a Linked Data API (Application Programming Interface) or a SPARQL endpoint, the query language of RDFs. For this step, example of tool is the Cube API (https://github.com/OpenGovIntelligence/json-qb-api-implementation) or the aggregator (http://opencube-toolkit.eu/opencube-aggregator/).

B-Data Cube Analysis Processes

Step 5-Discover and Explore Cube

Based on the metadata, analysts can start to discover the cubes browsing the datasets and pivot them. This step enables the expansion of cubes, what means combining other data resources. Standardised semantic annotation helps users to find data of interest faster and easier.

Step 6-Transform Cube

The sixth step expands cubes and also allow analysts to create slices or dices, using pre-compute summarisations and other statistical functionalities. This can also help users to understand the content and structure of datasets faster and easier. The tool used on this step is the aggregator.

Step 7-Analyse Cube

This step enables statistical analysis on the cubes created using comprehensive Online Analytical Processing (OLAP) operations. The tools Cube Browser (https://github.com/OpenGovIntelligence/qb-olap-browser) and Cube Explorer (https://github.com/OpenGovIntelligence/data-cube-explorer) allow analysts to create and evaluate learning and predictive models or estimate dependencies between measures. Further, it is possible to publish the descriptions of resulting models into the Web of Linked Data. This enables the connection of data cubes with each other.

Step 8-Communicate results

This final step concludes the data cubes analysis processes and the cycle can start over again. The main objective of this step is to create visualisations and reports which can be used in policy-making efforts. As an example, analysts can create charts (bar chart, pie chart, sorted pie chart, area chart) and maps (heat maps) based on the LOSD and data cubes. The tool used for this step is the Cube Visualizer (https://github.com/OpenGovIntelligence/CubeVisualizer). The Cube visualizer is a web application that creates and presents to the user graphical representations of an RDF data cube’s one-dimensional slices. It also enables non-technical users to re-use data more efficiently, in new and innovative ways without high level of technical skills.

4 Open Cubes in Practice: Case Studies

This paper selected six cases to evaluate its implementation of statistical data cubes. The first three cases were developed by students at Delft University of Technology (https://goo.gl/y5HgJq). The other three cases have been developed as part of the OpenGovIntelligence project (www.opengovintelligence.eu). The six applications are:

  1. 1.

    The “world most suitable country to live” (http://kossa.superhost.pl/sen1611/app/);

  2. 2.

    The “Gender Inequality in Europe” (http://raditya.me/genderinequality/paymentgap/mapview/);

  3. 3.

    The “Best places for automotive industry install your plants in Europe”;

  4. 4.

    The “Environmental monitoring centre” of The Flemish Government (Belgium);

  5. 5.

    The “Irish System of Maritime tourism, search and rescue” from Galway (Ireland);

  6. 6.

    The “Real Estate Market Analysis Dashboard” from Estonian Ministry of Economy (Estonia).

All cases took similar approaches of development, but have different objectives and audiences. Using the 22 requirements a questionnaire was designed to evaluate the benefits and identify the challenges of the data cube. The questionnaire was filled in by 40 students and 6 technical experts of the OGI Project. The benefits and challenges of the platforms are summarized in Table 2.

Table 2. Open statistical data cube platform benefits and challenges

5 Discussions and Conclusions

More and more statistical data have been disclosed by organizations, which enables people from around the world to use these data. Yet data cube platforms are not a mature technology yet. This paper purposed a model for evaluation open statistical data cubes using a list of 23 requirements derived from the ISO 25010:2010 standard for Systems and Software Quality Requirements and Evaluation. Based on this list of 23 requirements, a questionnaire was developed which was used to evaluate six cases which makes use of the same platform for processing LOSD using open data cubes. The questionnaire was filled in by 40 persons and using this benefits and challenges of using open statistical data cubes were determined. The identified benefits include ease of use, the easy creation of open cubes when available in linked data format, and the flexibility of open cube platform to integrate with other software for enable the use of functionalities provided by other software. Challenges of development identified include no single platform for covering all steps, a lack of proper documentation, no guidelines for open data cube creation (which blocks capacity building and learning skills), fragmentation of tools, need for much manual work, and, installing and running issues with software which is needed to run OpenCube. The results show that Open Cubes can be used, but that there is still a lot of manual effort necessary and a variety of tools are needed that are not build to interoperate with each other. We recommend the further integration of the building blocks in the platforms to reduce the barriers for use of LOSD by the public.