Contextualizing data warehouses with documents

https://doi.org/10.1016/j.dss.2006.12.005Get rights and content

Abstract

Current data warehouse and OLAP technologies are applied to analyze the structured data that companies store in databases. The context that helps to understand data over time is usually described separately in text-rich documents. This paper proposes to integrate the traditional corporate data warehouse with a document warehouse, resulting in a contextualized warehouse. Thus, the user first selects an analysis context by supplying some keywords. Then, the analysis is performed on a novel type of OLAP cube, called an R-cube, which is materialized by retrieving and ranking the documents and corporate facts related to the selected context.

Introduction

Current data warehouse and OLAP technologies can be efficiently applied to analyze the huge amounts of structured data that companies produce. These organizations also produce many documents and use the Web as their largest source of external information. Examples of internal and external sources of information include the following: purchase-trends and market-research reports; demographic and credit reports; popular business journals; industry newsletters; technology reports; etc. Although these documents cannot be analyzed by current OLAP technologies, mainly because they are unstructured and contain a large amount of text, they are highly valuable information that can help companies analyze their data. Because XML has become the standard for data exchange over the Internet [25], nowadays it is easy to find some of these documents in XML formats. Furthermore, existing XML tagging techniques [22] can be applied to give some structure to plain documents by identifying the different document sections, and exportation tools from most proprietary systems to XML-like formats are available now.

In this paper we present an architecture that integrates a corporate warehouse of structured data with a warehouse of text-rich XML documents. The resulting contextualized warehouse is a new type of decision support system that allows users to obtain strategic information by analyzing data under different contexts. For example, if we have a document warehouse with financial news articles, we can analyze the evolution of the sales measures of our corporate warehouse in the context of a period of crisis described by the relevant news. Thus, it is easier to find out which products were affected by the crisis.

Here, we define a context as a set of textual fragments that can provide analysts with strategic information important for decision-making tasks. Since the document warehouse may contain documents about many different topics, we apply modern Information Retrieval (IR) [2] techniques to select the context of analysis from the document warehouse. In order to build a contextualized OLAP cube, the analyst will specify the context under analysis by supplying a sequence of keywords. Each fact in the resulting cube will have a numerical value representing its relevance with respect to the specified context, thereby its name R-cube (Relevance cube). Moreover, each fact in the R-cube will be linked to the set of relevant documents that describe its context. In this paper we extend an existing multidimensional data model to represent these two new dimensions (relevance and context), and we study how the traditional OLAP operations are affected by them.

The relevance and context dimensions provide information about facts that can be very useful for analysis tasks. The relevance dimension can be used to explore the most relevant portions of an R-cube. For example, it can be used to identify the period of a political crisis, or the regions under economical development. The usefulness of the context dimension is twofold. First, it can be used to restrict the analysis to the facts described in a given subset of documents (e.g., the most relevant documents). Second, the user will be able to gain insight into the circumstances of a fact by retrieving its related documents.

The main contributions of this paper are: (1) an architecture that integrates a corporate warehouse with a document warehouse, resulting in a contextualized warehouse; (2) a formal definition of the multidimensional data model and the unary algebra operations to manage R-cubes; and (3) a prototypical system that shows the usefulness of the approach.

The rest of the paper is organized as follows: Section 2 discusses related work. In Section 3 we present the architecture of a contextualized warehouse and how the analysis cubes (R-cubes) are built. The multidimensional data model for R-cubes is presented in Section 4. In Section 5 we propose an algebra for R-cubes. A prototypical contextualized warehouse is shown in Section 6. Finally, Section 7 addresses conclusions and future work.

Section snippets

Related work

In [9] the importance of external contextual information to understand the results of historical analysis operations was emphasized: “External contextual information is information outside the corporation that nevertheless plays an important role in understanding information over time.” Since contextual information is usually available as documents (e.g., on-line news, company reports, etc.), which cannot be managed by relational systems, few approaches regarding contextual information in a

Contextualized warehouses

A contextualized warehouse is a decision support system that allows users to combine all their sources of structured and unstructured data, and to analyze the integrated data under different contexts. Fig. 1 shows the proposed architecture for the contextualized warehouse. Its main components are a corporate warehouse, a document warehouse and the fact extractor module. The corporate warehouse is a traditional data warehouse that integrates the company's structured data sources [9], [10]. The

A multidimensional data model for R-cubes

In this section we define a formal data model for the R-cubes. We extend an existing multidimensional model [16] with two new special dimensions to represent both the relevance of the facts and their context. For each component of the extended data model, we show its definition and give some examples.

The R-cubes algebra

In this section we present an algebra for the R-cubes by extending the definition of the unary operators presented in [16] to regard the relevance and context of the facts. For each operator, we show its definition, and discuss how the relevance and context are updated in the result by giving some examples.

Along the definitions we will assume an R-cube RM = (F,D,FD,Q), where D = {Di,i = 1, …, n}  {R,Ctxt}, FD = {FDi,i = 1, …n}  {FR,FCtxt} and whose quality is Quality. The set of documents relevant for the

The prototype

In order to validate the usefulness of our approach, we have developed a prototype. The resulting contextualized warehouse allows users to analyze stock market indexes with the advantage of having each measure value associated to a news extract that explains it. This section gives an overview of the main aspects involved in the design of the system. First, we describe the document and corporate warehouses of the prototype. Afterwards, we show the usefulness of the prototype by means of an

Conclusions and future work

A contextualized warehouse is a new decision support system that allows users to combine all their sources of structured data and unstructured documents, and to obtain strategic information by analyzing the integrated data under different contexts. In a contextualized warehouse, the user specifies an analysis context by supplying some keywords. Then, the analysis is performed on an R-cube which is materialized by retrieving the documents and facts related to the selected context.

R-cubes are

Acknowledgements

This project has been partially supported by the Danish Research Council for Technology and Production under grant no. 26-02-0277, the Spanish National Research Project TIN2005-09098-C05-04, and the Fundación Bancaixa Castelló.

Juan Manuel Pérez-Martínez obtained a B.S. degree from the Universitat Jaume I (Spain) in 2000, where he is registered for a Ph.D. Currently he is associate lecturer at the same university. He is author of a number of communications in international conferences such as DEXA, ECIR, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies. Contact him at [email protected].

References (26)

  • A. Badia

    Text warehousing: present and future

  • R.A. Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • K. Beyer et al.

    Extending XQuery for analytics

  • S. Bhowmick et al.

    Web warehousing: design and issues

  • T.T. Chinenyanga et al.

    An expressive and efficient language for XML information retrieval

  • W.B. Croft

    Combining approaches to information retrieval

  • R. Danger et al.

    A proposal for the automatic generation of instances from unstructured text

  • N. Fuhr et al.

    XIRQL: a query language for information retrieval in XML documents

  • W.H. Inmon

    Building the Data Warehouse

    (1996)
  • R. Kimball

    The Data Warehouse Toolkit

    (2002)
  • V. Lavrenko et al.

    Relevance-based language models

  • D.M. Llidó et al.

    Extracting temporal references to assign document event-time periods

  • B.R. Moole

    A probabilistic multidimensional data model and algebra for OLAP in decision support systems

  • Cited by (49)

    • Context-aware OLAP for textual data warehouses

      2022, International Journal of Information Management Data Insights
    • A foundation for spatio-textual-temporal cube analytics

      2022, Information Systems
      Citation Excerpt :

      There are solutions that combine more than one component of data, e.g., spatio-temporal [24], into the same model but do not provide combined STT analytics. Among those, the contextualized warehouse [20] combines traditional OLAP with a textual warehouse. This allows the user to provide some keywords, select a market (country or region), retrieve documents matching the keywords as context, and then analyze the facts related to those keywords and documents.

    • Textual aggregation approaches in OLAP context: A survey

      2017, International Journal of Information Management
      Citation Excerpt :

      The model adopts a concept similar to data cube designed for relational databases which is applied to textual data, where cells contain keywords, and an interestingness value is attached to each keyword. The R-Cube: Perez, Berlanga, and Aramburu (2007), Perez, Berlanga, and Aramburu (2008a), Perez, Berlanga, and Aramburu (2008b), Perez, Berlanga, and Aramburu (2008c); focus on the task of integrating structured and textual data in the same data warehouse. The authors proposed an architecture for a decision support system called contextualized warehouse that, allows a user to obtain knowledge from heterogeneous data and documents by analyzing data under different contexts.

    • Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs

      2015, Ecological Informatics
      Citation Excerpt :

      This IR approach was used in BI applications but no integration between both applications was made. In Pérez-Martínez (2007) and Pérez-Martínez et al. (2008a), the authors provide a framework for the integration of a corporate warehouse of structured data with a warehouse of text-rich XML documents, resulting in what authors call a contextualized warehouse. These works are based on applying IR techniques to select the context of analysis from the document warehouses.

    • A new Spatial OLAP approach for the analysis of Volunteered Geographic Information

      2014, Computers, Environment and Urban Systems
      Citation Excerpt :

      Despite this apparently important divergence, the (S)OLAP literature includes studies that aim to integrate complex data as contextual multimedia information for warehoused spatial data and for OLAP analysis. For example, the work described in (Pérez-Martínez, Llavori, Cabo, & Pedersen, 2008) integrates textual documents in a data warehouse to enhance and extend OLAP analyses. Another main difference between SOLAP and VGI systems is the data historicity.

    View all citing articles on Scopus

    1. Download : Download full-size image

    Juan Manuel Pérez-Martínez obtained a B.S. degree from the Universitat Jaume I (Spain) in 2000, where he is registered for a Ph.D. Currently he is associate lecturer at the same university. He is author of a number of communications in international conferences such as DEXA, ECIR, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies. Contact him at [email protected].

    1. Download : Download full-size image

    Rafael Berlanga-LLavori is an associate professor in the Computer Science career at University Jaume I, Spain for 12 years. He received the B.S. degree from Universidad de Valencia in Physics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such as Information Processing and Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications in international conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning. Contact him at [email protected].

    1. Download : Download full-size image

    María José Aramburu-Cabo is an associate professor in the Computer Science career at University Jaume I, Spain. She obtained the B.S degree from Universidad Politécnica de Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author of several articles in international journals, such as Information Processing and Management, Concurrency: Practice and Experience, Applied Intelligence, and numerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and their applications. Contact her at [email protected].

    1. Download : Download full-size image

    Torben Bach Pedersen is an associate professor of Computer Science at Aalborg University, Denmark. His research interest includes multidimensional databases, OLAP, data warehousing, federated databases, data streams, and location-based services. He has published more than 60 scientific papers on these issues in journals and conferences such as The VLDB Journal, Information Systems, IEEE Computer, VLDB, ICDE, SSDBM, SSTD, IDEAS, ACM-GIS, ECIR, Hypertext, DOLAP, and DaWaK. He is a member of the Editorial Board of the International Journal on Data Warehousing and Mining, and has served on more than 30 program committees including VLDB, ICDE, EDBT, SSDBM, and DaWaK. Before joining Aalborg University, he worked in the software industry for more than six years. He received the Ph.D. and M.S. degrees in Computer Science from Aalborg University and Aarhus University, respectively. He is a member of the IEEE, the IEEE Computer Society, and the ACM. Contact him at [email protected].

    1

    Fax: +34 964 728435.

    2

    Fax: +45 9815 9889.

    View full text