Contextualizing data warehouses with documents
Introduction
Current data warehouse and OLAP technologies can be efficiently applied to analyze the huge amounts of structured data that companies produce. These organizations also produce many documents and use the Web as their largest source of external information. Examples of internal and external sources of information include the following: purchase-trends and market-research reports; demographic and credit reports; popular business journals; industry newsletters; technology reports; etc. Although these documents cannot be analyzed by current OLAP technologies, mainly because they are unstructured and contain a large amount of text, they are highly valuable information that can help companies analyze their data. Because XML has become the standard for data exchange over the Internet [25], nowadays it is easy to find some of these documents in XML formats. Furthermore, existing XML tagging techniques [22] can be applied to give some structure to plain documents by identifying the different document sections, and exportation tools from most proprietary systems to XML-like formats are available now.
In this paper we present an architecture that integrates a corporate warehouse of structured data with a warehouse of text-rich XML documents. The resulting contextualized warehouse is a new type of decision support system that allows users to obtain strategic information by analyzing data under different contexts. For example, if we have a document warehouse with financial news articles, we can analyze the evolution of the sales measures of our corporate warehouse in the context of a period of crisis described by the relevant news. Thus, it is easier to find out which products were affected by the crisis.
Here, we define a context as a set of textual fragments that can provide analysts with strategic information important for decision-making tasks. Since the document warehouse may contain documents about many different topics, we apply modern Information Retrieval (IR) [2] techniques to select the context of analysis from the document warehouse. In order to build a contextualized OLAP cube, the analyst will specify the context under analysis by supplying a sequence of keywords. Each fact in the resulting cube will have a numerical value representing its relevance with respect to the specified context, thereby its name R-cube (Relevance cube). Moreover, each fact in the R-cube will be linked to the set of relevant documents that describe its context. In this paper we extend an existing multidimensional data model to represent these two new dimensions (relevance and context), and we study how the traditional OLAP operations are affected by them.
The relevance and context dimensions provide information about facts that can be very useful for analysis tasks. The relevance dimension can be used to explore the most relevant portions of an R-cube. For example, it can be used to identify the period of a political crisis, or the regions under economical development. The usefulness of the context dimension is twofold. First, it can be used to restrict the analysis to the facts described in a given subset of documents (e.g., the most relevant documents). Second, the user will be able to gain insight into the circumstances of a fact by retrieving its related documents.
The main contributions of this paper are: (1) an architecture that integrates a corporate warehouse with a document warehouse, resulting in a contextualized warehouse; (2) a formal definition of the multidimensional data model and the unary algebra operations to manage R-cubes; and (3) a prototypical system that shows the usefulness of the approach.
The rest of the paper is organized as follows: Section 2 discusses related work. In Section 3 we present the architecture of a contextualized warehouse and how the analysis cubes (R-cubes) are built. The multidimensional data model for R-cubes is presented in Section 4. In Section 5 we propose an algebra for R-cubes. A prototypical contextualized warehouse is shown in Section 6. Finally, Section 7 addresses conclusions and future work.
Section snippets
Related work
In [9] the importance of external contextual information to understand the results of historical analysis operations was emphasized: “External contextual information is information outside the corporation that nevertheless plays an important role in understanding information over time.” Since contextual information is usually available as documents (e.g., on-line news, company reports, etc.), which cannot be managed by relational systems, few approaches regarding contextual information in a
Contextualized warehouses
A contextualized warehouse is a decision support system that allows users to combine all their sources of structured and unstructured data, and to analyze the integrated data under different contexts. Fig. 1 shows the proposed architecture for the contextualized warehouse. Its main components are a corporate warehouse, a document warehouse and the fact extractor module. The corporate warehouse is a traditional data warehouse that integrates the company's structured data sources [9], [10]. The
A multidimensional data model for R-cubes
In this section we define a formal data model for the R-cubes. We extend an existing multidimensional model [16] with two new special dimensions to represent both the relevance of the facts and their context. For each component of the extended data model, we show its definition and give some examples.
The R-cubes algebra
In this section we present an algebra for the R-cubes by extending the definition of the unary operators presented in [16] to regard the relevance and context of the facts. For each operator, we show its definition, and discuss how the relevance and context are updated in the result by giving some examples.
Along the definitions we will assume an R-cube RM = (F,D,FD,Q), where D = {Di,i = 1, …, n} ∪ {R,Ctxt}, FD = {FDi,i = 1, …n} ⋃ {FR,FCtxt} and whose quality is Quality. The set of documents relevant for the
The prototype
In order to validate the usefulness of our approach, we have developed a prototype. The resulting contextualized warehouse allows users to analyze stock market indexes with the advantage of having each measure value associated to a news extract that explains it. This section gives an overview of the main aspects involved in the design of the system. First, we describe the document and corporate warehouses of the prototype. Afterwards, we show the usefulness of the prototype by means of an
Conclusions and future work
A contextualized warehouse is a new decision support system that allows users to combine all their sources of structured data and unstructured documents, and to obtain strategic information by analyzing the integrated data under different contexts. In a contextualized warehouse, the user specifies an analysis context by supplying some keywords. Then, the analysis is performed on an R-cube which is materialized by retrieving the documents and facts related to the selected context.
R-cubes are
Acknowledgements
This project has been partially supported by the Danish Research Council for Technology and Production under grant no. 26-02-0277, the Spanish National Research Project TIN2005-09098-C05-04, and the Fundación Bancaixa Castelló.
Juan Manuel Pérez-Martínez obtained a B.S. degree from the Universitat Jaume I (Spain) in 2000, where he is registered for a Ph.D. Currently he is associate lecturer at the same university. He is author of a number of communications in international conferences such as DEXA, ECIR, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies. Contact him at [email protected].
References (26)
Text warehousing: present and future
- et al.
Modern Information Retrieval
(1999) - et al.
Extending XQuery for analytics
- et al.
Web warehousing: design and issues
- et al.
An expressive and efficient language for XML information retrieval
Combining approaches to information retrieval
- et al.
A proposal for the automatic generation of instances from unstructured text
- et al.
XIRQL: a query language for information retrieval in XML documents
Building the Data Warehouse
(1996)The Data Warehouse Toolkit
(2002)
Relevance-based language models
Extracting temporal references to assign document event-time periods
A probabilistic multidimensional data model and algebra for OLAP in decision support systems
Cited by (49)
Context-aware OLAP for textual data warehouses
2022, International Journal of Information Management Data InsightsA foundation for spatio-textual-temporal cube analytics
2022, Information SystemsCitation Excerpt :There are solutions that combine more than one component of data, e.g., spatio-temporal [24], into the same model but do not provide combined STT analytics. Among those, the contextualized warehouse [20] combines traditional OLAP with a textual warehouse. This allows the user to provide some keywords, select a market (country or region), retrieve documents matching the keywords as context, and then analyze the facts related to those keywords and documents.
Textual aggregation approaches in OLAP context: A survey
2017, International Journal of Information ManagementCitation Excerpt :The model adopts a concept similar to data cube designed for relational databases which is applied to textual data, where cells contain keywords, and an interestingness value is attached to each keyword. The R-Cube: Perez, Berlanga, and Aramburu (2007), Perez, Berlanga, and Aramburu (2008a), Perez, Berlanga, and Aramburu (2008b), Perez, Berlanga, and Aramburu (2008c); focus on the task of integrating structured and textual data in the same data warehouse. The authors proposed an architecture for a decision support system called contextualized warehouse that, allows a user to obtain knowledge from heterogeneous data and documents by analyzing data under different contexts.
Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs
2015, Ecological InformaticsCitation Excerpt :This IR approach was used in BI applications but no integration between both applications was made. In Pérez-Martínez (2007) and Pérez-Martínez et al. (2008a), the authors provide a framework for the integration of a corporate warehouse of structured data with a warehouse of text-rich XML documents, resulting in what authors call a contextualized warehouse. These works are based on applying IR techniques to select the context of analysis from the document warehouses.
A new Spatial OLAP approach for the analysis of Volunteered Geographic Information
2014, Computers, Environment and Urban SystemsCitation Excerpt :Despite this apparently important divergence, the (S)OLAP literature includes studies that aim to integrate complex data as contextual multimedia information for warehoused spatial data and for OLAP analysis. For example, the work described in (Pérez-Martínez, Llavori, Cabo, & Pedersen, 2008) integrates textual documents in a data warehouse to enhance and extend OLAP analyses. Another main difference between SOLAP and VGI systems is the data historicity.
A foundation for spatio-textual-temporal cube analytics
2021, CEUR Workshop Proceedings
Juan Manuel Pérez-Martínez obtained a B.S. degree from the Universitat Jaume I (Spain) in 2000, where he is registered for a Ph.D. Currently he is associate lecturer at the same university. He is author of a number of communications in international conferences such as DEXA, ECIR, etc. His research interests are information retrieval, multidimensional databases, and web-based technologies. Contact him at [email protected].
Rafael Berlanga-LLavori is an associate professor in the Computer Science career at University Jaume I, Spain for 12 years. He received the B.S. degree from Universidad de Valencia in Physics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such as Information Processing and Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications in international conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning. Contact him at [email protected].
María José Aramburu-Cabo is an associate professor in the Computer Science career at University Jaume I, Spain. She obtained the B.S degree from Universidad Politécnica de Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author of several articles in international journals, such as Information Processing and Management, Concurrency: Practice and Experience, Applied Intelligence, and numerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and their applications. Contact her at [email protected].
Torben Bach Pedersen is an associate professor of Computer Science at Aalborg University, Denmark. His research interest includes multidimensional databases, OLAP, data warehousing, federated databases, data streams, and location-based services. He has published more than 60 scientific papers on these issues in journals and conferences such as The VLDB Journal, Information Systems, IEEE Computer, VLDB, ICDE, SSDBM, SSTD, IDEAS, ACM-GIS, ECIR, Hypertext, DOLAP, and DaWaK. He is a member of the Editorial Board of the International Journal on Data Warehousing and Mining, and has served on more than 30 program committees including VLDB, ICDE, EDBT, SSDBM, and DaWaK. Before joining Aalborg University, he worked in the software industry for more than six years. He received the Ph.D. and M.S. degrees in Computer Science from Aalborg University and Aarhus University, respectively. He is a member of the IEEE, the IEEE Computer Society, and the ACM. Contact him at [email protected].