Skip to main content
Log in

Executable schema mappings for statistical data processing

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Data processing is the core of any statistical information system. Statisticians are interested in specifying transformations and manipulations of data at a high level, in terms of entities of statistical models. We illustrate here a proposal where a high-level language, EXL, is used for the declarative specification of statistical programs, and a translation into executable form in various target systems is available. The language is based on the theory of schema mappings, in particular those defined by a specific class of tgds, which we actually use to optimize user programs and facilitate the translation towards several target systems. The characteristics of such class guarantee good tractability properties and the applicability in Big Data settings. A concrete implementation, EXLEngine, has been carried out and is currently used at the Bank of Italy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The seasonal decomposition is an operator that decomposes a time series into various components, one of which is the trend, which, roughly speaking considers medium- or long-term “variations”, ignoring seasonal, cyclic (and stochastic) ones [12, 33].

  2. Note that for the first quarter the PCHNG is not meaningful.

  3. As we will see in Sect. 5, we also have some egds, which enforce the functional nature of EXL relations.

  4. That is, repeated elements are meaningful.

  5. We could say “at most” one operator, but it is easy to assume that there are no statements that just copy a relation with no additional operations.

  6. The case with several relations is indeed possible and we will discuss it in Sect. 5.4.

  7. This total order is not strictly necessary, the only thing that is needed is that the rules that involve these general operators are applied only after their operands have been fully computed.

  8. As in the rest of the paper, we refer to tgds with one atom in the rhs.

  9. Indeed, the case where the two operators are multi-tuple and have different grouping dimensions requires a slight extension of the syntax, where the grouping dimensions would be specified as an argument of the operator itself, so, for example \(R(x,y,z) \rightarrow Q(x, \hbox {max}(\hbox {avg}(z, \hbox {group by}~x, y)))\), calculates the maximum, grouped by \(x\), of the averages of z, grouped by \(x\) and \(y\).

  10. Notice that there is the residual, and indeed remote, possibility in which the repeated dimension tuple has the identity element as its measure for the aggregation under consideration or that, in general, the repeated tuples compensate the error. However this condition is value and aggregation dependent and should be considered as a case of “correctness by chance”.

  11. There is the residual possibility that two EXL statements share a subsexpression, resulting in two tgds sharing one ore more atoms of the lhs. Since we break down all the statements into elementary statements, we could end up having tgds with coinciding premises, which indeed we detect and simplify in the system.

  12. http://kettle.pentaho.com/.

  13. http://mahout.apache.org.

  14. http://spark.apache.org.

References

  1. Arenas, M., Fagin, R., Nash, A.: Composition with target constraints. Logical Methods Comput. Sci. 7(3) (2011)

  2. Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS, pp. 14–26 (2014)

  3. Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P., Gianforme, G.: Model-independent schema translation. VLDB J. 17, 1347–1370 (2008)

    Article  Google Scholar 

  4. Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: MISM: a platform for model-independent solutions to model management problems. J. Data Semant. 14, 133–161 (2009)

    Article  Google Scholar 

  5. Atzeni, P., Bellomarini, L., Bugiotti, F., Celli, F., Gianforme, G.: A runtime approach to model-generic translation of schema and data. Inf. Syst. 37, 269–287 (2012)

    Article  Google Scholar 

  6. Atzeni, P., Bellomarini, L., Bugiotti, F.: Exlengine: executable schema mappings for statistical data processing. In: EDBT, pp. 672–682 (2013)

  7. Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: IJCAI, pp. 2–10 (2017)

  8. Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007)

  9. Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. PVLDB 7(7), 553–564 (2014)

    Google Scholar 

  10. Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger, R.: Heptox: Marrying XML and heterogeneity in your P2P databases. In: VLDB, pp. 1267–1270 (2005)

  11. Brockwell, P.J., Davis, R.A. (eds.): Introduction to Time Series and Forecasting. Springer, New York (2002)

    MATH  Google Scholar 

  12. Calì, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: PODS, pp. 77–86 (2009)

  13. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of query answering in description logics (extended abstract). In: IJCAI, pp. 4163–4167 (2015)

  14. Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, PODS ’98, pp. 34–43, New York, NY, USA, (1998). ACM

  15. Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: VLDB, pp. 354–366. Morgan Kaufmann, Burlington (1994)

  16. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)

    Google Scholar 

  17. Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: Ricardo: integrating R and hadoop. In: SIGMOD, pp. 987–998 (2010)

  18. Del Vecchio, V.: Statistical data and concepts representation. Bank of Italy (1997). http://goo.gl/YIAqDp

  19. Del Vecchio, V., Di Giovanni, F., Pambianco, S.: The “matrix” model. Bank of Italy (2007). http://goo.gl/Dj2XT0

  20. Dessloch, S., Hernández, M., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: ICDE, pp. 1307–1316 (2008)

  21. Di Giovanni, F., Piazza, D.: Processing and managing statistical data: a national central bank experience. Bank of Italy (2009). http://goo.gl/ZNi5zh

  22. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: ICDT, pp. 207–224 (2003)

  23. Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005)

    Article  MATH  Google Scholar 

  24. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)

    Article  Google Scholar 

  25. Fagin, R., Haas, L., Hernández, M., Miller, R., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Conceptual Modeling: Foundations and Applications, pp. 198–236 (2009)

  26. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222 (2011)

  27. Gottlob, G., Pichler, R., Savenkov, V.: Normalization and optimization of schema mappings. PVLDB 2(1), 1102–1113 (2009)

    Google Scholar 

  28. Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810. ACM (2005)

  29. Kolaitis, P.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)

  30. Kolaitis, P.G., Panttaja, J., Tan, W.C.: The complexity of data exchange. In: SIGMOD, pp. 30–39 (2006)

  31. Mahdi, E.: A survey of r software for parallel computing. Am. J. Appl. Math. Stat. 2(4), 224–230 (2014)

    Article  Google Scholar 

  32. Mecca, G., Papotti, P., Raunich, S.: Core schema mappings: scalable core computations in data exchange. Inf. Syst. 37(7), 677–711 (2012)

    Article  Google Scholar 

  33. Mumick, I.S., Pirahesh, H., Ramakrishnan, R.: The magic of duplicates and aggregates. In: VLDB, pp. 264–277 (1990)

  34. Ramsay, J.O., Hooker, G., Graves, S. (eds.): Functional Data Analysis with R and Matlab. Springer, New York (2009)

    MATH  Google Scholar 

  35. Sallinger, E.: Reasoning about schema mappings. In: Data Exchange, Integration, and Streams, pp. 97–127 (2013)

  36. Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., Mansmann, U.: State of the art in parallel computing with r. J. Stat. Softw. 31(1), 1–27 (2009). 8

    Article  Google Scholar 

  37. Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SCIDB. In: CIDR (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Atzeni.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Atzeni, P., Bellomarini, L., Bugiotti, F. et al. Executable schema mappings for statistical data processing. Distrib Parallel Databases 36, 265–300 (2018). https://doi.org/10.1007/s10619-017-7212-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-017-7212-2

Keywords

Navigation