Ensemble learning of runtime prediction models for gene-expression analysis workflows

Monge, David A.; Holec, Matěj; Železný, Filip; Garino, Carlos García

doi:10.1007/s10586-015-0481-5

Ensemble learning of runtime prediction models for gene-expression analysis workflows

Published: 09 September 2015

Volume 18, pages 1317–1329, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

David A. Monge¹,
Matěj Holec²,
Filip Železný² &
…
Carlos García Garino³

380 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

The adequate management of scientific workflow applications strongly depends on the availability of accurate performance models of sub-tasks. Numerous approaches use machine learning to generate such models autonomously, thus alleviating the human effort associated to this process. However, these standalone models may lack robustness, leading to a decay on the quality of information provided to workflow systems on top. This paper presents a novel approach for learning ensemble prediction models of tasks runtime. The ensemble-learning method entitled bootstrap aggregating (bagging) is used to produce robust ensembles of M5P regression trees of better predictive performance than could be achieved by standalone models. Our approach has been tested on gene expression analysis workflows. The results show that the ensemble method leads to significant prediction-error reductions when compared with learned standalone models. This is the first initiative using ensemble learning for generating performance prediction models. These promising results encourage further research in this direction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

Gene Expression Programming Ensemble for Classifying Big Datasets

MiniAnDE: A Reduced AnDE Ensemble to Deal with Microarray Data

Notes

This task of classification learning should not be confused with the learning task of runtime prediction.
HTCondor. http://research.cs.wisc.edu/htcondor/.
SciMark2 benchmark. http://math.nist.gov/scimark2.
Linpack benchmark. http://www.netlib.org/linpack.
This sub-sampling process should not be confused with the sub-sampling carried out in bagging.

References

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)
Google Scholar
Allan, R.: Survey of HPC performance modelling and prediction tools. Tech. Rep. DL-TR-2010-006, Science and Technology Facilities Council, Great Britain (2010). http://epubs.cclrc.ac.uk/bitstream/5264/DLTR-2010-006
Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, vol. 7204, pp. 11–20. Springer, Berlin (2012)
Chapter Google Scholar
da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: 2009 World Conference on Services—I, pp. 259–266 (2009)
Genez, T., Bittencourt, L., Madeira, E.R.M.: Workflow scheduling for SaaS / PaaS cloud providers considering two SLA levels. In: Network Operations and Management Symposium (NOMS), 2012 IEEE, pp. 906–912 (2012)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Article Google Scholar
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)
Google Scholar
Holec, M., Klema, J., Z̆elezný, F., Tolar, J.: Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinform. 13, Suppl. 10(S15), 1–15 (2012)
Google Scholar
Iverson, M., Ozguner, F., Potter, L.: Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In: Heterogeneous Computing Workshop. (HCW ’99) Proceedings of the Eighth, vol. 8, pp. 99–111. IEEE Computer Society, San Juan, PR (1999)
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)
Article MathSciNet Google Scholar
Mao, M., Humphrey, M.: Scaling and scheduling to maximize application performance within budget constraints in cloud workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 67–78. IEEE (2013)
Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)
Article Google Scholar
Mendes-Moreira, J.A., Soares, C., Jorge, A.M., Sousa, J.F.D.: Ensemble approaches for regression: a survey. ACM Comput. Surv. 45(1), 10:1–10:40 (2012)
Article Google Scholar
Monge, D.A., Bĕlohradský, J., García Garino, C., Z̆elezný, F.: A Performance Prediction Module for Workflow Scheduling. In: A.R. de Mendarozqueta et al. (ed.) 4th Symposium on High-Performance Computing in Latin America (HPCLatAm 2011), 40 JAIIO, vol. 4, pp. 130–144. Argentine Society of Informatics (SADIO), Córdoba (2011)
Monge, D.A., Holec, M., Z̆elezný, F., García Garino, C.: Ensemble learning of run-time prediction models for data-intensive scientific workflows. In: G.H. et al. (ed.) High Performance Computing, Communications in Computer and Information Science, vol. 485, pp. 83–97. Springer, Berlin (2014)
Ould-Ahmed-Vall, E., Woodlee, J., Yount, C., Doshi, K., Abraham, S.: Using model trees for computer architecture performance analysis of software applications. In: IEEE International Symposium on Performance Analysis of Systems Software, 2007. ISPASS 2007, pp. 116–125. IEEE Computer Society (2007)
Pllana, S., Brandic, I., Benkner, S.: A survey of the state of the art in performance modeling and prediction of parallel and distributed computing systems. Int. J. Comput. Intell. Res. 4(1), 279–284 (2008)
Quinlan, J.: Learning with continuous classes. In: Proceedings of the 5th Australian joint Conference on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)
Smola, A., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004). doi:10.1023/B:STCO.0000035301.49549.88
Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1st edn. Springer, London (2007)
Book Google Scholar
Taylor, V., Wu, X., Stevens, R.: Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications. SIGMETRICS Perform. Eval. Rev. 30, 13–18 (2003)
Article Google Scholar
Wang, Y., Witten, I.: Induction of model trees for predicting continuous classes. In: Proceedings of the poster papers of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague (1996)
Weicker, R.P.: Dhrystone: a synthetic systems programming benchmark. Commun. ACM 27(10), 1013–1030 (1984)
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan, Kaufman (2011)
Google Scholar

Download references

Acknowledgments

This research is supported by the ANPCyT project No. PICT-2012-2731, and by the MINCyT project No. RC0904. MH and FZ were supported by the Czech Science Foundation project No. P202/12/2032. The financial support from SeCTyP-UNCuyo through project No. M004 is also gratefully acknowledged. DAM wants to thank CONICET for the granted fellowship. We also want to thank Alejandro Edera and Rubén Santos for their fruitful comments. Finally, the authors want to thank the anonymous reviewers for their valuable comments and suggestions that helped to improve the quality of this paper.

Author information

Authors and Affiliations

ITIC Research Institute & Faculty of Exact and Natural Sciences, National University of Cuyo (UNCuyo), Mendoza, Argentina
David A. Monge
IDA Research Group, Czech Technical University in Prague, Prague, Czech Republic
Matěj Holec & Filip Železný
ITIC Research Institute & Faculty of Engineering, National University of Cuyo (UNCuyo), Mendoza, Argentina
Carlos García Garino

Authors

David A. Monge
View author publications
You can also search for this author in PubMed Google Scholar
Matěj Holec
View author publications
You can also search for this author in PubMed Google Scholar
Filip Železný
View author publications
You can also search for this author in PubMed Google Scholar
Carlos García Garino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David A. Monge.

Additional information

Reproducibility of Results The data used in this study and additional results are available in the following URL: http://damonge.wordpress.com/research/enslearn-clus2014.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Monge, D.A., Holec, M., Železný, F. et al. Ensemble learning of runtime prediction models for gene-expression analysis workflows. Cluster Comput 18, 1317–1329 (2015). https://doi.org/10.1007/s10586-015-0481-5

Download citation

Received: 17 December 2014
Revised: 25 August 2015
Accepted: 27 August 2015
Published: 09 September 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10586-015-0481-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble learning of runtime prediction models for gene-expression analysis workflows

Abstract

Access this article

Similar content being viewed by others

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

Gene Expression Programming Ensemble for Classifying Big Datasets

MiniAnDE: A Reduced AnDE Ensemble to Deal with Microarray Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ensemble learning of runtime prediction models for gene-expression analysis workflows

Abstract

Access this article

Similar content being viewed by others

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

Gene Expression Programming Ensemble for Classifying Big Datasets

MiniAnDE: A Reduced AnDE Ensemble to Deal with Microarray Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation