Skip to main content
Log in

Ensemble learning of runtime prediction models for gene-expression analysis workflows

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The adequate management of scientific workflow applications strongly depends on the availability of accurate performance models of sub-tasks. Numerous approaches use machine learning to generate such models autonomously, thus alleviating the human effort associated to this process. However, these standalone models may lack robustness, leading to a decay on the quality of information provided to workflow systems on top. This paper presents a novel approach for learning ensemble prediction models of tasks runtime. The ensemble-learning method entitled bootstrap aggregating (bagging) is used to produce robust ensembles of M5P regression trees of better predictive performance than could be achieved by standalone models. Our approach has been tested on gene expression analysis workflows. The results show that the ensemble method leads to significant prediction-error reductions when compared with learned standalone models. This is the first initiative using ensemble learning for generating performance prediction models. These promising results encourage further research in this direction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. This task of classification learning should not be confused with the learning task of runtime prediction.

  2. HTCondor. http://research.cs.wisc.edu/htcondor/.

  3. SciMark2 benchmark. http://math.nist.gov/scimark2.

  4. Linpack benchmark. http://www.netlib.org/linpack.

  5. This sub-sampling process should not be confused with the sub-sampling carried out in bagging.

References

  1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)

    Google Scholar 

  2. Allan, R.: Survey of HPC performance modelling and prediction tools. Tech. Rep. DL-TR-2010-006, Science and Technology Facilities Council, Great Britain (2010). http://epubs.cclrc.ac.uk/bitstream/5264/DLTR-2010-006

  3. Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, vol. 7204, pp. 11–20. Springer, Berlin (2012)

    Chapter  Google Scholar 

  4. da Cruz, S., Campos, M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: 2009 World Conference on Services—I, pp. 259–266 (2009)

  5. Genez, T., Bittencourt, L., Madeira, E.R.M.: Workflow scheduling for SaaS / PaaS cloud providers considering two SLA levels. In: Network Operations and Management Symposium (NOMS), 2012 IEEE, pp. 906–912 (2012)

  6. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)

    Article  Google Scholar 

  7. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)

    Google Scholar 

  8. Holec, M., Klema, J., Z̆elezný, F., Tolar, J.: Comparative evaluation of set-level techniques in predictive classification of gene expression samples. BMC Bioinform. 13, Suppl. 10(S15), 1–15 (2012)

    Google Scholar 

  9. Iverson, M., Ozguner, F., Potter, L.: Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In: Heterogeneous Computing Workshop. (HCW ’99) Proceedings of the Eighth, vol. 8, pp. 99–111. IEEE Computer Society, San Juan, PR (1999)

  10. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)

    Article  MathSciNet  Google Scholar 

  11. Mao, M., Humphrey, M.: Scaling and scheduling to maximize application performance within budget constraints in cloud workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 67–78. IEEE (2013)

  12. Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)

    Article  Google Scholar 

  13. Mendes-Moreira, J.A., Soares, C., Jorge, A.M., Sousa, J.F.D.: Ensemble approaches for regression: a survey. ACM Comput. Surv. 45(1), 10:1–10:40 (2012)

    Article  Google Scholar 

  14. Monge, D.A., Bĕlohradský, J., García Garino, C., Z̆elezný, F.: A Performance Prediction Module for Workflow Scheduling. In: A.R. de Mendarozqueta et al. (ed.) 4th Symposium on High-Performance Computing in Latin America (HPCLatAm 2011), 40 JAIIO, vol. 4, pp. 130–144. Argentine Society of Informatics (SADIO), Córdoba (2011)

  15. Monge, D.A., Holec, M., Z̆elezný, F., García Garino, C.: Ensemble learning of run-time prediction models for data-intensive scientific workflows. In: G.H. et al. (ed.) High Performance Computing, Communications in Computer and Information Science, vol. 485, pp. 83–97. Springer, Berlin (2014)

  16. Ould-Ahmed-Vall, E., Woodlee, J., Yount, C., Doshi, K., Abraham, S.: Using model trees for computer architecture performance analysis of software applications. In: IEEE International Symposium on Performance Analysis of Systems Software, 2007. ISPASS 2007, pp. 116–125. IEEE Computer Society (2007)

  17. Pllana, S., Brandic, I., Benkner, S.: A survey of the state of the art in performance modeling and prediction of parallel and distributed computing systems. Int. J. Comput. Intell. Res. 4(1), 279–284 (2008)

  18. Quinlan, J.: Learning with continuous classes. In: Proceedings of the 5th Australian joint Conference on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)

  19. Smola, A., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004). doi:10.1023/B:STCO.0000035301.49549.88

  20. Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids, 1st edn. Springer, London (2007)

    Book  Google Scholar 

  21. Taylor, V., Wu, X., Stevens, R.: Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications. SIGMETRICS Perform. Eval. Rev. 30, 13–18 (2003)

    Article  Google Scholar 

  22. Wang, Y., Witten, I.: Induction of model trees for predicting continuous classes. In: Proceedings of the poster papers of the European Conference on Machine Learning. University of Economics, Faculty of Informatics and Statistics, Prague (1996)

  23. Weicker, R.P.: Dhrystone: a synthetic systems programming benchmark. Commun. ACM 27(10), 1013–1030 (1984)

    Article  Google Scholar 

  24. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan, Kaufman (2011)

    Google Scholar 

Download references

Acknowledgments

This research is supported by the ANPCyT project No. PICT-2012-2731, and by the MINCyT project No. RC0904. MH and FZ were supported by the Czech Science Foundation project No. P202/12/2032. The financial support from SeCTyP-UNCuyo through project No. M004 is also gratefully acknowledged. DAM wants to thank CONICET for the granted fellowship. We also want to thank Alejandro Edera and Rubén Santos for their fruitful comments. Finally, the authors want to thank the anonymous reviewers for their valuable comments and suggestions that helped to improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David A. Monge.

Additional information

Reproducibility of Results The data used in this study and additional results are available in the following URL: http://damonge.wordpress.com/research/enslearn-clus2014.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Monge, D.A., Holec, M., Železný, F. et al. Ensemble learning of runtime prediction models for gene-expression analysis workflows. Cluster Comput 18, 1317–1329 (2015). https://doi.org/10.1007/s10586-015-0481-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0481-5

Keywords

Navigation