Skip to main content
Log in

Optimization of data-intensive workflows in stream-based data processing models

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. https://pegasus.isi.edu/overview/.

  2. https://confluence.pegasus.isi.edu/display/pegasus/Workflow+Data.

References

  1. Hey T, Tansley S, and Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, WA

  2. Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115

    Article  Google Scholar 

  3. Liew CS, Atkinson MP, Galea M, Ang TF, Martin P, and van Hemert J (2016) Scientific workflows: moving across paradigms. ACM Comput Surv 49(4):66:1–66:39

  4. Berriman GB, Groom SL (2011) How will astronomy archives survive the data tsunami? Commun ACM 54(12):52–56

    Article  Google Scholar 

  5. Chen C, Zhang C-Y (2014) Data-intensive applications, challenges, technique and technologies: a survey on big data. Inf Sci 275:314–347

  6. Prokhorenko V, Choo K-KR, Ashman H (2016) Context-oriented web application protection model. Appl Math Comput 285:59–78

    MathSciNet  Google Scholar 

  7. Penga J, Choob K-KR, Ashmana H (2016) User profiling in intrusion detection: a review. J Netw Comput Appl 72:14–27

    Article  Google Scholar 

  8. Prokhorenko V, Choo K-KR, Ashman H (2016) Web application protection techniques: a taxonomy. J Netw Comput Appl 60:95–112

    Article  Google Scholar 

  9. Prokhorenko V, Choo K-KR, Ashman H (2016) Intent-based extensible real-time php supervision framework. IEEE Trans Inf Forensics Secur 11(10):2215–2226

    Article  Google Scholar 

  10. Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, pp 129–136

  11. Ahmad SG, Munir EU, Nisar W (2012) PEGA: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: International Conference on High Performance Computing and Communications, pp 1082–1087

  12. Liew CS, Atkinson MP, van Hemert J, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: 19th ACM International Symposium on High Performance Distributed Computing (HPDC), pp 725–736

  13. Liew CS (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. Ph.D. dissertation, School of Informatics University of Edinburgh

  14. Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105

    Article  Google Scholar 

  15. Pandey S, Buyya R (2012) Data intensive distributed computing: challenges and solutions for large-scale information management. IGI Global, 2012, ch. A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows, pp 156–176

  16. DaweiSun G, Zhang S, Yang W, Zheng S, Khan U, Li K (2015) Re-stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112

    Article  MathSciNet  Google Scholar 

  17. Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712

    Article  MathSciNet  MATH  Google Scholar 

  18. Issa SA, Kienzler R, El-Kalioby M, Tonellato PJ, Wall D, Abouelhoda RBM (2013) Streaming support for data intensive cloud-based sequence analysis. BioMed Res Int 2013:1–16

    Article  Google Scholar 

  19. Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85

    Article  Google Scholar 

  20. Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128

    Google Scholar 

  21. Munir EU, Mohsin S, Hussain A, Nisar MW, Ali S (2013) SDBATS: A novel algorithm for task scheduling in heterogeneous computing systems. In: IEEE 27th international parallel and distributed processing symposium workshops. Ph.D. Forum (IPDPSW), 2013, pp 43–53

  22. Arabnejad H, Barbosa JG (2014) List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans Parallel Distrib Syst 25(3):682–694

    Article  Google Scholar 

  23. Hackett A, Ajwani D, Ali S, Kirkland S, Morrison JP (2013) A network configuration algorithm based on optimization of Kirchhoff index. In: IEEE 27th International Symposium on Parallel and Distributed Processing, pp 407–417

  24. Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)

  25. Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA

  26. Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium

  27. Benoit A, Catalyurek UV, Robert Y, Saule E (2013) A survey of pipelined workflow scheduling: models and algorithms. ACM Comput Surv (CSUR) 45(4):50:1–50:36

  28. Ahmad SG, Munir EU, Nisar MW (2012) Pega: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: The 14th IEEE International Conference on High Performance Computing and Communications, pp 1082–1087

  29. Juve G, Chervenak A, Deelman E, Bharathi S, Mehta G, Vahi K (2013) Characterizing and profiling scientific workflow. Fut Gener Comput Syst 29(3):682–692

    Article  Google Scholar 

  30. Quick D, Choo K-KR (2016) Big forensic data reduction: digital forensic images and electronic evidence. Clust Comput 19(2):723–740

    Article  Google Scholar 

  31. Martini B, Choo K-KR (2013) Cloud storage forensics: owncloud as a case study. Digital Investig 10(4):287–299

    Article  Google Scholar 

  32. Martini B, Choo KKR (2014) Distributed filesystem forensics: XtreemFS as a case study. Digital Investig 11(4):295–313

    Article  Google Scholar 

  33. Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

Download references

Acknowledgements

The work presented in this paper is partly supported by the Ministry of Education Malaysia (FRGS FP051-2013A and UMRG RP001F-13ICT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chee Sun Liew.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmad, S.G., Liew, C.S., Rafique, M.M. et al. Optimization of data-intensive workflows in stream-based data processing models. J Supercomput 73, 3901–3923 (2017). https://doi.org/10.1007/s11227-017-1991-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-1991-0

Keywords

Navigation