ABSTRACT
Stream processing applications have recently gained significant attention in the networking and database community. At the core of these applications is a stream processing engine that performs resource allocation and management to support continuous tracking of queries over collections of physically-distributed and rapidly-updating data streams. While numerous stream processing systems exist, there has been little work on understanding the performance characteristics of these applications in a distributed setup. In this paper, we examine the performance bottlenecks of streaming data applications, in particular the Linear Road stream data management benchmark, in achieving good performance in large-scale distributed environments, using the Stream Processing Core (SPC), a stream processing middleware we have developed. First, we present the design and implementation of the Linear Road benchmark on the SPC middleware. SPC has been designed to scale to tens of thousands of processing nodes, while supporting concurrent applications and multiple simultaneous queries. Second, we identify the main performance bottlenecks in the Linear Road application in achieving scalability and low query response latency. Our results show that data locality, buffer capacity, physical allocation of processing elements to infrastructure nodes, and packaging for transporting streamed data are important factors in achieving good application performance. Though we evaluate our system primarily for the Linear Road application, we believe it also provides useful insights into the overall system behavior for supporting other distributed and large-scale continuous streaming data applications. Finally, we examine how SPC can be used and tuned to enable a very efficient implementation of the Linear Road application in a distributed environment.
- {1} http://mit.edu/its/mitsimlab.html.Google Scholar
- {2} http://www.cs.brandeis.edu/~linearroad.Google Scholar
- {3} http://www.cs.brown.edu/research/aurora/main.html.Google Scholar
- {4} D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. VLDB Journal, 12(2), August 2003. Google ScholarDigital Library
- {5} D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In Proceedings of the 2005 Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA, 2005.Google Scholar
- {6} L. Amini, H. Andrade, F. Eskesen, R. King, Y. Park, P. Selo, and C. Venkatramani. The Stream Processing Core. Technical Report RSC 23798 (submitted for publication), IBM T. J. Watson Research Center, November 2005.Google Scholar
- {7} L. Amini, N. Jain, A. Sehgal, J. Silber, and O. Verscheure. Adaptive Control of Extreme-Scale Stream Processing Systems. In Proceedings of the 26th International Conference on Distributed Computing Systems (ICDCS 2006), Lisboa, Portugal, July 2006. Google ScholarDigital Library
- {8} A. Arasu, B. Babcock, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom. STREAM: The Stanford Stream Data Manager (Demonstration Description). In Proceedings of the 2003 ACM International Conference on Management of Data (SIGMOD 2003), San Diego, CA, June 2003. Google ScholarDigital Library
- {9} A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: A stream data management benchmark. In Proceedings of the 30th International Conference on Very Large Data Bases Conference (VLDB 2004), Toronto, Canada, 2004. Google ScholarDigital Library
- {10} M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11), October 2001. Google ScholarDigital Library
- {11} S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the 2003 Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, CA, 2003. Google ScholarDigital Library
- {12} N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, and C. Venkatramani. Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. Technical Report TR-06-18, Department of Computer Sciences, University of Texas at Austin, March 2006.Google Scholar
- {13} K. Kuo, R. Rabbah, and S. Amarasinghe. A productive programming environment for stream computing. In Proceedings of the 2nd Second Workshop on Productivity and Performance in High-End Computing, San Francisco, CA, February 2005.Google Scholar
- {14} S. R. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuously adaptive continuous queries over streams. In Proceedings of the 2002 ACM International Conference on Management of Data (SIGMOD 2002), Madison, WI, June 2002. Google ScholarDigital Library
- {15} C. Pu, K. Schwan, and J. Walpole. Infosphere project: System support for information flow applications. ACM SIGMOD Record, 30(1), March 2001. Google ScholarDigital Library
- {16} G. Swint, G. Jung, and C. Pu. Event-based QoS for a distributed continual query system. In Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration (IRI 2005), Las Vegas, NV, August 2005.Google ScholarCross Ref
- {17} W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 2002 International Conference on Compiler Construction (ICCC 2002), Grenoble, France, April 2002. Google ScholarDigital Library
- {18} S. Zdonik, M. Stonebraker, M. Cherniak, U. Cetintemel, M. Balazinska, and H. Balakrishnan. The Aurora and Medusa projects. Bulletin of the IEEE Technical Committee on Data Engineering, March 2003.Google Scholar
Index Terms
- Design, implementation, and evaluation of the linear road bnchmark on the stream processing core
Recommendations
Resource Management and Scheduling in Distributed Stream Processing Systems: A Taxonomy, Review, and Future Directions
Stream processing is an emerging paradigm to handle data streams upon arrival, powering latency-critical application such as fraud detection, algorithmic trading, and health surveillance. Though there are a variety of Distributed Stream Processing ...
Energy consumption analysis of data stream processing: a benchmarking approach
Energy efficiency of data analysis systems has become a very important issue in recent times because of the increasing costs of data center operations. Although distributed streaming workloads have increasingly been present in modern data centers, ...
Dual-Paradigm Stream Processing
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingExisting stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable ...
Comments