Abstract
In practice, it has been acknowledged that Hadoop framework is not an adequate choice for supporting interactive queries which aim of achieving a response time of milliseconds or few seconds. In addition, many programmers may be unfamiliar with the Hadoop framework and they would prefer to use SQL as a high-level declarative language to implement their jobs while delegating all of the optimization details in the execution process to the underlying engine. This chapter provides an overview of various systems that have been introduced to support the SQL flavor on top of the Hadoop-like infrastructure and provide competing and scalable performance on processing large-scale structured data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. Lynch, Big data: how do your data grow? Nature 455(7209), 28–29 (2008)
Large synoptic survey. http://www.lsst.org/
H. Chen, R.H.L. Chiang, V.C. Storey, Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012)
T. Hey, S. Tansley, K. Tolle (eds.), The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, Redmond, 2009)
G. Bell, J. Gray, A.S. Szalay, Petascale computational systems. IEEE Comput. 39(1), 110–112 (2006)
J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A.H. Byers, Big data: the next frontier for innovation, competition, and productivity. Technical Report 1999-66, May 2011
A. McAfee, E. Brynjolfsson, T.H. Davenport, D.J. Patil, D. Barton, Big data. The management revolution. Harvard Bus. Rev. 90(10), 61–67 (2012)
R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener. Comput. Syst. 25(6), 599–616 (2009)
L.M. Vaquero, L. Rodero-Merino, J. Caceres, M. Lindner, A break in the clouds: towards a cloud definition. ACM SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2008)
D.C. Plummer, T.J. Bittman, T. Austin, D.W. Cearley, D.M. Smith, Cloud computing: defining and describing an emerging phenomenon. Gartner (2008)
J. Staten, S. Yates, F.E. Gillett, W. Saleh, R.A. Dines, Is cloud computing ready for the enterprise. Forrester Research (2008)
M. Armbrust, O. Fox, R. Griffith, A.D. Joseph, Y. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica et al., Above the clouds: a Berkeley view of cloud computing (2009)
S. Madden, From databases to big data. IEEE Internet Comput. 3, 4–6 (2012)
S. Sakr, Cloud-hosted databases: technologies, challenges and opportunities. Clust. Comput. 17(2), 487–502 (2014)
S. Sakr, A. Liu, D.M. Batista, M. Alomari, A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311–336 (2011)
S. LaValle, E. Lesser, R. Shockley, M.S. Hopkins, N. Kruschwitz, Big data, analytics and the path from insights to value. MIT Sloan Manag. Rev. 52(2), 21 (2011)
X. Wu, X. Zhu, G.-Q. Wu, W. Ding, Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
D.J. DeWitt, J. Gray, Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD (2009), pp. 165–178
J. Dean, S. Ghemawa, MapReduce: simplified data processing on large clusters, in OSDI, 2004
D. Agrawal, S. Das, A. El Abbadi, Big data and cloud computing: current state and future opportunities, in Proceedings of the 14th International Conference on Extending Database Technology (ACM, New York, 2011), pp. 530–533
S. Sakr, A. Liu, A.G. Fayoumi, The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 1–44 (2013)
H. Yang, A. Dasdan, R. Hsiao, D. Parker, Map-reduce-merge: simplified relational data processing on large clusters, in SIGMOD, 2007
M. Stonebraker, The case for shared nothing. IEEE Database Eng. Bull. 9(1), 4–9 (1986)
T. White, Hadoop: The Definitive Guide (O’Reilly Media, Sebastopol, 2012)
D. Jiang, A.K.H. Tung, G. Chen, MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE TKDE 23(9), 1299–1311 (2011)
Y. Bu, B. Howe, M. Balazinska, M.D. Ernst, The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Y. Zhang, Q. Gao, L. Gao, C. Wang, iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10(1), 47–68 (2012)
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative MapReduce, in HPDC, 2010
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, N. Koudas, MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endowment 3(1), 494–505 (2010)
I. Elghandour, A. Aboulnaga, ReStore: reusing results of MapReduce jobs. Proc. VLDB Endowment 5(6), 586–597 (2012)
I. Elghandour, A. Aboulnaga, ReStore: reusing results of MapReduce jobs in Pig, in SIGMOD, 2012
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endowment 3(1), 518–529 (2010)
A. Floratou, J.M. Patel, E.J. Shekita, S. Tata, Column-oriented storage techniques for MapReduce. Proc. VLDB Endowment 4(7), 419–429 (2011)
Y. Lin et al., Llama: leveraging columnar storage for scalable join processing in the MapReduce framework, in SIGMOD, 2011
T. Kaldewey, E.J. Shekita, S. Tata, Clydesdale: structured data processing on MapReduce, in EDBT (2012), pp. 15–25
A. Balmin, T. Kaldewey, S. Tata, Clydesdale: structured data processing on Hadoop, in SIGMOD Conference (2012), pp. 705–708
M. Zukowski, P.A. Boncz, N. Nes, S. Héman, MonetDB/X100 - a DBMS in the CPU cache. IEEE Data Eng. Bull. 28(2), 17–22 (2005)
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, Z. Xu, RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems, in ICDE (2011), pp. 1199–1208
A. Jindal, J.-A. Quiane-Ruiz, J. Dittrich, Trojan data layouts: right shoes for a running elephant, in SoCC, 2011
M.Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson, CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endowment 4(9), 575–585 (2011)
Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E.N. Hanson, O. O’Malley, J. Pandey, Y. Yuan, R. Lee, X. Zhang, Major technical advancements in Apache Hive, in SIGMOD, 2014
G. Malewicz, M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in SIGMOD, 2010
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in HotCloud, 2010
M. Odersky, L. Spoon, B. Venners, Programming in Scala: A Comprehensive Step-by-Step Guide (Artima Inc., Walnut Creek, 2011)
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R.H. Katz, S. Shenker, I. Stoica, Mesos: a platform for fine-grained resource sharing in the data center, in NSDI, 2011
M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, I. Stoica, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, in EuroSys (2010), pp. 265–278
K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop distributed file system, in MSST, 2010
M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: relational data processing in Spark, in SIGMOD, 2015
E.R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J.E. Gonzalez, M.J. Franklin, M.I. Jordan, T. Kraska, MLI: an API for distributed machine learning, in ICDM, 2013
J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M.J. Franklin, I. Stoica, GraphX: graph processing in a distributed dataflow framework, in OSDI, 2014
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M.J. Sax, S. Schelter, M. Höger, K. Tzoumas, D. Warneke, The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)
A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke, Massively parallel data analysis with PACTs on nephele. Proc. VLDB Endowment 3(2), 1625–1628 (2010)
D. Battré et al., Nephele/PACTs: a programming model and execution framework for web-scale analytical processing, in SoCC, 2010
P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, T.G. Price, Access path selection in a relational database management system, in SIGMOD, 1979
A. Heise, A. Rheinlnder, M. Leich, U. Leser, F. Naumann, Meteor/Sopremo: an extensible query language and operator model, in VLDB Workshops, 2012
V.R. Borkar, M.J. Carey, R. Grover, N. Onose, R. Vernica, Hyracks: a flexible and extensible foundation for data-intensive computing, in ICDE, 2011
A. Behm, V.R. Borkar, M.J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, V.J. Tsotras, ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29(3), 185–216 (2011)
V. Borkar, S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, Y. Bu, M. Carey, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, P. Pirzadeh, N. Onose, R. Vernica, J. Wen, ASTERIX: an open source system for “Big Data” management and analysis. Proc. VLDB Endowment 5(2), 1898–1901 (2012)
S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V.R. Borkar, Y. Bu, M.J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, G. Li, J.M. Ok, N. Onose, P. Pirzadeh, V.J. Tsotras, R. Vernica, J. Wen, T. Westmann, AsterixDB: a scalable, open source BDMS. Proc. VLDB Endowment 7(14), 1905–1916 (2014)
Y. Bu, V.R. Borkar, J. Jia, M.J. Carey, T. Condie, Pregelix: big(ger) graph analytics on a dataflow engine. Proc. VLDB Endowment 8(2), 161–172 (2014)
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD, 2009
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.S. Sarma, R. Murthy, H. Liu, Data warehousing and analytics infrastructure at Facebook, in SIGMOD, 2010
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.S. Sarma, R. Murthy, H. Liu, Data warehousing and analytics infrastructure at Facebook, in SIGMOD Conference (2010), pp. 1013–1020
B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A.C. Murthy, C. Curino, Apache Tez: a unifying framework for modeling and building data processing applications, in SIGMOD, 2015
V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, E. Baldeschwieler, Apache Hadoop YARN: yet another resource negotiator, in SOCC, 2013
M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, M. Yoder, Impala: a modern, open-source SQL engine for Hadoop, in CIDR, 2015
S. Wanderman-Milne, N. Li, Runtime code generation in Cloudera Impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)
A. Abouzeid, K. Bajda-Pawlikowski, D.J. Abadi, A. Rasin, A. Silberschatz, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endowment 2(1), 922–933 (2009)
M. Stonebraker, D. Abadi, D. DeWitt, S. Madden, E. Paulson, A. Pavlo, A. Rasin, MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
H. Choi, J. Son, H. Yang, H. Ryu, B. Lim, S. Kim, Y.D. Chung, Tajo: a distributed data warehouse system on large clusters, in ICDE, 2013
S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, Dremel: interactive analysis of web-scale datasets, Proc. VLDB Endowment 3(1), 330–339 (2010)
D.J. DeWitt, A. Halverson, R.V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, J. Gramling, Split query processing in Polybase, in SIGMOD, 2013
V.R. Gankidi, N. Teletia, J.M. Patel, A. Halverson, D.J. DeWitt, Indexing HDFS data in PDW: splitting the data from the index. Proc. VLDB Endowment 7(13), 1520–1528 (2014)
S. Sakr, E. Pardede (eds.), Graph Data Management: Techniques and Applications (IGI Global, Hershey, 2011)
S. Sakr, Processing large-scale graph data: a guide to current technology, in IBM DeveloperWorks (2013), p. 15
A. Khan, S. Elnikety, Systems for big-graphs. Proc. VLDB Endowment 7(13), 1709–1710 (2014)
R. Chen, X. Weng, B. He, M. Yang, Large graph processing in the cloud, in SIGMOD, 2010
U. Kang, C.E. Tsourakakis, C. Faloutsos, PEGASUS: a peta-scale graph mining system, in ICDM, 2009
U. Kang, H. Tong, J. Sun, C.-Y. Lin, C. Faloutsos, GBASE: a scalable and general graph management system, in KDD, 2011
U. Kang, C.E. Tsourakakis, C. Faloutsos, PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)
U. Kang, B. Meeder, C. Faloutsos, Spectral analysis for billion-scale graphs: discoveries and implementation, in PAKDD, 2011
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, P. Kalnis, Mizan: a system for dynamic load balancing in large-scale graph processing, in EuroSys, 2013
S. Salihoglu, J. Widom, GPS: a graph processing system, in SSDBM, 2013
J.E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin, PowerGraph: distributed graph-parallel computation on natural graphs, in OSDI, 2012
A. Kyrola, G.E. Blelloch, C. Guestrin, GraphChi: large-scale graph computation on just a PC, in OSDI, 2012
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J.M. Hellerstein, Distributed GraphLab: a framework for machine learning in the cloud. Proc. VLDB Endowment 5(8), 716–727 (2012)
B. Shao, H. Wang, Y. Li, Trinity: a distributed graph engine on a memory cloud, in SIGMOD, 2013
G. Wang, W. Xie, A. Demers, J. Gehrke, Asynchronous large-scale graph processing made easy, in CIDR, 2013
P. Stutz, A. Bernstein, W.W. Cohen, Signal/collect: graph algorithms for the (semantic) web, in International Semantic Web Conference (1), 2010
L.G. Valiant, A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
W.D. Clinger, Foundations of actor semantics. Technical report, Cambridge (1981)
Y. Tian, A. Balmin, S.A. Corsten, S. Tatikonda, J. McPherson, From “think like a vertex” to “think like a graph”. Proc. VLDB Endowment 7(3), 193–204 (2013)
A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez, M. Zaharia, GraphFrames: an integrated API for mixing graph and relational queries, in Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems (ACM, New York, 2016), p. 2
M. Junghanns, A. Petermann, K. Gómez, E. Rahm, Gradoop: scalable graph data management and analytics with Hadoop (2015). Preprint. arXiv:1506.00548
M. Kricke, E. Peukert, E. Rahm, Graph data transformations in Gradoop, in BTW 2019, 2019
N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, A, Taylor, Cypher: an evolving query language for property graphs, in Proceedings of the 2018 International Conference on Management of Data (ACM, New York, 2018), pp. 1433–1445
M. Junghanns, M. Kießling, A. Averbuch, A. Petermann, E. Rahm, Cypher-based graph pattern matching in Gradoop, in Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems (ACM, New York, 2017), p. 3
M. Junghanns, M. Kießling, N. Teichmann, K. Gómez, A. Petermann, E. Rahm, Declarative and distributed graph analytics with Gradoop. Proc. VLDB Endowment 11(12), 2006–2009 (2018)
W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, H. Yu, TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC, in KDD, 2013
D. Yan, J. Cheng, Y. Lu, W. Ng, Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endowment 7(14), 1981–1992 (2014)
World Wide Web Consortium. RDF 1.1 Primer (2014)
F. Manola, E. Miller. RDF Primer, February 2004. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/
E. Prud’hommeaux, A. Seaborne, SPARQL Query Language for RDF, W3C Recommendation, January 2008. http://www.w3.org/TR/rdf-sparql-query/
Z. Kaoudi, I. Manolescu, RDF in the clouds: a survey. VLDB J. 24(1), 67–91 (2015)
M. Wylot, M. Hauswirth, P. Cudré-Mauroux, S. Sakr, RDF data storage and query processing schemes: a survey. ACM Comput. Surv. 51(4), 84:1–84:36 (2018)
V. Khadilkar, M. Kantarcioglu, B.M. Thuraisingham, P. Castagna, Jena-HBase: a distributed, scalable and efficient RDF triple store, in Proceedings of the ISWC 2012 Posters & Demonstrations Track, Boston, 11–15 November 2012
R. Punnoose, A. Crainiceanu, D. Rapp, SPARQL in the cloud using Rya. Inf. Syst. 48, 181–195 (2015)
A. Aranda-Andújar, F. Bugiotti, J. Camacho-Rodríguez, D. Colazzo, F. Goasdoué, Z. Kaoudi, I. Manolescu, AMADA: web data repositories in the amazon cloud, in 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, 29 October–02 November 2012, pp. 2749–2751
G. Ladwig, A. Harth, Cumulusrdf: linked data management on nested key-value stores, in The 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2011), vol. 30 (2011)
A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
R. Mutharaju, S. Sakr, A. Sala, P. Hitzler, D-SPARQ: distributed, scalable and efficient RDF query engine, in Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, 23 October 2013, pp. 261–264
J. Huang, D.J. Abadi, K. Ren, Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endowment 4(11), 1123–1134 (2011)
N. Papailiou, I. Konstantinou, D. Tsoumakos, P. Karras, N. Koziris, H2RDF+: high-performance distributed joins over large-scale RDF graphs, in 2013 IEEE International Conference on Big Data (IEEE, Piscataway, 2013), pp. 255–263
J. Huang, D.J. Abadi, K. Ren, Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endowment 4(11), 1123–1134 (2011)
A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, A. Silberschatz, HadoopDB in action: building real world applications, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, 6–10 June 2010, pp. 1111–1114
T. Neumann, G. Weikum, RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endowment 1(1), 647–659 (2008)
F. Goasdoué, Z. Kaoudi, I. Manolescu, J.-A. Quiané-Ruiz, S. Zampetakis, CliqueSquare: flat plans for massively parallel RDF queries, in 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, 13–17 April 2015, pp. 771–782
B. Djahandideh, F. Goasdoué, Z. Kaoudi, I. Manolescu, J.-A. Quiané-Ruiz, S. Zampetakis, CliqueSquare in action: flat plans for massively parallel RDF queries, in 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, 13–17 April 2015, pp. 1432–1435
A. Schätzle, M. Przyjaciel-Zablocki, T. Hornung, G. Lausen, PigSPARQL: a SPARQL query processing baseline for big data, in Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, 23 October 2013, pp. 241–244
C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig Latin: a not-so-foreign language for data processing, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, 10–12 June 2008, pp. 1099–1110
P. Ravindra, H. Kim, K. Anyanwu, An intermediate algebra for optimizing RDF graph pattern matching on MapReduce, in The Semantic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011. Proceedings, Part II, Heraklion, Crete, 29 May–2 June 2011, pp. 46–61
H. Kim, P. Ravindra, K. Anyanwu, Optimizing RDF(S) queries on cloud platforms, in 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, 13–17 May 2013, Companion Volume (2013), pp. 261–264
A. Schätzle, M. Przyjaciel-Zablocki, S. Skilevic, G. Lausen, S2RDF: RDF querying with SPARQL on Spark. CoRR (2015), abs/1512.07021
D.J. Abadi, A. Marcus, S.R. Madden, K. Hollenbach, Scalable semantic web data management using vertical partitioning, in Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment (2007), pp. 411–422
P. Valduriez, Join indices. ACM Trans. Database Syst. 12(2), 218–246 (1987)
P.A. Bernstein, D.-M.W. Chiu, Using semi-joins to solve relational queries. J. ACM 28(1), 25–40 (1981)
X. Chen, H. Chen, N. Zhang, S. Zhang, SparkRDF: elastic discreted RDF graph processing engine with distributed memory, in Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, 21 October 2014, pp. 261–264
X. Chen, H. Chen, N. Zhang, S. Zhang, SparkRDF: elastic discreted RDF graph processing engine with distributed memory, in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2015, Volume I, Singapore, 6–9 December 2015, pp. 292–300
A. Schätzle, M. Przyjaciel-Zablocki, T. Berberich, G. Lausen, S2X: graph-parallel querying of RDF with GraphX, in 1st International Workshop on Big-Graphs Online Querying (Big-O(Q), 2015
E.L. Goodman, D. Grunwald, Using vertex-centric programming platforms to implement SPARQL queries on large graphs, in Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms, IA3 ’14 (IEEE Press, Piscataway, 2014), pp. 25–32
H. Naacke, O. Curé, B. Amann, SPARQL query processing with Apache Spark (2016). CoRR, abs/1604.08903
K. Zeng, J. Yang, H. Wang, B. Shao, Z. Wang, A distributed graph engine for web scale RDF data, in Proceedings of the 39th International Conference on Very Large Data Bases. VLDB Endowment (2013), pp. 265–276
P. Stutz, M. Verman, L. Fischer, A. Bernstein, TripleRush: a fast and scalable triple store, in SSWS@ ISWC (2013), pp. 50–65
P. Stutz, B. Paudel, M. Verman, A. Bernstein, Random walk TripleRush: asynchronous graph querying and sampling, in Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, 18–22 May 2015, pp. 1034–1044
P. Stutz, A. Bernstein, W. Cohen, Signal/collect: graph algorithms for the (semantic) web, in International Semantic Web Conference (Springer, Berlin, 2010), pp. 764–780
R. Harbi, I. Abdelaziz, P. Kalnis, N. Mamoulis, Evaluating SPARQL queries on massive RDF datasets. Proc. VLDB Endowment 8(12), 1848–1851 (2015)
R. Al-Harbi, I. Abdelaziz, P. Kalnis, N. Mamoulis, Y. Ebrahim, M. Sahli, Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)
S. Gurajada, S. Seufert, I. Miliaraki, M. Theobald, TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing, in International Conference on Management of Data, SIGMOD 2014, Snowbird, 22–27 June 2014, pp. 289–300
L. Galárraga, K. Hose, R. Schenkel, Partout: a distributed engine for efficient RDF processing, in 23rd International World Wide Web Conference, WWW ’14, Seoul, 7–11 April 2014, Companion Volume, pp. 267–268
T. Neumann, G. Weikum, The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
M. Hammoud, D.A. Rabbou, R. Nouri, S.-M.-R. Beheshti, S. Sakr, DREAM: distributed RDF engine with adaptive query planner and minimal communication. Proc. VLDB Endowment 8(6), 654–665 (2015)
A. Hasan, M. Hammoud, R. Nouri, S. Sakr, DREAM in action: a distributed and adaptive RDF system on the cloud, in Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, 11–15 April 2016, Companion Volume, pp. 191–194
L. Cheng, S. Kotoulas, Scale-out processing of large RDF datasets. IEEE Trans. Big Data 1(4), 138–150 (2015)
M. Wylot, P. Cudré-Mauroux, DiploCloud: efficient and scalable management of RDF data in the cloud. IEEE Trans. Knowl. Data Eng. 28(3), 659–674 (2016)
P. Zikopoulos, C. Eaton et al., Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (McGraw-Hill Osborne Media, New York, 2011)
K. Ashton et al., That ‘Internet of things’ thing. RFID J. 22(7), 97–114 (2009)
N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning Publications Co., Shelter Island, 2015)
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, MapReduce online, in NSDI, 2010
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, R. Sears, Online aggregation and continuous query support in MapReduce, in SIGMOD, 2010
D. Logothetis, K. Yocum, Ad-hoc data processing in the cloud. Proc. VLDB Endowment 1(2), 1472–1475 (2008)
P. Bhatotia, A. Wieder, R. Rodrigues, U.A. Acar, R. Pasquini, Incoop: MapReduce for incremental computations, in SOCC, 2011
A.M. Aly, A. Sallam, B.M. Gnanasekaran, L.-V. Nguyen-Dinh, W.G. Aref, M. Ouzzaniy, A. Ghafoor, M3: stream processing on main-memory MapReduce, in ICDE, 2012
V. Kumar, H. Andrade, B. Gedik, K.-L. Wu, DEDUCE: at the intersection of MapReduce and stream processing, in EDBT (2010), pp. 657–662
S. Sakr, An introduction to InfoSphere Streams: a platform for analyzing big data in motion. IBM DeveloperWorks, 2013. http://www.ibm.com/developerworks/library/bd-streamsintro/index.html
S. Loesing, M. Hentschel, T. Kraska, D. Kossmann, Stormy: an elastic and highly available streaming service in the cloud, in EDBT/ICDT Workshops, 2012
H. Balakrishnan, M. Frans Kaashoek, D.R. Karger, R. Morris, I. Stoica, Looking up data in p2p systems. Commun. ACM 46(2), 43–48 (2003)
L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: distributed stream computing platform, in ICDMW, 2010
B. Gedik, H. Andrade, K.-L. Wu, P.S. Yu, M. Doo, SPADE: the system S declarative stream processing engine, in SIGMOD, 2008
M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, M. Zaharia, Structured streaming: a declarative API for real-time applications in Apache Spark, in SIGMOD, 2018
J. Kreps, N. Narkhede, J. Rao et al., Kafka: a distributed messaging system for log processing, in Proceedings of the NetDB, 2011
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron: stream processing at scale, in SIGMOD, 2015
G. De Francisci Morales, A. Bifet, Samoa: scalable advanced massive online analysis. J. Mach. Learn. Res. 16(1), 149–153 (2015)
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, U. Srivastava, Building a highlevel dataflow system on top of MapReduce: the Pig experience. Proc. VLDB Endowment 2(2), 1414–1425 (2009)
A. Gates, Programming Pig (O’Reilly Media, Sebastopol, 2011)
C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in PLDI, 2010
D. Wu, L. Zhu, X. Xu, S. Sakr, D. Sun, Q. Lu, A pipeline framework for heterogeneous execution environment of big data processing. IEEE Softw. 33, 60–67 (2016)
R. Elshawi, S. Sakr, D. Talia, P. Trunfio, Big data systems meet machine learning challenges: towards big data science as a service. Big Data Res. 14, 1–11 (2018)
D. Michie, D.J. Spiegelhalter, C.C. Taylor et al., Machine Learning. Neural and Statistical Classification, vol. 13 (Ellis Horwood, London, 1994)
S. Owen, Mahout in Action (Manning Publications Co., Shelter Island, 2012)
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D.B. Tsai, M. Amde, S. Owen et al., MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
M. Stonebraker, P. Brown, A. Poliakov, S. Raman, The architecture of SciDB, in International Conference on Scientific and Statistical Database Management (Springer, Berlin, 2011), pp. 1–16
X. Li, B. Cui, Y. Chen, W. Wu, C. Zhang, MLog: towards declarative in-database machine learning. Proc. VLDB Endowment 10(12), 1933–1936 (2017)
P.G. Brown, Overview of SciDB: large scale array storage, processing and analysis, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, New York, 2010), pp. 963–968
J.M. Hellerstein, C. Ré, F. Schoppmann, D.Z. Wang, E. Fratkin, A. Gorajek, K.S. Ng, C. Welton, X. Feng, K. Li et al., The MADlib analytics library: or MAD skills, the SQL. Proc. VLDB Endowment 5(12), 1700–1711 (2012)
S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, J. McPherson, Ricardo: integrating R and Hadoop, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, New York, 2010), pp. 987–998
S. Venkataraman, Z. Yang, D. Liu, E. Liang, H. Falaki, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica et al., SparkR: scaling R programs with Spark, in Proceedings of the 2016 International Conference on Management of Data (ACM, New York, 2016), pp. 1099–1104
S. Leo, G. Zanetti, Pydoop: a Python MapReduce and HDFS API for Hadoop, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (ACM, New York, 2010), pp. 819–825
AzureML Team. AzureML: anatomy of a machine learning service, in Conference on Predictive APIs and Apps (2016), pp. 1–13
B. Huang, S. Babu, J. Yang, Cumulon: optimizing statistical data analysis in the cloud, in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (ACM, New York, 2013), pp. 1–12
M. Boehm, M.W. Dusenberry, D. Eriksson, A.V. Evfimievski, F.M. Manshadi, N. Pansare, B. Reinwald, F.R. Reiss, P. Sen, A.C. Surve et al., SystemML: declarative machine learning on Spark. Proc. VLDB Endowment 9(13), 1425–1436 (2016)
S. Schelter, A. Palumbo, S. Quinn, S. Marthi, A. Musselman, Samsara: declarative machine learning on distributed dataflow systems, in NIPS Workshop ML Systems, 2016
T. Kraska, A. Talwalkar, J.C. Duchi, R. Griffith, M.J. Franklin, M.I. Jordan, MLbase: a distributed machine-learning system, in CIDR, 2013
M. Weimer, T. Condie, R. Ramakrishnan et al., Machine learning in ScalOps, a higher order cloud computing language, in NIPS 2011 Workshop on Parallel and Large-Scale Machine Learning (BigLearn), vol. 9 (2011), pp. 389–396
V. Borkar, M. Carey, R. Grover, N. Onose, R. Vernica, Hyracks: a flexible and extensible foundation for data-intensive computing, in 2011 IEEE 27th International Conference on Data Engineering (IEEE, Piscataway, 2011), pp. 1151–1162
E.R. Sparks, S. Venkataraman, T. Kaftan, M.J. Franklin, B. Recht, Keystoneml: optimizing pipelines for large-scale advanced analytics, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (IEEE, Piscataway, 2017), pp. 535–546
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105
R. Collobert et al., Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Y. Bengio et al., Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)
M. Abadi et al., TensorFlow: a system for large-scale machine learning, in OSDI, vol. 16 (2016), pp. 265–283
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C.Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc et al., TFX: a TensorFlow-based production-scale machine learning platform, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2017), pp. 1387–1395
J. Bergstra et al., Theano: a CPU and GPU math compiler in Python, in Proceedings of 9th Python in Science Conference, vol. 1, 2010
T. Chen et al., MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015). Preprint. arXiv:1512.01274
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in PyTorch (2017)
S. Tokui, K. Oono, S. Hido, J. Clayton, Chainer: a next-generation open source framework for deep learning, in NIPS Workshops, 2015
S. Lohr, The age of big data. New York Times, 11, 2012
V. Mayer-Schönberger, K. Cukier, Big Data: A Revolution that Will Transform How We Live, Work, and Think (Houghton Mifflin Harcourt, Boston, 2013)
H.E. Schaffer, X as a service, cloud computing, and the need for good judgment. IT Prof. 11(5), 4–5 (2009)
D. Delen, H. Demirkan, Data, information and analytics as services. Decis. Support Syst. 55(1), 359–363 (2013)
M. Baker, Data science: industry allure. Nature 520, 253–255 (2015)
F. Provost, T. Fawcett, Data science and its relationship to big data and data-driven decision making. Big Data 1(1), 51–59 (2013)
A. Labrinidis, H.V. Jagadish, Challenges and opportunities with big data. Proc. VLDB Endowment 5(12), 2032–2033 (2012)
H.V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J.M. Patel, R. Ramakrishnan, C. Shahabi, Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
D. Abadi, S. Babu, F. Ozcan, I. Pandis, SQL-on-Hadoop systems. Proc. VLDB Endowment 8(12), 2050–2061 (2015)
S. Sakr, S. Elnikety, Y. He, G-SPARQL: a hybrid engine for querying large attributed graphs, in CIKM (2012), pp. 335–344
Y. Guo, A.L. Varbanescu, A. Iosup, C. Martella, T.L. Willke, Benchmarking graph-processing platforms: a vision, in ICPE, 2014
A. Barnawi, O. Batarfi, S.-M.-R. Beheshti, R. El Shawi, A.G. Fayoumi, R. Nouri, S. Sakr, On characterizing the performance of distributed graph computation platforms, in TPCTC, 2014
O. Batarfi, R. El Shawi, A.G. Fayoumi, R. Nouri, S.-M.-R. Beheshti, A. Barnawi, S. Sakr, Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)
M. Han, K. Daudjee, K. Ammar, M. Tamer Özsu, X. Wang, T. Jin, An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endowment 7(12), 1047–1058 (2014)
Y. Lu, J. Cheng, D. Yan, H. Wu, Large-scale distributed graph computing systems: an experimental evaluation. Proc. VLDB Endowment 8(3), 281–292 (2014)
Y. Guo, M. Biczak, A.L. Varbanescu, A. Iosup, C. Martella, T.L. Willke, How well do graph-processing platforms perform? An empirical performance evaluation and analysis, in IPDPS, 2014
M. Li, J. Tan, Y. Wang, L. Zhang, V. Salapura, SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark, in Proceedings of the 12th ACM International Conference on Computing Frontiers, CF’15, Ischia, 18–21 May 2015, pp. 53:1–53:8
M. Capota, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, P.A. Boncz, Graphalytics: a big data benchmark for graph-processing platforms, in Proceedings of the Third International Workshop on Graph Data Management Experiences and Systems, GRADES 2015, Melbourne, 31 May–4 June 2015, pp. 7:1–7:6
O. Batarfi, R. El Shawi, A.G. Fayoumi, R. Nouri, A. Barnawi, S. Sakr et al., Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)
V. Aluko, S. Sakr, Big SQL systems: an experimental evaluation. Clust. Comput. 22(4), 1347–1377 (2019)
N. Mahmoud, Y. Essam, R. El Shawi, S. Sakr, DLBench: an experimental evaluation of deep learning frameworks, in 2019 IEEE International Congress on Big Data, BigData Congress 2019, Milan, 8–13 July 2019, pp. 149–156
E. Shahverdi, A. Awad, S. Sakr, Big stream processing systems: an experimental evaluation, in 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW) (IEEE, Piscataway, 2019), pp. 53–60
I. Gog, M. Schwarzkopf, N. Crooks, M.P. Grosvenor, A. Clement, S. Hand, Musketeer: all for one, one for all in data processing systems, in EuroSys (2015), pp. 2:1–2:16
D. Agrawal, M. Lamine Ba, L. Berti-Equille, S. Chawla, A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi, Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J.-A. Quian-Ruiz, N. Tang, M.J. Zaki, Rheem: enabling multi-platform task execution, in SIGMOD Conference, 2016
N. Huijboom, T. Van den Broek, Open data: an international comparison of strategies. Eur. J. ePractice 12(1), 4–16 (2011)
M. Balazinska, B. Howe, D. Suciu, Data markets in the cloud: an opportunity for the database community. Proc. VLDB Endowment 4(12), 1482–1485 (2011)
R. El Shawi, M. Maher, S. Sakr, Automated machine learning: state-of-the-art and open challenges (2019). CoRR, abs/1906.02287
H. Miao, A. Li, L.S. Davis, A. Deshpande, ModelHub: deep learning lifecycle management, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (IEEE, Piscataway, 2017), pp. 1393–1394
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, M. Zaharia, Model DB: a system for machine learning model management, in Proceedings of the Workshop on Human-In-the-Loop Data Analytics (ACM, New York, 2016), p. 14
P. Bailis, K. Olukotun, C. Ré, M. Zaharia, Infrastructure for usable machine learning: the Stanford DAWN Project (2017). Preprint. arXiv:1705.07538
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sakr, S. (2020). Large-Scale Processing Systems of Structured Data. In: Big Data 2.0 Processing Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-44187-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-44187-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44186-9
Online ISBN: 978-3-030-44187-6
eBook Packages: Computer ScienceComputer Science (R0)