Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Marwa Hussien Mohamed; Mohamed Helmy Khafagy  and Mohamed Hasan Ibrahim

doi:10.17485/ijst/2018/v11i18/119112

Article

Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

VIEWS 744
PDF 265

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2018/v11i18/119112

Year: 2018, Volume: 11, Issue: 18, Pages: 1-9

Original Article

Hash Semi Join MapReduce to Join Billion Records in a Reasonable Time

Marwa Hussien Mohamed^{1 *}, Mohamed Helmy Khafagy² and Mohamed Hasan Ibrahim³

¹ Department of Information Systems, Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt; [email protected]
² Department of Computer Science, Fayoum University, Cairo, Egypt;[email protected]
³ Department of Information Systems, Fayoum University, Cairo, Egypt; [email protected]

*Author for correspondence
Marwa Hussien Mohamed,
Department of Information Systems, Arab Academy for Science, Technology and Maritime Transport, Cairo, Egypt; [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objective: MapReduce is a programming model used to support massive data sets. Big data are the most important issue today to analyze these data. Methods/Statistical Analysis: MapReduce is used to discover hidden patterns and relations in data to get more helpful information by using two simple functions map and reduce written by the programmer, it includes load balancing, fault tolerance and high scalability. The most important operation in data analysis are join, but MapReduce is not directly support join. Findings: This paper explains two-way MapReduce join algorithm, semi-join and per split semi-join and proposes new algorithm hash semi-join that used hash table to increase performance by eliminating unused records as early as possible and apply join using hash table rather than using map function to match join key with other data table in the second phase but using hash tables isn’t affecting on memory size because we only save matched records from the second table only. Our experimental result shows that using a hash table with hash semi-join algorithm has higher performance than two other algorithms while increasing the data size from 100 million records to 50 billion. Application/Improvements: Running time is increased according to the size of joined records between two tables using 30 machines to run our data but our algorithm has the better running time than other algorithms.

Keywords: Hadoop, Hash Semi Join, MapReduce, Two-Way Join