Call Churn Prediction with PySpark

Nonghuloo, Mark Sheridan; Aravind Reddy, Rangapuram; Manideep, Ganji; SarathVamsi, M. R.; Lavanya, K.

doi:10.1007/978-981-15-0184-5_67

Call Churn Prediction with PySpark

Mark Sheridan Nonghuloo²⁰,
Rangapuram Aravind Reddy²⁰,
Ganji Manideep²⁰,
M. R. SarathVamsi²⁰ &
…
K. Lavanya²⁰

Conference paper
First Online: 28 November 2019

1039 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1057))

Abstract

Different markets over the world are ending up progressively more saturated, with an ever-increasing number of customers swapping their enrolled benefits between contending organizations. Consequently, organizations have understood that they should center their promoting endeavors in client maintenance instead of client procurement. It limits client surrender by foreseeing which clients are probably going to cross out a membership to an administration. In spite of the fact that initially utilized inside the telecommunication business, it has turned out to be regular practice crosswise over banks, ISPs, insurance firms and other verticals. In this paper, an end to end churn prediction is done in view of client call information records. We take a gander at what sorts of client information are normally utilized, do some preparatory investigation of the information and create churn prediction models with PySpark. PySpark processes huge datasets at minimal time and when it comes to the synchronization points as well as errors, framework easily handles at the back end. The PySpark API takes advantage of Spark to deliver dramatic improvements in processing speed for large sets of data.

Download conference paper PDF

1 Introduction

Different markets over the world are winding up progressively more saturated, with an ever-increasing number of clients swapping their enlisted benefits between contending organizations. In this way, organizations have understood that they should center their showcasing endeavors in client maintenance as opposed to client procurement. In reality, examines have demonstrated that the assets a company spends in endeavoring to increase new clients are far more noteworthy than the assets it would spend if it somehow happened to endeavor to hold its clients. Client maintenance techniques can be focused on high hazard clients that are meaning to stop their custom or move their custom to another administration contender. This impact of client misfortune is otherwise called client churn. From a machine learning viewpoint, churn can be planned as a paired characterization issue. In spite of the fact that there are different ways to deal with churn expectation (for instance, survival examination), the most well-known arrangement is to mark ‘churners’ over a particular timeframe as one class and clients who remain drew in with the item as the correlative class thus precise early distinguishing proof of these clients is basic in limiting the cost of an organization’s general maintenance advertising strategy. Churn is characterized distinctively by every association of item. For the most part, the clients who quit utilizing an item or administration for a given timeframe are alluded to as churners. Accordingly, churn is a standout among the most essential components in the Key Performance Indicator (KPI) of an item or administration. A full client lifecycle investigation requires investigating standards for dependability, keeping that in mind the end goal is to readily comprehend the well-being of the business or item. Currently, big companies, government organizations and people at higher positions gradually move toward big data analytics for making important decisions to become more prevailing in this world. However, data analysis is a process for cleaning, modeling and transforming the data on huge datasets to obtain the useful information. There are a lot of big data tools out there for big data analytics. Apache Spark is one among that tools which was introduced in 2010. It has a friendly Scala interface. Spark is compatible with Hadoop and its modules and runs faster than the Hadoop due to its real-time processing. Spark provides a very intelligible application programming interface (API) of great value. In comparison with Scala PySpark is easy to implement and simple to write. Appropriately, PySpark is cast off in this paper.

2 Literature Survey

Jie Lu et al. stated that churn prediction of the customer is one of the main features of the modern telecom customer relationship management systems (CRM). They conducted a real-world study on customer churn and proposed to use boosting to improve a customer churn prediction model [1]. The author Hua Hsu discussed a system which recommends for customer churn by proposing a decision tree algorithm. The data which is used for this analysis has roofed over 4000 members and more than 60,000 transactions, over a period of four months [2]. Malathi and Kamalraj focused their research on understanding customer churn prediction using different data mining techniques. This approach can be used in telecommunication industries for customer retention activities of their customer relationship management efforts. They have used the data mining techniques to predict call churn on the customer details [3]. Author Veronika Effendy has proposed a technique for the imbalanced data handling problem to improve the customer churn prediction model. This proposed technique is a combination of weighted random forest (WRF) and sampling for dataset balancing so that its accuracy of the churn prediction is enhanced [4]. The author G. Ganesh Sundar Kumar proposed one-class SVM-based under-sampling technique for enhancing the insurance fraud detection and churn prediction. Initially, the data is sampled using one-class SVM technique and then classing is performed on that using ML algorithms. Depending on the results, it is concluded that decision tree performance is better than other classification algorithms along with one-class SVM. It reduces system complications and helps in improving the prediction precision [5]. D. Olle et al. proposed hybrid learning model to predict customer churn for the telecommunication industry. Their model utilizes WEKA which is one of the well-known tools of machine learning. They even stated that data mining technique can detect the customers churn with high tendency but lacks in providing the reason of the churn. The primary goal of their study is to prove that models built using data mining techniques can explain the churn behavior with more accuracy than the models using single methods [6]. According to Yen, C et al., customer churn in telecommunication industry refers to the customer shifting from one service provider to other service provider. Customer management systems are the process conducted by telecom company to retain their customers. The evolution of the technology and increase in the service providers made this market more competitive than ever. So, the telecom industry has realized that to sustain their profits and preserve their customers to survive in this competitive world [7]. The authors Yihui et al. proposed a decision tree-based random forest method for characteristic extraction. From the original data with Q characteristics, exactly N samples with q < Q are selected randomly for each tree to form a forest. The number of trees depends on the number of characteristics randomly combined for each decision tree from the whole Q characteristics [8].

3 Background Study

3.1 Apache Spark

Apache Spark is a computation platform designed to be fast and easy to use it. It provides fault tolerance on commodity hardware, fault tolerance and scalability. It contains application programming interfaces (APIs) for Scala, Python, Java and libraries for SQL, machine learning, streaming and graph processing. It can be performed on Hadoop clusters or as a stand-alone. It is based on the Hadoop MapReduce model and extends it to implement it for more computations that includes stream processing and interactive queries. The main feature of Apache Spark is its in-memory cluster computing which indeed increases the processing speed of an application.

As shown in Fig. 1, Apache Spark uses Resilient Distributed Dataset (RDD) which is a read only collection of objects than can operated parallelly, which are separated across a set of machines which can be restored if a partition is lost. There are two types of RDD operations: transformations and actions. Transformations are the functions that take Resilient Distributed Dataset and convert into RDD of another type, sing functions which are designed by user. Some of the functions of transformations are map, filter, reduceByKey, join, cogroup randomSplit. Thus, actions are RDD operations of Spark which gives non-RDD values. The values generated of action are stored to drivers or to the external storage system. Action is a way of sending data from executer to drivers. Executers are responsible for performing tasks; driver is a JVM process which coordinates execution of a task and workers.

3.2 MongoDB

MongoDB is an open-source document database and one of the leading NoSQL databases. It is a cross-platform database that provides availability, high performance and scalability. It works on the concept of documents and collections. MongoDB server generally consists of multiple databases. Number of fields, size of the document and content differ from one document to another. In MongoDB, data is stored in the form of JSON documents. It is schema less and easy to scale.

3.3 Boosting

Boosting algorithm is one of the most powerful learning ideas which are designed for classification problems. Most of the boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding then to a final strong classifier. It converts a set of weak learners to strong ones. There are many types of boosting algorithms such as gradient tree boosting XGBoost and AdaBoost (adaptive boosting).

4 Architecture

Figure 2 shows the call churn prediction architecture that takes customer information as data and predicts the churn. The customer performs several events for example, a stream of billing actions, a stream of call records by the company and the subscriber, SMS, complaints, etc. The Integrator module integrates all these streams and generates a logically unique stream of events. The Data Manager module manages the customer pending predictions queue and customer’s information database. The predictions queue contains all predictions which are waiting for refusal or confirmation in the future; whereas, the customer information database contains all the basic information about the subscribers (name, age, type of contract, address, etc.). The Record Processor is the soul of the system. It receives stream of events which are generated by the integrator and uses those events for two purposes. First, it updates the customer’s information database according to the event. Second, it generates records from events using information from the customer information database. The Record Processor builds, maintains and applies the predictive models. Therefore, it contains machine learning or data mining algorithms that make prediction possible. Thus, the Record Processor module produces predictions and profiles of the predicted churners. The churner’s ids and their profiles produced by Record Processor are passed to the user interface or to the other parts of the customer management systems (CMM) so that adequate actions can be assessed and performed.

Figure 3 shows the MongoDB Spark Connection establishment. The MongoDB Spark is an open-source project, to read and write the data from MongoDB using Apache Spark. The connection offers various features like converting a MongoDB collection into a Spark RDD and methods to load collections directly into a Spark Data Frame or Dataset. MongoDB helps us to develop applications faster because it uses tables and stored procedures which are no longer required. It is an advantage to developers because earlier the tables should be translated to object model before they could be used in the application. Now, the stored data and object model have same structure similar to JSON format which is also known as BSON. MongoDB provides various options, supports scalability and maintains data consistency.

5 Methodology

5.1 Dataset

We had chosen to study the dataset from a telecom company which includes data of mobile customers in the number of millions who are active at a particular point of time. We initially extracted variables from the customer’s database, which includes contract information and mobile plan, usage, account length, area, billing and product holding information. It even consists of the customer care inbound/outbound information (Table 1).

Table 1 List of attributes with its type in the dataset

Full size table

6 Pseudo Code

1.
Loading the data from the MongoDB
$${\text{Data}} = {\text{load}}{()}$$
2.
Create training and testing data
$${\text{train}},{\text{test}} = {\text{data.Split}}\left( 0.8,0.2 \right)$$
3.
Check for correlation among columns
$${\text{train.Correlation}}{()}$$
4.
Converting from string to numeric values
$${\text{Val}} = {\text{String}}\,{\text{Indexer}}\left( {\text{Train}} \right)$$
5.
Acquiring the vectors
$${\text{Vector}} = {\text{vectorAssembler}}\left( {\text{Train}} \right)$$
6.
Creating a pipeline
$${\text{Pipeline}}\left( {{\text{val}},{\text{vector}},{\text{GBTClassifier}}} \right)$$
7.
Training the model
$${\text{Model}} = {\text{pipeline.fit}}\left( {\text{train}} \right)$$
8.
Predicting the test data
$${\text{Predict}} = {\text{model.Transform}}\left( {\text{test}} \right)$$
9.
Displaying the predicted data
$${\text{print Predict}}$$

7 Results Analysis

We have evaluated the performance of our churn prediction model by using a training set of customer information which has been collected from the link (https://bigml.com/user/cesareconti89/gallery/dataset/58cfbada49c4a13341003cba). Each customer is classified according to customer churn susceptibility. The testing dataset is monitored frequently and updated with the latest information. In this way, we can replicate the real-world scenario of churn prediction. The input for this model is the customer information database which contains all the information about the customer (state, account length, area code, phone number, etc.). The input contains all information about the customer; the total information is required to predict the churn.

In Fig. 4, churn is predicted for the customers. If the churn value is 1, then it is ‘True,’ else if it is 0, then it is ‘False.’ The churns are calculated for each customer, and churn percentage is calculated. In our model, the churning for the 932 customers is 136. The churn rate for these 932 customers is 14%. The bar plot of churn prediction of 932 customers is shown in Fig. 5.

A churn prediction model should be measured by its ability to identifying churners for marketing purpose. We, therefore, used the Receiver Operation Characteristic (ROC) for testing the efficiency of this model.

The churn prediction system is tested by the Receiver Operating Characteristic (ROC). When the testing is performed, it is measured that this model efficiency is 74.1%, for example, if this model generates churn for four customers, three predictions are correct and one prediction of the customer churning might be wrong.

8 Conclusion

The telecommunication industry had suffered huge loss from the high churn rates and immense churning of customers. Though the loss is unavoidable, churn can be managed and kept in an adequate level. This research conducts an experimental investigation of customer churn prediction based on real-world data sets. In this paper, the boosting algorithm is used which has predicted the churn for customers. The evaluation of the model we developed is calculated which has proved that this model is efficient of predicting the churns. There is still a lot of substantial work to do from both business and technical point of view. On the one hand, the performance and efficiency of the model can be further improved as well as other classification methods can be used and compared. On the other hand, churn prediction will only provide a basis for the generation of lists and prioritizes contact customers in spite of identifying the reason of customer’s churn behavior and providing the customer needs are also essential for targeted marketing.

References

Lu, J., Lin, H., Lu, N., Zhang, G.: A customer churn prediction model in telecom industry using boosting. IEEE Trans. Indu. Inf. 10(2), 1659–1665 (2014)
Article Google Scholar
Chlang, D.A.: A recommender system to avoid customer churn. Expert Syst. Appl. 36(4), 8071-8075 (2003)
Google Scholar
Malathi, A., Kamalraj, N.: Applying data mining techniques in telecom churn prediction. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 10 (2013)
Google Scholar
Baizal, Z.K.A., Effendy, V.: Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest. In: 2014 2nd International Conference (ICoICT), pp. 325–330. IEEE (2014)
Google Scholar
Siddeshwar, V., Ravi, V., Sundarkumar, G.G.: One-class support vector machine based undersampling: application to churn prediction and insurance fraud detection. In: 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1–7. IEEE (2015)
Google Scholar
Olle, G.D., Cai, S.: A hybrid churn prediction model in mobile telecommunication industry. Int. J. e-Educ. e-Bus. e-Manage. e-Learn. 4(1), 55 (2014)
Google Scholar
Hung, S.Y., Yen, D.C., Wang, H.Y.: Applying data mining to telecom churn management. Exp. Syst. Appl. 31, 515–524 (2006)
Article Google Scholar
Chiyu, Z., Yihui, Q.: Research of indicator system in customer churn prediction for telecom industry. In: 2016 11th ICCSE, pp. 123–130. IEEE (2016)
Google Scholar
Ahn, J.-H., Han, S.-P., Lee, Y.-S.: Customer churn analysis: churn determinants and mediation effects of partial defection in the Korean mobile telecommunications service industry. Telecommun. Policy 30(10), 552–568 (2006)
Article Google Scholar
Neslin, S.A., Gupta, S., Kamakura, W., Junxiang, L., Mason, C.H.: Defection detection: measuring and understanding the predictive accuracy of customer churn models. J. Mark. Res. 43(2), 204–211 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, VIT University, Vellore, India
Mark Sheridan Nonghuloo, Rangapuram Aravind Reddy, Ganji Manideep, M. R. SarathVamsi & K. Lavanya

Authors

Mark Sheridan Nonghuloo
View author publications
You can also search for this author in PubMed Google Scholar
Rangapuram Aravind Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Ganji Manideep
View author publications
You can also search for this author in PubMed Google Scholar
M. R. SarathVamsi
View author publications
You can also search for this author in PubMed Google Scholar
K. Lavanya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Lavanya .

Editor information

Editors and Affiliations

Department of Mathematics, National Institute of Technology Silchar, Silchar, Assam, India
Kedar Nath Das
Department of Mathematics, South Asian University, New Delhi, Delhi, India
Jagdish Chand Bansal
Department of Mathematics, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Kusum Deep
Department of Mathematics, Faculty of Science, Liverpool Hope University, Liverpool, UK
Atulya K. Nagar
School of Electrical Engineering, VIT University, Vellore, Tamil Nadu, India
Ponnambalam Pathipooranam
School of Electrical Engineering, VIT University, Vellore, Tamil Nadu, India
Rani Chinnappa Naidu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nonghuloo, M.S., Aravind Reddy, R., Manideep, G., SarathVamsi, M.R., Lavanya, K. (2020). Call Churn Prediction with PySpark. In: Das, K., Bansal, J., Deep, K., Nagar, A., Pathipooranam, P., Naidu, R. (eds) Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 1057. Springer, Singapore. https://doi.org/10.1007/978-981-15-0184-5_67

Download citation

DOI: https://doi.org/10.1007/978-981-15-0184-5_67
Published: 28 November 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0183-8
Online ISBN: 978-981-15-0184-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics