1 Introduction

Different markets over the world are winding up progressively more saturated, with an ever-increasing number of clients swapping their enlisted benefits between contending organizations. In this way, organizations have understood that they should center their showcasing endeavors in client maintenance as opposed to client procurement. In reality, examines have demonstrated that the assets a company spends in endeavoring to increase new clients are far more noteworthy than the assets it would spend if it somehow happened to endeavor to hold its clients. Client maintenance techniques can be focused on high hazard clients that are meaning to stop their custom or move their custom to another administration contender. This impact of client misfortune is otherwise called client churn. From a machine learning viewpoint, churn can be planned as a paired characterization issue. In spite of the fact that there are different ways to deal with churn expectation (for instance, survival examination), the most well-known arrangement is to mark ‘churners’ over a particular timeframe as one class and clients who remain drew in with the item as the correlative class thus precise early distinguishing proof of these clients is basic in limiting the cost of an organization’s general maintenance advertising strategy. Churn is characterized distinctively by every association of item. For the most part, the clients who quit utilizing an item or administration for a given timeframe are alluded to as churners. Accordingly, churn is a standout among the most essential components in the Key Performance Indicator (KPI) of an item or administration. A full client lifecycle investigation requires investigating standards for dependability, keeping that in mind the end goal is to readily comprehend the well-being of the business or item. Currently, big companies, government organizations and people at higher positions gradually move toward big data analytics for making important decisions to become more prevailing in this world. However, data analysis is a process for cleaning, modeling and transforming the data on huge datasets to obtain the useful information. There are a lot of big data tools out there for big data analytics. Apache Spark is one among that tools which was introduced in 2010. It has a friendly Scala interface. Spark is compatible with Hadoop and its modules and runs faster than the Hadoop due to its real-time processing. Spark provides a very intelligible application programming interface (API) of great value. In comparison with Scala PySpark is easy to implement and simple to write. Appropriately, PySpark is cast off in this paper.

2 Literature Survey

Jie Lu et al. stated that churn prediction of the customer is one of the main features of the modern telecom customer relationship management systems (CRM). They conducted a real-world study on customer churn and proposed to use boosting to improve a customer churn prediction model [1]. The author Hua Hsu discussed a system which recommends for customer churn by proposing a decision tree algorithm. The data which is used for this analysis has roofed over 4000 members and more than 60,000 transactions, over a period of four months [2]. Malathi and Kamalraj focused their research on understanding customer churn prediction using different data mining techniques. This approach can be used in telecommunication industries for customer retention activities of their customer relationship management efforts. They have used the data mining techniques to predict call churn on the customer details [3]. Author Veronika Effendy has proposed a technique for the imbalanced data handling problem to improve the customer churn prediction model. This proposed technique is a combination of weighted random forest (WRF) and sampling for dataset balancing so that its accuracy of the churn prediction is enhanced [4]. The author G. Ganesh Sundar Kumar proposed one-class SVM-based under-sampling technique for enhancing the insurance fraud detection and churn prediction. Initially, the data is sampled using one-class SVM technique and then classing is performed on that using ML algorithms. Depending on the results, it is concluded that decision tree performance is better than other classification algorithms along with one-class SVM. It reduces system complications and helps in improving the prediction precision [5]. D. Olle et al. proposed hybrid learning model to predict customer churn for the telecommunication industry. Their model utilizes WEKA which is one of the well-known tools of machine learning. They even stated that data mining technique can detect the customers churn with high tendency but lacks in providing the reason of the churn. The primary goal of their study is to prove that models built using data mining techniques can explain the churn behavior with more accuracy than the models using single methods [6]. According to Yen, C et al., customer churn in telecommunication industry refers to the customer shifting from one service provider to other service provider. Customer management systems are the process conducted by telecom company to retain their customers. The evolution of the technology and increase in the service providers made this market more competitive than ever. So, the telecom industry has realized that to sustain their profits and preserve their customers to survive in this competitive world [7]. The authors Yihui et al. proposed a decision tree-based random forest method for characteristic extraction. From the original data with Q characteristics, exactly N samples with q < Q are selected randomly for each tree to form a forest. The number of trees depends on the number of characteristics randomly combined for each decision tree from the whole Q characteristics [8].

3 Background Study

3.1 Apache Spark

Apache Spark is a computation platform designed to be fast and easy to use it. It provides fault tolerance on commodity hardware, fault tolerance and scalability. It contains application programming interfaces (APIs) for Scala, Python, Java and libraries for SQL, machine learning, streaming and graph processing. It can be performed on Hadoop clusters or as a stand-alone. It is based on the Hadoop MapReduce model and extends it to implement it for more computations that includes stream processing and interactive queries. The main feature of Apache Spark is its in-memory cluster computing which indeed increases the processing speed of an application.

As shown in Fig. 1, Apache Spark uses Resilient Distributed Dataset (RDD) which is a read only collection of objects than can operated parallelly, which are separated across a set of machines which can be restored if a partition is lost. There are two types of RDD operations: transformations and actions. Transformations are the functions that take Resilient Distributed Dataset and convert into RDD of another type, sing functions which are designed by user. Some of the functions of transformations are map, filter, reduceByKey, join, cogroup randomSplit. Thus, actions are RDD operations of Spark which gives non-RDD values. The values generated of action are stored to drivers or to the external storage system. Action is a way of sending data from executer to drivers. Executers are responsible for performing tasks; driver is a JVM process which coordinates execution of a task and workers.

Fig. 1
figure 1

Example of transformation process in Apache Spark

3.2 MongoDB

MongoDB is an open-source document database and one of the leading NoSQL databases. It is a cross-platform database that provides availability, high performance and scalability. It works on the concept of documents and collections. MongoDB server generally consists of multiple databases. Number of fields, size of the document and content differ from one document to another. In MongoDB, data is stored in the form of JSON documents. It is schema less and easy to scale.

3.3 Boosting

Boosting algorithm is one of the most powerful learning ideas which are designed for classification problems. Most of the boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding then to a final strong classifier. It converts a set of weak learners to strong ones. There are many types of boosting algorithms such as gradient tree boosting XGBoost and AdaBoost (adaptive boosting).

4 Architecture

Figure 2 shows the call churn prediction architecture that takes customer information as data and predicts the churn. The customer performs several events for example, a stream of billing actions, a stream of call records by the company and the subscriber, SMS, complaints, etc. The Integrator module integrates all these streams and generates a logically unique stream of events. The Data Manager module manages the customer pending predictions queue and customer’s information database. The predictions queue contains all predictions which are waiting for refusal or confirmation in the future; whereas, the customer information database contains all the basic information about the subscribers (name, age, type of contract, address, etc.). The Record Processor is the soul of the system. It receives stream of events which are generated by the integrator and uses those events for two purposes. First, it updates the customer’s information database according to the event. Second, it generates records from events using information from the customer information database. The Record Processor builds, maintains and applies the predictive models. Therefore, it contains machine learning or data mining algorithms that make prediction possible. Thus, the Record Processor module produces predictions and profiles of the predicted churners. The churner’s ids and their profiles produced by Record Processor are passed to the user interface or to the other parts of the customer management systems (CMM) so that adequate actions can be assessed and performed.

Fig. 2
figure 2

Framework architecture of call churn prediction

Figure 3 shows the MongoDB Spark Connection establishment. The MongoDB Spark is an open-source project, to read and write the data from MongoDB using Apache Spark. The connection offers various features like converting a MongoDB collection into a Spark RDD and methods to load collections directly into a Spark Data Frame or Dataset. MongoDB helps us to develop applications faster because it uses tables and stored procedures which are no longer required. It is an advantage to developers because earlier the tables should be translated to object model before they could be used in the application. Now, the stored data and object model have same structure similar to JSON format which is also known as BSON. MongoDB provides various options, supports scalability and maintains data consistency.

Fig. 3
figure 3

MongoDB Spark Connection establishment

5 Methodology

5.1 Dataset

We had chosen to study the dataset from a telecom company which includes data of mobile customers in the number of millions who are active at a particular point of time. We initially extracted variables from the customer’s database, which includes contract information and mobile plan, usage, account length, area, billing and product holding information. It even consists of the customer care inbound/outbound information (Table 1).

Table 1 List of attributes with its type in the dataset

6 Pseudo Code

  1. 1.

    Loading the data from the MongoDB

    $${\text{Data}} = {\text{load}}{()}$$
  2. 2.

    Create training and testing data

    $${\text{train}},{\text{test}} = {\text{data.Split}}\left( 0.8,0.2 \right)$$
  3. 3.

    Check for correlation among columns

    $${\text{train.Correlation}}{()}$$
  4. 4.

    Converting from string to numeric values

    $${\text{Val}} = {\text{String}}\,{\text{Indexer}}\left( {\text{Train}} \right)$$
  5. 5.

    Acquiring the vectors

    $${\text{Vector}} = {\text{vectorAssembler}}\left( {\text{Train}} \right)$$
  6. 6.

    Creating a pipeline

    $${\text{Pipeline}}\left( {{\text{val}},{\text{vector}},{\text{GBTClassifier}}} \right)$$
  7. 7.

    Training the model

    $${\text{Model}} = {\text{pipeline.fit}}\left( {\text{train}} \right)$$
  8. 8.

    Predicting the test data

    $${\text{Predict}} = {\text{model.Transform}}\left( {\text{test}} \right)$$
  9. 9.

    Displaying the predicted data

    $${\text{print Predict}}$$

7 Results Analysis

We have evaluated the performance of our churn prediction model by using a training set of customer information which has been collected from the link (https://bigml.com/user/cesareconti89/gallery/dataset/58cfbada49c4a13341003cba). Each customer is classified according to customer churn susceptibility. The testing dataset is monitored frequently and updated with the latest information. In this way, we can replicate the real-world scenario of churn prediction. The input for this model is the customer information database which contains all the information about the customer (state, account length, area code, phone number, etc.). The input contains all information about the customer; the total information is required to predict the churn.

In Fig. 4, churn is predicted for the customers. If the churn value is 1, then it is ‘True,’ else if it is 0, then it is ‘False.’ The churns are calculated for each customer, and churn percentage is calculated. In our model, the churning for the 932 customers is 136. The churn rate for these 932 customers is 14%. The bar plot of churn prediction of 932 customers is shown in Fig. 5.

Fig. 4
figure 4

Analysis of the outcomes of churn

Fig. 5
figure 5

Bar plot for the predicted churns

A churn prediction model should be measured by its ability to identifying churners for marketing purpose. We, therefore, used the Receiver Operation Characteristic (ROC) for testing the efficiency of this model.

figure a

The churn prediction system is tested by the Receiver Operating Characteristic (ROC). When the testing is performed, it is measured that this model efficiency is 74.1%, for example, if this model generates churn for four customers, three predictions are correct and one prediction of the customer churning might be wrong.

8 Conclusion

The telecommunication industry had suffered huge loss from the high churn rates and immense churning of customers. Though the loss is unavoidable, churn can be managed and kept in an adequate level. This research conducts an experimental investigation of customer churn prediction based on real-world data sets. In this paper, the boosting algorithm is used which has predicted the churn for customers. The evaluation of the model we developed is calculated which has proved that this model is efficient of predicting the churns. There is still a lot of substantial work to do from both business and technical point of view. On the one hand, the performance and efficiency of the model can be further improved as well as other classification methods can be used and compared. On the other hand, churn prediction will only provide a basis for the generation of lists and prioritizes contact customers in spite of identifying the reason of customer’s churn behavior and providing the customer needs are also essential for targeted marketing.