1 Introduction

Relations between objects in various systems are commonly modelled by networks. For instance, those are hyperlinks connecting Web pages, paper citations, conversations via e-mail or social interaction in social portals. Network models are further a base for different types of processing and analyses. One of them is node classification (labelling of the nodes in the network). Node classification has a deep theoretical background; however, due to new phenomena appearing in artificial environments like social networks on the Internet, the problem of node classification is being recently re-invented and re-implemented.

Nodes may be classified in networks either by inference based on known profiles of these nodes (regular concept of classification) or based on relational information derived from the network. This second approach utilizes information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given Web page x is related to sport (label sport), if x is linked by many other Web pages about sport.

One of the strong motivations to use the relational model is the ability for modelling a relationship between correlated observations. There is an intuitive desire to use information about one object, to reach conclusions (information) with other related objects. For example, a Web network should be able to propagate information about the topic of the document to other documents that contain links to it, and the propagation, following the successive degree of the neighbourhood, could indicate a network of topics of documents. In this paper, an algorithm is proposed using the above process in accordance with the principle of relational influence propagation [13].

Hence, a form of collective classification should be provided, with simultaneous decision making on every node’s label rather than classifying each node separately. Such approach allows taking into account correlations between connected nodes, which deliver usually undervalued knowledge.

Moreover, a rising trend of data explosion in transactional systems requires more sophisticated methods to analyse enormous amounts of data. There is a huge need to process large data in parallel, especially in complex analyses like collective classification.

MapReduce approach to collective classification which is able to perform processing on huge data is proposed and examined in the paper. The proposed method is able to handle massive data thanks to parallel and distributed processing of Relational Influence Propagation algorithm. Section 2 covers related work, while in Sect. 3 a proposal of MapReduce approach to label propagation in the network appears. Section 4 contains description of the experimental setup and the obtained results. The paper is concluded in Sect. 5.

2 Related work

2.1 Relational influence propagation

Collective classification problems may be solved using two main approaches: within-network and across-network inference. Within-network classification, for which training entities are connected directly to entities whose labels are to be classified, stays in contrast to across-network classification, where models learnt from one network are applied to another similar network [4]. Across-network classification can be understood as a transfer learning approach that accomplishes relational classification [5]. Overall, the networked data have several unique characteristics that simultaneously complicate and provide leverage to learning and classification.

One of the methods using within-network approach is relational influence propagation. The main idea is based on iterative propagation of the known labels in a network to the non-labelled nodes [3]. The method was originally derived from enhanced hypertext categorization [1]. Due to the profile of the method, its accuracy strictly depends on sampling of the training set. This is due to the fact that several false interference phases (propagation of initial information) can lead to a ”snowball” effect [3]. Generally, the final result of such a class of algorithms depends on network sampling methods, where independence and identity of distribution assumption made by a standard classification and clustering models are inappropriate in the complex relational domain [6].

Among others, several statistical relational learning (SRL) techniques were introduced, including probabilistic relational models, relational Markov networks and probabilistic entity-relationship models [7, 8]. Two distinct types of classification in networks may be distinguished: based on collection of local conditional classifiers and based on the classification stated as one global objective function. The most known implementations of the first approach are iterative classification (ICA) and Gibbs sampling algorithm (GS), whereas example of the latter are loopy belief propagation (LBP) and mean field relaxation labelling (MF) [9]. In general, the second group of algorithms represents the idea of label propagation. Label propagation starts with labelled nodes and each of these nodes propagate its known label to its neighbours (nodes without labels) and the process is repeated until convergence.

2.2 MapReduce programming model

Large data processing requires parallel computational model and parallel execution. To address these requirements, MapReduce programming model may be incorporated. MapReduce provides means for data processing derived from functional language [10] and is dedicated to solving complex and distributable problems [1113]. It utilizes a large number of nodes, hereafter collectively referred to as a cluster. MapReduce breaks the processing into two consecutive phases: the Map and the Reduce phases.

The first phase commences with data splitting into separate chunks. According to input file configuration, each data chunk must meet the requirement of the \(\langle {key, value}\rangle \) format. Data are then processed by a Map function, which takes this input pair. Assuming independence between each Map function, the processing may be conducted in a parallel manner. Each computational node in the cluster may perform multiple number of Map functions. The aim of the Map function is propagation of the processed input data to further phase using again the \(\langle {key, value}\rangle \) format. Before the next step of processing is accomplished, all results from Map functions are sorted (called shuffle in MapReduce). This allows to split the data according to the value of key into separate data chunks, one chunk for each value of key. Then particular data chunk, provided in \(\langle {key, list(value)}\rangle \) format, is processed in a separate reducer, called Reduce phase. Reducer implements final processing and propagates \(\langle {key, value}\rangle \) pairs as results. The results are saved as an output of processing. Usually, each reducer outputs one pair.

Both, Map and Reduce phases, need to be specified and implemented by the user [14, 15]. The aforementioned process is presented in Fig. 1.

Fig. 1
figure 1

The MapReduce programming model

Thanks to initial data splitting, the MapReduce programming model is able to process large data sets, which is impossible in other models. The most common open-source implementation of MapReduce is Apache Hadoop library [16]. Apache Hadoop is a framework that allows distributed processing of large data sets. Such processing can be performed across clusters of computers offering local computation and storage. The architectural solutions of Hadoop deliver high availability and robustness, not only because of hardware properties but also due to failures handling in the application layer. Moreover, data replication facilitates retaining fault-tolerant and highly reliable computational environment. The single MapReduce phase in Hadoop is called Job. The Job consists of the map method, reduce method, data inputFiles and configuration.

3 Collective classification by means of relational influence propagation using MapReduce

The most common way to utilize the information of labelled and unlabelled data is to construct a graph from data and perform a Markov random walk on it. The idea of Markov random walk has been used multiple times [1719] and involves defining a probability distribution over the labels for each node in the graph. In case of labelled nodes, the distribution reflects the true labels. The aim then is to recover this distribution for the unlabelled nodes. Using such a label propagation approach allows performing classification based on relational data.

In this paper we assume, similarly as proposed in [3], that iterative classification algorithm works on relational data structure that consists of nodes and ties between them. In our approach, relation influence propagation is realized iteratively and is based on physical modelling of harmonic energy minimization presented in [20]. The main idea of the algorithm is to maintain the potential balance between nodes in a network. Assuming that the potential represents the label probability, labels of nodes in this model are distributed by weighted arcs in a graph structure. Our proposal has been additionally fed with improved conception of dummy nodes initially proposed in [18]. Due to the fact that propagation of information about known labels needs to preserve the balance in the whole network, it is allowed to change the probability of label even in initially chosen informative nodes. Due to the impact of incoming connections to these nodes, their labels may eventually be changed, which is not at all desirable. Therefore, there is a need keep the original labels in such nodes and this is achieved by dummy node.

Let \(G(V,E,W)\) denote a graph with vertices V, arcs E and an \(n \times n\) arcs weight matrix W. According to [18] in a weighted graph G(V,E,W) with \(n=|V|\) vertices, label propagation may be solved by linear Eqs. 1, 2.

$$\begin{aligned} \forall {i,j}\in V \sum _{(i,j)\in E}w_{ij} P_i = \sum _{(i,j)\in E}w_{ij} P_j \end{aligned}$$
(1)
$$\begin{aligned} \forall {i}\,\in V \sum _{c \in {\text{ classes}}(i)} P_i = 1 \end{aligned}$$
(2)

where \(w_{ij} \in W\) and \(P_i\) denotes the class likelihood for the ith node.

Let us assume that the set of nodes V is partitioned into labelled \(V_L\) and unlabelled \(V_U\) vertices, \(V=V_L \cup V_U.\) Let \(P_u\) denote the probability distribution over the labels associated with vertex \(u \in V.\) For each node \(v \in V_L,\) for which \(P_v\) is known, a dummy node \(v{^{\prime }}\) is inserted such that \(w_{v^{\prime }v}=1\) and \(P_{v^{\prime }} = P_v.\) This operation is equivalent to ‘clamping’ discussed in [18]. Let \(V_D\) be the set of dummy nodes. Then the solution of Eqs. 1 and 2 can be performed according to Iterative Label Propagation, Algorithm 1.

figure a

The introduced dummy nodes are therefore artificially added nodes to a graph and are connected with labelled nodes. Each labelled node in the network has one dummy node added. A particular dummy node has an original label of its labelled parent. During the iterative calculation of the propagation algorithm, the label probability value might not change as the appended dummy nodes are connected with arc directed from dummy node to the original labelled node; see Fig. 2. This is a contrary proposal for preserving original values of labels to the one presented in [20], where arcs between a particular dummy node and the corresponding labelled one were directed from the real node to the dummy. As proposed in [20], no prevention for keeping the original label is preserved as only incoming arcs may influence the class probability. We propose to model the direction of influence: from a dummy to a corresponding labelled node. Therefore, the dummy node keeps the true label of the original node according to Eq. 3.

$$\begin{aligned} P_i = \frac{w_{i^{\prime }i}F_{i^{\prime }} + \sum _{(j,i)\in E}w_{ji} P_j}{w_{i^{\prime }i}+\sum _{(j,i)\in E}w_{ji}} \\&w_{i^{\prime }i} = 1 ; \sum _{(j,i)\in E}w_{ji} = 1 \\&P_i = \frac{P_{i^{\prime }}+\sum _{(j,i)\in E}w_{ji} P_j}{2}\end{aligned}$$
(3)

where \(P_i\) denotes the probability of classes for node i, and \(i^{\prime }\) denotes dummy node corresponding to node i. When \(P_{i^{\prime }} = 1\) or \(P_{i^{\prime }} = 0,\) which is a common initial situation in crisp labels classification problem, the probability \(P_i\) will remain \(P_i\approx P_{i^{\prime }}.\) It means, the initially known label 1 or 0 will not be changed during the algorithm iterations.

Fig. 2
figure 2

Two situations of network balancing using Iterative Label Propagation: a graph with unable to balance structure and a graph with added dummy nodes (black nodes) that may be balanced

In general, using the proposed dummy nodes, we are able to avoid the situation where the network is not balanced. As an example, please consider Fig. 2. In the first settlement with two nodes and two arcs between them, it is impossible to balance the network. Each iteration of the algorithm will cause a shift of probabilities (relevance of labels) between both nodes and no stabilization will be achieved. What is more in this situation, we change the originally known labels of nodes. When adding dummy nodes to the second situation in Fig. 2, it is possible to balance the network and it will be reached in a few iterations.

As it can be observed, at each iteration of Iterative Label Propagation, certain operations on each of the nodes are performed. These operations are calculated basing on local information only, namely the node’s neighbourhoods. This fact can be utilized in a parallel version of the algorithm; see Algorithm 2.

figure b

The MapReduce version of Iterative Label Propagation algorithm consists of two phases. The Map phase gets all labelled and dummy nodes and propagates their labels to all nodes in adjacency list taking into account arc weights between nodes. The Reduce phase calculates new label for each node with at least one labelled neighbour. Reducers calculates new label for nodes based on a list of labelled neighbours and relation strength between nodes (weight). The final result, namely a new label for a particular node, is computed as the weighted sum of labels’ probabilities from the neighbourhood.

4 Experiments and results

To present the profile of the proposed method, the telecommunication network was built with a 3 months' history of phone calls from a leading European telecommunication company. The original data set consisted of about 500,000,000 phone calls and more than 16 million unique users.

All communication facts (phone calls) were performed using one of 38 tariffs, of 4 types each. In order to establish a binary classification task, only two types of phone calls were extracted and utilized in the experiments.

Users were labelled with class conditional probability of tariffs, namely, the sum of outcoming phone call durations in a particular tariff was divided by the summarized duration of outcoming calls. Eventually, the final data set consisted of 12,787,114 users.

To test the method, an initial sampling of known labels in the network and unknown label to be populated from the former ones was required. Therefore the data set was partitioned into two subsets: training set and testing set. The training set consisted of 7,163,227 users who made calls in the first and second month. The testing set consisted of 1,522,759 users who made calls only in the third month. Both subsets were disjoint. Additionally, there were also 5,566,376 unknown users in the network who did not make any calls and they can be regarded as clients who belong to other external telecommunication providers.

To build a relational structure of the data required in the proposed algorithm, detailed billing information was used and transformed to the network. The users’ network was calculated using Eq. 4 to obtain a connection strength between particular users:

$$\begin{aligned} w_{ij}=\frac{2\cdot d_{ij}}{d_i+d_j} \end{aligned}$$
(4)

where \(d_{ij}\) denotes the summarized duration of calls between user i and j, \(d_i\) the summarized duration of ith outcoming calls and \(d_j\) the summarized duration of jth incoming calls. The consequently obtained network was composed of 67,184,654 weighted arcs between the aforementioned users.

The goal of the first experiment was to predict class conditional probability of tariff for unlabelled users. Using previously selected nodes for training, the MapReduce Iterative Label Propagation method was employed. After 35 iterations, 12,787,114 labels were reached. However only 399,075 labels from the testing set could be evaluated. This revealed that some of the nodes from the testing set were not reachable in label propagation—these nodes did not contain incoming connection thanks to which the label could be propagated. It means, 1,123,684 nodes in the testing set contained outcoming connections (therefore have labels; see 4) but do not have any incoming connection. This situation is illustrated in Fig. 3. Starting from the training set, labels are propagated to connected nodes in the testing set and further to nodes belonging to an external provider. However, there exists a black node in the testing set that cannot be accessed in the iterative process as it has no incoming arcs. Thus, this node will remain unlabelled. As a consequence, some nodes from the external provider might not be similarly unlabelled. In the experiment, there were 341,564 nodes unable to be labelled (see black node in Fig. 3) in the external provider set.

Fig. 3
figure 3

Distinct types of nodes in Iterative Label Propagation: labelled and unlabelled, training, testing and external provider ones

The Iterative Label Propagation algorithm was implemented in MapReduce programming model. It consists of six Jobs, each accomplishing map and reduce phases. A detailed description of Jobs is presented in Table 1. The convergence criterion in the algorithm has been controlled by \(\epsilon \) coefficient, which was a threshold for the change of class conditional probability for each node. The algorithm was iterating until these changes were greater than \(\epsilon \). In all experiments, \(\epsilon \) was set to 0.001. The algorithm was implemented in Hadoop distributed system [16] and was run on cluster computer composed of six nodes: one master and five slaves machines. Each machine contained 8 CPU, 20 GB RAM and 160 GB hard disk. The master configuration was set to two mappers and two reducers, and slaves machines were configured with three mappers and three reducers. The experiment was organized to examine the average computational time devoted to each of the map–reduce steps as well as to measure the accuracy.

Table 1 MapReduce jobs implemented in the algorithm

The computation time was measured in 120 iterations and is presented in Table 2. On average, the time required for computation of one iteration in a given experimental scenario was about 17 min. Moreover during first 35 iterations of the algorithm, the number of unstable (in terms of \(\epsilon \) threshold) nodes was measured. Additionally, F measure was evidenced.

Table 2 Average execution time in [s] for particular map-–reduce jobs

As observed in Fig. 4, after about 15 iterations of the algorithm, the number of unstable nodes stabilized around 1,000 maintaining the decreasing trend. Just after the sixth iteration, there was no improvement in F measure that in general was very low. It revealed that either the Iterative Label Propagation algorithm was not a good model for tariff modelling in the telecommunication data or the assumed sampling method used to derive the training set was inappropriate.

Fig. 4
figure 4

Number of nodes with not stabilized labels and F measure obtained after a particular number of algorithm iterations

Therefore, there were other sampling approaches examined in order to obtain the training set. We examined a sampling method based on standard measure of the node’s degree. Namely, for each node \(v \in V,\) a number of outcoming connections was calculated \(deg^{+}(v),\) which is called the out-degree measure. According to this structural property, the original data set was partitioned into the training and testing set, as in the formula presented in Eq. 5.

$$\begin{aligned} \forall v \in V_{\text{ labelled}} {\left\{ \begin{array}{ll} \ v \in {\text{ TrainingSet}} , {\text{ deg}}^{+}(v) >= d \\ v \in {\text{ TestSet}}, {\text{ deg}}^{+}(v) < d \end{array}\right.} \end{aligned}$$
(5)

In such a situation, nodes with out-degree higher or equal to d were assigned to training set, and the rest to the test set.

To examine the proposed sampling method, three distinct scenarios of sampling were designed. As the method is based on out-degree structural measure, three distinct \(d\) thresholds for this measure were established: 3, 6 and 9. Afterwards, they were used in sampling in S3, S6 and S9, respectively. The values of d were selected based on the histogram of the out-degree measure. The results obtained in the S3, S6 and S9 experiments are presented in Figs. 5 and 6.

Fig. 5
figure 5

Number of nodes with not stabilized labels after a particular number of algorithm iterations in three distinct sampling methods S3, S6 and S9

Fig. 6
figure 6

F measure for both binary classes obtained after a particular number of algorithm iterations in three distinct sampling methods S3, S6 and S9

According to the results presented in Fig. 5 and Table 3, the fastest converging was the scenario with S3 sampling. The F measure obtained in all three scenarios (results gathered in Table 4) was dramatically higher than previously reported (see Fig. 6 against Fig. 4). Therefore while using the second proposed approach to the sampling method, we were able to obtain quite satisfactory results of nodes classification. It can be observed that the S3 sampling method provides the best results for class 1 among other sampling methods. This is due to the fact that the training and testing data set is constructed according to Eq. 5. The network used in experimentation had a power law distribution, and sampling such a graph worked in the way that the smaller the value of d, the bigger was the training data set and smaller the test set. As a greater number of nodes with known labels in the graph makes the within-network classification easier, S3 sampling provided the best results.

Table 3 Number of nodes with not stabilized labels after a particular number of algorithm iterations in three distinct sampling methods S3, S6 and S9
Table 4 F measure obtained after a particular number of algorithm iterations in three distinct sampling methods S3, S6 and S9

5 Conclusions

The problem of collective classification using MapReduce programming model and relational influence propagation approach was considered in the paper. We examined the algorithm behaviour in a large network using a parallel environment that can perform complicated calculation using large data sets. The proposed method was examined on real data set in the telecommunication domain. The results indicated that it could be used to classify network’s users in order to propose new offerings or tariffs to them. Experiments revealed that using various network sampling methods, we were able to improve classification accuracy, and probably the problem of network sampling is one of the most important in learning for relational domain.

Further experimentation will consider a comparison of the presented method with other approaches. Moreover, further studies on much larger data sets will be conducted. Additionally, the data set sampling problem will be examined according to nodes' attributes.