Keywords

1 Introduction

In several pattern recognition applications, there is a need to define an element-to-element relation between two objects. This process, commonly referred as “matching”, has been applied on data structures such as sets of points [1], strings [2], trees [3] and most notably, graphs [48]. While this previous work demonstrates that there has been a long standing effort to increase the quality of the methods that perform structural matching, it may also derive in scenarios where we encounter two or more parties which, having applied different matching algorithms, have produced several matching solutions. These solutions, onwards referred as “correspondences”, may be the result of the several existing methodologies or different parameterisations of these methodologies, which generate each time a different set of mappings between the elements of an output data structure and the elements of an input data structure.

Given a set of objects, their median is defined as the object that has the smallest sum of distances (SOD) [9, 10] to all objects in the set [11]. From this definition, we are able to identify the generalised median (GM) and the set median, which difference lies in the space where each median is searched for. In the first case there are no restrictions for the search, while on the second case the exploration space is restricted to the elements in the set.

Due to its robustness, the concept of GM has been implemented to deduce the representative prototype of a set of data structures [12] and for clustering ensemble purposes [13] on data structures such as strings [14], graphs [15] and data clusters [16]. For these data structures (and for correspondences as well), the GM computation turns out to be an NP-complete problem. This drawback has led to a variety of methods solely developed for the GM approximation, such as a genetic search method [11] or approximations through the weighted mean of a pair of strings [17], graphs [18] or data clusters [19]. Most recently, a method known as the Evolutionary method [20] has been presented, offering a good trade-off between accuracy and runtime. This makes the Evolutionary method one of the most viable options for the GM approximation on most domains.

In this paper, we present two methodologies to obtain the GM of a set of correspondences. The first one is based on a voting and minimisation process, and the second one is based on the Evolutionary method adapted to the correspondence case. Notice that the calculation of a representative prototype of a set of correspondence has been studied before in what we called the “correspondence consensus frameworks” [2125]. Nonetheless, there are some differences between approximating towards the GM and these consensus frameworks; the most important one being the function to be minimised. As commented before, the GM computation aims to minimise the SOD, whereas in the consensus framework, the function could also include the reduction of some other restrictions, such as the cost defined on the correspondences or the structural information of the data structures mapped.

The rest of the paper is structured as follows. In Sect. 2, we introduce the basic definitions. In Sects. 3 and 4, we present both methods. In Sect. 5, we compare them and evaluate the results in terms of the distance to the ground truth GM and the runtime. Finally, Sect. 6 is reserved for conclusions and further work.

2 Basic Definitions

Let us represent any kind of data structure as \( G = (\varSigma , \gamma ) \), where \( v_{i} \in \varSigma \) is an element inside the structure (elements can be, for instance, characters for strings, or nodes and edges for trees and graphs), and \( \gamma \) is a function that maps each element to a set of attributes. To allow maximum flexibility in the matching process, these structures may have been extended with null elements (represented as \( \Phi \)), which have a set of attributes that differentiate them from the rest of the elements.

Given \( G = (\varSigma , \gamma ) \) and \( G '= (\varSigma ', \gamma ') \) of the same order \( N \) (naturally or due to the aforementioned null element extension), we define the set of all possible correspondences \( T_{{G,G^{ '} }} \) such that each correspondence in this set maps elements of \( G \) to elements of \( G ' \), \( f:\varSigma \to \varSigma ' \) in a bijective manner. Onwards, we refer to \( T \) instead of \( T_{{G,G^{ '} }} \). Moreover, consider a subset of correspondences \( S \in T \) between the same pair of structures that have been produced using different matching approaches.

Let \( f^{1} \) and \( f^{2} \) denote two different correspondences within \( S \). We can deduct how dissimilar these two correspondences are through the Hamming distance (HD) function, which calculates the distance (number of different mappings) between \( f^{1} \) and \( f^{2} \). More formally, the HD is defined as:

$$ Dist_{HD} \left( {f^{1} , f^{2} } \right) = \mathop \sum \limits_{i = 1}^{N} \left( {1 - \partial \left( {v_{x}^{\prime } , v_{y}^{\prime } } \right)} \right) $$
(1)

being \( x \) and \( y \) such that \( f^{1} \left( {v_{i} } \right) = v '_{x} \) and \( f^{2} \left( {v_{i} } \right) = v '_{y} \), and \( \partial \) being the well-known Kronecker Delta function.

$$ \partial (a,b) = \left\{ {\begin{array}{*{20}c} {0 \, if\,a \ne b} \\ {1 \,if \,a = b} \\ \end{array} } \right. $$
(2)

Given a set of \( M \) input correspondences, the GM is the correspondence that has the smallest SOD to all objects in such set.

$$ \hat{f} = \mathop {argmin}\limits_{\forall f \in T} \mathop \sum \limits_{i = 1}^{M} Dist_{HD} \left( {f,f^{i} } \right) $$
(3)

If the minimisation to find \( \hat{f} \) is restricted to be within the elements of \( S \), then the solution is called the set median. Conversely, a search of all elements within \( T \) is known as the GM, which is considered a more attractive but more computationally demanding option. As noticed, the calculation of a median is closely related to the distance between the objects involved, and thus, the importance of defining the HD for the correspondence case.

3 Minimisation Method

The first method presented in this paper is called Minimisation method. The name is related to the minimisation of the sum of the linear assignment problem (SLAP). To introduce it, consider the following example. Suppose that three separate entities have proposed correspondences as shown in Fig. 1, depicted as \( f^{1} \) (red lines), \( f^{2} \) (blue lines) and \( f^{3} \) (green lines). Notice that as commented in Sect. 2, the input set has been extended with a null element (\( \varPhi \)) to make correspondences mutually bijective.

Fig. 1.
figure 1

Example of three correspondences. (Color figure online)

We are able to represent these correspondences as correspondence matrices \( F^{1} \), \( F^{2} \) and \( F^{3} \) as shown in Fig. 2. These matrices are defined as follows. \( F^{k} \left[ {x,y} \right] = 1 \) if \( f^{k} \left( {v_{x} } \right) = v '_{y} \) and \( F^{k} \left[ {x,y} \right] = 0 \) otherwise.

Fig. 2.
figure 2

Correspondences matrices \( F^{1} \), \( F^{2} \) and \( F^{3} \).

Our method minimises the following expression:

$$ \hat{f} = \mathop {\text{argmin}}\limits_{\forall f \in T} \left\{ {\mathop \sum \limits_{x,y = 1}^{N} [H \circ F]\{ x,y\} } \right\} $$
(4)

where \( \{ x,y\} \) is a specific cell and \( H \) is the following matrix:

$$ H = \mathop \sum \limits_{k = 1}^{M} \,{\mathbf{1}} - F^{k} $$
(5)

with 1 being a matrix of all ones, \( F \) being the correspondence matrix of \( f \in T \) (if \( f\left( {v_{x} } \right) = v '_{y} \) then \( F\{ x,y\} = 1 \), otherwise \( F\{ x,y\} = 0 \)) and \( \circ \) being the Hadamard product.

We deduct \( \hat{f} \) through Eq. 4 but we wish to minimise the SOD of all correspondences to obtain the true \( \hat{f} \) (Eq. 3). Therefore, we have to demonstrate that the obtained correspondence \( \hat{f} \) in Eq. 3 is the same than the one in Eq. 4. For this reason, we have to demonstrate that Eq. 6 holds:

$$ \mathop \sum \limits_{k = 1}^{M} Dist_{HD} \left( {f , f^{k} } \right) = \mathop \sum \limits_{x,y = 1}^{N} \left[ {\left[ {\mathop \sum \limits_{k = 1}^{M} 1 - F^{k} } \right] \circ F} \right]\{ x,y\} $$
(6)

Applying the associative property of Hadamard product, the following expression is obtained:

$$ \mathop \sum \limits_{k = 1}^{M} Dist_{HD} \left( {f , f^{k} } \right) = \mathop \sum \limits_{k = 1}^{M} \,\left( {\mathop \sum \limits_{x,y = 1}^{N} \left[ {\left[ {1 - F^{k} } \right] \circ F} \right]\{ x,y\} } \right) $$
(7)

Then, if we demonstrate that each individual term holds the equality \( Dist_{HD} \left( {f , f^{k} } \right) = \mathop \sum \limits_{x,y = 1}^{N} \left[ {\left[ {1 - F^{k} } \right] \circ F} \right]\{ x,y\} \), then for sure Eq. 7 holds. As shown in its definition, the HD counts the number of mappings that are different between the two correspondences and similarly, expression \( \mathop \sum \limits_{x,y = 1}^{N} \left[ {\left[ {1 - F^{k} } \right] \circ F} \right]\{ x,y\} \) does, since this last expression counts the number of times \( F\{ x,y\} = 1 \) and that simultaneously \( F^{k} \{ x,y\} = 0 \).

Notice that by adding all correspondence matrices in Eq. 5, we create a structure similar to a voting matrix [26]. This method is based on minimising the linear assignment problem applied to this voting matrix using any solver such as the Hungarian method [27], the Munkres algorithm [28] or the Jonker-Volgenant solver [29] as shown in Algorithm 1:

Since the three solvers have all been demonstrated to obtain an optimal value, it is possible to guarantee that this method obtains the exact GM, given that there is only first order information involved and no second order information is considered. That is, we do not take into account relations between the mapped elements inside their set. Figure 3 shows the GM correspondence obtained for this particular practical example.

Fig. 3.
figure 3

GM correspondence of the three correspondences in Fig. 1.

4 Evolutionary Method

The second option explored in this paper for the GM correspondence computation is the use of the meta algorithm presented in [20] called Evolutionary method. This proposal relies on the concept of the weighted mean of a pair of correspondences, which is defined as follows. Given \( f^{1} \), \( f^{2} \) and a distance \( Dist \) between them (for instance \( Dist_{HD} \)), the mean correspondence of \( f^{1} \) and \( f^{2} \) is a correspondence \( \bar{f} \in {\text{T}} \) such that:

$$ \begin{array}{*{20}c} {Dist\left( {f^{1} ,\bar{f}} \right) = Dist\left( {\bar{f},f^{2} } \right)} \\ {Dist\left( {f^{1} ,f^{2} } \right) = Dist\left( {f^{1} ,\bar{f}} \right) + Dist\left( {\bar{f},f^{2} } \right) } \\ \end{array} $$
(8)

Additionally, the weighted mean correspondence \( \bar{f}_{\alpha } \in {\text{T}} \) is defined as a correspondence in \( T \) that holds:

$$ \begin{array}{*{20}c} {Dist\left( {f^{1} ,\bar{f}_{\alpha } } \right) = \alpha } \\ {Dist\left( {f^{1} ,f^{2} } \right) = \alpha + Dist\left( {\bar{f}_{\alpha } ,f^{2} } \right)} \\ {\text{where}\,\alpha \,\text{is}\,\text{a}\,\text{constant}:\,0 \le \alpha \le Dist\left( {f^{1} ,\,f^{2} } \right)} \\ \end{array} $$
(9)

Clearly, \( \bar{f} _{0} = f^{1} \) and \( \bar{f}_{{Dist\left( {f^{1} ,f^{2} } \right)}} = f^{2} \). Nevertheless, both the mean correspondence and the weighted mean correspondence given a specific \( \alpha \) are usually not unique. The concept of the weighted mean has been previously defined for strings [17], graphs [18] and data clusters [19].

As proven in [11], the GM of some elements in any space can be estimated through an optimal partition of the pairs of these elements. This is because they demonstrated that by computing the weighted mean of such optimal pairs of elements, all of those weighted means tend to match in one element that can be considered a good estimation of the GM of the set. Since in some cases the GM can be far away from the deducted element, an iterative algorithm is proposed in [20] which tends to achieve the true GM. This algorithm, applied to the correspondence domain, consists on the steps shown in Algorithm 2:

We proceed to detail steps 1, 2 and 4 in the correspondence domain. Notice the third step is simply adding the obtained weighted mean correspondences to the current set of correspondences.

4.1 Optimal Pairs of Correspondences

We generate the distance matrix given the whole correspondences, where any distance between these correspondences can be used. Then, the optimal pairs of elements are considered the ones that generate the minimum SOD between them [11]. Thus, we simply obtain the pairs of correspondences by applying a SLAP solver such as the Hungarian method [27], the Munkres algorithm [28] or the Jonker-Volgenant solver [29]. Note that we do not want one correspondence to be assigned as the optimal pair of itself and for this reason, instead of filling the diagonal of the distance matrix with zeros, we impose a high value. Nevertheless, if there is an odd number of correspondences, for sure the solver returns a correspondence mapped to itself. In this case, this correspondence is stored until the third step.

4.2 Weighted Means of Pairs of Correspondences

The aim of the second step is to estimate \( \varOmega \) equidistant weighted means per each pair of correspondences. Thus, we generate \( \bar{f}_{{\alpha_{1} }} , \ldots ,\bar{f}_{{\alpha_{\varOmega } }} \) such that \( \alpha_{i} = \frac{i}{\varOmega + 1} \) (Eq. 9). The order of \( \varOmega \) is usually set from 1 to 3. This is because, through the practical validation, we have seen that restricting the process to calculate only the mean correspondence (that is \( \varOmega = 1 \)) makes the process converge slower than when having for instance three equidistant weighted means, even though these are obtained in a sub-optimal form. Moreover, experimentation has shown that if \( \varOmega > 3 \), the computational cost is also increased without gaining in accuracy.

The weighted mean search strategy we used is inspired by the “Moving Elements Uniformly” strategy presented in [19] for the domain of data clusters. In that case, they were able to generate multiple weighted mean data clusters from two initial ones. To do so, authors defined an initial weighted mean as one of the data clusters, and then they systematically swap elements that belong to two different clusters in the weighted data cluster in such a way that the weighted mean data clusters formed tends to move from one of the initial data clusters into the other one.

Our proposal initially defines the weighted mean correspondence as one of the correspondences. Then, it simply swaps pairs of element-to-element mappings in the proposed weighted mean \( \bar{f}_{\alpha } \). Note, every time a swap is performed, the value \( Dist\left( {f^{1} ,\bar{f}_{\alpha } } \right) \) is increased by two, but we cannot guarantee that \( Dist\left( {f^{2} ,\bar{f}_{\alpha } } \right) \) is also decreased by two. For this reason, the strategy checks if the current correspondence is a true weighted mean (Eq. 9 holds). If it is the case, a weighted mean has been formed and the swapping process continues until finding all required weighted means. If it is not the case, the process is reset and repeated until finding weighted means. This method has its base on a theorem presented in [21], where it was shown that a weighted mean correspondence has to hold that \( \bar{f}_{\alpha } \left( {v_{x} } \right) = f^{1} \left( {v_{x} } \right) \) or \( \bar{f}_{\alpha } \left( {v_{x} } \right) = f^{2} \left( {v_{x} } \right) \) for all elements in the weighted mean correspondence.

4.3 Selecting the Optimal Correspondences

Once the current correspondences are put together with the new weighted mean correspondences to enlarge the set (step 3), the method could return to the first step of Algorithm 2 with this newly enlarged set without running the fourth step. Nevertheless, the computational cost and memory space needed in each iteration would exponentially increase. For this reason, the aim of this fourth step is to discard the correspondences that are believed not to be a good choice for the GM. To that aim, a distance matrix is computed between the whole correspondences. Then, the ones that have a larger SOD from themselves to the rest are discarded. Note that this methodology is in line of the GM (Eq. 3).

When the fourth step finishes, Algorithm 2 iterates again until one of three options happens: (1) The sum of the minimum SOD of the whole correspondence in the set is lower than a threshold. (2) A maximum number of iterations is achieved. (3) A minimum difference on the total SOD between the previous iteration and the current one is achieved. Independently of the terminating option, Algorithm 2 returns the correspondence in the set that has at the moment the minimum SOD to the set as the GM correspondence. Convergence is assured since the SOD, in each iteration, is equal or lower than the previous iteration. Moreover, in case the SOD is kept equal, Algorithm 2 stops.

5 Experimental Validation

Two methods have been presented. The first one obtains the exact GM correspondence, but is restricted to the use of the HD. The second one deducts an approximation of the GM correspondence, but is not restricted to any distance between correspondences. In this section, we show how close the suboptimal method is with respect to the optimal one, as well as the runtime of both methods. To have a fair comparison, we have used the HD in both cases.

We performed three tests, all of them executed the same way but using correspondences with \( N = 5 \), \( N = 10 \) and \( N = 30 \) mapped elements in each case. Each test was prepared as follows. We randomly generated 100 sets of \( M = 2,3 \ldots ,50 \) correspondences. For each set, both methods to find the GM correspondence are executed. In the Minimisation method, the Hungarian method [27] was used to solve the SLAP.

Figure 4 shows the normalised difference on the SOD that the GM correspondences generated by the Evolutionary method obtained with respect to the ones from the Minimisation method (x-axis) in the first test (\( N = 5 \)), second test \( (N = 10 \)) and third test (\( N = 30 \)) respectively. Each dot in the plot represents the average of the 100 executions. For the Evolutionary method, we show results using the number of iterations \( I_{max} = 1 \) and \( I_{max} = 2 \). Results for larger values of \( I_{max} \) are not shown since they deliver exactly the same values that the ones of \( I_{max} = 2 \).

Fig. 4.
figure 4

Average difference of SOD (HD) between the Evolutionary method and Minimisation method (\( x \)-axis, optimal method) for \( N = 5 \), \( N = 10 \) and \( N = 30 \).

In the three cases for a set of two correspondences, \( M = 2 \), the Evolutionary method obtains optimal GM correspondences since the method only has to deal with the mean calculation. Nonetheless as the number of correspondences \( M \) increases, this overestimation has a peak maximum value, and then it decreases until lowering down again towards the optimal value of the GM correspondence. This leads us to think that the Evolutionary method has an optimal number of correspondences to be used, since certain values of \( M \) lead more overestimation than others. Finally, from these plots we conclude that the Evolutionary method, regardless of the \( I_{max} \) value used, obtains values that are really close to the optimal ones. In fact, the worst case overestimates the SOD in 4.5 % with respect to the optimal SOD.

Figure 5 shows the runtime difference between the Evolutionary method and the Minimisation method (\( x \)-axis) in seconds. In the case of the Evolutionary method, it is clear that the time spent in each iteration is constant. Comparing both methods, the minimisation one is clearly faster than the evolutionary one, although both have a polynomial computational cost with respect to the number of correspondences used to deduct the GM. Finally, comparing the three plots, we realise the number of elements \( N \) in the sets seems to have almost no influence on the runtime.

Fig. 5.
figure 5

Average difference of runtime (seconds) between the Evolutionary method and Minimisation method (\( x \)-axis) for \( N = 5 \), \( N = 10 \) and \( N = 30 \).

6 Conclusions and Future Work

We have presented two methods to deduct the GM correspondence. The first one, called Minimisation method, computes the exact GM in a reduced runtime, but it is bounded to the use of the HD. Since it is based on the solution of the SLAP, it is limited in scalability for the cases where correspondences have a large size. The second one, called Evolutionary method, obtains a fair approximation of the GM, and may be used with any type of distance between correspondences. This method has better scalability, given that although there is a SLAP solution required (step 1), it only takes into consideration the distance between correspondences, and not the correspondences themselves as in the Minimisation method. Notice that the Evolutionary method has been a viable solution for the approximation of the GM of strings and graphs [20], since these structures imply second order relations and therefore finding their exact GM has an exponential cost.

In the concrete case that the aim is to find the GM of correspondences and the HD can be used, we have shown that this problem can be solved in cubic time (the computational cost of the SLAP) using the Minimisation method. Nevertheless, we consider this paper as a first step towards future research in which other distances between correspondences will be explored, and thus the Evolutionary method should not be discarded for future uses.

We believe that other distances between correspondences which take into consideration not only the element-to-element mapping, but also the structure and attributes of the related elements, could produce more interesting GM correspondences from the application point of view. For instance, in the situation that the correspondences relate attributed graphs, the mapping is defined as node-to-node. In this case, we could consider the local structure of the nodes (its adjacent edges and their terminal nodes) to penalise the cost of the mapping. Then, the Minimisation method would not produce an exact GM and therefore, we would need to compare both algorithms, not only from the runtime point of view, but also in terms of accuracy to deduct the best approximation to the GM correspondence.