Keywords

1 Introduction

Seriation, also known by ranking, ordination or sequencing, is a general procedure that aims at linearly ordering a set of elements in a dataset. The central assumption in this type of problems, where a distance metric function is defined, is the following [1]: there is a natural order of the objects where the total distance between adjacent elements is minimized. The seriation results strongly depend on the distance metric and in the way the similarity matrix is computed. Seriation is a transversal problem in many scientific and technological fields, such as archeology [2], genomics and proteomics, paleontological data and network analysis [3]. It can be formulated and solved by using several techniques based on linear algebra, graph or optimization theory [4,5,6]. Popular and well-studied approaches to address this problem were developed in the scope of the C1P and 2-SUM problem. In this formulation, the rows and columns of a binary similarity matrix are permuted in order to estimate the correct order of the objects that minimizes a given distance criterion. For a complete survey about seriation methods see [1, 7, 8]. A Greedy optimization method over Robinsonian matrices, based on graph-theory, have been used with success in several low dimension and topologic complexity problems [1, 5, 9]. Another popular set of seriation methods are formulated as a combinatorial optimization problem. In this class of methods, heuristic solutions are obtained by using the simulated annealing and dynamic programming algorithms [10]. The performance and accuracy of these methods are compromised with noisy data and the processing time explode exponentially with the dimension of the problem, making the computation of the solution harder. As well as, seriation of the objects in a dataset at the presence of the noise as a challenging problem has been considered [11]. From the perspective of clustering and classification theories, a seriation results interprets as a chain of objects with an optimum length [7]. Seriation of the objects from (dis)similarity matrix associated with a dataset can be considered as a one-mode two-way clustering problem [1, 8]. However, the clustering of the objects in a unique group can be translated to the language of seriation easily. A nice interpretation between clustering and seriation has been defined by [7, 12, 13] in a way that, from clustering perspective, the matrix permutation is a procedure to order the clusters at each level so that the objects on the edge of each cluster are adjacent to that object outside the cluster to which it is nearest. From the perspective of seriation, the result is an optimal ordering procedure, but with an additional grouping conditions for identifying the clusters and their boundaries. Since the matrix permutation approach on the seriation algorithms has some advantage to the clustering, such as, non-destructing the data by transformation or reduction, we preferred to address the problem as a seriation approach [7]. We propose here an iterative robust method where, in each iteration, the accuracy of the solution improves by local optimization of the Hamiltonian path. With permuting the complex node on the minimum spanning tree regarding to the acyclic Hamiltonian path, we are able to obtain more accurate and robust result when compared with the classical seriation problem solvers.

2 Problem Definition

The goal of this work is to reorder a set of images of mouse brain cross-sections obtained through a manual process of cutting. The corresponding fluorescence images of these cross-sections are noisy and distorted. The digitalization procedure of hundreds of these thin slices involves a time consuming and exhaustive manual work prone to errors during the fixation of the slices into the microscope slides and coverslips. The visual ordering of these images with high morphological similarity is challenging work.

This process gets harder when the technician loses the correct order of cross sections during staining. In order to the reconstruction of a 3D digital atlas of the mouse brain, section-to-section registration of slices with reference images and sorting of them are essential. In this study, we aim to reorder totally 91 shuffled adult mouse brain sections that have been disordered in the lab. We applied a new seriation method based on local optimal leaf permuting over the minimum spanning tree of data to reorder the rank of images. We compared our results with other classical seriation methods that shows a competitive and accurate results. Reordering of the mouse brain section images with the morphological distortion can be addressed as a seriation problem with noisy similarity matrix. The systematic and physical damages of slices as a soft tissue on the neuroscience laboratories, decreases the efficiency of global image similarity indexes such as Euclidean distance or mutual information. We used a Histogram of Oriented Gradient base image descriptor with a various window size to build the (dis)similarity matrix in this work [14].

3 Materials and Methods

Traveling Salesman Problem (TSP) is a classical approach to formulate the seriation problems where tries to find a linear order of objects in a dataset regarding their similarities [7, 9, 15]. The optimum solution of this method for a noiseless similarity matrix at the most the times is accurate where this method finds the best solution between all possible permutation with minimizing the Hamiltonian path [16]. TSP can be formulated as a graph problem also. Here G = {V, E, W} represents a graph with vertex (cities), V = {1, 2,…, n} and \( e_{i} \in E \) are the edges that associated with weight \( w_{i} \), where the weights indicate for images the distance or similarity value between images i and other connected objects. The Hamiltonian cycle is defined to calculate the minimum distance in a tour that connects all cities (objects) exactly once to each other. This formulation leads to find the minimum path between the objects with computing the minimum spanning tree (MST) for the optimal tour. Many algorithms as a solver for TSP has been suggested in the last decades. Since TSP is NP-complete problem to decrease the computational costs for large datasets, linear and dynamic programming methods are more efficient. The linear programming formulation of TSP is given by:

$$ \begin{array}{*{20}l} {minimize} \hfill & {} \hfill & {\sum\nolimits_{i = 1}^{m} {w_{i} x_{i} = {\mathbf{w}}^{T} {\mathbf{x}}} } \hfill \\ {subject to} \hfill & {} \hfill & {x \in S} \hfill \\ \end{array} $$
(1)

Where m is the number of edges on the graph G, \( w_{i} \in {\mathbf{w}} \) is the weight of edge \( e_{i} \) and x is the incidence vector of a tour. x contains all possible Hamiltonian cycle that connects all objects on a data set that make the TSP problematic [9]. The dynamic programming method can be formulated as follows: Given a subset of objects indicate \( S \subset \left\{ {2,\,\,3, \ldots ,\,\,n} \right\} \) and \( l \subset S \), let \( d^{*} \left( {S,l} \right) \) where represents the length of the shortest path from city one to the city l that crossing from all cities on S exactly once. For subset of \( S = \left\{ l \right\}, d^{*} \left( {S,l} \right) \) is defined as \( d_{1l} . \) The shortest path for larger sets with \( \left| S \right| > 1 \) can be denoted as:

$$ d^{ *} \left( {S,l} \right) = min_{{m \in S\backslash \left\{ l \right\}}} \left( {d^{ *} \left( {S\left\{ l \right\},m} \right) + d_{ml} } \right) $$
(2)

The minimum tour length for a complete tour including the first node is as follows:

$$ d^{ * *} = min_{{l \in \left\{ {2,3, \ldots ,n} \right\}}} \left. {\left( {d^{ *} \left\{ {2,3, \ldots ,n} \right\},l} \right) + d_{l1} } \right) $$
(3)

The quantity of \( d^{*} \left( {S,l} \right) \) can be estimated recursively by using the last two equation which leads to find the optimal tour \( d^{**} \). On the second step the optimal permutation of \( \pi = \left\{ {1,i_{2} ,i_{3} , \ldots ,i_{n} } \right\} \) computes the order of objects reversely starting with \( i_{n} \) into \( i_{2} \). The last notation shows that the permutation \( \pi \) is optimal if:

$$ d^{ * *} = d^{ *} \left( {\left\{ {2,3, \ldots ,n} \right\},i_{n} } \right) + d_{{i_{n} 1}} $$
(4)

For \( 2 \le p \le n - 1 \) we have:

$$ d^{ *} \left( {\left\{ {i_{2} ,i_{3} , \ldots ,i_{p} ,i_{p + 1} } \right\},i_{p + 1} } \right) = d^{ *} \left( {\left\{ {i_{2} ,i_{3} , \ldots ,i_{p} } \right\},i_{p} } \right) + d_{{i_{p} i_{p + 1} }} $$
(5)

For a large dataset, other formulation approaches by using relaxation for the linear programming method can give the solution for TSP. However, for a small number of objects in a dataset, the dynamic program is so fast and more efficient [1, 9]. To measure the accuracy of the results of seriation methods, we used two distance parameters, Kendall \( \tau \) and Spearman’s \( \rho \) [6, 17]. Nevertheless, the TSP solver as a seriation method using the Hamiltonian path, but for the noisy dataset does not meet the path graph correctly, while the path of the minimum spanning tree and TSP solution for the noiseless case is the same [15]. The proposed method starts with the construction of (dis)similarity matrix over the dataset. Employing HOG descriptor with different cell size for the images and calculation of the linear correlation coefficient between the feature vectors give a set of minimum spanning trees. The desired path is a tree with two nodes of vertex degree one and the rest nodes of vertex degree two. A noiseless similarity matrix leads to a MST that fit a path graph, while in a noisy case does not meet the same result. By measuring the complexity of each MST with counting the degree of the nodes in a tree, we are able to choose an appropriate tree that has maximum consistency with the path graph. Applying a permutation function over complex nodes with degree greater than two minimizes the differences between the tree and ground truth. This process repeats with re-computing the Hamiltonian path between the complex node and the neighbors by keeping the rest of the vertices constant. We applied a standard deviation filter over the images and random window size for the HOG to converge the result for the local permutation. The permuted nodes relocate on the global tree iteratively. This procedure continues unless the complexity parameter for the tree meets the maximum value n−2. The details of each step have been described separately.

3.1 Similarity and Distance Matrix

The pre-processing step for resizing and cleaning of the images is crucial before building the similarity matrix. The true seriation in an image-based dataset with morphological deformation has the main role while the Euclidian distance shows a poor effort mostly. We used the Histogram Oriented Gradient (HOG) as an image descriptor to get feature vector of the images in this paper [14]. To obtain (dis)similarity matrix in our dataset, firstly, we aligned all slices with a template image of multiple neighbors rigidly. To speed up the computation process and reduce the morphological deformation effects where mostly on the edges of slices, we selected a region of interest about fifty percent as a bounding box of image sizes from the image center. Then we used the morphological features of images obtained by applied HOG descriptor to construct the similarity matrix. We estimated the linear correlation coefficient between feature vectors pair-wisely. Let us consider a data set with n images. The feature matrix, \( F_{n \times m } \) represents the dimension of data set with size equal n and the length of normalized gradient histogram of image equal m respectively. We use L2-norm Euclidian distance \( f = v\left( {\left\| v \right\|_{2}^{2} + \varepsilon^{2} } \right)^{ - 1/2} \) to normalize of vector raw feature vector v where collected form each image. The length of vector f changes with the size of detection windows for HOG normally that gives a set of pre-R similarity matrix [2]. The linear dependency of two vectors with m elements on the feature matrix can be defined as Pearson’s correlation coefficient:

$$ \rho \left( {f_{i} ,f_{j} } \right) = \frac{1}{n - 1}\sum\nolimits_{k = 1}^{n} {\left( {\frac{{f_{ik} - \bar{\mu }_{{f_{i} }} }}{{\sigma_{{f_{i} }} }}} \right)\left( {\frac{{f_{jk} - \bar{\mu }_{{f_{j} }} }}{{\sigma_{{f_{j} }} }}} \right)} $$
(6)

Where \( \bar{\mu } \) and \( \sigma \) are the mean and standard deviation of vectors. For each pairwise vector of feature matrix, \( f_{i} \) and \( f_{j } \) the linear correlation coefficient defines as:

$$ r_{ij} = \left( {\begin{array}{*{20}c} 1 & {\rho_{ij} } \\ {\rho_{ji} } & 1 \\ \end{array} } \right). $$
(7)

Where \( \rho_{ij} \) is the correlation coefficient between image \( i \) and \( j \). Since the pre-R similarity matrix, in this case, are symmetrical so it’s easy to conclude that \( \rho_{ij} = \rho_{ji} \). For similarity and dissimilarity matrixes S and D with n objects, we define \( {\mathbf{S}} = {\mathbf{R}}^{{^\circ {\mathbf{2}}}} \varvec{ } \) and \( {\mathbf{D}} = {\mathbf{J}} - {\mathbf{R}}^{{^\circ {\mathbf{2}}}} \varvec{ } \) where J is the all-ones matrix with a dimension of \( n^{2} \) and \( {\mathbf{R}}^{{^\circ {\mathbf{2}}}} \) that is the squared matrix of entry-wise product for all images with elements \( R_{ij} = \rho_{ij}^{2} \). The (dis)similarity matrix for mouse brain images satisfies the following conditions:

$$ \begin{array}{*{20}l} {\left\{ {\begin{array}{*{20}l} {S_{ii} = 1\quad ,} \hfill & {S_{ij} = S_{ji} } \hfill \\ {D_{ii} = 0\quad ,} \hfill & {S_{ij} = S_{ji} } \hfill \\ \end{array} } \right.} \hfill & {and} \hfill & {\left\{ {\begin{array}{*{20}l} {tr\left( {{\mathbf{D}}_{{\mathbf{n}}} } \right) = 0 } \hfill \\ {tr\left( {{\mathbf{S}}_{{\mathbf{n}}} - {\mathbf{I}}_{{\mathbf{n}}} } \right) = 0} \hfill \\ \end{array} } \right.} \hfill \\ \end{array} $$
(8)

where \( {\mathbf{I}}_{{\mathbf{n}}} \) is the \( n \times n \) identity matrix.

3.2 Optimal Leaf Permutation Method

HOG based Optimal Leaf Permutation (HOG_OLP) method includes three main functions, dis(similarity) matrix construction, MST selection, and leaf permutation respectively. For dissimilarity matrix D, first, we choose detection window size randomly for HOG iteratively to get the best topological MST with less complexity. The range of window size is limited between a 4 × 4 pixel and forty percent of image size. We generated the minimum spanning tree for each dissimilarity matrix associated with different windows size of HOG where give the opportunity to use a relaxation parameter to select the best MST. With maximizing g, a new relaxation parameter that represents the number of leaves with degree two. This parameter as an index helps the algorithm to find the best topological MST for leaf permutation where:

$$ g_{m} = { \hbox{max} }\left\{ {g_{1} ,g_{2} , \ldots ,g_{l} |g_{i} = \# deg_{2} \left( {T_{i} } \right)} \right\} $$
(9)

The desired minimum spanning tree is a tree with \( g = \left( {n - 2} \right) \), where n is the number of nodes on the tree (see Fig. 1(e) and (f)). After finding the suitable MST, we apply an optimal permutation functions locally over complex nodes where tries to reorder the nonlinear leaves and relocate on the main tree. Let us denotes a features matrix \( f\left( {v_{i} ,l} \right) \) which are connected with vertex i as the neighbors with length l. The \( I\left\{ {v_{i} } \right\} \) indicates the images related to the node i with weight \( w_{ij} \) as dissimilarity value between image i and j. We estimate starting and ending vertex on the MST with calculation longest path in each iteration over distance matrix. The following algorithm explains how our method reduces the complexity of a spanned tree with the permutation of the leaves. The function return to a minimum spanning tree, \( T_{sl} \) with degrees less than three \( 1 \le deg\left\{ {v_{i} } \right\} \le 3 \). This iterative process will continue unless all complex nodes with degree greater than two connected with two nodes only (linear tree see Fig. 1(f)). A local permutation function using the sub-similarity matrix where builds from extracting more features by HOG method with applying median and standard deviation filters. Therefore, for each complex node and the first degree neighbors, the sub-similarity matrix is different from the global one. To keep the global seriation result, we always update and restore the similarity values when each complex node and its neighbors are aligned on the main similarity matrix. The final tree has the optimal length path between other solutions however this path is not the minimal path necessarily.

Fig. 1.
figure 1

Seriation method for mouse brain microscopic images. (a): Sample of montaged images, (b): pre-R distance matrix of the data, (c): shading distance matrix after seriation, (d): graph of connected images base on similarity index, (e): minimum spanning tree of data computed from HOG based (dis)similarity matrix (f): linear seriation by applying HOG_OLP method.

figure a

4 Results

We tested the proposed algorithm with two types of datasets. First we applied this method over hundred teapot images and compared the results with other seriation algorithms from the literature [18]. The tests with mouse brain dataset were also performed and the results are displayed in the Figs. 2 and 3 and the Table 1. Since the teapot images are not affected by any kind of morphological deformation or noise the resulting similarity matrix is clean and straightforward. Therefore, all seriation methods tested with this type of synthetic data where able to produce the ground-truth result (100% accuracy). The different methods only differ on the estimated starting and ending images of the final sequence. In fact, all methods are not able to estimate these parameters without any additional or prior information, which is not considered in this work. A dataset of 91 images of microscopy were used to illustrate the application of the proposed methodology to real data. This sequence was unwillingly shuffled at the lab and it was manually reorder by visual inspection to generate the ground-truth. The resorting operation and ground-truth definition were performed by neuroscientist that by visual comparison with the Allen Brain Atlas references images [19]. We tested the proposed method with other 32 methods of seriation described in the literatures [1] and [8]. Most of them that failed to give acceptable results, were excluded from the comparison test. Only the following four seriation methods were included in this paper: Single linkage of Hierarchical Clustering (HC_Single), Gruvaeus and Wainer seriation method (GW_Single), Optimal Leaf Ordering (OLO_Single) and TSP solver. A good survey and review over all seriation methods can be find here [7, 8]. The Kendall \( \tau \) and Spearman distances, computed for the mouse brain dataset (see Table 1) shows that the proposed algorithm.

Fig. 2.
figure 2

plot (1) indicates the Kendall \( \tau \) and Spearman‘s measurement for five seriation methods. The shading distance matrices (a–f) are for 91 mouse brain images includes GW single linkage, Hierarchical Clustering (HC), Optimal Leaf Ordering (OLO), TSP solution and HOG base Optimal Leaf Permuting (HOG_OLP) method respectively. Plots (2) illustrates the overlap-based similarity matrix for mouse brain dataset, x and y-axes indicate the ground truth and seriation results.

Fig. 3.
figure 3

The left image shows the absolute value of difference between ground truth and seriation results. The dark era illustrates that the seriation reordered successfully. The right images are a 3D blocks of 91 mouse brain before (up) and after (down) reordering respectively.

Table 1. The results of seriation accuracy for 91 mouse brain images

(HOG_OLP) reorders more images correctly comparing to the others. The hierarchical clustering method, with permitting the leaves in a dendrogram structure without changing the main clusters, tries to find the optimum results that for noisy similarity matrix case failing to give correct enumeration. In other hand optimal leaf ordering method by starting with hierarchical clustering and permuting the leaves have better efficiency. Employing GW method over images leads to higher accuracy comparing to other two methods. Whereas the Kendall and Spearman coefficients for this method raised to 0.9746 and 0.997 out of one respectively that are higher than HC and GW methods.

The solution of TSP among all other methods has better effort. Two measuring parameters for TSP solver are 0.9863 and 0.9987 which indicates that more images sorted correctly comparing to other three methods. However, the TSP solver normally gives the minimum path length for applied distance matrix but for noisy similarity matrix the minimum shortcut path is not the correct results therefore we need more robust and optimum results. Applying our method for this dataset gives more robust results comparing to others. The values of Kendall tau for our method is 0.9927 and Spearman’s coefficient is 0.9993 that shows better efficiency see Table (1). Figure 2 shows the measuring parameters for all seriation methods as well as shading distance matrix. All the plots show the HOG_OLP method has better effort comparing to the others. On the same figure, plots (2) represents the seriation results with ground truth. The correct results should be straight line where spied dots distribution out of straight line shows the error of seriation methods. Hierarchical clustering method among other methods has more error which means could not seriate the images successfully. Figure (3) left image shows the absolute differences between ground truth and the results of seriation methods. The dark background means the image ranks is completely matched with the ground truth. Finally, the right image on the same figure shows a 3D volumetric slice of mouse brain before and after seriation. The slices have been seriate with our algorithm that the black arrows on the images show the results of reordering and consistency of the tissue.

5 Conclusion

In this paper we propose, a graph-based seriation algorithm for noisy similarity matrix to increase the robustness of seriation problem. HOG_OLP is a method consists of two main steps: similarity matrix construction and leaves permuting. At the first step, the similarity matrix calculated from the linear correlation coefficient between the feature vectors obtained by HOG method. We applied the cell size to extract the feature of the mouse brain images as a parameter to get a set of (dis)similarity matrix. Minimizing the Hamiltonian path over noisy dissimilarity matrix by using minimum spanning tree provides a set of complex MST. In the second step, we identify the best MST for local permutation by measuring the nodes of vertex degree as a complexity parameter of a tree. A tree with a maximum number of the nods with degree two is more consistency of a path graph. The complex nodes within the first rank of the neighbors reorder by a local permutation function. With keeping the global order of the nodes and applying the appropriate imaging filters to the corresponding images and following the same procedure to obtain the local MST in a loop, the complex nodes meet the desired path graph. By relocating linearized complex nodes on the main tree regarding the optimal path iteratively, we obtain the final order of the images. The measuring parameters between the results of the proposed method and the other seriation algorithms with the ground truth show a better and competitive effort. Our results demonstrate that this approach can be used for the seriation of image-based datasets with high morphological deformation or noisy similarity matrix.