1 Introduction

Image registration is an invaluable tool for medical image analysis and has received vast attention in imaging research for the past several decades. Image registration is used as a tool to find meaningful temporal transformations to align images taken at different time frames. Traditionally, registration algorithms assume smooth transformations. This assumption quickly falls apart for many cases, since different organs move, to a certain degree, independently from one another. Image misalignment becomes inevitable if smoothness is assumed at regions where discontinuities are expected, such as organ boundaries [4]. In this paper, we introduce an unsupervised learning model that learns the relationship between image pairs and a corresponding displacement field. We propose a regularizer that accounts for local image discontinuities while simultaneously respecting local homogeneity. This approach drastically decreases registration time, as the registration task is no longer an optimization task, but becomes a simple function evaluation.

2 Related Work

In traditional image registration, the most common approach is to solve an optimization problem, where the objective function is comprised of two terms, an image dissimilarity term and a regularization term to restrict the solution space. Common methods include elastic and diffusion models [16], free-form deformations using b-splines [3], and more recently, kernel methods [11,12,13]. Because all of these methods optimize an energy function for every image pair, large-scale or successive registration tasks becomes very time consuming. Specialized algorithms such as Thirion’s Demons [5, 17, 22] allow significant reduction in computational time by estimating force vectors that acts to drive the deformation followed by Gaussian smoothing during the optimization process. Unfortunately, this algorithm restricts models to be diffusion-based models only.

With the rise of deep learning over the past decade, learning-based approaches have become extremely popular. Several models are trained in a supervised manner which required ground truth transformations to be available [6, 15, 18]. Although these methods showed promising results, the task of obtaining ground truth transformation fields is cumbersome and highly prone to error. Thus, recent methods have shifted to an unsupervised approach instead, where models are trained based on how transformation fields act on images, rather than strictly on the transformations [1, 9, 25]. For a survey of learning-based image registration methods, refer to the article by Haskins et al. [10].

3 Method

Our model follows a framework popularized by Voxelmorph [2]. Let \(I_F\) and \(I_M\) denote fixed and moving images. We find a function \(g_{\theta }(I_F, I_M)\) that produces the displacement field \(\mathbf {u}\), i.e. \(\mathbf {u} = g_{\theta }(I_F, I_M)\). The deformation \(\phi \) can then expressed as the mapping \(\phi = Id + \mathbf {u}\) where Id is the identity mapping. The deformation field is applied to \(I_M\) to produce the warped image \(I_M \circ \phi \) where \(I_F(x)\) is similar to \([I_M \circ \phi ](x)\) for all voxel locations \(x \in \varOmega \). Since \(\phi \) may map the original coordinate system to non-integer valued voxel locations, interpolation is required to warp \(I_M\) under \(\phi \). For our experiments, we use trilinear interpolation due to its simplicity. An overview of the model is shown in Fig. 1.

Fig. 1.
figure 1

Overview of the model. Fixed and moving images \(I_F\), \(I_M\) are passed into a convolutional neural network which produces the displacement field \(\mathbf {u}(x)\). The spatial transformer morphs \(I_M\) based on the displacement field. The loss is measured over the dissimilarity between the fixed and morphed moving images, as well as additional penalty functions defined over \(\mathbf {u}\).

3.1 Network Architecture

The function \(g_{\theta }\) is modeled using a convolutional neural network where \(\theta \) denotes the network parameters. The neural network follows a modified version of U-Net [19], which contains an encoder and a decoder structure that mirror each other and are connected by skip connections at each layer (Fig. 2). The encoder/decoder architecture is motivated by image pyramid techniques in many computer vision algorithms, where each encoding and decoding layer operate from coarse to fine representations of the input.

The encoder consists of three convolution layers by applying \(3 \times 3 \times 3\) convolutions with stride 2 for downsampling, followed by LeakyReLU with slope of 0.2 at each layer. Each convolution layer has 32 output channels except the first layer which contains 16 output channels.

The decoder follows a similar structure as the encoder but in reverse order. In the first decoding layer, we simply use the output of the final encoding layer as the input. In subsequent decoding layers, we first upsample the output of the previous decoding layer. Skip connections are constructed by concatenating layer outputs with that of the mirroring encoding layer. This effectively uses representations of the encoding layers to enforce more precise outputs in the decoding layers. Similar to the encoder, each decoding layer applies \(3 \times 3 \times 3\) convolutions followed by LeakyReLU of slope 0.2, but with stride 1 to preserve resolution at each layer. The output of the final decoding layer is passed into an additional convolution layer with 3 output channels, where each output channel contains the coordinate components of the displacement field \(\mathbf {u}\).

Fig. 2.
figure 2

Network architecture of \(g_{\theta }\) based on a modified version of U-Net. The network receives \(I_M\) and \(I_F\) to produce the displacement field \(\mathbf {u}\). The input and output of the network are of dimensions \(D \times H \times W \times 2\) and \(D \times H \times W \times 3\) respectively. The architecture consists of a contractive path (encoder) and a mirroring expansive path (decoder) connected by skip connections at each layer.

3.2 Loss Function

We train our model using a loss function in the form

$$\begin{aligned} \mathcal {L}(I_F, I_M, \mathbf {u}) = \lambda _{sim} \mathcal {L}_{sim}(I_F, I_M, \mathbf {u}) + \lambda _{disc} \mathcal {L}_{disc}(\mathbf {u}) + \lambda _{mag} \mathcal {L}_{mag}(\mathbf {u}), \end{aligned}$$
(1)

where \(\mathcal {L}_{sim}\) measures image dissimilarity, \(\mathcal {L}_{disc}\) is a discontinuity preserving regularizer, and \(\mathcal {L}_{mag}\) is a second loss term that manages the (ir)regularities in the magnitude of the displacement fields. \(\lambda _{sim}, \lambda _{disc}\), and \(\lambda _{mag}\) are corresponding regularization constants.

Similarity Loss. To measure image similarity/dissimilarity, we use a local normalized cross correlation which is defined as

$$\begin{aligned} \mathrm {LNCC}(I_M, I_F) = \sum _{x \in \varOmega } \frac{\left[ \sum _{y \in \mathcal {N}(x)} \left( I_M(y) - \mu _M(x) \right) \left( I_F(y) - \mu _F(x) \right) \right] ^2}{\left[ \sum _{y \in \mathcal {N}(x)} \left( I_M(y) - \mu _M(x) \right) ^2 \right] \left[ \sum _{y \in \mathcal {N}(x)} \left( I_F(y) - \mu _F(x) \right) ^2 \right] } \end{aligned}$$
(2)

where x is any voxel in the image domain \(\varOmega \), and \(y \in \mathcal {N}(x)\) are the neighborhood points around voxel x, and \(\mu _M(x)\) and \(\mu _F(x)\) are the average local intensities around x in the moving and fixed images, respectively. LNCC is maximized when \(I_F = I_M\) which measures similarity, thus we define the dissimilarity measure as \(\mathcal {L}_{\mathrm {sim}} = 1 - \mathrm {LNCC}\).

Discontinuous Loss. In designing the discontinuous loss, we first assume that there are no topological changes, i.e. no new tissue is introduced nor destroyed. We then consider the requirements based on these physical scenarios: 1. Homogeneous movement, 2. Movement along rigid structures, and 3. Sliding organs.

These scenarios help us define the requirements for our regularizer. Firstly, the regularizer must preserve smooth local deformations that occur locally within organ interiors. Secondly, the regularizer must not penalize large local changes in deformation magnitude as long as the movement is in a similar direction. This is to mimic the movement of soft tissues or organs against rigid structures such as the rib cage or the spinal column. Finally, the regularizer must be able to account for movements in the opposite directions along organ boundary. This final requirement is perhaps the most significant as there are many scenarios where sliding organs exist. Common examples include the sliding of the lungs against the chest wall during the respiratory cycle, and the movement of organs against one another in the abdominal region. Figure 3 visually summarizes possible desired behaviors of a discontinuous displacement field.

Fig. 3.
figure 3

Desired behaviors of the discontinuous displacement field. Figure 4(a) demonstrates local homogeneity which is expected within organs. Figure 4(b) allows displacement vectors of different magnitudes as long as they are in a similar direction, which represents soft tissue moving against rigid structures. (c) depicts sliding boundary conditions as displacement vectors on opposite sides of the boundary travel in opposite directions.

Let \(\mathbf {u}\) be represented by a collection of displacement vectors \(\{u_i\}_{i = 1, \dots , N}\), where N is the number of voxels in the image. Now consider two arbitrary vectors \(u_i\) and \(u_j\), respectively corresponding to locations \(x_i\) and \(x_j\) in the image domain. The area of the parallelogram spanned by \(u_i\) and \(u_j\) is maximized when \(u_i\) and \(u_j\) is orthogonal to one another, and minimized when they are parallel. Thus the three conditions are encouraged for any regularizer in the form

$$\begin{aligned} \mathcal {L}_{disc} = \sum _{i,j = 1}^N g(\mathcal {P}(u_i, u_j)) \end{aligned}$$
(3)

where \(\mathcal {P}\) the unsigned area of the parallelogram spanned by \(u_i\) and \(u_j\), and \(g: \mathbb {R} \rightarrow \mathbb {R}\) is a strictly increasing function satisfying \(g(0) = 0\). \(\mathcal {P}\) is computed as

$$\begin{aligned} \mathcal {P} (u_i, u_j) = \Vert u_i \times u_j \Vert _2 \end{aligned}$$
(4)

where \(\times \) denotes the cross product. We propose the regularizer

$$\begin{aligned} \mathcal {L}_{disc} = \sum _{i,j = 1}^N \frac{1}{2} \log \left( 1 + \mathcal {P}(u_i, u_j)^2 \right) k(x_i, x_j) \end{aligned}$$
(5)

where \(k(x_i, x_j)\) is a decreasing weight function that depends on the proximity between the locations \(x_i\) and \(x_j\). For our experiments, we choose the \(C^4\) Wendland kernel [26] for \(k(x_i, x_j)\).

Magnitude Loss. During preliminary stages of our experiments, we noticed that deformations in large dark image regions (background of CT image, for instance) behave erratically. We found that imposing an additional magnitude-based regularizer is needed to suppress this unpredictable behavior. Thus we add the following term to our loss function

$$\begin{aligned} \mathcal {L}_{mag}(u) = \max _i (\Vert u_i \Vert _2). \end{aligned}$$
(6)

This effectively discourages large magnitudes of u. Evidently, this additional term may become problematic for coarse registration where large-scale movement may be expected. However, since this is aimed towards addressing local discontinuities, it is safe to assume that deformations remain relatively small.

4 Experiments

4.1 Setup

Our model is implemented using PyTorch 1.3.0 and trained using an NVIDIA GeForce GTX 1080Ti with 11 GB of graphics memory. CPU tests are performed on an Intel Xeon E5-1620 at 3.7 GHz. We trained our model using Adam optimizer [14] with \(\lambda _{sim} = 100, \lambda _{disc} = \lambda _{mag} = 1\), and learning rate \(10^{-4}\).

The model is evaluated over 4DCT datasets provided by DIR-Lab [7, 8] and the POPI-model [23]. The DIR-Lab Reference 4DCT datasets contain ten sets of image volumes of sizes \(256 \times 256\) and \(512 \times 512\) with various number of axial slices (average of 100 and 128 for the two respective resolutions). To account for these variations, we only keep the middle 96 axial slices of the \(256 \times 256\) volume, and the middle 112 axial slices of the \(512 \times 512\) volumes. Each set of image volumes are taken over 10 time steps over the period of a single respiratory cycle. Since the input is a pair of image volumes, \(I_F\) is chosen as the image volume with a randomly chosen case number and time step, and \(I_M\) is selected based on the same case number with a different time step. By choosing eight cases as training data, this allows \(8 \times 10 \times 9 = 720\) training samples and \(2 \times 10 \times 9 = 180\) test samples, despite only having ten available cases. The POPI-Model contains six image volumes of sizes \(512 \times 512\) with 140 to 190 axial slices. For consistency, we only keep the middle 136 axial slices and use five of the six cases as training data. We follow the same approach as DIR-Lab in choosing \(I_F\) and \(I_M\).

4.2 Results

We first compare our discontinuity-preserving model with one that assumes global smoothness. As a baseline, we trained a second model using the DIR-lab dataset with an identical configuration, with the exception where the discontinuous loss \(\mathcal {L}_{\mathrm {disc}}\) is replaced with a total variation loss \(\mathcal {L}_{\mathrm {TV}}\) defined as

$$\begin{aligned} \mathcal {L}_{\mathrm {TV}} = \sum _i \Vert \nabla u_i \Vert _2 \end{aligned}$$
(7)
Fig. 4.
figure 4

Results obtained using \(\mathcal {L}_{TV}\) (a) and \(\mathcal {L}_{disc}\) (b). Columns 1 and 4 show an overlay of \(\mathbf {u}\) over \(I_M\). Columns 2 and 5 show a magnified local region where transformation discontinuities are expected. Columns 3 and 6 are heatmaps of the displacement field’s local magnitudes.

Fig. 5.
figure 5

Qualitative results of proposed model. From left to right: fixed image \(I_F\), moving image \(I_M\), registered image \(I_M \circ \phi \), absolute error before registration \(|I_F - I_M|\), absolute error after registration \(|I_F - I_M \circ \phi |\), heatmap of displacement field \(\mathbf {u}\).

where the summation is over all voxels indexed by i. Figure 4 shows a comparison between our model trained using \(\mathcal {L}_{\mathrm {TV}}\) and \(\mathcal {L}_{\mathrm {disc}}\). One can quickly identify sudden changes in the displacement field near the lung’s boundaries especially near the lung/vertebrae interface. Additional registration results are shown in Fig. 5. We compare our results (Table 1) quantitatively to the following methods: Free-Form Deformations (FFD) [20], isotropic parametric Total Variation (pTV) [24], and Sparse Kernel Machines (SKM) [12]. For comparison, we fixed frame 1 as the fixed image, and register all remaining frames to the reference. Finally, we compare the time required to register a pair of images using our approach versus a classical registration algorithm using minimization (Table 2). Classical registration is applied using the AIRLab framework [21] via diffusion regularizer.

Table 1. Target Registration Error (TRE) in millimeters (mm) against FFD [20], pTV [24], and SKM [12] on the DIR-Lab and POPI 4DCT Model. Baseline model is the same configuration but trained with \(\mathcal {L}_{TV}\) in place of \(\mathcal {L}_{disc}\).
Table 2. Comparison of registration time between learning-based model and inverse model. For the learning-based model, we used our proposed model for evaluation. For the inverse model, we perform pairwise registration with diffusion regularizer over 1,000 iterations. The inverse model is evaluated using the AIRLab framework [21]. The CPU time for the classical model over DIR-Lab 512 and POPI Model is not computed, as they were much higher than the corresponding GPU time. Time is measured in seconds.

5 Conclusion and Future Work

We presented an unsupervised learning-based model for discontinuity preserving image registration. Although the training set was relatively small, our model performed on par with existing methods while begin able to handle locations where discontinuities may occur. Furthermore, our model significantly reduced computation by several orders of magnitude, allowing successive registration to be performed within a relatively short time frame. A drawback of the model is its sensitivity to noise. In particular, since \(\mathcal {L}_{disc}\) is computed by comparing local displacement vectors with neighboring displacement vectors individually, there are no mechanisms to discourage local chaotic behaviors in the displacement field. A possible remedy is to extend the current model to incorporate additional information, such as segmentation masks and edge information. This allows image discontinuities to be defined rather than relying on only image intensities to predict boundary regions.