Robust phase unwrapping algorithm based on Zernike polynomial fitting and Swin-Transformer network

Zixin Zhao; Menghang Zhou; Yijun Du; Junxiang Li; Chen Fan; Xuchao Zhang; Xiang Wei; Hong Zhao

doi:10.1088/1361-6501/ac4ac2

1. Introduction

In fringe pattern analysis, measured changes such as displacement, temperature, and strain are reflected by extracting the fringe phase [1, 2]. However, because the demodulated phase diagram is usually sawtooth shaped with a range in the [−π, π] interval, phase unwrapping is required to obtain real continuous phase data. Many researchers have proposed methods to accomplish this task, and these methods can be classified into two groups: the time domain and spatial domain. In the time-domain group, several wrapped phase maps with different frequencies are required, and the phase unwrapping is completed with each pixel along the time axis. This is suitable for abrupt or discontinuous phase profiles. In the spatial-domain group, only one wrapped phase map is used, and phase unwrapping algorithms can be divided into two categories: path-dependent and path-independent algorithms. Path-related algorithms include the Goldstein branch cutting method [3, 4], quality map guided method [5, 6], mask cutting algorithm [7], and Flynn minimum discontinuity algorithm [8]. The Goldstein branch cutting method is extremely fast but is significantly affected by noise. The quality map guided method and mask cutting algorithm have high accuracy but require a good quality map. The path-independent algorithm mainly includes the least square methods [9–11], transport of intensity based methods [12, 13], and polynomial fitting-based methods [9, 14, 15]. The least square and transport of intensity methods both establish a discrete Poisson equation; however, their methods of calculating the input for the Poisson equation differ [16]. Polynomial fitting methods make use of the assumption that the absolute phase can be fitted by a specific polynomial (such as Zernike polynomials), and the phase unwrapping task is converted to the estimation of those fitting coefficients with the help of the second differential of the wrapped phase data. As a result, the polynomial fitting-based methods are highly sensitive to noise and are only suitable for wrapped phase maps with little noise.

With the rapid development of deep learning, convolutional neural networks (CNNs) have been successfully applied to phase unwrapping. In 2018, Dardikman et al [17] proposed an algorithm based on a residual neural network to solve the phase unwrapping problem in the field of interferometric phase microscopy. Spoorthi et al proposed PhaseNet [18] for phase unwrapping based on the SegNet network. In 2019, Wang et al [19] and Zhang et al [20] proposed a U-net architecture based on deep learning semantic segmentation for phase unwrapping, which has certain anti-low noise and anti-aliasing performances. However, these methods are only suitable for low-noise conditions; their performance is greatly decreased in high-noise cases. In 2020, Liu et al proposed D-Net [21] based on a CNN for fast demodulation of single-frame interference fringe patterns. Taking demodulation as a classification task, the network is used to predict the 34 Zernike polynomial coefficients of fringe patterns and then fit the phase information to obtain noiseless phase results, which offers a new idea for phase unwrapping. In addition to the CNN, the hot transformer [22] architecture in natural language processing gradually reveals its advantages in the visual field. The transformer was originally designed for sequence modeling and transformation tasks; however, it has also resulted in excellent achievements in visual fields such as image classification, image denoising, and semantic segmentation owing to its unique attention mechanism [23–28].

As mentioned earlier, polynomial-based methods often fail under heavy noise conditions. To solve this problem, a robust phase unwrapping method is proposed based on Zernike polynomial fitting and a Swin-Transformer network. With the help of transformer network sensitivity to contour information [24], the correlation between fringe information and Zernike polynomial coefficients can be constructed, and the interference of local dense and high noise can be effectively reduced through the transformer's self-attention mechanism. As a result, the fitting coefficient can be accurately estimated even under extremely harsh noise conditions. Numerous experiments on simulation and experimental data have been conducted to prove the effectiveness of the proposed method. Moreover, D-Net [21] and derivative Zernike polynomial fitting (DZPF) [15] methods were used to compare the performances at different noise levels. The results show that the proposed method performs better than the other two methods, especially for data disturbed by large noise. In general, the proposed method can effectively improve the unwrapping accuracy under noisy conditions using a transformer network.

The remainder of this paper is organized as follows: section 2 describes the principle of the proposed methods. Section 3 presents the numerous experiments on simulation and experimental data, and the outperformance of the proposed method against the other two representative methods is also demonstrated using synthetic and experimental data. Finally, conclusions are presented in section 4.

2. Theory

2.1. Phase unwrapping principle based on the transformer network

The wrapped phase can be expressed as

$\begin{equation}\psi \left( {x,y} \right) = W\left( {\varphi \left( {x,y} \right)} \right) = \left( {\left( {\varphi \left( {x,y} \right) + \pi } \right){\text{mod}}\left( {2\pi } \right)} \right) - \pi ,\end{equation} \tag{ 1 }$

where $\varphi \left( {x,y} \right)$ is the unwrapped phase, $\psi \left( {x,y} \right)$ is the wrapped phase, and $W\left( \cdot \right)$ is the wrap function. $\varphi \left( {x,y} \right)$ can be represented by the orthogonal Zernike polynomial [29] as

$\begin{equation}\varphi \left( {x,y} \right) = \mathop \sum \limits_{i = 1}^N {C_i}{Z_i}\left( {x,y} \right),\end{equation} \tag{ 2 }$

where ${Z_i}\left( {x,y} \right)$ represents the ith Zernike polynomial defined in the unit circle, ${C_i}$ is the corresponding coefficient, and $N$ is the number of items. ${Z_i}\left( {x,y} \right)$ can be expressed as

$\begin{equation}{Z_{{\text{eveni}}}} = \sqrt {2\left( {n + 1} \right)} R_N^M\left( r \right)\cos \left( {m\theta } \right),{ }m \ne 0,\end{equation} \tag{ 3 }$

$\begin{equation}{Z_{{\text{oddi}}}} = \sqrt {2\left( {n + 1} \right)} R_N^M\left( r \right)\sin \left( {m\theta } \right),{ }m \ne 0,\end{equation} \tag{ 4 }$

where ${Z_{{\text{eveni}}}}$ and ${Z_{{\text{oddi}}}}$ indicate that the ith item is an even-numbered item and an odd-numbered item, respectively, $R_n^m\left( r \right)$ can be expressed as

$\begin{equation}R_n^m\left( r \right) = \mathop \sum \limits_{s = 0}^{\left( {n - m} \right)/2} \frac{{{{\left( { - 1} \right)}^s}\left( {n - s} \right)!}}{{s!\left[ {\dfrac{{n + m}}{2} - s} \right]!\left[ {\dfrac{{n - m}}{2} - s} \right]}}{r^{\,n - 2s}},m \ne 0,\end{equation} \tag{ 5 }$

where $r$ and $\theta$ are the radius of the unit circle and the angle included with the Y-axis, respectively, and m and n are integers that represent the order and directional angular frequency of the polynomial respectively.

Figure 1 shows the unwrapping process of the proposed method. Phase unwrapping is regarded as a regression task in which the wrapped phase ψ (x, y) is the input, and the output of the network is 34 Zernike polynomial coefficients excluding the piston (the first Zernike polynomial is constant). Finally, the unwrapped phase can be reconstructed using these coefficients.

2.2. Dataset generation

The training data are important for deep learning-based methods. In this method, Zernike polynomials were used to generate the phase data (the ground truth), and figure 2 illustrates the data-generation process. The deformed speckle field was generated and subtracted from the initial speckle field to obtain the speckle fringe pattern. Considering the multiplicative noise (speckle), additive noise (Gaussian, salt and pepper), and non-uniform illumination or reflection in the fringe pattern, a series of wrapped phase data with different visibility and noise levels could be obtained. A total of 50 000 wrapped phase diagrams with a size of 256 $\times$ 256 and their corresponding Zernike polynomial coefficients C_i were generated; sections are shown in figure 3. Note that the simulated data are very close to the real ones. Among them, 45 000 data were used as the training set and 5000 data were used as the verification set to evaluate the training effect of the model. When the dataset was input into the network training, the data was enhanced through rotation and size transformation to enhance the performance of the model.

**Figure 2.** Flowchart of training data generation.
Download figure:
Standard image High-resolution image

**Figure 3.** Sections of the generated data with different noise levels and visibility.
Download figure:
Standard image High-resolution image

2.3. Network construction and training

In the proposed method, a network model is constructed and modified based on Swin-Transformer architecture [28]. An overview of the unwrapping net architecture is shown in figure 4; the network includes a patch partition module, linear embedding module, Swin-Transformer module, patch merging module, and multi-layer perceptron (MLP) head. First, it splits a wrapped phase map (H $\times$ W) into non-overlapping patches (4 $\times$ 4) using a patch-partition module. There are a total of $\frac{H}{4} \times \frac{W}{4}$ patches, and each patch is then converted into a 'token' whose feature is 16. A linear embedding module projects the raw-valued feature of each token to an arbitrary dimension C, where C is equal to 96. Then, all tokens are input into the Swin-Transformer module for feature transformation and to compute the self-attention between tokens while maintaining the original number ( $\frac{H}{4} \times \frac{W}{4}$ ) and feature dimension (C). The number of tokens is continuously reduced through the patch merging module as the network deepens to change the receptive field, which produces a hierarchical representation. The first patch merging layer changes the number of tokens to $\frac{H}{8} \times \frac{W}{8}$ and the feature dimension to 2C, and then two consecutive Swin-Transformer blocks are used for feature transformation, where the resolution remains unchanged. After the second patch merging layer, the number of tokens becomes $\frac{H}{{16}} \times \frac{W}{{16}}$ and the feature dimension becomes 4C; hence, six consecutive Swin-Transformer blocks are used to extract deeper features. Finally, the resolution and feature dimensions are set as $\frac{H}{{32}} \times \frac{W}{{32}}$ and 8C, respectively, through a patch merging layer and two continuous Swin-Transformer blocks, and 34 Zernike polynomial coefficients are output through a two-layer MLP with a Gaussian Error Linear Units layer [24].

**Figure 4.** The network structure of the proposed method.
Download figure:
Standard image High-resolution image

As shown in figure 5, each time through the patch merging module, four tokens are merged into a new token, the number of tokens becomes a quarter of the original, and the receptive field and feature dimension are expanded four and two times, respectively. The network can efficiently extract feature information of different sizes by merging image patches (shown in yellow) to construct hierarchical feature maps (shown in red) and then computing the self-attention between them.

During network training, the loss function was set as

$\begin{equation}{\text{LOSS}} = 0.8\sum\limits_{i = 2}^4 {{{\left( {{C_i} - {P_i}} \right)}^2}} + 0.2\sum\limits_{i = 5}^{35} {{{\left( {{C_i} - {P_i}} \right)}^2}} ,\end{equation} \tag{ 6 }$

where ${C_i}$ is the $i$ th ground truth coefficient, and ${P_i}$ is the $i$ th predicted coefficient. The mean-square error (MSE) between the estimated and actual values of the Zernike polynomial fitting coefficients was defined as the training loss, and the parameters of the network were iteratively updated using backpropagated gradients based on the MSE loss. Because ( ${C_2},{C_3},{C_4}$ ) are dominant in the actual data, the proportion should be larger and set to 0.8 through modifying the parameters for several times of training. The rest were relatively low and were set to 0.2. Adaptive moment estimation (Adam) was used to optimize the network parameters, and the learning rate was set to ${10^{ - 4}}$ . The network was trained using 1000 epochs in total, and 32 images were input each time. The change in the loss function during training is shown in figure 6. The loss function values were recorded every 100 epochs. The test curve and the training curve maintained the same downward trend, and both converged to a small value, indicating that the network training had not been overfitted.

**Figure 6.** The change in the loss function during training.
Download figure:
Standard image High-resolution image

3. Results

3.1. Results on synthetic data

A large quantity of data was used to test the performance of the proposed method. First, the unwrapping results of the dataset were evaluated for testing. The results are shown in figures 7 and 8.

**Figure 8.** The unwrapping result of a closed-fringe phase. (a) Wrapped phase map, (b) ground truth, (c) unwrapped phase with RMSE = 0.0627 rad, and (d) the true (in blue) and predicted Zernike polynomial fitting coefficients (in red).
Download figure:
Standard image High-resolution image

The performance of the trained model with the wrapped phase of the open and closed fringes was tested. Figure 7 shows the test results for the wrapped phase of the open fringe. Figure 7(d) presents the Zernike polynomial fitting coefficients predicted by the model, which reveals that these model-predicted coefficients are similar to the actual Zernike polynomial coefficient value, and the root mean square error (RMSE) between the fitted phase (figure 7(c)) and the actual phase distribution (figure 7(b)) is only 0.0407 rad, indicating a small prediction error. The test results for the closed-fringe-wrapped phase are shown in figure 8. From the unwrapping results of figure 8(c), the RMSE between the fitted phase and the actual phase (figure 8(b)) is only 0.0627, which is also extremely small. Through a comprehensive comparison of the RMSE and phase distribution, the trained network model is shown to have a good unwrapping performance on test data with different fringe shapes.

To evaluate the performance of the proposed method on different types of noise, a set of data that is more severely affected by salt and pepper noise is also tested and the results are shown in figure 9. The predicted Zernike polynomial coefficients are similar to the actual distribution, and the RMSE between the recovered phase and the actual phase is merely 0.0153, indicating that the prediction error is extremely small. The trained network model is proven to have a good unwrapping performance on the test data with different types of noise.

**Figure 9.** The unwrapping result of a wrapped phase with salt and pepper noise (noise density is 0.3). (a) Wrapped phase map, (b) ground truth, (c) unwrapped phase with RMSE = 0.0153 rad, and (d) the true (in blue) and predicted Zernike polynomial fitting coefficients (in red).
Download figure:
Standard image High-resolution image

To further verify the performance of the network model and its advantages over other algorithms, the unwrapping performance of the proposed method was compared with two Zernike polynomial fitting methods, D-Net [22] and DZPF [14], under different noise conditions. The results are shown in figures 10–12.

**Figure 10.** A comparison of the unwrapping performance under low-noise conditions. (Speckle noise and Gaussian noise; the mean and standard deviation of Gaussian noise are 0.0050 and 0.0101, respectively.) (a) Wrapped phase map, (b) ground truth, (c) unwrapped phase by the proposed method with RMSE = 0.1917 rad, (d) unwrapped phase using D-Net with RMSE = 0.9223 rad, and (e) unwrapped phase using DZPF with RMSE = 1.8314 rad.
Download figure:
Standard image High-resolution image

**Figure 11.** A comparison of the unwrapping performance under heavy noise conditions. (Speckle noise and Gaussian noise; the mean and standard deviation of Gaussian noise are 0.0501 and 0.0911, respectively.) (a) Wrapped phase map, (b) ground truth, (c) unwrapped phase by the proposed method with RMSE = 0.1107 rad, (d) unwrapped phase using D-Net with RMSE = 0.7126 rad, and (e) unwrapped phase using DZPF with RMSE = 6.7326 rad.
Download figure:
Standard image High-resolution image

**Figure 12.** A comparison of the unwrapping performance under heavy noise conditions and non-uniform visibility. (Speckle noise and Gaussian noise; the mean and standard deviation of Gaussian noise are 0.1010 and 0.1997, respectively.) (a) Wrapped phase map, (b) ground truth, (c) unwrapped phase by the proposed method with RMSE = 0.1917 rad, (d) unwrapped phase using D-Net with RMSE = 0.9223 rad, and (e) unwrapped phase using DZPF with RMSE = 9.0731 rad.
Download figure:
Standard image High-resolution image

Figure 10 shows a comparison of the unwrapping performances of the three methods under low-noise conditions. The RMSE between the unwrapped phase (figure 10(c)) and the actual phase (figure 10(b)) of the proposed method is merely 0.1021 rad, which indicates a higher accuracy than the results of D-Net (RMSE = 0.6792 rad) and the DZPF method (RMSE = 1.8314 rad). Figure 11(a) shows a wrapped phase map disturbed by relatively high noise. With an increase in noise, the performance of DZPF is greatly affected and the RMSE reaches 6.7326 rad, which reveals a failed unwrapping. The RMSE of D-Net is 0.7126 rad. In contrast, the RMSE of the proposed method is 0.1107 rad, which shows that the proposed method is less affected by noise. Figure 12 shows the unwrapping results under high-noise conditions and non-uniform visibility. The overall noise of the wrapped data (figure 12(a)) is large, especially in the upper left corner, the signal-to-noise ratio is low, and the fringe information cannot be accurately distinguished. The DZPF method cannot effectively fit the phase under this condition. Furthermore, the results of D-Net are shown in figure 12(d). There is a clear difference between the phase distribution in the upper left corner and the actual phase distribution (figure 12(b)); its RMSE is 0.9223, whereas the RMSE of the proposed method is 0.1917, which maintains a high accuracy. The results show that the performance of DZPF is greatly affected by the noise level; therefore, it is difficult to obtain reliable results under large noise conditions. Deep learning-based methods exhibit a good performance with data at different noise levels. Moreover, the proposed method, which is based on a transformer network, performs better than the D-Net method, which is based on the CNN structure in all cases. This shows that the improved transformer network has a better anti-noise ability than the CNN. The reason why the Swin-Transformer network outperforms traditional CNN is that the Transformer module can calculate the dependency between different regions of different scales extracted, reducing the weight of unnecessary regions (noise), which can effectively reduce the interference of local noise.

3.2. Results on real data

To evaluate the actual unwrapping performance of the proposed method, several real wrapped-phase data are processed by the proposed method, D-Net, and the DZPF method. The first data, as shown in figure 13(a), are from a noisy open-fringe-wrapped phase with a size of 232 $\times$ 232 pixels, which was obtained using digital holographic interferometry for a dime after a small horizontal tilt [30]. The image was directly input into the network for processing. Figures 13(b)–(d) show the results of the three methods. Owing to large noise interference, the results of D-Net and DZPF have clear deviations (the fringe distribution does not conform to the original small horizontal tilt phase), and the proposed method still obtain a more reliable phase. This shows that the proposed method has the best performance for real data and hence a better generalization ability than the D-net model.

**Figure 13.** The unwrapping results for real data with a size of 232 $\times$ 232 pixels. (a) Wrapped phase, (b) unwrapped phase using the proposed method, (c) unwrapped phase using D-Net, and (d) unwrapped phase using the DZPF method.
Download figure:
Standard image High-resolution image

The second data, as shown in figure 14, are from a noisy close-fringe-wrapped phase with a size of 256 $\times$ 256 pixels, which was obtained by digital speckle pattern interferometry (ESPI) for a rough plane surface after a small out-of-plane deformation (the variation range of phase is about $10\pi$ ). The image was directly input into the network for processing. Figures 14(b)–(d) show the results of the three methods. The closed fringe distribution in the middle of D-Net is obviously deviated from the actual fringe, and the phase variation range of DZPF is about $6\pi$ which does not match the actual fringe density. However, the proposed method provides a more reliable result, indicating that this method also has a better unwrapping performance for closed fringes.

In addition, the performance of the three methods was tested using 1024 $\times$ 1024 large-scale data, as shown in figure 15(a), which was obtained by ESPI for a rough plane surface after a small out-of-plane deformation. Because the 1024 $\times$ 1024 size differs from the training data which was only 256 $\times$ 256, the size of the real data was reduced to 256 $\times$ 256 before entering the network. Note that the image size remains unchanged when testing the performance of the DZPF method. Figures 15(b)–(d) show the unwrapping results of the three methods, respectively. There is a clear deviation in the fringe distribution in the upper left corner of D-Net (figure 15(c)) and a certain deviation in the phase variation range of DZPF, which indicates that both the D-Net and DZPF methods fail. In contrast, the proposed method performs best, indicating that it is also suitable for large-scale experimental data processing.

In conclusion, the proposed method exhibits the best unwrapping performance in different situations. In particular, under severe noise interference conditions, the proposed method exhibits a better anti-noise ability and robustness than the other Zernike polynomial fitting-based methods. Overall, the reasons why the Swin-Transformer network outperforms traditional CNN are summarized as follows:

(a)
The Swin-Transformer network can extract local and global features at different scales through regional multi-scale segmentation and merging modules. As a result, the transformer-based method has better size generalization than the CNN-based method, which was verified by experimental test with data of different resolutions.
(b)
The transformer module can calculate the dependency between different regions of different scales extracted, reducing the weight of unnecessary regions (noise), which can effectively reduce the interference of local noise. As a result, the transformer-based method has higher accuracy and noise robustness than the CNN-based method, which was verified by numerous tests with simulation and experimental data.

4. Conclusion

In this paper, a robust phase unwrapping method was proposed based on Zernike polynomial fitting and a transformer network. Phase unwrapping is regarded as a regression task. Combined with the Swin-Transformer network architecture, Zernike polynomial fitting coefficients can be accurately estimated using the wrapped phase data; hence, the actual phase distribution can be reconstructed. The main advantage of this method is that the transformer network can correlate the local and overall information and is sensitive to contour information. The correlation between the fringe information and the Zernike polynomial fitting coefficients was constructed, which can effectively resist the interference of noise. As a result, the proposed method exhibits a good accuracy and robustness to noise. The unwrapping performance of this method was verified using simulation and experimental data. In particular, under high-noise conditions, the proposed method has great advantages over the other Zernike polynomial fitting methods. For data with large noise and non-uniform visibility, this method can still provide more effective predictions. In conclusion, the proposed method can effectively improve the phase-unwrapping accuracy under large noise conditions. It has good application prospects in the field of quantitative phase measurement, especially in the case of severe noise situations such as speckle interferometry.

Acknowledgments

This work is financially supported by the National Natural Science Foundation of China (Nos. 52175516 and 51705404), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2020JQ-055), Key industrial technology innovation projects of Suzhou (No. SYG201922). We would like to thank Editage (www.editage.cn) for English language editing.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Robust phase unwrapping algorithm based on Zernike polynomial fitting and Swin-Transformer network

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Theory