Abstract
Algebraic factorization methods applied to the discipline of Computerized Tomography (CT) Medical Imaging Reconstruction involve a high computational cost. Since these techniques are significantly slower than the traditional analytical ones and time is critical in this field, we need to employ parallel implementations in order to exploit the machine resources and obtain efficient reconstructions.
In this paper, we analyze the performance of the sparse QR decomposition implemented on SuiteSparseQR factorization package applied to the CT reconstruction problem. We explore both the parallelism provided by BLAS threads and the use of the Householder reflections to reconstruct multiple slices at once efficiently. Combining both strategies, we can boost the performance of the reconstructions and implement a reliable and competitive method that gets high-quality CT images.
Keywords
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction and Background
In recent years, medical tests such as Magnetic Resonance Imaging (MRI) [1] have gained prominence in clinical practice. MRIs are not harmful to the patient, since the image is produced from the application of magnetic fields on a body. In contrast, Computerized Tomographies (CT) [2] project X-rays, which induce a dose of radiation that can be harmful to the patient. However, despite being harmful, CT scans are still necessary.
On the one hand, they obtain better images than the MRI for certain types of objects of interest (bones and tumors) while the magnetic resonance is mostly applied to soft tissues since it achieves greater contrast between different tissues. On the other hand, not all people are suitable for both tests. MRI is not recommended for patients who have a pacemaker or metal implants in their body, while they can undergo a CT scan. But CT is contraindicated for pregnant patients, infants and children, due to the dose of radiation induced with the test.
For all the above, although the current perception is that the CTs are losing ground to the MRI, it is not entirely true, so we believe that it is necessary to continue improving CT scanners and also the techniques of image reconstruction they use.
In our previous works [3,4,5,6,7], we have studied the option of working with algebraic methods instead of the traditional analytical methods to reconstruct CT images. In this way, we can solve the problem mathematically by either iterative or direct algebraic algorithms. While other works focus on reducing radiation by applying X-rays with lower voltage [8], we focus on taking fewer shots. With this approach, we can work with a low number of projections and still get high-quality images. This means that we could reduce the radiation dose to which the patient is exposed, which is our main objective.
Using iterative methods, we can reduce the number of views or projections that we use to a really small number if we make a good selection of the projection angles [6, 9, 10]. However, with the direct methods [7, 11], we have reached the conclusion that it is necessary that the matrix of the CT system has full rank, which will determine the number of projections required according to the image resolution that you want to achieve. Regardless, both approaches need a lower number of X-ray projections than other methods.
However, the types of methods we use require more computational resources. In addition, the time needed to reconstruct the images is much higher than with the analytical methods (minutes or hours versus milliseconds). Therefore, if we want the algebraic approach to be employed, we have to reduce the reconstruction time to the maximum. For this we can use High-Performance Computing (HPC) techniques, exploiting the hardware resources of the machine through parallel implementations of the algorithms.
In this paper, we focus on the analysis of the performance of the QR [12] factorization employed to reconstruct CT images. To do this, we make use of the SuiteSparseQR factorization package [13], analyzing the effect of using a different number of BLAS threads in operations. In addition, we compare the time efficiency when we reconstruct a single slice, or when we have a volume formed by multiple slices, which fits best a real situation.
In Sect. 2.1, we will describe the CT image reconstruction problem. We also make a brief description of the scanner we simulate and the dataset DeepLesion, used to take as reference. In Sects. 2.2 and 2.3 we explain how to perform the CT reconstructions using the explicit QR factorization or the Householder form. The results of the study are discussed in Sect. 3, analyzing on the one hand the performance of the QR factorization using a different number of threads (Sect. 3.1) as well as the performance of the reconstruction step (Sect. 3.2). Additionally, in Sect. 3.3 we show the obtained images measuring their quality. Finally, in Sect. 4 we summarize the work done and discuss the future lines of work.
2 Materials and Methods
2.1 Algebraic CT Image Reconstruction
When dealing with the reconstruction of CT images in an algebraic way, it is necessary to model the associated problem. Initially, we only have the data obtained through the scanner using X-rays. Therefore, it will be necessary to transform this data into an image that represents the projected object or body part.
We pose the problem as a system of linear equations as proposed in Eq. (1). Here, g is the data acquired by the scanner. It is usually called projections vector or sinogram for fanbeam CTs. As we see in (2), g is a vector of size M, which depends on the physical parameters of the scanner. We calculate M as the product of the number of detectors of the CT and the number of projections taken. It is worth mentioning than one projection (one X-ray shot) obtains as many data as detectors we have.
The system matrix A is defined in Eq. (4). This is a sparse weight matrix that represents the influence a of each ray beam traced (i) on each image pixel (j). As we can see, the number of rows of A is M, and the number of columns is N, which is the resolution in pixels of the reconstructed image we want to get.
In our case, both the matrix and the projections vector is simulated. We calculate both using the forward projection ray-tracing algorithm proposed by Joseph [14], simulating a CT scanner with 1025 detectors and taking equiangular projections around the 360 degrees of rotation. The images we projected are a selection of the DeepLesion dataset [15], which contains thousands of CT images of numerous patients for the study of different types of lesions.
Finally, u is the solution of our equations system. It is the reconstructed image. If we store in vector form, as we see in (3), its size is the total number of pixels on the image. For instance, for a final image of resolution \(256\times 256\) pixels, u is a vector of size \(1\times 256^2\). It is very easy to go from vector form to image form and vice versa.
2.2 QR Factorization Applied to the CT Problem
Ideally, the previously modeled problem could be solved very simply by obtaining the inverse of the system matrix as shown in the Eq. (5), provided it had full rank and the matrix was square. But in our application, this matrix dimensions can be really large for the highest resolutions and rectangular (more rows than columns). It is not feasible to explicitly compute the inverse since it requires a high computational cost and really advanced hardware. Besides, it is a highly unstable computation, so the errors could spoil the resulting image.
For this reason, we need to solve the problem with an iterative method as we did in [3, 6] or apply a factorization to the matrix so we can solve it directly. The factorization we propose is the QR with pivoting [16], as described in (6). Here, Q is an orthonormal matrix (its columns are orthogonal unit vectors so \(Q^tQ = I \)). R is an upper triangular matrix and P is the permutation matrix used to reduce the filling.
With this decomposition done, we can emulate the inverse of the matrix A as shown in (7), or the pseudoinverse if the matrix is non-invertible [17, 18]. Now we can solve the problem as (8). And since the factorization can be performed only once and stored for future use, every time we need to reconstruct we will only have to perform a matrix-matrix product, a permutation and solving one upper triangular equations system. This means faster reconstructions.
In addition, here g can be a matrix \(\mathbb {R}^{MxS}\) with S columns (the number of slices we want to reconstruct), so we can get multiple images within the same operation, which also reduces the time per slice.
2.3 Q-Less Factorization Using Householder Reflections
As we mentioned, our matrices can get relatively big depending on the desired image resolution, so it is possible the computational resources needed to compute the decomposition are extensive. Even if the system matrix A is sparse, Q can be dense. As a way to spare main memory resources, we could decide to not calculate the Q matrix explicitly. Instead, we could perform the factorization using Householder reflections, and store only the set of Householder vectors, which describe transforms to be applied as shown in (9). Here \(H_i\) are the successive reflection matrices to apply to our right-hand side g. Apart from main memory, this technique will also reduce the computation time.
2.4 SuiteSparseQR Factorization Package
SuiteSparseQR [13] is an implementation of the multifrontal sparse QR factorization method. It uses both BLAS and Intel’s Threading Building Blocks, a shared-memory programming model for modern multicore architectures to exploit parallelism. The package is written in C++ with user interfaces for MATLAB, C, and C++. It works for real and complex sparse matrices.
In our case, we are using the MATLAB interface, making use of BLAS parallelism and working with real sparse matrices.
3 Results and Discussion
In order to test the performance of the method we run both the matrix factorization and the reconstructions in a server with four Intel Xeon E5-4620 8c/16T processors (8 cores/processor, 32 cores/node) and 256 GB DDR3 RAM memory (ratio 8 GB/core).
We use a matrix corresponding to resolution \(256 \times 256\) pixels and 90 projections. Thus, the size of the matrix is \(90200 \times 65536\), with 40871478 non-zero elements and 0.0069 density. In the next subsections we show the experimental time results when using a different number of BLAS threads for the two alternatives of the factorization method, using one physical processor per thread.
3.1 Factorization Step
In the fist step we need to compute the factorization shown in Eq. (6). As explained in Sect. 2.2, the factorization of the matrix can be done once before the reconstruction of the images. Therefore, the computational speed is not going to be that important in this case. Even so, it is desirable to reduce it to the maximum, both to save computing time and to avoid possible system failures that can abort our process and spoil hours of work.
As we can see in Table 1, it is much faster to store the factorization in Householder form than to form the Q matrix explicitly. Regardless of the number of BLAS threads employed, it is around 3 times faster, which is a significant difference.
In Figs. 1 and 2 we can see the Speedup and Efficiency of both of the methods. We compute the Speedup for p processors as \(S_{p}=T_{1}/T_{p}\), being \(T_{1}\) the time with 1 processor and \(T_{p}\) the time with p processors. The Efficiency is \(E_{p}=S_{p}/p\). In a perfect parallel algorithm, the Speedup is equal to the number of processors and the Efficiency is 1, which means we are taking advantage of the 100\(\%\) of the resources.
However, we can observe that we get lower results. The Householder factorization has slightly better performance except when using 32 processors, with a Speedup of 8 versus the 10.2 of the explicit Q factorization. In Fig. 2 we can see that with more than 4 processors we are using less than a 50\(\%\) of the computational resources. Since we are working with sparse matrices with a low density it is usual to get lower efficiency that with dense matrices.
3.2 Reconstruction Step
We also need to verify which method is better in the reconstruction phase and if we get good performance when using more resources. In Table 2 we observe the results for reconstructing just 1 slice (1 right-hand side vector) or 128 slices, as per Eq. (8). As we can see, when we want to reconstruct only 1 slice, the Explicit Q factorization gets lower computational times. Besides, they improve slightly when using more processors, up to 16. With the Householder form, we only get better performance with up to 8 processors. However, the performance here is worse than in the factorization step, as we can see in Figs. 3 and 4. The Speedup is very low regardless of the number of threads used, and the highest efficiency we get is only 0.5 using 2 threads.
When we are dealing with multiple right-hand sides, in this case 128, the performance is different. We can see in the table that in this case, it is faster to perform the reconstruction using the Householder reflections. We get the reconstructions around twice as fast (2.8 times faster for 16 threads). On the other hand, the Speedup and Efficiency for each of the methods is not better than in the previous case, as we can see in Figs. 5 and 6. We get very low Efficiency, wasting resources.
3.3 Image Quality
In our preliminary work [7] we concluded that if we work with a full-rank system matrix, the images reconstructed by the QR factorization have really high quality. In order to verify this for the matrix we are studying here, we show the quality results, measuring with Mean Absolute Error (MAE), PSNR (Peak Signal-to-Noise Ratio) [19]. We also measure the SSIM (Structural Similarity Index). Since we are reconstructing 128 slices, we show the minimum, maximum and average results of both reconstruction techniques.
In Figs. 7 and 8 we can see the PSNR and MAE results respectively. We don’t show the SSIM results, since all the images get a SSIM equal to 1. This means that all the reconstructions have the same structure as the reference image (we are not losing significant internal structures). As we observe in the Figures, the reconstructions obtained through the Householder matrix have better quality, which reflects the better numerical stability of the factorization. It gets around 10 units of PSNR higher and lower error. But both of the methods get really high quality. Notice that we are getting MAEs of order \(10^{-12}\), which is almost insignificant.
In Fig. 9 we show the reference CT image of an abdomen and the reconstructed images with both techniques. As we can see, the images are almost identical to the human eye, since we can not discern significant differences or information loss.
4 Conclusion
In this work, we have conducted a study of the efficiency of applying a direct algebraic technique to the CT image reconstruction problem. We have compared two methods, the QR factorization forming the matrix Q explicitly, or using the matrix Q in the form of Householder reflections. To verify the performance of the methods we have used the SuiteSparseQR library, which includes a parallel implementation of these algorithms. We have used a server to get the experimental time performance, using up to 32 processors.
Regarding the quality obtained by both methods, we have determined that the Householder factorization is more numerically stable, so the images have less error. However, we speak of very small magnitudes that are not perceptible to the human eye.
On the other hand, we have verified that the time used to perform the factorization of the system matrix is much higher if we form the matrix Q in an explicit manner. This can lead to errors as we expose ourselves to system failures, power outages, etc. In addition, we have verified that the reconstructions are faster when we reconstruct several images simultaneously. We get less time per slice in general, and less time using the Householder form in particular. Since in reality we are going to reconstruct more than one slice at a time (usually a CT study has at least 100 slices), we determine that it is more efficient to use Householder reflections for our application. In this way we can reconstruct volumes with high quality in less than 30 min.
Finally, we have determined that the efficiency that is achieved using fine-grained parallelism in many-core servers is not good. We observe that we do not take advantage of all the allocated resources, obtaining very low Speedups. Still, the time used to solve 128 slices with 32 threads is slightly lower than using the 32 threads to solve each slice independently using coarse-grain parallelism.
At this point, it is still necessary to employ HPC computers since this implementation requires a high amount of main memory. That is the reason we are not able to compute the problem for higher image resolutions. As future work, we plan to work with out-of-core techniques that read the data stored in blocks from the hard drive when the particular block is needed for the computation, instead of having it always loaded in main memory. In this way, we could achieve to reconstruct bigger problems in workstations with lower amount of RAM memory and thus lower cost.
References
Brown, R.W., Haacke, E.M., Cheng, Y.C.N., Thompson, M.R., Venkatesan, R.: Magnetic Resonance Imaging: Physical Principles and Sequence Design. Wiley, Hoboken (2014)
Brooks, R., Chiro, G.D.: Principles of computer assisted tomography (CAT) in radiographic and radioisotopic imaging. Phys. Med. Biol. 21(5), 689–732 (1976)
Flores, L., Vidal, V., Verdú, G.: Iterative reconstruction from few-view projections. Procedia Comput. Sci. 51, 703–712 (2015)
Parcero, E., Flores, L., Sánchez, M., Vidal, V., Verdú, G.: Impact of view reduction in CT on radiation dose for patients. Radiat. Phys. Chem. 137, 173–175 (2017)
Flores, L.A., Vidal, V., Mayo, P., Rodenas, F., Verdú, G.: Parallel CT image reconstruction based on GPUs. Radiat. Phys. Chem. 95, 247–250 (2014)
Chillarón, M., Vidal, V., Segrelles, D., Blanquer, I., Verdú, G.: Combining grid computing and Docker containers for the study and parametrization of CT image reconstruction methods. Procedia Comput. Sci. 108, 1195–1204 (2017)
Chillarón, M., Vidal, V., Verdú, G., Arnal, J.: CT medical imaging reconstruction using direct algebraic methods with few projections. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10861, pp. 334–346. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93701-4_25
Padole, A., Ali Khawaja, R.D., Kalra, M.K., Singh, S.: CT radiation dose and iterative reconstruction techniques. Am. J. Roentgenol. 204(4), W384–W392 (2015)
Andersen, A.H., Kak, A.C.: Simultaneous algebraic reconstruction technique (SART): a superior implementation of the ART algorithm. Ultrason. Imaging 6(1), 81–94 (1984)
Yu, W., Zeng, L.: A novel weighted total difference based image reconstruction algorithm for few-view computed tomography. PLoS ONE 9(10), 1–10 (2014). https://doi.org/10.1371/journal.pone.0109345
Rodríguez-Alvarez, M.J., Sanchez, F., Soriano, A., Moliner, L., Sanchez, S., Benlloch, J.M.: QR-factorization algorithm for computed tomography (CT): comparison with FDK and conjugate gradient (CG) algorithms. IEEE Trans. Radiat. Plasma Med. Sci. 2(5), 459–469 (2018)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. Johns Hopkins University Press, Baltimore (2013)
Davis, T.A.: Algorithm 915, suitesparseQR: multifrontal multithreaded rank-revealing sparse QR factorization. ACM Trans. Math. Softw. 38(1), 8:1–8:22 (2011)
Joseph, P.: An improved algorithm for reprojecting rays through pixel images. IEEE Trans. Med. Imaging 1(3), 192–196 (1982)
Yan, K., Wang, X., Lu, L., Summers, R.M.: DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 5(3), 036501 (2018)
Golub, G.H., Ortega, J.M.: Scientific Computing: An Introduction with Parallel Computing. Academic Press Professional Inc., Cambridge (1993)
Arridge, S., Betcke, M., Harhanen, L.: Iterated preconditioned LSQR method for inverse problems on unstructured grids. Inverse Prob. 30(7), 075009 (2014)
Hansen, P.C.: The L-curve and its use in the numerical treatment of inverse problems (1999)
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)
Acknowledgements
This research has been supported by “Universitat Politècnica de València”, “Generalitat Valenciana” under PROMETEO/2018/035 co-financed by FEDER funds, as well as ACIF/2017/075 predoctoral grant, and the “Spanish Ministry of Economy and Competitiveness” under Grant TIN2015-66972-C5-4-R and TIAMHA co-financed by FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chillarón, M., Vidal, V., Verdú, G. (2019). Parallel CT Reconstruction for Multiple Slices Studies with SuiteSparseQR Factorization Package. In: Rodrigues, J.M.F., et al. Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science(), vol 11538. Springer, Cham. https://doi.org/10.1007/978-3-030-22744-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-22744-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22743-2
Online ISBN: 978-3-030-22744-9
eBook Packages: Computer ScienceComputer Science (R0)