Introduction

The prevalence of hepatic steatosis is increasing rapidly worldwide. This is largely attributed to the association with obesity and insulin resistance in non-alcoholic fatty liver disease (NAFLD) [1, 2].

Detection and quantification of hepatic steatosis is clinically important. In NAFLD, steatosis is the hepatic manifestation of the metabolic syndrome and the earliest biomarker for the development of liver fibrosis in the more severe condition of non-alcoholic steatohepatitis (NASH). Early diagnosis and treatment of NASH can prevent the potential development of cirrhosis and hepatocellular carcinoma (HCC) [35]. In hepatitis C, steatosis is associated with more severe fibrosis and rapid disease progression [6, 7]. In liver transplantation surgery, the presence of steatosis impairs the regenerative capacity of the liver in both donor and recipient [8, 9].

Liver biopsy remains the reference test for the evaluation of hepatic steatosis, despite well-established drawbacks regarding its invasiveness and sampling error due to small sample size and inter-observer variability [10].

Many studies have focused on the role of imaging techniques as a non-invasive alternative to liver biopsy for detecting and quantifying hepatic steatosis [1113]. The reported sensitivities and specificities between different imaging techniques and between different studies investigating the same technique vary substantially. Although magnetic resonance spectroscopy (1H-MRS)—generally considered the best technique—is increasingly used as a reference standard instead of liver biopsy, no evidence-based consensus currently exists on this topic.

The purpose of this systematic review therefore was to summarise the available literature on the accuracy of ultrasound (US), computed tomography (CT), magnetic resonance imaging (MRI) and 1H-MRS for the evaluation of hepatic steatosis with histopathology as the reference test. Subsequently, we aimed to identify the most accurate technique by meta-analysis.

Materials and methods

Literature search and study selection

We searched the MEDLINE (January 1966–November 2009), EMBASE (January 1980–November 2009), CINAHL and Cochrane databases without language restrictions with the assistance of an experienced clinical librarian. We combined Medical Subject Headings (MeSH) terms and accompanying entry terms for the patient group (patients with hepatic steatosis) and the index test (US, CT, MRI, 1H-MRS). The search strategy is described in detail in Online Resource 1.

Two reviewers (A.B. and J.v.W) read the titles and abstracts of all the articles obtained to select potentially relevant papers (original papers that addressed the diagnostic accuracy of US, CT, MRI or 1H-MRS for detecting hepatic steatosis in humans with histopathology as the reference test in ≥10 individuals). The reference lists of selected papers and of narrative reviews were screened for search completion.

The full texts of potentially relevant papers were reviewed for inclusion by the same reviewers independently. Inclusion criteria were: (a) hepatic steatosis was evaluated with US, CT, MRI and/or 1H-MRS; (b) imaging techniques met the minimum technical requirements of grey scale and real time for US; ≤120 kV for CT and ≥1 Tesla for MRI; (c) histopathology as the reference test; (d) evaluation of ≥10 human individuals; (e) criteria for a positive index test were clearly explained; (f) examination method of steatosis on liver biopsy was clearly explained; (g) data on diagnostic accuracy were reported. Exclusion criteria were: (a) duplicate publication; (b) reporting of combined data for different imaging techniques or data on the single technique could not be extracted; (c) no original research. Papers were not blinded with regard to authors’ names, affiliations or journal. The reviewers resolved all disagreements about inclusion and data extraction by consensus after face-to-face discussion.

Data extraction

From the articles included, the reviewers (A.B. and J.v.W.) independently recorded data using a standardised form. Papers were translated if necessary.

Methodological quality

Methodological quality was assessed based on the Quality Assessment of Studies of Diagnostic Accuracy included in Systematic Reviews (QUADAS) guidelines [14]. To be reasonably sure that the condition of the liver did not change between the two tests we chose a period of 1 month as a quality indicator [15, 16]. Additionally, we noted whether the study design was prospective or retrospective.

Patient characteristics

For each study, we extracted data on (a) sample size; (b) male-female ratio; (c) patient age; (d) body mass index (BMI) and (e) patient spectrum.

Imaging features and evaluation

For each imaging technique we recorded (a) the number of patients; (b) the criteria used for steatosis evaluation and (c) the cut-off values used. We noted whether cut-off values were defined prospectively or retrospectively. Additionally, for US we noted the type and frequency of the probe(s) used. For CT we noted: (a) the type of CT and (b) the imaging parameters (kV and mAs). For MRI, we noted: (a) magnetic field strength; (b) imaging sequence; (c) imaging parameters used; (d) whether breath holds were used; and (e) correction for T2* effects. For 1H-MRS, we noted: (a) magnetic field strength; (b) imaging sequence; (c) imaging parameters; (d) voxel size; (e) whether breath holds were used and (f) correction for T1 or T2 effects.

Reference test

For liver biopsy we included data on: (a) the number of patients biopsied; (b) cut-off values for steatosis grades and whether the presence of (c) fibrosis (d) inflammation and (e) iron was evaluated in the biopsy specimen.

Data for calculation of diagnostic accuracy

We extracted available data on true-positives (TP), false-negatives (FN), false-positives (FP) and true-negatives (TN) for detecting steatosis with the selected imaging technique to construct 2 × 2 contingency tables. Available 3 × 3 and 4 × 4 tables were dichotomised. Many different cut-off values for positive results (steatosis present) on liver biopsy were compared to the imaging techniques. We therefore grouped accuracy results from cut-off values that were almost equal into four subgroups to enable meta-analysis:

  1. Group 1:

    Cut-off values of >0%, >2% and >5% steatosis on biopsy;

  2. Group 2:

    Cut-off values of >10%, >15% and >20% steatosis on biopsy;

  3. Group 3:

    Cut-off values of >25%, >30% and >33% steatosis on biopsy and the qualitative designation of “moderate or severe” steatosis;

  4. Group 4:

    Cut-off values of >50%, >60% and >66% steatosis on biopsy and the qualitative designation of “severe” steatosis.

During analysis we corrected for dependent data such as results presented for different readers or for multiple imaging techniques within one study population. For CT and MRI, we did not use data obtained by subjective visual evaluation of examinations for analysis. If raw data in terms of 2 × 2 tables were unavailable, we attempted to contact authors for completion or verification of data.

Data analysis

We performed bivariate random-effects analysis for the pooled sensitivities and specificities per cut-off value group for each imaging technique [17]. In this analysis, the logit-transformed sensitivities and logit-transformed specificities from individual studies in a meta-analysis are assumed to follow a bivariate normal distribution around a mean logit sensitivity and a mean logit specificity. After antilogit transformation, summary estimates of sensitivity and specificity with 95% confidence intervals were calculated. Additionally, we calculated the natural logarithm (ln) of the diagnostic odds ratio (DOR): [logit sensitivity + logit specificity]. The DOR is a single indicator of test performance [18]. A higher lnDOR value indicates a better discriminatory test performance. If the lnDOR is not significantly different from 0, a test does not discriminate between patients with the disorder and those without it. We performed z-tests to compare sensitivities, specificities and lnDORs between imaging techniques. A p value < 0.05 was considered statistically significant. All meta-analyses were performed with SAS statistical software (version 9.1, SAS institute, Cary, NC, USA).

Results

Literature search and study selection

The literature search yielded 6992 unique references (Fig. 1). The reviewers selected 179 potentially relevant articles after reading the titles and abstracts of which 46 papers were finally included [15, 1963]. Twenty-eight evaluated US, twelve evaluated CT, ten evaluated MRI and five evaluated 1H-MRS. Eight papers compared two imaging techniques [22, 23, 43, 47, 54, 58, 61, 62] and one evaluated three techniques within the same population [56] (Table 1). Two studies were published in German [28, 45] and the remaining in English.

Fig. 1
figure 1

Flow diagram of the articles included

Table 1 Patient and design characteristics of 46 papers included

Data extraction

Methodological quality

An overview of the results is given in Fig. 2. In general, 13 out of 46 (28%) studies fulfilled at least 10 of the 13 methodological criteria [19, 20, 25, 29, 33, 34, 36, 38, 41, 44, 48, 49, 52]. A complete table of individual study scores is available upon request from the authors.

Fig. 2
figure 2

Study design characteristics of the 46 studies included

Patient characteristics

The 46 papers comprised 4715 patients with a median study size of 81 patients (range 20–589). Mean age reported by 37 studies was 44.5 years (range 11–89). Mean BMI reported by 18 studies was 26.6 kg/m2 (range 15–54 kg/m2). The male-to-female ratio reported in 40 studies was 1.62:1. Potential living liver donors constituted 34% of the total population (1593/4715). If specified, the disease spectrum comprised most frequently chronic hepatitis C (n = 1040) and NAFLD/NASH (n = 710). See also Table 1.

Imaging features and evaluation

US imaging features are outlined in Online Resource 2. More than half (15/28) of the studies used the widely accepted criteria for subjective visual steatosis evaluation of bright liver with increased liver-kidney contrast; blurring of intrahepatic vessels and diaphragm and loss of echoes of posterior hepatic segments [57, 60]. Two studies evaluated quantitative methods to assess liver steatosis [30, 59].

CT imaging features are shown in Online Resource 3. Two papers evaluated both contrast-enhanced CT and unenhanced CT [22, 38]. All other papers evaluated unenhanced CT. Average Hounsfield Units (HU) in selected regions of interests (ROIs) from the liver were compared with average HU values of ROIs from the spleen. The spleen was used as an internal reference in all the papers included, either by measuring the liver-minus-spleen attenuation value (L-S) or the liver-to-spleen ratio (L/S). Two studies also evaluated steatosis by subtracting hepatic blood attenuation from the total hepatic attenuation using an algorithm. Four papers had defined cut-off values prospectively [22, 43, 58, 61]. Six papers defined optimal cut-off values retrospectively [15, 35, 38, 44, 52, 62]. The chosen cut-off values, however, varied substantially.

MRI characteristics are outlined in Online Resource 4. Magnetic field strength for all included papers was 1.5 Tesla. Sequences used were T1-weighted dual spin echo, T1-weighted dual gradient echo or T1-weighted spoiled gradient echo for in-phase and out-of-phase (IP/OP) chemical shift imaging. Two studies also evaluated T2-weighted fast spin echo imaging with and without fat suppression (±FS). Liver steatosis was evaluated by the amount of signal intensity (SI) loss on OP images compared with IP images and by SI difference between FS and non-FS images. Exact measuring methods, however, differed. Correction for T2* effects was performed by d’Assignies et al [24]. Cho et al were the only authors to define cut-off values prospectively [22].

1H-MRS imaging characteristics are outlined in Online Resource 5. Magnetic field strengths were 1.5 T(3/5) and 3 T (2/5). Four papers used a point-resolved spectroscopic sequence (PRESS), one in combination with chemical-shift selective water suppression (CHESS). Krššák et al used a stimulated echo acquisition mode (STEAM) sequence [41]. Voxel sizes varied from 18 × 18 × 18 mm to 30 × 30 × 40 mm. Hepatic steatosis was evaluated by the ratio of lipid versus water peaks and by the choline-to-lipid ratio. One paper each corrected for T2 effects or T1 and T2 effects [24, 41]. All included studies defined cut-off values retrospectively.

Reference test

Details of the reference test are outlined in Online Resource 6. Cut-off values for grading steatosis severity differed among the articles included, complicating their comparison. Two studies compared semi-quantitative visual analysis of steatosis with the automatic vacuole segmentation method [24] or gas-liquid chromatography [41]. Thirty studies examined the presence of fibrosis, 24 examined the presence of inflammatory activity and eight studies examined the presence of iron.

Data for calculation of diagnostic accuracy

Extraction of 2 × 2 accuracy data resulted in 48 complete data sets for US; 33 for CT; 15 for MRI and 7 for 1H-MRS. For both CT and MRI, 5 datasets were not included for analysis because the examinations were visually assessed. This resulted in 28 datasets being analysed for CT and 10 datasets being analysed for MRI. Datasets with TP and FN only were also included for analysis: 4 for US and 1 for CT. The number of datasets per cut-off group is noted in Table 2. Six authors were contacted for completion or verification of data; three answered of which one supplied additional datasets. Three studies reported data-sets for multiple readers [30, 31, 39].

Table 2 Summary estimates of US, CT, MRI and 1H-MRS per combined cut-off value group

Data-analysis

Sensitivity and specificity values including 95% confidence intervals (CI) and significant differences (p < 0.05) for the imaging techniques are presented in Table 2 and in more detail in the Online Resources 7–10.

Group 1 (cut-off values of >0%, >2% and >5% steatosis)

Sensitivity and specificity estimates were 73.3% and 84.4% for US; 46.1% and 93.5% for CT; 82.0% and 89.9% for MRI and 88.5% and 92.0% for 1H-MRS, respectively. The sensitivity of 1H-MRS was significantly higher than that of US (p = 0.04) and CT (p < 0.01) and the sensitivity of MRI was significantly higher than that of CT (p = 0.02). No significant differences in specificity were found. The lnDOR of 1H-MRS was significantly higher compared with US (p = 0.02) and CT (p = 0.04).

Group 2 (cut-off values of >10%, >15% and >20% steatosis)

Sensitivity and specificity estimates were 90.5% and 69.6% for US; 57.0% and 88.1% for CT; 90.0% and 95.3% for MRI and 82.6% and 94.3% for 1H-MRS, respectively. CT had a significantly lower sensitivity compared with US, MRI and 1H-MRS (p < 0.01, p < 0.01 and p = 0.02 respectively). Although US had a sensitivity comparable to MRI and 1H-MRS, the specificity was significantly lower than CT (p < 0.01), MRI (p < 0.01) and 1H-MRS (p = 0.01). The lnDOR of MRI was significantly higher than the lnDOR for both US (p = 0.05) and CT (p < 0.01). 1H-MRS had a significantly higher lnDOR than CT (p = 0.03).

Group 3 (cut-off values of >25%, >30% and >33% steatosis)

Sensitivity and specificity estimates were 85.7% and 85.2% for US; 72.0% and 94.6% for CT; 97.4% and 76.1% for MRI and 72.7% and 95.7% for 1H-MRS, respectively. The sensitivity of MRI was significantly higher than CT (p = 0.01) and 1H-MRS (p = 0.03). The specificity of MRI however was significantly lower than both CT (p = 0.02) and 1H-MRS (p = 0.04). Further, the sensitivity for US was significantly higher than for CT (p = 0.03), the specificity for US was significantly lower than for CT (p = 0.03). Analysis of the lnDOR did not show any significant differences.

Group 4 (cut-off values of >50%, >60% and >66% steatosis)

For this group, data analysis was possible for US only. Sensitivity and specificity estimates were 91.1% and 91.9% respectively.

Figure 3 shows the diagnostic performances (lnDOR) of all imaging techniques per cut-off value group, illustrating the better performance for both MRI and 1H-MRS compared with US and CT.

Fig. 3
figure 3

Comparison of logarithmic diagnostic odds ratios of US, CT, MRI and 1H-MRS

Discussion

Our results show that MRI and 1H-MRS perform better than US and CT over the total range of cut-off values that were analysed. For the lower cut-off ranges, we found significant differences in favour of both MRI and 1H-MRS.

These findings suggest that MRI and 1H-MRS also perform better than US and CT for detecting separate disease grades, especially for mild disease (<30% steatosis). This is of value in clinical practice when an accurate estimation of the amount of hepatic steatosis is needed. Additional benefits of MRI and 1H-MRS over US are the quantitative measurements which are less subject to inter- and intraobserver variability [64]. For CT, drawbacks are the radiation exposure and factors affecting the accuracy of the results, such as imaging parameters or iron accumulation [11, 65].

Several limitations of our study must be considered. First, the studies included showed great heterogeneity regarding patient spectrum, reference test, index test and data reporting. Therefore, comparison of separate disease grades and sub-analysis of different aetiologies of steatosis (e.g. NALFD/NASH versus HCV) was precluded. Standardisation of future study designs is needed to enable these comparisons. Moreover, no studies compared all four imaging techniques within the same population, which would be the ideal study design. We were therefore restricted to summarising accuracy data for each technique separately across all the studies included. These indirect comparisons of studies, which showed substantial methodological heterogeneity, might have biased our results.

Second, we had to make the decision to group accuracy results from different cut-off values into four subgroups to enable meta-analysis and to reduce the number of summary estimates and comparisons. The ideal situation would have been to analyse accuracy results for each cut-off value separately.

Third, a standard method for meta-analysis of diagnostic studies is the summary Receiver Operating Characteristic (sROC). For the sROC approach, a negative correlation between the logit sensitivity and the logit specificity is required [17]. As we did not find this negative correlation in our data, plotting of sROC curves was not possible. We therefore used the lnDOR to summarise our results.

Fourth, we did not analyse 3 × 3 or 4 × 4 data as the reporting thereof was scarce. By dichotomising the results, we lost information on the capability of imaging techniques to diagnose the degree of steatosis.

A fifth limitation was that we chose to exclude articles with 1H-MRS as the reference standard [6673]. 1H-MRS is increasingly used as a reference standard for steatosis quantification since the results from the Dallas Heart Study were published by Szczepaniak et al in 2005 [74]. However, no clear consensus on this topic currently exists. The articles that were excluded all compared MRI with 1H-MRS and showed good correlations. Therefore, only a small number of datasets were available for analysis of MRI. Additionally, the included articles for MRI did not evaluate triple-echo, multi-echo or multi-interference techniques, whereas the aforementioned excluded articles did. Guiu et al recently suggested that these new techniques should replace the classical dual-echo chemical shift imaging methods, which are not reliable for quantification of liver fat in the case of liver iron overload because of T2* effects [75]. We believe that the small number of available data in combination with the techniques used could have negatively influenced our accuracy results for MRI.

We therefore recommend that consensus on the role of 1H-MRS as the reference standard needs to be established. For liver biopsy evaluation, we recommend using the classification from Kleiner et al for a uniform grading of hepatic steatosis [76].

In conclusion, we have shown that MRI and 1H-MRS are most accurate for the detection of hepatic steatosis. For future research, it is important to improve the study design and reporting of accuracy results.