Introduction

Over the last several decades, occlusal surfaces have been found to be one of the most prevalent sites for caries development in children and adolescents, mainly due to their anatomical susceptibility [1,2,3,4,5,6]. Because a valid and reproducible caries diagnosis and assessment could not be made by visual examination (VE) alone, there was a consistent demand for additional diagnostic devices for caries detection and diagnostics in pits and fissures. In addition to VE, conventional bitewing radiography (conventional BWR), digital bitewing radiography (digital BWR) and laser fluorescence (LF) measurements [7] were used in clinical practice or specifically introduced on the dental market in order to overcome the limitations of visual and/or tactile examination as well as to image and/or quantify the caries process to a certain degree [8]. On the basis of the acquired diagnostic information, the clinician should be enabled to make individual decisions about caries monitoring, prevention and/or operative intervention [9,10,11].

Numerous in vitro and in vivo caries detection, diagnostic, assessment and/or monitoring studies have been designed, conducted and published during the last few decades to describe the diagnostic performance of test methods in terms of validity (the diagnostic accuracy in relation to a reference standard) and intra-/inter-examiner reliability (the reproducibility of a diagnosis between different time points and examiners). Most recently, systematic reviews and meta-analyses have merged the available data and drawn conclusions mainly separately for each diagnostic method [12,13,14,15,16]. In addition, this author group [13,14,15] has mentioned substantial heterogeneity between the included diagnostic studies, and problematically, little attention has been paid to this important methodological issue so far; therefore, potential methodological sources of bias might be undetected and, furthermore, may also potentially skew meta-analysis data. Regarding this aspect, each diagnostic trial should ideally be designed similarly and should use equal scientific standards and protocols to generate comparable results that decrease the risk of bias (RoB) as much as possible. In contrast, previously published systematic reviews describe and report heterogeneity but do not exclude studies with a potentially high RoB. Therefore, this systematic review of the literature and meta-analysis was aimed, first, to identify caries diagnostic studies on pits and fissures that are tested with commonly used diagnostic methods, second, to evaluate study quality and identify only those studies with low/moderate RoB and, finally, to provide meta-analytic data on the diagnostic performance of clinically relevant detection and diagnostic methods.

Material and methods

The methodology of this systematic review was influenced by several recommendations or guidelines. The QUADAS 2 tool [17, 18], which was designed for the quality assessment of diagnostic accuracy studies, provided the basis for the RoB assessment. Here, the most recently published draft of the ‘Cochrane Handbook for Diagnostic Test Accuracy Reviews’ was also used [19]. The writing of this systematic review strictly followed the PRISMA-DTA statement (Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies) for diagnostic studies in its latest version [20]. The PRISMA-DTA group developed criteria to evaluate the validity and applicability of diagnostic studies and to enhance the replicability of systematic reviews in this area. The present systematic review was registered on the PROSPERO platform (CRD42017069894).

Search strategy

The research question, inclusion and exclusion criteria and search strategy were conducted on the basis of the PIRD concept [21]. Basically, this systematic review of the literature included in vitro and in vivo diagnostic studies that tested the diagnostic accuracy and/or reliability of different diagnostic methods for primary caries detection and assessment in human permanent posterior teeth (premolars and molars). In vivo studies were included regardless of the age of the population and the number of included patients or teeth. Studies containing information on primary teeth or teeth with restorations, secondary caries or artificially induced caries lesions were excluded. With respect to its clinical relevance, the following index tests were included in the search: VE, conventional BWR, digital BWR, LF measurements (DIAGNOdent 2095 or 2190, KaVo, Biberach, Germany), fibre-optic transillumination (FOTI, IC Lercher, Stockach, Germany) and quantitative light-induced fluorescence (QLF, Inspektor Research Systems, Amsterdam, The Netherlands). Other index test methods were not considered in this review. An essential characteristic of studies on diagnostic accuracy was the inclusion of a reference test, frequently also named the ‘gold standard’ or ‘reference standard’. The included in vitro studies had to use any histological technique to validate the ‘true’ caries extension; otherwise, the studies were excluded. Under in vitro conditions, several histological techniques, e.g. slices, grinding, hemisection or microradiography, are well-established which fulfil the before-mentioned prerequisite. In clinical studies, cavity preparation or biopsy can be considered equivalent to provide proof about the presence of any (dentin) caries [22]. As dental radiography was commonly applied under clinical conditions as well, it was, therefore, also included [23, 24]. In relation to the previously formulated aims and the corresponding inclusion and exclusion criteria, a structured search of the literature was initiated in accordance with the mnemonic PIRD recommendations [21]. This concept included information about the study material or population, the selected index tests, possible reference tests and diagnoses of interest (outcomes). The final consented search items are shown in Table 1.

Table 1 Documentation of keywords according to the PIRDS concept (Campbell et al. 2015)

Basic literature search and study selection according to PRISMA recommendations

The systematic search of the literature was performed in the MEDLINE (via PubMed) and EMBASE (via Ovid) electronic literature databases using the consented search terms (Table 1) according to standard procedures [20, 25]. The search included all publications that were listed until 31 December 2018 in the databases and were written in English. Grey literature was not included. Additionally, reference lists of included studies and reviews were screened to identify any studies that may have been missed. A few studies (N = 4) were found in result of manual searches.

Identification of the relevant literature

All identified bibliographies (PubMed N = 946, EMBASE N = 836), including titles and abstracts, were exported to a bibliographic software package (X7.8 for Windows, Thomson Reuters). The imported set of records from each database, including hand searches, was merged into one core database to remove duplicate records and to facilitate retrieval of relevant articles. In the next step, duplicates (N = 696) were removed, and the title (and, if needed, the abstract of each bibliography) was checked as to whether it met the inclusion criteria; otherwise, the study was excluded. After the primary identification of includable studies and the removal of duplicates, 1090 records were identified.

Screening and eligibility check

The titles and abstracts were screened by two reviewers (SK, MJR) independently. The reviewers were not blinded to the names of the authors, institutions, journal or results of each publication. All records were counterchecked in relation to the initially consented inclusion and exclusion criteria. If papers met the inclusion criteria completely or partially, their full-text documents were obtained. Doubts or disagreements were continuously resolved by discussion with an experienced researcher (JK). After review of the titles and abstracts, records that were found to be irrelevant were excluded from further proceedings (N = 894). At this step, 196 records were identified for full-text reading. Studies (N = 56) that were found to be irrelevant after their full texts were read were excluded from further analysis (supplemental Table S0). Finally, 140 studies met the inclusion criteria and were read in detail.

Data collection from the selected studies

Following the recommendation for diagnostic test accuracy [26], the following relevant items were extracted: study type (in vivo or in vitro studies), study population and teeth (number and age of patients, type and number of permanent teeth used in the study), index test methods (methods, scoring criteria and cut-offs), reference standard method (type of histological validation method, scoring criteria and cut-offs), validity and/or intra- and inter-examiner reliability data for the overall caries detection level (D0 versus D1-D4; Marthaler 1966), dentin caries detection level (D0-2 versus D3-4, Marthaler 1966) [27] and 1/3 dentin caries detection level (D0-2 versus D3-4, Ekstrand et al. 1997) [28]. Two reviewers (SK, MJR) independently extracted the required data from all primary studies. Any doubts or disagreements were continuously resolved by discussion with an experienced researcher (JK) until a consensus was reached. All data were systematically entered into an EpiData database [29] (EpiData software version 2.0.9.57, EpiData Association, Denmark).

RoB assessment

To date, no suitable set of criteria exists for assessing RoB among caries diagnostic studies. Therefore, existing checklists and proposals [21, 30,31,32] were analysed and adapted to clinical/laboratory caries diagnostic studies. The developed set of criteria includes 16 signalling questions divided into four main domains used for RoB assessments during the review (supplemental Table S7). Using the RoB assessment tool, all included studies were re-evaluated and assessed independently by two reviewers (SK, MJR). An additional and blind assessment was performed by two other colleagues from the workgroup (FE, SM). All RoB assessments are listed in supplemental Tables S8a/b–S13a/b.

In addition to the initially performed systematic search and selection of the literature, all identified papers were further selected according to their RoB status. Here, seven core domains were selected (tooth selection, index test criteria, reference test criteria, incorporation bias, partial verification bias, differential verification bias, bias in the analysis), and each study had to show a low or moderate inclusion in these domains; otherwise, the study was excluded from further analysis. In the next step, the remaining studies were crosschecked for the availability of sufficient validity data reporting cross-tabulation, sensitivity (SE), specificity (SP), positive predictive (PPV), negative predictive values (NPV) or areas under the receiver operating characteristic curve (AUC).

Data handling, statistical procedures and meta-analysis

All data were entered into a database and later exported to an Excel spreadsheet (Excel 2010, Microsoft Corporation, Redmond, WA, USA). Descriptive analyses were performed using Microsoft Excel 2010 and the statistical package mada version 0.5.9. [33] for RStudio [34]. If the included studies provided contingency tables, the data were used directly. If not, true positives (SE), true negatives (SP), PPV and NPV were calculated from the results in the original publication. If this calculation was not possible, the corresponding study was excluded. Corrections of tables with zero cells were also made; when, for example, the value for the true positives is zero, R itself makes a correction by changing the zero to 0.5 (a very small number) because RStudio cannot deal with zero cells. In some reports, statistical information was given for more than one examiner. However, in those cases, a mean was calculated by logit transformation.

Meta-analytic statistics were calculated for all included diagnostic test methods and commonly used diagnostic thresholds. Diagnostic accuracy and their 95% confidence intervals (95% CI) were calculated from the pooled data from all included studies, in terms of SE, SP and the diagnostic odds ratio (DOR). A bivariate diagnostic random-effects meta-analysis suggested by Reitsma et al. [35] was used to provide pooled estimates of SE and SP for the respective subgroups along with their 95% CI. This method can take the heterogeneity between studies into account by jointly analysing the logit transformation of SEs and SPs [36]. Finally, the pooled DOR was calculated using a random-effects model following the approach by DerSimonian and Laird [37] and aimed at describing the performance of the included diagnostic tests. An uninformative test shows a DOR value of 1; as the DOR increases, the test has more discriminatory power [38]. The area under the curve (AUC) of summary receiver operating characteristics (sROC) was reported to create an overview of the results within each subgroup. The AUC value quantifies the overall ability of a diagnostic test to discriminate between individuals with the disease and those without the disease [39]. The ideal test would have an AUC value of 1, whereas a random guess would have an AUC of 0.5; the larger the area under the ROC curve, the more accurate the diagnostic test. In addition, sROC plots and forest plots were computed to illustrate the diagnostic performance and heterogeneity, respectively [39].

Results

According to the workflow recommended by the PRISMA guidelines, 140 (108 in vitro and 32 in vivo) studies were initially identified (Fig. 1). After further consideration of the results from the RoB assessment (supplemental Tables S8a/b–S13a/b), an additional 103 publications needed to be excluded due to high RoB or insufficient data reporting (supplemental Tables S8c/d–S13c/d); the summary graphs from the RoB assessment are depicted in Fig. 2. Finally, 29 in vitro and 8 in vivo studies [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76] were selected according to the described stepwise process and were found to fulfil the inclusion criteria for meta-analysis (Fig. 1, Table 2). Only two studies were identified to use FOTI, and none used QLF.

Fig. 1
figure 1

Flow diagram detailing our search and study selection process applied during the systematic literature search (1st step) and study quality assessment (2nd step)

Fig. 2
figure 2

RoB graph across included in vivo (A) and in vitro (B) caries diagnostic studies for occlusal surfaces. Item no 1 (patient selection bias) is only available for clinical diagnostic studies

Table 2 Overview of the identified diagnostic studies in relation to the method used and characteristics of the study set-up with stepwise included studies for meta-analysis

Meta-analytic validity data are presented for all included caries detection and diagnostic methods in relation to the three chosen caries detection levels for laboratory and clinical studies in Tables 3 and 4, respectively. Most data sets originated from in vitro studies (N = 29, Table 3) rather than clinical investigations (N = 8, Table 4). In the in vitro results for all diagnostic methods at the caries detection and dentin caries level, a higher SP than SE value was typically found (Table 3). AUCs were characteristically higher for additional diagnostic methods, e.g. radiography or LF, than for VE. The highest diagnostic performance was observed for VE at the 1/3 dentin caries detection level (AUC = 0.89). The DOR values ranged from 1.94 to 37.77 (dentin caries detection level/in vitro, Table 3), 2.14 to 60.37 (caries detection level/in vivo, Table 4) and 11.79 to 127.56 (dentin caries detection level/in vivo, Table 4).

Table 3 Bivariate diagnostic random-effects meta-analysis for the finally included in vitro studies for all diagnostic methods at different caries detection levels
Table 4 Bivariate diagnostic random-effects meta-analysis for the finally included in vivo studies for all diagnostic methods at different caries detection levels

A meta-analysis was conducted for in vivo studies as well (Table 4). Here, SE (0.70) was registered to be higher than SP (0.47) for VE at the caries detection level. The SE (0.72) and SP (0.77) were higher at the 1/3 dentin caries detection level. The meta-analytic diagnostic performance of conventional bitewing radiography (F-speed) and LF was found to be excellent.

In addition to the fact that comparisons between in vitro and in vivo studies should be performed with caution with respect to the imbalance of included studies, a few trends were observed. While on the one hand, the diagnostic performance of VE tended to be higher under laboratory conditions than in clinical settings, on the other hand, the diagnostic performance of VE was not perfect and was lower than that of additional diagnostic methods. Here, conventional radiography (E-speed) and LF measurements showed higher performance data under clinical conditions. Furthermore, for all methods, there seemed to be a tendency towards a higher SE in clinical studies. SP was found to be comparable under laboratory and clinical conditions; only in the case of VE were higher values registered in vitro. Again, full comparisons could not be made due to incompleteness of the data (Tables 3 and 4). In addition, SROC curves and forest plots were computed and are presented in the additional online material (supplemental Tables S14–S17).

Discussion

This study project summarized the diagnostic accuracy of occlusal caries lesion detection, diagnostic, assessment and/or monitoring methods that were investigated under in vitro and in vivo conditions in permanent, posterior teeth. Therefore, a systematic search of the literature was conducted; potential sources of bias were considered; and finally, a meta-analysis was performed to compare commonly used caries diagnostic methods instead of analysing each method separately [12,13,14,15,16, 77,78,79,80,81]. When considering the quantity and quality of the systematically searched literature, it should be noted that there was a remarkable reduction in includable studies with each additional selection step (Fig. 1). Finally, 37 studies were included in the meta-analysis [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76], and unfortunately, these studies were not equally distributed over all test methods, study setups and considered thresholds (Tables 2, 3 and 4). Most studies were conducted under laboratory conditions (Fig. 1, Table 2) and investigated the diagnostic accuracy using the dentin caries detection threshold (Tables 3 and 4). VE, BWR and LF were tested most frequently than other additional diagnostic methods. This heterogenetic information pattern suggests that it is substantially necessary to conduct caries diagnostic studies that include different test methods and thresholds on pits and fissures. This demand is even more crucial for clinical studies.

The diverging methodology of each trial—technologies, thresholds, index and reference test criteria (supplemental Tables S1–S6)–and several sources of bias (Fig. 2, supplemental Tables S7–S13b) resulted in the exclusion of numerous studies, which ultimately lowered the number of includable studies and illustrated the heterogeneity between studies. This fact underlines the need for standardization and the necessity to conduct well-designed and well-powered caries diagnostic and detection studies in the future.

Regarding the meta-analytic diagnostic performance of the included diagnostic methods (Tables 3 and 4), it must be emphasized that for some methods, only a limited number of studies were identified. Exceptions were VE, BWR and LF (Tables 3 and 4). When viewing these data, a few trends can be discussed, but it should be mentioned from the outset that the results of this meta-analysis should not be overrated due to the limited number of includable studies for each of the relevant caries detection categories (Table 2). Nevertheless, a few conclusions can be drawn from the available data. The data support the generally and repeatedly published assumption that VE of pits and fissures is not perfect and needs to be accompanied by additional diagnostic methods. Nevertheless, more recently published criteria (ICDAS, UniViSS) that summarize the whole spectrum of non-cavitated caries lesions may help to overcome this drawback [16, 82,83,84]. Under in vitro conditions, VE showed mostly high SP values, while SE varied between the different methods and thresholds. A large difference between SE values was registered for VE under in vitro and in vivo conditions (Tables 3 and 4), which was also reported by Gimenez et al. [15]. Therefore, VE under in vitro conditions results in higher SP values. Vice versa, clinical evaluations probably include more details, which may result in higher diagnostic SE values especially for enamel caries.

It should be further noted that VE is the method that enables the clinician to collect important diagnostic co-variables, e.g. presence of biofilm or lesion appearance, enables differential diagnoses and provides finally information about the caries lesions activity [85, 86]. The latter aspect potentially influences the individual caries management strategy and it’s consideration has become mandatory in clinical practice [87,88,89]. Contrary, with respect to the methodological difficulties and missing standards to validate caries activity, it was decided to exclude the activity assessment from the present systematic search of literature and meta-analysis.

In vitro data from Ekstrand and co-workers [28, 90, 91] pointed to the fact that non-cavitated occlusal lesions depth (histological assessed), either, was restricted to the enamel or penetrated the dentin, but then restricted to the outer 1/3 towards the pulp. To raise the accuracy, e.g. in terms of SE and SP, Ekstrand et al. [28] suggested to move the standard thresholds - enamel versus dentin caries - to lesions reaching the middle or inner 1/3 of the dentin. Thus, combined SP and SP values amounted to 175 [91]. The new threshold is much more relevant to the clinicians than the old one, as non-cavitated lesions without an obvious shadow should receive non-operative care if the lesions are assessed as active, while more mature active lesions should receive operative [16].

BWR is the most commonly used additional caries lesion detection method in daily dental practice. However, its validity on occlusal surfaces is often questioned, especially in the early stages of caries [92]. Here, the anatomy of the tooth crown results in superimposed images on the two-dimensional (bitewing) radiographs, making the detection of early dentin caries lesions harder in comparison to that on proximal sides [93]. Surprisingly, the results of the present meta-analysis did not show a striking difference in SE and SP values between different X-ray types assessed in this review. However, the difference in accuracy parameters was obvious compared to those of LF. However, due to the limited number of studies belonging to each BWR category, these results should be interpreted with caution. Unlike previously published reviews [13], this review considered separate studies using conventional film-based BWR and digital BWR (including their different modalities) with the aim of reducing bias. Unfortunately, this approach resulted in a low number of includable studies in each category.

LF has been used as an adjunct caries detection method for incipient lesions that otherwise could not be detected by VE alone [94]. The results of our study revealed high SE and SP values for LF under in vitro conditions, which is in line with previously reported findings by Gimenez et al. [14] and Rosa et al. [12]. When considering the small number of includable data from in vivo studies (Table 4), these data should be treated with caution, but they are still comparable to previous findings from Pinheiro et al. [94]. In contrast to these reassuring results, LF alone is not sufficient for the correct diagnosis of caries and good standardization is essential to avoid overtreatment and false-positive readings due to other fluorescence sources [12, 14, 81, 84, 94].

The present study has strengths and limitations. First, one strength is that commonly used diagnostic methods for occlusal caries detection and diagnostics were analysed in one meta-analysis. Second, there was a strict study selection protocol, which was based on principles for performing systematic reviews and, in addition, a tailored RoB assessment that included only studies with a low RoB and excluded probable heterogenic publications. Third, the present study considered different thresholds independently for in vitro and in vivo studies. As a main limitation of the study selection process used, the exclusion of reports with a potentially high RoB from the meta-analysis and feasibly subjectivity of included selection criteria might be discussed for very low number of the included studies, especially in the clinical research. To our knowledge, such strict selection has not previously been performed because it is not part of the current recommendations for conducting a meta-analysis. While this step may result in the analysis of a homogenous pool of studies, it resulted, by contrast, in a substantial reduction in includable studies. It is further worth mentioning that several reports needed extensive discussion with respect to missing data or information. Therefore, the inclusion or exclusion of a single study remained in some cases a subjective procedure that could not be fully objectified. Because of the limited number of includable studies and the low sample size, the results from this meta-analysis should be interpreted with caution. This fact underlines the urgent need for well-designed and well-powered diagnostic studies that use multiple diagnostic procedures and different caries thresholds.

Conclusions

There is an overall need for high-quality, well-designed and standardized studies on the detection, diagnosis, assessment and/or monitoring of occlusal surface caries. This need must be emphasized for diagnostic studies under in vivo conditions due to the limited number of clinical trials and the documented heterogeneity between published reports. When considering the meta-analytic results, VE, BWR and LF provide acceptable measures for their diagnostic performance on occlusal surfaces. Again, the present results should be interpreted with caution with respect to the limited data in many diagnostic categories.