1 Introduction

The advancement of research and development of methodologies for big data mining [1] powered by Artificial Intelligence (AI) [2, 3], which seeks to discover meaningful and explorable patterns in data, has enabled/motivated its application in digital forensics (DF) investigation.Footnote 1 Digital artifacts are collections of digital data that are frequently large, complex, and heterogeneous. Despite concerns about the ability of “black-box” AI models [4] to generate reliable and verifiable digital evidence [5], the assumption that cognitive methodologies used in big data analysis will succeed when applied to DF analysis has fueled a decade-long surge of research into the application of AI in DF. Note that, our reference to AI methods in this paper includes machine learning (ML) [6, 7] and deep learning (DL) [170] methods; with distinctions made where necessary.

To begin, a misunderstanding exists regarding the colloquial use of the terms “Forensics AI” and “AI Forensics” within the forensics community (and beyond), with some using the phrases interchangeably as referring to the application of AI in DF. While both phrases are self-explanatory, it is vital to clarify common misconceptions and distinguish the two concepts. On the one hand, according to [8], a word preceding ‘forensics’ in the DF domain denotes the target (tool or device) to be analyzed (e.g., cloud forensics, network forensics, memory forensics, etc.). As a result, the author refers to “AI Forensics” as a forensic analysis of AI tools or methods, rather than forensic investigation applying AI techniques. In the same vein, the authors in [9], refers to AI Forensics as “scientific and legal tools, techniques, and protocols for the extraction, collection, analysis, and reporting of digital evidence pertaining to failures in AI-enabled systems.” To summarize their definition, AI Forensics is the analysis of the sequence of events and circumstances that led to the failure of an intelligent system, including assessing whether or not the failure was caused by malicious activity and identifying responsible entity(ies) in such scenario.

In contrast to the previously described concept, a comprehensive review of research databases such as Google Scholar, IEEE Explore, and Scopus for the terms “Forensics AI” or “Digital Forensics AI” reveals that the majority of resources are based on DF analysis methods assisted by AI techniques. However, in this paper, we refer to Digital Forensics AI (hereafter referred to as DFAI), as a generic or broader concepts of automated systems that encompasses the scientific and legal tools, models, methods; including evaluation, standardization, optimization, interpretability, and understandability of AI techniques (or AI-enabled tools) deployed in digital forensics domain. Also, we refer to “digital evidence mining” as the process of automatically identifying, detecting, extracting, and analyzing digital evidence with AI-driven techniques. The phrase “mining” is borrowed from the notion of data mining, which embodies procedures and components that can be applied in the analysis of digital evidence.

Importantly, as accurate and precise as most AI algorithms are; owing to numerous research focus and resources dedicated to them of recent, their applications to digital forensics require significant cautions, and consideration for domain-specific intricacies. Clearly, the results of a business-oriented AI task will be evaluated differently from those of a forensic investigation. Additionally, the bulk of AI algorithms are based on statistical probabilities, which commonly results in non-deterministic outputs. Thus, the challenge would be to establish the correctness of the outcomes and to communicate the probabilistic conclusion of a forensic examination in the simplest and most understandable manner possible in order for it to be admissible in legal proceedings.

As a result, in this work, we emphasize the importance of three scientific instruments in the application of AI in digital forensics: evaluation; standardization; and optimization of the approaches used to accomplish the tasks. In subsequent sections of this work, we will discuss the significance of these instruments and their components.

This paper makes the following contributions:

  • We present various AI model evaluation approaches, emphasizing their importance for both DFAI methodologies and the forensic tasks to which they are best suitable.

  • We propose a confidence scale (C-Scale) for the evaluation of strength of evidence that is adaptive to an AI generated probabilistic results.

  • We discuss numerous optimization techniques that may be appropriate for certain forensic analysis, as well as a comparison of their strengths and drawbacks, including their time complexity for DFAI tasks.

The subsequent parts of the paper are organized as follows. Section 2 covers the methods for evaluating DFAI techniques. In Sect. 3, the methods for standardizing DFAI techniques are discussed, while Sect. 4 elaborates on the techniques optimization. Finally, in Sect. 5, we discussed the future direction and conclusions.

2 Methods for Evaluating DFAI Techniques

During a forensic investigation, examiners develop an initial hypothesis based on observed evidence. Following that, the hypothesis is evaluated against all other competing hypotheses before final assertions are made [10]. The issue is that, as highlighted in [11], in an attempt to make sense of what they observe (sometimes coercively to ensure that it fits the initial assumption), investigators subconsciously: (1) seek findings that support their assertions; (2) interpret relevant and vague data in relation to the hypothesis; and (3) disregard or give less weight to data that contradict the working hypothesis. Numerous factors may contribute to this bias, including but not limited to: confidence (as a result of the presumption of guilt), emotional imbalance, concern about long-term consequences (e.g., loss of prestige), and personality characteristics (e.g., dislike for uncertainty or a proclivity to over-explore various scenarios) [12]. Consequently, before a forensic investigation can reach a conclusion, each component of the initial hypothesis must be independently and thoroughly tested (or evaluated) to ascertain the degree of confidence in the methodology that produced the fact. Evaluation, therefore, is the process of judging the strength of evidence adducing opposing assertions, as well as their relative plausibility and probability [13].

Expert examiners can evaluate forensic examination data using a variety of techniques, some of which are based on predefined scientific standards and others on logical deductions supported by experience or subjective reasoning. However, in the context of DFAI, forensic evaluation is performed by evaluating the AI algorithms deployed in the forensic investigation. This deployment requires metrics and measurements that are compatible with AI model evaluation. The evaluation of DFAI models can be carried out on the algorithm’s functional parameters (i.e., individual modules) or on their outputs. Unlike conventional approaches for evaluating ML or DL models, which apply standard metrics associated with the task or learning algorithm, gaining confidence in the outcome of a DFAI research may require additional human observation of the output. Numerous studies in DF have revealed that forensic practitioners frequently issue inconsistent or biased results [13, 14]. In addition, the majority of AI-based approaches lack the necessary clarity and replicability to allow investigators to assess the accuracy of their output [15]. Thus, a forensically sound processFootnote 2 , is one that integrates automated investigative analysis—evaluated through scientific (accuracy and precision) metrics—with human assessments of the outcome. For example, a DF investigation into Child Sexual Exploitation Material (CSEM) [16, 17] may seek to automatically detect and classify images of people found on a seized device as adult or underage (based on automatic estimated age). Because of possible misrepresentation in the dataset, misclassification (i.e., false positive), misinterpretation of features, and missing of critical features during the classification process that could have served as evidence (false negative; e.g., an underage wearing adult facial makeup) may occur [18]. In this case, merely addressing bugs in algorithmic codes may not be sufficient, as the classification errors may be subconsciously inherited and propagated through data. Similarly, the work described in [19] is a temporal analysis of e-mail exchange events to detect whether suspicious deletions of communication between suspects occurred and whether the deletions were intended to conceal evidence of discussion about certain incriminating subjects. One significant drawback of that analysis is the model’s inability to thoroughly investigate if the suspicious message(s) were initiated or received by the user or were deliberately sent by an unauthorized hacker, remotely accessing the user’s account to send such incriminating message. To reach a factual conclusion in this case, various other fragmented unstructured activity data (unrelated to e-mail, perhaps) must be analyzed and reconstructed. Depending on the design, a robust AI-based system can uncover various heretofore unrecognized clues. If these new revelations (even though relevant) are not properly analyzed and evaluated, they may lead investigators to believing that the outputs dependably fulfil their needs [15]. As a result, an extensive review of the output of DFAI will be required (supposedly provided by human experts) to arrive at a factually correct conclusion. This has also been highlighted as an important instrument for examining digital evidence in [10]. Additionally, expert knowledge that has been codified as facts (or rules) in a knowledge base can be used in place of direct human engagement to draw logical inferences from evidence data.

As with the output of any other forensic tool capable of extracting and analyzing evidence from digital artifacts, which frequently requires additional review and interpretation that are compatible with the working hypothesis, the results of forensic examinations conducted using DFAI should be viewed as “recommendations” that must be interpreted in the context of the overall forensic observation and investigation [15]. In addition, the evaluation apparatus must be verifiable, appropriate for the task it seeks to solve, and compatible with the other contextual analysis of the investigative model. Taking this into consideration, the methods for evaluating a DFAI techniques can be viewed in terms of two significant instruments: performance and forensic evaluation. Below, we discuss the significance and components of each of these instruments. These two instruments, in our opinion, are quite essential for a sound digital forensic process based on DFAI.

2.1 Methods for Evaluating the Performance of DFAI Models

In a machine-driven system, evaluation produces value as a measure of the model’s performance in accomplishing the task for which it was commissioned, which may be used to influence decision-making [10]. Depending on the problem the model attempt to solve, evaluation may be: a set of thresholds formulated as binary (i.e., ‘yes’ or ‘no’, or 0 or 1) or categorical (qualitative; one of a possible finite outcome) as the case maybe; discrete (enumeration of strength; e.g., range between 0 to 10); or continuous (e.g., probability distributions of real values between 0 and 1). Consequently, evaluating the performance of a DFAI model built to recognize specific faces in a CSEM is distinct from evaluating the performance of a model meant to classify faces as underage or adolescent. Similarly, distinct metrics are required for models that detect spam e-mails and those that attempt to infer intent from an e-mail content. The majority of DFAI tasks will fall into one of three categories: classification, regression, or clustering. The scientific methods used for evaluating the performance of these three categories are discussed below. It is worth mentioning, however, that these are standard metrics for ML tasks. Hence, we offer only a brief review of the methods, emphasizing the intersection and relevance of each metric to DFAI (including the weaknesses and strengths that make them appropriate or otherwise) where necessary. Therefore, readers are encouraged to consult additional publications on ML metrics for complete details.

2.1.1 Evaluating Classification Algorithms in DFAI

Classification models are predictive in nature, identifying the class to which a set of input samples belongs. Classification tasks are evaluated by comparing predicted class samples to ground-truth samples. In a vast majority of cases, classification model design will include both positive and negative examples. The former represent true samples obtained from data, whilst the latter are fictitious samples that do not exist in the real sense. A classification task is commonly modelled in ML as a binary representation that predicts a Bernoulli probability distribution [21] for each sample. Bernoulli distributions are a type of discrete probability distribution in which events have binary outcomes such as 0 or 1. Therefore, the performance of a classification model is measured by its ability to correctly predict (assign a high probability value to) the class of positive samples and to assign a very low probability value to non-existent samples.

Prior to deploying a DFAI model, it is necessary to examine the characteristics of the investigation to determine whether the model is appropriate for that purpose. Practitioners are expected to be aware of the unique characteristics of learning algorithms and to use them appropriately. For instance, in a forensic investigation involving facial classification, two main techniques that can be applicable: verification and identification. Verification entails comparing an unknown face to a known face directly (One-vs-One) [22] and computing their similarity score. This can be adapted as a binary classification task, in which the system predicts whether or not two faces share a high degree of similarity, based on a predetermined threshold. On the other hand, identification involves One-vs-Rest [23] comparison, in which an unknown face is compared to the faces in a database of known persons. The Identification task is typically a “Multi-Class Classification” [24] problem, in which samples are classified into one of a set of known classes. Other classification models are: Multi-label classification [25] and Imbalanced classification [26].

Metrics such as accuracy, precision, recall, and F-Measure are all relevant depending on the investigation’s characteristics. The measure of “accuracy” can be seen as the validity measure of a model. It is the ratio of the correctly classified samples to the total samples. Accuracy tells whether a model was correctly trained and how well it will function in general. However, caution should be exercised when using this information alone to reach a general conclusion in forensic investigation, as it provides little information about its application to the problem and performs poorly in circumstances of severe class imbalance. That is, if the dataset is asymmetric, e.g., if the proportion of false positives is not (or nearly) equal to the proportion of false negatives. Accuracy is calculated in terms of a confusion matrix while performing a binary classification task, such as predicting whether an e-mail is “spam” or “not-spam.” The confusion matrix [27] [28] is applied to a set of test data, for which the true values are known. What a classifier seek to minimize is the number of “False Positives” and “False Negatives.” A true positive (tp) is one in which the model accurately predicts the positive samples, while a true negative (tn) indicates the result of correctly predicted negative samples. Similarly, a false positive (fp) outcome occurs when the model incorrectly predicts positive samples, whereas a false negative (fn) outcome occurs when the model inaccurately predicts negative samples. Therefore, in terms of confusion matrix, an accuracy measure is represented as:

$$\begin{aligned} Accuracy = \frac{tp\;+\;tn}{tp\;+\; tn\;+\;fp\;+\;fn} \end{aligned}$$
(1)

To ascertain the reliability of a DFAI model, precision metric [29] is critical. It provides additional assurance by posing the question: “how frequently is the model correct when it predicts a positive sample?” With precision, we affirm the classifier’s ability not to label a negative sample as positive. Given that the outcome of a forensic investigation may be critical to the outcome of an inculpatory or exculpatory proceeding, the cost of a high rate of false positives may be detrimental.

Additionally, in situations where the cost of a false negative is potentially catastrophic, such as a facial recognition investigation to discover criminal materials via training examples. While the system is capable of identifying and classifying a large number of positive samples, it may be necessary to ascertain how many faces were correctly identified from the predicted samples. This is where recall [29] plays a critical role in DFAI. Recall is crucial for evaluating working hypotheses and can help in answering some potentially damning questions during court proceedings. Recall facilitates informed decisions on false negatives; for example, by highlighting crucial details that should not be overlooked.

To take advantage of both precision and recall’s evaluative strength, the F-Measure (or F-Score) can be employed to measure the model’s accuracy. It takes into consideration both false positives and negatives; with a low value indicating a good F-Measure. This has the potential to aid in the reduction of false assumptions during forensic investigations.

Another relevant metric for measuring a classifier’s capacity to distinguish between classes is the Area Under the Curve (AUC) [30], which serves as a summary of the Receiver Operating Characteristic (ROC) curve [31]. The ROC curve is constructed by plotting the tp rate versus the fp rate at various threshold values. The AUC and Average Precision (AP) [32] are the quality measures used in link the performance of link prediction models, as well as the probability of a relationship between hypothetical variables.

There are instances when evaluating accuracy becomes preferable to F-measures; this is especially true when the cost of false positives and negatives is similar, meaning that the consequences are not negligible. If the situation is reversed, it is reasonable to evaluate the F-measure. However, some critical concerns about the F-measure’s weaknesses are discussed in [33, 34]. Notable among them are its bias towards the majority class and its underlying assumption that the actual and predicted distributions are identical. Additionally, caution should be exercised when evaluating performance on classified samples that involves the assignment of a threshold (as is the case in some logistic regression models). Increasing or decreasing the threshold value (in a classification model) has a major effect on the precision and recall results. In contrast to a model developed to optimize business decisions, it may be prudent to avoid including any threshold in DFAI—as it would be appropriate to have a realistic view of the analysis’ outcome, unless there is certainty that doing so will not have a detrimental impact on the outcome. Nonetheless, accuracy is crucial; so, the threshold can be considered provided the trade-offs can be quantified and justified sufficiently.

2.1.2 Evaluating Regression Algorithms in DFAI

In contrast to classification models, which predict the classes of input samples, regression models predict an infinite number of possible (continuous; real-valued such as integer or floating point) outcomes. In DFAI, regression analysis can be utilized for two conceptually distinct purposes: forecasting and prediction; and inference of causal relationships between dependent (observed) and independent (predictors) variables. Before a regression analysis may be commissioned, the examiner must be convinced that the correlations present in the data possess the predictive power to infer a new context or that these correlations can induce a causal interpretation based on observational data [35, 36]. This is particularly important for forensic investigations. A significant factor that can improve the predictive capabilities of a regression model is when the input variables are arranged chronologically (according to event time), a notion referred to as time series forecasting. This is important for forensic tasks such as detecting deviations (anomalies), forecasting crime, predicting probable connections between data, and reconstructing events. Furthermore, while working with regression models, interpolation and extrapolation [37] are critical concepts to understand. Often, the former is preferable, as it involves the prediction of values within the range of data points in the dataset used to fit the model. The latter, on the other hand, depending on the task, might not be fully desirable for DFAI. Extrapolation is based on regression assumptions and requires predicting values outside the observed data range. Extrapolating over a range that is significantly larger than the actual data is risky and it is a sign of likely model failure.

A regression model’s performance is measured as an error in prediction, i.e., how close the predictions were to the ground truth. To do this, the following error measures are frequently used: Mean Squared Error (MSE) [38,39,40], Root Mean Squared Error (RMSE) [41], Mean Absolute Error (MAE) [40], and Mean Absolute Percentage Error (MAPE) [42]. Although there are several other error metrics available; the choice of which is determined by the type of error being evaluated. We present a brief discussion about the above-mentioned metrics below.

MSE can be used to evaluate the quality of a predictor or an estimator. However, in DFAI, it better-off as a predictor since it can map arbitrary input to a sample of random variables. A MSE of zero indicates a perfectly accurate prediction, however this is rarely possible [43]. Unfortunately, other measures have been sometimes preferred to MSE due to its disproportionate weighting of outliers [44]. This occurs as a result of magnification of large errors than on small ones, due to each value being squared.

An extension of the MSE is the RMSE; which is always non-negative. A value of zero (0) is almost unrealistic; and if it does occur, it indicates that the model is trivial. RMSE is highly susceptible to outliers, as larger errors are significantly weighted. It may be prudent to establish a baseline RMSE for the working dataset in DFAI tasks by predicting the mean target value for the training dataset using a naive predictive modelFootnote 3. This can be accomplished by transforming or scaling the dataset’s feature vectors between 0 and 1 (i.e., normalization).

In contrast to the previously stated error measures, which require squaring the differences, MAE changes are linear, intuitive, and interpretable; they simply represent the contribution of each error in proportion to the error’s absolute value. MAE calculates the error difference between paired observations expressing the same event, i.e., it is scale-dependent; it uses the same scale as the data being measuredFootnote 4 Moreover, it does not give greater or lesser weight to errors and hence provides a realistic view of the main prediction errors; thus, it is strongly recommended for DFAI. Additionally, it is a frequently used metric for forecasting error in time series analysis [45], which may be beneficial when examining an event reconstruction problems.

While MAPE appears to be well-suited for prediction, particularly when adequate data is available [46], caution should be exercised to prevent the ’one divided by zero’ problem. Additionally, MAPE penalizes negative-valued errors significantly more than positive-valued errors; as a result, when utilized in a prediction task, it favours methods with extremely low forecasts, making it ineffective for evaluating tasks with large errors [46].

There are other error measures for regressors such as Max Error [47]; that calculates the maximum residual error and detect worst case errors [15], and \(R^{2}\) (also known as R-Squared, Goodness of fit; Co-efficient of Determination) [48,49,50], which is the measure of variance proportion in the regressor.

Following the description of each of these error measurements for regression problems and their associated limitations in some cases, selecting which one is most appropriate for a specific forensic task can be somewhat puzzling. However, as demonstrated in [51], the RMSE is unreliable and unsuitable for determining the correctness of a time series analysis (such as temporal event reconstruction). Additionally, the study in [44, 52] stated that RMSE possessed “disturbing characteristics,” rendering it ineffective as an error measure. MSE and all other squared errors were also deemed unsuitable for evaluation purposes (in the study). The work described in [53] somewhat challenged these conclusions by presenting arguments in support of RMSE. Nevertheless, MAE has been recommended in the majority of cases, which is understandable. As previously stated, the MAE metric is a consistent and compatible evaluation technique with DFAI; it is a more natural representation of the model’s average error magnitude [52] that appropriately depicts the model’s performance. The \(R^{2}\) is another metric that deserves a role in DFAI. A recent comparison of regression analysis error measures is discussed in [54]. \(R^{2}\) exhibit desirable features, including interpretability in terms of the data’s information content and sufficient generality that span a relatively broad class of models [55]. Although a negative \(R^{2}\) indicates a worse fit than the average line, this representation may be critical for determining how the learning model fits the dataset. Further on this, regardless of whether an examiner reports the \(R^{2}\) score, or whether it helps to determine the performance of a regressor, it is a highly effective technique for evaluating the performance of a regression analysis and highly recommended for DFAI analysis.

2.1.3 Evaluating Clustering Algorithms in DFAI

Evaluating a clustering method can be challenging because it is mostly used in unsupervised learning [56, 57]; which means that no ground-truth labels are available. Clustering in a supervised (learning) [58] setting, on the other hand, can be evaluated using supervised learning metrics. One significant downside with unsupervised learning that fact-finders should be aware of is that applying clustering analysis to a dataset blindly would categorize the data into clusters (even if the data is random), as this is the algorithm’s expected function. As a result, before deciding on a clustering approach, examiners must verify the non-random structure of the data. Three critical factors that should be considered in clustering are: (1) Clustering tendency; (2) Number of clusters, k; and (3) Clustering quality. We give a brief explanation of these factors below.

1. Clustering tendency: tests the spatial randomness of data by measuring the probability that a given dataset is generated by a uniform data distribution. If the data is sparsely random, clustering techniques may be meaningless. It is critical (especially in DFAI) for examiners to conduct this preliminary assessment, in part because it can assist reduce the amount of time required to analyze artifacts. A method for determining a dataset’s cluster tendency is to utilize the Hopkins statistic [59], which is a type of sparse sampling test. The Hopkins statistic is used to test the null hypothesis (\(H_{0}\)) and the alternative hypothesis (\(H_{a}\)). the Hopkins statistic is close to 1 or \(H>0.5\), we can reject the null hypothesis and infer that there are significant clusters in the data.

2. Number of clusters: obtaining the ideal number, k, of clusters is critical in clustering analysis; while there is no definitive method for doing so, it can rely on the shape of the distribution, the size of the data set, and the examiner’s preference. If k is set to a value that is too high, each data point has a chance of forming a cluster, whereas a value that is too low may result in inaccurate clusters. Additionally, the following approaches can help forensic examiners determine the cluster number:

  • Prior domain knowledge—prior domain knowledge (based on experience on use case) can provide insight into the optimal number of clusters to choose.

  • Data driven approach—employs mathematical methods to determine the correct value, such as rule of thumb method, elbow method [60, 61] and gap statistics [62].

3. Clustering quality: characterised by minimal intra-cluster distance and maximal inter-cluster distance.

To evaluate the performance of a clustering task, two validation statistics are key, namely: internal cluster validation and external cluster validation.

Internal cluster validation: evaluates a clustering structure’s goodness without reference to external data. It frequently reflects the compactness, connectedness, and separation of the clusters. The silhouette coefficient (SC) [63, 64] and Dunn index (DI) [65] can be used to evaluate how well the algorithm performs in comparison to its internal clusters. By measuring the average distance between two observations, the SC determines how well they are clustered. SC has been applied in a variety of forensics-related clustering methodologies, including document forensics [173], image source identification [174, 175], and text forensics (e.g. authorship) [176, 177].

However, if computational cost is not an issue, the DI can be utilized. A practical application of DI in computer forensics is reported in [178], where it aids in the evaluation of ransomware sample similarity. There are further indices (for example, the Davies-Bouldin index [66]); but, the silhouette and Dunn provide, in principle, the closest compatibility with DFAI in general, and specifically in terms of interpretability.

External cluster validation: compares and quantifies a cluster analysis’ results against externally known benchmarks (e.g., externally provided gold standard labels). Such benchmarks are made up of a collection of pre-classified items, which are often created by (expert) humans. The evaluation approach quantifies the degree to which the clustering analysis result corresponds to predefined ground truth classes. To evaluate the performance of external cluster indices, the Rand index [67], the Purity index [68], the F-measure (with precision and recall; as indicated in the classification task), and the Fowlkes-Mallows index [69] can be utilized. As a matter of fact, it remains unclear how external cluster validation could improve DFAI. To elaborate on this fact, given the majority of digital artifacts from which evidence can be derived are sparse, unconventional, and previously unseen, having a ground truth label with which to compare may be impracticable. Moreover, given the majority of DF analysis are crime-specific (or relating to a particular case), the question is whether it is appropriate to compare crime-related data analysis to a general task ground truth labels. However, if gold standard, case-based labels are available, such as those for videos and photos in [70] or (though limited in scope and diversity) the “Computer Forensic Reference Dataset Portal CFReDS)Footnote 5” or “Datasets for Cyber Forensics,Footnote 6” then suitable comparisons can be established.

2.2 Forensic Evaluation

Upon the establishment of facts through a forensic investigation, decision-making follows, which is the adoption of a hypothesis as a conclusion [71]. While evaluation of forensic outcome is usually discussed in court contexts, review of forensic decisions is appropriate at all phases of the investigation [72]. It begins with evaluation of the individual hypothesis against all competing claims; the accuracy (including quantification of error rates) of the results obtained through automated tools used in the analysis; the extent to which experience and domain knowledge were helpful; and the ease with which the entire investigative process can be explained to a non-expert. Because automated systems are not self-contained and thus cannot take everything into account [15], it is possible that multiple DFAI approaches were used to find solutions to all competing hypotheses. As a result, forensic evaluation in this case will entail weighing the differing claims against the overall investigative problem. One way of determining this is to assign an evidential weight (strength of evidence) or “Likelihood Ratios” (LR) [73,74,75] to all contending claims. Although LR was originally created as a framework for evaluating forensic (science) evidence, the concept can be adopted to help make the DFAI’s outcome more intelligible. Contrary to the factually deterministic requirements of evidence in a criminal or civil case, the majority of AI-based algorithms and their outputs are mostly probabilistic. However, forensic examiners do not pronounce judgments or issue final decisions; they rather provide expert testimony (or an opinion) or report of their findings to fact finders (attorneys, judges, etc.). Succinctly reporting forensic investigation findings remains a challenge [76], and while it may be comprehensible to state an opinion on a hypothesis and its alternatives as true (or false), such approach lacks the transparency and logical informativeness necessary to reach a verdict in a legal proceeding. Consequently, reporting DF findings in terms of weights or LRs enables the decision maker to assign the evidence an appropriate level of confidence [15]. LRs represent examiners’ assessment of the relative probability of observed features under various hypotheses concerning a particular case. Furthermore, the European Network of Forensic Science Institutes (ENFSI) [75] recommends LR (simply in terms of numbers) even when examiners must make subjective decisions [75], because it makes the examiner’s belief and inferential process explicit and transparent, facilitating the evaluation of strengths and weaknesses for those who rely on it [76]. While expressing subjective decision in terms of LRs has grown widespread in Europe, doubts have been raised in support of empirical data instead [73]. In other contexts, verbal expressions of LRs have been proposed; for example according to [73], consider an LR expression in the form: “at least 1,000 times more likely” and “far more probable.” The former is likely to receive scepticism regarding the basis for that figure, whereas the latter has a stronger possibility of acceptance [73].

Consequently, given the probabilistic (or stochastic) nature of the results of DFAI models, and the fact that these models have been empirically verified as accurate and well-suited for analytical purposesFootnote 7, as well as the inclusion of an “expert-in-the-middleFootnote 8,” it is still necessary to find the most appropriate way to report the results in the clearest and most understandable manner possible, albeit as recommendations. The recommended LR by the UK’s Forensic Science Providers (AFSP) on “standard for the formulation of evaluative forensic science expert opinion” is available in [77].

However, in 2016, the US President’s Council of Advisors on Science and Technology [78] recommended that forensic examiners reveal the error rates observed in black-box validation when reporting or testifying on forensic comparisons. Thus, error rates have become an intrinsic element of investigative outcome reporting, and with it, fact-finders have a greater logical and empirical understanding of the probative value of the examiner’s conclusion [73]. It is not straightforward to express likelihood ratios in ways that are consistent with probabilistic distributions or error estimates (usually real values between 0 and 1). An approach was proposed in [79] which is based on the combination of prior probabilities and the likelihood ratio. However, when the conditional components of a hypothesis are transposed, evaluating its probability might be logically fallacious [72]. Probabilities are rarely acceptable in legal decisions, because an 80% probability is synonymous to the fact that one in five cases would be decided wrongly [80]. Given that probability is relative to certainty (or otherwise), we can align our DFAI evaluation intuition with the “Certainty Scale”, or “Confidence Scale” (C-Scale) proposed in [72, 81, 82], which is reasonably appropriate for assigning strength of evidence to continuous values with respect to the hypothesis. As noted by [72]; “...the strength of evidence does not exist in an abstract sense, and is not an inherent property of the evidence; it only exists when a forensic practitioner assigns value to the evidence in light of the hypothesis.” Therefore, in light of each working hypothesis resolved via DFAI, Table 1 represent a proposed C-Scale for expressing the strength of evidence that is compatible with DFAI analysis.

This is by no means a standard evaluation, but rather a tentative proposition that will need to be refined as research in this field progresses. Additionally, unlike the LR recommendation and the C-Scale proposals, which are based on hypothesis (or strength of hypothesis) about source identification during a forensic investigation, the DFAI C-scale evaluation method is fairly generic (for hypothesis and AI models) and applicable in a wide variety of situations, including strength of evidence. Furthermore, the FP and FN rating scales in Table 1 can be adjusted according to investigative tasks, as there are instances when a 50% to 60% false positive/negative rate would indicate “weak support”.

Table 1 A proposed AI-adaptive C-Scale evaluation of strength of evidence for DFAI

As previously stated, human expert interpretation and evaluation are key components of DFAI in a partially automated setup because it is difficult to predetermine all of the reasoning required to do a forensic investigation work [15]. However, in a fully automated scenario, learning algorithms in conjunction with contextually structured expert systems can incorporate domain-specific knowledge-base rules. An expert system can also be built to evaluate every hypothesis at each modular level and make recommendations based on codified LRs.

3 Standardization in DFAI

The issue of standardization in digital forensics has persisted for several years; first because standard guidelines have been unable to keep up with the dynamic pace of technological sophistication, and second, because forensic stakeholders have been unable to agree on certain rules and standards, resulting in conflict of interest [83]. Additionally, the distinctiveness of investigation, the domain’s diversity, and the existence of disparate legislative frameworks are all reasons cited as impediments to the standardization of the DF field [85, 86]. Nowadays, when it comes to standardization, the majority of what is available (in the form of guidelines) are check boxes; since the notion is that the more details, the better the standard [87]. Nonetheless, the “Forensic Science Regulator” in a 2016 guidance draft highlighted the validation of forensic methods as a standard, rather than the software tool [84]. This method validation entails a number of assessments, including the evaluation of data samples, which are relatively small in DF [88]. Standardization in DF (as well as DFAI) is a broad and intricate area of study, as every component of DF requires it. However, as part of the advancement of DFAI (for which further study is envisaged), we examine standardization within the context of forensic datasets and error rates.

3.1 DFAI Datasets

Datasets (or data samples) are a critical component of AI, as they define the validity of an AI model to a great extent. A dataset is a set of related, discrete elements that have varying meanings based on the context and are used in some type of experiment or analysis [89]. To evaluate or test novel approaches or to replicate existing procedures, similar data sets are required; for example, investigations on facial recognition require human facial sample data. Similarly, an inquiry into message spamming necessitates the collection of e-mail samples. Datasets are often beneficial in the following ways, according to the National Institute of Standards and Technology (2019)Footnote 9:

  • For training purposes: dataset is generated for training purposes, i.e., simulation of case scenarios in order to train a model to learn the specifics of that environment, and to facilitate practitioner’s training on case handling so that their ability to identify, examine, and interpret information can be assessed.

  • Tool validation: wherein dataset is utilized to determine the completeness and correctness of a tool when it is deployed in a given scenario.

  • Familiarity with tool behavior: for instance, a dataset collected from users’ software interaction traces. As a result, such datasets are crucial for deciphering how certain software behaves on a device and for assisting in the interpretation of digital traces left by usage [86].

The process of creating a dataset is critical, even more so in the domain of DF, where each component must be verifiable, fit for purpose, and compliant with some set of standards. Therefore, the created dataset must be realistic and reliable [90]. This also entails having a high-quality, correctly labeled dataset that is identical to the real-world use case for testing and evaluation purposes, substantial enough for adequate learning, and is accessible to ensure reproducibility [89]. In the context of DFAI, there are a few considerations that must be made in order to conduct a forensically sound operation with respect to datasets.

Due to limited availability of datasets in DF, practitioners frequently overuse a single data corpus in developing several tools and methodologies, resulting in solutions gradually adapting to a dataset over time. For example, the Enron corpus has developed into a research treasure for a variety of forensic solutions, including e-mail classification [91,92,93], communication network analysis [19, 94], and other forensic linguistics works [95,96,97]. However, proving that a solution based on a single corpus is sufficiently generalizable to establish a conclusion in a forensic investigation will be difficult. Nevertheless, this is a widely recognized issue among stakeholders, and while it may be excusable in peer reviews, it is a major issue in the standardization of DF that requires immediate resolution. Similarly, while a workable DF dataset is constantly being sought, it is worth emphasizing that using a (single) dataset to assess the validity of a tool or method may not appropriately represent the general case scenario.

Datasets are created as a “mock-up” of a specific scenario, representing the activities/events that occur within an environment; supposedly within a specified time period. Each use case is time-dependent; as such, the continued relevance of a particular use case (from a previous period) in a future period may be debatable. This is particularly true in the domain of DF. For instance, given the advancements in computer network architecture, it may be illogical to use a dataset of network traffic from the 1990s to model an intrusion detection system today. This is also a point made in [98]. Similarly, it may seem counter-intuitive to argue that a model trained on images retrieved from an older (e.g., 2000) CCTV footage or camera is helpful for identifying objects in a contemporary crime scene image - technology has improved. However, in an ideal circumstance and for a robust model, updating the dataset with a collection of new features compatible with recent realities, rather than completely discarding the old dataset, should be viable.

Criminal cases such as hate speech [99] may involve local nuances [101], and while global dimension may not be impossible [100], investigations should take into account regional differences. For instance, in a typical forensic linguistics investigation [95,96,97] (e.g., cyberbullying [102]), a language corpus plays a vital role. However, native speakers’ use of language (for example, English) may differ greatly from those of non-native speakers. Language, in usage and writing, varies across borders. An AI model trained to identify instances of bullying using a message corpus derived from British databases may not be completely representative of the same use case in Anglophone Africa – some English phrases are offensive to native speakers but inconsequential to non-natives. As such, a DFAI training dataset should accurately represent the use case (in terms of geographical location and dimensionality) for which application is intended.

Lastly, the demand for synthetically generated datasets is increasing in the DF domain, and rightly so. The issues of privacy, unavailability, and non-sharing policy continue to be a barrier to getting forensically viable datasets for the purpose of training, testing, and validating forensic tools. Synthetic data, first introduced in [103, 104], is described as an artificially generated data that contains statistical properties of the original data. While synthetic data can be extremely beneficial for research and education, the question is whether any novel technique can be tested on fictitious data [105], and particularly for DF; whether a perfect simulation of a crime event can be achieved. Nonetheless, several research (not related to DF) have demonstrated the usefulness of synthetic data in comparison to actual data [106, 107], in which a model was trained on synthetic data and tested on real data. The results indicated that the accuracy of a variety of ML approaches were slightly decreased and varied when a synthetic dataset was used. Synthetic data can be used to augment or enhance an existing dataset, as well as to adjust for data imbalances created by an event’s rarity. In DFAI, modeling with synthetic data is sometimes useful, but not always. Synthetic data generation requires a purpose-built dataset that may be too narrow for general-purpose solutions; demonstrating the results’ applicability to real-world crime data may be difficult. This point is highlighted in [108], while some other challenges are emphasized in [109]. Furthermore, synthetic datasets are randomised, which means that the data do not follow a regular pattern. We foresee an extended challenge if the dataset is used to train an unsupervised neural network model – the model may learn non-interpretable patterns. While it is natural to assume that random data is less biased, there is no means to verify this claim. Thus, while synthetic datasets may be advantageous for solving specific ML problems, their usage in DFAI should be carefully considered.

3.2 DFAI Error Rates

As critical as accuracy is in determining the correctness of an evidence mining process, so also is the error rate. The error rate not only indicates the probability that a particular result is correct, or the strength of a technique, but also its limitations. According to the Scientific Working Group on Digital Evidence (SWGDE) [110], the term “error” does not allude to a mistake or blunder, but rather to the inevitable uncertainty inherent in scientific measurements. Numerous factors can influence these uncertainties, including algorithmic flaws, statistical probability, physical measurements, and human error [110]. One of the criteria for validating scientific methods under Daubert standardFootnote 10 is the assessment of error rate. Indeed, some of the other requirements (in the Daubert standard) are heavily weighted around error rate. For example, the Daubert standard requires the validation (or test) of a theory or methodology. The question is how can we validate a hypothesis and its alternatives, or a method, without determining the rate of uncertainty? Additionally, peer-review publishing of the method(s) used in forensic examination of digital artifacts is critical. Peer-review enables scientific validation of the technique and quantification of methodological uncertainties. This demonstrates the importance of publishing error rates for forensic methods alongside accuracy values. Thus, in contrast to conventional approaches to AI/ML methods that place a premium on accuracy (or precision), we propose that the results of DFAI algorithm include additional information regarding the method’s errors and uncertainties. That way, the method’s limitations are known in advance, allowing for an assessment of whether the method’s outcomes are sufficiently (and scientifically) valid as evidence.

In alignment with the guidelines offered in [110], the uncertainty associated with any DFAI technique can be assessed in two ways: random and systematic [111]. Random uncertainties are related with the technique’s algorithm and are commonly associated with measurements, whereas systematic uncertainties are typically associated with implementation and it occur in tools. DF tools represents implementation of a technique, and their functionality varies according to the task they seek to resolve. It is not uncommon for software to possibly contain intrinsic bugs [112], which are caused by logical flaws or incorrect instructions. For instance, an erroneous string search algorithm can cause a tool to report certain critical evidence incompletely. In this case, the tool will extract some relevant strings but will likely under-report their extent. Due to the fact that these flaws are not random, the tool frequently produces the same output when given the same input, which may be inadvertently deceptive to an examiner. Consequently, additional error mitigation methods may be required to detect and fix the error.

Due to the probabilistic nature of DFAI algorithms (the outcome of which may be random), the error rates are expressed in terms of false positive and false negative rates (which we discussed earlier). Depending on the percentages of these errors, and as long as adequate confidence in the algorithm’s optimality exists, the error rates may only indicate the technique’s limitations, not its true efficiency. It is critical to report and publish error rates for techniques in the DF domain [113], and this should be especially true for DFAI. This increases the technique’s transparency and ensures that, in the event of method replication, the intended outcome is known. Additionally, disclosing error rates provides prospective researchers with a baseline understanding of the components that function efficiently, where improvements are anticipated, as well as prevent potential biases in interpretation. Mitigating this error may not be straightforward scientifically, as it is dependent on a variety of factors; however, algorithm optimization, sufficient datasets, accurate labelling (in supervised settings), and strong domain knowledge (for proper interpretations) are some of the ways to achieve a fairly reasonable success. Additional mitigating strategies for systematic errors include training, written procedures, documentation, peer-review, and testing [110].

4 Methods for Optimizing DFAI Techniques

Developing an AI/ML model involves initializing and optimizing weight parameters via an optimization method until the objective functionFootnote 11 tend towards a minimum value, or the accuracy approaches a maximum value [114]. In addition to learning in predictive models, optimization is necessary at several stages of the process, and it includes selecting: (1) the model’s hyper-parameters (HPs) [115]; (2) the transformation techniques to apply to the model prior to modelling; and (3) the modelling pipeline to apply. This section will not explore the depth of optimization in AI, but will instead describe hyper-parameter optimization (HPO) [116] as a component in DFAI models.

Two parameters are critical in ML models: (1) the model parameters, which can be initialized and updated during the learning process; and (2) the HPs, which cannot be estimated directly from data learning and must be set prior to training a ML model – because they define the model’s architecture [117]. Understanding which HP is required for a given task is critical in a variety of scenarios, ranging from experimental design to automated optimization processes. The traditional method, which is still used in research but requires knowledge of the ML algorithm’s HP configurations, entails manually tuning the HP until the desired result is achieved [11]. This is ineffective in some cases, particularly for complex models with non-linear HP interactions [118]. Numerous circumstances may necessitate the use of HPO techniques [119]; we highlight few of them below, specifically focusing on DFAI tasks.

  1. 1.

    Conducting a digital forensic investigation requires an inordinate amount of time, and minimizing this time has been a primary focus of research in this domain for years. Similarly, machine-driven techniques can be time consuming, depending on the size of the dataset or the number of HPs. Applying AI techniques on already complicated forensic investigations almost always adds complexity. HPO can significantly reduce the amount of human effort required to tune these HPs, hence considerably shortening the entire forensic analysis time.

  2. 2.

    We have already highlighted the importance of performance in the context of DFAI procedures. ML methods require a range of HP settings to obtain optimal performance on a variety of datasets and problems. Numerous HPO techniques exist to assist in optimizing the performance of AI-based models by searching over different optimization spaces in quest of the global optimum for a given problem.

  3. 3.

    As previously stated, reproducibility is a necessary condition for a standard DF technique. HPO can assist in a variety of ways in achieving this goal. When evaluating the efficacy of several AI algorithms on a certain analysis, for example, adopting the same HP settings across all models establishes a fair comparison process. This can also be used to determine the optimal algorithm for a particular problem. Reporting these HP configurations can be advantageous in the event of DFAI model replication.

As with conventional AI models, when developing a DFAI model with HPO in mind, the process will include the following: an estimator (a classifier or regressor) with its objective function, a search (configuration) space, an optimization method for identifying suitable HP combinations, and an evaluation function for comparing the performance of various HP configurations [118]. A typical HP configuration can be continuous (e.g., multiple learning rate values), discrete (e.g., the number of clusters, k), binary (e.g., whether to use early stopping or not), or categorical (type of optimizer), all of which can be combined to produce an optimized model. Because the majority of ML algorithms have well-defined open-source frameworks (such as scikit learnFootnote 12) that can assist in solving problems by tuning (changing values) some already pre-set HPs, we will focus on HPOs related to DL models because they require self/auto-tuning of un-set parameters. HP in DL are set and tuned according to the complexity of the dataset and the task, and they are proportional to the number of hidden layers and neurons in each layer [120]. The initial parameter setting for a DL model is to specify the loss function (binary cross-entropy [121], multi-class cross-entropy [122], or squared error loss) appropriate for the problem type. Then comes the type of activation function (e.g., ReLU [123], sigmoid,Footnote 13 etc.) that describes how the weighted sum of the input is transformed into the output. Finally, the optimizer type is specified, which may be stochastic gradient descent (SGD) [124], Adaptive Moment Estimation (Adam) [125], or root mean square propagation (RMSprop) [126]. In what follows, we describe several optimization techniques that can be vital to the optimization of a DFAI model.

4.1 Methods for Hyper-Parameter Optimization in DFAI

A. Trial and error method

This method involves tuning parameters manually. It entails testing a large number of HP values based on experience, guesswork, or analysis of prior results. The approach is to improve parameter guesses iteratively until a satisfying result is obtained. This approach may be impractical for a variety of issues, particularly those involving DF analysis, that could involve large number of HP or complex models [118]. However, this technique can improve interpretability by allowing for the assessment of the model’s various working parts as the parameters are tuned.

B. Grid search (GS) This is a frequently used tech nique for exploring the HP configuration space [127]. It does a parallel search of the configuration space and is suitable within a limited search space; otherwise, it may suffer from the “curse of dimensionality” [129]

When DF examiner has sufficient knowledge about the (finite) set of HP to specify [95] for the search space, GS is preferable. Because computational intensity is one of GS’s drawbacks [128], its usage in DFAI is mostly focused on comparing the performances of many ML algorithms [169] in order to identify which one achieves the best performance on a certain forensic task. The authors in [130] described a botnet detection method using GS optimization techniques.

C. Randon search (RS)

RS was proposed in [131] as a way to circumvent GS’s limitations. Unlike GS, however, RS randomly selects a predefined number of candidate samples between a specified upper and lower bound and trains them until the budget is exhausted or the target accuracy is reached. It does this by allocating resources to best-performing regions with parallelization [132].

Due to the simplicity with which RS parallelizes, it is an ideal choice for DFAI tasks involving convolutional networks (CNN) [133], such as multimedia forensics (e.g., sound and video), image forensics, and so on, in which (low-dimensional) features are mapped from one layer to the next. This method can be time and memory-intensive. To optimize the process, a batching strategy [135] is implemented that takes advantage of the batch size and learning rate to reduce training time without compromising performance. In this case, RS may be useful in terms of determining the ideal range of values for these parameters [134], as just the search space must be specified. Additionally, RS’s use in optimizing multimedia forensics analysis suggests that it may be key for recurrent neural networks (RNN) [136], although RS has the disadvantage of not taking past results into account during evaluation [118]. As a result, using RS in recursive tasks such as event reconstruction in DFAI may result in less-than-optimal outcomes.

D. Gradient descent (GD)

The gradient descent [137] optimization computes the gradient of variables in order to determine the most promising path to the optimum. Gradient-based optimization techniques converge faster to the local minimum than the previously described techniques, but they are only applicable to continuous HP, such as the learning rate in NN [138], as other types of HP (e.g., categorical) lack gradient direction. The application of GD in DFAI approaches is almost ubiquitous, as it is used in virtually all DL models. It is one of the most straightforward optimization architectures to understand and interpret. However, the findings published in [172] proved the occurrence of “Catastrophic Forgetting” when gradient descent is used, particularly for reproduction. That is, when trained on a new task, ML models may forget what they learned on a previous task with only gradient descent. A combination with dropout [172] is recommended, however.

E. Bayesian Optimization (BO)

BO [139, 140] is an iterative algorithm that calculates future evaluation points based on the prior results. It is a typical model for all sorts of global optimization, with the goal of becoming less incorrect with more data [141]. BO identifies optimal HP combinations faster and it is applicable regardless of whether the objective function is stochastic, discrete, continuous, convex, or non-convex. Gaussian process (GP) [142], Sequential Model-based algorithm configuration (SMAC) [143], and Tree Parzen Estimator (TPE) [144] are an examples of common BO algorithms. BO is especially useful in tools like the Waikato Environment for Knowledge Analysis (WEKA) [145], an open-source tool with collections of ML and data processing algorithms. Numerous DF analyses methods [146,147,148] have been proposed or conducted using WEKA—leveraging its robust data mining capabilities and the possibility to choose from, or compare a variety of extensible, base learning algorithms for a specific forensic task. Selecting the right algorithm and HPs for optimal performance and accuracy in a WEKA-based DFAI analysis might be challenging. In this case, the excellent properties of BO can aid in choosing the optimal ML method and HP settings that minimizes analytical errors.

The works presented in [149] and [150] demonstrates how BO can be used (more precisely, with SMAC and TPE) as meta-learning to guide the choice of ML algorithms and HPO settings that outperform conventional selections on a classification task.

F. Multi-fidelity optimization (MFO)

MFO techniques are frequently used to overcome the time constraints limitations imposed by other HPO due to huge configuration space and datasets. MFO evaluates practical applications by combining low and high-fidelity measures [151]. In low-fidelity, a relatively small subset is evaluated at a low cost and with poor generalization performance; while in high-fidelity, a larger subset is examined at a higher cost and with improved generalization performance [152].

MFO techniques include “bandit-based” [153] methods that allocates computational resources to the “best-arm” (most promising) HP settings. Successive halving (SHA) and Hyperband (HB) are the two most often used bandit-based algorithms [152, 154].

The application of MFO techniques to DFAI can be exemplified with transfer learning (TL) [155], which is the process by which previously stored knowledge is used to solve different but related problems. TL has been deployed in a variety of DFAI methods [156, 157], most notably on image forensics and detection problems using labeled samples. Thus, low or high fidelity optimization can be helpful for determining the optimal solution depending on the size of the stored knowledge (dataset), the investigative problem, and available computational resources. [158] describes an example of work on detecting (signature-based and unknown) malware-infected domains based on HTTPS traffic, using TL and optimized with Hyperband optimization. Additionally, a state-of-the-art HPO technique called Bayesian Optimization Hyperband (BOHB) [159], which combines BO and HB to maximize the benefits of both, is gaining attention, and it will be interesting to see how DF research employs this promising technique in the future.

G. Metaheuristic algorithms

Metaheuristic algorithms are a popular type of optimization technique that are primarily inspired by biological evolution and genetic mutations. They are capable of resolving problems that are not continuous, non-convex, or non-smooth [118]. Population-based optimization algorithms (POAs) [160] are an excellent example of metaheuristic algorithms since they update and evaluate each generation within a population until the global optimum is found. The two most frequently utilized types of POA are genetic algorithms (GA) [161] and particle swarm optimization (PSO) [162]. PSO, specifically, is an evolutionary algorithm that functions by allowing a group of particles (swarm) to traverse the search space in a semi-random fashion [116], while simultaneously discovering the optimal solution through information sharing across swarms.

Network forensics with DL is an ideal use for PSOs, as training such models can be time-consuming since it requires identifying complex patterns from large amounts of data. To detect network intrusion or attack, iterative reverse-engineered codes on parser and network traffic logs are required; this can be challenging for humans [163]. The work described in [163] shows the efficacy of PSO as a useful instrument to minimize/maximize an objective function, and to determine the optimal HPs (such as epochs, learning rate, and batch size) that contribute to the deep forensic model’s AUC accuracy and the reduction in false alarm rate.

Table 2 The comparison of HPO techniques (n denote the number of HP values and k is the number of HP)

4.2 General Discussion on HPO in DFAI

It is worth emphasizing that the techniques discussed here are by no means exhaustive in terms of definition, components, and applicability. These few are chosen for their popularity and as a means of briefly discussing optimization techniques in the context of DFAI models. As such, in depth discussions about HPOs are available in [114, 118]. In general, depending on the size of the data, the complexity of the model (e.g., the number of hidden layers in a neural network (NN) [164,165,166] or the number of neighbours in a \(k-\)Nearest Neighbors (KNN) [167, 168]), and the available computational resources, an HP configuration may lengthen the time required to complete a task. Further along this line, in most cases, only a few HP have a substantial effect on the model’s performance in ML methods [118]. As such, having many HP configurations exponentially increases the complexity of the search space. However, with DL, HPO techniques will require significant resources, particularly when dealing with large datasets. Considering all of these complexities, especially in the context of DFAI, where timeliness, transparency, and interpretability are critical, a well-chosen HPO technique should aid in rapid convergence and avoid random results. However, given that DF analysis are case-specific, often distinctive, with interpretability as a fundamental requirement, decomposing complexity should be a priority. Thus, unless forensic investigators have sufficient computing resources and a working knowledge of the parameter settings for various HPO techniques, they may choose to consider the default HP settings in major open-source ML libraries, or make use of a simple linear model with reduced complexity, where necessary. In case of a self-defined DNN model, basic HP settings and early stopping techniques can be considered. Finally, to summarize the various HPO algorithms mentioned thus far, table 2 compares these HPO algorithms and their respective strengths and drawbacks, as adapted from [118] but extended with additional inputs.

5 Conclusion and Future Works

In this paper, we addressed common misunderstandings about “AI Forensics” and “Digital Forensics AI” (DFAI). We presented the notion of AI Forensics as specified in the literature, while also providing a conceptual description of “Digital Forensics AI” as a generic term referring to all the components and instruments used in the application of AI in digital forensics. As a result, we examined techniques and methods for evaluating the effectiveness of classification and regression algorithms, as well as algorithms based on clustering that are employed in digital forensics investigation. We focused on indicators that should not be disregarded while evaluating a predictive model’s correctness. Additionally, we examined forensic (decision) evaluation and proposed an AI-adaptive confidence scale reporting system that takes into account the error rates associated with false positives and negatives in a forensic output. We laid great emphasis on the datasets and error rates of AI-based programs used in digital forensics when it comes to standardization.

Finally, we conducted a comparative review of the key optimization techniques used in machine learning models, focusing on their application (and suitability) for digital forensics. We summarized these techniques and their various strengths and drawbacks, as well as their corresponding time complexities. Additionally, we presented our opinion on the usage of hyper-parameter optimization in AI-based DF analysis under discussion section.

As this is an attempt to formalize the concept of DFAI with all its prospective components, future work will strive to expand standardization beyond the two areas addressed thus far: datasets and error rates. Furthermore, the idea of expanding the methods for evaluating DFAI techniques to include comparative analysis of the various methods in practical settings appears to be a promising development for the domain, and it will be fascinating to see how it evolves in the future. Additionally, the explainability/interpretability and understandability of AI models employed in forensic investigation (and, more widely, in general) remains a concern. This is also a critical instrument of DFAI for which resources can be expanded; hence, our future work will look to broaden the research focus in this direction.