Stochastic process mining: Earth movers’ stochastic conformance
Introduction
Process mining aims to analyse event data in a process-centric manner and can be used to identify, predict and to address performance and compliance problems [1]. The uptake of process mining in industry has accelerated in recent years. Currently, there are more than 35 commercial offerings of process mining software (for instance Celonis, Disco, ProcessGold, myInvenio, PAFnow, Minit, QPR, Mehrwerk, Puzzledata, LanaLabs, StereoLogic, Everflow, TimelinePI, Signavio and Logpickr). These products still focus on process discovery. However, the perceived importance of conformance checking is clearly growing (see for example the recent surveys by Gartner [2]).
Conformance checking techniques aim to compare observed behaviour in the form of an event log with modelled behaviour. Models may be expressed using BPMN, transition systems, Petri nets, process trees, statecharts, etc. Such models may have been made by hand or learned from event data using process discovery techniques. The first comprehensive conformance checking techniques used token-based replay in order to count produced, consumed, missing and remaining tokens [3]. Over the last decade, alignment-based techniques replaced token-based replay in process mining research. Alignments are used to directly relate observed traces to the corresponding closest paths through the model [4], [5]. Many conformance measures have been proposed throughout the years [1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. These cover different quality dimensions. In [1], four major quality dimensions were identified: recall, precision, generalisation, and simplicity. Most of the conformance measures focus on the first two dimensions. Recall measures quantify the fraction of the log that “fits” the model. This intuitive notion can be operationalised in different ways, e.g., the percentage of observed traces that can be generated by the model or the number of missing and remaining tokens during replay [1], [4], [5]. Precision measures complement recall. Precision aims to quantify the fraction of modelled behaviour that was actually observed in the event log. Many precision notions have been proposed, but, unfortunately, most turned out to be problematic [14], [15]. There are several reasons for this. Here, we name the two most important problems encountered using most of the measures. First of all, a model with loops allows for infinitely many traces making it difficult to define the “fraction” of observed behaviour (i.e., the observed percentage of modelled traces is zero by definition). Second, precision depends on recall. When many of the observed traces are not fitting, we cannot talk about precision in a meaningful way (i.e., precision is not orthogonal to but is influenced by recall).
Therefore, we first justify our choice to focus on stochastic conformance checking using a small toy example.Intuitively, most measures reason in terms of the “fraction of observed behaviour” and the “fraction of modelled behaviour”. The first fraction is easy to quantify, because the event log is finite and observed behaviours have a frequency. The second fraction is difficult to define, leading to the two problems related to precision mentioned before. How to talk about the “fraction of modelled behaviour” covered by the event data when the number of possible traces is infinite or many observed traces are non-fitting? Note that the event log contains only a sample of behaviour and it is odd to assume that for precision one would need to see all possible behaviour. The analysis in this paper shows that the absence of probabilities in the process model are a direct cause for these problems. Adding probabilities allows us to better reason about “fraction of modelled behaviour” covered by the event log. This paper shows that these problems indeed disappear when using probabilistic models. Therefore, we advocate the use of stochastic conformance checking and provide a new approach to compare event logs and process models.
This paper extends the work presented in [16] where we introduced the first stochastic conformance checking technique. In [16], we proposed a measure for quantifying the difference between a process model having probabilities and a standard event log capturing the frequencies of traces. Both the process model and the event log are mapped onto stochastic languages that are then compared using the so-called earth movers’ distance (EMD). Using EMD we can quantify the difference between a model and a log in a more objective manner. However, in [16] we only presented the EMD-based measure without providing diagnostics. In this paper, we extend our approach to also provide diagnostics projected onto the event log and the process model. We annotate logs and models with information explaining the differences. This will help to diagnose deviations. To do this, we also had to extend the concept of the so-called reallocation matrix to deal with paths in the model. The reason is twofold. First of all, we want to handle duplicate and invisible activities: the same activity may appear at different locations in the process model and parts of the process may be skipped. In Petri net terms, this corresponds to multiple transitions having the same activity label or transitions that do not have a label (that is, so-called -activities that are not recorded in the event log). Second, to localise deviations in models we need to reconstruct the model states in which these deviations occur. Therefore, we added the new concept of stochastic trace alignments that relates observed traces to paths in the model in detail, and added projections of these trace alignments on the event log and on the model. The novel EMD-based conformance checking technique based on the extended reallocation matrix has been implemented as a ProM plug-in and can be obtained by downloading the Earth Movers’ Stochastic Conformance Checking package from http://promtools.org.
We believe that it is important to consider the stochastic perspective as a first-class citizen for the following reasons:
- •
Current conformance checking techniques are asymmetric, because the frequencies of traces in the event log are taken into account without having a comparable notion on the model side. This causes foundational problems when defining for example precision (handling loops and event logs that are relatively small or that contain deviating behaviour). As a result, conformance checking measures and diagnostics tend to be misleading. Consequently, process discovery techniques cannot be compared properly.
- •
Another reason to include the stochastic perspective is the obvious link to simulation, prediction and recommendation [1]. Simulation, prediction and recommendation models inherently require probabilities. For example, to predict the remaining process time of a running case, one needs to know the likelihood of the different paths the case can take. Also, in simulation models, we need to assign probabilities to choices [17]. Therefore, the quality of a process model is not only determined by its control-flow structure but also by its stochastic perspective.
Event logs typically have a clear Pareto distribution: it is quite common that less than 20% of the trace variants cover over 80% of the traces in the event log. In an event log with thousands of traces, a deviating trace variant that happened hundreds of times is clearly more severe than a deviating trace variant that happened only once. When models have no probabilities, the decision to include additional, less likely, paths in the model may have devastating effects on precision. If the model distinguishes between “highways” (paths in the model that have a high probability) and “dirt roads” (paths in the model that have a low probability), then it is less severe that a dirt road is not observed in the event log. However, if highways are completely missing in the event log, then this is more severe. Conversely, the decision to include a dirt road in the model or not should have limited impact on conformance.
Existing conformance techniques are highly sensitive to what is included in the model and what not. Using existing measures a model may seem similar to the actual process, but is not. Conversely, the model may seem very different, but is actually very close to what is observed.
Stochastic conformance checking assumes the presence of probabilities in process models. Existing models typically do not have such probabilities. Fortunately, by using replay techniques, it is relatively easy to add probabilities to process models [17], [18]. These can be used as a first estimate and should be refined by domain experts after seeing conformance diagnostics. Given the problems mentioned, we feel that modellers should add probabilities to process models and that discovery techniques should directly return models with probabilities.
This paper extends [16] with support for silent and duplicate activities (e.g., skipping parts of the model or activities occurring in different parts of the process), detailed log- and model-based diagnostics, based on the new concepts of stochastic trace alignments. Furthermore, a new formally-proven more efficient implementation was added, and the evaluation was extended with several case studies.
The remainder of this paper is organised as follows. We start by providing a small motivating example in Section 2. Section 3 discusses related work. Notions such as event logs, process models, and stochastic languages are introduced in Section 4. Section 5 introduces the Earth Mover’s Stochastic Conformance () notion to compare stochastic languages and presents the reallocation matrix. Based on this, stochastic trace alignments are computed, which serve as input for diagnostics projected on the event log and process model. Section 6 evaluates the approach which has been implemented in ProM. Section 7 discusses various open challenges. Section 8 concludes the paper.
Section snippets
Motivating example
To motivate the need for stochastic conformance checking, we use a small toy example. Consider the process model in Fig. 1 and the following five event logs: The process model is expressed in terms of an accepting Petri net with an initial marking and a
Related work
This section discusses three groups of related work.This section discusses related work. Section 3.1 discusses techniques for conformance checking in process mining; note that most existing techniques ignore the stochastic perspective of process models. Section 3.1 discusses techniques for conformance checking in process mining. Section 3.2 presents two case studies in process mining, and broader in Business Process Management, that can directly benefit from stochastic conformance checking.
Preliminaries
In this section, we introduce existing concepts to be used in the remainder of the paper.
Earth movers’ stochastic conformance checking
In this section, we introduce our approach for stochastic conformance checking of event logs and stochastic process models: the Earth Movers’ Stochastic Conformance (). Intuitively, mimics the earth movers’ distance: consider both the log and the model as piles of sand, each having a particular shape. Then, the earth movers’ distance is the minimal effort to transform one pile into the other, that is, the amount of sand that needs to be moved multiplied by the distance over which it
Evaluation
In this section, we evaluate the Earth Movers’ Stochastic Conformance () checking technique as presented in this paper using four experiments: we first illustrate the necessity of conformance checking techniques to consider stochastic information. Second, we illustrate the influence of unfolding infinite behaviour on the proposed measure. Third, we show the feasibility of the approach on real-life event logs and models. Fourth, we illustrate the applicability of the log projections on a
Open challenges
In this section, we describe several remaining open challenges of the technique described in this paper.
Conclusion
The conformance checking technique presented in this paper considers the stochastic perspective as a first-class citizen. The main reason is to address the asymmetry between event logs and process models. A unique trace that cannot be replayed by the model is typically assumed to be less severe than a deviating trace that appears many times in the event log. Therefore, most conformance checking techniques take trace frequencies into account. Probabilities in process models can be seen as the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We thank the Alexander von Humboldt (AvH) Stiftung for supporting our research, and Merih Seran Uysal for her useful comments on the description of the implementation and its proofs. Artem Polyvyanyy was in part supported by the Australian Research Council project DP180102839.
References (60)
- et al.
Conformance checking of processes based on monitoring real behavior
Inf. Syst.
(2008) - et al.
The imprecisions of precision measures in process mining
Inform. Process. Lett.
(2018) - et al.
Process compliance measurement based on behavioral profiles
Inf. Syst.
(2011) - et al.
Conformance checking and performance improvement in scheduled processes: A queueing-network perspective
Inf. Syst.
(2016) Probabilistic automata
Inf. Control
(1963)- et al.
Computational experience with exterior point algorithms for the transportation problem
Appl. Math. Comput.
(2004) Process Mining: Data Science in Action
(2016)Gartner market guide for process mining, research note G00353970
(2018)- et al.
Replaying history on process models for conformance checking and performance analysis
WIREs Data Min. Knowl. Discovery
(2012) - et al.
Conformance Checking: Relating Processes and Models
(2018)