Elsevier

Information Systems

Volume 102, December 2021, 101724
Information Systems

Stochastic process mining: Earth movers’ stochastic conformance

https://doi.org/10.1016/j.is.2021.101724Get rights and content

Highlights

  • Frequencies – how often paths are used in a process – matter in process mining.

  • This stochastic perspective is often ignored in conformance checking techniques.

  • We introduce stochastic log–log, log–model and model–model comparison techniques.

  • These are feasible, and enable detailed insights into processes’ differences.

Abstract

Initially, process mining focused on discovering process models from event data, but in recent years the use and importance of conformance checking has increased. Conformance checking aims to uncover differences between a process model and an event log. Many conformance checking techniques and measures have been proposed. Typically, these take into account the frequencies of traces in the event log, but do not consider the probabilities of these traces in the model. This asymmetry leads to various complications. Therefore, we define conformance for stochastic process models taking into account frequencies and routing probabilities. We use the earth movers’ distance between stochastic languages representing models and logs as an intuitive conformance notion. In this paper, we show that this form of stochastic conformance checking enables detailed diagnostics projected on both model and log. To pinpoint differences and relate these to specific model elements, we extend the so-called ‘reallocation matrix’ to consider paths. The approach has been implemented in ProM and our evaluations show that stochastic conformance checking is possible in real-life settings.

Introduction

Process mining aims to analyse event data in a process-centric manner and can be used to identify, predict and to address performance and compliance problems [1]. The uptake of process mining in industry has accelerated in recent years. Currently, there are more than 35 commercial offerings of process mining software (for instance Celonis, Disco, ProcessGold, myInvenio, PAFnow, Minit, QPR, Mehrwerk, Puzzledata, LanaLabs, StereoLogic, Everflow, TimelinePI, Signavio and Logpickr). These products still focus on process discovery. However, the perceived importance of conformance checking is clearly growing (see for example the recent surveys by Gartner [2]).

Conformance checking techniques aim to compare observed behaviour in the form of an event log with modelled behaviour. Models may be expressed using BPMN, transition systems, Petri nets, process trees, statecharts, etc. Such models may have been made by hand or learned from event data using process discovery techniques. The first comprehensive conformance checking techniques used token-based replay in order to count produced, consumed, missing and remaining tokens [3]. Over the last decade, alignment-based techniques replaced token-based replay in process mining research. Alignments are used to directly relate observed traces to the corresponding closest paths through the model [4], [5]. Many conformance measures have been proposed throughout the years [1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. These cover different quality dimensions. In [1], four major quality dimensions were identified: recall, precision, generalisation, and simplicity. Most of the conformance measures focus on the first two dimensions. Recall measures quantify the fraction of the log that “fits” the model. This intuitive notion can be operationalised in different ways, e.g., the percentage of observed traces that can be generated by the model or the number of missing and remaining tokens during replay [1], [4], [5]. Precision measures complement recall. Precision aims to quantify the fraction of modelled behaviour that was actually observed in the event log. Many precision notions have been proposed, but, unfortunately, most turned out to be problematic [14], [15]. There are several reasons for this. Here, we name the two most important problems encountered using most of the measures. First of all, a model with loops allows for infinitely many traces making it difficult to define the “fraction” of observed behaviour (i.e., the observed percentage of modelled traces is zero by definition). Second, precision depends on recall. When many of the observed traces are not fitting, we cannot talk about precision in a meaningful way (i.e., precision is not orthogonal to but is influenced by recall).

Therefore, we first justify our choice to focus on stochastic conformance checking using a small toy example.Intuitively, most measures reason in terms of the “fraction of observed behaviour” and the “fraction of modelled behaviour”. The first fraction is easy to quantify, because the event log is finite and observed behaviours have a frequency. The second fraction is difficult to define, leading to the two problems related to precision mentioned before. How to talk about the “fraction of modelled behaviour” covered by the event data when the number of possible traces is infinite or many observed traces are non-fitting? Note that the event log contains only a sample of behaviour and it is odd to assume that for precision one would need to see all possible behaviour. The analysis in this paper shows that the absence of probabilities in the process model are a direct cause for these problems. Adding probabilities allows us to better reason about “fraction of modelled behaviour” covered by the event log. This paper shows that these problems indeed disappear when using probabilistic models. Therefore, we advocate the use of stochastic conformance checking and provide a new approach to compare event logs and process models.

This paper extends the work presented in [16] where we introduced the first stochastic conformance checking technique. In [16], we proposed a measure for quantifying the difference between a process model having probabilities and a standard event log capturing the frequencies of traces. Both the process model and the event log are mapped onto stochastic languages that are then compared using the so-called earth movers’ distance (EMD). Using EMD we can quantify the difference between a model and a log in a more objective manner. However, in [16] we only presented the EMD-based measure without providing diagnostics. In this paper, we extend our approach to also provide diagnostics projected onto the event log and the process model. We annotate logs and models with information explaining the differences. This will help to diagnose deviations. To do this, we also had to extend the concept of the so-called reallocation matrix to deal with paths in the model. The reason is twofold. First of all, we want to handle duplicate and invisible activities: the same activity may appear at different locations in the process model and parts of the process may be skipped. In Petri net terms, this corresponds to multiple transitions having the same activity label or transitions that do not have a label (that is, so-called τ-activities that are not recorded in the event log). Second, to localise deviations in models we need to reconstruct the model states in which these deviations occur. Therefore, we added the new concept of stochastic trace alignments that relates observed traces to paths in the model in detail, and added projections of these trace alignments on the event log and on the model. The novel EMD-based conformance checking technique based on the extended reallocation matrix has been implemented as a ProM plug-in and can be obtained by downloading the Earth Movers’ Stochastic Conformance Checking package from http://promtools.org.

We believe that it is important to consider the stochastic perspective as a first-class citizen for the following reasons:

  • Current conformance checking techniques are asymmetric, because the frequencies of traces in the event log are taken into account without having a comparable notion on the model side. This causes foundational problems when defining for example precision (handling loops and event logs that are relatively small or that contain deviating behaviour). As a result, conformance checking measures and diagnostics tend to be misleading. Consequently, process discovery techniques cannot be compared properly.

  • Another reason to include the stochastic perspective is the obvious link to simulation, prediction and recommendation [1]. Simulation, prediction and recommendation models inherently require probabilities. For example, to predict the remaining process time of a running case, one needs to know the likelihood of the different paths the case can take. Also, in simulation models, we need to assign probabilities to choices [17]. Therefore, the quality of a process model is not only determined by its control-flow structure but also by its stochastic perspective.

Event logs typically have a clear Pareto distribution: it is quite common that less than 20% of the trace variants cover over 80% of the traces in the event log. In an event log with thousands of traces, a deviating trace variant that happened hundreds of times is clearly more severe than a deviating trace variant that happened only once. When models have no probabilities, the decision to include additional, less likely, paths in the model may have devastating effects on precision. If the model distinguishes between “highways” (paths in the model that have a high probability) and “dirt roads” (paths in the model that have a low probability), then it is less severe that a dirt road is not observed in the event log. However, if highways are completely missing in the event log, then this is more severe. Conversely, the decision to include a dirt road in the model or not should have limited impact on conformance.

Existing conformance techniques are highly sensitive to what is included in the model and what not. Using existing measures a model may seem similar to the actual process, but is not. Conversely, the model may seem very different, but is actually very close to what is observed.

Stochastic conformance checking assumes the presence of probabilities in process models. Existing models typically do not have such probabilities. Fortunately, by using replay techniques, it is relatively easy to add probabilities to process models [17], [18]. These can be used as a first estimate and should be refined by domain experts after seeing conformance diagnostics. Given the problems mentioned, we feel that modellers should add probabilities to process models and that discovery techniques should directly return models with probabilities.

This paper extends [16] with support for silent and duplicate activities (e.g., skipping parts of the model or activities occurring in different parts of the process), detailed log- and model-based diagnostics, based on the new concepts of stochastic trace alignments. Furthermore, a new formally-proven more efficient implementation was added, and the evaluation was extended with several case studies.

The remainder of this paper is organised as follows. We start by providing a small motivating example in Section 2. Section 3 discusses related work. Notions such as event logs, process models, and stochastic languages are introduced in Section 4. Section 5 introduces the Earth Mover’s Stochastic Conformance (EMSC) notion to compare stochastic languages and presents the reallocation matrix. Based on this, stochastic trace alignments are computed, which serve as input for diagnostics projected on the event log and process model. Section 6 evaluates the approach which has been implemented in ProM. Section 7 discusses various open challenges. Section 8 concludes the paper.

Section snippets

Motivating example

To motivate the need for stochastic conformance checking, we use a small toy example. Consider the process model in Fig. 1 and the following five event logs: L1=[a,b,d,e490,a,d,b,e490,a,c,d,e10,a,d,c,e10]L2=[a,b,d,e245,a,d,b,e245,a,c,d,e5,a,d,c,e5,a,b,e500]L3=[a,b,d,e489,a,d,b,e489,a,c,d,e10,a,d,c,e10,a,b,e2]L4=[a,b,d,e500,a,d,b,e500]L5=[a,c,d,e500,a,d,c,e500] The process model is expressed in terms of an accepting Petri net with an initial marking [p1] and a

Related work

This section discusses three groups of related work.This section discusses related work. Section 3.1 discusses techniques for conformance checking in process mining; note that most existing techniques ignore the stochastic perspective of process models. Section 3.1 discusses techniques for conformance checking in process mining. Section 3.2 presents two case studies in process mining, and broader in Business Process Management, that can directly benefit from stochastic conformance checking.

Preliminaries

In this section, we introduce existing concepts to be used in the remainder of the paper.

Earth movers’ stochastic conformance checking

In this section, we introduce our approach for stochastic conformance checking of event logs and stochastic process models: the Earth Movers’ Stochastic Conformance (EMSC). Intuitively, EMSC mimics the earth movers’ distance: consider both the log and the model as piles of sand, each having a particular shape. Then, the earth movers’ distance is the minimal effort to transform one pile into the other, that is, the amount of sand that needs to be moved multiplied by the distance over which it

Evaluation

In this section, we evaluate the Earth Movers’ Stochastic Conformance (EMSC) checking technique as presented in this paper using four experiments: we first illustrate the necessity of conformance checking techniques to consider stochastic information. Second, we illustrate the influence of unfolding infinite behaviour on the proposed measure. Third, we show the feasibility of the approach on real-life event logs and models. Fourth, we illustrate the applicability of the log projections on a

Open challenges

In this section, we describe several remaining open challenges of the technique described in this paper.

Conclusion

The conformance checking technique presented in this paper considers the stochastic perspective as a first-class citizen. The main reason is to address the asymmetry between event logs and process models. A unique trace that cannot be replayed by the model is typically assumed to be less severe than a deviating trace that appears many times in the event log. Therefore, most conformance checking techniques take trace frequencies into account. Probabilities in process models can be seen as the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank the Alexander von Humboldt (AvH) Stiftung for supporting our research, and Merih Seran Uysal for her useful comments on the description of the implementation and its proofs. Artem Polyvyanyy was in part supported by the Australian Research Council project DP180102839.

References (60)

  • AdriansyahA. et al.

    Conformance checking using cost-based fitness analysis

  • van DongenB.F. et al.

    A unified approach for measuring precision and generalization based on anti-alignments

  • DongenB.F. van et al.

    Aligning modeled and observed behavior: A compromise between computation complexity and quality

  • Garcia-BanuelosL. et al.

    Complete and interpretable conformance checking of business processes

    IEEE Trans. Softw. Eng.

    (2018)
  • MannhardtF. et al.

    Balanced multi-perspective checking of process conformance

    Computing

    (2016)
  • Munoz-GamaJ. et al.

    A fresh look at precision in process conformance

  • WeerdtJ. De et al.

    A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs

    Inf. Syst.

    (2012)
  • WeerdtJ. De et al.

    A robust F-measure for evaluating discovered process models

  • van der AalstW.M.P.

    Relating process models and event logs: 21 conformance propositions

  • LeemansS.J.J. et al.

    Earth movers’ stochastic conformance checking

  • RozinatA. et al.

    Discovering simulation models

    2nd International Conference on Process Mining, ICPM 2020, Padua, Italy, October 4-9, 2020

    Inf. Syst.

    (2009)
  • Rogge-SoltiA. et al.

    Discovering stochastic Petri nets with arbitrary delay distributions from event logs

  • Ajmone MarsanM. et al.

    Modelling with Generalized Stochastic Petri Nets

    (1995)
  • Rogge-SoltiA. et al.

    In log and model we trust? A generalized conformance checking framework

  • Munoz-GamaJ. et al.

    Enhancing precision in process conformance: Stability, confidence and severity

  • AdriansyahA. et al.

    Measuring precision of modeled behavior

    Inf. Syst. e-Bus. Manage.

    (2015)
  • MannhardtF. et al.

    Measuring the precision of multi-perspective process models

  • GoedertierS. et al.

    Robust process discovery with artificial negative events

    J. Mach. Learn. Res.

    (2009)
  • vanden BrouckeSeppe K.L.M. et al.

    Determining process model precision and generalization with weighted artificial negative events

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • LeemansS.J.J. et al.

    Scalable process discovery and conformance checking

    Softw. Syst. Model.

    (2018)
  • Cited by (0)

    View full text