1 Introduction

Two-sample testing is important in many areas of science and commerce and has been extensively studied in the statistics literature, particularly in the univariate case. The maximum mean discrepancy (MMD) test (Borgwardt et al. 2006; Gretton et al. 2012a) is a nonparametric kernelised two-sample test that, when using a characteristic kernel (Sriperumbudur et al. 2010; Fukumizu et al. 2009) on data in \(\mathbb {R}^{d}\), can detect any change in a distribution between the two samples (Fukumizu et al. 2004, 2008). An expression for the MMD test statistic is given in Eq. (1) in Sect. 2, from which it is apparent that the computational complexity of computing the MMD statistic for two samples containing a total of \(n\) observations from \(d\)-dimensional data is \(\mathcal {O}(dn^2 )\). When the total number of observations \(n\) is large, the cost of computing the MMD can be therefore be prohibitively expensive. There have been several proposed approaches for speeding up the computation of the MMD statistic (Gretton et al. 2012a; Zaremba et al. 2013; Zhao and Meng 2015); note however that these are all approximate and have worse statistical performance.

In this paper, we derive an exact method for computing the MMD statistic in \(\mathcal {O}(n\log n)\) for the case where the \(n\) observations are univariate. To do so, we use the Laplacian kernel, which is a popular characteristic kernel (Fukumizu et al. 2009; Sriperumbudur et al. 2010). This novel algorithm is named euMMD, for efficient univariate Maximum Mean Discrepancy.

There are two cases where such a computational speed up would have utility. The first is when an analysis involves a single test and the number of observations \(n\) is large; then the improvement in speed is clear as shown in Fig. 3 and Table 1. A second, less obvious, case is when an analysis requires many tests even though the number of observations is of moderate size; for example, in cyber-security applications the same two-sample test would be applied a large number of times across a population of entities (Neil et al. 2013), and in the medical literature there are situations where sample-size calculations are done via simulations which compute the test statistic for many different samples (Landau and Stahl 2013). In such cases, moderate gains in efficiency for each test would lead to a large overall improvement in runtime.

Table 1 Computational complexity MMD and euMMD, with \(n=n_1 + n_2\) samples and L permutations.
Fig. 1
figure 1

A comparison of Type I error and statistical power for the (i) \(\textrm{MMD}\) with the Laplacian kernel and (ii) The \(\textrm{MMD}\) with the Gaussian kernel, for samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\). (Left) the Type I error is computed when the underlying distributions of \(\textbf{x}\) and \(\textbf{y}\) are both i.i.d. \(\Gamma (1, 1)\). (Right) the statistical error is computed when \(\textbf{x}\) is i.i.d. \(\Gamma (1, 1)\) and \(\textbf{y}\) is i.i.d. \(\text {N}(1, 1)\). We note that in this experiment the performance of the \(\textrm{MMD}\) approach is almost identical, whether the Laplacian or Gaussian kernel is used. Moreover, the \(\textrm{MMD}\) test has more power the energy distance test, at least for this data. In both cases, a significance threshold of \(\alpha =0.05\) is used with \(L=1000\) permutations to compute a p-value, and the performance shown is the average performance over 100 trials

One concern may be that reliance on the Laplacian kernel is overly restrictive, and another kernel may be preferred; in Fig. 1 it is shown that for univariate, real-valued data the performance of the MMD two-sample test with the Laplacian kernel is broadly the same as when the popular Gaussian kernel is used.

A greater influence on performance is, in fact, the choice of value for the kernel parameter, as shown in Sect. S1 of the Supp. Material. A common method for setting the kernel parameter value is the so-called median heuristic; Sect. 3.1 discusses how for univariate data the kernel parameter can be set using this heuristic in \(\mathcal {O}(n\log n)\) using existing algorithms from the literature.

Along with the MMD, another popular two-sample test with an extensive literature is the energy distance (Baringhaus and Franz 2004; Székely and Rizzo 2004, 2013). Its formulation is very similar to that of the MMD, and recent work has shown that the two tests are closely related (Sejdinovic et al. 2013; Shen and Vogelstein 2018). However, given the choice, one may prefer to use the MMD since there are cases where the MMD is more powerful than the energy distance (Sejdinovic et al. 2013); see Fig. 4 and Sect. 2 for a discussion.

There is in fact a class of “energy statistics” (Székely and Rizzo 2013), one of which is the distance correlation (Székely et al. 2007). Inspired by an efficient method for computing Kendall’s \(\tau \) coefficient (Knight 1966), recent work (Huo and Székely 2016) has shown that an efficient computation of the distance correlation can be found for univariate real-valued data after first sorting the data. That approach prompted this current work, in order to see if a similar approach could be used to find an efficient MMD two-sample test.

While the proposed method is exact for univariate data, it does not easily generalise to the \(d\)-dimensional multivariate case. However, one could use random projections (Cuesta-Albertos et al. 2006; Rahimi and Recht 2007; Wei et al. 2016) to project the data onto one dimension and obtain an average value for the MMD over multiple projections; this was done in Huang and Huo (2017) for the Energy Distance. Another approach is to apply a univariate test to the distances between the observations and fixed “centre points” in \(d\)-dimensional space (Heller and Heller 2016). These two different approaches to using a univariate test on multivariate data are explored in Sect. 4 using euMMD as the base univariate two-sample test, although several experiments are relegated to the appendix as this comparison is not the primary motivation for this work.

The rest of the paper is organized as follows: Sect. 2 provides definitions of the MMD statistic and the Laplacian kernel. Section 3 describes our proposed algorithm, euMMD, as well as related work and experiments showing the improved speed-up on synthetic data. Section 4 describes the approximate extension to the \(d\)-dimensional case, and provides experiments showing the good performance of the approximate method. Several proofs and additional experiments are contained in the Supplementary Material. Note that any reference to computational complexity refers to worst-case computational complexity (see, e.g. Cormen et al. (2009)).

2 Background

Given a space \(\mathcal {X}\) and two samples \(\textbf{x}, \textbf{y}\subset \mathcal {X}\), with \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\), and a kernel \(k: \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\), the minimum variance unbiased estimate of the squared maximum mean discrepancy (MMD) statistic was defined in Borgwardt et al. (2006); Gretton et al. (2012a) as

$$\begin{aligned} \begin{aligned} \textrm{MMD}^2(\textbf{x}, \textbf{y})&= \frac{1}{n_1(n_1- 1)} \sum _{i=1}^{n_1} \sum _{ \begin{array}{c} j=1 \\ j \ne i \end{array} }^{n_1} k( x_{i} , x_{j} ) \\&+ \frac{1}{n_2(n_2- 1)} \sum _{p=1}^{n_2} \sum _{\begin{array}{c} q=1 \\ q \ne p \end{array} }^{n_2} k(y_{p} , y_{q} ) \\&- \frac{2}{n_1n_2} \sum _{i=1}^{n_1} \sum _{p=1}^{n_2} k( x_{i} , y_{p} ). \end{aligned} \end{aligned}$$
(1)

Note that is more convenient to deal with \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\), the squared MMD statistic, rather than the actual (unsquared) MMD statistic, and in this paper we shall often refer to the MMD statistic when we really mean \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\).

The statistic in Eq. (1) is unbiased and is denoted by \(\textrm{MMD}_{u}^2\) in Gretton et al (2012a, [Eq. (3)]) because there is a biased version \(\textrm{MMD}_{b}^2\) which is very similar. Whichever version is used does not make a difference in terms of the computational complexity, and we shall focus on the unbiased version, simply denoting it by \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) or \(\textrm{MMD}^2\) when the underlying sets are understood. If one assumes that the kernel function is \(\mathcal {O}(d)\), then it is clear that the computational complexity of computing \(\textrm{MMD}^2\) directly from Eq. (1) is \(\mathcal {O}(d(n_1^2 + n_2^2 + n_1n_2) )\), which is the same as \(\mathcal {O}(dn^2 )\), after defining \(n= n_1+ n_2\).

For the MMD statistic to be most effective, it is important to choose a kernel that is characteristic on \(\mathbb {R}^{d}\) (Sriperumbudur et al. 2010; Fukumizu et al. 2009). This is a property that means the kernel can be used to detect any change in a distribution (Fukumizu et al. 2004, 2008).

Two popular characteristic kernels are the Gaussian kernel and Laplacian kernel (Fukumizu et al. 2009; Sriperumbudur et al. 2010). For real \(d\)-dimensional vectors \(z, z'\in \mathbb {R}^{d}\), with \(z= \{z_1, z_2, \dots , z_d\}\) and \(z'= \{z'_1, z'_2, \dots , z'_d\}\), for parameters \(\beta , \gamma > 0\) the Laplacian kernel \(k_{\textrm{L}, \beta }\) and the Gaussian kernel \(k_{\textrm{G}, \gamma }\) are defined as

$$\begin{aligned} k_{\textrm{L}, \beta }(z, z')&= \exp \left( -\beta \Vert z- z' \Vert _1 \right) \nonumber \\&= \exp \left( -\beta \sum _{\alpha =1}^{d} \left| z_\alpha - z'_\alpha \right| \right) , \end{aligned}$$
(2)
$$\begin{aligned} k_{\textrm{G}, \gamma }(z, z')&= \exp \left( -\gamma \Vert z- z' \Vert _2^2 \right) \nonumber \\&= \exp \left( -\gamma \sum _{\alpha =1}^d (z_\alpha - z'_\alpha )^2 \right) . \end{aligned}$$
(3)

In the univariate case with observations \(z, z'\in \mathbb {R}\), the Laplacian kernel in Eq. (2) reduces to

$$\begin{aligned} k_{\textrm{L}, \beta }(z, z')=\exp (-\beta \left|z- z' \right|). \end{aligned}$$
(4)

In Sect. 3 this kernel is used to define an exact \(\mathcal {O}(n\log n)\) method for computing \(\textrm{MMD}^2\) in Eq. (1) for univariate real-valued data. We call this algorithm euMMD. Note that in the remainder of this section and Sect. 3 the focus will be on univariate sets, while in Sect. 4 we look at approaches for extending this work to multivariate sets. In each case, we make explicit whether a value \(z\in \mathbb {R}\) or \(z\in \mathbb {R}^{d}\) for \(d> 1\).

2.1 The null distribution of \(\textrm{MMD}^2\)

Under the null hypothesis that the data \(\textbf{x}\) and \(\textbf{y}\) are observations of i.i.d. random variables sampled from the same distribution, the statistic \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) can be shown in the limit to follow the distribution of an infinite sum of weighted (non-central) chi-squared random variables, which can be approximated by moment-matching with Pearson curves Gretton et al. (2009, 2012a). However, computing the coefficients needed to use these approximations will be \(\mathcal {O}(n^3)\), where as above \(n= n_1+ n_2\), and \(\textbf{x}\) and \(\textbf{y}\) have \(n_1\) and \(n_2\) elements, respectively.

Computationally, and perhaps theoretically, it is preferable to use permutations to compute the empirical distribution of the statistic given the data, since this approach does not rely on approximations. One then obtains a p-value by ranking the statistic within the empirical distribution; see Sect. S4 in the Supp. Material for details. In Sect. 3 it is shown that computing \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) statistic with the euMMD algorithm is \(\mathcal {O}(n\log n)\), and so if \(L\) permutations are used, computing a p-value using permutations is \(\mathcal {O}(Ln\log n)\).

2.2 Choice of kernel and other tests

The Gaussian kernel in Eq. (3) is often a popular choice when computing the \(\textrm{MMD}\), and this preference may be due to the two-norm being invariant to rotations of multivariate data. However, we do not require such a property for univariate data, and then a preferred choice of kernel may be less clear.

In Fig. 1, the right panel shows that the performance of the \(\textrm{MMD}\) test with the Laplacian and Gaussian kernels is almost identical for detecting a change in distribution between a \(\text {N}(1, 1)\) and \(\Gamma (1, 1)\) distribution; note this is a case where the two distributions both have mean and variance equal to 1, but the shape of the distributions is different. The energy distance (Baringhaus and Franz 2004; Székely and Rizzo 2004, 2013) is also included in this comparison; the energy distance statistic has the same form as the \(\textrm{MMD}^2\) in Eq. (1) with \(k(z, z') = - \Vert z- z' \Vert _2\). The left panel of Fig. 1 shows the Type I error of the three tests when the underlying distributions for both \(\textbf{x}\) and \(\textbf{y}\) are a \(\Gamma (1, 1)\) distribution. In both panels a significance threshold of \(\alpha =0.05\) is used with \(L=100\) permutations to compute the p-value, and each point on the plot represents the average Type I error/power over 1000 trials. The kernel parameter value is set using the median heuristic as described in Sect. 3.1.

This experiment provides two insights: (a) The performance of the \(\textrm{MMD}\) for univariate data is very similar whether the Laplacian or Gaussian kernel is used, and (b) The \(\textrm{MMD}\) has more power than the energy distance, at least in this case where the data \(\textbf{x}\) and \(\textbf{y}\) are sampled from the \(\text {N}(1, 1)\) and \(\Gamma (1, 1)\) distributions; Fig. S9 in the Supp. Material shows a similar result for normally-distributed data, except that the energy distance has slightly higher power than the \(\textrm{MMD}\) methods.

3 Proposed algorithm: euMMD

This section describes the proposed euMMD algorithm for efficiently computing \(\textrm{MMD}^2\) for univariate data. We start with, for any \(m\ge 1\),

Fig. 2
figure 2

The kernel matrix for the set \(\{x_1, x_2, \dots , x_{n_1}, y_1, y_2, \dots , y_{n_2}\}\) and kernel \(k\). Notice how \(T_1\), \(T_2\) and \(T_3\) are obtained by adding the visible terms in the upper-left, lower-right and lower-left submatrices, respectively (but excluding the main diagonal terms for \(T_1\) and \(T_2\))

Lemma 1

Given a set \(\{z_1, z_2, \dots , z_{m}\} \subset \mathbb {R}\) and a symmetric function \(f: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\), then

$$\begin{aligned} \sum _{i=1}^{m} \sum _{j=1}^{i} f(z_i, z_j) = \sum _{i=1}^{m} \sum _{j=i}^{m} f(z_i, z_j). \end{aligned}$$
(5)

The proof follows immediately from the summation laws, and is included in Sect. S5.1 of the Supp. Material. However, Lemma 1 is a key step in later results. In particular, it leads directly to

Lemma 2

Given a set \(\{z_1, z_2, \dots , z_{m}\} \subset \mathbb {R}\) and a symmetric function \(f: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\), then

$$\begin{aligned} \sum _{i=1}^{m} \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^{m} f(z_i, z_j)&= 2\sum _{i=1}^{m} \sum _{j=1}^{i-1} f(z_i, z_j). \end{aligned}$$
(6)

Proof

See Sect. S5.2 of the Supp. Material. \(\square \)

Now, since every kernel function is symmetric (Shawe-Taylor and Cristianini 2004), Lemma 2 allows us to rewrite \(\textrm{MMD}^2\) as follows:

$$\begin{aligned}&\textrm{MMD}^2 (\textbf{x}, \textbf{y}) = 2 \left( \alpha _1 T_1+ \alpha _2 T_2+ \alpha _3 T_3\right) , \end{aligned}$$
(7)

where

$$\begin{aligned} T_1&= \sum _{i=1}^{n_1} \sum _{j=1}^{i-1} k(x_{i},x_{j}), \qquad \alpha _1 = \frac{1}{n_1(n_1- 1)}, \\ T_2&= \sum _{p=1}^{n_2} \sum _{q=1}^{p-1} k(y_{p},y_{q}), \qquad \alpha _2 = \frac{1}{n_2(n_2- 1)}, \\ T_3&=\sum _{i=1}^{n_1}\sum _{p=1}^{n_2}k(x_{i},y_{p}), \qquad \alpha _3 = -\frac{1}{n_1n_2}. \end{aligned}$$

Remark 3

Essentially, we have simply shown that only the lower-triangular elements of the kernel matrix are needed to compute the \(\textrm{MMD}\) test statistic, but this observation will be key in deriving the new efficient method.

Using Eq. (7) for \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\), let us look at the lower triangular terms of the kernel matrix for the set \(\{x_1,x_2,\dots ,x_{n_1},y_1,y_2,\dots ,y_{n_2}\}\), as shown in Fig. 2.

Notice that adding together the terms below the main diagonal in the top-left submatrix gives \(T_1\). Similarly, adding the terms below the main diagonal in the bottom-right submatrix gives \(T_2\). \(T_3\) is obtained by adding all the terms in the bottom-left submatrix; however, if we define \(T_{4}\) to be the sum of all the terms in the entire kernel matrix below the main diagonal, then

$$\begin{aligned} T_{4}= T_1+ T_2+ T_3\qquad \Rightarrow T_3= T_{4}- T_1- T_2. \end{aligned}$$
(8)

Therefore, if we could find an efficient method for computing the sum of lower-triangular terms in a kernel matrix, we would have an efficient method for computing \(\textrm{MMD}^2\). One next realises that, when computing these lower-triangular sums, the order in which the terms are added does not matter. This observation is described in

Lemma 4

Given a set \(\{ z_{1}, z_{2}, \dots , z_{m} \} \subset \mathbb {R}\), a symmetric function \(f: \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) and a permutation \(\sigma : \{1,2,\dots ,m\} \rightarrow \{1,2,\dots ,m\}\), then:

$$\begin{aligned} \sum _{i=1}^{m} \sum _{j=1}^{i-1} f(z_{i}, z_{j} )&= \sum _{i=1}^{m} \sum _{j=1}^{i-1} f( z_{\sigma (i)} , z_{\sigma (j)} ). \end{aligned}$$
(9)

Proof

See Sect. S5.3 of the Supp. Material. \(\square \)

Since a permutation \(\sigma : \{1,2,\dots ,m\} \rightarrow \{1,2,\dots ,m\}\) is simply a bijection on \(\{1,2,\dots ,m\}\), Lemma 4 holds in particular for the permutation that orders a specific set of elements \(\left\{ z_{1}, z_{2}, \dots , z_{m} \right\} \) in increasing value, i.e. the permutation \(\sigma (i) \rightarrow (i)\), where \(z_{(1)} \le z_{(2)} \le \dots \le z_{(m-1)}\le z_{(m)}\). This leads to the key result using the Laplacian kernel described in

Proposition 5

Given a set of observations \(\{ z_{1}, z_{2}, \dots , z_{m} \} \subset \mathbb {R}\), let \(z_{(i)}\) denote the ith smallest element of the set, so that \(z_{(1)} \le z_{(2)} \le \dots \le z_{(m)}\). Then, using the Laplacian kernel with parameter \(\beta > 0\), the quantity

$$\begin{aligned} S\underset{\textrm{def}}{=}\sum _{i=1}^{m} \sum _{j=1}^{i-1} k_{\textrm{L}, \beta }(z_{i}, z_{j})&= \sum _{i=1}^{m} \sum _{j=1}^{i-1} \exp { \left( -\beta \left[ z_{(i)} - z_{(j)} \right] \right) }, \end{aligned}$$

can be computed recursively by defining \(S= S_{m}\) and then using the equations \(S_{1} = R_{1}=0\) and, for \(k\in \{2, \dots , m\}\),

$$\begin{aligned} \begin{aligned} D_{k}&= \exp \left( -\beta \left[ z_{(k)}-z_{(k-1)} \right] \right) , \\ R_{k}&= \left( R_{k-1} + 1 \right) \cdot D_{k}, \\ S_{k}&= S_{k-1} + R_{k} . \end{aligned} \end{aligned}$$
(10)

Proof

Using Lemma 4,

$$\begin{aligned} S&= \sum _{i=1}^{m}\sum _{j=1}^{i-1} k_{\textrm{L}, \beta }(z_{(i)},z_{(j)}) \\&=\sum _{i=1}^{m} \sum _{j=1}^{i-1} \exp (-\beta \big ( z_{(i)}-z_{(j)} \big ) ), \end{aligned}$$

since \(z_{(i)} - z_{(j)} \ge 0\) for \(j < i\). For \(k\in \{1, 2, \dots , m\}\), if we define \(D_{k}\) as in Eq. (10) and

$$\begin{aligned} S_{k}&=\sum _{i=1}^{k} \sum _{j=1}^{i-1} \exp (-\beta \big ( z_{(i)}-z_{(j)} \big ) ), \\ R_{k}&= \sum _{j=1}^{k-1} \exp (-\beta \big ( z_{(k)}-z_{(j)} \big ) ), \end{aligned}$$

one immediately obtains \(S_{1} = R_{1}=0\) and derives the recursive equations in Eq. (10). For the full details, see Sect. S5.4 of the Supp. Material. \(\square \)

Proposition 5 leads directly to

Corollary 6

Given a set of observations \(\{ z_{1}, z_{2}, \dots , z_{m} \} \subset \mathbb {R}\), and a parameter \(\beta > 0\), then

$$\begin{aligned} S= \sum _{i=1}^{m} \sum _{j=1}^{i-1} k_{\textrm{L}, \beta }(z_{i},z_{j}) \end{aligned}$$

can be computed in \(\mathcal {O}(m\log m)\) time.

Proof

The observations are first sorted to \(z_{(1)}, \dots , z_{(m)}\), using a worst-case \(\mathcal {O}(m\log m)\) sorting algorithm such as merge sort. The sequential update equations for \(S\), given in Eq. (10), are all \(\mathcal {O}(1)\), so together the \(m-1\) recursive updates are \(\mathcal {O}(m)\), and so overall the algorithm is \(\mathcal {O}(m\log m)\). \(\square \)

Remark 7

In plain terms, Corollary 6 shows that one can compute the sum of lower-triangular terms in a kernel matrix, for univariate data and while using the Laplacian kernel, in \(\mathcal {O}(m\log m)\).

Algorithm 1
figure a

TriSSL: Triangular Sorted Sum of Laplacians

Algorithm 2
figure b

euMMD: Efficient computation of univariate \(\textrm{MMD}^2\). Note that pseudocode for MergeTwoAlreadySorted can be found in Sect. S7 of the Supp. Material

Pseudocode implementing the recursive update equations given in Eq. (10) is given in the TriSSL subroutine in Algorithm 1. We can now state the main result:

Proposition 8

Given two samples \(\textbf{x}, \textbf{y}\subset \mathbb {R}\) containing a total of \(n\) observations, the euMMD algorithm computes the exact \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) statistic defined in Eq. (1) using the Laplacian kernel in \(\mathcal {O}(n\log n)\) time and \(\mathcal {O}(n)\) space.

Proof

Let \(\textbf{x}= \{ x_{1}, x_{2}, \dots , x_{n_1} \}\) and \(\textbf{y}= \{ y_{1}, y_{2}, \dots , y_{n_2} \}\) denote the two samples and define \(n= n_1+ n_2\). First, the values in \(\textbf{x}\) are sorted to obtain the sorted set \(\{ x_{(1)}, x_{(2)}, \dots , x_{(n_1)} \}\) and then the recursive formulae in Eq. (10) of Prop. 5 are applied to these sorted values to compute \(T_1\). The same steps are followed for the values in \(\textbf{y}\) to compute \(T_2\). This is summarised by lines 1-4 in the pseudocode in Algorithm 2, which describes the procedure in detail; the TriSSL subroutine, as described in Algorithm 1, implements the update equations in Eq. (10). (Note that the notation \(\textbf{x}\) is used in the pseudocode to denote an array containing the values \(x_{1}, \dots , x_{n_1}\), while here \(\textbf{x}\) denotes the set.) The sorted values are then merged into a sorted set \(\textbf{z}\) by the MergeTwoAlreadySorted algorithm in \(\mathcal {O}(n)\) time (see Sect. S7 of the Supp. Material), and then TriSSL computes \(T_{4}\) from \(\textbf{z}\). Equation (8) then gives \(T_3= T_{4}- T_1- T_2\). With \(T_1\), \(T_2\) and \(T_3\) computed, Eq. (7) gives \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\). To analyse the complexity of euMMD, we look at the pseudocode in Algorithm 2. The merge sorts in lines 1-2 are \(\mathcal {O}(n\log n)\) (Cormen et al. 2009).

Lines 3-6 are all \(\mathcal {O}(n)\), and lines 7-8 are all \(\mathcal {O}(1)\). Therefore, euMMD is \(\mathcal {O}(n\log n)\) in time overall. Since merge sort and TriSSL both have space complexity \(\mathcal {O}(n)\), euMMD also has space complexity \(\mathcal {O}(n)\). \(\square \)

Remark 9

Note that it is not necessary to use the MergeSort algorithm and the MergeTwoAlreadySorted subroutine; any other \(\mathcal {O}(n\log n)\) sorting procedure could be used for \(\textbf{x}\) and \(\textbf{y}\), which could be concatenated to form \(\textbf{z}\), and then \(\textbf{z}\) could be sorted with the same algorithm. The computational complexity would stil be \(\mathcal {O}(n\log n)\). However, the \(\mathcal {O}(n)\) MergeTwoAlreadySorted step takes advantage of \(\textbf{x}\) and \(\textbf{y}\) being already sorted.

Remark 10

When analysing the euMMD algorithm one notices that, after the data is sorted, the remaining part of the algorithm is \(\mathcal {O}(n)\). In other words, the only \(\mathcal {O}(n\log n)\) step is the initial sorting of the data, where a sorting algorithm with worst-case time complexity \(\mathcal {O}(n\log n)\) (also denoted \(\Omega (n\log n)\)), such as merge sort (Knuth 1997b; Cormen et al. 2009), is used; it is well-known that \(\Omega (n\log n)\) is the lower bound on the time complexity for sorting general real-valued data (Cormen et al. 2009). However, one then realises that in special cases where the two samples are already sorted, for example in cases where the data are ordered times, the euMMD algorithm is then linear in n. Another special case is when the data belong to some restricted set, e.g. the data are integers in a known finite range, such as for a multinomial distribution with a known number of categories (for which the \(\textrm{MMD}\) two-sample test would be valid), then a linear time sorting method such as counting sort (Cormen et al. 2009) could be used, which would give the algorithm linear-time computational complexity. We emphasise, however, that these are special cases and do not focus on them here.

Remark 11

Note that there is an efficient method for computing the Energy Distance (Huang and Huo 2017), related to the approach in Huo and Székely (2016), which is very similar to the algorithm for the proposed efficient MMD presented here. However, there are at least a few key differences: our approach to the computation of the mixed \(T_3\) term is much simpler, our sequential update approach utilizing the TriSSL algorithm is much clearer, and this sequential approach allows the use of the Laplacian kernel, which would be nontrivial to employ in the algorithm in Huang and Huo (2017); furthermore, a careful proof showing the correctness of euMMD is provided.

3.1 Setting the kernel parameter

Most kernels require a user-defined parameter to be selected in advance of the computation. For the Laplacian kernel, this is a parameter \(\beta > 0\). While it remains an open question how to select the optimal value of \(\beta \) (or indeed the value of other kernels’ parameters), a commonly-used approach is to use the median heuristic, where for a set of values \(\{z_1, z_2, \dots , z_{n} \}\), one computes

$$\begin{aligned} h = \text {median} \{ \left|z_i - z_j \right| : i, j \in \{1, 2, \dots , n, i < j \} \} \end{aligned}$$

and then one sets \(\beta = \frac{1}{h}\). A naive computation of h would require the computation of all \(\tfrac{n}{2}(n-1)\) differences, resulting in a computational complexity of \(\mathcal {O}(n^2)\). However, a similar quantity has been previously considered as a measure of dispersion (Shamos 1976, Theorem 3.6), (Bickel and Lehmann 1979, Example 9) (Rousseeuw and Croux 1993), and there exists an efficient method for computing h in \(\mathcal {O}(n\log n)\) (Croux and Rousseeuw 1992, Sect. 3), based on Johnson and Mizoguchi (1978). Note that although this method is worst-case \(\mathcal {O}(n\log n)\), it is based on the repeated use of the selection algorithm for finding the kth smallest value among n values in worst-case \(\mathcal {O}(n)\) (Cormen et al. 2009, Sect. II.9.3). Needing to compute the median heuristic value first makes the algorithm slightly more computationally expensive overall compared to only computing the \(\textrm{MMD}^2\) statistic, despite both algorithms being \(\mathcal {O}(n\log n)\), because of larger leading coefficients hidden by the big-O notation. However, when used in combination with the computation of the p-value via permutations, this difference ends up being very slight; see Fig. S12 in the Supp. Material.

While it has been shown that the median heuristic may not necessarily provide the “best” choice for the parameter (Ramdas et al. 2015), i.e. the parameter value that leads to the greatest power for the \(\textrm{MMD}\) test, it generally works well; see Sect. S1 in the Supp. Material for additional experiments. There are other schemes for selecting \(\beta \) (Sriperumbudur et al. 2009; Gretton et al. 2012b), but these amount to testing various values and selecting the one which “works best” for the given data. We do not pursue the question of how to set the kernel parameter further here.

Fig. 3
figure 3

A runtime comparison of Python implementations of the standard \(\mathcal {O}(n^2)\) MMD and the proposed \(\mathcal {O}(n\log n)\) euMMD on univariate data, as the size of the samples, \(n_1=n_2\), increases and \(n=n_1+n_2\). Left: linear scale. Right log-log scale. Table 1 shows the relative speedup for \(n_1=n_2=2000\)

3.2 Computing p-values

The previous sections have shown that the MMD statistic for univariate data can be computed in \(\mathcal {O}(n\log n)\), along with the median heuristic for setting the kernel parameter in \(\mathcal {O}(n\log n)\). In Sect. 2.1 it is discussed how it may be preferable to use \(L\) permutations to compute a p-value for the MMD test statistic, which would result in an algorithm that is \(\mathcal {O}(Ln\log n)\). This approach involves first computing the median heuristic to determine the kernel parameter \(\beta \) and then computing the test statistic \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\). Then, one concatenates \(\textbf{x}\) and \(\textbf{y}\) to form the vector \(\textbf{z}\), randomly permutes the values in \(\textbf{z}\) to create vector \(\widetilde{\textbf{z}}\), before subsetting this vector to \(\widetilde{\textbf{x}}\) (first \(n_1\) values) and \(\widetilde{\textbf{y}}\) (last \(n_2\) values). One calls euMMD to compute \(\textrm{MMD}^2(\widetilde{\textbf{x}}, \widetilde{\textbf{y}})\). Repeating this process for a total of \(L\) permutations, one counts the number of \(\textrm{MMD}^2(\widetilde{\textbf{x}}, \widetilde{\textbf{y}})\) values smaller than \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) as \(c\), and then computes \(p= \max \{ 1 -\left|1 - 2c/(L+1) \right|, p_{\min } \}\), where \(p_{\min }\) is set as \(1/(2(L+1))\) to avoid p-values of 0 (this would give issues when using Fisher’s method to combine p-values in the multivariate approximate method). It is apparent that this method is \(\mathcal {O}(Ln\log n)\). Although the median heuristic only needs to be computed once, as the kernel parameter is the same for all permutations, this method requires sorting each pair of permuted samples \(\widetilde{\textbf{x}}\) and \(\widetilde{\textbf{y}}\).

Table 2 Runtime comparison of MMD and euMMD univariate data with \(n_1=n_2=2000\) in Python

However, we can improve on this approach. Motivated by the observation that the euMMD algorithm is \(\mathcal {O}(n)\) if the data is already sorted, we devise a simple approach to form permuted samples such that \(\widetilde{\textbf{x}}\) and \(\widetilde{\textbf{y}}\) are already sorted. First, the kernel parameter (via the median heuristic) and \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\) are computed with euMMD in an \(\mathcal {O}(n\log n)\) step. A result of this computation is the vector \(\textbf{z}\), which is a sorted vector containing the \(n\) values from \(\textbf{x}\) and \(\textbf{y}\). Instead of permuting the values in \(\textbf{z}\), an indicator vector \(\varvec{\iota }= (1, \dots , 1, 0, \dots , 0)\) is formed, where the first \(n_1\) values are 1 and the remaining \(n_2\) values are 0. For each permutation, the indicator vector \(\varvec{\iota }\) is permuted to \(\varvec{\widetilde{\iota }}\) rather than permuting the data in the vector \(\textbf{z}\); i.e. \(\textbf{z}\) remains sorted. Defining the function \(s\) such that \(s(j, \varvec{\widetilde{\iota }}) = k\) means the jth 1 in \(\varvec{\widetilde{\iota }}\) is at index k, the vector \(\widetilde{\textbf{x}}\) is defined as \(\widetilde{\textbf{x}}= \left( z_{s(1, \varvec{\widetilde{\iota }})}, \dots , z_{s(n_1, \varvec{\widetilde{\iota }})} \right) \). The vector \(\widetilde{\textbf{y}}\) is similarly defined using \(s'\) where \(s'(j, \varvec{\widetilde{\iota }}) = k\) means the jth 0 in \(\varvec{\widetilde{\iota }}\) is at index k. Since \(\textbf{z}\) is sorted and \(s\) and \(s'\) are increasing functions in their first argument, the vectors \(\widetilde{\textbf{x}}\) and \(\widetilde{\textbf{y}}\) are both sorted. Therefore, computing each \(\textrm{MMD}^2(\widetilde{\textbf{x}}, \widetilde{\textbf{y}})\) will be \(\mathcal {O}(n)\) and the overall complexity of computing the p-value after \(L\) permutations is \(\mathcal {O}(n\log n+ Ln)\). The pseudocode for this procedure is shown in Algorithm S2 in Sect. S6 in the Supp. Material (Table 2).

3.3 Runtime comparison

As discussed in Sect. 2, for \(n= n_1+n_2\), computing \(\textrm{MMD}^2\) directly from Eq. (1) is \(\mathcal {O}(n^2)\) for univariate data. On the other hand, Sect. 3 shows that \(\textrm{MMD}^2\) can be computed in \(\mathcal {O}(n\log n)\). Note that since euMMD is exact for univariate real-valued data, there is no need to compare its statistical performance to that of the naive computation of \(\textrm{MMD}^2\), denoted MMD; see for example Gretton et al. (2012a) for a study of the statistical performance of \(\textrm{MMD}\).

Figure 3 shows the improved performance of euMMD over MMD as \(n\) increases; the right-hand panel displays the same data as in the left-hand panel, but on a log-log scale. This makes it easier to see the orders of magnitude difference between the two methods. Note that these are the average speeds over 10 trials, and error bars of one standard deviation are plotted, but are too small to see (i.e. little variation in time between trials).

Furthermore, Table 1 compares data in Fig. 3 when \(n=2000\), and shows that in this case euMMD is \(\sim 1000\times \) faster than MMD. Note also that, in order to provide a fair comparison of the algorithms, we implemented the merge sort algorithm from Panny and Prodinger (1995) in Python.

Section S6 in the Supp. Material provides additional runtime experiments. One experiment shows that using the built-in Numpy sorting function would yield an extra \(\sim 4\times \) speedup over using the “hand-coded” merge sort; however, this is probably an unfair comparison, since the Numpy sort function is implemented in C++. There are also runtime experiments for implementations in C++, which show more modest improvements in speed, up to about \(124\times \).

While the results in Fig. 3 are illustrative of the improvement in computational complexity of euMMD over the standard MMD, another experiment showing a more practical implementation (a) Using C++, (b) Using the median heuristic with the kernel parameter for the Laplacian kernel, and (c) Using either 100 or 1000 permutations to compute the p-value, which show more modest improvements in speed of between \(25\times \) to \(328\times \); see Sect. S6.2 in the Supp. Material.

3.4 Statistical performance of MMD on univariate data

We conduct a number of experiments to evaluate the statistical performance of \(\textrm{MMD}\) on univariate data, and compare to other univariate two-sample testing methods: the Kolmogorov-Smirnov (Kolmogorov 1933; Smirnov 1948), Cramér-von Mises (Cramér 1928; von Mises 1928), Wilcoxon-Mann–Whitney (Wilcoxon 1945; Mann and Whitney 1947) energy distance (Baringhaus and Franz 2004; Székely and Rizzo 2004, 2013) and Student’s t-test (Student, 1908). Figure 4 shows the results when there is a distribution change, the data in the first sample \(\textbf{x}\) are independent observations of a \(\Gamma (1, 1)\) distribution, while the second sample \(\textbf{y}\) are independent observations of a \(\text {N}(1, 1)\) distribution (both samples have mean and variance equal to 1), the same as for Fig. 1. Figure 4 shows that the \(\textrm{MMD}\) has, in this case, higher power than the other methods.

Section S11 contains additional experiments, for another type of distribution shift, as well as a variance shift and mean shifts. The MMD performs the best for distribution shifts and joint-best for variance shifts, but does not perform as well as the other methods for mean shifts. Section S11 also contains figures showing the Type 1 error with y-axis on the scale [0, 0.1].

Fig. 4
figure 4

A comparison of Type I error and statistical power for the (i) \(\textrm{MMD}\) with the Laplacian kernel and (ii) The \(\textrm{MMD}\) with the Gaussian kernel, (iii) The energy distance, (iv) The Kolmogorov-Smirnov test (KS), (v) The Cramér-von Mises test (CvM), (vi) The Wilcoxon-Mann–Whitney test (WMW), and (vii) Student’s t-test, for samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\). (Left) the Type I error is computed when the underlying distributions of \(\textbf{x}\) and \(\textbf{y}\) are both i.i.d. \(\Gamma (1, 1)\). (Right) the statistical error is computed when \(\textbf{x}\) is i.i.d. \(\Gamma (1, 1)\) and \(\textbf{y}\) is i.i.d. \(\text {N}(1, 1)\). We note that in this experiment the performance of the \(\textrm{MMD}\) approach is almost identical, whether the Laplacian or Gaussian kernel is used.In both cases, a significance threshold of \(\alpha =0.05\) is used with \(L=1000\) permutations to compute a p-value, and the performance shown is the average performance over 100 trials

4 Extension to multivariate data: MEA-MMD

While euMMD is an exact method for computing the \(\textrm{MMD}^2\) statistic for univariate data, this approach does not easily extend to an exact method in the multivariate setting. However, in this section we extend our approach to an approximate methods for computing the \(\textrm{MMD}^2\) statistic for multivariate data named MEA-MMD: Multivariate Efficient Approximate MMD.

We explore two approaches from the literature for adapting a univariate test to multivariate data: (a) using random projections as in Huang and Huo (2017), and (b) computing distances to a reference point as in Heller and Heller (2016). Both approaches involve projecting the observations from \(\mathbb {R}^{d}\rightarrow \mathbb {R}\) and then performing the univariate test. For the remainder of this section, let \(d\in \left\{ 2, 3, \dots \right\} \) be the dimension and suppose we have samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \} \subset \mathbb {R}^{d}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \} \subset \mathbb {R}^{d}\).

4.1 Using random projections

Let \(\mathcal {S}^{d-1}\) denote the unit sphere in \(\mathbb {R}^{d}\), i.e.

$$\begin{aligned} \mathcal {S}^{d-1} = \left\{ u \in \mathbb {R}^{d}\, \, : \, \, \left\| u \right\| _2 = 1 \right\} \end{aligned}$$

Following the approach in Huang and Huo (2017), instead of computing \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\), we can compute \(\textrm{MMD}^2(u_k^{T} \textbf{x}, u_k^{T} \textbf{y})\), for any \(u_k \in \mathcal {S}^{d-1}\), where \(u_k^{T} \textbf{x}= \{u_k^{T} x_1, u_k^{T} x_2, \dots , u_k^{T} x_{n_1} \} \subset \mathbb {R}\) and \(u_k^{T}\textbf{y}= \{u_k^{T}y_1, u_k^{T}y_2, \dots , u_k^{T}y_{n_2} \} \subset \mathbb {R}\) are now univariate sets, using the euMMD algorithm. We can do this for \(K\) projections \(u_k \in \mathcal {S}^{d-1}\), and then construct our statistic by averaging across the \(K\) projections, as in

$$\begin{aligned} \overline{\textrm{MMD}^2_K}(\textbf{x}, \textbf{y}) = \frac{1}{K} \sum _{k=1}^{K} \textrm{MMD}^2(u_k^{T} \textbf{x}, u_k^{T} \textbf{y}). \end{aligned}$$
(11)

We call this algorithm MEA-MMD-Proj. The theoretical results in Huang and Huo (2017) for the energy distance can be adapted to demonstrate that \(\overline{\textrm{MMD}^2_K}(\textbf{x}, \textbf{y})\) converges to \({\textrm{MMD}^2}(\textbf{x}, \textbf{y})\) by increasing K and the sample size.

4.2 Computing distances to a reference point

In Heller and Heller (2016), the multivariate vectors are projected onto univariate vectors by computing distances between the different individual observations. For \(i \in \left\{ 1, 2, \dots , n_1 \right\} \), one computes the distances to \(x_i\) as follows:

$$\begin{aligned} \textbf{x}^{(i)}= [&\Vert z_1 - x_i \Vert _{2} , \dots , \Vert x_{i-1} - x_i \Vert _{2} , \Vert x_{i+1}-x_i \Vert _{2} , \\&\dots , \Vert x_{n_1} - x_i \Vert _{2} ] \\ \textbf{y}^{(i)}= [&\Vert y_1 - x_i \Vert _{2} , \Vert y_1 - x_i \Vert _{2} \dots , \Vert y_{n_2} - x_i \Vert _{2} ]. \end{aligned}$$

Note that \(\textbf{x}^{(i)}, \textbf{y}^{(i)}\subset \mathbb {R}\), with \(n_1-1\) and \(n_2\) observations, respectively; the ith component of \(\textbf{x}\) is removed rather than included as zero. Similarly, for \(n_1< i \le n\), one can compute the distances between the observations and \(y_{i-n_1}\) to obtain \(\textbf{x}^{(i)}, \textbf{y}^{(i)}\subset \mathbb {R}\) (where now \(\textbf{x}^{(i)}\) has \(n_1\) observations and \(\textbf{y}^{(i)}\) has \(n_2-1\) observations). Overall, there are \(n_1+n_2=n\) projections. In Heller and Heller (2016) this was used to apply the univariate Kolmogorov-Smirnov and Anderson-Darling tests to multivariate data; here this approach will be used with the univariate \(\textrm{MMD}\) tests.

For notational convenience, we now assume there is a function \(g:\mathbb {R}\rightarrow [0,1]\) that computes a p-value given the value of a statistic \(\textrm{MMD}^2(\textbf{x}^{(i)}, \textbf{y}^{(i)})\), although in practice this will be done via permutations of the \(\textbf{x}^{(i)}\) and \(\textbf{y}^{(i)}\). We also use \(h: [0,1]^{n} \rightarrow [0,1]\) to denote the p-value combination method due to Hommel (1983), which is briefly described in Sect. S9.4 of the Supp. Material. Then, after computing \(p_i = g( \textrm{MMD}^2(\textbf{x}^{(i)}, \textbf{y}^{(i)})\) for \(i=1, 2, \dots , n\), the final p-value is \(p = h(p_1, p_2, \dots , p_{n})\). This method will be denoted MEA-MMD-Dist.

Of course, MEA-MMD-Dist and MEA-MMD-Proj are only approximations to \(\textrm{MMD}^2(\textbf{x}, \textbf{y})\), so in addition to its runtime we need to evaluate the statistical performance of using the statistic in a statistical test, which we do in the next sections.

4.3 Statistical performance

We present simulations to investigate the Type I error and statistical power. In particular, we consider the case where the two samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\), where \(x_{i}=(x_{i,1}, x_{i,2}, \dots , x_{i,d})\) and \(y_{i}=(y_{i,1}, y_{i,2}, \dots , y_{i,d})\), each \(x_{i,j}\) is an observation of \(X_{i,j}\), and \(X_{i,j}\) are each i.i.d. normal; i.e. \(X_{i,j} \sim \text {N}(\mu _1, \sigma _1^2)\). Similarly, each \(y_{i,j}\) is an observation of \(Y_{i,j}\), and the \(Y_{i,j} \sim \text {N}(\mu _2, \sigma _2^2)\) are i.i.d. For the case of computing the Type I error, we set \(\mu _1=\mu _2=0\) and \(\sigma _1=\sigma _2=1\). For the case of computing the power, \(\mu _1=0\), and \(\sigma _1=\sigma _2=1\), and \(\mu _2=\mu _1 + 0.5\sigma _1=0.5\). Note that the number of permutations used is \(L=100\), the threshold used is \(\alpha =0.05\), and the values are obtained over 100 independent trials. The results of this experiment are displayed in Fig. 5. Note that we have set the number of projections to be \(K=20\), in which case MEA-MMD-Proj should be computationally more efficient than MMD, as shown in Fig. 7. Sect. S8.1 in the Supp. Material contains a similar figure to Fig. 5, but with \(K=100\). Section S8.5 in the Supp. Material looks at variations of MEA-MMD-Dist approach. We notice from Fig. 5 that MEA-MMD-Proj has slightly greater power than MMD for this setting, at least for \(n\le 50\), while both have similar Type I errors, close to the threshold \(\alpha \). We have included MMD with both the Laplacian and Gaussian kernels, which have very similar performance.

Fig. 5
figure 5

A comparison of Type I error and statistical power for the (i) \(\textrm{MMD}\) with the Laplacian kernel, (ii) The \(\textrm{MMD}\) with the Gaussian kernel, (iii) The proposed \(\texttt {MEA-MMD-Proj}{}\) and (iv) The proposed \(\texttt {MEA-MMD-Dist}{}\), for multivariate samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \} \subset \mathbb {R}^{d}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\subset \mathbb {R}^{d}\) for \(d=10\). (Left) the Type I error is computed when the underlying distributions of each component of the \(x_{i}\) and \(y_{j}\) vectors are i.i.d. \(\text {N}(0, 1)\). (Right) the statistical error is computed when each component of the \(x_{i}\) are i.i.d. \(\text {N}(0, 1)\) and each component of the \(y_{j}\) are i.i.d. \(\text {N}(0.5, 1)\). In both cases, a significance threshold of \(\alpha =0.05\) is used with \(L=100\) permutations to compute a p-value, the number of projections for MEA-MMD-Proj is \(K=20\), and the performance shown is the average performance over 100 trials

4.3.1 Statistical performance: \(d> n\)

The experiment shown in Fig. 5 indicates one result of particular interest: the power of \(\texttt {MEA-MMD-Proj}{}\) is higher than the power of MMD for small \(n_1=n_2\).

We investigate this further by restricting \(n_1, n_2\in \{5, 10, 20, 30, 40\}\) with \(n_1=n_2\) and increase the dimension to \(d=100\) to consider the case \(d> n\), where \(n=n_1+n_2\).

Again, \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\) are observations sampled as before, but for the power experiment, we set \(\mu _2=\frac{1}{\sqrt{40}}\) in order to keep the Kullback–Leibler divergence between the two distributions the same for this case where \(d=100\) as in the case where \(d=10\) in Fig. 5; see Sect. S8.2 in the Supp. Material for details.

Figure 6 shows that MEA-MMD-Proj has power \(1-\beta > 0.8\) for \(d=100\) and \(n_1=n_2=10\) and power close to 1 for \(d=100\) and \(n_1=n_2=20\), while the power for MMD is much lower in those cases, being around 0.2 and 0.4, respectively. This provides some evidence that MEA-MMD-Proj may have utility in the case \(d> n\).

The poor performance of MEA-MMD-Dist may be due, as noted before, to the fact that the Hommel p-value combination procedure is conservative; see Sect. S8.5 in the Supp. Material for further details.

Fig. 6
figure 6

A comparison of Type I error and statistical power for the (i) The proposed \(\texttt {MEA-MMD-Proj}{}\) and (ii) \(\textrm{MMD}\) with the Laplacian kernel, (iii) The \(\textrm{MMD}\) with the Gaussian kernel, (iv) The Crossmatch test, (v) The Multivariate Runs test, (vi) The energy distance, (vii) Hotelling’s multivariate T-test, for multivariate samples \(\textbf{x}= \{x_1, x_2, \dots , x_{n_1} \} \subset \mathbb {R}^{d}\) and \(\textbf{y}= \{y_1, y_2, \dots , y_{n_2} \}\subset \mathbb {R}^{d}\) for \(d=100\) and \(n_1, n_2\in \left\{ 10, 20, 30, 40 \right\} \), and \(n_1=n_2\), so \(d> n=n_1+n_2\). (Left) the Type I error is computed when the underlying distributions of each component of the \(x_{i}\) and \(y_{j}\) vectors are i.i.d. \(\text {N}(0, 1)\). (Right) the statistical error is computed when each component of the \(x_{i}\) are i.i.d. \(\text {N}(0, 1)\) and each component of the \(y_{j}\) are i.i.d. \(\text {N}(\mu _2, 1)\), where \(\mu _2=1/\sqrt{40}\). In both cases, a significance threshold of \(\alpha =0.05\) is used with \(L=100\) permutations to compute a p-value, the number of projections for MEA-MMD-Proj is \(K=20\), and the performance shown is the average performance over 100 trials

4.3.2 Comparison to other multivariate two-sample tests

We compare the performance of the proposed \(\texttt {MEA-MMD-Proj}{}\) to the following multivariate two-sample tests: the Crossmatch test (Rosenbaum 2005), the Multivariate Runs test (Friedman and Rafsky 1979; Henze and Penrose 1999), and Hotelling’s multivariate T-test (Hotelling 1931). We do not consider \(\texttt {MEA-MMD-Dist}{}\) any longer, since it has worse performance compared to \(\texttt {MEA-MMD-Proj}{}\). We use the \(\textrm{MMD}\) with the Laplacian and Gaussian kernels and the energy distance as benchmarks. We discuss the implementations of the multivariate methods in Sect. S12 of the appendix.

Figure 6 compares these multivariate methods for a mean shift, with the number of dimensions \(d=100\). In this case, \(\texttt {MEA-MMD-Proj}{}\) has higher power than the other methods. The Multivariate Runs test has power equal to 0, possibly because \(n_1+ n_2< d\). The Hotelling test also power equal to 0, because \(n_1+ n_2< d\) and the sample covariance matrix is singular. In Sect. S12 of the Supp. Material, the same experiment is run for \(d=10\), and \(\texttt {MEA-MMD-Proj}{}\) again has superior power over these methods for small sample sizes.

Further experiments for a change in the mean shows that \(\texttt {MEA-MMD-Proj}{}\) performs well in comparison to the mutlivariate methods. For a change in variance or distribution, though, \(\texttt {MEA-MMD-Proj}{}\) does not perform well, and it is shown that only \(\textrm{MMD}\) performs well. Following a suggestion from one of the referees, we have applied the same random projections approach to other univariate tests used in Sect. 3.4. These random projection methods perform similarly to \(\texttt {MEA-MMD-Proj}{}\), performing well for a change in the mean, but not for a change in variance or distribution. These experiments are also in Sect. S12 of the Supp. Material.

In summary, \(\texttt {MEA-MMD-Proj}{}\) may have utility for detecting a change in the mean, particularly when the number of dimensions is greater than the number of samples, but the multivariate \(\textrm{MMD}\) should be preferred for other cases.

Fig. 7
figure 7

Runtime comparison of C++ implementations of the multivariate (i) \(\textrm{MMD}\) with the Laplacian kernel, (ii) \(\textrm{MMD}\) with the Gaussian kernel, (iii) The proposed \(\texttt {MEA-MMD-Proj}{}\) and (iv) The proposed \(\texttt {MEA-MMD-Dist}{}\), as \(n_1=n_2\) varies, with \(d=10\) and number of projections for MEA-MMD-Proj is \(K=20\). (Left) linear scale. (Right) log-log scale. Note that the p-values are computed using \(L=100\) permutations, but without using the median heuristic method to set the value of \(\beta \)

4.4 Runtime comparison

Analysing the runtime of MEA-MMD-Proj by looking at the computation of Eq. (11), since computing \(u_j^T \textbf{x}\) and \(u_j^T \textbf{y}\) is \(\mathcal {O}(dn)\), the overall computational complexity is \(\mathcal {O}(Kn\log n+ Kdn)\).

For MEA-MMD-Dist, since there are \(n\) computations of \(\textrm{MMD}^2(\textbf{x}^{(i)}, \textbf{y}^{(i)})\) (which are \(\mathcal {O}(n\log n+ dn)\) steps), and a single \(\mathcal {O}(n\log n)\) for the Hommel p-value combination procedure, the whole algorithm is \(\mathcal {O}(n^2 \log n+ dn^2)\). This might seem counterproductive; after all, the standard \(\textrm{MMD}\) method is \(\mathcal {O}(dn^2 )\). However, as shown in Figs. 5 and 6 in Sect. 4.3, there are cases where these approximate approaches can have greater statistical power than the standard \(\textrm{MMD}\).

Note that we are neglecting the number of permutations from the computational complexities, since we are assuming all methods will use the permutation approach. However, to be precise, MEA-MMD-Dist is \(\mathcal {O}(Ldn^2 \log n)\), where \(L\) is the number of permutations.

Figure 7 investigates the runtime of MEA-MMD-Proj compared to the other methods by varying \(n\) and the number of projections \(K\), and shows that when \(K\) is not too large, MEA-MMD-Proj can be significantly faster than MMD. In Fig. 7 we have used \(K=20\), as that seems to be sufficient for good performance; see Sect. S8.4 in the Supp. Material for details. Note also that Fig. 7 shows the runtime for when the p-values are computed, but without using the median heuristic; Fig. S14 in the Supp. Material shows that when the median heuristic is used, MEA-MMD-Proj is slightly slower than MMD for \(n\le 2000\). This is because the median heuristic algorithm needs to be computed for every projection. However, as shown in Sects. 4.3 and 4.3.1, we see that the utility of MEA-MMD-Proj is not its increased efficiency, but rather an increase of power in cases where \(n\) is small, and even when \(n< d\).

4.4.1 Increasing the dimension \(d\) or the number of projections \(K\)

All the algorithms have computational complexity that is linear in the dimension of the data, \(d\).

The computational complexity is clearly linear in \(K\), but additional experiments in Sect. S8.4 in the Supp. Material show that for \(K\ge 20\), the statistical performance of MEA-MMD is relatively constant, i.e. does not significantly improve for larger \(K\).

4.5 Other approximate methods

Since MMD (Gretton et al. 2012a) has computational complexity \(\mathcal {O}(dn^2)\), several approximate methods have been proposed which are more efficient.

In Gretton et al. (2012a) a linear-time approximate method MMD-Linear was proposed with computational complexity is \(\mathcal {O}(dn)\), but this method has a higher variance than MMD, and so worse statistical performance. The MMD-Block method, known as B-test in Zaremba et al. (2013), approximates the \(\textrm{MMD}\) statistic by partitioning the kernel matrices into blocks; the size of the blocks, B, is a parameter for the method and we use their suggested heuristic \(B=\sqrt{n_1}\), when \(n_1=n_2\) (as is the case in our simulation setup), and in this case its computational complexity is \(\mathcal {O}(dn^{1.5})\). The MMD-Fourier method, known as FastMMD in Zhao and Meng (2015), uses the random Fourier features approach from Rahimi and Recht (2007); it has a parameter L, with a suggested heuristic for its value based on the sample size \(n\), and it has computational complexity \(\mathcal {O}(dnL)\).

However, while these methods are computationally more efficient, their statistical performance is worse than that of the original MMD, and so they are not considered in our comparison of statistical performance.

5 Discussion

While a direct computation of the \(\textrm{MMD}\) statistic for \(n\) univariate, real-valued samples is \(\mathcal {O}(n^2)\), we present the algorithm euMMD, which can compute the exact \(\textrm{MMD}\) statistic with computational complexity \(\mathcal {O}(n\log n)\) and storage complexity \(\mathcal {O}(n)\), where these are worst-case complexities. This approach can lead to runtime improvements of several orders of magnitude.

Additionally, two approximate methods MEA-MMD-Proj and MEA-MMD-Dist, based on euMMD, are explored for computing the \(\textrm{MMD}\) statistic in the multivariate setting for \(d\)-dimensional data. The MEA-MMD-Proj approach depends on a user-defined number of random projections \(K\), and for \(K=20\) this algorithm will have improved computational complexity and runtime performance for large \(n\). However, of greater interest are the cases with small \(n\), and even cases where \(n< d\), where MEA-MMD-Proj has better statistical perfomance than the exact MMD method.

So, while the univariate case is exact and has improved computational complexity, the multivariate approximate method MEA-MMD-Proj has greater statistical power in cases where \(n\) is small, and even works well in cases where \(n< d\).

One interesting question is whether the \(\mathcal {O}(n\log n)\) computational complexity of euMMD can be improved for univariate real-valued data, i.e. is the computational complexity \(\Omega (n\log n)\)? In Remark 10 it is mentioned that if the data is known to be integer-valued, then an existing \(\mathcal {O}(n)\) sort allows the algorithm to be \(\mathcal {O}(n)\) overall; however, this is a special case.

5.1 Code and figures

The euMMD and MEA-MMD-Proj algorithms are available in the R package named eummd on CRAN and in the Python package named eummd on PyPI. Both packages share the same underlying core code written in C++.

Section S12 in the Supp. Material discusses the implementation of the multivariate methods.

All experiments were run on an Apple Macbook Air with an M1 processor and 16 GB of RAM.

Plots were produced in Python using the matplotlib (Hunter 2007) library.