1 Introduction

Finding simple patterns that can be used to describe the data is one of the main problems in data mining. The data mining literature knows many different techniques for this general task, but one of the most common pattern finding techniques rarely gets classified as such. Matrix factorizations (or decompositions, these two terms are used interchangeably in this paper) represent the given input matrix as a product of two (or more) factor matrices, . This standard formulation of matrix factorizations makes their pattern mining nature less obvious, but let us write the matrix product as a sum of rank-1 matrices, , where is the outer product of the \(i\hbox {th}\) column of and the \(i\hbox {th}\) row of . Now it becomes clear that the rank-1 matrices are the “simple patterns”, and the matrix factorization is finding k such patterns whose sum is a good approximation of the original data matrix.

This so-called “component interpretation” (Skillicorn 2007) is more appealing with some factorizations than with others. For example, the classical singular value decomposition (SVD) does not easily admit such an interpretation, as the components are not easy to interpret without knowing the earlier components. On the other hand, the motivation for the nonnegative matrix factorization (NMF) often comes from the component interpretation, as can be seen, for example, in the famous “parts of faces” figures of Lee and Seung (1999). The “parts-of-whole” interpretation is in the hearth of NMF: every rank-1 component adds something to the overall decomposition, and never removes anything. This aids with the interpretation of the components, and is also often claimed to yield sparse factors, although this latter point is more contentious (see e.g. Hoyer 2004).

Perhaps the reason why matrix factorization methods are not often considered as pattern mining methods is that the rank-1 matrices are summed together to build the full data. Hence, it is rare for any rank-1 component to explain any part of the input matrix alone. But the use of summation as a way to aggregate the rank-1 components can be considered to be “merely” a consequence of the fact that we are using the standard algebra. If we change the algebra—in particular, if we change how we define the summation—we change the operator used for the aggregation. In this work, we propose to use the maximum operator to define the summation over the nonnegative matrices, giving us what is known as the subtropical algebra. As the aggregation of the rank-1 factors is now the element-wise maximum, we obtain what we call the “winner-takes-it-all” interpretation: the final value of each element in the approximation is defined only by the largest value in the corresponding element in the rank-1 matrices. This can be considered a staple of the subtropical structure—for each element in the data we can find a single rank-1 pattern, the “winner”, that determines its value exactly. This is in contrast to the NMF structure, where each pattern would only make a “contribution” to the final value.

Not only does the subtropical algebra give us the intriguing winner-takes-it-all interpretation, it also provides guarantees about the sparsity of the factors, as we will show in Sect. 3.2. Furthermore, a different algebra means that we are finding different factorizations compared to NMF (or SVD). The emphasis here is on the word different: the factorizations can be better or worse in terms of the reconstruction error but the patterns they find are usually different to those found by NMF. It is also worth mentioning that the same dataset often has both kinds of structures in it, in which case subtropical and NMF patterns are complementary to each other, and depending on an application, one or the other can be more useful. One practical advantage of the subtropical methods though is that they tend to find more concise representation of patterns in the data, while NMF often splits them into several smaller components, making it harder to see the big picture.

Fig. 1
figure 1

Example results by and on the dataset. For each method, two selected columns from the left-hand factor matrix is shown on a map. The values are normalized to the unit interval, and darker shades indicate higher values. a high precipitation. b hot summer days. c high precipitation. d summer daily maximum temperature

To illustrate this, and in general the kind of structure subtropical matrix factorization can reveal and how it is different from that of NMF, we show example results on the European climate data (Fig. 1).

The data contains weather records for Europe between 1960 and 1990, and it was obtained from the global climate data repository.Footnote 1 The data has \(2\,575\) rows that correspond to 50-by-50 kilometer squares of land where measurements were made and 48 columns corresponding to observations. More precisely, the first 12 columns represent the average low temperature for each month, the next 12 columns the average high temperature, and the next 12 columns the daily mean. The last 12 columns represent the mean monthly precipitation for each month. We preprocessed every column of the data by first subtracting its mean, dividing by the standard deviation, and then subtracting its minimum value, so that the smallest value becomes 0. We compare the results of our subtropical matrix factorization algorithm, called , to those of an NMF algorithm, called , that obtained the best reconstruction error on this data (see Table 2 in Sect. 5). For both methods, we chose two factors: one that best identifies the areas of high precipitation and another that reflects summer (i.e. June, July, and August) daily maximum temperatures. To be able to validate the results of the algorithms, we also include the average annual precipitation and average summer maximum temperature in Fig. 2a, b, respectively.

Both and identify the areas of high precipitation and their corresponding left-hand factors are shown in Fig. 1a, c, respectively. There are, however, two significant differences between their interpretations. First, emphasizes the wettest areas, while shows a much smoother transition similar to the original data. The second difference is that, unlike , does not identify either the UK or Norwegian coast as areas of (particularly) high precipitation. A potential explanation is that there are many overlaps with other factors (see Fig. 15b), and hence having large values in any of them might lead to overcovering the original data. , as a subtropical method, does not suffer from this issue, as there the reconstructed precipitation is entirely based on the factor with the highest values.

Fig. 2
figure 2

Average climate data to be compared with the factors in Fig. 1. a Average annual precipitation in mm. b Average daily maximum temperature in summer in degrees Celsius

In order to make the above argument more concrete, let us see what happens when we try to combine ’s factors using the standard algebra instead of the subtropical one. Recall that if is a rank-k matrix decomposition of , then we have , where each pattern is an outer product of the \(s\hbox {th}\) column of and the \(s\hbox {th}\) row of . If for some l and t we have , then also since all values are nonnegative. It is therefore generally undesirable for any subset of the patterns to overcover values in the original data, as there would be no way of decreasing these values by adding more patterns. As an example we will combine the patterns corresponding to ’s factors from Fig. 1a, b. To obtain the actual rank-1 patterns we first need to compute the outer products of these factors with the corresponding rows of the right-hand side matrix. Now if we denote the obtained patterns by and , then the elements of the matrix show by how much the combination of and overcovers the original data . We now plot the average value of every row of the overcover matrix scaled by the average value in the original data (Fig. 3a). Since each row corresponds to a location on the map, it shows the average amount by which we would overcover the data, were we to use the standard algebra for combining the ’s factors. It is evident that this method produces many values that are too high (mostly around Alps and other high precipitation areas). On the other hand, when we perform the same procedure using the subtropical algebra (Fig. 3b), there is almost no overcovering.

A somewhat similar behaviour is seen with the maximal daily temperatures in summer. finds a factor that, with the exception of the Scandinavian peninsula, closely mimics the original data and maintains a smooth gradient of decreasing temperatures when moving towards north (Fig. 1d). In contrast, identifies the areas where summer days are hot, while practically disregarding other parts (Fig. 1b).

Fig. 3
figure 3

Comparison of the overcovering when combining the factors from Fig. 1a, b using the standard (a) and the subtropical (b) algebras. All values are divided by the average value in the original matrix, negative values are ignored. Darker shades indicate higher values. a Overcovering when using the standard algebra. b Overcovering when using the subtropical algebra

It is worth mentioning that, although UK and the coastal regions of Norway are not prominent in the ’s factor shown above, they actually belong to some of its other factors (see Fig. 15b). In other words, the high precipitation pattern is split into several parts and partially merged with other factors. This is likely a consequence of the pattern splitting nature of NMF mentioned earlier. On the other hand, using the subtropical structure, we were able to isolate the high precipitation pattern and present it in a single factor.

While the above discussion shows that the subtropical model can be a useful complement to NMF, it is generally difficult to claim that either of them is superior. For example generally provided a more concise representation of patterns in the climate data, outlining its most prominent properties, while ’s strength was recovering the smooth transition between values.

Contributions and a roadmap In this paper, we study the use of subtropical decompositions for data analysis.Footnote 2 We start by studying the theoretical aspects of the problem (Sect. 3), showing that the problem is -hard to even approximate, but also that sparse matrices have sparse dominated subtropical decompositions.

In Sect. 4, we develop a general framework, called , for finding approximate, low-rank subtropical decompositions, and we will present two instances of this framework, tailored towards different types of data and noise, called and . assumes discrete data with noise that randomly flips the value to a random number, whereas assumes continuous-valued data with standard Gaussian noise.

Our experiments (Sect. 5) show that both and work well on datasets that have the kind of noise they are designed for, and they outperform SVD and different NMF methods when data has subtropical structure. On real-world data, is usually the better of the two, although in terms of reconstruction error, neither of the methods can challenge SVD. On the other hand, we show that both and return interpretable results that show different aspects of the data compared to factorizations made under the standard algebra.

2 Notation and basic definitions

Basic notation Throughout this paper, we will denote a matrix by upper-case boldface letters (), and vectors by lower-case boldface letters (\(\varvec{a}\)). The \(i\hbox {th}\) row of matrix is denoted by and the \(j\hbox {th}\) column by . The matrix with the \(i\hbox {th}\) column removed is denoted by , and is the respective notation for with a removed row. Most matrices and vectors in this paper are restricted to the nonnegative real numbers .

We use the shorthand [n] to denote the set \(\{1, 2, \ldots , n\}\).

Algebras In this paper we consider matrix factorization over so called max-times (or subtropical) algebra. It differs from the standard algebra of real numbers in that addition is replaced with the operation of taking the maximum. Also the domain is restricted to the set of nonnegative real numbers.

Definition 1

The max-times (or subtropical) algebra is a set of nonnegative real numbers together with operations (addition) and (multiplication) defined for any . The identity element for addition is 0 and for multiplication it is 1.

In the future we will use the notation and \(\max \lbrace a, b\rbrace \) and the names max-times and subtropical interchangeably. It is straightforward to see that the max-times algebra is a dioid, that is, a semiring with idempotent addition (). It is important to note that subtropical algebra is anti-negative, that is, there is no subtraction operation.

A very closely related algebraic structure is the max-plus (tropical) algebra (see e.g. Akian et al. 2007).

Definition 2

The max-plus (or tropical) algebra is defined over the set of extended real numbers with operations (addition) and (multiplication). The identity elements for addition and multiplication are \(-\infty \) and 0, respectively.

The tropical and subtropical algebras are isomorphic (Blondel et al. 2000), which can be seen by taking the logarithm of the subtropical algebra or the exponent of the tropical algebra (with the conventions that \(\log 0 = -\infty \) and \(\exp (-\infty ) = 0\)). Thus, most of the results we prove for subtropical algebra can be extended to their tropical analogues, although caution should be used when dealing with approximate matrix factorizations. The latter is because, as we will see in Theorem 4, the reconstruction error of an approximate matrix factorization under the two different algebras does not transfer directly.

Matrix products and ranks The matrix product over the subtropical algebra is defined in the natural way:

Definition 3

The max-times matrix product of two matrices and is defined as

(1)

We will also need the matrix product over the tropical algebra.

Definition 4

For two matrices and , their tropical matrix product is defined as

(2)

The matrix rank over the subtropical algebra can be defined in many ways, depending on which definition of the normal matrix rank is taken as the starting point. We will discuss different subtropical ranks in detail in Sect. 3.4. Here we give the main definition of the rank we are using throughout this paper, the so-called Schein (or Barvinok) rank of a matrix.

Definition 5

The max-times (Schein or Barvinok) rank of a matrix is the least integer k such that can be expressed as an element-wise maximum of k rank-1 matrices, . Matrix has subtropical (Schein/Barvinok) rank of 1 if there exist column vectors and such that . Matrices with subtropical Schein (or Barvinok) rank of 1 are called blocks.

When it is clear from the context, we will use the term rank (or subtropical rank) without other qualifiers to denote the subtropical Schein/Barvinok rank.

Special matrices The final concepts we need in this paper are pattern matrices and dominating matrices.

Definition 6

A pattern of a matrix is an n-by-m binary matrix such that if and only if , and otherwise . We denote the pattern of by .

Definition 7

Let and be matrices of the same size, and let \(\varGamma \) be a subset of their indices. Then if for all indices \((i, j) \in \varGamma \), , we say that dominateswithin\(\varGamma \). If \(\varGamma \) spans the entire size of and , we simply say that dominates. Correspondingly, is said to be dominated by.

Main problem definition Now that we have sufficient notation, we can formally introduce the main problem considered in the paper.

Problem 1

(Approximate subtropical rank-kmatrix factorization) Given a matrix and an integer \(k>0\), find factor matrices and minimizing

(3)

Here we have deliberately not specified any particular norm. Depending on the circumstances, different matrix norms can be used, but in this paper we will consider the two most natural choices—the Frobenius and \(L_1\) norms.

3 Theory

Our main contributions in this paper are the algorithms for the subtropical matrix factorization. But before we present them, it is important to understand the theoretical aspects of subtropical factorizations. We will start by studying the computational complexity of Problem 1, showing that it is -hard even to approximate. After that, we will show that the dominated subtropical factorizations of sparse matrices are sparse. Then we compare the subtropical factorizations to factorizations over other algebras, analyzing how the error of an approximate decomposition behaves when moving from tropical to subtropical algebra. Finally, we briefly summarize different ways to define the subtropical rank, and how these different ranks can be used to bound each other, and the Boolean rank of a binary matrix, as well.

3.1 Computational complexity

The computational complexity of different matrix factorization problems varies. For example, SVD can be computed in polynomial time (Golub and Van Loan 2012), while NMF is -hard (Vavasis 2009). Unfortunately, the subtropical factorization is also -hard.

Theorem 1

Computing the max-times matrix rank is an -hard problem, even for binary matrices.

The theorem is a direct consequence of the following theorem by Kim and Roush (2005):

Theorem 2

(Kim and Roush 2005) Computing the max-plus (tropical) matrix rank is -hard, even for matrices that take values only from \(\{-\infty , 0\}\).

While computing the rank deals with exact decompositions, its hardness automatically makes any approximation algorithm with provable multiplicative guarantees unlikely to exist, as the following corollary shows.

Corollary 1

It is -hard to approximate Problem 1 to within any polynomially computable factor.

Proof

Any algorithm that can approximate Problem 1 to within a factor \(\alpha \) must find a decomposition of error \(\alpha \cdot 0 = 0\) if the input matrix has exact max-times rank-k decomposition. As this implies solving the max-times rank, per Theorem 1 it is only possible if . \(\square \)

3.2 Sparsity of the factors

It is often desirable to obtain sparse factor matrices if the original data is sparse, as well, and the sparsity of its factors is frequently mentioned as one of the benefits of using NMF (see, e.g. Hoyer 2004). In general, however, the factors obtained by NMF might not be sparse, but if we restrict ourselves to dominated decompositions, Gillis and Glineur (2010) showed that the sparsity of the factors cannot be less than the sparsity of the original matrix.

The proof of Gillis and Glineur (2010) relies on the anti-negativity, and hence their proof is easy to adapt to max-times setting. Let the sparsity of an n-by-m matrix , , be defined as

(4)

where is the number of nonzero elements in . Now we have

Theorem 3

Let matrices and be such that their max-times product is dominated by an n-by-m matrix . Then the following estimate holds:

(5)

Proof

The proof follows that of Gillis and Glineur (2010). We first prove (5) for \(k = 1\). Let and be such that for all \(i \in [n]\) and \(j \in [m]\). Since \((\varvec{b}\varvec{c}^T)_{ij}>0\) if and only if \(\varvec{b}_i > 0\) and \(\varvec{c}_j > 0\), we have that

(6)

By (4) we have , and . Plugging these expressions into (6) we obtain \((1 - s(\varvec{b}\varvec{c}^T)) = (1-s(\varvec{b}))(1-s(\varvec{c}))\). Hence, the sparsity in a rank-1 dominated approximation of is

$$\begin{aligned} s(\varvec{b}) + s(\varvec{c}) \ge s(\varvec{b}\varvec{c}^T). \end{aligned}$$
(7)

From (7) and the fact that the number of nonzero elements in \(\varvec{b}\varvec{c}^T\) is no greater than in , it follows that

(8)

Now let and be such that is dominated by . Then for all \(i \in [n]\), \(j \in [m]\), and \(l \in [k]\), which means that for each \(l\in [k]\), is dominated by . To complete the proof observe that and and that for each l estimate (8) holds. \(\square \)

3.3 Relation to other algebras

Let us now study how the max-times algebra relates to other algebras, especially the standard, the Boolean, and the max-plus algebras. For the first two, we compare the ranks, and for the last, the reconstruction error.

Let us start by considering the Boolean rank of a binary matrix. The Boolean (Schein or Barvinok) rank is the following problem:

Problem 2

(Boolean rank) Given a matrix , find the smallest integer k such that there exist matrices and that satisfy , where is the Boolean matrix product,

Lemma 1

If is a binary matrix, then its Boolean and subtropical ranks are the same.

Proof

We will prove the claim by first showing that the Boolean rank of a binary matrix is no less than the subtropical rank, and then showing that it is no larger, either. For the first direction, let the Boolean rank of be k, and let and be binary matrices such that has k columns and . It is easy to see that , and hence, the subtropical rank of is no more than k.

For the second direction, we will actually show a slightly stronger claim: Let and let be its pattern. Then the Boolean rank of is never more than the subtropical rank of . As for a binary , the claim follows. To prove the claim, let have subtropical rank of k and let and be such that . Let (ij) be such that . By definition, , and hence

(9)

On the other hand, if (ij) is such that , then there exists l such that and consequently,

(10)

Combining (9) and (10) gives us

(11)

showing that the Boolean rank of is at most k. \(\square \)

Notice that Lemma 1 also furnishes us with another proof of Theorem 1, as computing the Boolean rank is -hard (see, e.g. Miettinen 2009). Notice also that while the Boolean rank of the pattern is never more than the subtropical rank of the original matrix, it can be much less. This is easy to see by considering a matrix with no zeroes: it can have arbitrarily large subtropical rank, but it’s pattern has Boolean rank 1.

Unfortunately, the Boolean rank does not help us with effectively estimating the subtropical rank, as its computation is an -hard problem. The standard rank is (relatively) easy to compute, but the standard rank and the max-times rank are incommensurable, that is, there are matrices that have smaller max-times rank than standard rank and others that have higher max-times rank than standard rank. Let us consider an example of the first kind,

As the decomposition shows, this matrix has max-times rank of 2, while its normal rank is easily verified to be 3. Indeed, it is easy to see that the complement of the n-by-n identity matrix , that is, the matrix that has \(0\hbox {s}\) at the diagonal and 1s everywhere else, has max-times rank of \(O(\log n)\) while its standard rank is n (the result follows from similar results regarding the Boolean rank, see, e.g. Miettinen 2009).

As we have discussed earlier, max-plus and max-times algebras are isomorphic, and consequently for any matrix its max-times rank agrees with the max-plus rank of the matrix . Yet, the errors obtained in approximate decompositions do not have to (and usually will not) agree. In what follows we characterize the relationship between max-plus and max-times errors. We denote by the extended real line .

Theorem 4

Let , and . Let \(M = \exp \{N\}\), where

If an error can be bounded in max-plus algebra as

(12)

then the following estimate holds with respect to the max-times algebra:

(13)

Proof

Let . From (12) it follows that there exists a set of numbers such that for any ij we have \((A_{ij} - \alpha _{ij})^2 \le \lambda _{ij}\) and \(\sum _{ij} \lambda _{ij} = \lambda \). By the mean-value theorem, for every i and j we obtain

for some . Hence,

The estimate for the max-times error now follows from the monotonicity of the exponent:

proving the claim. \(\square \)

3.4 Different subtropical matrix ranks

The definition of the subtropical rank we use in this work is the so-called Schein (or Barvinok) rank (see Definition 5). Like in the standard linear algebra, this is not the only possible way to define the (subtropical) rank. Here we will review few other forms of subtropical rank that can allow us to bound the Schein/Barvinok rank of a matrix. Unless otherwise mentioned, the definitions are by Guillon et al. (2015); naturally results without citations are ours. Following Guillon et al, we will present the definitions in this section over the tropical algebra. Recall that due to isomorphism, these definitions transfer directly to the subtropical case.

We begin with the tropical equivalent of the subtropical Schein/Barvinok rank:

Definition 8

The tropical Schein/Barvinok rank of a matrix , denoted , is defined to be the least integer k such that there exist matrices and for which .

Analogous to the standard case, we can also define the rank as the number of linearly independent rows or columns. The following definition of linear independence of a family of vectors in a tropical space is due to Gondran and Minoux (1984b).

Definition 9

A set of vectors \(\varvec{x}_1, \dots , \varvec{x}_k\) from is called linearly dependent if there exist disjoint sets \(I, J \subset [k]\) and scalars \(\{\lambda _i\}_{i\in I \cup J}\), such that \(\lambda _i \ne -\infty \) for all i and

$$\begin{aligned} \max _{i\in I} \{\lambda _i + \varvec{x}_i\} = \max _{j\in J} \{\lambda _j + \varvec{x}_j\}. \end{aligned}$$
(14)

Otherwise the vectors \(\varvec{x}_1, \dots , \varvec{x}_k\) are called linearly independent.

This gives rise to the so-called Gondran–Minoux ranks:

Definition 10

The Gondran–Minoux row (column) rank of a matrix is defined as the maximal k such that has k independent rows (columns). They are denoted by and respectively.

Another way to characterize the rank of the matrix is to consider the space its rows or columns can span.

Definition 11

A set is called tropically convex if for any vectors \(\varvec{x}, \varvec{y} \in X\) and scalars , we have \(\max \{\lambda + \varvec{x}, \mu + \varvec{y}\} \in X\).

Definition 12

The convex hull\(H(\varvec{x}_1, \dots \varvec{x}_k)\) of a finite set of vectors is defined as follows

Definition 13

The weak dimension of a finitely generated tropically convex subset of is the cardinality of its minimal generating set.

We can define the rank of the matrix by looking at the weak dimension of the (tropically) convex hull its rows or columns span.

Definition 14

The row rank and the column rank of a matrix are defined as the weak dimensions of the convex hulls of the rows and the columns of respectively. They are denoted by and .

None of the above definitions coincide (see Akian et al. 2009), unlike in the standard algebra. We can, however, have a partial ordering of the ranks:

Theorem 5

(Guillon et al. 2015; Akian et al. 2009) Let . Then the the following relations are true for the above definitions of the rank of :

(15)

The row and column ranks of an n-by-n tropical matrix can be computed in \(O(n^3)\) time (Butkovič 2010), allowing us to bound the Schein/Barvinok rank from above. Unfortunately, no efficient algorithm for the Gondran–Minoux rank is known. On the other hand, Guillon et al. (2015) presented what they called the ultimate tropical rank that lower-bounds the Gondran–Minoux rank and can be computed in time \(O(n^3)\). We can also check if a matrix has full Schein/Barvinok rank in time \(O(n^3)\) (see Butkovič and Hevery 1985), even if computing any other value is -hard.

These bounds, together with Lemma 1 yield the following corollary regarding the bounding of the Boolean rank of a square matrix:

Corollary 2

Given an n-by-n binary matrix , it’s Boolean rank can be bound from below, using the ultimate rank, and from above, using the tropical column and row ranks, in time \(O(n^3)\).

4 Algorithms

There are some unique challenges in doing subtropical matrix factorization, that stem from the lack of linearity and smoothness of the max-times algebra. One of such issues is that dominated elements in a decomposition have no impact on the final result. Namely, if we consider the subtropical product of two matrices and , we can see that each entry is completely determined by a single element with index . This means that all entries t with do not contribute at all to the final decomposition. To see why this is a problem, observe that many optimization methods used in matrix factorization algorithms rely on local information to choose the direction of the next step (e.g. various forms of gradient descent). In the case of the subtropical algebra, however, the local information is practically absent, and hence we need to look elsewhere for effective optimization techniques.

A common approach to matrix decomposition problems is to update factor matrices alternatingly, which utilizes the fact that the problem is biconvex. Unfortunately, the subtropical matrix factorization problem does not have the biconvexity property, which makes alternating updates less useful.

Here we present a different approach that, instead of doing alternating factor updates, constructs the decomposition by adding one rank-1 matrix at a time, following the idea by Kolda and O’Leary (2000). The corresponding algorithm is called (Algorithm 1).

First observe that the max-times product can be represented as an elementwise maximum of rank-1 matrices (blocks)

(16)

Hence, Problem 1 can be split into k subproblems of the following form: given a rank-\((l-1)\) decomposition , of a matrix , find a column vector and a row vector such that the error

(17)

is minimized. We assume by definition that the rank-0 decomposition is an all zero matrix of the same size as . The problem of rank-k subtropical matrix factorization is then reduced to solving (17) k times. One should of course remember that this scheme is just a heuristic and finding optimal blocks on each iteration does not guarantee converging to a global minimum.

One prominent issue with the above approach is that an optimal rank-\((k-1)\) decomposition might not be very good when considered as a part of a rank-k decomposition. This is because for smaller ranks we generally have to cover the data more crudely, whereas when the rank increases we can afford to use smaller and more refined blocks. In order to deal with this problem, we find and then update the blocks repeatedly, in a cyclic fashion. That means that after discovering the last block, we go all the way back to block one. The input parameter M defines the number of full cycles we make.

figure av

On a high level works as follows. First the factor matrices are initialized to all zeros (line 2). Since the algorithm makes iterative changes to the current solutions that might in some cases lead to worsening of the results, it also stores the best reconstruction error and the corresponding factors found so far. They are initialized with the starting solution on lines 3–4. The main work is done in the loop on lines 5–10, where on each iteration we update a single rank-1 matrix in the current decomposition using the routine (line 7), and then check if the update improves the best result (lines 8–10).

We will present two versions of the function, one called and the other one . is designed to work with discrete (or flipping) noise, when some of the elements in the data are randomly changed to different values. In this setting the level of noise is the proportion of the flipped elements relative to the total number of nonzeros. on the other hand is robust with continuous noise, when many elements are affected (e.g. Gaussian noise). We will discuss both of them in detail in the following subsections. In the rest of the paper, especially when presenting the experiments, we will use names and not only for a specific variation of the function, but also for the algorithm that uses it.

4.1

We first describe , which is designed to solve the subtropical matrix factorization problem in the presence of discrete noise, and minimizes the \(L_{1}\) norm of the error matrix. The main idea behind the algorithm is to spot potential blocks by considering ratios of matrix rows. Consider an arbitrary rank-1 block , where and . For any indices i and j such that \(\varvec{b}_i>0\) and \(\varvec{b}_j>0\), we have . This is a characteristic property of rank-1 matrices—all rows are multiples of one another. Hence, if a block dominates some region \(\varGamma \) of a matrix , then rows of should all be multiples of each other within \(\varGamma \). These rows might have different lengths due to block overlap, in which case the rule only applies to their common part.

starts by identifying the index of the block that has to be updated at the current iteration (line 2). In order to find the best new block we need to take into account that some parts of the data have already been covered, and we must ignore them. This is accomplished by replacing the original matrix with a residual that represents what there is left to cover. The building of the residual (line 3) reflects the winner-takes-it-all property of the max-times algebra: if an element of is approximated by a smaller value, it appears as such in the residual; if it is approximated by a value that is at least as large, then the corresponding residual element is NaN, indicating that this value is already covered. We then select a seed row (line 4), with an intention of growing a block around it. We choose the row with the largest sum as this increases the chances of finding the most prominent block. In order to find the best block that the seed row passes through, we first find a binary matrix that represents the pattern of (line 5). Next, on lines 6–9 we choose an approximation of the block pattern with index sets \({b\_idx}\) and \(c\_idx\), which define what elements of \(\varvec{b}\) and \(\varvec{c}\) should be nonzero. The next step is to find the actual values of elements within the block with the function (line 10). Finally, we inflate the found core block with (line 11).

figure bl

The function (Algorithm 3) finds the pattern of a new block. It does so by comparing a given seed row to other rows of the matrix and extracting sets where the ratio of the rows is almost constant. As was mentioned before, if two rows locally represent the same block, then one should be a multiple of the other, and the ratios of their corresponding elements should remain level. processes the input matrix row by row using the function , which for every row outputs the most likely set of indices, where it is correlated with the seed row (lines 4–6). Since the seed row is obviously the most correlated with itself, we compensate for this by replacing its pattern with that of the second most correlated row (lines 7–8). Finally, we drop some of the least correlated rows after comparing their correlation value \(\phi \) to that of the second most correlated row (after the seed row). The correlation function \(\phi \) is defined as follows

(18)

The parameter \(\tau \) is a threshold determining whether a row should be discarded or retained. The auxiliary function (Algorithm 4) compares two vectors and finds the biggest set of indices where their ratio remains almost constant. It does so by sorting the log-ratio of the input vectors into buckets of a fixed size and then choosing the bucket with the most elements. The notation \(\varvec{u} \mathbin {./} \varvec{v}\) on line 2 means elementwise ratio of vectors and .

It accepts two additional parameters: and \(\delta \). If the largest bucket has fewer than elements, the function will return an empty set—this is done because very small patterns do not reveal much structure and are mostly accidental. The width of the buckets is determined by the parameter \(\delta \).

figure bq
figure br

At this point we know the pattern of the new block, that is, the locations of its non-zeros. To fill in the actual values, we consider the submatrix defined by the pattern, and find the best rank-1 approximation of it. We do this using the function (Algorithm 5). It begins by setting all elements outside of the pattern to 0 as they are irrelevant to the block (line 2). Then it chooses one row to represent the block (lines 3–4), which will be used to find a good rank-1 cover.

Finally, we find the optimal column vector for the block by computing the best weights to be used for covering different rows of the block with its representing row (line 5). Here we optimize with respect to the Frobenius norm, rather than \(L_1\) matrix norm, since it allows to solve the optimization problem in closed form.

figure bt

Since blocks often heavily overlap, we are susceptible to finding only fragments of patterns in the data—some parts of a block can be dominated by another block and subsequently not recognized. Hence, we need to expand found blocks to make them complete. This is done separately for rows and columns in the method called (Algorithm 6), which, given a starting block and the original matrix , tries to add new nonzero elements to \(\varvec{b}\). It iterates through all rows of and adds those that would make a positive impact on the objective without unnecessarily overcovering the data. In order to decide whether a given row should be added, it first extracts a set \(V_i\) of indices where this row is a multiple of the row vector \(\varvec{c}\) of the block (if they are not sufficiently correlated, then the row does not belong to the block) (line 4). A row is added if the evaluation of the following function (line 8)

(19)

is below the threshold \(\theta \). In (19) the numerator measures by how much the new row would overcover the original matrix, and the denominator reflects the improvement in the objective compared to a zero row.

figure bv

Parameters has four parameters in addition to the common parameters in the Equator framework: , \(\delta >0\), \(\theta >0\), and \(\tau \in [0,1]\). The first one, determines the minimum number of elements in two rows that must have “approximately” the same ratio for them to be considered for building a block. The parameter \(\delta \) defines the bucket width when computing row correlations. When expanding a block, \(\theta \) is used to decide whether to add a row (or column) to it—the decision is positive whenever the expression (19) is at most \(\theta \). Finally \(\tau \) is used during the discovery of correlated rows. The value of \(\tau \) belongs to the closed unit interval, and the higher it is, the more rows will be added.

4.2

We now present our second algorithm, , which is a counterpart of specifically designed to work in the presence of high levels of continuous noise. The reason why cannot deal with continuous noise is that it expects the rows in a block to have an “almost” constant elementwise ratio, which is not the case when too many entries in the data are disturbed. For example, even low levels of Gaussian noise would make the ratios vary enough to hinder ’s ability to spot blocks. With we take a new approach which is based on polynomial approximation of the objective. We also replace the \(L_1\) matrix norm, which was used as an objective for , with the Frobenius norm. The reason for that is that when the noise is continuous, its level is defined as the total deviation of the noisy data from the original, rather than a count of the altered elements. This makes the Frobenius norm a good estimator for the amount of noise. conforms to the general framework of (Algorithm 1), and differs from only in how it finds the blocks and in the objective function.

Observe that in order to solve the problem (17) we need to find a column vector and a row vector such that they provide the best rank-1 approximation of the input matrix given the current factorization. The objective function is not convex in either \(\varvec{b}\) or \(\varvec{c}\) and is generally hard to optimize directly, so we have to simplify the problem, which we do in two steps. First, instead of doing full optimization of \(\varvec{b}\) and \(\varvec{c}\) simultaneously, we update only a single element of one of them at a time. This way the problem is reduced to single variable optimization. Even then the objective is hard to minimize, and we replace it with a polynomial approximation, which is easy to optimize directly.

The version of the function is described in Algorithm 7. It alternatingly updates the vectors \(\varvec{b}\) and \(\varvec{c}\) using the routine. Both \(\varvec{b}\) and \(\varvec{c}\) will be updated \(\lfloor f (n+m)/2\rfloor \) times. starts by finding the index of the block that has to be changed (line 2). Since the purpose of is to find the best rank-1 matrix to replace the current block, we also need to compute the reconstructed matrix without it, which is done on line 3. We then find the number of times will be called (line 4) and change the degree of polynomials used for objective function approximation (line 5). This is needed because high degree polynomials are better at finalizing a solution that is already reasonably good, but tend to overfit the data and cause the algorithm to get stuck in local minima at the beginning. It is therefore beneficial to start with polynomials of lower degrees and then gradually increase it. The actual changes to \(\varvec{b}\) and \(\varvec{c}\) happen in the loop (lines 7–9), where we update them using .

The function (Algorithm 8) updates a single entry in either a column vector \(\varvec{b}\) or a row vector \(\varvec{c}\). Let us consider the case when \(\varvec{b}\) is fixed and \(\varvec{c}\) varies. In order to decide which element of \(\varvec{c}\) to change, we need to compare the best changes to all m entries and then choose the one that yields the most improvement to the objective. A single element \(\varvec{c}_l\) only has an effect on the error along the column l. Assume that we are currently updating block with index q and let denote the reconstruction matrix without this block, that is . Minimizing with respect to \(\varvec{c}_l\) is then equivalent to minimizing

(20)

Instead of minimizing (20) directly, we use polynomial approximation in the routine (line 4). It returns the (approximate) error \({\textit{err}}\) and the value x achieving that. The polynomial approximation is obtained by evaluating the objective function at \(deg+1\) points generated uniformly at random from the interval [0, 5] and then fitting a polynomial to the obtained values. The upper bound of 5 does not have any special meaning, rather it was chosen by trial and error. is a heuristic and does not necessarily find the global minimum of the objective function. Moreover, in rare cases it might even cause an increase in the objective value. In such cases it would, in theory, make sense to just keep the value prior to the update, as in that case the objective at least does not increase. However in practice this phenomenon helps to get out of local minima. Since we are only interested in the improvement of the objective achieved by updating a single entry of \(\varvec{c}\), we compute the improvement of the objective after the change (line 5). After trying every column of \(\varvec{c}\), we update only the column that yield the largest improvement.

figure cq
figure cr

The function \(\gamma \) that we need to minimize in order to find the best change to the vector \(\varvec{c}\) in is hard to work with directly since it is not convex, and also not smooth because of the presence of the maximum operator. To alleviate this, we approximate the error function \(\gamma \) with a polynomial g of degree deg. Notice that when updating \(\varvec{c}_l\), other variables of \(\gamma \) are fixed and we only need to consider function . To build g we sample \(deg+1\) points from (0, 1) and fit g to the values of \(\gamma '\) at these points. We then find the that minimizes g(x) and return g(x) (the approximate error) and x (the optimal value).

Parameters has two parameters, \(t>2\) and \(0<f<1\), that control its execution. The first one, t, is the maximum allowed degree of polynomials used for approximation of the objective, which we set to 16 in all our experiments. The second parameter, f, determines the number of single element updates we make to the row and column vectors of a block in . To demonstrate that the chosen values of the parameters are reasonable, we performed a grid search for various parameter values (see Fig. 4 in Sect. 5).

Generalized The algorithm can be adapted to optimize other objective functions. Its general polynomial approximation framework allows for a wide variety of possible objectives, the only constraint being that they have to be additive (we call a function additive if there exists a mapping such that for all and we have ). Some examples of such functions are \(L_1\) and Frobenius matrix norms, as well as Kullback–Leibler and Jensen–Shannon divergences. In order to use the generalized form of one simply has to replace the Frobenius norm with another cost function wherever the error is evaluated.

4.3 Time complexity

The main work in is performed inside the routine, which is called Mk times. Since M is a constant parameter, the complexity of is k times the complexity of . In the following we find the theoretical bounds on the execution time of for both and .

  In the case of there are three main contributors to (Algorithm 2): , , and . compares every row to the seed row, each time calling , which in turn has to process all m elements of both rows. This results in the total complexity of being O(nm). To find the complexity of , first observe that any “pure” block can be represented as , where and with \(n'\le n\) and \(m'\le m\). selects \(\varvec{c}\) from the rows of and then finds the corresponding column vector \(\varvec{b}\) that minimizes . In order to select the best row, we have to try each of the \(n'\) candidates, and since finding the corresponding \(\varvec{b}\) for each of them takes time \(O(n'm')\), this gives the runtime of as \(O(n')O(n'm') = O(n^2m)\). The most computationally expensive parts of are (line 4), finding the mean (line 7), and computing the impact (line 8), which all run in O(m) time. All of these operations have to be repeated O(n) times, and hence the runtime of is O(nm). Thus, we can now estimate the complexity of to be \(O(nm)+O(n^2m)+O(nm) = O(n^2m)\), which leads to the total runtime of to be \(O(n^2mk)\).

Cancer Here (Algorithm 7) is a loop that calls \(\lfloor f(n+m) \rfloor \) times. In the contributors to the complexity are computing the base error (line 3) and a call to (line 4). Both of them are performed n or m times depending on whether we supplied the column vector \(\varvec{b}\) or the row vector \(\varvec{c}\) to . Finding the base error takes time O(m) for \(\varvec{b}\) and O(n) for \(\varvec{c}\). The complexity of boils down to that of evaluating the max-times objective at \(deg+1\) points and then minimizing a degree deg polynomial. Hence, runs in time O(m) or O(n) depending on whether we are optimizing \(\varvec{b}\) or \(\varvec{c}\), and the complexity of is O(nm).

Since is called \(\lfloor f(n+m)/2 \rfloor \) times and f is a fixed parameter, this gives the complexity \(O\bigl ((n+m)nm\bigr )\) for and \(O\bigl ((n+m)nmk\bigr ) = O(\max \{n,m\}nmk)\) for .

Empirical evaluation of the time complexity is reported in Sect. 5.3.

5 Experiments

We tested both and on synthetic and real-world data. In addition we also compare against a variation of that optimizes the Jensen–Shannon divergence, which we call . The purpose of the synthetic experiments is to evaluate the properties of the algorithm in controlled environments where we know the data has the max-times structure. They also demonstrate on what kind of data each algorithm excels and what their limitations are. The purpose of the real-world experiments is to confirm that these observations also hold true in real-world data, and to study what kinds of data sets actually have max-times structure. The source code of and and the scripts that run the experiments in this paper are freely available for academic use.Footnote 3

Parameters of . Both variations of use the same set of parameters. For the synthetic experiments we used \(M=14\), \(t=16\), and \(f=0.1\). For the real world experiments we set \(t=16\), \(f=0.1\), and \(M=40\) (except for , where we used \(M=50\) and , where we set \(M=8\)). Increasing M, which controls the number of cycles of execution of , almost invariably improves the results. At some point though, the gains become marginal, and the value of \(M=40\) is chosen so as to reach the point where increasing M further would not yield much improvement. Sometimes though, this moment can be reached faster—for example the smaller choice of M for is motivated by the fact that quickly reached a point where it could no longer make significant progress, despite being the largest dataset. The relationship of the other two parameters and the quality of decomposition is more complex. We see in Fig. 4a that the dependence on f and t is not monotone, and it is hard to pinpoint the best combination exactly. Moreover, the optimal values can differ depending on the dataset; for example, Fig. 4b features an almost monotone dependence on f that flattens out before f reaches 0.1. From our experience, however, the values of \(t=16\) and \(f=0.1\) seem to be a good choice.

Parameters of . In both synthetic and real-world experiments we used the following default set of parameters: \(M=4\), , \(\delta =0.01\), \(\theta =0.5\), and \(\tau =0.5\). As with , there is a complex dependency of the results on the parameters, but the values chosen above seem to produce good results in most cases. We do not show a comparison table, as we did with , due to a bigger number of parameters.

Fig. 4
figure 4

Results of with different values of parameters t and f. a . b Synthetic experiment with Gaussian noise

Fig. 5
figure 5

Results of with different regularization parameters for the two factors matrices, which are denoted by \(\alpha \) and \(\beta \) respectively. a . b Synthetic experiment with Gaussian noise

5.1 Other methods

We compared our algorithms against and five versions of NMF. For , we used Matlab’s built-in implementation. The first form of NMF is a sparse NMF algorithm by Hoyer (2004),Footnote 4 which we call . It defines the sparsity of a vector as

(21)

and returns factorizations where the sparsity of the factor matrices is user-controllable. Note that the above definition of sparsity is different from the one we use elsewhere (see Equation (4)). In order to run we used the sparsity of ’s factors (as defined by (21)) as its sparsity parameter. We also compare against a standard alternating least squares algorithm called (Cichocki et al. 2009). Next we have two versions of NMF that are essentially the same as , but they use \(L_1\) regularization for increased sparsity (Cichocki et al. 2009), that is, they aim at minimizing

The first method is called and uses regularizer coefficient \(\alpha =\beta =1\), and the other, called , has regularizer coefficient \(\alpha =\beta =5\). It is natural to ask how would fare with different values of parameters. In Fig. 5 we perform a grid search for the best parameter combination. While the experiment with has a very uneven surface without much structure apart from a couple of spikes, the synthetic dataset demonstrates that high values of \(\alpha \) and \(\beta \) can have serious adverse effects on the reconstruction error. It therefore seems safest to set \(\alpha =\beta =0\), which corresponds to the method. It is worth mentioning that in many of our experiments larger values of \(\alpha \) and \(\beta \) resulted in factors becoming close to zero, or some elements in the factors getting enormous values due to numeric instability. This was the case for some other real-world experiments, such as , which is another indication to use the parameter values of \(\alpha =\beta =0\).

The last NMF algorithm, by Li and Ngom (2013), is designed to work with missing values in the data.

5.2 Synthetic experiments

The purpose of synthetic experiments is to prove the concept, that is that our algorithms are capable of identifying the max-times structure when it is there. In order to test this, we first generate the data with the pure max-times structure, then pollute it with some level of noise, and finally run the methods. The noise-free data is created by first generating random factors of some density with nonzero elements drawn from a uniform distribution on the [0, 1] interval and then multiplying them using the max-times matrix product.

We distinguish two types of noise. The first one is the discrete (or tropical) noise, which is introduced in the following way. Assume that we are given an input matrix of size n-by-m. We first generate an n-by-m noise matrix with elements drawn from a uniform distribution on the [0, 1] interval. Given a level of noise l, we then turn \(\lfloor (1 - l)nm \rfloor \) random elements of to 0, so that its resulting density is l. Finally, the noise is applied by taking elementwise maximum between the original data and the noise matrix . This is the kind of noise that was designed to handle, so we expect it to be better than and other comparison algorithms.

We also test against continuous noise, as it is arguably more common in the real world. For that we chose Gaussian noise with 0 mean, where the noise level is defined to be its standard deviation. Since adding this noise to the data might result in negative entries, we truncate all values in a resulting matrix that are below zero.

Unless specified otherwise, all matrices in the synthetic experiments are of size 1000-by-800 with true max-times rank 10. All results presented in this section are averaged over 10 instances. For reconstruction error tests, we compared our algorithms , , and against , , , , , and . The error is measured as the relative Frobenius norm , where is the data and its approximation, as that is the measure both and aim at minimizing. We also report the sparsity s of factor matrices obtained by algorithms, which is defined as a fraction of zero elements in the factor matrices, see (4). for an n-by-m matrix . For the experiments with tropical noise, the reconstruction errors are reported in Fig. 6 and factor sparsity in Fig. 7. For the Gaussian noise experiments, the reconstruction errors and factor sparsity are shown in Figs. 8 and 9, respectively.

Varying density with tropical noise In our first experiment we studied the effects of varying the density of the factor matrices in presence of the tropical noise. We changed the density of the factors from 10 to 100% with an increment of 10%, while keeping the noise level at 10%. Figure 6a shows the reconstruction error and Fig. 7a the sparsity of the obtained factors. is consistently the best method, obtaining almost perfect reconstruction; only when the density approaches 100% does its reconstruction error deviate slightly from 0. This is expected since the data was generated with the tropical (flipping) noise that is designed to optimize. Compared to all other methods clearly underperform, with being the second best. With the exception of , all NMF methods obtain results similar to those of , while having a somewhat higher reconstruction error than . That and NMF methods (except ) start behaving better at higher levels of density indicates that these matrices can be explained relatively well using standard algebra. and also have the highest sparsity of factors, with exhibiting a decrease in sparsity as the density of the input increases. This behaviour is desirable since ideally we would prefer to find factors that are as close to the original ones as possible. For NMF methods there is a trade-off between the reconstruction error and the sparsity of the factors—the algorithms that were worse at reconstruction tend to have sparser factors.

Fig. 6
figure 6

Reconstruction errors on synthetic data with tropical noise. x-axis is the parameter varied and y-axis is the relative Frobenius norm. All results are averages over 10 random matrices and the width of the error bars is twice the standard deviation. a Varying density test. b Varying noise test. c Varying noise with high density. d Varying rank test with 10% noise and 30% factor density. e Varying rank test with 50% noise and 30% factor density. f Varying rank test with 10% noise and 60% factor density

Varying tropical noise The amount of noise is always with respect to the number of nonzero elements in a matrix, that is, for a matrix with nonzero elements and noise level \(\alpha \), we flip elements to random values. There are two versions of this experiment—one with factor density 30% and the other with 60%. In both cases we varied the noise level from 0 to 110% with increments of 10%. Figure 6b, c shows the respective reconstruction errors and Fig. 7b, c the corresponding sparsities of the obtained factors. In the low-density case, is consistently the best method with essentially perfect reconstruction for up to 80% of noise. In the high-density case, however, the noise has more severe effects, and in particular after 60% of noise, , , and all versions of NMF are better than . The severity of the noise is, at least partially, explained by the fact that in the denser data we flip more elements than in sparser data: for example when the data matrices are full, at 50% of noise, we have already replaced half of the values in the matrices with random values. Further, the quick increase of the reconstruction error for hints strongly that the max-times structure of the data is mostly gone at these noise levels. also produces clearly the sparsest factors for the low density case, and is mostly tied with and when the density is high. It should be noted, however, that generally has the highest reconstruction error among all the methods, which suggests that its sparse factors come at the cost of recovering little structure from the data.

Fig. 7
figure 7

Sparsity (fraction of zeroes) of the factor matrices for synthetic data with tropical noise. x-axis is the parameter varied and y-axis is the sparsity of the factors. The markers are averages of 10 random matrices and the width of the error bars is twice the standard deviation. a Varying density test. b Varying noise test. c Varying noise with high density. d Varying rank test with 10% noise and 30% factor density. e Varying rank test with 50% noise and 30% factor density. f Varying rank test with 10% noise and 60% factor density

Varying rank with tropical noise Here we test the effects of the (max-times) rank, with the assumption that higher-rank matrices are harder to reconstruct. The true max-times rank of the data varied from 2 to 20 with increments of 2. There are three variations of this experiment: with 30% factor density and 10% noise (Fig. 6d), with 30% factor density and 50% noise (Fig. 6e), and with 60% factor density and 10% noise (Fig. 6f). The corresponding sparsities are shown on Fig. 7d–f. has a clear advantage for all settings, obtaining nearly perfect reconstruction. is generally second best, except for the high noise case, where it is mostly tied with a bunch of NMF methods. It also has a relatively high variance. To see why this happens, consider that always updates one element in factor matrices at a time. This update is completely dependent on values on a single row (or column) and is sensitive to the spikes that tropical noise introduces to some elements. Interestingly, on the last two plots the reconstruction error actually drops for , , and NMF-based methods. This is a strong indication that at this point they no longer can extract meaningful structure in the data, and the improvement of the reconstruction error is largely due to uniformization of the data caused by high density and high noise levels.

Varying Gaussian noise Here we investigate how the algorithms respond to different levels of Gaussian noise, which was varied from 0 to 0.14 with increments of 0.01. A level of noise is a standard deviation of the Gaussian noise used to generate the noise matrix as described earlier. The factor density was kept at 50%. The results are given on Figs. 8a (reconstruction error) and 9a (sparsity of factors).

Fig. 8
figure 8

Reconstruction error (Frobenius norm) for synthetic data with Gaussian noise noise. The markers are averages of 10 random matrices and the width of the error bars is twice the standard deviation. a Varying noise with density 50%. b Varying density with low Gaussian noise. c Varying density with high Gaussian noise. d Varying rank; 50% density and low Gaussian noise

Here is generally the best method in reconstruction error, and second in sparsity only to . The only time it loses to any method is when there is no noise, and obtains a perfect decomposition. This is expected since is by design better at spotting pure subtropical structure.

Varying density with Gaussian noise In this experiment we studied what effects the density of factor matrices used in data generation has on the algorithms’ performance. For this purpose we varied the density from 10 to 100% with increments of 10% while keeping the other parameters fixed. There are two versions of this experiment, one with low noise level of 0.01 (Figs. 8b, 9b), and a more noisy case at 0.08 (Figs. 8c, 9c).

provides the least reconstruction error in this experiment, being clearly the best until the density is 0.7, from which point on it is tied with and the NMF-based methods (the only exception being the least-dense high-noise case, where obtains a slightly better reconstruction error). is the worst by a wide margin, but this is not surprising, as the data does not follow its assumptions. On the other hand, does produce generally the sparsest factorization, but these are of little use given its bad reconstruction error. produces the sparsest factors from the remaining methods, except in the first few cases where is sparser (and worse in reconstruction error), meaning that produces factors that are both the most accurate and very sparse.

Fig. 9
figure 9

Sparsity (fraction of zeroes) of the factor matrices for synthetic data with Gaussian noise. The markers are averages of 10 random matrices and the width of the error bars is twice the standard deviation. a Varying noise with density 50%. b Varying density with low Gaussian noise. c Varying density with high Gaussian noise. d Varying rank; 50% density and low Gaussian noise

Varying rank with Gaussian noise The purpose of this test is to study the performance of algorithms on data of different max-times ranks. We varied the true rank of the data from 2 to 20 with increments of 2. The factor density was fixed at 50% and Gaussian noise at 0.01. The results are shown on Figs. 8d (reconstruction error) and 9d (sparsity of factors). The results are similar to those considered above, with returning the most accurate and second sparsest factorizations.

Optimizing the Jensen–Shannon divergence By default optimizes the Frobenius reconstruction error, but it can be replaced by an arbitrary additive cost function. We performed experiments with Jensen–Shannon divergence, which is given by the formula

(22)

It is easy to see that (22) is an additive function, and hence can be plugged into . Figure 10 shows how this version of compares to other methods. The setup is the same as in the corresponding experiments on Fig. 8. In all these experiments it is apparent that this version of is inferior to that optimizing the Frobenius error, but is generally on par with and NMF-based methods. Also for the varying density test (Fig. 10b) it produces better reconstruction errors than and all the NMF methods, until the density reaches 50%, after which they become tied.

Fig. 10
figure 10

Comparison of with Jensen–Shannon objective and other methods on synthetic data with Gaussian noise. x-axis is the parameter varied and y-axis is the relative Frobenius error. All results are averages over 10 random matrices and the width of the error bars is twice the standard deviation. a Varying noise test. b Varying density test. c Varying rank test with 10% noise and 30% factor density

Prediction In this experiment we choose a random holdout set and remove it from the data (elements of this set are marked as missing values). We then try to learn the structure of the data from its remaining part using the algorithms, and finally test how well they predict the values inside the holdout set. The factors are drawn uniformly at random from the set of integers in an interval [0, a] with a predefined density of 30%, and then multiplied using the subtropical matrix product. We use two different values of a for each experiment, 10 and 3. With \(a=10\) input matrices have values in the range [0, 100], and when \(a=3\), the range is [0, 9]. We then apply noise to the obtained matrices and feed them to the algorithms. Since all input matrices are integer-valued, and since the recovered data produced by the algorithms can be continuous-valued, we round it to the nearest integer. We report two measures of the prediction quality—prediction rate, which is defined as the fraction of correctly guessed values in the hold-out set, and root mean square error (RMSE). We tested this setup with both tropical noise (Fig. 11) and Gaussian noise (Fig. 12).

Fig. 11
figure 11

Prediction rate on synthetic data with tropical noise. The x-axis represents the size of the holdout set. The y-axis is the correct prediction rate in (a) and (c), and RMSE in (b) and (d). The range is the interval that the values in input matrices are restricted to. All results are averages over 10 random matrices and the width of the error bars is twice the standard deviation. a Prediction rate, range [0, 100]. b RMSE, range [0, 100]. c Prediction rate, range [0, 9]. d RMSE, range [0, 9]

Fig. 12
figure 12

Prediction rate on synthetic data with Gaussian noise. The x-axis represents the size of the holdout set. The y-axis is the correct prediction rate in (a) and (c), and RMSE in (b) and (d). The range is the interval that the values in input matrices are restricted to. All results are averages over 10 random matrices and the width of the error bars is twice the standard deviation. a Prediction rate, range [0, 100]. b RMSE, range [0, 100]. c Prediction rate, range [0, 9]. d RMSE, range [0, 9]

gives by far the best prediction rate when using the higher [0, 100] range of values in input matrices (Figs. 11a, 12a). Especially interesting is that it also beats all other methods in the presence of Gaussian noise. In terms of RMSE it generally lands somewhere in the middle of the pack among various NMF methods. Such a large difference between these measures is caused by not really being an approximation algorithm. It extracts subtropical patterns where they exist, while ignoring parts of the data where they cannot be found. This results in it either predicting the integer values exactly or missing by a wide margin. With the [0, 9] range of values the results of become worse, which is especially evident with Gaussian noise. Although this behaviour might seem counterintuitive, it is simply a consequence of noise having a larger effect when values in the data are smaller. shows the opposite behaviour to in that it benefits from smaller value range and Gaussian noise, where it consistently outperforms all other methods. Unlike , approximates values in input data, which allows it to get a high number of hits with the [0, 9] range after the rounding. On the [0, 100] interval though, it is liable to guessing many values incorrectly since a much higher level of precision is required. For many prediction tasks, like predicting user ratings, ’s approach seems more useful as input values are usually drawn from a relatively small range (for example, in , all ratings are from [0, 5]). Other competing methods generally do not perform well, with the exception of winning the first place with RMSE measure for the high range experiments (Figs. 11b, 12b). It illustrates once again that is a good approximation method but does not help its prediction accuracy. In all other experiments the first place is held by either or . As a general guideline, when choosing between and for value prediction, one should consider that usually gives a superior performance, while tends to be better for exact guessing of values having a wider range.

Discussion The synthetic experiments confirm that both and are able to recover matrices with max-times structure. The main practical difference between them is that is designed to handle the tropical (flipping) noise, while is meant for the data that is perturbed with white (Gaussian) noise. While is clearly the best method when the data has only the flipping noise—and is capable of tolerating very high noise levels—its results deteriorate when we apply Gaussian noise. Hence, when the exact type of noise is not known a priori, it is advisable to try both methods. It is also important to note that is actually a framework of algorithms as it can optimize various objectives. In order to demonstrate that, we performed experiments with Jensen–Shannon divergence as objective and obtained results that are, while inferior to that optimizes the Frobenius error, still slightly better than the rest of the algorithms. Overall we can conclude that and the NMF-based methods generally cannot recover the structure from subtropical data, that is, we cannot use existing methods as a substitute to find the max-times structure neither for the reconstruction nor for the prediction tasks.

5.3 Real-world experiments

The main purpose of the real-world experiments is to study to which extend and can find max-times structure from various real-world data sets. Having established with the synthetic experiments that both algorithms are capable of finding the structure when it is present, here we look at what kind of results they obtain in the real-world data.

It is probably unrealistic to expect real-world data sets to have “pure” max-times structure, as in the synthetic experiments. Rather, we expect to be the best method (in reconstruction error’s sense), and our algorithms to obtain reconstruction error comparable to the NMF-based methods. We will also verify that the results from the real-world data sets are intuitive.

The datasets

represents a linear program.Footnote 5 It is available from the University of Florida Sparse Matrix CollectionFootnote 6 (Davis and Hu 2011).

is a brute force disjoint product matrix in tree algebra on n nodes.Footnote 7 It can be obtained from the same repository as .

contains weather records for various locations in Europe (full description can be found in Sect. 1).

is a nerdiness personality test that uses different attributes to determine the level of nerdiness of a person.Footnote 8 It contains answers by 1418 respondents to a set of 36 questions that asked them to self-assess various statements about themselves on a scale of 1 to 7. We preprocessed the input matrix by dividing each column by its standard deviation and subtracting its mean. To make sure that the data is nonnegative, we subtracted the smallest value of the obtained normalized matrix from every its element.

is a subset of the Extended Yale Face collection of face images (Georghiades et al. 2000). It consists of 32-by-32 pixel images under different lighting conditions. We used a preprocessed data by Xiaofei He et al.Footnote 9 We selected a subset of pictures with lighting from the left and then preprocessed the input matrix by first subtracting from every column its smallest element and then dividing it by its standard deviation.

is a subset of the 20Newsgroups dataset,Footnote 10 containing the usage of 800 words over 400 posts for 4 newsgroups.Footnote 11 Before running the algorithms we represented the dataset as a TF-IDF matrix, and then scaled it by dividing each entry by the greatest entry in the matrix.

is a land registry house price index.Footnote 12 Rows represent months, columns are locations, and entries are residential property price indices. We preprocessed the data by first dividing each column by its standard deviation and then subtracting its minimum, so that each column has minimum 0.

is a collection of user ratings for a set of movies. The original datasetFootnote 13 consists of 100 000 ratings from 1000 users on 1700 movies, with ratings ranging from 1 to 5. In order to be able to perform cross-validation on it, we had to preprocess by removing users that rated fewer than 10 movies and movies that were rated less than 5 times. After that we were left with 943 users, 1349 movies and 99 287 ratings.

The basic properties of these data sets are listed in Table 1.

Table 1 Real world datasets properties

5.3.1 Quantitative results: reconstruction error, sparsity, convergence, and runtime

The following experiments are meant to test and , and how they compare to other methods, such as and NMF. Table 2 provides the relative Frobenius reconstruction errors for various real-world data sets, as well as ranks used for factorization.Footnote 14 Since there is no ground truth for these datasets, the ranks are chosen based mainly on the size of the data and our intuition on what the true rank should be. is, as expected, consistently the best method, followed by and . generally lands in the middle of the pack of the NMF methods, which suggests that it is capable of finding max-times structure that is comparable to what NMF-based methods provide. Consequently, we can study the max-times structure found by , knowing that it is (relatively) accurate. On the other hand, has a high reconstruction error. The discrepancy between ’s and ’s results indicates that the datasets used cannot be represented using “pure” subtropical structure. Rather, they are either a mix of NMF and subtropical patterns or have relatively high levels of continuous noise.

Table 2 Reconstruction error for various real-world datasets

The sparsity of the factors for real-world data is presented in Table 3 (we do not include the sparsities for and as they were all 0). Here, often returns the second-sparsest factors (behind only ), but with and , and obtains sparser decompositions.

Table 3 Factor sparsity for various real-world datasets

We also studied the convergence behavior of using some of the real-world data sets. The results can be seen in Fig. 13, where we plot the relative error with respect to the iterations over the main for-loop in . As we can see, in both cases has obtained a good reconstruction error already after few full cycles, with the remaining runs only providing minor improvements. We can deduce that quickly reaches an acceptable solution.

Fig. 13
figure 13

Convergence rate of for two real-world datasets. Each iteration is a single run of , that is if a factorization has rank k, then one full cycle would correspond to k iterations. a . b

To give some idea about the speed performance of the algorithms, we ran each of the competing methods on some of the real-world datasets. The runtime of each algorithm (in seconds) is shown in Table 4, where we report its mean value and the standard deviation averaged over 5 runs. All tests were performed on a Linux machine with Intel Xeon E5530 CPU with 16 2.40 GHz cores, although and were only utilizing one core. As we can see, the simplest methods, such as and , are also the fastest, while more involved algorithms, such as or , take much longer to run. It is worth noting that and are written in Matlab, and their performance can be potentially significantly improved by implementing time critical parts in C or another low-level programming language.

Table 4 The average runtime in seconds and standard deviation of the algorithms for various real-world datasets

Prediction

Here we investigate how well both and can predict missing values in the data. We used three real-world datasets, a user-movie rating matrix , a brute force disjoint product matrix in tree algebra and , that represents a linear program. All these matrices are integer valued, and hence we will also round the results of all methods to the nearest integer. We compare the results of our methods against and . The choice of is motivated by its ability to ignore missing elements in the input data and its generally good performance on the previous tests. There is only one caveat: sometimes produces very high spikes for some elements in the matrix. They do not cause too much problem with prediction, but they seriously deteriorate the results of with respect to various distance measures. For this reason we always ignore such elements. While this comparison method is obviously not completely fair towards other methods, it can serve as a rough upper bound for what performance is possible with NMF-based algorithms. Comparing against other methods is obviously not fair as they are not designed to deal with missing values, but we will still present the results of for completeness.

On we perform standard cross-validation tests, where a random selection of elements is chosen as a holdout set and removed from the data. The data has 943 users, each having rated from 19 to 648 movies. A holdout set is chosen by sampling uniformly at random 5 ratings from each user. We run the algorithms, while treating the elements from the holdout set as missing values, and then compare the reconstructed matrices to the original data. This procedure is repeated 10 times.

To get a more complete view on how good the predictions are, we report various measures of quality: Frobenius error, root mean square error (RMSE), reciprocal rank, Spearman’s \(\rho \), mean absolute error (MAE), Jensen–Shannon divergence (JS), optimistic reciprocal rank, Kendall’s \(\tau \), and prediction accuracy. The prediction accuracy allows us to see if the methods are capable of recovering the missing user ratings. The remaining tests can be divided into two categories. The first one, which comprises Frobenius error, root mean square error, mean absolute error, and Jensen–Shannon divergence, aims to quantify the distance between the original data and the reconstructed matrix. The second group of tests finds the correlation between rankings of movies for each user. It includes Spearman’s \(\rho \), Kendall’s \(\tau \), reciprocal rank, and optimistic reciprocal rank.

All these measures are well known, with perhaps only the reciprocal rank requiring some explanation. Let us first denote by U the set of all users. In the following, for each user \(u\in U\) we only consider the set of movies M(u) that this user has rated that belong to the holdout set. The ratings by user u induce a natural ranking on M(u). On the other hand, the algorithms produce approximations \(r'(u, m)\) to the true ratings r(um), which also induce a corresponding ranking of the movies. The reciprocal rank is a convenient way of comparing the rankings obtained by the algorithms to the original one. For any user \(u \in U\), denote by H(u) a set of movies that this user ranked the highest (that is \(H(u) = \lbrace m\in M(u) \, : \, r(u, m) = \max _{m'\in M(u)} r(u, m') \rbrace \)). The reciprocal rank for user u is now defined as

(23)

where R(um) is the rank of the movie m within M(u) according to the rating approximations given by the algorithm in question. Now the mean reciprocal rank is defined as the average of the reciprocal ranks for each individual user . When computing the ranks R(um), all tied elements receive the same rank, which is computed by averaging. That means that if, say, movies \(m_1\) and \(m_2\) have tied ranks of 2 and 3, then they both receive the rank of 2.5. An alternative way is to always assign the smallest possible rank. In the above example both \(m_1\) and \(m_2\) will receive rank 2. When ranks R(um) are computed like this, the equation (23) defines the optimistic reciprocal rank.

Table 5 Comparison between the predictive power of different methods on the data

For each test, Table 5 shows the mean and the standard deviation of the results of each algorithm. In addition we report the p-value based on the Wilcoxon signed-rank test. It shows if an advantage of one method over another is statistically significant. We say that a method A is significantly better than method B if the p-value is \(<0.05\). It is unreasonable to report the p-value for every method pair—instead we only show p-values involving the best method. For each method, the value given next to it is the p-value for this method and the best method.

is significantly better for the Frobenius error, root mean square error, mean absolute error, Jensen–Shannon divergence, and accuracy. For the remaining tests the results are less clear, with winning on the reciprocal rank, taking the optimistic reciprocal rank, and being better on Spearman’s \(\rho \) and Kendall’s \(\tau \) tests. It should be noted though, that the victories of on Spearman’s \(\rho \) and Kendall’s \(\tau \) tests, as well as ’s on the reciprocal rank, are not statistically significant as the p values are quite high. In summary, our experiments show that is significantly better in tests that measure the direct distance between the original and the reconstructed matrices, as well as the prediction accuracy, whereas for the ranking experiments it is difficult to give any of the algorithms an edge.

Table 6 Comparison between the predictive power of different methods on the data

For and we also perform cross-validation, where on each fold we take 10% of the nonzero values in the data as a holdout set and then try to predict them. In total there are 5 folds. Unlike , and datasets are not so readily interpretable as they were generated from abstract mathematical data structures. Ranking, in particular, makes no sense, and hence we do not perform any ranking associated experiments. The results for are shown in Table 6 and for in Table 7. It is apparent that on performs significantly better than any other method, being better in all metrics. As discussed earlier, however, that should be taken with a grain of salt as we ignore the elements where produced unreasonably large values. Without this preprocessing its results are much worse than those of . This presents evidence, although not conclusive, that the dataset has less subtropical structure than . The p-value is 0.004 in for all metrics, which is the result of a particular number of folds (5) that we used. The fact that we have this number everywhere in the table simply indicates that was better than any other method on every fold with respect to all measures. With the roles reverse, and this time is clearly the best method, winning according to all metrics and on all folds, just as did on .

Table 7 Comparison between the predictive power of different methods on the data

5.3.2 Interpretability of the results

The crux of using max-times factorizations instead of standard (nonnegative) ones is that the factors (are supposed to) exhibit the “winner-takes-it-all” structure instead of the “parts-of-whole” structure. To demonstrate this, we analysed results in four different datasets: , , , and . The dataset is explained below.

We plotted the left factor matrices for the data for and in Fig. 14. At first, it might look like provides more interpretable results, as most factors are easily identifiable as faces. This, however, is not a very interesting result: we already knew that the data has faces, and many factors in the ’s result are simply some kind of “prototypical” faces. The results of are harder to identify on the first sight. Upon closer inspection, though, one can see that they identify areas that are lighter in the different images, that is, have higher grayscale values. These factors tell us the variances in the lighting in the different photos, and can reveal information we did not know a priori. In addition almost every one of ’s factors contains one or two main feature of the face (such as nose, left eye, right cheek, etc.). In other words, while NMF’s patterns are for the most part close to fully formed faces, finds independent fragments that indicate the direction of the lighting and (or) contain some of the main features of a face.

Further, as seen in Table 2, obtains a better reconstruction error than with this data, confirming that these factors are indeed useful to recreate the data.

Fig. 14
figure 14

finds the dominant patterns from the data. Pictured are the left factor matrices for the data. a . b

Table 8 Top three attributes for the first two factors of
Fig. 15
figure 15

factors in the data. The factor vectors are normalized to take values from the unit interval and darker shades indicate higher values. a left-hand factors. b left-hand factors. c right-hand factors. d right-hand factors

In order to interpret we first observe that each column represents a single personality attribute. Denote by the obtained approximation of the original matrix. For each rank-1 factor and each column we define the score \(\sigma (i)\) as the number of elements in that are determined by . By sorting attributes in descending order of \(\sigma (i)\) we obtain relative rankings of the attributes for a given factor. The results are shown in Table 8. The first factor clearly shows introverted tendencies, while the second one can be summarized as having interests in fiction and games.

Figure 15 shows all of the factors for the data, as obtained by and (the best NMF-method in Table 2). Figure 15a, b shows left-hand sides of factors found by and , respectively, plotted on the map. Darker colours indicate higher values, that can be interpreted as “more important”. The right-hand side factors are presented in Fig. 15c, d, respectively. Here, each row corresponds to a factor, and each column to a single observation column from the original data (that is columns 1–12 represent average low temperatures for each month, columns 13–24 average high temperatures, columns 25–36 daily means, and columns 37–48 average monthly precipitation). Again, higher values can be seen as having more importance. Recall that a pattern is formed by taking an outer product of a single left-hand factor and the corresponding right-hand factor. It is easy to see that largest (and thus the most important) values in a pattern are those that are products of high values in both right-hand side and left-hand side factors.

The factors have less high values (dark colours—all factors are normalized to the unit interval). For , there are more large values in each factor. This highlights the difference between the subtropical and the normal algebra: in normal algebra, if you sum two large values, the result is even larger, whereas in subtropical algebra, the result is no larger than the largest of the summands. In decompositions, this means that cannot have overlapping high values in its factors; instead it has to split its factors to mostly non-overlapping parts. , on the other hand, can have overlap, and hence its factors can share some phenomena. For instance, the seventh factor of clearly indicates areas of high precipitation (cf. Fig. 1). The same phenomenon is split into many factors by (at least third, sixth, and seventh factor), mostly explaining areas with higher precipitation at different parts of the year. While many elements in the right-hand side factors of are nonzero, that does not mean that all of them are of equal importance. Because some of them are dominated by larger features, they do not influence the final outcome. Generally, since larger values are more likely to make a contribution than smaller ones, they should be considered more important when interpreting the data.

The possibility of the factors to overlap is not always desired, but in some applications it can be seen to be almost necessary. Consider, for example, mammal species’ co-location data. This dataset, called , is a matrix whose rows and columns correspond to locations in Europe, and for every column-row pair, the corresponding entry represents the degree to which the sets of mammals inhabiting them overlap. This datasetFootnote 15 was obtained from the original binary location-species matrix (see Mitchell-Jones et al. 1999) by multiplying it with its transpose and then normalizing by dividing each column by its maximal element. The obtained matrix has 2670 rows and columns and density 91%. Due to its special nature, we use it only in this experiment to provide intuition about the subtropical factorizations.

Fig. 16
figure 16

Values in the factors by in the data plotted on a map. Every factor is normalized to take values from the unit interval and darker shades indicate higher values

Fig. 17
figure 17

Values in the factors by in the data plotted on a map. Every factor is normalized to take values from the unit interval and darker shades indicate higher values

The factors obtained by with the data are depicted in Fig. 16, where we can see that many of these factors cover the central parts of the European Plain, extending a bit south to cover most of Germany. There are, naturally, many mammal species that inhabit the whole European Plain, and the east–west change is gradual. This gradual change is easier to model in subtropical algebra, as we do not have to worry about the sums of the factors getting too large. Factors 1–6 model various aspects of the east–west change, emphasizing either the south–west, central, or eastern parts of the plain. Similarly, the ninth factor explains mammal species found in the UK and southern Scandinavia, while the tenth factor covers species found in Scotland, Scandinavia, and Baltic countries, indicating that these areas have roughly the same biome. If we compare these results to those of (Fig. 17), then it becomes evident that the latter tries to find relatively disjoint factors and avoids the factor overlap whenever possible. This is because in NMF any feature that is nonzero at a given data point is always “active” in a sense that it contributes to the final value. That being said, does find some interesting patterns, such as rather distinct factors representing France and Scandinavian peninsula.

6 Related work

Here we present earlier research that is related to the subtropical matrix factorization. We start by discussing classic methods, such as SVD and NMF, that have long been used for various data analysis tasks, and then continue with approaches that use idempotent structures. Since the tropical algebra is very closely related to the subtropical algebra, and since there has been a lot of research on it, we dedicate the last subsection to discuss it in more detail.

6.1 Matrix factorization in data analysis

Matrix factorization methods play a crucial role in data analysis as they help to find low-dimensional representations of the data and uncover the underlying latent structure. A classic example of a real-valued matrix factorization is the singular value decomposition (SVD) (see e.g. Golub and Van Loan 2012), which is very well known and finds extensive applications in various disciplines, such as signal processing and natural language processing. The SVD of a real n-by-m matrix is a factorization of the form where and are orthogonal matrices, and is a rectangular diagonal matrix with nonnegative entries. An important property of SVD is that it provides the best low-rank approximation of a given matrix with respect to the Frobenius norm (Golub and Van Loan 2012), giving rise to the so called truncated SVD. This property is frequently used to separate important parts of data from the noise. For example, it was used by Jha and Yadava (2011) to remove the noise from sensor data in electronic nose systems. Another prominent usage of the truncated SVD is in dimensionality reduction (see for example Sarwar et al. 2000; Deerwester et al. 1990).

Despite SVD being so ubiquitous, there are some restrictions to its usage in data mining due to possible presence of negative elements in the factors. In many applications negative values are hard to interpret, and thus other methods have to be used. Nonnegative matrix factorization (NMF) is a way to tackle this problem. For a given nonnegative real matrix , the NMF problem is to find a decomposition of into two matrices such that and are also nonnegative. Its applications are extensive and include text mining (Pauca et al. 2004), document clustering (Xu et al. 2003), pattern discovery (Brunet et al. 2004), and many other. This area drew considerable attention after a publication by Lee and Seung (1999), where they provided an efficient algorithm for solving the NMF problem. It is worth mentioning that even though the paper by Lee and Seung is perhaps the most famous in NMF literature, it was not the first one to consider this problem. Earlier works include Paatero and Tapper (1994) (see also Paatero 1997), Paatero (1999), and Cohen and Rothblum (1993). Berry et al. (2007) provide an overview of NMF algorithms and their applications. There exist various flavours of NMF that impose different constraints on the factors; for example Hoyer (2004) used sparsity constraints. Though both NMF and SVD perform approximations of a fixed rank, there are also other ways to enforce compact representation of data. For example, in maximum-margin matrix factorization constraints are imposed on the norms of factors. This approach was exploited by Srebro et al. (2004), who showed it to be a good method for predicting unobserved values in a matrix. The authors also indicate that posing constraints on the factor norms, rather than on the rank, yields a convex optimization problem, which is easier to solve.

6.2 Idempotent semirings

The concept of the subtropical algebra is relatively new, and as far as we know, its applications in data mining are not yet well studied. Indeed, its only usage for data analysis that we are aware of was by Weston et al. (2013), where it was used as a part of a model for collaborative filtering. The authors modeled users as a set of vectors, where each vector represents a single aspect about the user (e.g. a particular area of interest). The ratings are then reconstructed by selecting the highest scoring prediction using the \(\max \) operator. Since their model uses \(\max \) as well as the standard plus operation, it stands on the border between the standard and the subtropical worlds.

Boolean algebra, despite being limited to the binary set \(\{0,1\}\), is related to the subtropical algebra by virtue of having the same operations, and is thus a restriction of the latter to \(\{0,1\}\). By the same token, when both factor matrices are binary, their subtropical product coincides with the Boolean product, and hence the Boolean matrix factorization can be seen as a degenerate case of the subtropical matrix factorization problem. The dioid properties of the Boolean algebra can be checked trivially. The motivation for the Boolean matrix factorization comes from the fact that in many applications data is naturally represented as a binary matrix (e.g. transaction databases), which makes it reasonable to seek decompositions that preserve the binary character of the data. The conceptual and algorithmic analysis of the problem was done by Miettinen (2009), with the focus mainly on the data mining perspective of the problem. For a linear algebra perspective see Kim (1982), where the emphasis is put on the existence of exact decompositions. A number of algorithms have been proposed for solving the BMF problem (Miettinen et al. 2008; Lu et al. 2008; Lucchese et al. 2014; Karaev et al. 2015).

6.3 Tropical algebra

Another close cousin of the max-times algebra is the max-plus, or so called tropical algebra, which uses plus in place of multiplication. It is also a dioid due to the idempotent nature of the \(\max \) operation. As was mentioned earlier, the two algebras are isomorphic, and hence many of the properties are identical (see Sects. 2 and 3 for more details).

Despite the theory of the tropical algebra being relatively young, it has been thoroughly studied in recent years. The reason for this is that it finds extensive applications in various areas of mathematics and other disciplines. An example of such a field is the discrete event systems (DES) (Cassandras and Lafortune 2008), where the tropical algebra is ubiquitously used for modeling (see e.g. Baccelli et al. 1992; Cohen et al. 1999). Other mathematical disciplines where the tropical algebra plays a crucial role are optimal control (Gaubert 1997), asymptotic analysis (Dembo and Zeitouni 2010; Maslov 1992; Akian 1999), and decidability (Simon 1978, 1994).

Research on tropical matrix factorization is of interest to us because of the above mentioned isomorphism between the two algebras. However, as was explained in Sect. 3, the approximate matrix factorizations are not directly transferable as the errors can differ dramatically. It should be mentioned that in the general case the problem of the tropical matrix factorization is NP-hard (see e.g. Shitov 2014). De Schutter and De Moor (2002) demonstrated that if the max-plus algebra is extended in such a way that there is an additive inverse for each element, then it is possible to solve many of the standard matrix decomposition problems. Among other results the authors obtained max-plus analogues of QR and SVD. They also claimed that the techniques they propose can be readily extended to other types of classic factorizations (e.g. Hessenberg and LU decomposition).

The problem of solving tropical linear systems of equations arises naturally in numerous applications, and is also closely related to matrix factorization. In order to illustrate this connection, assume that we are given a tropical matrix and one of the factors . Then the other factor can be found by solving the following set of problems

(24)

Each problem in (24) requires “approximately” solving a system of tropical linear equations. The minus operation in (24) does not belong to the tropical semiring, so the approximation here should be understood in terms of minimizing the classical distance. The general form of tropical linear equations

(25)

is not always solvable (see e.g. Gaubert 1997); however various techniques exist for checking the existence of the solution for particular cases of (25).

For equations of the form the feasibility can be established for example through the so called matrix residuation. There is a general result that for an n-by-m matrix over a complete idempotent semiring, the existence of the solution can be checked in O(nm) time (see Gaubert 1997). Although the tropical algebra is not complete, there is an efficient way of finding if the solution exists (Cuninghame-Green 1979; Zimmermann 2011). It was shown by Butkovič (2003) that this type of tropical equations is equivalent to the set cover problem, which is known to be NP-hard. This directly affects the max-times algebra through the above-mentioned isomorphism and makes the problem of precisely solving max-times linear systems of the form infeasible for high dimensions.

Homogeneous equations can be solved using the elimination method, which is based on the fact that the set of solutions of a homogeneous system is a finitely generated semimodule (Butkovič and Hegedüs 1984) (independently rediscovered by Gaubert 1992). If only a single solution is required, then according to Gaubert (1997), a method by Walkup and Borriello (1998) is usually the fastest in practice.

Now let be a tropical square matrix of size \(n\times n\). For complete idempotent semirings a solution to the equation is given by (see e.g. Salomaa and Soittola 2012), where the operator is defined as

Since the tropical semiring is not complete (it is missing the \(\infty \) element), can not always be computed. However, when there are no positive weight circuits in the graph defined by , then we have , and all entries of belong to the tropical semiring (Baccelli et al. 1992). Computing the operator takes time \(O(n^3)\) (see e.g. Gondran and Minoux 1984a; Gaubert 1997).

Another important direction of research is the eigenvalue problem . Tropical analogues of the Perron–Frobenius theorem (see e.g. Vorobyev 1967; Maslov 1992), and Collatz–Wielandt formula (Bapat et al. 1995; Gaubert 1992) were developed. For a general overview of the results in the \((\max , +)\) spectral theory, see for example Gaubert (1997).

Tropical algebra and tropical geometry were used by Gärtner and Jaggi (2008) to construct a tropical analogue of an SVM. Unlike in the classical case, tropical SVMs are localized, in the sense that the kernel at any given point is not influenced by all the support vectors. Their work also utilizes the fact that tropical hyperplanes are somewhat more complex than their counterparts in the classical geometry, which makes it possible to do multiple category classification with a single hyperplane.

7 Conclusions

Subtropical low-rank factorizations are a novel approach for finding latent structure from nonnegative data. The factorizations can be interpreted using the winner-takes-it-all interpretation: the value of the element in the final reconstruction depends only on the largest of values in the corresponding elements of the rank-1 components (compare that to NMF, where the value in the reconstruction is the sum of the corresponding elements). That the factorizations are different does not necessarily mean that they are better in terms of reconstruction error, although they can yield lower reconstruction error than even SVD. It does mean, however, that they find different structure from the data. This is an important advantage, as it allows the data analyst to use both the classical factorizations and the subtropical factorizations to get a broader understanding of the kinds of patterns that are present in the data.

Working in the subtropical algebra is harder than in the normal algebra, though. The various definitions for the rank, for example, do not agree, and computing many of them—including the subtropical Schein rank, which is arguably the most useful one for data analysis—is computationally hard. That said, our proposed algorithms, and , can find the subtropical structure when it is present in the data. Not every data has subtropical structure, though, and due to the complexity of finding the optimal subtropical factorization we cannot distinguish between the cases where our algorithms fail to find the latent subtropical structure, and where it does not exist. Based on our experiments with synthetic data, our hypothesis is that the failure of finding a good factorization more probably indicates the lack of the subtropical structure rather than the algorithms’ failure. Naturally, more experiments using data with known subtropical structure should improve our confidence of the correctness of the hypothesis.

The presented algorithms are heuristics. Developing algorithms that achieve better reconstruction error is naturally an important direction of future work. In our framework, this hinges on the task of finding the rank-1 components. In addition, the scalability of the algorithms could be improved. A potential direction could be to take into account the sparsity of the factor matrices in dominated decompositions. This could allow one to concentrate only on the non-zero entries in the factor matrices.

The connection between Boolean and (sub-)tropical factorizations raises potential directions for future work. The continuous framework could allow for easier optimization in the Boolean algebra. Also, the connection allows us to model combinatorial structures (e.g. cliques in a graph) using subtropical matrices. This could allow for novel approaches on finding such structures using continuous subtropical factorizations.