Legendre decomposition for tensors*

Mahito Sugiyama; Hiroyuki Nakahara; Koji Tsuda

doi:10.1088/1742-5468/ab3196

1. Introduction

Matrix and tensor decomposition is a fundamental technique in machine learning; it is used to analyze data represented in the form of multi-dimensional arrays, and is used in a wide range of applications such as computer vision (Vasilescu and Terzopoulos 2002, 2007), recommender systems (Symeonidis 2016), signal processing (Cichocki et al 2015), and neuroscience (Beckmann and Smith 2005). The current standard approaches include nonnegative matrix factorization (Lee and Seung 1999, 2001) for matrices and CANDECOMP/PARAFAC (CP) decomposition (Harshman 1970) or Tucker decomposition (Tucker 1966) for tensors. CP decomposition compresses an input tensor into a sum of rank-one components, and Tucker decomposition approximates an input tensor by a core tensor multiplied by matrices. To date, matrix and tensor decomposition has been extensively analyzed, and there are a number of variations of such decomposition (Kolda and Bader 2009), where the common goal is to approximate a given tensor by a smaller number of components, or parameters, in an efficient manner.

However, despite the recent advances of decomposition techniques, a learning theory that can systematically define decomposition for any order tensors including vectors and matrices is still under development. Moreover, it is well known that CP and Tucker tensor decomposition include non-convex optimization and that the global convergence is not guaranteed. Although there are a number of extensions to transform the problem into a convex problem (Liu et al 2013, Tomioka and Suzuki 2013), one needs additional assumptions on data, such as a bounded variance.

Here we present a new paradigm of matrix and tensor decomposition, called Legendre decomposition, based on information geometry (Amari 2016) to solve the above open problems of matrix and tensor decomposition. In our formulation, an arbitrary order tensor is treated as a discrete probability distribution in a statistical manifold as long as it is nonnegative, and Legendre decomposition is realized as a projection of the input tensor onto a submanifold composed of reconstructable tensors. The key to introducing the formulation is to use the partial order (Davey and Priestley 2002, Gierz et al 2003) of indices, which allows us to treat any order tensors as a probability distribution in the information geometric framework.

Legendre decomposition has the following remarkable properties: it always finds the unique tensor that minimizes the Kullback–Leibler (KL) divergence from an input tensor. This is because Legendre decomposition is formulated as convex optimization, and hence we can directly use gradient descent, which always guarantees the global convergence, and the optimization can be further speeded up by using a natural gradient (Amari 1998) as demonstrated in our experiments. Moreover, Legendre decomposition is flexible as it can decompose sparse tensors by removing arbitrary entries beforehand, for examples zeros or missing entries.

Furthermore, our formulation has a close relationship with statistical models, and can be interpreted as an extension of the learning of Boltzmann machines (Ackley et al 1985). This interpretation gives new insight into the relationship between tensor decomposition and graphical models (Yılmaz et al 2011, Yılmaz and Cemgil 2012, Chen et al 2018) as well as the relationship between tensor decomposition and energy-based models (LeCun et al 2007). In addition, we show that the proposed formulation belongs to the exponential family, where the parameter $\theta$ used in our decomposition is the natural parameter, and $\newcommand{\e}{{\rm e}} \eta$ , used to obtain the gradient of $\theta$ , is the expectation of the exponential family.

The remainder of this paper is organized as follows. We introduce Legendre decomposition in section 2. We define the decomposition in section 2.1, formulate it as convex optimization in section 2.2, describe algorithms in section 2.3, and discuss the relationship with other statistical models in section 2.4. We empirically examine performance of our method in section 3, and summarize our contribution in section 4.

2. The Legendre decomposition

We introduce Legendre decomposition for tensors. We begin with a nonnegative Nth-order tensor $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{X} \in \R^{I_1 \times I_2 \times \dots \times I_N}_{\geqslant 0}$ . To simplify the notation, we write the entry $x_{i_1 i_2\dots i_N}$ by $x_v$ with the index vector $v = (i_1, i_2, \dots, i_N) \in [I_1] \times [I_2] \times \dots \times [I_N]$ , where each $[I_k] = \{1, 2, \dots, I_k\}$ . To treat $\newcommand{\ten}[1]{\mathcal{#1}} \ten{X}$ as a discrete probability mass function in our formulation, we normalize $\newcommand{\ten}[1]{\mathcal{#1}} \ten{X}$ by dividing each entry by the sum of all entries, yielding $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P} = \ten{X} / \sum_{v} x_{v}$ . In the following, we always work with a normalized tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ .

2.1. Definition

Legendre decomposition always finds the best approximation of a given tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ . Our strategy is to choose an index set $B \subseteq [I_1] \times [I_2] \times \dots \times [I_N]$ as a decomposition basis, where we assume $(1, 1, \dots, 1) \not\in B$ for a technical reason, and approximate the normalized tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ by a multiplicative combination of parameters associated with B.

First we introduce the relation ' $\leqslant$ ' between index vectors $u = (\,j_1, \dots, j_N)$ and $v = (i_1, \dots, i_N)$ as $u \leqslant v$ if $j_1 \leqslant i_1$ , $j_2 \leqslant i_2$ , ..., $j_N \leqslant i_N$ . It is clear that this relation gives a partial order (Gierz et al 2003); that is, $\leqslant$ satisfies the following three properties for all $u, v, w$ : (1) $v \leqslant v$ (reflexivity), (2) $u \leqslant v$ , $\newcommand{\R}{\mathbb{R}} v \leqslant u \Rightarrow u = v$ (antisymmetry), and (3) $u \leqslant v$ , $\newcommand{\R}{\mathbb{R}} v \leqslant w \Rightarrow u \leqslant w$ (transitivity). Each tensor is treated as a discrete probability mass function with the sample space $\Omega \subseteq [I_1] \times \dots \times [I_N]$ . While it is natural to set $\Omega = [I_1] \times \dots \times [I_N]$ , our formulation allows us to use any subset $\Omega \subseteq [I_1] \times \dots \times [I_N]$ . Hence, for example, we can remove unnecessary indices such as missing or zero entries of an input tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ from $\Omega$ . We define $\Omega^+ = \Omega \setminus \{(1, 1, \dots, 1)\}$ .

We define a tensor $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{Q} \in \R^{I_1 \times I_2 \times \dots \times I_N}_{\geqslant 0}$ as fully decomposable with $B \subseteq \Omega^+$ if each entry $q_{v} \in \Omega$ is represented in the form of

$\begin{align} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \displaystyle \label{eq:theta} q_v = \frac{1}{\exp(\psi(\theta))} \prod_{u \in {\downarrow}v} \exp(\theta_u),\quad {\downarrow}v = \{u \in B | u \leqslant v\}, \nonumber \end{align} \tag{ 1 }$

using $|B|$ parameters $(\theta_v)_{v \in B}$ with $\newcommand{\R}{\mathbb{R}} \theta_{v} \in \R$ and the normalizer $\newcommand{\R}{\mathbb{R}} \psi(\theta) \in \R$ , which is always uniquely determined from the parameters $(\theta_v)_{v \in B}$ as

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \displaystyle \psi(\theta) = \log \sum_{v \in \Omega} \prod_{u \in {\downarrow}v} \exp(\theta_{u}). \nonumber \end{align*}$

This normalization does not have any effect on the decomposition performance, but rather it is needed to formulate our decomposition as an information geometric projection, as shown in the next subsection. There are two extreme cases for a choice of a basis B: if $\newcommand{\e}{{\rm e}} B = \emptyset$ , a fully decomposable $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ is always uniform; that is, $q_{v} = 1 / |\Omega|$ for all $v \in \Omega$ . In contrast, if $B = \Omega^+$ , any input $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ itself becomes decomposable.

We now define Legendre decomposition as follows: given an input tensor $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{P} \in \R^{I_1 \times I_2 \times \dots \times I_N}_{\geqslant 0}$ , the sample space $\Omega \subseteq [I_1] \times [I_2] \times \dots \times [I_N]$ , and a parameter basis $B \subseteq \Omega^+$ , Legendre decomposition finds the fully decomposable tensor $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{Q} \in \R^{I_1 \times I_2 \times \dots \times I_N}_{\geqslant 0}$ with a B that minimizes the KL divergence $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\KL}{D_{{\rm KL}}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \KL(\ten{P}, \ten{Q}) = \sum_{v \in \Omega} p_v \log (\,p_v / q_v)$ (figure 1(a)). In the next subsection, we introduce an additional parameter $\newcommand{\e}{{\rm e}} (\eta_v)_{v \in B}$ and show that this decomposition is always possible via the dual parameters $\newcommand{\e}{{\rm e}} (\,(\theta_v)_{v \in B}, (\eta_v)_{v \in B}\,)$ with information geometric analysis. Since $\theta$ and $\newcommand{\e}{{\rm e}} \eta$ are connected via Legendre transformation, we call our method Legendre decomposition.

**Figure 1.** (a) Overview of Legendre decomposition. (b) Illustration of $\theta$ and $\newcommand{\e}{{\rm e}} \eta$ for second-order tensor (matrix) when $B = [I_1] \times [I_2]$ .
Download figure:
Standard image High-resolution image

**Figure 1.** (a) Overview of Legendre decomposition. (b) Illustration of $\theta$ and $\newcommand{\e}{{\rm e}} \eta$ for second-order tensor (matrix) when $B = [I_1] \times [I_2]$ .
Download figure:
Standard image High-resolution image

Legendre decomposition for second-order tensors (that is, matrices) can be viewed as a low rank approximation not of an input matrix $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ but of its entry-wise logarithm $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \log \ten{P}$ . This is why the rank of $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \log\ten{Q}$ with the fully decomposable matrix $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ coincides with the parameter matrix $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{T} \in \R^{I_1 \times I_2}$ such that $t_{v} = \theta_{v}$ if $v \in B$ , $\newcommand{\re}{{\rm Re}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\e}{{\rm e}} t_{(1, 1)} = 1 / \exp(\psi(\theta))$ , and $t_{v} = 0$ otherwise. Thus we fill zeros into entries in $\Omega^+ \setminus B$ . Then we have $\newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \log q_{v} = \sum_{u \in {\downarrow}v} t_{u}$ , meaning that the rank of $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \log\ten{Q}$ coincides with the rank of $\newcommand{\ten}[1]{\mathcal{#1}} \ten{T}$ . Therefore if we use a decomposition basis B that includes only l rows (or columns), $\newcommand{\rank}{\mathop{{\rm rank}}} \newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \rank(\log\ten{Q}) \leqslant l$ always holds.

2.2. Optimization

We solve the Legendre decomposition by formulating it as a convex optimization problem. Let us assume that $B = \Omega^+ = \Omega \setminus \{(1, 1, \dots, 1)\}$ , which means that any tensor is fully decomposable. Our definition in equation (1) can be re-written as

$\begin{align} \newcommand{\e}{{\rm e}} \renewcommand{\log}{\mathop{{\rm log}}} \newcommand{\re}{{\rm Re}} \displaystyle \label{eq:loglinear} \log q_v = \sum_{u \in \Omega^+} \zeta(u, v) \theta_u - \psi(\theta) = \sum_{u \in \Omega} \zeta(u, v) \theta_u,\quad \zeta(u, v) = \left\{\begin{array}{@{}ll@{}} 1 & {\rm if}~ u \leqslant v, \nonumber \\ 0 & {\rm otherwise} \end{array} \right. \nonumber \end{align} \tag{ 2 }$

with $- \psi(\theta) = \theta_{(1, 1, \dots, 1)}$ , and the sample space $\Omega$ is a poset (partially ordered set) with respect to the partial order ' $\leqslant$ ' with the least element $\bot = (1, 1, \dots, 1)$ . Therefore our model belongs to the log-linear model on posets introduced by Sugiyama et al (2016, 2017), which is an extension of the information geometric hierarchical log-linear model (Amari 2001, Nakahara and Amari 2002). Each entry $q_v$ and the parameters $(\theta_v)_{v \in \Omega^+}$ in equation (2) directly correspond to those in equation (8) in Sugiyama et al (2017). According to theorem 2 in Sugiyama et al (2017), if we introduce $\newcommand{\e}{{\rm e}} (\eta_v)_{v \in \Omega^+}$ such that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:eta} \eta_v = \sum_{u \in {\uparrow}v} q_u = \sum_{u \in \Omega} \zeta(v, u) q_u,\quad {\uparrow}v = \{u \in \Omega | u \geqslant v\}, \nonumber \end{align} \tag{ 3 }$

for each $v \in \Omega^+$ (see figure 1(b)), the pair $\newcommand{\e}{{\rm e}} (\,(\theta_v)_{v \in \Omega^+}, (\eta_v)_{v \in \Omega^+}\,)$ is always a dual coordinate system of the set of normalized tensors $\newcommand{\Sc}{\mathcal{S}} \newcommand{\ten}[1]{\mathcal{#1}} \boldsymbol{\Sc} = \{\ten{P} \mid 0 < p_v < 1~{\rm and}~\sum_{v \in \Omega} p_v = 1\}$ with respect to the sample space $\Omega$ , as they are connected via Legendre transformation. Hence $\newcommand{\Sc}{\mathcal{S}} \boldsymbol{\Sc}$ becomes a dually flat manifold (Amari 2009).

Here we formulate Legendre decomposition as a projection of a tensor onto a submanifold. Suppose that $B \subseteq \Omega^+$ and let $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \boldsymbol{\Pf}_{B}$ be the submanifold such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \displaystyle \boldsymbol{\Pf}_{B} = \{\ten{Q} \in \boldsymbol{\Pf} | \theta_v = 0 ~{\rm for~all}~ v \in \Omega^+ \setminus B\}, \nonumber \end{align*}$

which is the set of fully decomposable tensors with B and is an e-flat submanifold as it has constraints on the $\theta$ coordinate (Amari (2016), chapter 2.4). Furthermore, we introduce another submanifold $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \boldsymbol{\Pf}_{\ten{P}}$ for a tensor $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{P} \in \boldsymbol{\Pf}$ and $A \subseteq \Omega^+$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \displaystyle \boldsymbol{\Pf}_{\ten{P}} = \{\ten{Q} \in \boldsymbol{\Pf} | \, \eta_v = \hat{\eta}_v ~ {\rm for~all}~ v \in A\}, \nonumber \end{align*}$

where $\newcommand{\e}{{\rm e}} \hat{\eta}_v$ is given by equation (3) by replacing q_u with p _u, which is an m-flat submanifold as it has constraints on the $\newcommand{\e}{{\rm e}} \eta$ coordinate.

The dually flat structure of $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \boldsymbol{\Pf}$ with the dual coordinate systems $\newcommand{\e}{{\rm e}} (\,(\theta_v)_{v \in \Omega^+}, (\eta_v)_{v \in \Omega^+}\,)$ leads to the following strong property: if A = B, that is, $(\Omega^+ \setminus B) \cup A = \Omega^+$ and $\newcommand{\e}{{\rm e}} (\Omega^+ \setminus B) \cap A = \emptyset$ , the intersection $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \boldsymbol{\Pf}_{B} \cap \boldsymbol{\Pf}_{\ten{P}}$ is always a singleton; that is, the tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ such that $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{Q} \in \boldsymbol{\Pf}_{B} \cap \boldsymbol{\Pf}_{\ten{P}}$ always uniquely exists, and $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ is the minimizer of the KL divergence from $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ (Amari (2009), theorem 3):

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\KL}{D_{{\rm KL}}} \newcommand{\argmin}{\mathop{{\rm argmin}}} \newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \displaystyle \label{eq:optimization} \ten{Q} = \argmin_{\ten{R} \in \boldsymbol{\Pf}_{B}} \KL(\ten{P}, \ten{R}). \nonumber \end{align} \tag{ 4 }$

The transition from $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ to $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ is called the m-projection of $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ onto $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \boldsymbol{\Pf}_{B}$ , and Legendre decomposition coincides with m-projection (figure 2). In contrast, if some fully decomposable tensor $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{R} \in \boldsymbol{\Pf}_{B}$ is given, finding the intersection $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{Q} \in \boldsymbol{\Pf}_{B} \cap \boldsymbol{\Pf}_{\ten{P}}$ is called the e-projection of $\newcommand{\ten}[1]{\mathcal{#1}} \ten{R}$ onto $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \boldsymbol{\Pf}_{\ten{P}}$ . In practice, we use e-projection because the number of parameters to be optimized is $|B|$ in e-projection while it is $|\Omega \setminus B|$ in m-projection, and $|B| \leqslant |\Omega \setminus B|$ usually holds.

**Figure 2.** Projection in statistical manifold.
Download figure:
Standard image High-resolution image

The e-projection is always convex optimization as the e-flat submanifold $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \boldsymbol{\Pf}_B$ is convex with respect to $(\theta_v)_{v \in \Omega^+}$ . More precisely,

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\log}{\mathop{{\rm log}}} \newcommand{\re}{{\rm Re}} \newcommand{\KL}{D_{{\rm KL}}} \newcommand{\ten}[1]{\mathcal{#1}} \displaystyle \KL(\ten{P}, \ten{Q}) = \sum_{v \in \Omega} p_v \log \frac{p_v}{q_v} = \sum_{v \in \Omega} p_v \log p_v - \sum_{v \in \Omega} p_v \log q_v = - \sum_{v \in \Omega} p_v \log q_v - H(\ten{P}), \nonumber \end{align*}$

where $\newcommand{\ten}[1]{\mathcal{#1}} H(\ten{P})$ is the entropy of $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ and independent of $(\theta_v)_{v \in \Omega^+}$ . Since we have

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \renewcommand{\log}{\mathop{{\rm log}}} \displaystyle - \sum_{v \in \Omega} p_v \log q_v = \sum_{v \in \Omega} p_v \left(\sum_{u \in B} \zeta(u, v) (- \theta_u) + \psi(\theta)\right),\quad \psi(\theta) = \log\sum_{v \in \Omega}\exp\left(\sum_{u \in B} \zeta(u, v)\theta(u)\right), \nonumber \end{align*}$

$\psi(\theta)$ is convex and $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\KL}{D_{{\rm KL}}} \KL(\ten{P}, \ten{Q})$ is also convex with respect to $(\theta_v)_{v \in \Omega^+}$ .

2.3. Algorithm

Here we present our two gradient-based optimization algorithms to solve the KL divergence minimization problem in equation (4). Since the KL divergence $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\KL}{D_{{\rm KL}}} \KL(\ten{P}, \ten{Q})$ is convex with respect to each $\theta_v$ , the standard gradient descent shown in algorithm 1 can always find the global optimum, where $\newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\e}{{\rm e}} \epsilon > 0$ is a learning rate. Starting with some initial parameter set $(\theta_v)_{v \in B}$ , the algorithm iteratively updates the set until convergence. The gradient of $\theta_w$ for each $w \in B$ is obtained as

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\log}{\mathop{{\rm log}}} \newcommand{\re}{{\rm Re}} \newcommand{\KL}{D_{{\rm KL}}} \newcommand{\ten}[1]{\mathcal{#1}} \displaystyle \frac{\partial}{\partial \theta_w} \KL(\ten{P}, \ten{Q}) &= - \frac{\partial}{\partial \theta_w} \sum_{v \in \Omega} p_v \log q_v = - \frac{\partial}{\partial \theta_w} \sum_{v \in \Omega} p_v \sum_{u \in B} \zeta(u, v) \theta_u + \frac{\partial}{\partial \theta_w} \sum_{v \in \Omega} p_v \psi(\theta)\nonumber \\ &= - \sum_{v \in \Omega} p_v \zeta(w, v) + \frac{\partial \psi(\theta)}{\partial \theta_w} = \eta_w - \hat{\eta}_w, \nonumber \end{align*}$

where the last equation uses the fact that $\newcommand{\e}{{\rm e}} \partial \psi(\theta) / \partial \theta_w = \eta_w$ in theorem 2 in Sugiyama et al (2017). This equation also shows that the KL divergence $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\KL}{D_{{\rm KL}}} \KL(\ten{P}, \ten{Q})$ is minimized if and only if $\newcommand{\e}{{\rm e}} \eta_v = \hat{\eta}_v$ for all $v \in B$ . The time complexity of each iteration is $O(|\Omega| |B|)$ , as that of computing $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ from $(\theta_v)_{v \in B}$ (line 5 in algorithm 1) is $O(|\Omega| |B|)$ and computing $\newcommand{\e}{{\rm e}} (\eta_v)_{v \in B}$ from $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ (line 6 in algorithm 1) is $O(|\Omega|)$ . Thus the total complexity is $O(h |\Omega| |B|^2)$ with the number of iterations h until convergence.

Algorithm 1. Legendre decomposition by gradient descent.

Although gradient descent is an efficient approach, in Legendre decomposition, we need to repeat 'decoding' from $(\theta_v)_{v \in B}$ and 'encoding' to $\newcommand{\e}{{\rm e}} (\eta_v)_{v \in B}$ in each iteration, which may lead to a loss of efficiency if the number of iterations is large. To reduce the number of iterations to gain efficiency, we propose to use a natural gradient (Amari 1998), which is the second-order optimization method, shown in algorithm 2. Again, since the KL divergence $\newcommand{\ten}[1]{\mathcal{#1}} \newcommand{\KL}{D_{{\rm KL}}} \KL(\ten{P}, \ten{Q})$ is convex with respect to $(\theta_v)_{v \in B}$ , a natural gradient can always find the global optimum. More precisely, our natural gradient algorithm is an instance of the Bregman algorithm applied to a convex region, which is well known to always converge to the global solution (Censor and Lent 1981). Let $B = \{v_1, v_2, \dots, v_{|B|}\}$ , $\boldsymbol{\theta} = (\theta_{v_1}, \dots, \theta_{v_{|B|}}){}^T$ , and $\newcommand{\e}{{\rm e}} \boldsymbol{\eta} = (\eta_{v_1}, \dots, \eta_{v_{|B|}}){}^T$ . In each update of the current $\boldsymbol{\theta}$ to $\boldsymbol{\theta}_{{\rm next}}$ , the natural gradient method uses the relationship,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Delta \boldsymbol{\eta} = - {\bf G} \Delta \boldsymbol{\theta},\quad \Delta \boldsymbol{\eta} = \boldsymbol{\eta} - \hat{\boldsymbol{\eta}} ~{\rm and}~ \Delta \boldsymbol{\theta} = \boldsymbol{\theta}_{{\rm next}} - \boldsymbol{\theta}, \nonumber \end{align*}$

which leads to the update formula

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \boldsymbol{\theta}_{{\rm next}} = \boldsymbol{\theta} - {\bf G}^{-1} \Delta \boldsymbol{\eta}, \nonumber \end{align*}$

where $\newcommand{\R}{\mathbb{R}} {\bf G} = (g_{uv}) \in \R^{|B| \times |B|}$ is the Fisher information matrix such that

$\begin{align} \newcommand{\e}{{\rm e}} \renewcommand{\log}{\mathop{{\rm log}}} \newcommand{\re}{{\rm Re}} \newcommand{\Exp}{{\bf E}} \displaystyle \label{eq:Fisher_theta} g_{uv}(\theta) = \frac{\partial \eta_u}{\partial \theta_v} = \Exp\left[\frac{\partial\log p_w}{\partial\theta_u}\frac{\partial\log p_w}{\partial\theta_v}\right] = \sum_{w \in \Omega}\zeta(u, w)\zeta(v, w)\,p_w - \eta_u\eta_v \nonumber \end{align} \tag{ 5 }$

as given in theorem 3 in Sugiyama et al (2017). Note that natural gradient coincides with Newton's method in our case as the Fisher information matrix ${\bf G}$ corresponds to the (negative) Hessian matrix:

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\KL}{D_{{\rm KL}}} \newcommand{\ten}[1]{\mathcal{#1}} \displaystyle \frac{\partial^2}{\partial \theta_{u} \partial \theta_{v}} \KL(\ten{P}, \ten{Q}) = - \frac{\partial\eta_{u}}{\partial \theta_{v}} = - g_{uv}. \nonumber \end{align*}$

Algorithm 2. Legendre decomposition by natural gradient.

The time complexity of each iteration is $O(|\Omega| |B| + |B|^3)$ , where $O(|\Omega| |B|)$ is needed to compute $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ from $\theta$ and O(|B|³) to compute the inverse of ${\bf G}$ , resulting in the total complexity $O(h' |\Omega| |B| + h'|B|^3)$ with the number of iterations $h'$ until convergence.

2.4. Relationship to statistical models

We demonstrate interesting relationships between Legendre decomposition and statistical models, including the exponential family and the Boltzmann (Gibbs) distributions, and show that our decomposition method can be viewed as a generalization of Boltzmann machine learning (Ackley et al 1985). Although the connection between tensor decomposition and graphical models has been analyzed by Yılmaz et al (2011), Yılmaz and Cemgil (2012), and Chen et al (2018), our analysis adds a new insight as we focus on not the graphical model itself but the sample space of distributions generated by the model.

2.4.1. Exponential family.

We show that the set of normalized tensors $\newcommand{\R}{\mathbb{R}} \newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \newcommand{\ten}[1]{\mathcal{#1}} \boldsymbol{\Pf} = \{\ten{P} \in \R_{> 0}^{I_1 \times I_2 \times \dots \times I_N} \mid \sum\nolimits_{v \in \Omega} p_{v} = 1\}$ is included in the exponential family. The exponential family is defined as

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \displaystyle p(x, \boldsymbol{\theta}) = \exp\left(\sum \theta_i k_i(x) + r(x) - C(\boldsymbol{\theta})\right) \nonumber \end{align*}$

for natural parameters $\boldsymbol{\theta}$ . Since our model in equation (1) can be written as

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \displaystyle p_{v} = \frac{1}{\exp(\psi(\theta))}\prod_{u \in {\downarrow}v} \exp(\theta_u) = \exp\left(\sum_{u \in \Omega^+}\theta_u\zeta(u, v) - \psi(\theta)\right) \nonumber \end{align*}$

with $\theta_u = 0$ for $u \in \Omega^+ \setminus B$ , it is clearly in the exponential family, where $\zeta$ and $\psi(\theta)$ correspond to k and $C(\boldsymbol{\theta})$ , respectively, and $r(x) = 0$ . Thus, the $(\theta_v)_{v \in B}$ used in Legendre decomposition are interpreted as natural parameters of the exponential family. Moreover, we can obtain $\newcommand{\e}{{\rm e}} (\eta_v)_{v \in B}$ by taking the expectation of $\zeta(u, v)$ :

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle {\bf E}\bigl[\zeta(u, v)\bigr] = \sum_{v \in \Omega}\zeta(u, v)\,p_{v} = \eta_u. \nonumber \end{align*}$

Thus Legendre decomposition of $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ is understood to find a fully decomposable $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ that has the same expectation with $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ with respect to a basis B.

2.4.2. Boltzmann machines.

A Boltzmann machine is represented as an undirected graph $G = (V, E)$ with a vertex set $V$ and an edge set $E \subseteq V \times V$ , where we assume that $V = [N] = \{1, 2, \dots, N\}$ without loss of generality. This $V$ is the set of indices of N binary variables. A Boltzmann machine G defines a probability distribution P, where each probability of an N-dimensional binary vector $\boldsymbol{x} \in \{0, 1\}^{N}$ is given as

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\re}{{\rm Re}} \displaystyle p(\boldsymbol{x}) = \frac{1}{Z(\theta)} \prod_{{i \in V}}\exp\big(\theta_i x_i\big) \prod_{{\{i, j\} \in E}} \exp\big(\theta_{ij} x_i x_j\big), \nonumber \end{align*}$

where $\theta_i$ is a bias, $\theta_{ij}$ is a weight, and $Z(\theta)$ is a partition function.

To translate a Boltzmann machine into our formulation, let $\Omega = \{1, 2\}^N$ and suppose that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle B(V) &= \{(i_{1}^a, \dots, i_{N}^a) \in \Omega | a \in V\},\qquad & i_{l}^a = \left\{\begin{array}{@{}ll@{}} 2 & {\rm if}~ l = a, \nonumber \\ 1 & {\rm otherwise}, \end{array} \right.\nonumber \\ B(E) &= \{(i_{1}^{ab}, \dots, i_{N}^{ab}) \in \Omega | \{a, b\} \in E\},\qquad & i_{l}^{ab} = \left\{\begin{array}{@{}ll@{}} 2 & {\rm if}~ l \in \{a, b\}, \nonumber \\ 1 & {\rm otherwise}. \end{array} \right. \nonumber \end{align*}$

Then it is clear that the set of probability distributions, or Gibbs distributions, that can be represented by the Boltzmann machine G is exactly the same as $\newcommand{\Pf}{\boldsymbol{\mathcal{S}}} \boldsymbol{\Pf}_{B}$ with $B = B(V) \cup B(E)$ and $\newcommand{\re}{{\rm Re}} \renewcommand{\exp}{\mathop{{\rm exp}}} \newcommand{\e}{{\rm e}} \exp(\psi(\theta)) = Z(\theta)$ ; that is, the set of fully decomposable Nth-order tensors defined by equation (1) with the basis $B(V) \cup B(E)$ (figure 3). Moreover, let a given Nth-order tensor $\newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathcal{#1}} \ten{P} \in \R_{\geqslant 0}^{2 \times 2 \times \dots \times 2}$ be an empirical distribution estimated from data, where each $p_{v}$ is the probability for a binary vector $v - (1, \dots, 1) \in \{0, 1\}^N$ . The tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ obtained by Legendre decomposition with $B = B(V) \cup B(E)$ coincides with the distribution learned by the Boltzmann machine $G = (V, E)$ . The condition $\newcommand{\e}{{\rm e}} \eta_v = \hat{\eta}_v$ in the optimization of the Legendre decomposition corresponds to the well-known learning equation of Boltzmann machines, where $\newcommand{\e}{{\rm e}} \hat{\eta}$ and $\newcommand{\e}{{\rm e}} \eta$ correspond to the expectation of the data distribution and that of the model distribution, respectively.

**Figure 3.** Boltzmann machine with $V = \{1, 2, 3\}$ and $E = \{\{1, 2\}, \{2, 3\}\}$ (left) and its sample space (center), which corresponds to a tensor (right). Gray circles are the domain of parameters $\theta$ .
Download figure:
Standard image High-resolution image

**Figure 3.** Boltzmann machine with $V = \{1, 2, 3\}$ and $E = \{\{1, 2\}, \{2, 3\}\}$ (left) and its sample space (center), which corresponds to a tensor (right). Gray circles are the domain of parameters $\theta$ .
Download figure:
Standard image High-resolution image

Therefore our Legendre decomposition is a generalization of Boltzmann machine learning in the following three aspects:

1.
The domain is not limited to binary but can be ordinal; that is, {0,1}^N is extended to $[I_1] \times [I_2] \times \dots \times [I_N]$ for any $\newcommand{\N}{\mathbb{N}} I_1, I_2, \dots, I_N \in \N$ .
2.
The basis B with which parameters $\theta$ are associated is not limited to $B(V) \cup B(E)$ but can be any subset of $[I_1] \times \dots \times [I_N]$ , meaning that higher-order interactions (Sejnowski 1986) can be included.
3.
The sample space of probability distributions is not limited to {0,1}^N but can be any subset of $[I_1] \times [I_2] \times \dots \times [I_N]$ , which enables us to perform efficient computation by removing unnecessary entries such as missing values.

Hidden variables are often used in Boltzmann machines to increase the representation power, such as in restricted Boltzmann machines (Smolensky 1986, Hinton 2002) and deep Boltzmann machines (Salakhutdinov and Hinton 2009, 2012). In Legendre decomposition, including a hidden variable corresponds to including an additional dimension. Hence if we include H hidden variables, the fully decomposable tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{Q}$ has the order of N + H. This is an interesting extension to our method and an ongoing research topic, but it is not a focus of this paper since our main aim is to find a lower dimensional representation of a given tensor $\newcommand{\ten}[1]{\mathcal{#1}} \ten{P}$ .

In the learning process of Boltzmann machines, approximation techniques of the partition function $Z(\theta)$ are usually required, such as annealed importance sampling (Salakhutdinov and Murray 2008) or variational techniques (Salakhutdinov 2008). This requirement is because the exact computation of the partition function requires the summation over all probabilities of the sample space $\Omega$ , which is always fixed to $2^V$ with the set $V$ of variables in the learning process of Boltzmann machines, and which is not tractable. Our method does not require such techniques as $\Omega$ is a subset of indices of an input tensor and the partition function can always be directly computed.

3. Experiments

We empirically examine the efficiency and the effectiveness of Legendre decomposition using synthetic and real-world datasets. We used Amazon Linux AMI release 2018.03 and ran all experiments on 2.3 GHz Intel Xeon CPU E5-2686 v4 with 256 GB of memory. The Legendre decomposition was implemented in C++ and compiled with icpc 18.0.0⁴.

Throughout the experiments, we focused on the decomposition of third-order tensors and used three types of decomposition bases in the form of $B_1 = \{v \mid i_1 = i_2 = 1\} \cup \{v \mid i_2 = i_3 = 1\} \cup \{v \mid i_1 = i_3 = 1\}$ , $B_2(l) = \{v \mid i_1 = 1, i_2 \in C_2(l)\} \cup \{v \mid i_1 \in C_1(l), i_2 = 1\}$ with $C_k(l) = \{c\lfloor I_k / l \rfloor \mid c \in [l]\}$ , and $B_3(l) = \{v \mid (i_1, i_2) \in H_{i_3}(l)\}$ with $H_{i_3}(l)$ being the set indices for the top l elements of the i₃th frontal slice in terms of probability. In these bases, B₁ works as a normalizer for each mode, B₂ works as a normalizer for rows and columns of each slice, and B₃ highlights entries with high probabilities. We always assume that $(1, \dots, 1)$ is not included in the above bases. The cardinality of a basis corresponds to the number of parameters used in the decomposition. We used l to vary the number of parameters for decomposition in our experiments.

To examine the efficiency and the effectiveness of tensor decomposition, we compared Legendre decomposition with two standard nonnegative tensor decomposition techniques, nonnegative Tucker decomposition (Kim and Choi 2007) and nonnegative CANDECOMP/PARAFAC (CP) decomposition (Shashua and Hazan 2005). Since both of these methods are based on least square objective functions (Lee and Seung 1999), we also included a variant of CP decomposition, CP-alternating poisson regression (CP-APR) (Chi and Kolda 2012), which uses the KL-divergence for its objective function. We used the TensorLy implementation (Kossaifi et al 2016) for the nonnegative Tucker and CP decompositions and the tensor toolbox (Bader and Kolda 2007, Bader et al 2017) for CP-APR. In the nonnegative Tucker decomposition, we always employed rank- $(m, m, m)$ Tucker decomposition with the single number m, and we use rank-n decomposition in the nonnegative CP decomposition and CP-APR. Thus rank- $(m, m, m)$ Tucker decomposition has $(I_1 + I_2 + I_3)m + m^3$ parameters, and rank-n CP decomposition and CP-APR have $(I_1 + I_2 + I_3)n$ parameters.

Results on synthetic data.

First we compared our two algorithms, the gradient descent in algorithm 1 and the natural gradient in algorithm 2, to evaluate the efficiency of these optimization algorithms. We randomly generated a third-order tensor with the size $20 \times 20 \times 20$ from the uniform distribution and obtained the running time and the number of iterations. We set B = B₃(l) and varied the number of parameters $|B|$ with increasing l. In algorithm 2, we used the outer loop (from line 3–8) as one iteration for fair comparison and fixed the learning rate $\newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\e}{{\rm e}} \epsilon = 0.1$ .

Results are plotted in figures 4(a) and (b) that clearly show that the natural gradient is dramatically faster than gradient descent. When the number of parameters is around 400, the natural gradient is more than six orders of magnitude faster than gradient descent. The increased speed comes from the reduction of iterations. The natural gradient requires only two or three iterations until convergence in all cases, while gradient descent requires more than 10⁵ iterations to get the same result. In the following, we consistently use the natural gradient for Legendre decomposition.

**Figure 4.** Experimental results on synthetic data. (a) and (b) Comparison of natural gradient (algorithm 2) and gradient descent (algorithm 1), where both algorithms produce exactly the same results. (c) Comparison of Legendre decomposition (natural gradient) and other tensor decomposition methods.
Download figure:
Standard image High-resolution image

Next we examined the scalability compared to other tensor decomposition methods. We used the same synthetic datasets and increased the tensor size from $20 \times 20 \times 20$ to $500 \times 500 \times 500$ . Results are plotted in figure 4(c). Legendre decomposition is slower than Tucker and CP decompositions if the tensors get larger, while the plots show that the running time of Legendre decomposition is linear with the tensor size (figure 5). Moreover, Legendre decomposition is faster than CP-APR if the tensor size is not large.

**Figure 5.** Experimental results on the face image dataset (a), (b) and MNIST (c), (d).
Download figure:
Standard image High-resolution image

Results on real data.

Next we demonstrate the effectiveness of Legendre decomposition on real-world datasets of third-order tensors. We evaluated the quality of decomposition by the root mean squared error (RMSE) between the input and the reconstructed tensors. We also examined the scalability of our method in terms of the number of parameters.

First we examine Legendre decomposition and three competing methods on the face image dataset⁵. We picked up the first entry from the fourth mode (corresponds to lighting) from the dataset and the first 20 faces from the third mode, resulting in a single third-order tensor with a size of $92 \times 112 \times 20$ , where the first two modes correspond to image pixels and the third mode to individuals. We set decomposition bases B as $B = B_1 \cup B_2(l) \cup B_3(l)$ . For every decomposition method, we gradually increased l, m, and n and checked the performance in terms of RMSE and running time.

Results are plotted in figures 5(a) and (b). In terms of RMSE, Legendre decomposition is superior to the other methods if the number of parameters is small (up to 2000), and it is competitive with nonnegative CP decomposition and inferior to CP-APR for a larger number of parameters. The reason is that Legendre decomposition uses the information of the index order that is based on the structure of the face images (pixels); that is, rows or columns cannot be replaced with each other in the data. In terms of running time, it is slower than Tucker and CP decompositions as the number of parameters increases, while it is still faster than CP-APR.

Next we used the MNIST dataset (LeCun et al 1998), which consists of a collection of images of handwritten digits and has been used as the standard benchmark datasets in a number of recent studies such as deep learning. We picked up the first 500 images for each digit, resulting in 10 third-order tensors with the size of $28 \times 28 \times 500$ , where the first two modes correspond to image pixels. In Legendre decomposition, the decomposition basis B was simply set as B = B₃(l) and removed zero entries from $\Omega$ . Again, for every decomposition method, we varied the number of parameters by increasing l, m, and n and evaluated the performance in terms of RMSE.

Means $\pm$ standard error of the mean (SEM) across all digits from '0' to '9' are plotted in figures 5(c) and (d). Results for all digits are presented in the supplementary material (available on line at stacks.iop.org/JSTAT/19/124017/mmedia). Legendre decomposition clearly shows the smallest RMSE, and the difference is larger if the number of parameters is smaller. The reason is that Legendre decomposition ignores zero entries and decomposes only nonzero entries, while such decomposition is not possible for other methods. Running time shows the same trend as that of the face dataset; that is, Legendre decomposition is slower than other methods when the number of parameters increases.

4. Conclusion

In this paper, we have proposed Legendre decomposition, which incorporates tensor structure into information geometry. A given tensor is converted into the dual parameters $\newcommand{\e}{{\rm e}} (\theta, \eta)$ connected via the Legendre transformation, and the optimization is performed in the parameter space instead of directly treating the tensors. We have theoretically shown the desired properties of the Legendre decomposition, namely, that its results are well-defined, unique, and globally optimized, in that it always finds the decomposable tensor that minimizes the KL divergence from the input tensor. We have also shown the connection between Legendre decomposition and Boltzmann machine learning.

We have experimentally shown that Legendre decomposition can more accurately reconstruct input tensors than three standard tensor decomposition methods (nonnegative Tucker decomposition, nonnegative CP decomposition, and CP-APR) using the same number of parameters. Since the shape of the decomposition basis B is arbitrary, Legendre decomposition has the potential to achieve even more accurate decomposition. For example, one can incorporate the domain knowledge into the set B such that specific entries of the input tensor are known to dominate the other entries.

Our work opens the door to both further theoretical investigation of information geometric algorithms for tensor analysis and a number of practical applications such as missing value imputation.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Nos. JP16K16115, JP16H02870, and JST, PRESTO Grant No. JPMJPR1855, Japan (MS); JSPS KAKENHI Grant Nos. 26120732 and 16H06570 (HN); and JST CREST JPMJCR1502 (KT).

Legendre decomposition for tensors^*

Article metrics

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. The Legendre decomposition

2.1. Definition

2.2. Optimization

2.3. Algorithm

2.4. Relationship to statistical models

2.4.1. Exponential family.

2.4.2. Boltzmann machines.

3. Experiments

Results on synthetic data.

Results on real data.

4. Conclusion

Acknowledgments

Footnotes

Legendre decomposition for tensors*

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. The Legendre decomposition

2.1. Definition

2.2. Optimization

2.3. Algorithm

2.4. Relationship to statistical models

2.4.1. Exponential family.

2.4.2. Boltzmann machines.

3. Experiments

Results on synthetic data.

Results on real data.

4. Conclusion

Acknowledgments

Footnotes

Legendre decomposition for tensors^*