Introduction

Inverse problems consist in inferring information about the structure of a system from the observation of its configurations. Cases where the system’s variables \(s_i\) are binary can be studied in the framework of the inverse Ising model, whose parameters \(\{J_{ij},h_i\}\) are tuned in order to describe the observed configurations according to the Boltzmann weight \(P(s) \sim \exp (\sum _{i<j} J_{ij} s_i s_j + \sum _i h_i s_i)\). This is the simplest distribution emerging when using the maximum entropy approach in order to reproduce exactly the one and two points statistics of the data. Successful applications of this method arise in biology1, immunology2, neurosciences3,4 as in the study of collective behaviors5 and financial time series6,7,8. In general, inferring the parameters of the model is a challenging problem since maximizing the likelihood involves the computation of the partition function \(Z=\sum _s P(s)\), which is intractable in most realistic cases. On the other hand, when dealing with time-series, it is possible to use a simpler approach based on the dynamic (kinetic) version of the Ising model analysed in9, optimized in10 and generalized in11,12,13. A recent review on this subject can be found in14.

The original attempt to solve the problem is a gradient descent algorithm known as Boltzmann learning15. This method is unpractical on large systems unless heuristic methods, like Monte Carlo (MC) sampling, are used to estimate correlations16. Nevertheless MC is slow and thus many sophisticated techniques coming from statistical mechanics and machine learning have been proposed as alternative approaches17,18,19,20,21,22,23,24,25,26,27,28. These methods, however, share one or both of the following shortcomings: (1) they require a large number of observations and (2) the overall performance drops significantly when the dataset is structured. This is often the case when data is produced from a (sub)set of many attractive states or is collected in different regimes, e.g. quiescent and spiking regimes in neural networks. This problem becomes particularly relevant at low temperatures and it has already been studied both in the sparse29 and in the dense case30,31. Pseudo-likelihood32 based methods31 were shown to be the best options in a wide range of temperatures. Here, we present two algorithms to compete with the existing state-of-the-art by posing the problem in a Bayesian framework using the Thouless-Anderson-Palmer (TAP) equations33 and the Restricted Boltzmann Machine (RBM)34,35. Our TAP-based algorithm will be shown to achieve a better quality of the results by observing far fewer configurations in the clusterized phase. Moreover, it allows to consider larger system size with respect to those studied in30,31.

Results

We consider a dataset with many clusters by drawing configurations from the Hopfield model36,37. Given a set of N-dimensional binary independent patterns \(\{\underline{\zeta }^{\mu }\}_{\mu =1,\dots ,P}\), teacher’s patterns, the coupling matrix of the associated Hopfield model is defined as \(J_{ij} = N^{-1} \sum _{\mu } \zeta ^{\mu }_i \zeta ^{\mu }_j\) and its Hamiltonian is \(H_{\zeta }(\underline{s})=-1/2 \sum _{ij} J_{ij} s_i s_j\). We construct a set of equilibrium configurations \({\mathscr {D}} = \{\underline{s}^a\}_{a=1,\dots ,M}\) sampled from the Boltzmann distribution

$$\begin{aligned} P(\underline{s}) = Z^{-1}\exp [-\beta H_{\zeta }(\underline{s})], \end{aligned}$$
(1)

being \(\beta\) the inverse temperature. The task of a student is to infer the teacher’s patterns from the observation of \({\mathscr {D}}\). This task differs from the one in30,31, whose focus was the inference of the coupling matrix J only.

For \(P=1\), the Hopfield model is nothing but a Curie–Weiss model. In this case the posterior distribution is

$$\begin{aligned} P(\underline{\xi }|{\mathscr {D}}) =Z(\underline{\xi })^{-M}\exp ^{ \frac{\beta }{2N} \sum _{ij} \sum _{a=1}^M s^a_i s^a_j \xi _i \xi _j} , \end{aligned}$$
(2)

where \(\underline{\xi }\) denote the student’s pattern and the problem is called dual Hopfield model38,39. This is readily established by absorbing the \(\xi\)-dependence of the partition function into a redefined set of variables \(\underline{s}\) via \(s'_i=\xi _i s_i\). On the other hand, for \(P>1\), this transformation is not feasible and the posterior comes from the log-likelihood

$$\begin{aligned} {\mathscr {L}} (\{\underline{\xi }^{\mu }\}_{\mu =1,\dots ,P}|{\mathscr {D}}) = \log P(\{\underline{\xi }^{\mu }\}_{\mu =1,\dots ,P}|{\mathscr {D}})= \frac{\beta }{2N} \sum _{ij} \sum _{\mu =1}^P \sum _{a=1}^M s^a_i s^a_j \xi ^{\mu }_i \xi ^{\mu }_j -M \log Z (\{\xi ^{\mu }_i\}_{\mu =1,\dots ,P}) . \end{aligned}$$
(3)

We propose an algorithm based on TAP equations to estimate the posterior associated with Eqs. (2) and (3). In the direct problem, i.e. the study of the Boltzmann distribution of Eq. (1), TAP equations40,41,42,43,44 describe the local magnetizations \(m_i=\left<s_i \right>\) of the equilibrium states and their use as an inference method has been pioneered in43,45,46,47,48,49 since they can be used to approximate the intractable partition function appearing in the likelihood. These works paved the way to their applications in a number of problems such as error correcting codes, compressed sensing and learning in neural networks, as discussed in the recent review50.

Figure 1
figure 1

Overlap between the teacher’s pattern and pattern recovered by the student when using Eq. (4) with \(P=1\), as a function of the number of samples M, at different temperatures. System size is \(N=1000\). When \(\beta <1\), there exists a critical value of \(M\sim O(N)\) below which it is impossible to infer the pattern, whereas above only a finite set of samples is needed.

Even if TAP and mean field methods have already been used to solve inverse problems23,29,30,31,51, the present approach is completely different since we directly apply TAP to the posterior distribution (dual model) to improve the quality of the reconstructed network. On the dual model, the role of spins and patterns is exchanged: the variables (spins) are now the \(\xi\)’s and the M sampled configurations play the role dual patterns, thus we use TAP equations to study the local magnetizations \(m_i = \left<\xi _i\right>\). We notice that a similar approach has been independently proposed in52 for an RBM with 2 hidden binary units, using Belief Propagation.

Single pattern

We start by considering the simplest case \(P=1\). We introduce a naive time indexing for an iterative scheme of the TAP equations,

$$\begin{aligned} m^{t+1}_i = \tanh \left( \beta \sum _{j=1}^N J_{ij} m^t_j - \frac{\alpha \beta }{1-\beta (1-q^t) } m^t_i \right) , \end{aligned}$$
(4)

where \(J_{ij} = N^{-1} \sum _a s^a_i s^a_i\), \(\alpha =P/N\) and \(Nq^t=\sum _i (m_i^t)^2\). The entire set of magnetizations \(\underline{m}^t\) are updated to achieve \(\underline{m}^{t+1}\) in a parallel way. In principle, any sophisticated time indexing schemes can be employed as long as it achieves the convergence to a physical state. Particularly, the so-called Approximate Message Passing (AMP) equations has been the focus of many studies in inference problems50. This scheme is inspired by the convergence issues of naive indexing in SK model even in the replica symmetric phase53. An explanation to this behavior can be found in54, where a less trivial time index setting is shown to improve convergence properties outside the glassy phase. The AMP equations exhibit convergence issues for the case of the Hopfield model in the direct problem, when the initial condition is chosen at random, thus in the following we adopt Eq. (4). More details about these issues are discussed in detail in the Methods through simulations and analytical insights.

Figure 2
figure 2

Overlap between the best TAP solutions and the teacher’s patterns. The system size is \(N=1000\), the teacher generates \(P=20\) patterns at \(\beta =2\). Inference is done with \(P'=25\) students observing only \(M=200\) samples, i.e. 10 per state. At each iteration step \(\tau =1\ldots ,P'\), we pick the best TAP solution and we plot its overlap with all of the teacher’s patterns. We observe clearly that the students are able to retrieve all the patterns from the teacher.

Once solved Eq. (4), we use \(m_i\) as the student estimator for the pattern \(\zeta _i\). In Fig. 1, we present the teacher-student overlap \(q(\underline{m},\underline{\zeta }) =| N^{-1}\sum _i m_i \zeta _i|\), where \(\underline{m}\) is the solution of Eq. (4). We observe that in the ferromagnetic-retrieval phase \(\beta >1\), a perfect reconstruction may be realized already with a small number of samples. This is due to the large signal contained in the correlation matrix of the data \(\overline{c}\). In particular, we notice that in the ferromagnetic phase the student is able to find a pattern correlated with the teacher’s one even at \(M=1\). On the other hand in the paramagnetic phase the signal in \(\overline{c}\) is weaker and reconstruction is possible only exploiting finite size effects, at the price of observing an extensive number of samples. As discussed in39, the critical fraction M/N of samples necessary for reconstruction corresponds to the paramagnetic-spin glass transition of the direct problem.

Figure 3
figure 3

Evolution of the error \(\epsilon\) and of the simplified Likelihood \(\hat{{\mathscr {L}}}\), as defined in the text, with the iteration of the algorithm. Different lines refer to different values of \(M=100,200,500,1000\) at \(\beta =2\), \(P=20\), \(N=1000\). The error decreases with M and it reaches zero for \(M=1000\), although we observe that even with very few samples, the errors are very small and, as shown on Fig. 2 the patterns are perfectly recovered. The dependency of \(\hat{{\mathscr {L}}}\) on M is negligible. \(\hat{{\mathscr {L}}}\) is rescaled in order to fit in the figure.

Figure 4
figure 4

Average error as a function of M/P for a system of size \(N=1000\) with a number of patterns \(P=10\) (top panel) and \(P=30\) (bottom panel). The reconstruction is done using \(P'=2P\) students. The error is computed stopping the algorithm with the criterion described in the text. Each point represent an average over 100 independent trials. In the retrieval phase, the error goes to zero with M.

Multiple patterns

The \(P>1\) case is more difficult because of the presence of the term in Eq. (3) coming from the partition function. However, we argue (see Methods) that this term is effectively a (soft) orthogonality constraint over inferred patterns. This observation allows us to design an inference algorithm accordingly. First, let us construct a time evolution of the coupling matrix \(J^{\tau }_{ij}\) with its initial condition given by \(J^{\tau =0}_{ij} = N^{-1} \sum _{a} s_i^a s_j^a\). At each time step \(\tau\), we consider \(P'\) TAP students trying to learn the teacher’s patterns independently. Namely, the magnetizations \(m_i^{\mu }=\left< \xi _i^{\mu } \right>\) for each student evolve according to Eq. (4) from a randomly initialized configuration. To escape the (unstable) fixed point at \(m=0\), the absolute value of the local magnetization is chosen to be in the interval [0.1, 1]. Upon convergence, we evaluate the \(P'\) solutions with the score \(S_{\mu } = \sum _{ij} \sum _{a=1}^M s^a_i s^a_j m^{\mu }_i m^{\mu }_j\). These scores characterize the quality of the TAP fixed points and we pick the one with the largest score. The corresponding magnetization selected by this criterion at time \(\tau\) are denoted by \(\underline{m}^{\tau }\). This trick is closely reminiscent of the algorithm presented in28, where the iterative steps are performed by evaluating the likelihood gain obtained moving in different directions and choosing the one with the largest payoff. Finally, in order to learn the remaining contributions, we remove the rank-1 part associated to the retrieved state \(\underline{m}^\tau\) from the coupling matrix. When the student knows the actual number of patterns, this correspond to the rule \(J^{\tau +1}_{ij} = J^{\tau }_{ij} - \gamma N^{-1} m^\tau _i m^\tau _j\), where \(\gamma =M/P\) (assuming that different states are uniformly sampled in the dataset). We repeat these steps until no further patterns are found.

We stress that our algorithm finds solutions correlated with the patterns without any prior information, i.e. we start iterating the TAP equations from a random initial configuration. This is a rather remarkable property in comparison to the method used in29, where BP equations were guided to converge to the fixed points associated with the patterns using a reinforcement term aligned with the magnetizations of the states. In Fig. 2 we compare the P teacher’s patterns with the \(P'\) TAP fix points learned from data generated in the retrieval phase44. We clearly see that all the P patterns are successfully retrieved from the first \(P<P'\) students. In addition, let us define two performance measures, the simplified likelihood

$$\begin{aligned} \hat{{\mathscr {L}}} = \frac{1}{2N} \sum _{ij} \sum _{\mu =1}^{P'} \sum _{a=1}^M s^a_i s^a_j m^{\mu }_i m^{\mu }_j, \end{aligned}$$
(5)

and the reconstruction error \(\epsilon = [N(N-1)/2]^{-1} \sqrt{\sum _{i<j} (J^{\tau }_{ij}-J^*_{ij})^2 / \sum _{i<j} (J^*_{ij})^2 }\), where \(J^*\) denotes the teacher’s coupling matrix, and \(J^{\tau }\) is the inferred matrix at time \(\tau\), \(J^{\tau }_{ij}=N^{-1}\sum _{t=1}^{\tau }m_i^{t} m_j^{t}\). The simplified likelihood is defined by neglecting the partition function in Eq. (3). In Fig. 3 their behaviors are reported as a function of iteration time. As expected, \(\epsilon\) decreases as the students learn the patterns but then increases when the students start to learn the remaining noise. Similarly, the simplified likelihood \(\hat{{\mathscr {L}}}\) develops a kink at the point where students learn all the patterns, that can be used as a stopping condition of the algorithm. In Fig. 4 we study the behavior of \(\epsilon\) for different values of the temperature. As a function of \(\beta\), the system sweeps through different regions of the phase diagram. Data is generated with a sequential Glauber dynamics and states are equally sampled. In Fig. 4, we show the behavior of the error computed using the criterion discussed above with the number of observations in different regions of the phase diagram. As expected, perfect reconstruction is obtained only in the retrieval phase.

Figure 5
figure 5

Upper panel: Pseudo-likelihood \({\mathscr {S}}\) and error \(\epsilon\) for M during learning. Data is produced by a teacher Hopfield model with \(N=1000\), \(P=10\), \(M=1000\) at \(\beta =2\). Learning is done with an RBM with \(N_v=N\) visible units and \(N_h=15\) hidden units. Lower Panel: emergence of singular values \(\sigma\) during learning for the same dataset. \(P=10\) modes emerge. Inset: error \(t_{\alpha }\) for different modes at the beginning of learning (blue line) and at the end of learning (orange line).

Another approach to perform inference is using the equivalence between Hopfield model and RBM39,44,55,56,57,58,59,60. In fact, the likelihood Eq. (3) can be rewritten as

$$\begin{aligned} P(\{W^{\mu }_i\}_{\mu =1,\dots ,P}|{\mathscr {D}}) \propto \prod _{a=1}^M Z^{-1}(\underline{W}) \int d\underline{\lambda } e^{-\sum _\mu \lambda _{\mu }^2/2 + \sum _{i,\mu } W_i^{\mu } s^a_i \lambda _{\mu }}, \end{aligned}$$
(6)

with \(W_i^{\mu } = \sqrt{\frac{\beta }{N}} \xi ^{\mu }_i\), which defines a RBM with \(N_v=N\) binary visible units and \(N_h=P\) Gaussian hidden units. Following the standard practice61, the weights \(W_i^{\mu }\) are learned maximizing the log-likelihood using the Persistent Constrastive Divergence (PCD-10) algorithm62, with 10 Monte Carlo steps to estimate the part of the log-likelihood derivative involving the partition function. Compared to existing methods19,21,23,24,26,27,28, the RBM is both time and space efficient as the number of parameters scales as \(N_vN_h\) rather than \(N^2\). The number of hidden units \(N_h\) plays the role of \(P'\), thus we consider the general setting \(N_h \ge P\) in the following. RBM learns a set of weights \(J^r_{ij} = \beta ^{-1} \sum _{\mu } W_i^{\mu } W_j^{\mu }\) that we can compare with the teacher coupling matrix. The error between \(J^r_{ij}\) and \(J^*_{ij}\) decreases during learning but it never achieves the values found with TAP. In order to monitor learning, we study the pseudo-likelihood \({\mathscr {S}}\)32, i.e a proxy for the likelihood that can be easily computed (see Methods). In Fig. 5, we show the behavior of these quantities for a dataset generated by a teacher with P patterns and using an RBM with \(N_h=P'>P\) hidden units. The minimum of \(\epsilon\) is achieved when the pseudo-likelihood flattens. This happen when all of the relevant (P) modes of the data have been learned.

Unless learning starts in the vicinity of the teacher’s patterns, final RBM weights do not reproduce them, contrarily to the TAP-based algorithm discussed above. In fact, the Hopfield model is invariant under a rotation in the pattern space30 and the student RBM can learn, at most, the subspace spanned by teacher’s patterns. To prove it, we consider the Singular Value Decomposition (SVD) of the dataset, and the SVD of the weights. We denote by \(\{\sigma ^{\alpha }\}\) the singular values of the matrix \(W^{\mu }_i\) and by \(t^{\alpha }\) the error in reconstructing the singular vector of the data, indexed by \(\alpha\), using only the singular vectors of the weight matrix. In Fig. 5, we show the emergence of different modes during learning. When the singular values \(\sigma ^{\alpha }\) of the coupling matrix emerge, the error \(t^{\alpha }\) decreases. The first P principal modes of the dataset are well represented by the subspace spanned by the singular vectors of the weight matrix W.

Methods

Linear stability analysis of TAP and AMP in the paramagnetic state

Here we present the linear stability analysis of TAP and AMP equations in the paramagnetic state. We will focus on the direct problem where \(J_{ij}\) is constructed from M random patterns. While the complete analysis is possible for arbitrary \(\alpha\), we find it more instructive to focus on the limit \(\alpha \rightarrow 0\) as it greatly simplifies the discussion. As will be shown below, our results are valid if \(|1-\beta | \sim O(1)\), which is larger than \(O(\alpha )\).

From now on, we denote by TAP the simple iterative updating scheme discussed in the text, reported here for convenience

$$\begin{aligned} m^{t+1}_i = \tanh \left( \beta \sum _{j=1}^N J_{ij} m^t_j - \frac{\alpha \beta }{1-\beta (1-q^t) } m^t_i \right) \end{aligned}$$
(7)

and by AMP the iterative scheme derived in44,

$$\begin{aligned} H_i^{t+1} = \frac{1}{1-u^t}\left[ \sum _{j\ne i} J_{ij} m_j^t -u^t H_i^t - \frac{\alpha u^t}{1-u^{t-1}} m_i^{t-1} \right] , \end{aligned}$$
(8)

where \(m^t_i=\tanh (\beta H_i^t)\), \(u^t=\beta (1-q^t)\) and \(Nq^t=\sum _i (m^t_i)^2\). This time index setting naturally emerges from the expansion of the BP equations in the large connectivity limit43.

Let us first consider the linear stability of Eq. (7). Near the paramagnetic state \(M_i \sim 0\), this equation may be expanded into

$$\begin{aligned} m^{t+1}_i \simeq \beta \sum _{j=1}^N J_{ij} m^t_j + O(\alpha ) \end{aligned}$$
(9)

where the second term is neglected as it is of \(O(\alpha )\). Performing the coordinate change with the eigenvectors of \(J_{ij}\) as its basis, one obtains

$$\begin{aligned} {\tilde{m}}^{t+1}_\lambda \simeq \beta \lambda {\tilde{m}}^t_\lambda , \end{aligned}$$
(10)

where \(\lambda\) is an eigenvalue of \(J_{ij}\). This implies that the paramagnetic solution becomes unstable when \(\beta \lambda _{\text {max}} > 1\). The spectrum of coupling matrix follows the Marchenko–Pastur law44. Namely, P eigenvalues are \(1 + O(\sqrt{\alpha })\) and their eigenvectors span the same space spanned from the set of patterns. The remaining \(N-P\) eigenvalues are zero. Thus we find that the critical temperature is \(T_c = 1 + O(\sqrt{\alpha })\) (the true value is \(T_c = 1/(1+\sqrt{\alpha })\), found expanding TAP equation beyond the \(\alpha \rightarrow 0\) limit, which is identical to the result of replica theory44.

Similarly, AMP Eq. (8) can be expanded as follows:

$$\begin{aligned} {\tilde{m}}_\lambda ^{t+1} = \frac{\beta }{1-\beta }\left[ (\lambda -1 ){\tilde{m}}_\lambda ^t - O(\alpha ) \right] . \end{aligned}$$
(11)

Because of the \(\lambda -1\) term, in the limit \(\alpha \rightarrow 0\), the \(N-P\) eigenvalues equal to zero give the largest O(1) contribution to the instability of the paramagnetic fixed point. In particular, the modes associated with patterns, with eigenvalues \(1 + O(\sqrt{\alpha })\), give a vanishing contribution. From the infinite temperature limit, the first T where this equation becomes unstable is given by \(-\frac{\beta }{1-\beta } = -1\), i.e. \(T_c = 1/2\). Nevertheless, this unstable direction is orthogonal to the patterns and the magnetization either converges to an unphysical state or never converge (see Fig. 6). The negative value of the leading eigenvalue for \(\beta \in [1/2,1]\) leads to an oscillating behavior starting from the paramagnetic solution, as can be seen in the second plot in Fig. 6. Similar issues with parallel updating of the AMP equations were discussed in50, and they can be alleviated by updating spins sequentially and introducing a strong dumping. Nevertheless their sequential updating leads to a much slower algorithm, without showing any improvement in the quality of inference in comparison to the parallel updating scheme of Eq. (7).

A different updating scheme of the TAP equations has been recently proposed by63. This approach does not require to consider the fully connected limit of the BP equations and it is suitable to be applied in systems with dense random coupling matrices. It is based on a dynamical mean field theory which allows to study the dynamics of iterative algorithms in the thermodynamic limit by averaging over the noise contained in the couplings. For the Hopfield model, the updating scheme turns out to be

$$\begin{aligned} m_i^{t+1}&= \tanh \left( z_i^t + A_t m_i^t\right) \end{aligned}$$
(12)
$$\begin{aligned} z_i^t&= A_t \left( \sum _j J_{ij} m_j^t -m_i^t \right) + \alpha (1-q^t) A_t z_i^{t-1} \end{aligned}$$
(13)

where \(A_t = \beta / (1+\alpha u_t)\) and \(u_t\) is the same quantity introduced in the AMP Eq. (8). It is possible to see that this updating scheme does not present the issues of the AMP algorithm by repeating the same \(\alpha \rightarrow 0\) analysis presented above.

In Fig. 7, we compare the performances of these three algorithms for different system sizes. We define \(P_c\) as the probability to converge to one of the patterns of the system with overlap greater than 0.7 when the initial condition is chosen at random. Sequential AMP were iterated with a damping term d, i.e. \(m_i^{t+1} = (1-d) \tanh \beta H_i^{t+1} + d m_i^t\), and \(d=0.95\). For the two parallel TAP equations, (Eqs. 713), the iteration is stopped when the average difference between \(m_i^{t+1}\) and \(m_i^{t}\) is smaller than 0.001. For the AMP sequential algorithm the iteration is stopped when the average between \(\tanh \beta H_i^{t+1}\) and \(\tanh \beta H_i^{t}\) is smaller than 0.001. In all the cases we observe that convergence to patterns is achieved in the retrieval phase. For small values of N, due to finite size effect, convergence regime extends in the metastable retrieval phase too.

Figure 6
figure 6

Trajectories of the N magnetizations \(m_i^t\) in the updating schemes of TAP, Eq. (7), and AMP, Eq. (8), for three different temperatures at \(N=1000\), \(P=1\). Most of the trajectories are very similar, thus they are indistinguishable. The critical value is at \(\beta =1\). The starting point is chosen at random with the absolute value of the local magnetization equal to one. In the first steps, both TAP and AMP destroy the initial condition and create very small magnetization values. Then, once close to the paramagnetic fixed point \(m=0\), AMP eqs. escape from it for \(\beta >0.5\) while TAP eqs. do not until \(\beta >1\). Moreover, when leaving the paramagnetic state, the direction chosen by AMP is completely random, while TAP moves towards the pattern.

Figure 7
figure 7

Probability of converging to a pattern when iterating Eq. (7) (left), Eq. (8) (center), Eq. (13) (right), starting from a random initial condition in the direct problem. This probability is estimated running 1000 independent experiments from different realizations of the patterns and different initial conditions and counting the number of times that the equations converged to one of the patterns of the system with overlap greater than 0.7, in order to exclude mixture states. The sequential updating of the AMP equations is done with a dumping term equal to 0.95. The performance of all these algorithms is similar, with the second one being much slower. The initial absolute value of the local magnetizations are mostly irrelevant in the first two cases (and it is chosen to be 1), but needs to be chosen small at low temperatures in the third case (and it is chosen to be 0.1).

The instability issue of the AMP equations presented above holds for the direct problem, but it can be extended also to the inverse, dual problem. In this last case, where \(J_{ij}=N^{-1}\sum _{a=1}^M s^a_i s^a_j\), if there is enough signal in the data and \(\lambda _{\text {max}} > 2\), inference is possible also with parallel AMP equations. Nevertheless the analysis shows that obtaining time indexes from BP does not necessarily lead to good algorithms. TAP, as well as BP, equations describe only fixed points of the associated free energy and, in principle, any updating scheme could be used to solve these equations in an iterative manner, as shown in63. The relevance of this observation for other problems requires further analysis and, given that the AMP convergence issues are usually mitigated by considering a sequential updating with a strong dumping, it would be interesting to study whether a similar improvements is achieved when iterating TAP equations with the naive time indexing sequentially and with a strong damping, in problems where their parallel updating was failing.

Posterior for P>1

We discuss the role of the difficult term arising in the posterior distribution when \(P>1\). We show that for the simple case \(P=2\), it has a clear interpretation in terms of a constraint on the orthogonality of the inferred patterns. In fact, let us consider

$$\begin{aligned} Z(\{\xi ^1_i,\xi ^2_i\}) = \sum _{s} e^{\frac{\beta }{N}\sum _{ij}(\xi ^1_i \xi ^1_j+\xi ^2_i \xi ^2_j) s_i s_j}, \end{aligned}$$
(14)

and let us define \(S=\{i : \xi ^1_i = \xi ^2_i \}\), such that \(|S|= N(1+q)/2\), where q is the mutual overlap between the two patterns, \(Nq = \sum _i \xi ^1_i \xi ^2_i\). The exponent in Eq. (14) reads

$$\begin{aligned} H_{\xi }(s) = \frac{ 2 \beta }{N}\sum _{i\in S, j \in S}\xi ^1_i \xi ^1_j s_i s_j + \frac{ 2 \beta }{N}\sum _{i\in \overline{S}, j \in \overline{S}}\xi ^1_i \xi ^1_j s_i s_j , \end{aligned}$$
(15)

where we indicate with \(\overline{S}\) the complement of set S. Using again the gauge transformation \(s'_i=\xi ^1_i s_i\), Eq. (15) leads to

$$\begin{aligned} Z(\{\xi ^1_i,\xi ^2_i\}) = Z_{cw,\beta (1+q)}^{N(1+q)/2} Z_{cw,\beta (1-q)}^{N(1-q)/2}, \end{aligned}$$
(16)

where we indicate with \(Z_{cw,\beta }^N\) the partition function of a ferromagnetic Curie–Weiss model at inverse temperature \(\beta\). We observe that the interaction depends only on their mutual overlap. If we define \(\phi = -N \log Z\), we obtain

$$\begin{aligned} \phi (q) = \frac{1+q}{2}f_{cw}(\beta (1+q)) + \frac{1-q}{2}f_{cw}(\beta (1-q)) \end{aligned}$$
(17)

where \(f_{cw}(\beta )\) is the free energy of the Curie–Weiss model at inverse temperature \(\beta\). It is easy to check that \(\phi (q)\) is a convex function with a minimum in \(q = 0\). Thus the term \(- M\log Z(\underline{\xi })\) in the posterior can be interpreted as a soft regularizer for patterns orthogonality.

Restricted Boltzmann machine

A restricted Boltzmann machine (RBM) is a particular kind of Boltzmann machines in which units are divided in two layers, formed by visible \(\{s_i\}\) and hidden \(\{\lambda _{\mu }\}\) units, and only interactions \(W^{\mu }_i\) between units of different layers are allowed, such that the proxy probability distribution reads

$$\begin{aligned} P(\underline{s},\underline{\lambda }|\{W\})=Z^{-1}(\{W\}) e^{-E_W(\underline{s},\underline{\lambda })} \end{aligned}$$
(18)

where \(E_W(\underline{s},\underline{\lambda }) = - \sum _{i,\mu } W^{\mu }_i s_i \lambda _{\mu }\), and \(Z(\{W\})\) is the partition function,

$$\begin{aligned} Z(\{W\}) = \int \prod _{\mu =1}^{N_h} d P(\lambda _{\mu }) \sum _{\underline{v}} e^{-E_W(\underline{s},\underline{\lambda })}. \end{aligned}$$
(19)

For our purposes, \(P(\underline{\lambda })\) denotes a generic distribution over hidden units, while visible units are \(\pm 1\) binary variables. We indicate with \(N_v\) the number of visible units and with \(N_h\) the number of hidden units. RBM has the property that the two conditional probabilities, \(P(\underline{s}|\underline{\lambda },\{W\})\) and \(P(\underline{\lambda }|\underline{s},\{W\})\), factorize over the visible (resp. hidden) units. These machines are used to learn weights such that the distribution over the visible units reproduce the distribution of the data. In other words

$$\begin{aligned} P(\underline{s}|\{W\}) = \int \prod _{\mu =1}^{N_h} d P(\lambda _{\mu }) P(\underline{s},\underline{\lambda }|\{W\}) \end{aligned}$$
(20)

should reproduce as close as possible \(P_{D}(\underline{s}) = M^{-1} \sum _{a=1}^M \delta _{\underline{s},\underline{s}^{a}}.\) Weights can be found minimizing the KL distance between the two distribution, which is equivalent to maximizing the likelihood \(\prod _{a=1}^M P(\underline{s}^a|\{W\})\) or the log-likelihood

$$\begin{aligned} {\mathscr {L}}=\frac{1}{M} \sum _{a=1}^M\left( -\log Z(\underline{W})+ \log \int d P(\underline{\lambda }) e^{-E_W(\underline{s}^a,\underline{\lambda })} \right) . \end{aligned}$$
(21)

Optimal weights can be learned by gradient ascent:

$$\begin{aligned} W^{\mu }_i = W^{\mu }_i + \left( \langle \lambda _{\mu } s_i\rangle _{D} - \langle \lambda _{\mu } s_i\rangle _{RBM} \right) \end{aligned}$$
(22)

where the first average, usually referred to as positive phase, is

$$\begin{aligned} \langle \lambda _{\mu } s_i \rangle _{D} = M^{-1} \sum _a^{M} \int d P(\underline{\lambda }) P(\underline{\lambda }|\underline{s}^a,\{W\}) \lambda _{\mu } s_i^a \end{aligned}$$
(23)

and the second average, usually referred to as negative phase, is

$$\begin{aligned} \langle \lambda _{\mu } s_i \rangle _{RBM} = \frac{\partial }{\partial W_i^{\mu }} \log Z(\underline{W}). \end{aligned}$$
(24)

The second one is known to be difficult and it can be computed with approximate methods. One way to estimate it is to use a Monte Carlo (MC). Depending on the number of steps T of the MC Markov chain, this method is referred to as CD-T, where CD stands for Contrastive Divergence. In the text, we discuss results obtained with \(T=10\). When the positive term is computed over a sub set (mini-batch) of the dataset, the direction indicated by the gradient does not correspond to the correct one obtained considering the whole dataset. This trick introduces a source of randomness in the path to the solution, and the associate learning algorithm is called Stochastic Gradient ascent. In our experiments we use a mini-batch size equal to 100. Since mini-batch samples are independent, different parallel MC can be used. In our experiments we used 100 MC chains, one per mini-batch sample. Their initial conditions can be chosen to be the considered samples, but this quickly results in over fitting the parameters, since the MC dynamics spend all the time in the phase space regions close to the samples. When the initial condition of the MC dynamics is chosen at random and we keep track of their positions through different batches and epochs, this method is called Persistent CD (PCD). Our results are obtained using PCD.

As stated above, the likelihood function cannot be easily computed. Thus, we introduce the pseudo-likelihood that, for a model with hidden units, is defined by \({\mathscr {S}} =\sum _{r=1}^{N_v} {\mathscr {S}}_r\), where

$$\begin{aligned} {\mathscr {S}}_r = \frac{1}{M} \sum _{a=1}^{M} \log ( \langle p(s_r^{a} | \underline{\lambda } \langle _{P(\underline{\lambda }|\underline{s}^a,\{W\} )} ), \end{aligned}$$
(25)

where the term inside the \(\log\) is defined by

$$\begin{aligned} \langle p(s_r^{a} | \underline{\lambda } \rangle _{P(\underline{\lambda }|\underline{s}^a,\{W\} )} = \int d P(\underline{\lambda }) p(s_r^{a} | \underline{\lambda } ) P(\underline{\lambda }|\underline{s}^a,\{W\}) \end{aligned}$$
(26)

and it is equal to

$$\begin{aligned} \langle p(s_r^{a} | \underline{\lambda } \rangle _{P(\underline{\lambda }|\underline{s}^a,\{W\} )} = {\mathscr {N}}^{-1}_a \int d P(\underline{\lambda }) e^{ \sum _{\mu k} W_k^{\mu } s_k^a \lambda _{\mu }}, \end{aligned}$$
(27)

where \({\mathscr {N}}_a\) is a sample dependent normalization factor,

$$\begin{aligned} {\mathscr {N}}_a = \sum _{s_r^a} \int d P(\underline{\lambda }) e^{ \sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }}. \end{aligned}$$
(28)

The Pseudo-likelihood is optimized by the same set of parameters \(\{W\}\) that optimize the likelihood. In order to show this property, we can take derivatives of \({\mathscr {S}}\):

$$\begin{aligned} \frac{\partial {\mathscr {S}}_r}{\partial W_{r}^{\nu }} = \frac{1}{M} \sum _{a=1}^M \left( \frac{ \int \prod _{\mu } d P(\lambda _{\mu })s_r^a \lambda _{\nu } e^{\sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }} }{ \int d P(\underline{\lambda }) e^{ \sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }} } - \frac{\partial }{\partial W_r^{\nu }} \log \sum _{s_r^a} \int d P(\underline{\lambda }) e^{ \sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }}\right) . \end{aligned}$$
(29)

The definition

$$\begin{aligned} P(\underline{\lambda }|\underline{s}^a,\{W\} ) = \frac{ e^{\sum _{k,\mu } W_k^{\mu } \lambda _{\mu } s_k^a} }{ \int d P(\underline{\lambda }) e^{ \sum _{k, \mu } W_k^{\mu } \lambda _{\mu }s_k^a } } \end{aligned}$$
(30)

allows to write the first term of Eq. (29) as

$$\begin{aligned} \langle \lambda _{\nu } s_r \rangle _{D} = \frac{1}{M} \sum _{a=1}^M \int d P(\underline{\lambda }) \lambda _{\nu } s^a_r P(\underline{\lambda }|\underline{s}^a,\{W\}). \end{aligned}$$
(31)

The second term is given by the average over samples of

$$\begin{aligned} {\mathscr {N}}_a^{-1} \frac{\partial {\mathscr {N}}_a}{\partial W^{\nu }_r} = \frac{\sum _{s_r^a} \int d P(\underline{\lambda }) s_r^a \lambda _{\nu } e^{ \sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }}}{\sum _{s_r^a} \int d P(\underline{\lambda }) e^{ \sum _{k \mu } W_k^{\mu } s_k^a \lambda _{\mu }} } \end{aligned}$$
(32)

and similar manipulations on the second term lead to

$$\begin{aligned} \frac{\partial {\mathscr {S}}}{\partial W_{r}^{\nu }} = \frac{1}{M} \sum _{a=1}^M s_r^a \left\langle \lambda _{\nu } \right\rangle _{P(\underline{\lambda }|\underline{s}^a,\{W\})} - \frac{1}{M} \sum _{a=1}^M \left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu }\right\rangle _{P(\underline{\lambda }|\underline{s}^a,\{W\})} =\left\langle \lambda _{\nu } s_r \right\rangle _{D} - \left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu } \right\rangle _{D}. \end{aligned}$$
(33)

In the infinite sampling limit,

$$\begin{aligned} \lim _{M\rightarrow \infty } \left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu } \right\rangle _{D} = \langle \lambda _{\nu } s_r \rangle _{RBM} . \end{aligned}$$
(34)

In fact, it is easy to show that

$$\begin{aligned} \left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu } \right\rangle _{RBM} = \langle \lambda _{\nu } s_r \rangle _{RBM} \end{aligned}$$
(35)

and, on the other hand, that

$$\begin{aligned} \lim _{M\rightarrow \infty }\left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu }\right\rangle _{D} =\left\langle \lambda _{\nu } \tanh \sum _{\mu } W^{\mu }_r \lambda _{\mu }\right\rangle _{RBM} . \end{aligned}$$
(36)

Thus the gradient of the pseudo-likelihood \({\mathscr {S}}\) vanishes on the same set of parameters \(\{W\}\) that solve \(0=\partial _{W} {\mathscr {L}}\), and this is the reason the pseudo-likelihood can be used to control the learning state. In practice, the probability in Eq. (26), can be estimated after one step of Monte Carlo:

$$\begin{aligned} \left<p(\underline{s_r}^a|\underline{\lambda })\right>_{P(\underline{\lambda }|\underline{s}^a,\{W\})} \sim \frac{e^{s_r^a \sum _{\mu } W^{\mu }_i \lambda _{\mu }}}{2 \cosh \sum _{\mu }W^{\mu }_i \lambda _{\mu } } \end{aligned}$$
(37)

where \(\underline{\lambda }\) is sampled from the distribution \(P(\underline{\lambda }|\underline{s}^a,\{W\})\).

Finally, we discuss the learning of the RBM compared to our TAP based algorithm. As mentioned in the text, unless learning starts in the vicinity of the teacher’s patterns, final RBM weights do not reproduce them. In fact, the Hopfield model is invariant under a rotation in the pattern space. The dataset \({\mathscr {D}}\) analyzed by the student could have been produced by another set of patterns \(\{\underline{{\hat{\zeta }}}^{\mu }\}_{\mu =1,\dots ,P}\) given by \({\hat{\zeta }}^{\mu }_i = \sum _{\nu =1}^P O_{\mu ,\nu }\zeta ^{\nu }_i\) where O is an orthogonal matrix. This symmetry implies that the student RBM cannot learn exactly the teacher’s patterns. One could think that the singular vectors of W should learn at least the principal vectors of the data (that, given the spherical symmetry, are not necessarily aligned along the teacher’s patterns), as discussed in59. Nevertheless this is true only during the initial steps of learning, when couplings are small. This is reminiscent of the results discussed in30, where the posterior of the problem is analyzed in a perturbative expansion. At the first order, corresponding to the small couplings regime, the student’s patterns are aligned along the singular vectors of the data at zero order. Anyway, computing higher order corrections, this relation breaks down.

In the following we show that RBM is learning the subspace spanned by the singular vectors of the data. To prove it, we consider the Singular Value Decomposition SVD of the dataset, \(D=U_{D} \Sigma _{D} V_{D}^T\), where, considering \(N<M\), D is a \(N \times M\) matrix, \(U_D\) is a orthogonal \(N \times N\) matrix, \(\Sigma _D\) is a \(N \times M\) matrix, with only N diagonal elements different from zero, and \(V_D\) is a orthogonal \(M \times M\) matrix. D represent the matrix of the dataset \({\mathscr {D}}\), where each column is a sample. Similarly, we consider the SVD of the weight matrix, \(W^T=U_{W} \Sigma _{W} V_{W}^T\), where \(W^T\) is \(N_v \times N_h\), \(U_{W}\) is a \(N_v \times N_v\) orthogonal matrix, \(\Sigma _{W}\) is a \(N_v \times N_h\) matrix, with \(N_h\) diagonal elements different from zero, and \(V_{W}\) is a \(N_h \times N_h\) orthogonal matrix. We consider \(N_v=N\) and we decompose all of the data modes \(u^{(\alpha ),D}_i= [U_{D}]_{i \alpha }\) onto the subspace spanned by the first \(N_h\) singular vectors of the weights, \(u^{(\mu ),W}_i= [U_{W}]_{i \mu }\), \(\mu =1,\ldots ,N_h\):

$$\begin{aligned} \vec {u}^{(\alpha ),D}= & {} \sum _{\mu =1}^{N_h} c_{\mu }^{\alpha } \vec {u}^{(\mu ),W} + e^{\alpha }(W) ,\nonumber \\ c_{\mu }^{\alpha }= & {} \left<\vec {u}^{(\mu ),W},\vec {u}^{(\alpha ),D}\right>\end{aligned}$$
(38)

where \(\{ \vec {u}^{(\mu ),W} \}_{\mu =1,\ldots ,N_h}\) are orthogonal vectors normalized to one. We measure the behavior of

$$\begin{aligned} t_{\alpha }=\sum _i|e^{\alpha }_i|/\sum _i|u^{(\alpha ),D}_i| \end{aligned}$$
(39)

at the beginning and at the end of learning. This quantity measures the difference between the original vector and its projection onto the subspace spanned by the basis \(\{ \vec {u}^{(\mu ),W} \}_{\mu =1,\dots N_h}\). The results of this analysis are found in the insets of Fig. 5, where we plot these quantities at the initial stage of learning and at the end.

Conclusions

In summary, we discussed a new method to solve inverse problems with a clusterized dataset. We analyzed the fully connected Hopfield model in a teacher–student scenario and proposed an inference method based on the TAP equations working directly on the posterior distribution, i.e. the dual problem. We discussed a retrieval algorithm based on the parallel updating of the TAP equations with a naive indexing, showing that in our case it gives good results. Contrarily to previous methods, our algorithm is able at retrieving patterns, besides couplings, because TAP equations allows to reduces the continuous symmetry under rotation to a simple symmetry under permutation over the pattern labels. Finally we compare these results with those obtained with RBM, exploiting their analogies with the Hopfield model. RBM is a good candidate model to perform inference with many variables, a task that would require a much longer execution time to methods based on the optimization of the pseudo-likelihood of an associate pairwise Ising model. Their ability to perform inference tasks systematically, as well as their performance on inferring sparse models, will be addressed elsewhere.