1 Introduction and Statement of the Main Results

1.1 Bayesian Inference

Statistical inference aims to update beliefs about uncertain parameters as more information becomes available. The Bayesian inference, one of the most successful methods used in decision theory, builds over Bayes’ theorem:

$$\begin{aligned} \text {Prob}(H\mid E) = \frac{\text {Prob} (E\mid H) \cdot \text {Prob}(H) }{\text {Prob}(E)} = \frac{\text {Prob} (E\mid H)}{\text {Prob}(E)} \cdot \text {Prob}(H) \end{aligned}$$
(1)

which expresses the conditional probability of the hypothesis H conditional to the event E with the probability that the event or evidence E occurs given the hypothesis H. In the previous expression, the posterior probability \(\text {Prob}(H\mid E)\) is inferred as an outcome of the prior probability \(\text {Prob}(H)\) on the hypothesis, the model evidence \(\text {Prob}(E)\) and the likelihood \(\text {Prob} (E\mid H)\). Bayes’ theorem has been widely used as an inductive learning model to transform prior and sample information into posterior information and, consequently, in decision theory. One should not make confusion between \(\text {Prob} (E\mid H)\) and \(\text {Prob} (H\mid E)\). Let us provide a simple example. Suppose one is tested for covid-19, and the test turns out to be positive. If the test is \(99 \%\) accurate, the latter means that \(\text {Prob}(\text {Positive test}\mid \text {Covid-}19)=0.99\). However, the most relevant information is \(\text {Prob}(\text {Covid-}19 \mid \text {Positive test})\), namely the probability of having covid-19 once one is tested positive, which is related with the former conditional probability by  (1). If proportion \(\text {Prob}(\text {Covid-}19)\) of infected persons in the total population is 0.001 it it possible to compute the normalizing term \(\text {Prob}(\text {Positive test})\) and to conclude that \(\text {Prob}(\text {Covid-}19\mid \text {Positive test})=0.5\), which provides a different and rather relevant information (see e.g. [13] for all computations in a similar example). The conclusion is that both the prior and the data contain important information, and so neither should be neglected.

The process of drawing conclusions from available information is called inference. However, in many physical phenomena the available information is often insufficient to reach certainty through a logical reasoning. In these cases, one may use different approaches for doing inductive inference, and the most common methods are those involving probability theory and entropic inference (cf. [13]). The frequentist interpretation advocates that the probability of a random event is given by the relative number of occurrences of the event in a sufficiently large number of identical and independent trials. An alternative approach is given by the Bayesian interpretation which became more popular in the recent decades and sustains that a probability reflects the degree of belief of an agent in the truth of a proposition. Citing [13], “the crucial aspect of Bayesian probability measures is that different agents may have different degrees of belief in the truth of the very same proposition, a fact that is described by referring to Bayesian probability measures as being subjective".

In the framework of parametric Bayesian statistics, one is interested in updating beliefs, or the degree of confidence, on the space of parameters \(\Theta \), which play the role of the variable H in the expression  (1) above. In rough terms, the formula  (1) expresses that the belief on a certain set of parameters is updated from the original belief, after an event E, by how likely such event is for all parameterized models. This supports the idea that while frequentists say the data are random and the parameters are fixed, Bayesians say the data are fixed and the parameters are random. The basic idea in classical Bayesian inference is the updating of a prior belief distribution to a posterior belief distribution when the parameter of interest is connected to observations via the likelihood function. In [7], Bissiri et al propose a general framework for the Bayesian inference arguing that a valid update of a prior belief distribution to the posterior one can be made for parameters which are connected to observations through a loss function which accumulates information as time passes rather than the likelihood function. In their framework, the classical inference process corresponds to the special case where the loss function is expressed as the negative log likelihood function. In this more general framework, the choice of loss function determines the way that the data are analyzed contribute to the mechanism of updating the belief distribution on the space of parameters, and such choice is often subjective and depends on the kind of feature one desires to highlight from the data. Moreover, the purpose is that the successive updated belief distributions, called posterior distributions, either converge or concentrate around the unknown targeted parameters. We refer the reader to [1, 21, 22, 49, 51, 54] for more information on classical Bayesian inference formalism.

The Bayesian inference in the context of observations arising from dynamical systems faces some natural challenges. The first one is that the process of taking time series (via Birkhoff theorem) lacks independence: if \(T: (Y,\nu )\rightarrow (Y,\nu )\) is a measure preserving map and \(\phi : Y\rightarrow {\mathbb {R}}\) is an observable then the sequence of random variables \((\phi \circ T^n)_{n\geqslant 1}\) is identically distributed but the random variables are not even pairwise independent. The second one concerns the choice of the loss function to make update of beliefs on the space of parameters. From the Physics and the Dynamical Systems viewpoints it is natural that loss functions should value some of the geometric or chaotic properties of the dynamical system, identified either in terms of Lyapunov exponents, joint spectral radius of matrix cocycles, entropy or estimates on the Charathéodory, box-counting or Hausdorff dimension of repellers and attractors, and with applications in wavelets and multifractal analysis, just to mention a few. These concepts, central in mathematical physics (see e.g. [4,5,6, 8, 10, 20, 27, 29, 32, 33] and references therein) appear naturally as limits of either Birkhoff averages of potentials, sub-additive or almost-additive potentials (or several other versions of non-additivity, to be defined in Sect. 3.2). As a first example, if T is a \(C^1\)-smooth volume preserving and ergodic diffeomorphism on a surface then its largest Lyapunov exponent is

by the random product of \(SL(2,{\mathbb {R}})\)-matrices as

$$\begin{aligned} \lambda _+(T,\text {Leb})=\lim _{n\rightarrow \infty } \frac{1}{n} \log \Vert A(T^{n-1}(y))\dots A(T(y)) A(y)\Vert \end{aligned}$$

for Lebesgue almost every \(y\in Y\), where \(A=DT: Y \rightarrow TY\) is the derivative cocycle. In general, the sequence of observables \(\Phi =(\varphi _n)_{n\geqslant 1}\) defined by \(\varphi _n(y)=\log \Vert A(T^{n-1}(y))\dots A(T(y)) A(y)\Vert \) is sub-additive and, in the special case that the linear cocycle has an invariant cone-field, this sequence is actually almost-additive (cf. [28]). A second example concerns the Shannon-McMillan-Breiman formula for entropy on one-sided subshifts of finite type \(\sigma :\Omega \rightarrow \Omega \), where the set \(\Omega \subset \{1,2,...,q\}^{\mathbb {N}}\) is \(\sigma \)-invariant determined by a transition matrix \(M_\Omega \in {\mathcal {M}}_{q\times q}(\{0,1\})\) and

$$\begin{aligned} h_{\mu }(\sigma )=\lim _{n\rightarrow \infty } -\frac{1}{n} \log \mu (C_n(x)), \quad \text { for } \mu \text {-a.e. }x \end{aligned}$$
(2)

where \(C_n(x)\subset \Omega \) denotes the n-cylinder set containing the sequence \(x=(x_1, x_2, x_3, \dots )\). The sequence of observables \(\Phi =(\varphi _n)_{n\geqslant 1}\) defined by \(\varphi _n(y)=-\log \mu (C_n(x))\), which is non-additive in general, is additive and almost-additive in the relevant classes of Bernoulli and Gibbs measures, respectively (see Lemma 3.3).

Finally, it is worth to mention that sub-additive and almost-additive sequences appear naturally also in applications to several other areas of knowledge and appear for instance in the study of factorial languages by Thue, Morse and Hedlund in the beginning of the twentieth century (see [52] and references therein).

In this article, inspired by the relevant physical quantities arising from non-additive sequences of potentials, we will establish a bridge between non-additive thermodynamic formalism of dynamical systems and Gibbs posterior inference in statistics (to be defined in Sect. 1.2 below), two areas of research in connection with statistical physics. We refer the interested reader to the introduction of [47] for a careful and wonderful exposition on the link between Bayesian inference and thermodynamic formalism, and a list of cornerstone contributions. We will mostly be interested in the parametric formulation of Bayesian inference, as described below. Let \(\sigma : \Omega \rightarrow \Omega \) be a subshift of finite type. This will serve as the underlying dynamical system, with respect to which samplings are obtained along its finite orbits \(\{y, \sigma (y), \dots , \sigma ^{n-1}(y)\}\), \(y\in \Omega \). We take a family of Gibbs probability measures \(\{\mu _\theta \}_{\theta \in \Theta }\) as the models in the inference procedure for their relevance and ubiquity in the thermodynamic formalism of dynamical systems, and are of crucial importance in several other fields as in the study of the randomness of time-series, decision theory, quantum information and information gain, just to mention a few (cf. [2, 13, 30, 36, 46]). In our context, Gibbs measures appear as fixed points of the dual of certain transfer operators. Let us be more precise. For any Lipschitz continuous potential \(A: \Omega \rightarrow {\mathbb {R}}\), the Ruelle-Perron-Frobenius transfer operator associated to A is defined by

$$\begin{aligned} {\mathcal {L}}_A (\varphi ) (x) = \sum _{\sigma (y) =x} e^{A(y)} \varphi (y). \end{aligned}$$

The potential A is called normalized if \( {\mathcal {L}}_A(1)=1\), and in this case, it is natural to write \(A =\log J\), and we call J the Lipschitz continuous Jacobian. A Gibbs measure \(\mu \) is any \(\sigma \)-invariant probability measure obtained as a fixed point of the dual operator \( {\mathcal {L}}_{\log J}^*\) acting on the space of probability measures on \(\Omega \), for some Lipschitz continuous and normalized Jacobian J. In this way, it is natural to parametrize Gibbs probabilities by the space of normalized Lipschitz continuous Jacobians J, hence this space can be observed as an infinite dimensional Riemannian analytic manifold [35, 45, 46]. Invariant Gibbs measures are equilibrium states, namely they satisfy a variational relations (cf. Sect. 1.3 for more details). Given a prior probability measure \(\Pi _0\) on the space \(\Theta \) of parameters and a the sampling according to a Gibbs measure \(\mu _{\theta _0}\), the posterior probability (i.e. updated belief distribution) is determined using the loss functions \( \ell _n : \Theta \times \Omega \times \Omega \rightarrow {\mathbb {R}}, \) where \(\ell _n(\theta ,x,y)\) encodes the information on the parameter \(\theta \) accumulated along the sampling \(\{y, \sigma (y), \dots , \sigma ^{n-1}(y)\}\) and influenced by the measurements along the orbit \(\{x, \sigma (x), \dots , \sigma ^{n-1}(x)\}\). The Shannon-McMillan-Breiman formula  (2) suggests the use of loss functions to collect the information of the measure on cylinder sets in \(\Omega \) (cf. expressions (4),  (5) and  (9) below). The relative entropy, also called Kullback-Leibler divergence and defined by (27), makes the comparison between the measurements of cylinders according to two different Gibbs measures. This notion is of paramount importance in Physics and will be used to interconnect log likelihood inference with the direct observation analysis of Gibbs probability measures. Our main results guarantee that posterior consistency for certain classes of loss functions determined by almost-additive sequences of potentials: the posterior distributions asymptotically concentrate around the unknown targeted parameter \(\theta _0\), often with exponential speed (we refer the reader to Theorems A,  B and  C for the precise statements). The main ingredient to obtain quantitative estimates on the convergence for the parameter \(\theta _0\) is the use of large deviations for non-additive sequences of potentials [57].

Our results are strongly inspired, and should be compared, with those by McGoff, Mukherjee and Nobel [47], where the authors established posterior consistency of (hidden) Gibbs processes on mixing subshifts of finite type using properties of Gibbs measures. For that purpose, they consider a more general framework, where the dynamical system \(T: Y \rightarrow Y\) on a Polish space does not necessarily coincide with the subshift of finite type \(\sigma : \Omega \rightarrow \Omega \). In particular, the sampling is determined by a T-invariant and ergodic probability measure \(\nu \), that could be unrelated to the Gibbs measures \(\{\mu _\theta \}_{\theta \in \Theta }\) for the shift. If the loss functions are additive (i.e. \(\ell _n=\sum _{j=0}^{n-1} \ell (\theta , \sigma ^j(x), T^j(y))\) for some function \(\ell : \Theta \times \Omega \times Y \rightarrow {\mathbb {R}}\) satisfying a mild regularity condition then the main results in [47] ensure that it is possible to formulate the problem as a limiting variational problem and to identity the parameters, obtained as minimizing parameters for a lower semicontinuous function \(V:\Theta \rightarrow {\mathbb {R}}\), for which the posterior consistency holds: if \(\Theta _{\min } =\text {argmin}_{\theta \in \Theta } V(\theta )\) then the posterior distributions \(\Pi _n(\cdot \mid y)\), defined by  (6), satisfy

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pi _n(\Theta \setminus U\,|y) =0 \end{aligned}$$

for each open neighborhood U of \(\Theta _{\min }\) and for \(\nu \)-almost every \(y \in {Y}\) (cf. [47, Theorem 2]). The proof of this result requires the use of joinings (or couplings) of the model system and the observed system, and results on fibered entropy. Our framework corresponds to the special case of direct observation, that the dynamical system T coincides with the subshift of finite type \(\sigma \) and the target parameter is a single \(\theta _0\in \Theta \), with a subtler difference that our assumptions ensure that \(\mu _\theta \ne \mu _{{\tilde{\theta }}}\) for every distinct \(\theta ,{\tilde{\theta }}\in \Theta \). Our results complement the ones in [47] in the sense that the information can be collected by more general loss functions \(\ell _n\). Furthermore, the more direct use of large deviation techniques allows to prove an exponential speed of convergence in the posterior consistency (cf. Theorem A), which were not known even in the context of direct observation (cf. [47, Theorem 2 and Remark 8]). Summarizing, the three main novelties are the extension to non-additive loss functions, the exponential rate of convergence and the proof which is not based on joinings and fiber entropy. It is also worth noticing that, more recently, Su and Mukherjee [55] also used a large deviations approach for posterior consistency, using Varadhan’s large deviation principle for stochastic processes. A different point of view of the Bayesian a priori  and a posteriori formalism will appear in [26] where results on thermodynamic formalism for plans are used (see [42, 43]). In [36] the author considered log-likelihood estimators in classical thermodynamic formalism and the inference concerns Hölder potentials and not probabilities.

To finalize, one should mention that there is an increasing interest to explore the strong connection between Statistical Inference and Physics in general. There are several such connections in this regard, including a Bayesian approach to the dynamics of the classical ideal gas [58, Section 31.3], prior sensitivity in the Bayesian model selection context to some galaxy data sets [11]. In the monograph [13], the author clarifies the conceptual foundations of Physics by deriving the fundamental laws of statistical mechanics and of quantum mechanics as examples of inductive inference, while he also advocates that, in view of the fact that models may need to change as time evolves, it may be the case that all areas of Physics may be modeled using inductive inference.

1.2 Gibbs Posterior Inference

According to the Gibbs posterior paradigm [7, 37], the beliefs should be updated according to the Gibbs posterior distribution. Let us recall the formulation of this posterior measure following [47].

1.2.1 Observed System

Assume that Y is a complete and separable metric space and that \(T:Y \rightarrow Y\) is a Borel measurable map endowed with a T-invariant, ergodic probability measure \(\nu \). This dynamical system represents the observed system and will be used to update information for the model. This is the analogue of the data in the context of Statistics. The updated belief, given by the a posteriori measure, is obtained by feeding data obtained from the observed system on a model by means of a loss function.

1.2.2 Model Families

Consider a transitive subshift of finite type \(\sigma :\Omega \rightarrow \Omega \) where \(\sigma \) denotes the right-shift map, acting on a compact invariant set \(\Omega \subset \{1,2,...,q\}^{\mathbb {N}}\) determined by a transition matrix \(M_\Omega \in {\mathcal {M}}_{q\times q}(\{0,1\})\). The map \(\sigma \) presents different statistical behaviors (e.g. measured in terms of different convergences for Cesàro averages of continuous observables) according to any of its equilibrium states associated to Lipschitz continuous observables, each of which satisfies a Gibbs property (see e.g. Remark 1 in [48, Section 2] or [41]).

Consider a compact metric space \(\Theta \) and a family of \(\sigma \)-invariant probability measures

$$\begin{aligned} {\mathscr {G}}= \big \{\mu _\theta :\theta \in \Theta \big \} \end{aligned}$$

so that: (i) for every \(\theta \in \Theta \) the probability measure \(\mu _\theta \) is a Gibbs measure associated to a Lipschitz continuous potential \(f_\theta : \Omega \rightarrow {\mathbb {R}}\), that is, there exists \(K_\theta >1\) and \(P_\theta \in {\mathbb {R}}\) so that

$$\begin{aligned} \frac{1}{K_\theta } \leqslant \frac{\mu _\theta (C_n(x))}{e^{-n P_\theta } + S_nf_\theta (x)} \leqslant K_\theta , \qquad \forall n\geqslant 1, \end{aligned}$$
(3)

where \(S_n f_\theta =\sum _{j=0}^{n-1} f_\theta \circ \sigma ^j\) and \(C_n(x)\subset \Omega \) denotes the n-cylinder set in the shift space \(\Omega \) containing the sequence \(x=(x_1, x_2, x_3, \dots )\); and (ii) the family \(\Theta \ni \theta \mapsto f_\theta \) is continuous (in the Lipschitz norm). We assume Gibbs measures to be normalized, hence probability measures. It is well known that the previous conditions ensure the continuity of the pressure function \(\Theta \ni \theta \mapsto P_\theta \) and of the map \(\Theta \ni \theta \mapsto \mu _\theta \) (in the weak\(^*\) topology) [48]. In particular, one can take a uniform constant \(K>0\) in  (3). The problem to be considered here involves a formulation and analysis of an iterative procedure (based on sampling and updated information) on the family \({\mathscr {G}}\) of models.

1.2.3 Loss Functions and Gibbs Posterior Distributions

Consider the product space \(\Theta \times \Omega \) endowed with the metric d defined as \( d(\,(\theta , x),(\theta ' , x ')\,)= \max \{ d_\Theta (\theta , \theta ') , d_\Omega (x, x')\, \}.\) A fully supported probability measure \(\Pi _0\) on \(\Theta \) describes the a priori uncertainty on the Gibbs measure.

Given such an a priori probability measure \(\Pi _0\) on the space of parameters \(\Theta \) and a sample of size n (determined by the observed system T) we will get the a posteriori probability measure \(\Pi _n\) on the space of parameters \(\Theta \), taking into account the updated information from the data. More precisely, given \(\Pi _0\) and a family \((\mu _\theta )_{\theta \in \Theta }\), consider the probability measure \(P_0\) on the product space \(\Theta \times \Omega \) given by

$$\begin{aligned} P_0 (E)= \int \int {\mathbf {1}}_E (\theta ,x) \,d\mu _\theta (x) \,d\Pi _0 (\theta ) \end{aligned}$$

for all Borel sets \(E \subset \Theta \times \Omega \). In other words, \(P_0\) has the a priori measure \(\Pi _0\) as marginal on \(\Theta \) and admits a disintegration on the partition by vertical fibers where the fibered measures are exactly the Gibbs measures \((\mu _\theta )_{\theta \in \Theta }\). There is no action of the dynamics T on this product space. Indeed, the a posteriori measures are defined using loss functions. For each \(n \in {\mathbb {N}}\) consider a continuous loss function \(\ell _n\) of the form

$$\begin{aligned} \ell _n : \Theta \times \Omega \times Y \rightarrow {\mathbb {R}}, \end{aligned}$$

consider the probability measure \(P_n\) on \(\Theta \times \Omega \) given by

$$\begin{aligned} P_n (E\mid y)= \int \int {\mathbf {1}}_E (\theta ,x) e^{ - \, \ell _n (\theta , x,y)}\,d\mu _\theta (x) \,d\Pi _0 (\theta ) \end{aligned}$$
(4)

for all Borel sets \(E \subset \Theta \times \Omega \), and set

$$\begin{aligned} Z_n(y)=\int _\Theta \int _\Omega \, e^{ - \, \ell _n (\theta , x,y)}\,d \mu _\theta (x)\, d \Pi _0 (\theta ), \end{aligned}$$
(5)

where \(x=(x_1,x_2,...,x_n,\dots )\in \Omega \) and \(y\in Y\). In the special case that \(Y=\Omega \), that \(-\ell _n: \Theta \times \Omega \times \Omega \rightarrow {\mathbb {R}}\) coincides with a n-Birkhoff sum of a fixed observable \(\psi \) with respect to T and \(\Pi _0\) is a Dirac measure, the expression  (5) resembles the partition function in statistical mechanics whose exponential asymptotic growth coincides with the topological pressure of T with respect to \(\psi \).

Given \(y \in Y\) and \(n\geqslant 1\), the a posteriori Borel probability measure \(\Pi _n (\cdot \,|\, y)\) on the parameter space \(\Theta \) (at time n and determined by the sample of y) is defined by

$$\begin{aligned} \,\Pi _n (B \,|\, y) =\frac{1}{Z_n(y)} \int _B \int _\Omega {e^{ - \, \ell _n (\theta , x,y)}d \mu _\theta (x)}d \Pi _0 (\theta )\,, \end{aligned}$$
(6)

for every measurable \(B\subset \Theta \) and appears as marginals of the probability measures \(P_n(\cdot \mid y)\) given above.

The general question is to describe the set of probability measures \(\,\Pi _n (.\,|\, y)\) on the parameter space \(\Theta \), namely if their marginals converge and to formulate the locus of convergence in terms of some variational principle or as points of maximization for a certain function (see e.g. [47, Theorem 2] for a context where the loss functions are chosen such that the support of such measures on the minimization locus of a certain rate function).

The main problem we are interested in is to understand whenever a sampling process according to a fixed probability measure can help to identify it from a recursive process involving Bayesian inference. Assume that \(Y=\Omega \), that \(T=\sigma \) is the shift and that one is interested in a specific probability measure \(\mu _{\theta _0}\in {\mathscr {G}}\), where \(\theta _0 \in \Theta \). If \(\nu =\mu _{\theta _0}\) then the sampling \(\{y, T(y), T^2(y), \dots T^{n-1}(y)\}\) is distributed according to this probability measure. From the Birkhoff time series is it possible to successively update the initial a priori probability measure \(\Pi _0\) in order to get a sequence of probability measures \(\Pi _n(\cdot \mid y)\) on \(\Theta \) (the a posteriori probability measure at time n) as described. We ask the following:

\(\circ \):

Does the limit \(\lim _{n \rightarrow \infty } \Pi _n\) exist?

\(\circ \):

If the previous question has an affirmative answer:

\(\circ \):

is it the Dirac measure \(\delta _{\theta _0}\) on \(\theta _0\in \Theta \)?

\(\circ \):

is it possible to estimate the speed of convergence to the limiting measure?

In this paper we answer the previous questions for loss functions that are not necessarily arising from Birkhoff averaging but that keep some almost additive property. For that reason our approach will make use of results from non-additive thermodynamic formalism, hence it differs from the one considered in [47]. We refer the reader to [16] for a related work which does not involve Bayesian statistics.

This paper is organized as follows. In the rest of this first section we formulate the precise setting we are interested in and state the main results. In Sect. 2 we present several examples and applications of our results. Section 3 is devoted to some preliminaries on relative entropy, large deviations and non-additive thermodynamic formalism. Finally, the proofs of the main results are given in Sect. 4.

1.3 Setting and Main Results

Let \(\sigma :\Omega \rightarrow \Omega \) be a subshift of finite type endowed with the metric \(d_\Omega (x,y)=2^{-n(x,y)}\), where \(n(x,y)=\inf \{n\geqslant 1:x_n\ne y_n\}\), and denote by \({\mathcal {M}}_\sigma (\Omega )\) the space of \(\sigma \)-invariant probability measures. The space \({\mathcal {M}}_\sigma (\Omega )\) is metrizable and we consider the usual topology on it (compatible with weak\(^*\) convergence). Let \(D_\Omega \) be a metric on \({\mathcal {M}}_\sigma (\Omega )\) compatible with the weak\(^*\) topology. The set \({\mathcal {G}}\subset {\mathcal {M}}_\sigma (\Omega )\) of Gibbs measures for Lipschitz continuous potentials is dense in \({\mathcal {M}}_\sigma (\Omega )\) (see for instance [39]). Given a Lipschitz continuous potential \(A:\Omega \rightarrow {\mathbb {R}}\) we denote by \(\mu _A\) the associated Gibbs measure. We say that the Lipschitz continuous potential \(A:\Omega \rightarrow {\mathbb {R}}\) is normalized if \({\mathcal {L}}_A (1)=1\), where

$$\begin{aligned} {\mathcal {L}}_A :\text {Lip}(\Omega ,{\mathbb {R}}) \rightarrow \text {Lip}(\Omega ,{\mathbb {R}}) \quad \text {given by}\quad {\mathcal {L}}_A g(x)=\sum _{\sigma (y)=x} \, e^{A(y)} \, g(y) \end{aligned}$$

is the usual Ruelle-Perron-Frobenius transfer operator (cf. [48, Chapter 2]). We will always assume that potentials are normalized and write \(J=e^A>0\) (or alternatively \(A=\log J\)) as the Jacobian of the associated probability measure \(\mu _A=\mu _{\log J}.\) That is, \({\mathcal {L}}_{\log J}^* (\mu _{\log J})= \mu _{\log J}\) and, equivalently, \(\mu _{\log J}(\sigma (E)) = \int _E J \, d\mu _{\log J}\) for every measurable set \(E\subset \Omega \) so that \(\sigma \mid _E\) is injective. We consider the Lipschitz norm \(|\,.\,|=\Vert \cdot \Vert _\infty +|\cdot |_{Lip}\) on the space of Lipschitz continuous potentials A, where \( |A|_{Lip} = \sup _{x\ne y} \frac{|A(x)-A(y)|}{d_\Omega (x,y)}. \) Moreover, it is a classical result in thermodynamic formalism (see e.g. [48]) that the following variational principle holds

$$\begin{aligned} \sup _{\mu \in {\mathcal {M}}_\sigma (\Omega )} \Big \{\, h(\mu ) \,+\, \int \log J \, d \mu \,\Big \}= 0 \end{aligned}$$
(7)

for any Lipschitz and normalized potential \(\log J\). A particularly relevant context is given by the space of stationary Markov probability measures on shift spaces (cf. Example 2.1). One should emphasize that, replacing the metric on \(\Omega \), it is possible to deal instead with the space of Lipschitz continuous potentials (cf. [48, Chapter 1]).

In the direct observation context, the sampling on the Bayesian inference is determined by \(T=\sigma \) and a fixed T-invariant Gibbs measure \(\nu \) on \(\Omega \) associated to a normalized potential \(\log J\). The sampling will describe the interaction (expressed in terms of the loss functions) over certain families of potentials (and Gibbs measures) which are parameterized on a compact set, where the sampling will occur. More precisely, consider the set of parameters \(\Theta \subset {\mathbb {R}}^k\) of the form

$$\begin{aligned} \Theta = [a_1,b_1] \times [a_2,b_2] \times ...\times [a_k,b_k], \end{aligned}$$

endowed with the metric \(d_\Theta \) given by \(d_\Theta (\theta _1,\theta _2) = \,\Vert \theta _1 - \theta _2\Vert \), \(\forall \theta _1,\theta _2\in \Theta \), and denote by \(f:\Theta \rightarrow {\mathcal {G}}\subset {\mathcal {M}}_\sigma (\Omega )\) a continuous function of potentials parameterized over \(\Theta \) such that:

  1. (1)

    f is an homeomorphism over its image;

  2. (2)

    for each \(\theta \) the potential \(f(\theta )\) is normalized (we use the notation \(f(\theta ) = \log J_\theta \)).

The assumptions guarantee that for each \(\theta \in \Theta \) there exists a unique invariant Gibbs measure \(\mu _\theta \) with respect to the associated normalized potential \(f(\theta )\), and that these vary continuously in the weak\(^*\) topology. Moreover, as the parameter space \(\Theta \) is compact and \(f:\Theta \rightarrow {\mathcal {G}}\) is a continuous function (expressed in the form \(f(\theta )= \log J_\theta \), where f is a continuous function on \(\theta \in \Theta \) and \(J_\theta >0\)), we deduce that the quotient

$$\begin{aligned} \frac{ J_{\theta _1} (x) }{J_{\theta _2} (x) }>0 \end{aligned}$$
(8)

is uniformly bounded for every \(x \in \Omega \) and all \(\theta _1,\theta _2 \in \Theta .\)

Remark 1.1

At this moment we are not requiring the probability measure \(\nu \) of the observed system \(Y=\Omega \) to belong to the family of probability measures \((\mu _\theta )_{\theta \in \Theta }\). We refer the reader to Example 2.5 for an application in the special case that \(\nu =\mu _{\theta _0}\), for some \(\theta _0\in \Theta \).

The statistics is described by an a priori Bayes probability measure \(\Pi _0\) on the space of parameters \(\Theta \) satisfying Hypothesis A:

$$\begin{aligned} \Pi _0 (d z_1, d z_2,...,d z_k)=\Pi _0 (d \theta ) \quad&\text {is a fixed continuous strictly positive density } \\&\text {fully supported on the compact set }\Theta . \end{aligned}$$
(A)

In many examples the a priori measure appears as the Lebesgue or an equidistributed measure on the parameter space. We refer the reader to Sect. 2 for examples.

The previous full support assumption not only expresses the uncertainty on the choice of the parameters, as it ensures that all parameters in \(\Theta \) will be taken into account in the inference independently of the initial belief (distribution of \(\Pi _0\)). In this case of direct observations of Gibbs measures, let \(\theta _0\in \Theta \) be fixed. The probability measure \(\mu _{\theta _0}\) will play the role of the measure \(\nu \) (on the observed system Y) considered abstractly on the previous subsection. We will consider the loss functions \(\ell _n : \Theta \times \Omega \times \Omega \rightarrow {\mathbb {R}}\), \(n\geqslant 1\), given by

$$\begin{aligned} \ell _n(\theta , x,y)= {\left\{ \begin{array}{ll} \begin{array}{ll} \log \Big ( \mu _{\theta _0} \,( C_n(y) )\, \Big ) &{}, \text {if } x\in C_n(y) \\ +\infty &{}, \text {if } x\not \in C_n(y). \end{array} \end{array}\right. } \end{aligned}$$
(9)

If one denotes by \({\mathbf {1}}_{C_n(y)}\) the indicator function of the n-cylinder set centered at y and defined by \({C_n(y)}=\{(x_j)_{j\geqslant 1} :x_j=y_j, \forall 1\leqslant j \leqslant n\}\), such choice of loss functions ensures that

$$\begin{aligned} Z_n(y)&= \int _\Theta \int e^{ - \, \ell _n (\theta , x,y)} \,d \mu _\theta (x)\, d \Pi _0 (\theta ) = \int _\Theta \int _{C_n(y)} e^{ - \, \ell _n (\theta , x,y)} \,d \mu _\theta (x)\, d \Pi _0 (\theta ) \\&= \int _\Theta \int \frac{ \,\,{\mathbf {1}}_{C_n(y)} (x )\,}{\mu _{\theta _0} \,(\,C_n(y)\,) } \,d \mu _\theta (x)\, d \Pi _0 (\theta ) = \int _\Theta \, \frac{\mu _{\theta } \,(\,C_n(y) \,)}{\mu _{\theta _0} \,(\,C_n(y)\,) } \,d \Pi _0 (\theta ) \end{aligned}$$

for each \(y\in Y\). Therefore, using equalities (25) and (27) (see Sect. 3.1 below), Jensen inequality and the monotone convergence theorem, one obtains that

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n} \log Z_n(y) \,\,&= \limsup _{n \rightarrow \infty } \frac{1}{n} \log \int _\Theta \frac{\mu _{\theta } \,(\,C_n(y) \,)}{\mu _{\theta _0} \,(\,C_n(y)\,) } d \Pi _0 (\theta ) \nonumber \\&\geqslant \limsup _{n \rightarrow \infty } \frac{1}{n} \int _\Theta \log \frac{\mu _{\theta } (C_n(y) )}{\mu _{\theta _0} (C_n(y)) } \, d \Pi _0 (\theta ) \nonumber \\&= - \int _\Theta h(\mu _{\theta _0} \mid \mu _{\theta } ) \, d \Pi _0 (\theta ) \nonumber \\&= \int _{\Theta } \Big [\, h( \mu _{\theta _0}) + \int _\Omega \log J_{\theta } \, d \mu _{\theta _0} \,\Big ] \,d \Pi _0 (\theta ) \end{aligned}$$
(10)

for \(\mu _{\theta _0}\)-almost every y.

On this context of direct observation we are interested in estimating the family of a posteriori measures

$$\begin{aligned} {\Pi _n (E \mid y)\,=\, \frac{ \int _E \mu _\theta (C_n (y) ) \,d \Pi _0 (\theta ) }{ \int _\Theta \mu _\theta (C_n (y) ) \, d \Pi _0 (\theta )},} \end{aligned}$$
(11)

on Borel sets \(E \subset \Theta \) which do not contain \(\theta _0\) and \(y\in \Omega \) is a point chosen according to \(\mu _{\theta _0}\). An equivalent form of (11) which may be useful is

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \, \frac{\mu _{\theta } \,(\,C_n(y) \,)}{\mu _{\theta _0} \,(\,C_n(y)\,) } \,d \Pi _0 (\theta )}{ \int _\Theta \, \frac{\mu _{\theta } \,(\,C_n(y) \,)}{\mu _{\theta _0} \,(\,C_n(y)\,) } \,d \Pi _0 (\theta )}. \end{aligned}$$
(12)

Actually, given such kind of \(E \subset \Theta \), one can ask wether the limit

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pi _n (E \mid y) = \lim _{n \rightarrow \infty } \frac{1}{n} \log \frac{ \int _E \mu _\theta (C_n (y) ) d \Pi _0 (\theta ) }{ \int _\Theta \mu _\theta (C_n (y) ) d \Pi _0 (\theta )} \end{aligned}$$
(13)

exists for \(\mu _{\theta _0}\)-almost every y. The following result gives an affirmative answer to this question.

Theorem A

In the previous context,

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pi _n (\cdot \mid y) = \delta _{\theta _0}, \quad \text {for }\mu _{\theta _0}\text {-a.e. }y\in \Omega . \end{aligned}$$

Moreover the convergence is exponentially fast: for every \(\delta >0\) there exists a constant \(c_\delta >0\) so that the ball \(B_\delta \) of radius \(\delta \) around \(\theta _0\) satisfies \( | \Pi _n (B_\delta \mid y)-1 | \leqslant e^{ - c_\delta \,n} \) for every large \(n\geqslant 1\).

The previous result guarantees that the parameter \(\theta _0\), or equivalently the sampling measure \(\mu _{\theta _0}\), is identified as the limit of the Bayesian inference process determined by the loss function (9). This result arises as a consequence of the quantitative estimates in Theorem 4.1, given in the proofs section below. The direct observation of Gibbs measures was also considered in [47, Section 2.1] although with a different approach. For a parameterized family of loss functions of the form \(\beta \cdot \ell _n(\theta , x,y)\) it is also analyzed in section 3.7 of [47] the zero temperature limit (ground states). This is a topic which can be associated to ergodic optimization. Our results are related in some sense to the so called Maximum Likelihood Identification described [14,15,16,17,18]

The previous context fits in the wider scope of non-additive thermodynamic formalism, using almost-additive sequences of continuous functions (see Sect. 3.2 for the definition). Indeed, the loss functions \((\ell _n)_{n\geqslant 1}\) described in  (9) form an almost-additive family (cf. Definition 3.2 and Lemma 3.3). Furthermore, we will consider loss functions \(\ell _n :\Theta \times \Omega \times Y \rightarrow {\mathbb {R}}\) which form an almost-additive sequence of continuous functions, and for which one can write

$$\begin{aligned} \ell _n(\theta , x,y) = - \varphi _n(\theta , x,y), \end{aligned}$$
(14)

where \(\varphi _n :\Theta \times \Omega \times Y \rightarrow {\mathbb {R}}_+\) are continuous observables satisfying:

  1. (A1)

    for \(\nu \)-almost every \(y\in Y\) the following limit exists

    $$\begin{aligned} \Gamma ^y(\theta ):=\lim _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x), \end{aligned}$$
  2. (A2)

    \(\Theta \ni \theta \mapsto \Gamma ^y(\theta )\) is upper semicontinuous.

Given \(y\in Y\) and the loss functions \(\ell _n\) satisfying (A1)-(A2), the a posteriori measures are

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )}{ \int _\Theta \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )}. \end{aligned}$$
(15)

Remark 1.2

The expression appearing in assumption (A1), which resembles the logarithm of the moment generating function for i.i.d. random variables, is in special cases referred to as the free energy function. Consider the special case where \(T=\sigma \) is the shift, \(\nu \) is an equilibrium state with respect to a Lipschitz continuous potential \(\psi \) and \(\varphi _n(\theta , x,y) =\varphi _{n,1}(\theta , x) + \varphi _{n,2}(\theta ,y)\), where \(\varphi _{n,1}(\theta , x)=\sum _{j=0}^{n-1} \phi _\theta \circ \sigma ^j(x)\), \(\phi _\theta :\Omega \rightarrow {\mathbb {R}}\) is Lipschitz continuous and \((\varphi _{n,2}(\theta , \cdot ))_{n\geqslant 1}\) is sub-additive. Then using the fact that the pressure function defined over the space of Lipschitz continuous observables is Gateaux differentiable and the sub-additive ergodic theorem one obtains that

$$\begin{aligned} \Gamma ^y(\theta )&= \lim _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\sum _{j=0}^{n-1} \phi _\theta (\sigma ^j(x))} \, d\mu _\theta (x) + \inf _{n\geqslant 1} \frac{1}{n} \int \varphi _{n,2}(\theta , \cdot ) \, d\nu \\&= P_{\text {top}}(\sigma , \log J_\theta + \phi _\theta ) - P_{\text {top}}(\sigma , \log J_\theta ) + \inf _{n\geqslant 1} \frac{1}{n} \int \varphi _{n,2}(\theta , \cdot ) \, d\nu \\&= P_{\text {top}}(\sigma , \log J_\theta + \phi _\theta ) + \inf _{n\geqslant 1} \frac{1}{n} \int \varphi _{n,2}(\theta , \cdot ) \, d\nu , \end{aligned}$$

for \(\nu \)-almost every \(y\in \Omega \), hence it is independent of y. We refer the reader to Sect. 3.2 for the concept of topological pressure and further information.

The following result guarantees that the previous Bayesian inference procedure accumulates on the set of probability measures on the parameter space \(\Theta \) that maximize the free energy function \(\Gamma ^y\). By assumption (A2) the set \(\text {argmax} \,\Gamma ^y:=\{\theta _0\in \Theta :\Gamma ^y(\theta ) \leqslant \Gamma ^y(\theta _0), \forall \theta \in \Theta \}\) is non-empty. Then we prove the following:

Theorem B

Assume \(\ell _n\) is a loss function of the form  (14) satisfying assumptions (A1)-(A2). There exists a full \(\nu \)-measure subset \(Y'\subset Y\) so that, for any \(\delta >0\) and \(y\in Y'\),

$$\begin{aligned} \lim _{n \rightarrow \infty } \Pi _n (\Theta \setminus B_\delta ^y \mid y) = 0 \quad \text {where} \quad B_\delta ^y=\{\theta \in \Theta :d_\Theta \big (\,\theta , \text {argmax} \, \Gamma ^y\,\big )>\delta \} \end{aligned}$$
(16)

is the open \(\delta \)-neighborhood of the maximality locus of \(\Gamma ^y\). In particular, if \(y\in Y'\) is such that \(\Theta \ni \theta \mapsto \Gamma ^y(\theta )\) has a unique point of maximum \(\theta _0^y\in \Theta \) then \( \lim _{n\rightarrow \infty } \Pi _n(\cdot \mid y)=\delta _{\theta ^y_0}. \)

Finally, inspired by the log-likelihood estimators in the context of Bayesian statistics it is also natural to consider the loss functions \(\ell _n :\Theta \times X \times Y \rightarrow {\mathbb {R}}\) defined by

$$\begin{aligned} \ell _n(\theta , x,y) = -\log \varphi _n(\theta , x,y) \end{aligned}$$
(17)

associated to an almost additive sequence \(\Phi =(\varphi _n)_{n\geqslant 1}\) of continuous observables \(\varphi _n :\Theta \times X \times Y \rightarrow {\mathbb {R}}_+\) satisfying

  1. (H1)

    for each \(\theta \in \Theta \) and \(x\in X\) there exists a constant \(K_{\theta ,x}>0\) so that, for every \(y\in Y\),

    $$\begin{aligned}&\varphi _n(\theta ,x,y) + \varphi _m(\theta ,x,T^n(y)) - K_{\theta ,x} \leqslant \varphi _{m+n}(\theta ,x,y) \leqslant \varphi _n(\theta ,x,y)\\&\quad + \varphi _m(\theta ,x,T^n(y)) + K_{\theta ,x} \end{aligned}$$
  2. (H2)

    \(\int K_{\theta ,x} d\mu _{\theta }(x)<\infty \) for every \(\theta \in \Theta \).

In this context, the loss functions induce the a posteriori measures

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \psi _n(\theta ,y)\, d \Pi _0 (\theta )}{ \int _\Theta \psi _n(\theta ,y) \, d \Pi _0 (\theta )}, \quad \text {where}\quad \psi _n(\theta ,y)=\int _\Omega \varphi _n(\theta ,x,y) \, d\mu _\theta (x).\nonumber \\ \end{aligned}$$
(18)

Therefore, even though the loss functions are not almost-additive, due to the logarithmic term, we have the following result for the latter non-additive loss functions:

Theorem C

Assume that the loss function of the form  (17) satisfies assumptions (H1)-(H2) above. There exists a non-negative function \(\psi _*:\Theta \rightarrow {\mathbb {R}}_+\) (depending on \(\Psi ^\theta =(\psi _n(\theta ,\cdot ))_{n\ge 1}\)) so that for \(\nu \)-almost every \(y\in Y\) the a posteriori measures \((\Pi _n (\cdot \mid y))_{n\geqslant 1}\) are convergent and

$$\begin{aligned} \Pi _n (\cdot \mid y)\,=\, \frac{ \int _{\cdot } \psi _n(\theta ,y)\, d \Pi _0 (\theta )}{ \int _\Theta \psi _n(\theta ,y) \, d \Pi _0 (\theta )} \longrightarrow \Pi _*(\cdot ):=\frac{(\psi _* \Pi _0)(\cdot )}{(\psi _* \Pi _0)(\Theta )} \end{aligned}$$

as \(n\rightarrow \infty \). Moreover, if \(T=\sigma \) is a subshift of finite type, \(\nu \in {\mathcal {M}}_\sigma (\Omega )\) is a Gibbs measure with respect to a Lipschitz continuous potential and \(\inf _{\theta \in \Theta }\psi _*(\theta )>0\) then for each \(g\in C(\Theta , {\mathbb {R}})\) there exists \(c>0\) so that

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \, \nu ( \{\,y \in \Omega : \Big |\int g\, d\Pi _n (\cdot \mid y) - \int g\, d\Pi _*\Big |\geqslant \delta \}) \nonumber \\&\quad \leqslant \sup _{\theta \in \Theta } \sup _{\{\eta :|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant c \delta \}} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \}, \end{aligned}$$
(19)

where \({{\mathcal {F}}} (\eta , \Psi ^\theta ) := \lim _{n \rightarrow \infty } \frac{1}{n} \int \psi _n(\theta ,\cdot ) \, d \eta .\) If, additionally, the map \(\Theta \ni \theta \mapsto {\mathcal {F}}(\eta ,\Psi ^\theta )\) is continuous for each \(\eta \in {\mathcal {M}}_\sigma (\Omega )\) then the right hand-side in  (19) is strictly negative.

The previous theorem ensures that, in the context of loss functions of the form  (17) satisfying properties (H1) and (H2) above, the a posteriori measures do converge exponentially fast to a probability measure on the parameter space which is typically fully supported. We refer the reader to Example 2.2 for more details in the special case the loss function depends exclusively on one parameter.

Remark 1.3

For completeness, let us mention that the results by Kifer [38] suggest that level-2 large deviations estimates (ie, the rate of convergence of \(\Pi _n (\cdot \mid y)\) to \(\Pi _*\) on the space of probability measures on \(\Theta \)) are likely to hold under the assumption that the limit \( \lim _{n\rightarrow \infty } \frac{1}{n} \log \int e^{\varphi _n} \,d\nu \) exists for all almost-additive sequences \(\Phi =(\varphi _n)_{n\geqslant 1}\) of continuous observables and defines a non-additive free energy function which is related to the non-additive topological pressure. This extrapolates the scope of our interest here.

2 Examples

In what follows we give some examples which illustrate the intuition and utility of the Bayesian inference and also the meaning of the a priori measures.

Example 2.1

The space of all stationary Markov probability measures \(\mu \) in \(\Omega =\{1,2\}^{\mathbb {N}}\) is described by the space of column stochastic transition matrices P with all positive entries. These matrices P can be parameterized by the open square \(\Theta =(0,1) \times (0,1)\) through the parameterization

$$\begin{aligned} M_{(a,b)}= \left( \begin{array}{cc} P_{11} &{} P_{12} \\ P_{21}&{} P_{22} \end{array} \right) =\left( \begin{array}{cc} a &{} 1-b \\ 1-a&{} b \end{array} \right) , \qquad (a,b) \in (0,1) \times (0,1). \end{aligned}$$

In this case the associated normalized Jacobian \(J_{(a,b)}(w)\) has constant value on cylinders of size two. More precisely, for w on the cylinder \([i, j]\subset \Omega \) we get \(J = \frac{\pi _i \,P_{i,j}}{\pi _j}\), where \((\pi _1,\pi _2)\) is the initial invariant probability vector. For each value (a, b) denote by \(\mu _{(a,b)}\) the stationary Markov probability measure associated to the stochastic matrix \( M_{(a,b)}\). In this case we get that \( h(\mu _{(a,b)}) \,+\, \int \log J_{(a,b)} \, d \mu _{(a,b)} =0\) and \({\mathcal {L}}_{\log J_{(a,b)}}^* (\mu _{(a,b)})= \mu _{(a,b)}\) (see [41, 53]). We refer the reader to [23,24,25, 56] for applications of the use of the maximum likelihood estimator in this context of Markov probability measures. One possibility would be to take the probability measure \(\Pi _0\) on the \(\Theta \) space as the Lebesgue probability measure on \((0,1) \times (0,1).\) Different choices of loss functions would lead to different solutions for the claim of Theorem B.

The first of the following examples are very simple and illustrate some trivial contexts. Whenever the parameter space \(\Theta \) (or Y) is a singleton the Bayesian inference is trivial, hence it carries no information. The first example we shall consider is when the loss function depends exclusively on a single variable. Nevertheless, as loss functions are non-additive, these results could not be handled with the previous literature in the subject.

Example 2.2

Assume that \(\Theta \subset {\mathbb {R}}^d\) is a compact set, \(Y=\Omega \) and \(T=\sigma : \Omega \rightarrow \Omega \) is a subshift of finite type. In the case that the loss functions \(\ell _n :\Theta \times \Omega \times Y \rightarrow {\mathbb {R}}\) are generated by an almost-additive sequence of continuous observables \(\Phi =(\varphi _n)_{n\geqslant 1}\) by \( \ell _n(\theta , x,y) = -\log \varphi _n(y) \) which are independent of \(\theta \) and x, the loss function gives no information on the parameter space. For that reason it is natural that the a posteriori measures are

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \varphi _n(y) \,d \Pi _0 (\theta ) }{ \int _\Theta \varphi _n(y) \, d \Pi _0 (\theta )} = \Pi _0 (E) \end{aligned}$$
(20)

for every sampling \(y, T(y), \dots , T^{n-1}(y) \in Y\).

Now, assuming alternatively that the loss function is given by \( \ell _n(\theta , x,y) = -\log \varphi _n(\theta ), \) which is independent on both x and y, a simple computation shows that

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \varphi _n(\theta ) \,d \Pi _0 (\theta ) }{ \int _\Theta \varphi _n(\theta ) \, d \Pi _0 (\theta )}. \end{aligned}$$
(21)

In this case the loss function neglects the observable dynamical system T, hence the a posteriori measures are independent of the sampling. Yet, as the family \(\Phi \) is almost-additive it is easy to check that there exists \(C>0\) so that \(\{\varphi _n+C\}_{n\geqslant 1}\) is sub-additive. In particular, a simple application of Fekete’s lemma (cf. Lemma 3.2) ensures that the limit \(\lim _{n\rightarrow \infty }\frac{\phi _n(\theta )}{n}\) does exists and coincides with \(\phi _*(\theta ):=\inf _{n\geqslant 1} \frac{\phi _n(\theta )}{n}\), for every \(\theta \in \Theta \). In consequence,

$$\begin{aligned} \lim _{n\rightarrow \infty } \Pi _n (E \mid y)= \lim _{n\rightarrow \infty } \frac{ \int _E \frac{\varphi _n(\theta )}{n} \,d \Pi _0 (\theta ) }{ \int _\Theta \frac{\varphi _n(\theta )}{n} \, d \Pi _0 (\theta )} = \Pi (E):= \frac{ \int _E \varphi _*(\theta ) \,d \Pi _0 (\theta ) }{ \int _\Theta \varphi _*(\theta ) \, d \Pi _0 (\theta )}, \end{aligned}$$
(22)

independently of the sampling y. In particular the limit measure \(\Pi \) is fully supported on \(\Theta \) if and only if \(\varphi _*(\theta )>0\) for every \(\theta \in \Theta \).

Finally, for each \(n\geqslant 1\) and almost-additive sequence of continuous observables \(\Phi =(\varphi _n)_{n\geqslant 1}\) on X, consider the loss function

$$\begin{aligned} \ell _n(\theta , x,y) = -\log \varphi _n(x), \end{aligned}$$

In this case a simple computation shows that one obtains a posteriori measures

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \psi _n(\theta )\, d \Pi _0 (\theta )}{ \int _\Theta \psi _n(\theta ) \, d \Pi _0 (\theta )}, \end{aligned}$$
(23)

where the sequence \(\psi _n(\theta )=\int _\Omega \varphi _n(x) \, d\mu _\theta (x)\) is almost additive. Indeed, the \(\sigma \)-invariance of \(\mu _\theta \) and the almost-additivity condition \( \varphi _{n}(x) + \varphi _{m}(\sigma ^n(x)) - C \leqslant \varphi _{m+n}(x) \leqslant \varphi _{n}(x) + \varphi _{m}(\sigma ^n(x)) +C \) ensures that \( \psi _{n}(\theta ) + \psi _{m}(\theta ) - C \leqslant \psi _{m+n}(\theta ) \leqslant \psi _{n}(\theta ) + \psi _{m}(\theta ) + C \) for every \(m,n\geqslant 1\) and \(\theta \in \Omega \). Hence, even though the feed of information is given through the x-variable, the a posteriori measures are of the form  (20), and their convergence is described by Lemma 3.2. In particular, this example shows that the situation is much simpler to describe when the loss functions depend exclusively on a single variable.

In the following two simple examples, we will make explicit computations on the limit of posterior distributions which shows that the assumption (A) on the space of parameters and a priori distribution cannot be removed. In particular, these will show that the posterior distributions \(\Pi _n(\cdot \mid y)\) may converge but not for a Dirac measure on the parameter \(\theta _0\) corresponding to the measure with respect to which the sampling occurs.

Example 2.3

Set \(\Theta =\{-1,1\}\), \(T=\sigma : \{0,1\}^{{\mathbb {N}}} \rightarrow \{0,1\}^{{\mathbb {N}}}\) be the full shift and let \({\mathbb {B}}(a,b)\) denote the Bernoulli measure with \(\nu [0]=a\) and \(\nu [1]=b\), for \(a+b=1\), \(0<a<1\). If \(\phi : \{0,1\}^{{\mathbb {N}}} \rightarrow {\mathbb {R}}\) is a locally constant normalized potential so that \(\phi \mid _{[0]}=c<0\) then it is not hard to deduce (see e.g. [9]) that \(\phi \mid _{[1]}=\log (1-e^c)\) and the unique equilibrium state for \(\sigma \) with respect to \(\phi \) is the probability measure \({\mathbb {B}}(e^{c}, 1-e^c)\). Assume that \(\mu _{-1}={\mathbb {B}}(\frac{1}{3},\frac{2}{3})\) and \(\mu _{1}={\mathbb {B}}(\frac{2}{3},\frac{1}{3})\) which are the unique equilibrium states for the potentials

$$\begin{aligned} \phi _{-1}(x):= {\left\{ \begin{array}{ll} -\log 3, x\in [0] \\ -\log \frac{3}{2}, x\in [1] \end{array}\right. } \quad \text {and}\quad \phi _{1}(x):= {\left\{ \begin{array}{ll} -\log \frac{3}{2}, x\in [0] \\ -\log 3, x\in [1], \end{array}\right. } \end{aligned}$$

respectively. Take \(\Pi _0=\frac{1}{2}\delta _{-1}+\frac{1}{2}\delta _{1}\) and \(\nu ={\mathbb {B}}(\frac{1}{2},\frac{1}{2})\) and notice that \(\nu \) does not belong to the family \((\mu _\theta )_{\theta }\). On the context of direct observation we are interested in describing the a posteriori measures

$$\begin{aligned} {\Pi _n (E \mid y)\,=\, \frac{ \int _E \mu _\theta (C_n (y) ) \,d \Pi _0 (\theta ) }{ \int _\Theta \mu _\theta (C_n (y) ) \, d \Pi _0 (\theta )},} \end{aligned}$$

for samplings over \(\nu \). The ergodic theorem ensures that

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\#\{0\leqslant j \leqslant n-1 :\sigma ^j(y) \in [0]\} = \lim _{n\rightarrow \infty }\frac{1}{n}\#\{0\leqslant j \leqslant n-1 :\sigma ^j(y) \in [1]\} = \frac{1}{2} \end{aligned}$$

for \(\nu \)-almost every y. The Bernoulli property of \(\mu _{\pm 1}\) then implies that, for \(\nu \)-a.e. y,

$$\begin{aligned} \frac{\mu _{1} (C_n (y) )}{\mu _{-1} (C_n (y) )} \rightarrow 1 \quad \text {as }n\rightarrow \infty \end{aligned}$$

and, consequently, the sequence of probability measures \(\Pi _n (\cdot \mid y)\) on \(\{-1,1\}\) is convergent as

$$\begin{aligned} \lim _{n\rightarrow \infty }\Pi _n (\{\pm 1\} \mid y) \,=\, \lim _{n\rightarrow \infty } \frac{ \mu _{\pm 1} (C_n (y) ) }{\mu _{-1} (C_n (y) ) + \mu _{1} (C_n (y) )} = \frac{1}{2}. \end{aligned}$$

In other words, \(\lim _{n\rightarrow \infty }\Pi _n (\cdot \mid y)=\frac{1}{2}\delta _{-1}+\frac{1}{2}\delta _{1}=\Pi _0\). This convergence reflects the fact that \(\int \phi _{-1}\, d\nu =\int \phi _{1}\, d\nu \). Finally, it is not hard to check that for any a priori measure \(\Pi _0=\alpha \delta _{-1}+(1-\alpha )\delta _{1}\) for some \(0<\alpha <1\) it still holds that \(\lim _{n\rightarrow \infty }\Pi _n (\cdot \mid y)=\Pi _0\).

Example 2.4

In the context of Example 2.3, assume that the sampling is done with respect to a non-symmetric Bernoulli measure \({\hat{\nu }}={\mathbb {B}}(\alpha ,1-\alpha )\) for some \(0<\alpha <\frac{1}{2}\). The ergodic theorem guarantees that, for \({\hat{\nu }}\)-a.e. y,

$$\begin{aligned} \frac{\mu _{1} (C_n (y) )}{\frac{2^{\alpha n}}{3^n}} \rightarrow 1 \quad \text {and}\quad \frac{\mu _{-1} (C_n (y) )}{\frac{2^{(1-\alpha ) n}}{3^n}} \rightarrow 1 \quad \text {as }n\rightarrow \infty \end{aligned}$$

and, consequently, \(\mu _{1} (C_n (y) ) / \mu _{-1} (C_n (y) ) \rightarrow 0\) as \(n\rightarrow \infty \). Altogether we get

$$\begin{aligned} \lim _{n\rightarrow \infty }\Pi _n (\{1\} \mid y) \,=\, \lim _{n\rightarrow \infty } \frac{ \mu _{ 1} (C_n (y) ) }{\mu _{-1} (C_n (y) ) + \mu _{1} (C_n (y) )} \,=\, \lim _{n\rightarrow \infty } \frac{ \frac{ \mu _{ 1} (C_n (y) ) }{\mu _{-1} (C_n (y) )} }{1+\frac{ \mu _{ 1} (C_n (y) ) }{\mu _{-1} (C_n (y) )}} =0 \end{aligned}$$

and \(\lim _{n\rightarrow \infty }\Pi _n (\{-1\} \mid y)=1\). In other words, \(\lim _{n\rightarrow \infty }\Pi _n (\cdot \mid y)=\delta _{-1}\) for \({\hat{\nu }}\)-almost every y, which reflects the fact that \(\int \phi _{1}\, d{\hat{\nu }}<\int \phi _{-1}\, d{\hat{\nu }}\).

Example 2.5

Take \(\Theta =[0,1]\) and let \(\Pi _0\) be the Lebesgue measure. Take \(\log J_0\) and \(\log J_1\) two normalized Lipschitz continuous Jacobians, where \(J_0,J_1 : \{1,2,...,q\} ^{\mathbb {N}} \rightarrow {\mathbb {R}}_+,\) and consider the family of Lipschitz continuous potentials

$$\begin{aligned} f_\theta = \log J_\theta :=\log (\theta J_1 + (1-\theta ) J_0), \qquad \theta \in [0,1]. \end{aligned}$$

For each \(\theta \in [0,1]\) let \(\mu _\theta \) be the unique Gibbs measure associated to the Lipschitz continuous potential \(f_\theta \) (see also section 6 in [30] for a related work). Assume further that the observed probability measure associated to the sampling is \(\nu =\mu _{\theta _0}\) for some \(\theta _0\in [0,1]\).

The probability measure \(\Pi _0\) describes our ignorance of the exact value \(\theta _0\) among all possible choices \(\theta \in [0,1]\). For each \(n \in {\mathbb {N}}\) consider a continuous loss function \( \ell _n : \Theta \times \Omega \times Y \rightarrow {\mathbb {R}} \) expressed as

$$\begin{aligned} \ell _n((a,b), x, y) =- \sum _{j=0}^{n-1} \log J_{\theta } (\sigma ^j (x)) + \sum _{j=0}^{n-1} \log J_{\theta _0}(\sigma ^j (y)) + \, \theta \log \theta -\theta \log \theta _0\,. \end{aligned}$$

Similar expressions are often referred as cross-entropy loss functions. By compactness of the parameter space \(\Theta \) we conclude that the third and fourth expressions above are uniformly bounded, hence \((\ell _n)_{n\geqslant 1}\) forms an almost-additive family on the y-variable, hence it fits in the context of Theorem B. In particular we conclude that the a posteriori measures \(\Pi _n(\cdot \mid y)\) converge to the probability measure \(\mu _{\theta _0}\) as n tends to infinity, for \(\mu _{\theta _0}\)-almost every y. Alternatively, consider the continuous loss function \( \ell _n : \Theta \times \Omega \times Y \rightarrow {\mathbb {R}} \) given by

$$\begin{aligned} \ell _n((a,b), x, y) = - \sum _{j=0}^{n-1} \log J_{\theta } (\sigma ^j (x)) + \sum _{j=0}^{n-1} \log J_{\theta _0}(\sigma ^j (y)) - \Vert \, \theta - \theta _0\,\Vert ^2. \end{aligned}$$

The minimization of \(- \ell _n\) corresponds, in rough terms, to what is known in statistics as the minimization of the mean squared error on the set of parameters. As the previous loss function is also almost-additive on the y-variable, Theorem B ensures that the corresponding a posteriori measures \(\Pi _n(\cdot \mid y)\) converge exponentially fast to the sampling probability measure \(\mu _{\theta _0}\) as n tends to infinity, \(\mu _{\theta _0}\)-almost everywhere (we refer the reader to [47] where the methods which were developed there can provide an alternative argument leading to the same conclusion).

Example 2.6

Let \(\sigma :\{1,2\}^{{\mathbb {N}}} \rightarrow \{1,2\}^{{\mathbb {N}}}\) be the full shift and for each \(\theta =(\theta _1, \theta _2)\) in the parameter space \(\Theta :=[-\varepsilon , \varepsilon ]^2\) let \(\mu _\theta \) be a continuous family of Bernoulli measures. These are equilibrium states for a continuous family of potentials. Consider also the locally constant linear cocycle \(A_\theta : \Omega \rightarrow SL(2,{\mathbb {R}})\) given by

$$\begin{aligned} A_\theta \mid _{[i]}= \begin{pmatrix} \begin{array}{cc} 2 &{} 1 \\ 1 &{} 1 \end{array} \end{pmatrix} \,\cdot \, \begin{pmatrix} \begin{array}{cc} \cos \theta _i &{} -\sin \theta _i \\ \sin \theta _i &{} \cos \theta _i \end{array} \end{pmatrix}, \quad \text {for every }i=1, 2. \end{aligned}$$

Given \(n\geqslant 1\) and \((x_1, \dots , x_n) \in \{1,2\}^n\), take the matrix product

$$\begin{aligned} A_\theta ^{(n)}(x_1, \dots , x_n):= A_{\theta _{x_n}} \dots A_{\theta _{x_2}} A_{\theta _{x_1}}. \end{aligned}$$

The limit \( \lambda _{\theta ,i} := \lim _{n\rightarrow \infty } \frac{1}{n} \log \Vert A_\theta ^{(n)}(x) \, v_i \Vert \) is the largest Lyapunov exponent along the orbit of x, it is well defined for \(\nu \)-almost every x and independs on the vector \(v_i\in E^i_{\theta ,x} \setminus \{0\},\) \((i=\pm )\) (cf. Sect. 3.2.3 for more details). Somewhat dual to the context of joint spectral radius [8], the problem here is the selection of a certain Gibbs measure from the information on the norm of the products of matrices, along orbits of typical points. More precisely, take the loss function \(\ell _n(\theta , x,y) = -\log \Vert A^{(n)}_{\theta }(x)\Vert \) and notice that, for \(\nu \)-almost every \(y\in Y\) and every \(\theta \in \Theta \),

$$\begin{aligned} \int _\Omega e^{\varphi _{n+m}(\theta , x,y)} \, d\mu _\theta (x)&= \int _\Omega \Vert A^{(n+m)}_{\theta }(x)\Vert \, d\mu _\theta (x) = \sum _{C_{n+m}(z)} \Vert A^{(n+m)}_{\theta }(z)\Vert \, \mu _\theta (C_{n+m}(z)) \\&\leqslant \sum _{C_n(z)} \Vert A^{(n)}_{\theta }(z)\Vert \Vert A^{(m)}_{\theta }(\sigma ^n(z))\Vert \, \mu _\theta (C_{n}(z)) \, \mu _\theta (C_{m}(\sigma ^n(z))) \\&\leqslant \int _\Omega e^{\varphi _{n}(\theta , x,y)} \, d\mu _\theta (x) \cdot \int _\Omega e^{\varphi _{m}(\theta , x,y)} \, d\mu _\theta (x) \end{aligned}$$

for every \(m,n\geqslant 1\), where we used that \(\mu _\theta \) is a \(\sigma \)-invariant Bernoulli measure. In particular, Fekete’s lemma implies that the following limit exists

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x)&= \inf _{n\geqslant 1} \frac{1}{n} \log \int _\Omega \Vert A^{(n)}_{\theta }(x)\Vert \, d\mu _\theta (x). \end{aligned}$$

exists and independs on y. As the right hand-side above is the infimum of continuous functions on the parameter \(\theta \), the limit function \(\Theta \ni \theta \mapsto \Gamma ^y(\theta )\) is upper semicontinuous. We remark that \(\theta _0=(0,0)\) is the unique parameter for which the Lyapunov exponent is the largest possible (see Lemma 3.5). Hence, as assumptions (A1) and (A2) are satisfied, Theorem B implies that

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \int \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x) \,d \Pi _0 (\theta ) }{ \int _\Theta \int \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x) \,d \Pi _0 (\theta )} \longrightarrow {\left\{ \begin{array}{ll} 1, \text {if} \; (0,0) \in E \\ 0, \text {otherwise} \end{array}\right. } \end{aligned}$$

for every measurable subset \(E\subset \Theta \), In other words, the a posteriori measures converge to the Dirac measure \(\delta _{(0,0)}\). In particular, one has posterior consistency in the problem of determining the measure with largest Lyapunov exponent.

Alternatively, taking the loss function \(\ell _n(\theta , x,y) =-\varphi _n(\theta , x,y) = -\log \Vert A^{(n)}_{\theta }(y)\Vert \), note that the a posteriori measures are given by

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \Vert A^{(n)}_{\theta }(y)\Vert \,d \Pi _0 (\theta ) }{ \int _\Theta \Vert A^{(n)}_{\theta }(y)\Vert \,d \Pi _0 (\theta )} \end{aligned}$$

and that, by the Oseledets theorem and the sub-additive ergodic theorem, the limit

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) = \lim _{n\rightarrow \infty } \frac{1}{n} \log \Vert A^{(n)}_{\theta }(y)\Vert = \inf _{n\geqslant 1} \frac{1}{n} \int \log \Vert A^{(n)}_{\theta }(\cdot )\Vert \,d\nu \end{aligned}$$

for \(\nu \)-almost every \(y\in Y\). The map \( \Theta \ni \theta \mapsto \Gamma ^y(\theta ) := \inf _{n\geqslant 1} \frac{1}{n} \int \log \Vert A^{(n)}_{\theta }(\cdot )\Vert \,d\nu \) is upper semicontinuous because it is the infimum of continuous maps. In particular, Theorem B implies once more that for \(\nu \)-almost every \(y\in Y\)

$$\begin{aligned} \lim _{n\rightarrow \infty }\Pi _n (\cdot \mid y)\,=\, \lim _{n\rightarrow \infty } \frac{ \int _\cdot \Vert A^{(n)}_{\theta }(y)\Vert \,d \Pi _0 (\theta ) }{ \int _\Theta \Vert A^{(n)}_{\theta }(y)\Vert \,d \Pi _0 (\theta )} = \delta _{(0,0)} \end{aligned}$$

Example 2.7

In the context of Example 2.6, noticing that all matrices are in \(SL(2,{\mathbb {R}})\) it makes sense to consider alternatively the loss function \( \ell _n(\theta , x,y) = -\log \varphi _n(\theta ,x,y)=- \log \log \Vert A^{(n)}_{\theta }(y)\Vert , \) and to observe that \(\varphi _n(\theta ,x,y)\) is almost-additive, meaning it satisfies (H1)-(H2) with a constant K uniform on \(\theta \). The loss functions induce the a posteriori measures

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \int \log \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x)\, d \Pi _0 (\theta )}{ \int _\Theta \int \log \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x)\, d \Pi _0 (\theta )}. \end{aligned}$$
(24)

A simple computation involving Fekete’s lemma guarantees that, for each \(\theta \in \Theta \), the annealed Lyapunov exponent

$$\begin{aligned} \lambda (\theta ):=\lim _{n\rightarrow \infty } \frac{1}{n} \int \log \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x) =\inf _{n\geqslant 1} \frac{1}{n} \int \log \Vert A^{(n)}_{\theta }(x)\Vert d\mu _\theta (x) \geqslant 0 \end{aligned}$$

does exist. Theorem C implies that the a posteriori measures  (24) converge and

$$\begin{aligned} \lim _{n\rightarrow \infty } \Pi _n (E \mid y)\,=\, \frac{\int _E \lambda (\theta ) \, d\Pi _0(\theta )}{\int _\Omega \lambda (\theta ) \, d\Pi _0(\theta )} \end{aligned}$$

for every measurable subset \(E\subset \Theta \). In particular the limit measure is absolutely continuous with respect to the a priori measure \(\Pi _0\) and with density given by the normalized Lyapunov exponent function. Moreover, the continuous dependence of the Lyapunov exponents with respect to the parameter \(\theta \) implies the exponential large deviations estimates in Theorem C.

3 Preliminaries

3.1 Relative Entropy

Let us recall some relevant concepts of entropy in the context of shifts. Given \(x= (x_1,x_2,...,x_k ,...)\in \Omega \) and \(n\geqslant 1\), recall

$$\begin{aligned} C_n (x)=\{ y\in \Omega \,|\,y_j=x_j, j=1,2,..,n\} \end{aligned}$$

the n-cylinder in \(\Omega \) that contains the point x. The concept of relative entropy will play a key role in the analysis. Let \(\phi :\Omega \rightarrow {\mathbb {R}}\) be a Lipschitz continuous potential and let \(\mu _\phi \) be its unique Gibbs measure, thus ergodic. Following [16, Section 3], given an ergodic probability measure \(\mu \in {\mathcal {M}}_\sigma (\Omega )\) the limit

$$\begin{aligned} h(\mu \mid \mu _\phi ):=\lim _{n \rightarrow \infty } \frac{ 1 }{ n} \log \Big (\frac{ \mu ( C_n (x))}{\mu _\phi ( C_n (x) )} \Big ) \end{aligned}$$
(25)

exists and is non-negative for \(\mu \)-almost every \(x=(x_1,x_2,...,x_n,..)\in \Omega \), and it is called the relative entropy of \(\mu \) with respect to \(\mu _\phi \). Notice that any two distinct ergodic probability measures are mutually singular, hence no Radon-Nykodym derivative is well defined. In  (25), a sequence of nested cylinder sets is used as an alternative to compute relative entropy when Radon-Nykodym derivatives are not well defined (see [16] for more details). Moreover,

$$\begin{aligned} h(\mu \mid \mu _\phi ) = P_{\text {top}}(\sigma ,\phi ) - \int \phi \, d \mu - h(\mu ) \end{aligned}$$
(26)

and, by the variational principle and uniqueness of equilibrium states for Lipschitz continuous potentials, \(h(\mu \mid \mu _\phi )=0\) if and only if \(\mu =\mu _\phi \) (cf. Subsection 3.2 in [16]). Furthermore, if \(\mu =\mu _\psi \) is a Gibbs measure then \(h(\mu \mid \mu _\phi )=0\) if and only if \(\phi \) and \(\psi \) are cohomologous, i.e, if there exists a Lipschitz continuous function \(u: \Omega \rightarrow {\mathbb {R}}\) so that \(\phi =\psi + u\circ \sigma - u\). The relative entropy is also known as the Kullback-Leibler divergence. For proofs of general results on the topic in the context of shifts we refer the reader to [16] and [44] which deal with finite and compact alphabets, respectively. We refer the reader to [34] for an application of Kullback-Leibler divergence in statistics.

Remark 3.1

In the special case that \((\mu _\theta )_{\theta \in \Theta }\) is a parameterized family of Gibbs measures associated to normalized potentials then for \(\mu _{\theta }\)-almost every \(x=(x_1,x_2,...,x_n,..)\in \Omega \) we have

$$\begin{aligned} \frac{\mu _\theta \,(C_n(x))}{\mu _{\theta _0} \,( C_n(x)) }\,\sim \, e^{ -n\, h(\mu _{\theta _0} \mid \mu _\theta )}\,\,\rightarrow \,\,0,\, \end{aligned}$$

as \(n \rightarrow \infty \), whenever \(f_\theta \) and \(f_{\theta _0}\) are not cohomologous. Furthermore, as the pressure function is zero in this context the relative entropy \(h(\mu _{\theta _0} \mid \mu _\theta )\) can be written as

$$\begin{aligned} h(\mu _{\theta _0} \mid \mu _\theta ) = - h (\mu _{\theta _0}) - \int \log J_\theta \, d \mu _{\theta _0}. \end{aligned}$$
(27)

Expression  (26) allows to obtain uniform estimates on the relative entropy of nearby invariant measures. More precisely:

Lemma 3.1

Let \(\phi : \Omega \rightarrow {\mathbb {R}}\) be a Lipschitz continuous potential and let \(\mu _\phi \) be its unique Gibbs measure. Then, for any small \(\varepsilon >0\) there exists \(\delta >0\) such that

$$\begin{aligned} \inf _{\mu \in {\mathcal {M}}_\sigma (\Omega ) } \Big \{h(\mu \mid \mu _\phi ) \,:\, D_\Omega (\mu , \mu _\phi )>\, \delta \,\Big \}> \varepsilon . \end{aligned}$$

Proof

Fix \(\varepsilon >0\). By the continuity of the map \(\mu \mapsto \int \phi \, d\mu \), upper-semicontinuity of the entropy map \(\mu \mapsto h(\mu )\) and uniqueness of the equilibrium state, there exists \(\delta >0\) so that any invariant probability measure \(\mu \) so that \(D_\Omega (\mu ,\mu _\phi )>\delta \) satisfies \( h(\mu ) + \int \phi \, d \mu < P_{\text {top}}(\sigma ,\phi ) - \varepsilon . \) This, together with (26) proves the lemma. \(\square \)

3.2 Non-additive Thermodynamic Formalism

As mentioned before, we are mostly interested in non-additive loss functions which keep some almost additivity condition. Let us recall some of the basic notions associated to the non-additive thermodynamic formalism.

3.2.1 Basic Notions

There are several notions of non-additive sequences which appear naturally in the description of thermodynamic objects. Let us recall some of these notions.

Definition 3.2

A sequence \(\Psi := \{\psi _n\}_{n\geqslant 1}\) of continuous functions \(\psi _n:\Omega \rightarrow {\mathbb {R}}\) is called:

  1. (1)

    almost additive if there exists \(C>0\) such that

    $$\begin{aligned} \psi _n +\psi _m \circ \sigma ^n - C \leqslant \psi _{m+n} \leqslant \psi _n + \psi _m \circ \sigma ^n + C, \quad \forall m,n \geqslant 1; \end{aligned}$$
  2. (2)

    asymptotically additive if for any \(\xi >0\) there is a continuous function \(\psi _\xi \) so that

    $$\begin{aligned} \limsup _{n\rightarrow \infty } \frac{1}{n} \left\| \psi _n - S_n \psi _\xi \right\| <\xi ; \end{aligned}$$
  3. (3)

    sub-additive if

    $$\begin{aligned} \psi _{m+n}\leqslant \psi _{m}+\psi _{n}\circ \sigma ^m, \quad \forall m,n \geqslant 1. \end{aligned}$$

The convergence in the case of constant functions, ie sub-additive sequences is given by the following well known lemma.

Lemma 3.2

(Fekete’s lemma) Let \((a_n)_{n\geqslant 1}\) be a sequence of real numbers so that \(a_{n+m} \leqslant a_{n}+ a_{m}\) for every \(n,m\geqslant 1\). Then the sequence \((a_n)_{n\geqslant 1}\) is convergent to \(\inf _{n\geqslant 1} \frac{a_n}{n}\).

In order to recall the variational principle and equilibrium states for sequences of dynamical systems we need to obtain an almost sure convergence. Given a probability measure \(\rho \in {\mathcal {M}} (\Omega )\), Kingman’s sub-additive ergodic theorem ensures that any almost additive or sub-additive sequence \(\Psi := \{\psi _n\}_{n\geqslant 1}\) of continuous functions is such that

$$\begin{aligned} {{\mathcal {F}}} (\rho , \Psi ) = \lim _{n \rightarrow \infty } \frac{1}{n} \int \psi _n \, d \rho \,. \end{aligned}$$
(28)

Definition 3.3

We denote by \(P_{\text {top}} (\sigma ,\Phi )\) the pressure of the almost additive family \(\Phi \), associated to the family \(\varphi _n\), where

$$\begin{aligned} P_{\text {top}} (\sigma ,\Phi ) = \sup _{\rho \in {\mathcal {M}}_\sigma (\Omega )} \Big \{\, h(\rho ) \,+\,{{\mathcal {F}}} (\rho , \Phi ) \,\Big \}. \end{aligned}$$

A probability measure \(\mu =\mu _\Phi \in {\mathcal {M}}_\sigma (\Omega )\) is called a Gibbs measure for the almost additive family \(\Phi \), if it attains the supremum.

The previous topological pressure for non-additive sequences can also be defined, in the spirit of information theory, as the maximal topological complexity of the dynamics with respect to such sequences of observables (cf. [3]). The unique Gibbs measure associated to the family \(\Phi =(\varphi _n)_{n\geqslant 1}\), \(\varphi _n =\sum _{j=0}^{n-1} \log J_{\theta _0}\circ \sigma ^j,\) \(n \in {\mathbb {N}}\), is \(\mu _{\theta _0}\). Moreover, in this case \(P_{\text {top}} (\sigma ,\Phi )=0.\) For the family \(\Phi :=\{\varphi _n\}\) the claim is under the domain of the classical Thermodynamic Formalism as described before by expression (7). In this case

$$\begin{aligned} P_{\text {top}} (\sigma ,\Phi ) = \sup _{\mu \in {\mathcal {M}}_\sigma (\Omega )} \{\, h(\mu ) \,+\, \int \log J_{\theta _0} d \mu \,\}= 0. \end{aligned}$$
(29)

Remark 3.4

In [19], the author proved that any sequence \(\Psi \) of almost additive or asymptotically additive potentials is equivalent to standard additive potentials: there exists a continuous potential \(\varphi \) with the same topological pressure, equilibrium states, variational principle, weak Gibbs measures, level sets (and irregular set) for the Lyapunov exponent and large deviations properties. Yet, it is still unknown wether any sequence of Lipschitz continuous potentials has a Lipschitz continuous additive representative.

3.2.2 Almost-Additive Potentials Related to Entropy

The next proposition says that Gibbs measures determine in a natural way some sequences of almost additive potentials.

Lemma 3.3

Given \(\theta \in \Theta \), the family \(\psi _{n,1}^\theta (y):= \log (\mu _\theta (\,C_{n} (y)\, )\), \(n \in {\mathbb {N}},\) is almost additive.

Proof

Recall that all potentials \(f_\theta \) are normalized, thus each \(\mu _\theta \) satisfies the Gibbs property  (3) with \(P_\theta =0\). Thus, for \(\theta \in \Theta \) there exists \(K_\theta >0\) such that for all \(n\geqslant 1\) and \(x\in \Omega \)

$$\begin{aligned} \mu _\theta (C_{m+n}(x))&\leqslant K_\theta ^{3} \; \mu _\theta (C_n(x)) \, \mu _\theta (\sigma ^n (C_{m+n}(x))\,) \\&= K_\theta ^{3} \; \mu _\theta (C_n(x)) \, \mu _\theta (C_{m}(\sigma ^n(x))\,). \end{aligned}$$

Similarly, \(\mu _\theta (C_{m+n}(x)) \geqslant K_\theta ^{-3} \,\mu _\theta (C_n(x)) \, \mu _\theta (C_{m}(\sigma ^n(x))\,)\) for all \(n\geqslant 1\). Therefore, the family \(\psi _{n,1}^\theta ( y)= \log (\mu _\theta (\,C_n (y)\, )\) satisfies

$$\begin{aligned} \psi _{n,1}^\theta + \psi _{m,1}^\theta \circ \sigma ^n - 3 \, \log K_\theta \leqslant \psi _{(n+m),1}^\theta \leqslant \psi _{n,1}^\theta + \psi _{m,1}^\theta \circ \sigma ^n + 3 \, \log K_\theta \end{aligned}$$

for all \(m, n\geqslant 1\), hence it is almost-additive. \(\square \)

Note that the natural family

$$\begin{aligned} y \rightarrow -\,\log \,\int _E \, \frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) } \, d \Pi _0 (\theta ), \end{aligned}$$

\(n \in {\mathbb {N}},\) which seems at first useful, may not be almost additive as one first evaluate fluctuations on the different ways the measures see cylinders and only afterwards take its logarithm. We consider alternatively the sequence of potentials given below.

Lemma 3.4

For any fixed \(y\in Y\) and any Borel set \(E\subset \Theta \), the family

$$\begin{aligned} \psi _{n}(y)= \psi _{n}^{E} (y) = -\,\,\int {\mathbf {1}}_E (\theta )\, \log (\mu _\theta (\,C_n (y)\, ) \, d \Pi _0 (\theta ), \end{aligned}$$

\(n \in {\mathbb {N}},\) is almost additive. In particular, for each \(\theta _0\in \Theta \) and \(E\subset \Theta \), the family \(\Psi ^E := \{\Psi _n^E\}_n, \)

$$\begin{aligned} y \rightarrow \Psi _{n}^{E} (y) := -\,\,\int _E\, \log \Big (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }\Big ) \, d \Pi _0 (\theta ), \end{aligned}$$
(30)

is almost additive.

Proof

The first assertion is a direct consequence of the previous lemma and linearity of the integral. For the second one just notice that

$$\begin{aligned} -\,\,\int _E\, \log \Big (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }\Big ) \, d \Pi _0 (\theta ) = \psi _n(y) + \psi _{n,1}^{\theta _0} \end{aligned}$$

is the sum of two almost-additive sequences, hence almost additive. \(\square \)

3.2.3 Almost-Additive Potentials Related to Lyapunov Exponents

Let \(\sigma :\Omega \rightarrow \Omega \) be a subshift of finite type and for each \(\theta =(\theta _1, \theta _2,\dots ,\theta _p)\in [-\varepsilon , \varepsilon ]^2\) consider the locally constant linear cocycle \(A_\theta : \Omega \rightarrow SL(2,{\mathbb {R}})\) given by

$$\begin{aligned} A_\theta \mid _{[i]}= \begin{pmatrix} \begin{array}{cc} 2 &{} 1 \\ 1 &{} 1 \end{array} \end{pmatrix} \,\cdot \, \begin{pmatrix} \begin{array}{cc} \cos \theta _i &{} -\sin \theta _i \\ \sin \theta _i &{} \cos \theta _i \end{array} \end{pmatrix}, \quad \text {for every }i=1, 2. \end{aligned}$$

To each \(n\geqslant 1\) and \((x_1, \dots , x_n) \in \{1,2\}^n\) one associates the product matrix

$$\begin{aligned} A_\theta ^{(n)}(x_1, \dots , x_n):= A_{\theta _{x_n}} \dots A_{\theta _{x_2}} A_{\theta _{x_1}}. \end{aligned}$$

If \(\varepsilon >0\) is chosen small the previous family of matrices preserve a constant cone field in \({\mathbb {R}}^2\), hence have a dominated splitting. Furthermore, if \(\mu \in {\mathcal {M}}_\sigma (\Omega )\) is ergodic the Oseledets theorem ensures that for \(\mu \)-almost every \(x\in \Omega \) there exists a cocycle invariant splitting \({\mathbb {R}}^2=E^+_{\theta ,x} \oplus E^-_{\theta ,x}\) so that the limit

$$\begin{aligned} \lambda _{\theta ,i} := \lim _{n\rightarrow \infty } \frac{1}{n} \log \Vert A_\theta ^{(n)}(x) \, v_i \Vert \end{aligned}$$

exists and is independent of the vector \(v_i\in E^i_{\theta ,x} \setminus \{0\},\) \((i=\pm )\). Actually, Oseledets theorem also ensures that the largest Lyapunov exponent can be obtained by means of sub-additive sequences, as

$$\begin{aligned} \lambda ^+(A_\theta ,\mu )=\lambda _{\theta ,+}=\inf _{n\geqslant 1} \frac{1}{n} \log \Vert A^{(n)}_{\theta }(x)\Vert , \end{aligned}$$

for \(\mu \)-almost every x. Since all matrices preserve a cone field then for each \(\theta \in [-\varepsilon ,\varepsilon ]^2\) the sequence \((\log \Vert A^{(n)}_{\theta }(x)\Vert )_{n\geqslant 1}\) is known to be almost-additive on the x-variable (cf. [28]). Most surprisingly, in this simple context the largest annealed Lyapunov exponent

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{1}{n} \int \log \Vert A^{(n)}_{\theta }(x)\Vert \,d\nu (x) \end{aligned}$$

varies analytically with the parameter \(\theta \) (cf. [50]). We will need the following localization result.

Lemma 3.5

\(\lambda ^+(A_{(0,0)},\nu ) > \lambda ^+(A_\theta ,\nu )\) for every \(\theta \in [-\varepsilon ,\varepsilon ]^2 \setminus \{(0,0)\}\)

Proof

First observe that, as all matrices are obtained by a rotation of the original hyperbolic matrix, we have that \(\log \Vert A_\theta \Vert =\log (\frac{3+\sqrt{5}}{2})\) for all \(\theta \in [-\varepsilon ,\varepsilon ]^2\). Second, it is clear from the definition that \(\lambda ^+(A_{(0,0)},\nu )\) is the logarithm of the largest eigenvalue of the unperturbed hyperbolic matrix, hence it is \(\log (\frac{3+\sqrt{5}}{2})\). Finally, Furstenberg [31] proved that

$$\begin{aligned} \lambda ^+(A_\theta ,\nu ) = \int \int _{{\mathbb {S}}^1} \log \frac{\Vert A_\theta (x)\cdot v\Vert }{\Vert v\Vert } \, d{\mathbb {P}}(v)\,d\nu (x) \end{aligned}$$

where \({\mathbb {S}}^1\) stands for the projective space of \({\mathbb {R}}^2\) and \({\mathbb {P}}\) is a \(\nu \)-stationary measure, meaning that \(\nu \times {\mathbb {P}}\) is invariant by the projectivization of the map \(F(x,v)=(\sigma (x), A_{x_0}(v))\) for \((x,v)\in \Omega \times {\mathbb {R}}^2\). Altogether this guarantees that

$$\begin{aligned} \lambda ^+(A_\theta ,\nu ) = \log (\frac{3+\sqrt{5}}{2}) \quad \text {if and only if}\quad {\mathbb {P}}=\delta _{v_+} \end{aligned}$$

where \(v_+\) is the leading eigenvector of \(A_{(0,0)}\), which cannot occur because \(\nu \times \delta _{v_+}\) is not invariant by the projectivized cocycle. This proves the lemma. \(\square \)

3.3 Large Deviations: Speed of Convergence

Large deviations estimates are commonly used in decision theory (see e.g. [12, 30, 56]). In the context of dynamical systems, the exponential rate of convergence in large deviations are defined in terms of rate functions, often described by thermodynamic quantities as pressure and entropy. In the case of level-1 large deviation estimates these can be defined as follows. Given a family \(\Psi ^E := \{\psi _n^E\},\) where \(\psi _n^E:\Omega \rightarrow {\mathbb {R}},\) E is a Borel set of parameters, \(n \in {\mathbb {N}}\) and \(-\infty \leqslant c < d \leqslant \infty \), we define

$$\begin{aligned} {\overline{R}}_\nu ( \Psi ^E , [c,d] )= \limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \,\nu ( \{\,y \in \Omega : \frac{1}{n} \,\,\psi _n^E(y) \in [c,d] \}) \end{aligned}$$

and

$$\begin{aligned} {\underline{R}}_\nu ( \Psi ^E , (c,d) )= \liminf _{n \rightarrow \infty } \frac{1}{n}\,\, \log \nu ( \{\,y \in \Omega : \frac{1}{n} \,\,\psi _n^E(y) \in (c,d) \}). \end{aligned}$$

Since the subshift dynamics satisfies the transitive specification property (also referred as gluing orbit property, the [57, Theorem B] ensures the following large deviations principle for the subshift and either asymptotically additive or certain sequences of sub-additive potentials.

Theorem 3.6

Let \(\Phi =\{\varphi _n\}\) be an almost additive family of potentials with \(P(\sigma ,\Phi )>-\infty \) and let \(\nu \) be a Gibbs measure for \(\sigma \) with respect to \(\Phi \). Assume that either:

  1. (a)

    \(\Psi =\{\psi _n\}\) is an asymptotically additive family of potentials, or;

  2. (b)

    \(\Psi =\{\psi _n\}\) is a sub-additive family of potentials such that:

    1. i.

      \(\Psi =\{\psi _n\}\) satisfies the weak Bowen condition: there exists \(\delta >0\) so that

      $$\begin{aligned} \limsup \limits _{n\rightarrow \infty } \frac{\sup \{|\varphi _n(y)-\varphi _n(z)|:\ y,z \in B_n(x,\delta )\}}{n} =0; \end{aligned}$$
    2. ii.

      \(\inf _{n\geqslant 1} \frac{\psi _n(x)}{n}>-\infty \) for all \(x\in \Omega \); and

    3. iii.

      the sequence \(\{\psi _n/n\}\) is equicontinuous.

Given \(c \in {\mathbb {R}}\), it holds that:

  1. (1)

    \({\overline{R}}_{\nu }(\Psi , [c,\infty )) \leqslant \sup \big \{-P(\sigma ,\Phi )+h_\eta (\sigma ) + {{\mathcal {F}}}(\eta ,\Phi ) \big \} \}\), where the supremum is over all \(\eta \in {{\mathcal {M}}}_\sigma (\Omega )\) such that \({{\mathcal {F}}}(\eta ,\Psi ) \geqslant c\);

  2. (2)

    \({\underline{R}}_{\nu }(\Psi , (c,\infty )) \geqslant \sup \big \{-P(\sigma ,\Phi )+h_\eta (\sigma ) + {{\mathcal {F}}}(\eta ,\Phi ) \big \}\) where the supremum is taken over all \(\eta \in {{\mathcal {M}}}_\sigma (\Omega )\) satisfying \({{\mathcal {F}}}(\eta ,\Psi ) > c\).

While in the previous theorem both invariant measures and sequences of observables may be generated by non-additive sequences of potentials (we refer the reader e.g. to [3] for the construction of equilibrium states associated to almost-additive sequences of potentials) we will be mostly concerned with Gibbs measures generated by a single Lipschitz continuous potential. In the special case of the almost-additive sequences considered in Sect. 3.2.2 the previous theorem can read as follows:

Corollary 1 Let \(\Phi =\{\varphi _n\}\) be defined by \(\varphi _n =\sum _{j=0}^{n-1} \log J_{\theta _0},\) \(n \in {\mathbb {N}}\) and let \(\mu _{\theta _0}\) denote the corresponding Gibbs measure. For a given Borel set \(E\subset \Theta \), take \(\Psi ^E := \{\psi _n^E\},\) where \(\psi _n^E\), \(n \in {\mathbb {N}}\) was defined in Lemma 3.4. Then, given \(\infty \geqslant d>c\geqslant - \infty \) we have:

  1. a.

    \({\overline{R}}_{\mu _{\theta _0}} ( \Psi ^E , [c,d] ) \leqslant \sup \Big \{ h(\eta ) + \int \log J_{\theta _0}\, d \eta \, :{\eta \in {\mathcal {S}}(Y)\,\,\text {so that}\,{{\mathcal {F}}} (\eta , \Psi ^E) \in [c,d] } \Big \}\)

  2. b.

    \({\overline{R}}_{\mu _{\theta _0}} ( \Psi ^E , (c,d) ) \geqslant \sup \Big \{ h(\eta ) + \int \log J_{\theta _0}\, d \eta \, :{\eta \in {\mathcal {S}}(Y)\,\,\text {so that}\,{{\mathcal {F}}} (\eta , \Psi ^E) \in (c,d) } \Big \}\)

As the entropy function of the subshift is upper-semicontinuous, any sequence of invariant measures whose free energies associated to a continuous potential tend to the topological pressure accumulate on the space of equilibrium states. Thus, in the special case that there exists a unique equilibrium state, any such sequence is convergent to the equilibrium state. Altogether the previous argument gives the following:

Lemma 3.7

Consider the sequence of functions \(\Phi =\{\varphi _n\}_{n\geqslant 1}\) where \(\varphi _n(y) = \sum _{j=0}^{n-1} \log J_{\theta _0}(\sigma ^j(y))\) and \(\log J_{\theta _0}\) is Lipschitz continuous, and let \(\mu _\Phi \) denote the corresponding Gibbs measure. If U is an open neighborhood of the Gibbs measure \(\mu _\Phi \) then there exists \(\alpha _1>0\) such that

$$\begin{aligned} \sup _{\mu \in {\mathcal {M}}_\sigma (\Omega )\setminus U} \{\, h(\mu ) \,+\, \int \log J_{\theta _0} d \mu \} = \sup _{\mu \in {\mathcal {M}}_\sigma (\Omega )\setminus U} \{\, h(\mu ) \,+\,{{\mathcal {F}}} (\mu , \Phi ) \,\}\,<\,- \alpha _1. \end{aligned}$$

We are particularly interested in the \(\delta \)-neighborhood of the parameter \(\theta _0\in \Theta \) defined by

$$\begin{aligned} B_\delta = \{ \,\theta \in \Theta \,|\, d_\theta (\theta , \theta _0)<\delta \,\},\quad \text { for some }\delta >0. \end{aligned}$$
(31)

The next result establishes large deviations estimates for relative entropy associated to Gibbs measures close to \(\mu _{\theta _0}\). More precisely:

Proposition 3.8

Let \(\Psi ^E\) be defined by  (30). For any \(\delta >0\) there exists \(d_\delta >0\) satisfying

$$\begin{aligned} \,{{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{B_\delta }) \,\,< d_\delta < {{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^\Theta )= \int _\Theta h(\mu _{\theta _0} \mid \mu _\theta ) \, d \Pi _0 (\theta ). \end{aligned}$$

\(d_\delta \) can be taken small if \(\delta \) is small.

Moreover, for every small \(\delta >0\) there exists \(\alpha _1>0\) so that

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \Big [\,\mu _{\theta _0} \Big ( \{\, y\in \Omega \,:\, - \frac{1}{n} \int _{B_\delta } \log \Big (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }\Big )\, \, d \Pi _0 (\theta ) \in [d_\delta , \infty ) \} \Big ) \,\Big ] \\&\quad \leqslant - \alpha _1. \end{aligned}$$

Proof

Remember that, given \(\eta \in {\mathcal {M}}_\sigma ( \Omega )\) and \(E\subset \Theta \),

$$\begin{aligned} -\lim _{n \rightarrow \infty } \frac{1}{n} \int \,\int _E \,\log \Big (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }\Big )\, d \Pi _0 (\theta )\, d \eta (y) = \lim _{n \rightarrow \infty } \frac{1}{n} \int \psi _n^E (y) \, d \eta (y)={{\mathcal {F}}} (\eta , \Psi ^E). \end{aligned}$$

Taking \(\eta = \mu _{\theta _0}\) and \(E=\Theta \) we get from (25), (27) and Lemma 3.1 that

$$\begin{aligned} {{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^\Theta )&= \int _\Theta \int - \lim _{n \rightarrow \infty } \frac{1}{n} \log \Big (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }\Big )\, \, d \mu _{\theta _0} (y)\, \,d \Pi _0 (\theta ) \nonumber \\&= \int _\Theta h(\mu _{\theta _0} \mid \mu _\theta ) \,d \Pi _0 (\theta )\nonumber \\&= - h (\mu _{\theta _0}) - \int _\Theta \int \log J_\theta \,d \mu _\theta \,d \Pi _0. \end{aligned}$$
(32)

Similarly, one obtains \( {{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^E) = - h (\mu _{\theta _0}) \, \Pi _0(E)- \int _{E} \int \log J_\theta \,d \mu _\theta \,d \Pi _0 \) for any \(E \subset \Theta \). Using that \(h(\mu _{\theta _0} \mid \mu _\theta )>0\) for all \(\theta \ne \theta _0\) and that \(\Pi _0\) is fully supported on \(\Theta \), Lemma 3.1 ensures that

$$\begin{aligned} \int _{\Theta \setminus B_\delta }h(\mu _{\theta _0} \mid \mu _\theta ) \,d \Pi _0 (\theta ) > 0 \end{aligned}$$

for every small \(\delta \). In consequence,

$$\begin{aligned}&{{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{B_\delta }) = \int _{B_\delta }\, h(\mu _{\theta _0},\mu _\theta )\, d \Pi _0 (\theta )\,\\&\quad < \int _{B_\delta } \,\int - \lim _{n \rightarrow \infty } \frac{1}{n} \log (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }) \, d \mu _{\theta _0} (y)\,\,d \Pi _0 (\theta )\, \\&\qquad + \int _{\Theta \setminus B_\delta } \,\int - \lim _{n \rightarrow \infty } \frac{1}{n} \log (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }) \, d \mu _{\theta _0} (y)\,\,d \Pi _0 (\theta ) \\&\quad = \int _\Theta h(\mu _{\theta _0} \mid \mu _\theta ) d \Pi _0 (\theta ) = {{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{\Theta }). \end{aligned}$$

for every small \(\delta \), hence there exists \(d_\delta >0\) so that

$$\begin{aligned} {{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{B_\delta })< d_\delta <{{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{\Theta }). \end{aligned}$$
(33)

Now, on the one hand, by continuity of \(\eta \mapsto {{\mathcal {F}}} (\eta , \Psi ^{B_\delta })\), the set \(U=\{\,\eta \in {\mathcal {M}}_\sigma (\Omega )\,\,:\,\,{{\mathcal {F}}} (\eta , \Psi ^{B_\delta })< d _\delta \}\) is an open neighborhood of \(\mu _{\theta _0}.\) On the other hand, according to Lemma 3.7 there exists \(\alpha _1>0\) such that

$$\begin{aligned} \sup _{\eta \in {\mathcal {M}}_\sigma (\Omega ) \setminus U } \Big \{ h(\eta ) + \int \log J_{\theta _0}\, d \eta \,\Big \} \leqslant - \alpha _1<0. \end{aligned}$$

Therefore, from Theorem 3.5

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \mu _{\theta _0} \Big ( \{\, y \in \Omega \,|\, - \frac{1}{n} \int _{E^\delta }\log (\frac{\mu _\theta (\,C_n (y)\, )}{ \mu _{\theta _0} (\,C_n (y)\, ) }) \, d \Pi _0 (\theta )\, \in [d_\delta , \infty ) \} \Big ) \\&\quad \leqslant \sup _{\{\eta \in {\mathcal {M}}_\sigma (\Omega )\,:\,{{\mathcal {F}}} (\eta , \Psi ^{B_\delta }) > d_\delta \}} \{ h(\eta ) + \int \log J_{\theta _0}\, d \eta \,\} \leqslant \,-\,\alpha _1<0. \end{aligned}$$

\(\square \)

Remark 3.9

From Hypothesis A the value \(d_\delta >0\) can be taken small, if \(\delta >0\) is small, because \( \,{{\mathcal {F}}} (\mu _{\theta _0}, \Psi ^{B_\delta }) \,=\, \int _{B_\delta }\, h(\mu _{\theta _0}\mid \mu _\theta )\, d \Pi _0 (\theta )\, . \)

Corollary 3.10

Given \(\delta >0\) small let \(B_\delta \subset \Theta \) be the \(\delta \)-open neighborhood of \(\theta _0\) defined in (31) and let \(d_\delta >0\) be given by Proposition 3.8. The following holds:

$$\begin{aligned} \limsup _{n \rightarrow \infty } - \frac{1}{n}\log \int _{B_\delta } \frac{ \mu _{\theta } (\,C_n (y)\, ) }{ \mu _{\theta _0} (\,C_n (y)\, )}\, d \Pi _0 (\theta ) \leqslant d_\delta \end{aligned}$$
(34)

for \(\mu _{\theta _0}\)-almost every point y. Moreover, for \(\mu _{\theta _0}\)-almost every point y

$$\begin{aligned} \liminf _{n \rightarrow \infty } \frac{1}{n} \log \,\, \int _{B_\delta } \mu _{\theta } (\,C_n (y)\, )\, d \Pi _0 (\theta ) \geqslant - \Pi _0 (B_\delta )\cdot h(\mu _{\theta _0})- d_\delta .\end{aligned}$$
(35)

Proof

For each \(n\geqslant 1\) consider the set

$$\begin{aligned} A_n= \Big \{\, y\in \Omega \,|\, - \frac{1}{n} \int _{B_\delta } \log \frac{ \mu _\theta (\,C_n (y)\, ) }{ \mu _{\theta _0} (\,C_n (y)\, )} \, d \Pi _0 (\theta ) \in [d_{\delta }, \infty ) \Big \} . \end{aligned}$$

By Proposition 3.8, we get that \(\sum _n \mu _{\theta _0} (A_n) <\infty .\) It follows from Borel-Cantelli Lemma that for \(\mu _{\theta _0}\)-almost every point \(y\in \Omega \) there exists an N, such that \(y\notin A_n\) for all \(n>N\). Equivalently, \(- \frac{1}{n} \int _{B_\delta } \log ( \frac{ \mu _{\theta } (\,C_n (y)\, ) }{ \mu _{\theta _0} (\,C_n (y)\, )}\,)\, d \Pi _0 (\theta )< d_\delta \) for all \(n>N\), which proves (34). Therefore, from Jensen inequality, we get for \(\mu _{\theta _0}\)-almost every \(y\in \Omega \) and every large \(n\geqslant 1\)

$$\begin{aligned}&\frac{1}{n} \log \,\, \int _{B_\delta } \, \mu _{\theta }\ (\,C_n (y)\, )\, d \Pi _0 (\theta )\,\, - \frac{1}{n} \int _{B_\delta } \log ( \mu _{\theta _0} (\,C_n (y)\, )\,)\, d \Pi _0 (\theta ) \nonumber \\&\quad \geqslant \frac{1}{n} \int _{B_\delta } \log (\, \mu _{\theta } (\,C_n (y)\, ) \, d \Pi _0 (\theta ) - \frac{1}{n} \int _{B_\delta } \log ( \mu _{\theta _0} (\,C_n (y)\, )\,)\, d \Pi _0 (\theta ) \nonumber \\&= \frac{1}{n} \int _{B_\delta } \log \frac{ \mu _{\theta } (\,C_n (y)\, ) }{ \mu _{\theta _0} (\,C_n (y)\, )}\, d \Pi _0 (\theta ) \geqslant - d_\delta . \end{aligned}$$
(36)

Moreover, as \(\lim _{n\rightarrow \infty } - \frac{1}{n} \log ( \mu _{\theta _0} (\,C_n (y)\, )=h(\mu _{\theta _0})\) for \(\mu _{\theta _0}\)-almost every y, it follows from the previous inequalities that

$$\begin{aligned} \liminf _{n \rightarrow \infty } \frac{1}{n} \log \,\, \int _{B_\delta } \, \mu _{\theta } (C_n (y))\, d \Pi _0 (\theta )\,\, + \Pi _0 (B_\delta )\, h(\mu _{\theta _0}) \geqslant - d_\delta \end{aligned}$$

for \(\mu _{\theta _0}\) almost every y, which proves  (35), as desired. \(\square \)

Remark 3.11

The previous corollary ensures that for any \(\zeta >0\) and \(\mu _{\theta _0}\)-a.e. \(y\in \Omega \)

$$\begin{aligned} \int _{B_\delta } \, \mu _{\theta } (\,C_{n} (y)\, )\, d \Pi _0 (\theta ) \geqslant e^{- [d_\delta + \Pi _0 (B_\delta )\,\, h(\mu _{\theta _0})+\zeta ] \,{n}} \quad \text {for every large }n\geqslant 1. \end{aligned}$$

Moreover, Remark 3.9 guarantees that \(d_\delta >0\) can be chosen small provided that \(\delta \) is small. In particular, the absolute continuity assumption on the a priori measure \(\Pi _0\) (hypothesis A) implies that \(\Pi _0 (B_\delta )\, h(\mu _{\theta _0})+ d_\delta \) can be taken arbitrarily small, provided that \(\delta \) is small.

Lemma 3.12

For small \(\delta >0\) and \(\mu _{\theta _0}\)-almost every \(y\in \Omega \)

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n}\,\log \int _{\Theta \setminus B_\delta } \mu _\theta (C_n (y) ) \,d\Pi _0 (\theta ) \leqslant \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0}<0. \end{aligned}$$
(37)

Moreover, \(\sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta \,d \mu _{\theta _0}\rightarrow - h(\mu _{\theta _0})\) as \(\delta \rightarrow 0\).

Proof

Recalling the Gibbs property  (3) for \(\mu _\theta \) the continuous dependence of the constants \(K_\theta \) and compactness of \(\Theta \) we conclude that there exist uniform constants \(c_1, c_2>0\) so that

$$\begin{aligned} {c_1} \leqslant \frac{\mu _\theta (C_n(x))}{e^{-n P_\theta } + S_nf_\theta (x)} \leqslant {c_2} \qquad \forall \theta \in \Theta , \forall y\in \Omega , \forall n\geqslant 1. \end{aligned}$$
(38)

Furthermore, as the potentials are assumed to be normalized then \(P_\theta =0\) for every \(\theta \in \Theta \). Therefore, there exists \(C_1>0\) and \(C_2>0\) , such that, for all \(y\in \Omega \), \(\theta \in \Theta \) and \(n\geqslant 1\)

$$\begin{aligned} C_1< \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } \frac{ e^{ \sum _{j=0}^{n-1} \log J_{\theta _0} (\sigma ^j( y)) }}{e^{ \sum _{j=0}^{n-1} \log J_\theta (\sigma ^j( y))} } <C_2. \end{aligned}$$

Then,

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n} \log C_1&< \limsup _{n \rightarrow \infty } \frac{1}{n} \log \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } \\&\quad + \limsup _{n \rightarrow \infty } \frac{1}{n} \sum _{j=0}^{n-1} \log J_{\theta _0} (\sigma ^j( y)) - \sum _{j=0}^{n-1} \log J_\theta (\sigma ^j( y)) ] \\&< \limsup _{n \rightarrow \infty } \frac{1}{n} \log C_2. \end{aligned}$$

The above expression is independent of y.

In consequence, using the ergodic theorem and that \(h(\mu _{\theta _0} )=\int -\log J_{\theta _0}\, d\mu _{\theta _0}\), one gets

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n} \log \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } \leqslant h(\mu _{\theta _0} ) + \int \log J_\theta d \mu _{\theta _0}, \quad \text {for }\mu _{\theta _0}\text {-a.e. }y. \end{aligned}$$

Fix \(\zeta >0\) arbitrary and small. The previous expression ensures that, for \(\mu _{\theta _0}\)-a.e. \(y\in \Omega \), there exists a \(N=N(\zeta ,y)\) such that

$$\begin{aligned} \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } \leqslant e^{n\, (h(\mu _{\theta _0} ) + \int \log J_\theta d \mu _{\theta _0} +\zeta )} \quad \text {for every large }n\geqslant N=N(\zeta ,y) \end{aligned}$$

Given a small \(\delta >0\), by uniqueness of the equilibrium state for \(\log J_\theta \), we have that

$$\begin{aligned} \rho _\delta :=h(\mu _{\theta _0} ) + \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0}= \sup _{\theta \in \Theta \setminus B_\delta } [h(\mu _{\theta _0} ) + \int \log J_\theta d \mu _{\theta _0}] <0, \end{aligned}$$

and that \(\rho _\delta \) tends to zero as \(\delta \rightarrow 0\). Then, for \(\mu _{\theta _0}\)-almost every point y, and \(n\ge N(\zeta ,y)\),

$$\begin{aligned} \int _{\Theta \setminus B_\delta } \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } d \Pi _0 (\theta ) \leqslant \int _{\Theta \setminus B_\delta } e^{n\, (h(\mu _{\theta _0} ) + \log J_\theta d \mu _{\theta _0} +\zeta )} d \Pi _0 (\theta ) \leqslant \Pi _0 (\Theta \setminus B_\delta ) e^{n (\rho _\delta +\zeta )}, \end{aligned}$$

which implies for small arbitrary \(\zeta \)

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n}\,\log \int _{\Theta \setminus B_\delta } \frac{ \mu _\theta (C_n (y) ) }{ \mu _{\theta _0} (C_n (y) ) } d \Pi _0 (\theta ) \leqslant \rho _\delta +\zeta . \end{aligned}$$

\(\zeta >0\) small enough we conclude that, for \(\mu _{\theta _0}\)-almost every point y,

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\log \int _{\Theta \setminus B_\delta } \mu _\theta (C_n (y) ) d \Pi _0 (\theta ) \leqslant - h(\mu _{\theta _0}) + \rho _\delta + \zeta \\&\quad = \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0}+ \zeta <0. \end{aligned}$$

For fixed y, as \(\zeta \) is arbitrary we get the (37). \(\square \)

Proposition 3.13

For \(\mu _{\theta _0}\)-almost every \(y\in \Omega \)

$$\begin{aligned} 0\leqslant -\limsup _{n \rightarrow \infty } \frac{1}{n} \log Z_n(y) \,\, \leqslant - \int _{\Theta } \int _\Omega \log J_\theta (y) d \mu _{\theta _0}(y)d \Pi _0 (\theta ) - h( \mu _{\theta _0}). \end{aligned}$$

Proof

For y, \(\mu _{\theta _0}\)-almost every everywhere, if

$$\begin{aligned} 0< \limsup _{n \rightarrow \infty } \frac{1}{n}\,\log Z_n(y)= \limsup _{n \rightarrow \infty } \frac{1}{n} \log \int _\Theta \frac{\mu _{\theta } \,(\,C_n(y) \,)}{\mu _{\theta _0} \,(\,C_n(y)\,) } d \Pi _0 (\theta ), \end{aligned}$$

taking \(\delta \rightarrow 0\) in (37), we would reach a contradiction. \(\square \)

The statement of the second inequality in the above Proposition is nothing more than the expression (10).

4 Proof of the Main Results

4.1 Proof of Theorem A

We proceed to show that the a posteriori measures in Theorem A do converge, for \(\mu _{\theta _0}\)-typical points y. In order to prove that \(\Pi _n(\cdot ,y) \rightarrow \delta _{\theta _0}\) (in the weak\(^*\) topology) it is sufficient to prove that, for every \(\delta >0\), one has that \(\Pi _n(\Theta \setminus B_\delta ,y) \rightarrow 0\) as \(n\rightarrow \infty \). This is the content of the following theorem.

Theorem 4.1

Let \(\Pi _n (\cdot \mid y)\) be the a posteriori measures defined by  (11) and let \(B_\delta \) be the \(\delta \)-neighborhood of \(\theta _0\) defined by  (31). Then, for every small \(\delta >0\) and \(\mu _{\theta _0}\)-a.e. y,

$$\begin{aligned} \Pi _n (B_\delta \mid y)=\frac{\int _{B_\delta } \mu _{\theta } \,(\,C_n (y))\, d \Pi _0 (\theta ) \,}{\int _\Theta \mu _{\theta }(C_n (y)\,)d \Pi _0 (\theta ) \, } \rightarrow 1 \end{aligned}$$
(39)

exponentially fast as \(n \rightarrow \infty .\)

Proof

Fix \(\delta >0\) small. We claim that \(\Pi _n (\Theta \setminus B_\delta \mid y) \) tends to zero exponentially fast as \(n \rightarrow \infty \). We have to estimate

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\int _{B_\delta } \mu _\theta (C_n (y) ) d \Pi _0 (\theta ) \end{aligned}$$

and

$$\begin{aligned} - \limsup _{n \rightarrow \infty } \frac{1}{n} \log \int _{\Theta \setminus B_\theta } \mu _\theta (C_n (y) ) d \Pi _0 (\theta ). \end{aligned}$$

From (35), for \(\mu _{\theta _0}\) almost every point y

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\, \int _{B_\delta } \, \mu _{\theta } (C_n (y))\, d \Pi _0 (\theta )\,\, \geqslant - h(\mu _{\theta _0})\, \Pi _0 (B_\delta )- d_\delta , \end{aligned}$$
(40)

where \(d_\delta \) can be taken small if \(\delta >0\) is small. Fix \(0<\zeta <\frac{h(\mu _{\theta _0})}{2}\). Therefore, from Remark 3.11 we get that for \(\mu _{\theta _0}\) almost every point y

$$\begin{aligned} \int _{B_\delta } \, \mu _{\theta } (\,C_{n} (y)\, )\, d \Pi _0 (\theta ) \geqslant e^{- [d_\delta + \Pi _0 (B_\delta )\,\, h(\mu _{\theta _0})+\zeta ] \,{n}} \quad \text {for every large }n\geqslant 1. \end{aligned}$$
(41)

Observe that the map \(\delta \mapsto \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0}\) is monotone increasing and recall that \(\sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0} \rightarrow - h(\mu _{\theta _0})\) as \(\delta \rightarrow 0\). On the other hand, \(- h(\mu _{\theta _0})\, \Pi _0 (B_\delta )- d_\delta \) tends to zero as \(\delta \rightarrow 0\) (cf. Remark 3.11). Thus,

$$\begin{aligned} \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0} < - h(\mu _{\theta _0})\, \Pi _0 (B_\delta )- d_\delta -\zeta \end{aligned}$$
(42)

for every small \(\delta >0\). As

$$\begin{aligned} \frac{\int _{B_\delta } \mu _{\theta } \,(\,C_n (y))\, d \Pi _0 (\theta ) \,}{\int _\Theta \mu _{\theta }(C_n (y)\,)d \Pi _0 (\theta ) \, } + \frac{\int _{\Theta \setminus B_\delta } \mu _{\theta } \,(\,C_n (y)\,d \Pi _0 (\theta ) \,}{\int _ \Theta \mu _{\theta }(C_n (y)\,d \Pi _0 (\theta ) \, } =1 \end{aligned}$$

we just have to show that

$$\begin{aligned} \frac{\int _{B_\delta } \mu _{\theta } \,(\,C_n (y))\, d \Pi _0 (\theta ) \,}{\int _{\Theta \setminus B_\delta } \mu _{\theta }(C_n (y)\,)d \Pi _0 (\theta ) \, } \rightarrow \infty , \end{aligned}$$

when \(n \rightarrow \infty .\)

Indeed,

$$\begin{aligned} \frac{\int _{\Theta \setminus B_\delta } \mu _{\theta } \,(\,C_n (y)\,d \Pi _0 (\theta ) \,}{\int _ \Theta \mu _{\theta }(C_n (y))\,d \Pi _0 (\theta ) \, }= \frac{1\,}{1+\frac{\int _ { B_\delta } \mu _{\theta }(C_n (y))\,d \Pi _0 (\theta )}{\int _{\Theta \setminus B_\delta } \mu _{\theta } \,(\,C_n (y))\,d \Pi _0 (\theta ) } \, }. \end{aligned}$$

Now, equations (37) and (41) and the choice of \(\delta \) in (42) ensure that, for \(\mu _{\theta _0}\)-almost every \(y\in \Omega \),

$$\begin{aligned} \frac{\int _{B_\delta } \mu _{\theta } \,(\,C_n (y))\, d \Pi _0 (\theta ) \,}{\int _{\Theta \setminus B_\delta } \mu _{\theta }(C_n (y)\,)d \Pi _0 (\theta ) \, } \geqslant \frac{ e^{ - [h(\mu _{\theta _0})\, \Pi _0 (B_\delta )+ d_\delta +\zeta ]\, \,n}}{ e^{ n\, \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0}} } \end{aligned}$$

which tends to infinity as \(n\rightarrow \infty \). Finally the previous expression also ensures that

$$\begin{aligned}&| \Pi _n (B_\delta \mid y)-1 | =\frac{\int _{\Theta \setminus B_\delta } \mu _{\theta } \,(\,C_n (y))\, d \Pi _0 (\theta ) \,}{\int _\Theta \mu _{\theta }(C_n (y)\,)d \Pi _0 (\theta ) \, } \\&\quad \leqslant e^{ n\,[ \sup _{\theta \in \Theta \setminus B_\delta } \int \log J_\theta d \mu _{\theta _0} + h(\mu _{\theta _0})\, \Pi _0 (B_\delta )+ d_\delta +\zeta ] } \end{aligned}$$

decreases exponentially fast with exponential rate that can be taken uniform for all small \(\delta >0\). This finishes the proof of the theorem. \(\square \)

4.2 Proof of Theorem B

By assumption, there exists a full \(\nu \)-measure subset \(Y'\subset Y\) so that the limit

$$\begin{aligned} \Gamma ^y(\theta ):=\lim _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \end{aligned}$$

exists for every \(y\in Y'\). Given an arbitrary \(y\in Y'\) we proceed to estimate the asymptotic behavior of the a posteriori measures \(\Pi _n (\cdot \mid y)\) given by  (23).

Given \(\delta >0\), by upper semicontinuity of \(\Gamma ^y(\cdot )\) the function \(\Gamma ^y\) has a maximum value and there exists \(d_\delta >0\) (which may be chosen to converge to zero as \(\delta \rightarrow 0\)) so that

$$\begin{aligned} B_\delta ^y=\big \{\theta \in \Theta :d_\Theta \big (\,\theta , \text {argmax} \, \Gamma ^y\,\big )> \delta \big \} \subset (\Gamma ^y)^{-1}(( -\infty , \alpha ^y -d_\delta )) \end{aligned}$$

is non-empty and open subset, where \(\alpha ^y:=\max _{\theta \in \Theta } \Gamma ^y(\theta )\).

There are two cases to consider. On the one hand, if \(\Gamma ^y(\cdot )\equiv \alpha ^y\) is constant then \(B_\delta ^y=\Theta \) and we conclude that \(\Pi _n (B_\delta ^y \mid y) = 1\) for all \(n\geqslant 1\) and the convergence in  (16) is trivially satisfied. On the other hand, as \(\Pi _0\) is fully supported and absolutely continuous then \(\int _\Theta \Gamma ^y(\theta ) \, d\Pi _0(\theta )<\alpha ^y\). Actually, this allows to estimate the double integral

$$\begin{aligned} \int _{\Theta \setminus B^y_\delta } \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta ) \end{aligned}$$

without making use of the features of the set \(B^y_\delta \). More precisely, using Jensen inequality and taking the limsup under the sign of the integral,

$$\begin{aligned}&\limsup _{n\rightarrow \infty } \frac{1}{n} \log \int _{\Theta \setminus B^y_\delta } \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta )\\&\quad \leqslant \int _{\Theta \setminus B^y_\delta } \limsup _{n\rightarrow \infty } \frac{1}{n} \log \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta ) \\&\quad = \int _{\Theta \setminus B^y_\delta } \Gamma ^y(\theta ) \, d\Pi _0(\theta ). \end{aligned}$$

As \(\varphi _n\) are assumed non-negative we conclude that \(\Gamma ^y(\cdot )\) is a non-negative function and

$$\begin{aligned} \limsup _{n\rightarrow \infty } \frac{1}{n} \log \int _{\Theta } \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta )&\leqslant \int _{\Theta } \Gamma ^y(\theta ) \, d\Pi _0(\theta )<\alpha ^y. \end{aligned}$$
(43)

In consequence, if \(0<\zeta <\frac{1}{2}\big [\alpha ^y-\int _{\Theta } \Gamma ^y(\theta ) \, d\Pi _0(\theta )\big ]\) then

$$\begin{aligned} \int _{\Theta } \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta ) \leqslant e^{(\alpha ^y-\zeta )n} \end{aligned}$$

for every large \(n\geqslant 1\). Now, in order to estimate the measures \(\Pi _n (\cdot \mid y)\) on the nested family \((B_\delta ^y)_{\delta >0}\) we observe that \( \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \geqslant e^{(\alpha ^y-d_\delta )n}, \; \forall \theta \in B_\delta ^y, \) thus

$$\begin{aligned} \int _{B^y_\delta } \int _\Omega e^{\varphi _n(\theta , x,y)} \, d\mu _\theta (x) \, d\Pi _0(\theta ) \geqslant e^{(\alpha ^y-d_\delta )n} \Pi _0(B^y_\delta ) \end{aligned}$$

for every large \(n\geqslant 1\). In particular, if \(\delta >0\) is small so that \(0<d_\delta <\zeta \), putting together the last expression, inequality (43) and the fact that \(0<\Pi _0(B_\delta ^y)<1\), one concludes that

$$\begin{aligned} \Pi _n (\Theta \setminus B_\delta ^y \mid y)&\,=\, \frac{ \int _{\Theta \setminus B_\delta ^y} \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )}{ \int _\Theta \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )} \\&\leqslant \frac{ \int _{\Theta \setminus B_\delta ^y} \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )}{ \int _{B^y_\delta } \int _\Omega e^{\varphi _n(\theta ,x,y)} \, d\mu _\theta (x) \, d \Pi _0 (\theta )} \\&\leqslant \frac{1}{\Pi _0(B^y_\delta )} e^{-(\zeta -d_\delta )n} \end{aligned}$$

tends exponentially fast to zero, as claimed. Hence, any accumulation point of \((\Pi _n(\cdot \mid y))_{n\geqslant 1}\) (in the weak\(^*\) topology) is supported on the compact set \(\text {argmax} \, \Gamma ^y\), which proves the first statement in the theorem. As the second assertion is immediate from the first one, this concludes the proof of the theorem. \(\square \)

4.3 Proof of Theorem C

Consider the family of loss functions \(\ell _n :\Theta \times X \times Y \rightarrow {\mathbb {R}}\) defined by  (17) associated to an almost additive sequence \(\Phi =(\varphi _n)_{n\geqslant 1}\) of continuous and non-negative observables \(\varphi _n :\Theta \times X \times Y \rightarrow {\mathbb {R}}_+\) satisfying assumptions (H1)-(H2).

  1. (H1)

    for each \(\theta \in \Theta \) and \(x\in X\) there exists a constant \(K_{\theta ,x}>0\) so that, for every \(y\in Y\),

    $$\begin{aligned}&\varphi _n(\theta ,x,y) + \varphi _m(\theta ,x,T^n(y)) - K_{\theta ,x} \leqslant \varphi _{m+n}(\theta ,x,y) \leqslant \varphi _n(\theta ,x,y)\\&\quad + \varphi _m(\theta ,x,T^n(y)) + K_{\theta ,x} \end{aligned}$$
  2. (H2)

    \(\int K_{\theta ,x} d\mu _{\theta }(x)<\infty \) for every \(\theta \in \Theta \).

The a posteriori measures are

$$\begin{aligned} \Pi _n (E \mid y)\,=\, \frac{ \int _E \psi _n(\theta ,y)\, d \Pi _0 (\theta )}{ \int _\Theta \psi _n(\theta ,y) \, d \Pi _0 (\theta )}, \end{aligned}$$
(44)

where the sequence \(\psi _n(\theta ,y)=\int _\Omega \varphi _n(\theta ,x,y) \, d\mu _\theta (x)\) is almost additive in the y-variable. Indeed, this family satisfies

$$\begin{aligned}&\psi _{n}(\theta ,y) + \psi _{m}(\theta , T^n(y)) - \int K_{\theta ,x} d\mu _{\theta }(x) \leqslant \psi _{m+n}(\theta ,y) \leqslant \psi _{n}(\theta ,y) + \psi _{m}(\theta , T^n(y)) \\&\quad + \int K_{\theta ,x} d\mu _{\theta }(x) \end{aligned}$$

for every \(m,n\geqslant 1\), every \(\theta \in \Omega \) and \(y\in Y\). Now, for each fixed \(\theta \in \Theta \), we note that the sequence of observables

$$\begin{aligned} \Big (\psi _n(\theta ,\cdot ) + \int K_{\theta ,x} d\mu _{\theta }(x)\Big )_{n\geqslant 1} \end{aligned}$$

is subadditive. Hence, Kingman’s subadditive ergodic theorem ensures that the limit \(\lim _{n\rightarrow \infty } \frac{\psi _n(\theta , y)}{n}\) does exist and is \(\nu \)-almost everywhere constant to the non-negative function \(\psi _*(\theta ):= \inf _{n\geqslant 1} \frac{1}{n} \int {\psi _n(\theta , y)}\, d\nu (y)\). The function \(\psi _*\) is measurable and integrable, because it satisfies \(0\leqslant \psi _*\leqslant \psi _1\). Thus, taking the limit under the sign of the integral and noticing that the denominator is a normalizing term we conclude that

$$\begin{aligned} \lim _{n\rightarrow \infty }\Pi _n (E \mid y) \,=\, \frac{ \int _E \psi _*(\theta )\, d \Pi _0 (\theta )}{ \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )} \,=\, \frac{ \int 1_E \psi _*(\theta )\, d \Pi _0 (\theta )}{ \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )} \end{aligned}$$
(45)

for every measurable subset \(E\subset \Theta \). This proves the first statement of the theorem.

We proceed to prove the level-1 large deviations estimates on the convergence of the a posteriori measures \(\Pi _n (\cdot \mid y)\) to \(\Pi _*\), whenever T is a subshift of finite type and \(\nu \) is a Gibbs measure associated to a Lipschitz continuous potential \(\varphi \). We will make use of the following instrumental lemma, whose proof is left as a simple exercise to the reader.

Lemma 4.2

Given arbitrary functions \(A,B: \Omega \rightarrow {\mathbb {R}}_+\) and constants \(a,b,\delta >0\) and \(0<\xi <b\), the following holds:

$$\begin{aligned} \Big \{\Big | \frac{A(y)}{B(y)} - \frac{a}{b}\Big |>\delta \Big \} \subset S_1 \;\cup \; S_2 \;\cup \; S_3 \end{aligned}$$

where \( S_1= \Big \{\Big | {B(y)} - {b}\Big |>\xi \Big \}, \quad S_2= \Big \{\frac{1}{b-\xi }\Big | {A(y)} - {a}\Big |>\frac{\delta }{2} \Big \} \) and \( S_3= \Big \{\frac{a}{b(b-\xi )}\Big | {B(y)} - {b}\Big |>\frac{\delta }{2} \Big \}. \)

Let us return to the proof of the large deviation estimates. Given \(g\in C(\Theta ,{\mathbb {R}})\) it is not hard to check using (44) and  (45) that

$$\begin{aligned} \int g\, d\Pi _n(\cdot \mid y)= \frac{ \int g(\theta ) \frac{\psi _n(\theta ,y)}{n}\, d \Pi _0 (\theta )}{ \int _\Theta \frac{\psi _n(\theta ,y)}{n} \, d \Pi _0 (\theta )} \quad \text {and}\quad \int g\, d\Pi _*= \frac{ \int g(\theta ) \psi _*(\theta )\, d \Pi _0 (\theta )}{ \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )}.\nonumber \\ \end{aligned}$$
(46)

Fix \(\delta >0\). In order to provide an upper bound for

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \,\nu ( \{\,y \in \Omega : \Big |\int g\, d\Pi _n (\cdot \mid y) - \int g\, d\Pi _*\Big |>\delta \}) \end{aligned}$$

we will estimate the set \(\Big \{\Big |\int g\, d\Pi _n (\cdot \mid y) - \int g\, d\Pi _*\Big |>\delta \Big \}\) as in Lemma 4.2. For that purpose, fix \(0<\xi <\min _{\theta \in \Theta } \psi _*(\theta )\). For each fixed \(\theta \in \Theta \) the family \(\Psi ^\theta :=(\psi _n(\theta ,\cdot ))_n\) is almost-additive. Hence Theorem 3.6 implies that

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \,\nu ( \{\,y \in \Omega : \Big | \frac{\psi _n(\theta ,y)}{n} - \psi _*(\theta )\Big |\geqslant \xi \}) \\&\quad \leqslant \sup _{{\mathcal {P}}^1_{\theta ,\xi ,\delta }} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \} \end{aligned}$$

where \({\mathcal {P}}^1_{\theta ,\xi ,\delta }\subset {{\mathcal {M}}}_\sigma (\Omega )\) is the space of invariant probability measures \(\eta \) such that \(|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant \xi \). In consequence,

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\nu \Big ( \Big \{\,y \in \Omega : \Big | \int _\Theta \frac{\psi _n(\theta ,y)}{n} \, d \Pi _0 (\theta ) - \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )\Big |\geqslant \xi \Big \}\Big ) \nonumber \\&\quad \leqslant \limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \,\nu \Big ( \Big \{\,y \in \Omega : \int _\Theta \Big | \frac{\psi _n(\theta ,y)}{n} - \psi _*(\theta )\,\Big |\, d \Pi _0 (\theta )\geqslant \xi \Big \}\Big ) \nonumber \\&\quad \leqslant \limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \,\nu \Big ( \Big \{\,y \in \Omega : \Big | \frac{\psi _n(\theta ,y)}{n} - \psi _*(\theta )\,\Big |\, \geqslant \xi , \; \text {for some}\; \theta \in \Theta \Big \}\Big ) \nonumber \\&\quad \leqslant \sup _{\theta \in \Theta } \sup _{{\mathcal {P}}^1_{\theta ,\xi ,\delta }} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \}. \end{aligned}$$
(47)

Analogously,

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\nu \Big ( \Big \{\,y \in \Omega : \frac{1}{\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )-\xi } \Big | \int g(\theta ) \frac{\psi _n(\theta ,y)}{n}\, d \Pi _0 (\theta ) \nonumber \\&\quad - \int _\Theta g(\theta ) \psi _*(\theta )\, d \Pi _0 (\theta )\Big |\geqslant \frac{\delta }{2} \Big \}\Big ) \nonumber \\&\quad \leqslant \limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\nu \Big ( \Big \{\,y \in \Omega : \Big | \int \frac{\psi _n(\theta ,y)}{n}\, d \Pi _0 (\theta ) - \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )\Big |\nonumber \\&\quad \geqslant \frac{\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )-\xi }{2\Vert g\Vert _\infty } \delta \Big \}\Big ) \nonumber \\&\quad \leqslant \sup _{\theta \in \Theta } \sup _{{\mathcal {P}}^2_{\theta ,\xi ,\delta }} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \}, \end{aligned}$$
(48)

where \(\eta \in {\mathcal {P}}^2_{\theta ,\xi ,\delta } \subset {{\mathcal {M}}}_\sigma (\Omega )\) if and only if \(|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant \frac{\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )-\xi }{2\Vert g\Vert _\infty } \delta \). The third term in the decomposition of Lemma 4.2 is identical to the estimate of  (47) and we have

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n} \log \,\nu \Big ( \Big \{\,y \in \Omega : \Big | \int _\Theta \frac{\psi _n(\theta ,y)}{n} \, d \Pi _0 (\theta ) - \int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )\Big |\nonumber \\&\quad \geqslant \frac{(\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )-\xi )^2}{ 2\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )} \delta \Big \}\Big ) \nonumber \\&\quad \leqslant \sup _{\theta \in \Theta } \sup _{{\mathcal {P}}^3_{\theta ,\xi ,\delta }} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \}, \end{aligned}$$
(49)

where \(\eta \in {\mathcal {P}}^3_{\theta ,\xi ,\delta } \subset {{\mathcal {M}}}_\sigma (\Omega )\) if and only if \(|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant \frac{(\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )-\xi )^2}{ 2\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )} \delta \). Altogether, if \(0<\delta <1\) and \(\xi =\delta \cdot \min \{ \inf _{\theta \in \Theta }\psi _*(\theta ),\int _\Theta \psi _*(\theta )\, d \Pi _0 (\theta )\} >0\), estimates (47)-(49) imply that there exists \(c>0\) so that

$$\begin{aligned}&\limsup _{n \rightarrow \infty } \frac{1}{n}\,\, \log \, \nu \Big ( \Big \{\,y \in \Omega : \Big |\int g\, d\Pi _n (\cdot \mid y) - \int g\, d\Pi _*\Big |\geqslant \delta \Big \}\Big ) \\&\quad \leqslant \sup _{\theta \in \Theta } \max _{1\leqslant i \leqslant 3} \sup _{{\mathcal {P}}^i_{\theta ,\xi ,\delta }} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \} \\&\quad \le \sup _{\theta \in \Theta } \; \sup _{\{\eta :|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant c\delta \}} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \} \end{aligned}$$

Finally, it remains to guarantee that the right hand-side above is strictly negative. Notice that as \({{\mathcal {F}}}(\nu ,\Psi ^\theta )=\psi _*(\theta )\), the uniqueness of the equilibrium state (which is an invariant Gibbs measure) for the potential \(\varphi \) and the continuity of the map \(\eta \mapsto {{\mathcal {F}}}(\eta ,\Psi ^\theta )\) imply that the set \({\mathcal {B}}_\theta (\delta ):=\{\eta \in {\mathcal {M}}_\sigma (\Omega ):|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )| \geqslant c\delta \}\) is compact and disjoint from \(\{\nu \}\), hence \(d_{{\mathcal {M}}_\sigma (\Omega )}\big (\nu , {\mathcal {B}}_\theta (\delta )\big ) > 0, \text {for each }\theta \in \Theta \). Hence, under the additional assumption that both maps \(\theta \mapsto {{\mathcal {F}}}(\eta ,\Psi ^\theta )=\inf _{n\ge 1}\frac{1}{n} \int \psi _n(\theta ,\cdot )d\eta \) and \(\theta \mapsto \psi _*(\theta )={{\mathcal {F}}}(\nu ,\Psi ^\theta )\) are continuous we conclude that

$$\begin{aligned} \min _{\theta \in \Theta } d_{{\mathcal {M}}_\sigma (\Omega )}\big (\nu , {\mathcal {B}}_\theta (\delta )\big ) > 0 \end{aligned}$$

and, consequently,

$$\begin{aligned} \sup _{\theta \in \Theta } \; \sup _{\{\eta :|{{\mathcal {F}}}(\eta ,\Psi ^\theta ) -\psi _*(\theta )|\geqslant c\delta \}} \Big \{-P(\sigma ,\varphi )+h_\eta (\sigma ) + \int \varphi \, d\eta \Big \}<0, \end{aligned}$$

which finishes the proof of the theorem \(\square \)