Bayesian Methods

Kass, Robert E.; Eden, Uri T.; Brown, Emery N.

doi:10.1007/978-1-4614-9602-1_16

Robert E. Kass¹⁰,
Uri T. Eden¹¹ &
Emery N. Brown¹²

Part of the book series: Springer Series in Statistics ((SSS))

5852 Accesses

Abstract

Few results are as consequential for data analysis as Bayes’ Theorem. The theorem itself, which we introduced in Sections 3.1.4 and 4.3.3, is a simple re-formulation of conditional probability and is easy to derive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Exceptions occur for improper priors; see Section 16.1.4.
2.
See Kass (2011) for an elaboration of the current philosophical pragmatism among most practicing statisticians.
3.
Adding 2 successes and two failures has been advocated as a way of achieving good frequentist coverage probability of approximate 95 % CIs, i.e., intervals based on (7.22) with $\hat{p}$ replaced by $(x+2)/(n+4)$. See Agresti and Caffo (2000).
4.
A similar argument may be applied to the case of estimating a normal standard deviation $\sigma $ when the mean $\mu $ is assumed known, and this produces a uniform prior $\pi _{\xi }(\xi )=1$ on $\xi =\log \sigma $, with the change of variables formula (see the theorem on p. 62) giving $\pi (\sigma )=1/\sigma $.
5.
In order for the right-hand side to integrate to 1 we must have $k=\sqrt{-\ell ^{\prime \prime }(\hat{\theta })/2\pi }$.
6.
Additional details may be found in many sources including Kass and Vos (1997, Theorem 2.2.13) and Chen (1985).
7.
When computer-based simulation methods were first being used, Monte Carlo was the site of a famous gambling establishment, which was frequented by the uncle of one of the developers of these methods. See Metropolis (1987).
8.
Because we are assuming discrete time the memoryless distribution of durations becomes geometric rather than exponential, as we noted on p. 120.
9.
The chain must beirreducible (if the chain is in state $i$ at time $t$ it is possible for it to get to state $j$ in the future),aperiodic (the chain does not cycle deterministically through the states), andrecurrent (if the chain is in state $i$ at time $t$ it will eventually return to state $i$ in the future), see for example, Ross (1996, Theorem 4.3.3).
10.
It is not too difficult to derive the formula, but we omit the arithmetic. For large $t$ the probability of the channel being open is
$$ P(X_t=1) \approx \frac{P_{01}}{P_{01}+P_{10}} $$
which depends on the probability $P_{01}$ of switching from closed to open relative to the probability $P_{10}$ of switching from open to closed.
11.
It is also very important in many other situations, where Markov chains are used as statistical models.
12.
Details concerning the continuous case may be found in many sources (for example, Robert and Casella 2004) for a parameter $\theta $.
13.
The noise random variable $\epsilon _i$ in the regression model (12.1) is unobservable, but would not typically be called latent. To exclude such cases a random variable could be called latent only if it can not be written in terms of observable random variables. Thus, under this definition, because (12.1) implies $\epsilon _i=Y_i - f(x_i)$, $\epsilon _i$ would not be a latent variable. See Bollen (2002).
14.
To speed computation Tokdar et al. chose to work with the inter-spike intervals instead of the variables $Y_t$ we have defined here.
15.
Nonparametric methods (Section 13.3) are based on statistical models of a more general form that do not depend on a finite-dimensional parameter vector.
16.
In principle this process can continue, with $\lambda $ distributed according to a family of densities, and so on, but they do not arise very often in practice.
17.
See, for example, his 1973 article “Exploring data analysis as part of a larger whole,” reprinted in Tukey (1987).
18.
The data analyzed here were based on a pre-publication draft of the Sklar and Strauss paper and are slightly different than those reported in the final version. Because strain 11 had such a large uncertainty the authors replicated their experiment for strain 11 with a much larger sample and obtained results that were much more consistent with the other strains.
19.
Alternatively, this likelihood may be integrated over $\mu $ and then maximized over $\tau $, which produces a slightly different and sometimes preferable estimate often known as the REML estimate of $\tau $, for restricted maximum likelihood estimate.
20.
In the jargon of computer science we would say that the hyperparameter $\tau $ was learned from the data, as opposed to fixed within the estimation algorithm.
21.
In the discrete case $f_X(x)=\sum _y f(x,y)$ and the sum of positive quantities is positive. In the continuous case the integral of a positive function is positive.
22.
Theoretical analysis of Gibbs sampling shows that the conditions mentioned in footnote 9 are satisfied (see Robert and Casella 2004). The name comes from Geman and Geman (1984), who applied it to image restoration, where there is a close analogy with the Gibbs distribution in statistical mechanics.
23.
For a univariate normal variance parameter $\sigma ^2$ the conjugate prior family is called inverse-gamma because $\sigma ^{-2}$ follows a gamma distribution. The multivariate extension is calledinverse-Wishart. For a $p \times p$ variance matrix the inverse-Wishart itself has $1+p(p+1)/2$ free parameters that must be selected. Kass and Natarajan (2006) suggested a method of doing so in the context of hierarchical models.
24.
A rough estimate of $\Sigma $ may be obtained by setting $V_1=\cdots = V_k=V^*$ in (16.43), where $V^*$ is some kind of average of the $V_i$ matrices (such as the inverse of the mean of the inverse matrices) and then applying the method of moments.
25.
In fact, according to this assumption on the $Y_t$ variables, the neuron’s spike trains flip back and forth between two discrete-time versions of Poisson processes, a bursting Poisson process and a non-bursting Poisson process; see Chapter 19.
26.
For a review of these ideas together with commentary on algorithms see Brockwell et al. (2007).
27.
Often movement velocity is used, and sometimes direction, velocity, and acceleration are all used as components of $X_t$.
28.
In practice one also runs a version of the Kalman filter backwards in time, beginning at the last time point $t=T$ and ending with $t=1$. This pair of forward and backward algorithms is called the Kalman smoother.
29.
A possible issue is the extent to which strain 12 was selected post hoc, after the data had been examined. It is possible to correct the Bayes factor for suchpost hoc selection, analogously to (though differently than) the way $p$-values may be adjusted (see Section 11.3). The investigators repeated the experiment on strain 12 and found similar results, which provided strong confirmation of $H_0$.
30.
Mathematically the situation is reversed: an elegant theorem due to Doob establishes the consistency of the posterior distribution, and thus of Bayes factors, under weak conditions. Equation (11.12) then provides consistency of BIC. For precise statements see Schervish (1995, Section 7.2.1) and the references in Kass and Raftery (1995, Section 4.1.3).
31.
In Eq. (10.24) the statistic $Q$ could follow a standard distribution, such as a $t_{\nu }$-distribution, in which case the calculation would be based on the distribution of $Q$. However, Eq. (10.24) may be rewritten as
$$ p = \int _R f_0(x)dx $$
where $R=\{x: Q \ge q_{obs}\}$.
32.
For the normal testing problem of Section 10.3.1, one may consider the class of all pdfs that are symmetric around $\mu =\mu _0$, and also have their mode at $\mu =\mu _0$. Sellke et al. (2001) reported results based on this assumption. They also considered the distribution of the $p$-value. Under $H_0$ this distribution is uniform (see Section 10.4.1) and under $H_A$ they assumed it to take the form $f(p)=\xi p^{\xi -1}$ for some $\xi $, which provided another way to formalize the family of alternatives and compute the minimum value of the Bayes factor.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Robert E. Kass
Boston University, Boston, MA, USA
Uri T. Eden
Massachusetts Institute of Technology, Cambridge, MA, USA
Emery N. Brown

Authors

Robert E. Kass
View author publications
You can also search for this author in PubMed Google Scholar
Uri T. Eden
View author publications
You can also search for this author in PubMed Google Scholar
Emery N. Brown
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert E. Kass .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kass, R.E., Eden, U.T., Brown, E.N. (2014). Bayesian Methods. In: Analysis of Neural Data. Springer Series in Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9602-1_16

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9602-1_16
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9601-4
Online ISBN: 978-1-4614-9602-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics