Linear Models for Systematics and Nuisances

Rodrigo Luger; Daniel Foreman-Mackey; David W. Hogg

doi:10.3847/2515-5172/aa96b5

1. Introduction

The target of many astronomical studies is the recovery of tiny astrophysical signals living in a sea of uninteresting (but usually dominant) noise. In many contexts (i.e., stellar time-series, or high-contrast imaging, or stellar spectroscopy), there are structured components in this noise caused by systematic effects in the astronomical source, the atmosphere, the telescope, or the detector. More often than not, evaluation of the true physical model for these nuisances is computationally intractable and dependent on too many (unknown) parameters to allow rigorous probabilistic inference.

Sometimes, housekeeping data—and often the science data themselves—can be used as predictors of the systematic noise. Linear combinations of these predictors (or linear combinations of nonlinear functions of these predictors) are often used as computationally tractable models that can capture the nuisances. These models can be used to fit and subtract systematics prior to investigation of the signals of interest, or they can be used in a simultaneous fit of the systematics and the signals. For our purposes, a linear model for a column vector of data ${\boldsymbol{y}}$ can be written in the form

$\begin{eqnarray}&&{\boldsymbol{y}}={\boldsymbol{\mu }}({\boldsymbol{\theta }})+{\boldsymbol{A}}{\boldsymbol{w}}+\mathrm{noise}\end{eqnarray} \tag{ 1 }$

where ${\boldsymbol{\mu }}({\boldsymbol{\theta }})$ is the column vector expectation or mean model (the part of the model that we care about), ${\boldsymbol{A}}$ is a design matrix, whose columns are basis vectors (predictors) for the systematics, and ${\boldsymbol{w}}$ is the vector of weights or amplitudes, one for each basis vector.

Similar models have been used to describe the systematics in astrophysical time series data (Smith et al. 2012; Luger et al. 2016; Wang et al. 2016), galaxy or stellar spectra (Tsalmantza & Hogg 2012; Ness et al. 2015), and imaging (Fergus et al. 2014; Wang et al. 2017). One issue with flexible data-driven models is their tendency to overfit and reduce the astrophysical signal of interest. This is generally tackled using a dimensionality reduction technique like principal component analysis or by applying strong priors or a regularization to the weights vector ${\boldsymbol{w}}$ .

In this Note, we show that if a Gaussian prior is placed on the weights ${\boldsymbol{w}}$ of the linear components, the weights can be marginalized out with an operation in pure linear algebra, which can (often) be made fast. We illustrate this model by demonstrating the applicability of a linear model for the nonlinear systematics in K 2 time-series data, where the dominant noise source for many stars is spacecraft motion and variability.

2. The Problem

Consider a data set ${\boldsymbol{y}}$ of N measurements y_i with covariance matrix ${\boldsymbol{C}}$ . In the common case of data collected with measurement error ${\sigma }_{i}$ on individual data points but no correlation across measurements, ${\boldsymbol{C}}$ is a diagonal matrix with ${{\boldsymbol{C}}}_{{ij}}={\sigma }_{i}{\delta }_{{ij}}$ , although in general the off-diagonal elements capture the covariance between different measurements. Given a linear model as in Equation (1), the probability of the data under the model is given by a normal distribution with mean ${\boldsymbol{\mu }}({\boldsymbol{\theta }})+{\boldsymbol{A}}{\boldsymbol{w}}$ and covariance ${\boldsymbol{C}}$ :

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }},{\boldsymbol{w}})={ \mathcal N }({\boldsymbol{y}};{\boldsymbol{\mu }}({\boldsymbol{\theta }})+{\boldsymbol{A}}{\boldsymbol{w}},{\boldsymbol{C}}).\end{eqnarray} \tag{ 2 }$

However, we are specifically not interested in the value of ${\boldsymbol{w}}$ . Instead, we will marginalize over it. To perform this marginalization we must place a prior on ${\boldsymbol{w}}$ that we will assume to be Gaussian:

$\begin{eqnarray*}&&p({\boldsymbol{w}})={ \mathcal N }({\boldsymbol{w}};0,{\boldsymbol{\Lambda }}).\end{eqnarray*}$

With this prior and the likelihood in Equation (2), our goal is to marginalize out the weights ${\boldsymbol{w}};$ that is, we want to compute the marginalized likelihood,

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})={\int }_{-\infty }^{\infty }p({\boldsymbol{w}})p({\boldsymbol{y}}| {\boldsymbol{\theta }},{\boldsymbol{w}})d{\boldsymbol{w}}.\end{eqnarray} \tag{ 3 }$

In doing so, we would like to avoid explicitly solving for the weights ${\boldsymbol{w}}$ while also avoiding the evaluation of numerical integrals.

3. The Solution

As we show in the Appendix, the marginalized likelihood (Equation (3)) may be expressed as:

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})={ \mathcal N }({\boldsymbol{y}};{\boldsymbol{\mu }}({\boldsymbol{\theta }}),{\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}\,{{\boldsymbol{A}}}^{\top }).\end{eqnarray} \tag{ 4 }$

This marginalized likelihood function can be numerically maximized to find the maximum likelihood parameters ${\theta }^{\star }$ , or it can be multiplied by a prior $p(\theta )$ and used for posterior inference. In either case, the evaluation of the model will include the effects of marginalizing over ${\boldsymbol{w}}$ in the linear model and any uncertainties in those values will be propagated to the results.

It is often useful to compute the value of the linear model so that we can "remove" systematics from the data. To derive this, we recognize that Equation (4) is the likelihood of a Gaussian Process. This means that conditioned on the data and a choice of the parameters ${\boldsymbol{\theta }}$ , the systematics will have a Gaussian distribution with mean ${\boldsymbol{m}}$ and covariance ${{\boldsymbol{\Sigma }}}_{{\boldsymbol{m}}}$ given by (Rasmussen & Williams 2006)

$\begin{eqnarray}{\boldsymbol{m}} & = & \mu ({\boldsymbol{\theta }})+{\boldsymbol{A}}{\boldsymbol{\Lambda }}\,{{\boldsymbol{A}}}^{\top }{\left[{\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }\right]}^{-1}\left[{\boldsymbol{y}}-\mu ({\boldsymbol{\theta }})\right]\\ {{\boldsymbol{\Sigma }}}_{{\boldsymbol{m}}} & = & {\boldsymbol{A}}{\boldsymbol{\Lambda }}\,{{\boldsymbol{A}}}^{\top }-{\boldsymbol{A}}{\boldsymbol{\Lambda }}\,{{\boldsymbol{A}}}^{\top }{\left[{\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }\right]}^{-1}{{\boldsymbol{A}}}^{\top }{\boldsymbol{\Lambda }}\,{\boldsymbol{A}}.\end{eqnarray} \tag{ 5 }$

4. The Implications

In the previous section, we presented an expression that can be used to compute the likelihood function for a linear model marginalized over the weights vector. Linear models have been used throughout the astrophysics literature as data-driven descriptions of complicated physical processes but, in some cases, this analytic marginalization could be applied to improve performance—both computational and statistical—of the models. Linear models become more expressive as more basis components are added, but they also become prone to overfitting. A prior can be used to mitigate overfitting while maintaining the flexibility of the model and the trick described in this Note can be used to efficiently compute the likelihood marginalized over the many linear parameters ${\boldsymbol{w}}$ .

Figure 1 shows an example where the marginalized likelihood function described here is used to fit a data-driven systematics model to a light curve from the K 2 mission. The details of this model appear elsewhere (Luger et al. 2016, 2017), but the basic idea is that this linear model can be used to describe the noise introduced into the light curve by motion of the spacecraft's pointing. This can be combined with a physical model of a transiting planet to characterize the planet even when the signal is not visible in the raw data.

It is a pleasure to thank Patrick Cooper, Boris Leistedt, Bernhard Schölkopf, and Dun Wang for helping us understand all of this.

Appendix:

The marginalized likelihood may be expressed as follows:

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})=\displaystyle \frac{1}{| 2\pi {\boldsymbol{\Lambda }}{| }^{\tfrac{1}{2}}| 2\pi {\boldsymbol{C}}{| }^{\tfrac{1}{2}}}{\int }_{-\infty }^{\infty }\exp \left[-\displaystyle \frac{1}{2}z\right]d{\boldsymbol{w}}\end{eqnarray} \tag{ 6 }$

where

$\begin{eqnarray}&&z={{\boldsymbol{w}}}^{\top }{{\boldsymbol{\Lambda }}}^{-1}{\boldsymbol{w}}+{\left({\boldsymbol{r}}-{\boldsymbol{A}}{\boldsymbol{w}}\right)}^{\top }{{\boldsymbol{C}}}^{-1}({\boldsymbol{r}}-{\boldsymbol{A}}{\boldsymbol{w}})\end{eqnarray} \tag{ 7 }$

and ${\boldsymbol{r}}={\boldsymbol{y}}-{\boldsymbol{\mu }}({\boldsymbol{\theta }})$ . The integral is easier to evaluate if we complete the square and write:

$\begin{eqnarray}&&z={\left({\boldsymbol{w}}-{\boldsymbol{h}}\right)}^{\top }{{\boldsymbol{\Sigma }}}^{-1}({\boldsymbol{w}}-{\boldsymbol{h}})+k,\end{eqnarray} \tag{ 8 }$

where, by comparison with Equation (7), it can be shown that

$\begin{eqnarray}&&{{\boldsymbol{\Sigma }}}^{-1}={{\boldsymbol{\Lambda }}}^{-1}+{{\boldsymbol{A}}}^{\top }{{\boldsymbol{C}}}^{-1}{\boldsymbol{A}}\end{eqnarray} \tag{ 9 }$

$\begin{eqnarray}&&{\boldsymbol{h}}={\boldsymbol{\Sigma }}{{\boldsymbol{A}}}^{\top }{{\boldsymbol{C}}}^{-1}{\boldsymbol{r}}\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&k={{\boldsymbol{r}}}^{\top }\left({{\boldsymbol{C}}}^{-1}-{{\boldsymbol{C}}}^{-1}{\boldsymbol{A}}{\boldsymbol{\Sigma }}{{\boldsymbol{A}}}^{\top }{{\boldsymbol{C}}}^{-1}\right){\boldsymbol{r}}.\end{eqnarray} \tag{ 11 }$

We may thus write

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})=\displaystyle \frac{1}{| 2\pi {\boldsymbol{\Lambda }}{| }^{\tfrac{1}{2}}| 2\pi {\boldsymbol{C}}{| }^{\tfrac{1}{2}}}\exp \left[-\displaystyle \frac{k}{2}\right]{\int }_{-\infty }^{\infty }\exp \left[-\displaystyle \frac{1}{2}{\left({\boldsymbol{w}}-{\boldsymbol{h}}\right)}^{\top }{{\boldsymbol{\Sigma }}}^{-1}({\boldsymbol{w}}-{\boldsymbol{h}})\right]d{\boldsymbol{w}}.\end{eqnarray} \tag{ 12 }$

The integral is that of a Gaussian, which evaluates to $| 2\pi {\boldsymbol{\Sigma }}{| }^{\tfrac{1}{2}}$ . By the Matrix Determinant Lemma and the Woodbury Identity (for example, Woodbury 1950; Harville 1997),

$\begin{eqnarray}| {\boldsymbol{\Sigma }}| & = & {| {{\boldsymbol{\Sigma }}}^{-1}| }^{-1}=\displaystyle \frac{| {\boldsymbol{\Lambda }}| | {\boldsymbol{C}}| }{| {\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }| }\\ k & = & {{\boldsymbol{r}}}^{\top }{\left({\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }\right)}^{-1}{\boldsymbol{r}}.\end{eqnarray} \tag{ 13 }$

Combining these results, the expression in Equation (12) simplifies to

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})=\displaystyle \frac{1}{| 2\pi ({\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }){| }^{\tfrac{1}{2}}}\exp \left[-\displaystyle \frac{1}{2}{\left({\boldsymbol{y}}-{\boldsymbol{\mu }}({\boldsymbol{\theta }})\right)}^{\top }{\left({\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }\right)}^{-1}\left({\boldsymbol{y}}\,-\,{\boldsymbol{\mu }}({\boldsymbol{\theta }})\right)\right].\end{eqnarray} \tag{ 14 }$

This is a normal distribution with mean ${\boldsymbol{\mu }}({\boldsymbol{\theta }})$ and covariance ${\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }$ :

$\begin{eqnarray}&&p({\boldsymbol{y}}| {\boldsymbol{\theta }})={ \mathcal N }({\boldsymbol{y}};{\boldsymbol{\mu }}({\boldsymbol{\theta }}),{\boldsymbol{C}}+{\boldsymbol{A}}{\boldsymbol{\Lambda }}{{\boldsymbol{A}}}^{\top }).\end{eqnarray} \tag{ 15 }$

Linear Models for Systematics and Nuisances

Article metrics

Permissions

Author affiliations

ORCID iDs

Dates

1. Introduction

2. The Problem

3. The Solution

4. The Implications

Appendix:

Linear Models for Systematics and Nuisances

Article metrics

Permissions

Share this article

Author affiliations

ORCID iDs

Dates

1. Introduction

2. The Problem

3. The Solution

4. The Implications

Appendix: