1 Introduction

The interpretation of collected research data is a frequent and important step of the data analysis procedure in many scientific fields. A common task in this interpretation is the estimation of parameters using predictions obtained from analytic calculations or simulation programs, usually referred to as fitting. Frequently used fitting algorithms rely on numerical methods and utilize iterative function minimization algorithms [1,2,3,4,5,6,7]. The availability of more computational power and the development of improved algorithms such as machine learning techniques [8, 9] have led to more comprehensive statistical methods that can be employed for the estimation of parameters [10,11,12,13]. At the same time, simulation programs have become more complex and more computationally demanding, placing constraints on the inference algorithms. For example, in the field of particle physics, such simulations provide fully differential predictions in perturbative quantum chromodynamics and the electroweak theory (see Refs. [14,15,16,17,18,19,20,21,22] for some recent advances) and may take several hundreds up to a few thousand years on a single, modern CPU core. Sometimes, the simulation of the physical process is followed by a detailed simulation of the experimental apparatus in order to provide synthetic data as close as possible to the recorded raw data. At the experiments at the CERN Large Hadron Collider (LHC), the simulated data include the physics of the proton–proton collision [23,24,25] and the simulation of processes that take place subsequently in the detector material [26]. This results in the simulation of about 100 million electronic channels [27], which are processed similarly to real data [28]. Consequently, these simulated data cannot be used in iterative optimization algorithms because of their computational cost. In addition, these predictions can be provided only for a few selected values of the theory parameters of interest.

Besides the restrictions associated with computationally demanding simulations, there exist other reasons why less complex simulations cannot be used by certain optimization algorithms. Numerical instabilities or statistical fluctuations in the predictions can result in fit instabilities, input quantities for the predictions are sometimes only available for selected discrete values of the parameters of interest [for example, parameterizations of the parton distribution functions of the proton (PDFs) which are only available for a few values of the strong coupling constant \(\alpha _s\) [29, 30]], or simply because there are technical limitations when interfacing a simulation program to the inference algorithm and the parameter(s) of interest cannot be made explicit to the minimization algorithm. A further frequent limitation for simulation-based inference is related to the intellectual property of the simulation program, where the computer code is not publicly available, but the obtained predictions are available for a given set of reference values. In short, in many cases, the inference algorithm can only use predictions that have been provided previously since a recomputation by the inference algorithm is not possible.

Several variants of simulation-based template fits, or template-like fits, are nowadays used in high-energy physics [13, 31,32,33,34,35,36,37,38,39,40,41,42,43,44]. These make use of polynomial interpolation or regression between the reference points or apply related techniques [45,46,47]. Some algorithms use un-binned quantities, while others make use of summary statistics. Furthermore, different algorithms to find the best estimator are employed; for example, numerical methods or iterative minimization algorithms are used to find the extremum of a likelihood or \(\chi ^{2}\)  function. In general, typical iterative minimization algorithms can be considered to be template fits, since every iteration generates a new template at a new reference point, which are then used to find the extremum of the likelihood function.

In this article, the equations of the Linear Template Fit (LTF) are derived, which can be written in a closed matrix equation form. The Linear Template Fit provides an analytic solution for the determination of the best estimator in a least-squares problem under the assumption that the predictions are available only for a finite set of values of the parameters of interest. The equations are obtained from a two-step marginalization of the underlying statistical model, which assumes normally distributed uncertainties. The first step provides a linearized, but continuous, estimate of the prediction Footnote 1, and the second step provides the best estimator of the fitting problem. The Linear Template Fit is suited for a wide range of parameter estimation problems, where the input data can be cross section measurements, event counts, or other summary statistics like histograms.

This article is structured as follows. After a brief review of the method of template fits in Sect. 2, the equation of the Linear Template Fit is derived in Sect. 3 for a univariate problem. While the Linear Template Fit turns out to be a simple matrix equation, to the author’s knowledge it has not been published in its closed form before. The emphasis of this article is on variations of the Linear Template Fit, its applicability, and the propagation of uncertainties. The multivariate Linear Template Fit is discussed in Sect. 5, and the Linear Template Fit with relative uncertainties is presented in Sect. 7. A detailed discussion on error treatment and propagation is given in Sect. 8. A (detector) response matrix is inserted into the Linear Template Fit in Sect. 9. Several considerations for the applicability of the Linear Template Fit are provided in Sect. 10, as well as a simple algorithm for cross-checks and the relation to other algorithms. While the prerequisite of the Linear Template Fit is the linearity of the prediction in the parameters of interest, potential nonlinear effects are estimated and discussed in Sect. 11. There, the Quadratic Template Fit is introduced, an algorithm with fast convergence using second-degree polynomials for the parameter dependence of the model. Three toy examples are discussed in Sects. 46, and 13, and are also remarked upon occasionally in between. A comprehensive and real example application of the Linear Template Fit is given in Sect. 14, where the value of the strong coupling constant \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) is determined from inclusive jet cross section data obtained by the CMS experiment at the LHC. The best estimators are compared with previously published results, obtained with other inference algorithms. Section 15 provides a summary. Additional details are collected in the Appendix, where a table can be found of the notation adopted in this article.

2 Template fits

The objective function of a multivariate optimization problem assuming normally distributed random variables is a likelihood function calculated as the joint probability distribution of Gaussians

$$\begin{aligned} {\mathcal {L}} = \prod _{i}^n\frac{1}{\sqrt{2\pi \sigma _i^2}}\exp \left( \frac{-(d_i-\lambda _i(\varvec{\alpha }))^2}{2\sigma _i^2}\right) \,, \end{aligned}$$
(1)

where \(d_i\) are the i data values, \(\sigma ^2_i\) the variances, and \(\lambda _i\) is the value that is dependent on the model parameters of interest \(\varvec{\alpha }\). Gaussian probability distributions are often appropriate assumptions for real data due to the central limit theorem [48, 49]. For numerical computation and optimization algorithms, it is convenient to rewrite \({\mathcal {L}}\) in terms of a least-squares equation using \(\chi ^{2}=-2\log {\mathcal {L}}\) and omitting constant terms. In matrix notation, and using a covariance matrix, the objective function becomes Footnote 2

$$\begin{aligned} \chi ^{2}(\varvec{\alpha })= (\varvec{d} - \varvec{\lambda }(\varvec{\alpha }))^{{\text {T}}} V^{-1} (\varvec{d} - \varvec{\lambda }(\varvec{\alpha }))\,. \end{aligned}$$
(2)

The maximum likelihood estimators (MLE) of the parameters \(\varvec{\alpha }\) are found by minimizing \(\chi ^{2}\)

$$\begin{aligned} \varvec{{\hat{\alpha }}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{\alpha }}\chi ^{2}(\varvec{d};\varvec{\alpha })\,. \end{aligned}$$
(3)

For this task, iterative function optimization algorithms are commonly employed (see e.g. Refs. [4,5,6, 50]). The least-squares method is then equivalent to the maximum likelihood method in Eq. (1).

A common problem that physicists often face at this point is that the model \(\varvec{\lambda }(\varvec{\alpha })\) may be computationally intense, time-consuming to calculate, or not be available in a compatible implementation. Hence, the objective function \(\chi ^{2}\) cannot be evaluated repeatedly and for arbitrary values of \(\varvec{\alpha }\), as it would be required in gradient methods. In contrast, it is often the case that the model is available for a set of j model parameters \(\varvec{{\dot{\alpha }}}_{(j)}\). These predictions \(\varvec{y}_{(j)}:=\varvec{\lambda }(\varvec{{\dot{\alpha }}}_{(j)})\) are denoted as templates for given reference values \(\varvec{{\dot{\alpha }}}_{(j)}\) in the following. Consequently, a template fit exploits only previously available information and can be considered as a two-step algorithm:

  1. 1.

    In a first step, a (continuous) representation of the model in the parameter \(\varvec{\alpha }\) needs to be derived from the templates. This can be done by interpolation, regression, or any other approximation of the true model with some reasonable function \(\varvec{{\mathbf {y}}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}})\), like

    $$\begin{aligned} \varvec{\lambda }(\varvec{\alpha }) \approx {\mathbf {y}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}})\,, \end{aligned}$$
    (4)

    where \(\varvec{{\mathbf {y}}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}})\) is dependent on \(\varvec{\alpha }\) and several parameters \(\varvec{\theta }\). The values \(\varvec{\theta }\) are then determined from the templates, for example with a least-squares method and a suitable \(\chi ^{2}\), similar to

    $$\begin{aligned} \varvec{{\hat{\theta }}} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{\theta }}\chi ^{2}(\varvec{y}_{(0)},\ldots ,\varvec{y}_{(j)};\varvec{\theta })\,. \end{aligned}$$
    (5)

    Such a procedure can be done for each point i separately, or for all points simultaneously.

  2. 2.

    In a second step, which is often decoupled from the first, the best estimators of the parameter of interest are determined using the approximated true model from the first step. It can be expressed as

    $$\begin{aligned} \varvec{{\hat{\alpha }}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{\alpha }}\chi ^{2}(\varvec{d};\varvec{\alpha })\,, \end{aligned}$$
    (6)

    where the objective function could be a least-squares expression, like the one of Eq. (2), but using the approximated model \({\mathbf {y}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}})\) from the first step, like

    $$\begin{aligned} \chi ^{2}(\varvec{\alpha })= (\varvec{d} - {\mathbf {y}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}}) )^{{\text {T}}} V^{-1} (\varvec{d} - {\mathbf {y}}(\varvec{\alpha },{\varvec{{\hat{\theta }}}}) )\,. \end{aligned}$$
    (7)

Such a two-step algorithm appears to be cumbersome, and in the following the equation of the Linear Template Fit is introduced, which combines the two steps into a single equation, under certain assumptions.

3 The basic method of the Linear Template Fit

The basic methodology of the Linear Template Fit is introduced, and in order to simplify the discussion, a univariate model \(\varvec{\lambda }(\alpha )\) is considered, where the parameter of interest is denoted as \(\alpha \). It is assumed that the model is available for several values \({{\dot{\alpha }}}_{(j)}\) of the parameter of interest, and these j predictions are denoted as the templates, \(\varvec{\lambda }({\dot{\alpha }}_{(j)})\). These templates will be confronted with the vector of the data \(\varvec{d}\) in order to obtain the best estimator \({\hat{\alpha }}\).

We consider a basic optimization problem based on Gaussian probability distributions, and the objective function is written in terms of a least-squares equation

$$\begin{aligned} \chi ^{2}(\alpha )= (\varvec{d} - \varvec{\lambda }({\alpha }))^{{\text {T}}} V^{-1} (\varvec{d} - \varvec{\lambda }({\alpha }))\,. \end{aligned}$$
(8)

In a first step, the model \(\varvec{\lambda }(\alpha )\) is approximated linearly from the template distributions, and in every entry i, \(\lambda _i(\alpha )\) is approximated through a linear function \(\mathrm {y}_i(\alpha ;\theta _0,\theta _1)\) like

$$\begin{aligned} \lambda _i(\alpha ) \approx \mathrm {y}_i(\alpha ;\varvec{{\hat{\theta }}}) ~~~~\text {using}~~~~ \mathrm {y}_i(\alpha ;\varvec{\theta }_{(i)}) := \theta ^{(i)}_0 + \theta ^{(i)}_1 \alpha \,. \end{aligned}$$
(9)

The best estimators of the function parameters, \(\varvec{{\hat{\theta }}}\), are obtained from the templates by linear regression (“straight line fit,” see e.g. Refs. [51,52,53]), whose formalism also follows the statistical concepts that were briefly outlined above. Each of the templates is representative of a certain value of \(\alpha \), which will be denoted in the following as reference points \({\dot{\alpha }}_j\), and all reference values form the j vector \(\varvec{{\dot{\alpha }}}\). Next, an extended Vandermonde design matrix [54], the regression matrix, is constructed from a unit column vector and \(\dot{\varvec{\alpha }}\):

$$\begin{aligned} M := \begin{pmatrix}\varvec{1}&\varvec{{\dot{\alpha }}} \end{pmatrix} = \left( {\begin{matrix} 1 &{} {\dot{\alpha }}_1 \\ \vdots &{} \vdots \\ 1 &{} {\dot{\alpha }}_j \end{matrix}}\right) \,. \end{aligned}$$
(10)

From the method of least squares, the best estimators of the polynomial parameters for the ith entry are [55]

$$\begin{aligned} {\varvec{{\hat{\theta }}}}_{(i)}= & {} \begin{pmatrix}{\hat{\theta }}_0\\ {\hat{\theta }}_1\end{pmatrix}_{(i)} = M^+_{(i)}\begin{pmatrix} y_{(1),i} \\ \vdots \\ y_{(j),i} \end{pmatrix} ~~~\mathrm{using}~~~\nonumber \\ M^+_{(i)}= & {} (M^{{\text {T}}}{\mathcal {W}}_{(i)}M)^{-1}M^{{\text {T}}}{\mathcal {W}}_{(i)}\,, \end{aligned}$$
(11)

where the matrix \(M^+_{(i)}\) is a g-inverse of a least-squares problem [54, 56], \(y_{(j),i}\) denotes the ith entry of the jth template vector, and \({\mathcal {W}}_{(i)}\) is an inverse covariance matrix which represents uncertainties in the templates, e.g. \({\mathcal {W}}_{(i),j,j}=\sigma _{y_{(j),i}}^{-2}\). However, for the purpose of the Linear Template Fit, an important simplification is obtained since the special case of an unweighted linear regression is applicable to a good approximation. The (inverse) covariance matrix \({\mathcal {W}}_{(i)}\) in this problem has two features: it is a diagonal matrix, since the templates are generated independently, and secondly, it has, to a very good approximation, equally sized diagonal elements, \({\mathcal {W}}_{(i),j,j}\approx {\mathcal {W}}_{(i),j+1,j+1}\). This simplification is commonly well applicable, for example, if the model is obtained from an exact calculation or if Monte Carlo event generators were employed and the templates all have a similar statistical precision. Consequently, the size of the uncertainties in the templates \(\sigma _{y_{(j),i}}\) factorizes from \({\mathcal {W}}_{(i)}\) and subsequently cancels in the calculation of \(M^+_{(i)}\), eq. (11). Thus, an unweighted polynomial regression is applicable, and the g-inverse simplifies to a left-side Moore–Penrose pseudoinverse matrix [57,58,59],

$$\begin{aligned} M^+_{(i)}\simeq M^+= (M^{{\text {T}}}M)^{-1}M^{{\text {T}}}\,, \end{aligned}$$
(12)

with M as defined in Eq. (10). It is observed that the matrix \(M^+\) is universal, so it is independent on the quantities \(y_{(j),i}\) and \({\mathcal {W}}_{(i)}\), but is calculated only from the reference points \(\varvec{{\dot{\alpha }}}\). Consequently, it is equally applicable in every bin i.

Henceforth, we decompose the matrix \(M^+\) (Eq. (12)), which is a 2\(\times \) \(j\) matrix, into two row vectors, like

$$\begin{aligned} M^+ =: \begin{pmatrix} \varvec{{\bar{m}}}^{{\text {T}}}\\ \varvec{{\tilde{m}}}^{{\text {T}}} \\ \end{pmatrix} \,,~~~~~\text {so}~~~~~~ \begin{array}{l} \varvec{{\bar{m}}} = (M^+)^{{\text {T}}}\left( {\begin{matrix}1\\ 0\end{matrix}}\right) \\ \varvec{{\tilde{m}}} = (M^+)^{{\text {T}}}\left( {\begin{matrix}0\\ 1\end{matrix}}\right) \end{array}\,. \end{aligned}$$
(13)

This introduces the j column vectors \(\varvec{{\bar{m}}}\) and \(\varvec{{\tilde{m}}}\).

We further introduce the template matrix Y, a \(i\times j\) matrix, which is constructed from the column vectors of the template distributions, and can be written as

$$\begin{aligned} Y:= & {} \begin{pmatrix} Y_{1,1} &{} \ldots &{} Y_{1,j} \\ \vdots &{} \ddots &{} \vdots \\ Y_{i,1} &{} \ldots &{} Y_{i,j} \\ \end{pmatrix} = \left( \begin{array}{lll} \varvec{y}_{({\dot{\alpha }}_1)}&\ldots&\varvec{y}_{({\dot{\alpha }}_j)} \end{array} \right) \nonumber \\= & {} \begin{pmatrix} y_{(1),1} &{} \ldots &{} y_{(j),1} \\ \vdots &{} \vdots &{} \vdots \\ y_{(1),i} &{} \ldots &{} y_{(j),i} \end{pmatrix}\,. \end{aligned}$$
(14)

Hence, substituting the best estimators \(\varvec{{\hat{\theta }}}\) (Eq. (11)) into the linearized model \({\mathbf {y}}(\alpha ,\varvec{{\hat{\theta }}})\) (Eq. (9)), and using \(\varvec{{\bar{m}}}\) and \(\varvec{{\tilde{m}}}\) (Eq. (13)) and Y (Eq. (14)), the model can be expressed as a matrix equationFootnote 3:

$$\begin{aligned} \varvec{\lambda }(\alpha )\approx {\mathbf {y}}(\alpha ,\varvec{{\hat{\theta }}}) = Y\varvec{{\bar{m}}} + Y\varvec{{\tilde{m}}}\alpha \,. \end{aligned}$$
(15)

It is seen that the polynomial parameters \(\varvec{\theta }\) are no longer explicit in this expression. Consequently, the objective function (Eq. (8)) becomes a linear least-squares expression

$$\begin{aligned} \chi ^{2}(\alpha ) = (\varvec{d}- Y\varvec{{\bar{m}}}- Y\varvec{{\tilde{m}}}\alpha )^{{\text {T}}} W (\varvec{d}- Y\varvec{{\bar{m}}}- Y\varvec{{\tilde{m}}}\alpha )\,, \end{aligned}$$
(16)

where W is the inverse covariance matrix, \(W=V^{-1}\).

In the second step of the derivation of the Linear Template Fit, the best estimator \({\hat{\alpha }}\) is determined. Due to the linearity of \(\chi ^{2}\) in \(\alpha \), the best-fit value of \(\alpha \) is defined by the stationary point of \(\chi ^{2}\)[49, 55, 60], and the best linear unbiased estimator of the parameter of interest is

$$\begin{aligned} {\hat{\alpha }} = \frac{(Y\varvec{{\tilde{m}}})^{{\text {T}}}W }{(Y\varvec{{\tilde{m}}})^{{\text {T}}}WY\varvec{{\tilde{m}}}} (\varvec{d}-Y\varvec{{\bar{m}}})\,. \end{aligned}$$
(17)

When introducing the g-inverse of least squares, a \(1\times i\) matrix,

$$\begin{aligned} F := \frac{(Y\varvec{{\tilde{m}}})^{{\text {T}}}W}{ (Y\varvec{{\tilde{m}}})^{{\text {T}}}WY\varvec{{\tilde{m}}}}\,, \end{aligned}$$
(18)

the master formula of the Linear Template Fit is found to be

$$\begin{aligned} {\hat{\alpha }}=F ( \varvec{d}- Y\varvec{{\bar{m}}})\,, \end{aligned}$$
(19)

where Eqs. (13), (14), and (18) were used, and \(\varvec{d}\) denotes the vector of the data.

Since the distribution of the data is assumed to be normally distributed, from the linearity of the least squares, it follows that the estimates are also normally distributed. From error propagation, the variance of the best estimator is

$$\begin{aligned} \sigma _{{\hat{\alpha }}}^2= FVF^{{\text {T}}} = (F^{{\text {T}}}WF)^{-1}\,. \end{aligned}$$
(20)

Given the case that the approximation in Eq. (15) holds, the estimator \({\hat{\alpha }}\) represents a best and unbiased estimator according to the Gauß–Markov theorem [49, 55], and also the variances are the smallest among all possible estimators. Since \(\varvec{d}\) does not enter the calculation of the variances, Eq. (20), the uncertainties are equivalent to the expected uncertainties from the Asimov [61] data \(\varvec{\lambda }({\hat{\alpha }})\).

When using the right expression of Eq. (13), it can be directly seen that the best estimator is indeed obtained from a single matrix equation using only the template quantities (Y, \(\varvec{{\dot{\alpha }}}\)) and the data details (\(\varvec{d}\), W), and the Linear Template Fit can alternatively be written as

$$\begin{aligned} {\hat{\alpha }}&= \left( \left( \varUpsilon \left( {\begin{matrix}0\\ 1\end{matrix}}\right) \right) ^{{\text {T}}} W \varUpsilon \left( {\begin{matrix}0\\ 1\end{matrix}}\right) \right) ^{-1} \left( \varUpsilon \left( {\begin{matrix}0\\ 1\end{matrix}}\right) \right) ^{{\text {T}}} W \left( \varvec{d}- \varUpsilon \left( {\begin{matrix}1\\ 0\end{matrix}}\right) \right) \nonumber \\&\text {using} \end{aligned}$$
(21)
$$\begin{aligned} \varUpsilon&=Y(M^+)^{{\text {T}}}=Y \begin{pmatrix}\varvec{1}&\varvec{{\dot{\alpha }}}\end{pmatrix} \left( \begin{pmatrix}\varvec{1}&\varvec{{\dot{\alpha }}}\end{pmatrix} \begin{pmatrix}\varvec{1}&\varvec{{\dot{\alpha }}}\end{pmatrix}^{{\text {T}}} \right) ^{-1} \,. \end{aligned}$$
(22)

In contrast, when the (commonly well-justified) approximation of the unweighted regression in each bin is not applicable [see left side of Eq. (12)], then the bin-dependent regression matrices \(M_{(i)}^+\) remain explicit. Such a case may be present when a Monte Carlo event generator is used for the generation of the templates and the statistical uncertainties have to be considered, and one sample was generated with higher statistical precision than the others, for instance, since it also serves for further purposes like unfolding. However, when using such a weighted regression, the best estimator may still be expressed using Eq. (17), but the elements of the two vectors \(Y\varvec{{\bar{m}}}\) and \(Y\varvec{{\tilde{m}}}\) would need to be calculated separately and become

$$\begin{aligned} (Y\varvec{{\bar{m}}})_i= & {} \left( {\begin{matrix}1\\ 0\end{matrix}}\right) ^{{\text {T}}} M^+_{(i)}\varvec{x}_{(i)} ~~~~\text {and}~~~~ \nonumber \\ (Y\varvec{{\tilde{m}}})_i= & {} \left( {\begin{matrix}0\\ 1\end{matrix}}\right) ^{{\text {T}}} M^+_{(i)}\varvec{x}_{(i)} \,, \end{aligned}$$
(23)

where the vectors \(\varvec{x}_{(i)}\) denote the ith row of the template matrix, \(\varvec{x}_{(i)}=Y_i^{{\text {T}}}\), and the matrices \(M^+_{(i)}\) may include uncertainties in the templates \(\varvec{x}_{(i)}\) through \({\mathcal {W}}_{(i)}\) (see Eq. (11), right). This case, however, will not be considered any further in that article.

In the following sections, an example application of the Linear Template Fit is presented, and subsequently, generalizations of the Linear Template Fit are discussed, and formulae for error propagation are presented.

4 Example 1: the univariate Linear Template Fit

An example application is constructed from a random number generator within the ROOT analysis framework [62]. The physics model is a normal distribution with a standard deviation of 6.0. The mean value, denoted as \(\alpha \), is subject to the inference. Similarly to particle physics, a counting experiment is simulated, and the (pseudo-)data are generated from 500 events at a true mean value of \(\alpha =170.2\), while limited acceptance restricts the measurement to values larger than 169. Some measurement distortions are simulated by using a standard deviation of 6.2 for the pseudo-data. Limited acceptance of the experimental setup is simulated by considering only one side of the normal distribution, and altogether 14 “bins” with unity width are “measured.” Let us assume that from a previous measurement it is known that the physics parameter of interest has a value of about \(\alpha \approx 170.5\pm 1.0\). Therefore, seven templates in a range from \({\dot{\alpha }}=169.0\) to 172.0 in steps of 0.5 are generated, and each template is generated using 40,000 random numbers. From Eq. (20) it is then found that the pseudo-data will be able to determine the value of \(\alpha \) approximately with an uncertainty of \(\pm 0.5\), and subsequently, additional templates at values of \({\dot{\alpha }}=169.0\), 169.5, 170.5, 171, and 171.5 are generated.

Fig. 1
figure 1

Two views of the template matrix Y of an example application of the Linear Template Fit. Left: the template distributions Y as a function of the observable. The full circles display a pseudo-data set with statistical uncertainties, and the dotted line indicates the estimated model after the fit, \({{\mathbf {y}}}(\hat{\alpha })\). Right: the template distributions Y of an example application of the Linear Template Fit, but as a function of the reference values of the parameter of interest \(\alpha \). Only nine selected “bins” i are displayed. The red line shows the linearized model \(\hat{\mathrm {y}}_i(\alpha )\), Eq. (15), in the respective bin, and the full circles are again the pseudo-data. The dotted line indicates a weighted linear regression, which is not used in the Linear Template Fit

The Linear Template Fit using these seven templates is illustrated in Fig. 1. The Linear Template Fit from Eq. (19) reports a value of  \({\hat{\alpha }}=170.03 \pm 0.41\), which is fully consistent within the statistical uncertainty with the generated true value of 170.2, although a somewhat different standard deviation was used. In Fig. 1 (right), the result from a hypothesized weighted linear regression is shown (cf. Eq. (23)), but it is almost indistinguishable from the unweighted linear regression as is used in the Linear Template Fit. This is because the generation of the individual templates always employs the same methodology, as already argued above. Furthermore, the estimated best model is displayed in Fig. 1 (left), which is defined from Eq. (15) when substituting the best estimator, \({{\hat{{\mathbf {y}}}}}={{\mathbf {y}}}(\hat{\alpha })\). This results in \(\chi ^{2}=5.6\) for 14 data points and one free parameter. Other seeds for the random number generator of course result in different values. It should be noted that this example is purposefully constructed with large (statistical) uncertainties in order to obtain a visually clearer presentation in the figures, although in some bins the assumption of normally distributed random variables is then a rather poor approximation to the Poisson distribution.

5 The multivariate Linear Template Fit

Phenomenological models often depend on multiple parameters, and thus it is a common task to determine multiple parameters at a time. Such k parameters of interest are referred to as \(\alpha _1\), ..., \(\alpha _k\), or simply \(\varvec{\alpha }\), and the best estimators are denoted as \(\hat{\alpha }_1\), ..., \(\hat{\alpha }_k\), or \(\varvec{{\hat{\alpha }}}\). The linear representation of the model is a hyperplane \(\mathrm{y}_i(\varvec{\alpha })\) in each of the i data points, defined as

$$\begin{aligned} \varvec{\lambda }(\varvec{\alpha }) \approx {\mathbf {y}}(\varvec{\alpha };{\varvec{{\hat{\theta }}}}) = {\hat{\theta }}_0 + {\hat{\theta }}_{(1,1)} \alpha _1 + \dots + {\hat{\theta }}_{(1,k)} \alpha _k\,, \end{aligned}$$
(24)

where a constant \(\theta _0\) and the first-degree parameters \(\theta _{(1,k)}\) are considered. Since higher-degree terms or interference terms are not included, the fit parameters need to be (sufficiently) independent or they have been made orthogonal by applying a variable transformation beforehand.

In the multivariate case, each template \(\varvec{y}_{(j)}\) is representative of a reference point \(\varvec{{\dot{\alpha }}}_{(j)}\) in the k-dimensional space. The regressor matrix M is constructed as a \(j\times (1+k)\) design matrix:

$$\begin{aligned} M := \left( \begin{array}{cccc} 1 &{} {\dot{\alpha }}_{(1),1} &{} \dots &{} {\dot{\alpha }}_{(1),k} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ 1 &{} {\dot{\alpha }}_{(j),1} &{} \dots &{} {\dot{\alpha }}_{(j),k} \\ \end{array}\right) \,. \end{aligned}$$
(25)

As in the univariate case, the pseudoinverse \(M^+\) is calculated from Eq. (12), and also the same considerations for the justification of the unweighted multiple regression are applicable. Instead of a single vector \(\varvec{{\tilde{m}}}\), there is now a vector for each regression parameter \(\theta _k\). Therefore, the \(j\times k\) matrix \({\tilde{M}}\) is introduced, which is defined by decomposing \(M^+\) like

$$\begin{aligned} M^+ =: \begin{pmatrix} \varvec{{\bar{m}}}^{{\text {T}}}\\ {\tilde{M}}^{{\text {T}}} \\ \end{pmatrix}\,. \end{aligned}$$
(26)

Hence, the best estimator for the linearized multivariate model \({\mathbf {y}}(\varvec{\alpha };{\varvec{\theta }})\) becomes

$$\begin{aligned} \varvec{\lambda }(\varvec{\alpha }) \approx {\hat{{\mathbf {y}}}}(\varvec{\alpha }) = Y\varvec{{\bar{m}}}+ Y{\tilde{M}}\varvec{\alpha }\,, \end{aligned}$$
(27)

where the \(\theta \) parameters are again no longer explicit.

When using the linear approximation of the model, \(\varvec{\lambda }(\varvec{\alpha })\approx {\hat{{\mathbf {y}}}}(\varvec{\alpha })\), the \(\chi ^{2}\) function (Eq. (2)) for the multivariate Linear Template Fit becomes

$$\begin{aligned} \chi ^{2}&\simeq \left( \varvec{d}-{\hat{{\mathbf {y}}}}(\varvec{\alpha })\right) ^{{\text {T}}} W \left( \varvec{d}- {\hat{{\mathbf {y}}}}(\varvec{\alpha })\right) \end{aligned}$$
(28)
$$\begin{aligned}&= \left( \varvec{d}- Y\varvec{{\bar{m}}}- Y{\tilde{M}}\varvec{\alpha }\right) ^{{\text {T}}} W \left( \varvec{d}- Y\varvec{{\bar{m}}}- Y{\tilde{M}}\varvec{\alpha }\right) \end{aligned}$$
(29)
$$\begin{aligned}&= \left( \varvec{d}- {\textstyle \sum }_l{\hat{\epsilon }}_l \varvec{s}_{(l)} - {\hat{{\mathbf {y}}}}(\varvec{\alpha })\right) ^{{\text {T}}} {\mathbb {V}}^{-1} \nonumber \\&\quad \times \left( \varvec{d}-\quad {\textstyle \sum }_l{\hat{\epsilon }}_l \varvec{s}_{(l)} - {\hat{{\mathbf {y}}}}(\varvec{\alpha })\right) + {\textstyle \sum }_l {\hat{\epsilon }}_l^2\,. \end{aligned}$$
(30)

The last equation introduces an equivalent expression in terms of nuisance parameters [63]. The factors \({\hat{\epsilon }}_l\) are related to uncertainties with full bin-to-bin correlations when writing the covariance matrix as

$$\begin{aligned} V ={\mathbb {V}}+V_\text {corr} ~~~\text { using}~~~ V_\text {corr}=\textstyle \sum _l \varvec{s}_{(l)}^{ }\varvec{s}^{{\text {T}}}_{(l)}\,, \end{aligned}$$
(31)

where the sum l runs over all uncertainties with full bin-to-bin correlations and the vectors \(\varvec{s}_l\) denote the individual systematic uncertainties (also called shifts), while the matrix \({\mathbb {V}}\) includes all other uncertainty components. It is common practice that the systematic shifts are calculated from relative uncertainties and multiplied with the measured data. Implications of this practice are discussed in Sects. 7 and 8.7, and care must be taken that the result does not become biased [64,65,66,67,68,69]; a common technique to avoid that bias is discussed in Sect. 8.7.

Equation (30) is again a linear least-squares expression, and the best linear unbiased estimators for the parameters \(\varvec{{\hat{\alpha }}}\) and the nuisance parameters \(\varvec{{\hat{\epsilon }}}\) are obtained from the stationary point. Hence, the best estimators from the multivariate Linear Template Fit become

$$\begin{aligned} \varvec{{\hat{a}}}= \begin{pmatrix}\varvec{{\hat{\alpha }}}\\ \varvec{{\hat{\epsilon }}}\end{pmatrix} = {\mathcal {F}} ( \varvec{d}-Y\varvec{{\bar{m}}})\,, \end{aligned}$$
(32)

where \(\varvec{{\hat{a}}}\) was introduced and the shorthand notations \({\mathcal {F}}\) (a g-inverse of least squares), and \({\mathcal {D}}\) are calculated as

$$\begin{aligned} {\mathcal {F}}:= & {} {\mathcal {D}}^{-1} \begin{pmatrix}Y{\tilde{M}}&~S\end{pmatrix}^{{\text {T}}} {\mathbb {W}}\nonumber \\ {}:= & {} \!\left( \!\!\begin{pmatrix}Y{\tilde{M}}&~S\end{pmatrix}\!^{{\text {T}}}{\mathbb {W}}\!\!\begin{pmatrix}Y{\tilde{M}}&~S\end{pmatrix} \!+\!\!\begin{pmatrix} {0}_k &{} 0 \\ 0 &{} {1}_l\end{pmatrix}\!\! \right) \!^{-1} \begin{pmatrix}Y{\tilde{M}}&S\end{pmatrix}^{{\text {T}}} {\mathbb {W}}\!, \end{aligned}$$
(33)

using \({\mathbb {W}}={\mathbb {V}}^{-1}\). The matrix with the fully bin-to-bin correlated uncertainties S is composed from the column vectors of the systematic shifts \(\varvec{s}_{(l)}\) as

$$\begin{aligned} S := \begin{pmatrix}\varvec{s}_{(1)}&\ldots&\varvec{s}_{(l)} \end{pmatrix}\,, \end{aligned}$$
(34)

and the symbol \(\begin{pmatrix}Y{\tilde{M}}&~S\end{pmatrix}\) denotes a composed \(i\times (k+l)\) matrix from \(Y{\tilde{M}}\) and S, while \({0}_{k}\) and \({1}_{l}\) denote a \(k\times k\) or \(l\times l\) zero or unit matrix, respectively. The matrix \({\mathcal {D}}\) is commonly a full-ranked symmetric matrix and thus invertible. A brief discussion about a more efficient numerical calculation of \({\mathcal {D}}^{-1}\) is given in Appendix D. If all uncertainty components are represented through a single covariance matrix V, cf. left side of Eq. (31), the multivariate Linear Template Fit simplifies to

$$\begin{aligned} \varvec{{\hat{\alpha }}} = \left( (Y{\tilde{M}})^{{\text {T}}}WY{\tilde{M}}\right) ^{-1}(Y{\tilde{M}})^{{\text {T}}} W ( \varvec{d}-Y\varvec{{\bar{m}}})\,. \end{aligned}$$
(35)

6 Example 2: the multivariate Linear Template Fit

In Example 1, only the mean value of the model (which is a normal distribution) was determined, although the pseudo-data had a slightly different standard deviation than the model. In the following example, a multivariate Linear Template Fit is performed and the same pseudo-data as in Example 1 are used, but both values of the model will be determined: the mean value and the standard deviation of the Gaussian. Therefore, templates are generated for some selected values for the mean (between values of 169.5 and 171) and the standard deviation (between values of 5.8 and 6.4) of a Gaussian, and again, 40,000 events are used for each template. The four “extreme” variations of the two parameters are omitted. The multivariate Linear Template Fit using all these templates is illustrated in Fig. 2.

Fig. 2
figure 2

Two views of the template matrix Y of an example application of the multivariate Linear Template Fit. Left: the template distributions Y as a function the observable. More details are given in the caption of Fig. 1. Right: the template distributions Y (open medium-sized circles) as a function of the two reference values for four selected “bins” i. The plane displays the linearized model \(\hat{\mathrm {y}}_i(\varvec{\alpha })\), Eq. (27), in the respective bin. The large open circles are again the pseudo-data, and small circles indicate the projection onto the plane \(\hat{\mathrm {y}}\)

The best estimators from the multivariate Linear Template Fit (Eq. (35)) are \(169.91\pm 0.46\) for the mean and \(6.28\pm 0.38\) for the standard deviation. Their correlation coefficient is found to be \(-0.3\). Within the statistical uncertainties, both values are in very good agreement with the simulated input values of 170.2 and 6.2, respectively. Thus, this example illustrates the application of the multivariate Linear Template Fit with two free parameters, where a visual representation of the linearized model is still possible. However, any number of free parameters is in principle possible, and for an n-parameter fit, the minimum number of linearly independent templates is just \(n+1\).

7 The Linear Template Fit with relative uncertainties

In this section we present the equations for the Linear Template Fit when the estimators obey a log-normal distribution. This could be the case for data when the determination of a variable is affected by a number of multiplicative factors that are subject to uncertainties. Consequently, the variable follows a log-normal distribution due to the central limit theorem [48, 49]. Also, when the value of an observed variable is a random proportion of the previous observation, it follows a log-normal distribution [49]. An example would be the measurement of the electron energy in a calorimeter which is affected by a number of fractional energy losses and corrections [10]. Another example would be measurements that are dominated by systematic multiplicative uncertainties: a prominent multiplicative error is due to uncertainties of the luminosity measurement in particle collider experiments which results in a relative uncertainty. Contrary to additive errors, multiplicative errors cannot change the sign of the variable, and a positively defined observable always remains positive, which is an important prerequisite for several physical quantities such as cross sections. A comparison of the log-normal, the normal, and the Poisson probability distribution function for two selected values of their mean value are displayed in Fig. 3.

The likelihood of a set of i measurements is a joint function of log-normal PDFs

$$\begin{aligned} {\mathcal {L}} = \prod _{i}^N\frac{1}{\sqrt{2\pi \varsigma _i^2d_i^2}}\exp \left( \frac{-(\log {d_i}-\mu _i)^2}{2\varsigma _i^2}\right) \,. \end{aligned}$$
(36)

When writing \(\mu _i=\log m_i\) [70, 71], it can be directly seen from the residuals that the ratios \(d_i/m_i\) enter the likelihood calculation instead of their differences. The variance \(\varsigma \) denotes a relative quantity, thus it represents relative uncertainties. In fact, the log-normal distribution is equivalent to considering random variables with normally distributed relative uncertainties. For small relative errors, \(\varsigma _i\lesssim 20\,\%\), which is a common case in particle physics, the log-normal, the normal, and the Poisson distributions become similar. For larger relative errors, the log-normal distribution provides some approximation of the Poisson distribution, and for practical purposes one benefits from its strictly positiveness, in contrast to the often used normal distribution.

Similarly to the case for normally distributed data, we choose as a first-order approximation of the model in every bin

$$\begin{aligned} \mu _i(\varvec{\alpha }) \approx \theta ^{(i)}_0 + \theta ^{(i)}_{(1,1)} \alpha _1 +\cdots + \theta ^{(i)}_{(1,k)} \alpha _k\,, \end{aligned}$$
(37)

and the best estimators for the \(\theta \) parameters are again determined from the templates; consequently, the best estimator of the approximated model is

$$\begin{aligned} \varvec{\mu }(\varvec{\alpha }) \approx {\hat{\varvec{\mu }}}(\varvec{\alpha }) = \log (Y) \varvec{{\bar{m}}}+ \log (Y) {\tilde{M}}\varvec{\alpha }\,. \end{aligned}$$
(38)

The term \(\log (Y)\) denotes a coefficient-wise application of the logarithm to the elements of the template matrix Y, and it is used to match the logarithmized data (\(\log {d_i}\)) in the likelihood. Equation (38) is an equally valid first-order approximation of the model as compared with the default linear approximation in Eq. (24).

Fig. 3
figure 3

The log-normal, normal, and Poisson probability distribution functions with mean values of 10 (left) and 100 (right) on a linear and logarithmic axis. These values are commonly referred to as relative uncertainties of 32 % and 10 %, respectively. For larger uncertainties, the log-normal distribution provides a good approximation of the Poisson distribution around the peak. The log-normal and the Poisson distributions are strictly positive. For large N, all distributions become similar

As an illustration, the linearized model from Eq. (38) is displayed in Fig. 4 for some selected bins of Example 1 (Sect. 4) and compared with the approximations of Eq. (27). The two are reasonably similar in that example, since the templates are present in the vicinity of the best estimator and in a range where a linear approximation is valid. Although the parameter dependence in Example 1 is in fact truly linear, from the generated statistical precision of the templates, the templates cannot discriminate between these two best model estimators.

When using log-normally distributed variables, which corresponds to relative uncertainties, the objective function \(\chi ^{2}\) of the Linear Template Fit becomes

$$\begin{aligned} \chi ^{2}= & {} \left( \varvec{\log d} -{\textstyle \sum }_l\epsilon _l {\mathbf {s}}_{(l)} - {\hat{\varvec{\mu }}}(\varvec{\alpha }) \right) ^{{\text {T}}} {\mathbb {W}} \nonumber \\&\times \left( \varvec{\log d} -{\textstyle \sum }_l\epsilon _l {\mathbf {s}}_{(l)} - {\hat{\varvec{\mu }}}(\varvec{\alpha }) \right) + {\textstyle \sum }_l\epsilon _l^2\,. \end{aligned}$$
(39)

The covariance matrix \({\mathbb {V}}={\mathbb {W}}^{-1}\) comprises normally distributed relative uncertainties, and the uncertainties with bin correlations \({\mathbf {s}}\) also enter the calculation through their relative values. The notation \(\varvec{\log d}\) denotes a coefficient-wise application of the logarithm. A common relation between relative and absolute uncertainties would be

$$\begin{aligned} \mathrm {s}_{(l),i} \simeq \frac{s_{(l),i}}{d_{i}}\,, \end{aligned}$$
(40)

and although this equation does not strictly translate between normal and log-normal uncertainties, it is often used in practical applications. Note that we use a Roman font s for relative uncertainties, and italic s for absolute uncertainties.

This \(\chi ^{2}\) function is again a linear least-squares problem, and the best estimators for \(\varvec{\alpha }\) and \(\varvec{\epsilon }\) in the Linear Template Fit are

$$\begin{aligned} \varvec{{\hat{a}}}:= \left( \begin{array}{l}\varvec{{\hat{\alpha }}}\\ {\hat{\varvec{\epsilon }}}\\ \end{array}\right) = {\mathcal {F}} \big ( \varvec{\log d} - \log (Y)\varvec{{\bar{m}}}\big )\,, \end{aligned}$$
(41)

where the matrix \({\mathcal {F}}\) is defined in a similar way as Eq. (33), but from relative uncertainties in \({\mathbb {V}}\) and \(\mathrm {s}\) (S) and employing the substitution \(Y{\tilde{M}}\rightarrow \log (Y){\tilde{M}}\). Due to the treatment of systematic uncertainties as relative uncertainties, the Linear Template Fit with the log-normal p.d.f.’s may be preferred in practical applications over the Linear Template Fit with absolute uncertainties.

Fig. 4
figure 4

The best estimate of the parameter dependence of the model, Eq. (38) for four selected bins of Example 1, when performing a Linear Template Fit with log-normally distributed estimators. Note that the y-axis indicates the logarithmized values of the templates. The pseudo-data are displayed for comparison at the best estimator \({\hat{\alpha }}\) of the Linear Template Fit. For comparison, the linearized model from Eq. (15) is also displayed, which is otherwise not used when log-normally distributed estimators are considered

figure a

8 Errors and error propagation

In many applications and fields, like particle physics, it is of great importance to provide detailed uncertainties together with the fit parameters and to propagate uncertainties from all relevant input quantities. Due to its analytic and closed form, the Linear Template Fit provides unique opportunities for detailed error propagation and analysis, that is otherwise not easily possible in other inference algorithms. For example, every single uncertainty component can be propagated separately and analytically to the fit results. In the following, the equations for error propagation are discussed, based on the equation of the Linear Template Fit in Eq. (32).

8.1 Statistical uncertainties and uncertainties without bin-to-bin correlations

Statistical or epistemic uncertainties in the data are represented by the covariance matrix \({\mathbb {V}}\). The matrix \({\mathbb {V}}\) may include further uncertainty components with or without bin-to-bin correlations, like systematic or aleatoric uncertainties. From linear error propagation [60], the covariance matrix is propagated to the best estimators \(\varvec{{\hat{a}}}\) as

$$\begin{aligned} \hat{{\mathbb {V}}}_{\varvec{{\hat{a}}}} = {\mathcal {F}} {\mathbb {V}} {\mathcal {F}}^{{\text {T}}}\,, \end{aligned}$$
(42)

using \({\mathcal {F}}\) as defined in Eq. (33). Due to the Gauß–Markov theorem, the estimators \(\varvec{{\hat{a}}}\) for the given objective function \(\chi ^{2}\), Eqs. (30) and (39), have the least variance among all possible estimators.

If \({\mathbb {V}}\) is a sum of multiple individual matrices, like \({\mathbb {V}} = {\mathbb {V}}_\text {stat.} +\sum {\mathbb {V}}_\text {uncorr.} + \sum {\mathbb {V}}_\text {corr.}+\cdots \), each covariance matrix can be propagated separately using Eq. (42). By doing so, the contribution from each uncertainty component to the full uncertainty is determined separately.

8.2 Systematic uncertainties or uncertainties with bin-to-bin correlations

Systematic uncertainties associated with the input data are often represented as uncertainties with full bin-to-bin correlations and are represented either as systematic shifts, \(\varvec{s}_{(l)}\) (Eq. (34)), or as covariance matrices, \({\mathbb {V}}_l^\text {corr} = \varvec{s}_{(l)}^{ }\varvec{s}_{(l)}^{{\text {T}}}\). Using linear error propagation, these are propagated to the best estimators \(\varvec{\varvec{{\hat{a}}}}\) as

$$\begin{aligned} \varvec{\sigma }_{\varvec{{\hat{a}}}}^{(\varvec{s}_{(l)})}= {\mathcal {F}}\varvec{s}_{(l)}\,. \end{aligned}$$
(43)

Alternatively, Eq. (42) can be employed. For the Linear Template Fit with relative uncertainties, one literally has the same equation but using relative values \({\mathbf {s}}_{(l)}\) (compare Eq. (40)).

figure b

When systematic uncertainties are included in the Linear Template Fit, their corresponding nuisance parameters \(\epsilon _l\) are constrained by data, and the obtained reduction of the systematic uncertainty may be used for further analyses. In applications where the fitted parameters are dominated by systematic uncertainties in the input data, it is good practice to constrain the associated systematic shifts by adding additional measurements to the fitting problem [31, 35, 72, 73]. This practice provides additional constraints on the related parameters, \(\epsilon _l\), and smaller experimental uncertainties can be obtained. The uncertainty in the nuisance parameter becomes smaller than unity.

8.3 External uncertainties

The generalized transfer matrix \({\mathcal {F}}\) provides the straightforward opportunity for linear error propagation; see Eqs. (42) or (43). While these equations are predominantly employed to propagate the uncertainties that are included in the Linear Template Fit, equivalent equations are also applicable to propagate further uncertainties that are not used in the fit, but which would still need to be propagated to the fit result. For example, these may be uncertainties in the underlying theory prediction which cannot be constrained by the data. An actual case for such an uncertainty is given below for the non-perturbative correction factors (cf. Fig. 10).

8.4 Unconstrained uncertainties with full bin-to-bin correlations

A special case of an uncertainty with full bin-to-bin correlations may be considered through an unconstrained systematic uncertainty. These are defined by including the respective systematic shift \(s_{(l)}\) into the Linear Template Fit, but in contrast to common constrained systematic uncertainties, these have a zero value (instead of unity) in the respective diagonal element of the \(l\times l\) matrix \({1}_l\) in Eq. (33). As a consequence, the Linear Template Fit can exploit the degree of freedom of that systematic shift, but the data does not provide any constraint on that uncertainty source. A possible application would be fits to data, where the normalization is unknown and is a free, unconstrained, systematic uncertainty.

8.5 Uncertainties in the templates

The template distributions Y may be associated with uncertainties from various sources. These can be statistical uncertainties in the elements \(Y_{ij}\) or systematic uncertainties with correlations among all entries of Y. One may think about Y being determined from two distinct models, or templates that were generated with different parameters, and the resulting difference in Y may be considered as a systematic uncertainty.

The uncertainties in \(\varvec{{\hat{a}}}\) due to uncertainties in Y are obtained from linear error propagation. The uncertainty in a single element of Y is denoted as \(\sigma _{Y_{ij}}\), and using the shorthand notation \({\mathcal {D}}\) from Eq. (33), the partial derivative of Eq. (32) is

$$\begin{aligned} \frac{\partial \varvec{{\hat{a}}}}{\partial Y_{ij}}&= {\mathcal {D}}^{-1} \Bigg [ \begin{pmatrix}{\mathbb {1}}_{(ij)}{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}( \varvec{d}- Y \varvec{{\bar{m}}})\nonumber \\&\quad - \begin{pmatrix}Y{\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}{\mathbb {1}}_{(ij)} \varvec{{\bar{m}}}\nonumber \\&\quad - \left[ \begin{pmatrix} Y {\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix}{\mathbb {1}}_{(ij)}{\tilde{M}}&{0}_{il}\end{pmatrix}\right. \nonumber \\&\left. \quad + \begin{pmatrix}{\mathbb {1}}_{(ij)}{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix} Y {\tilde{M}}&S \end{pmatrix} \right] \varvec{{\hat{a}}}\Bigg ]\,, \end{aligned}$$
(44)

where \({\mathbb {1}}_{(ij)}\) denotes an \(i \times j\) zero matrix with only the element (ij) being unity, and \({0}_{il}\) denotes an \(i\times l\) zero matrix. It should be noted that the usage of the matrices \({\mathbb {1}}_{(ij)}\) and \({0}_{il}\) in numerical calculations is discouraged, since a naive matrix multiplication becomes numerically inefficient, but that it remains instructive to write the equations in that extensive format. The resulting size of the uncertainty of the best estimators due to \(\sigma _{Y_{ij}}\) becomes

$$\begin{aligned} \varvec{\sigma }^{(Y_{ij})}_{\varvec{{\hat{a}}}} = \frac{\partial \varvec{{\hat{a}}}}{\partial Y_{ij}} \sigma _{Y_{ij}}\,. \end{aligned}$$
(45)

Hence, uncertainties without entry-to-entry correlations, such as statistical uncertainties, are propagated to the estimated parameters as

$$\begin{aligned} \varvec{\sigma }^{(Y_\text {uncorr.})}_{\varvec{{\hat{a}}}} = \sqrt{ \sum _{i,j} \bigg ( \frac{\partial \varvec{{\hat{a}}}}{\partial Y_{ij}} \sigma _{Y_{ij}} \bigg )^2}\,, \end{aligned}$$
(46)

where the square root and the exponentiation are performed coefficient-wise, while uncertainties that have full entry-to-entry correlations are propagated as

$$\begin{aligned} \varvec{\sigma }^{(Y_\text {corr.})}_{\varvec{{\hat{a}}}} = \sum _{i,j} \frac{\partial \varvec{{\hat{a}}}}{\partial Y_{ij}}\sigma _{Y_{ij}} \,. \end{aligned}$$
(47)
figure c

The equations for error propagation of the Linear Template Fit with relative uncertainties (Sect. 7) are altogether similar to the presented equations and do not need to be repeated. However, since relative uncertainties are considered, the partial derivatives for error propagation in Eqs. (46) and (47) become

$$\begin{aligned} \frac{\partial \varvec{{\hat{a}}}}{\partial Y_{tj}} \rightarrow \frac{ \partial \varvec{{\hat{a}}}}{\partial \log {Y_{tj}}}\,. \end{aligned}$$
(48)
figure d

Note that the uncertainties in Y are not considered within the fit, but propagated separately, since it is assumed that these are negligibly small and independent of the model parameter \(\varvec{\alpha }\), and cannot be constrained by the data.

8.6 Full uncertainty

The covariance matrix for the best estimators, including all uncertainty components that are considered in the Linear Template Fit, are calculated from Eqs. (42) and (43):

$$\begin{aligned} V_{\varvec{{\hat{a}}}} = {\mathbb {V}}_{\varvec{{\hat{a}}}} + {\textstyle \sum }_l \varvec{\sigma }_{\varvec{{\hat{a}}}}^{(\varvec{s}_{(l)})} (\varvec{\sigma }_{\varvec{{\hat{a}}}}^{(\varvec{s}_{(l)})})^{{\text {T}}} \end{aligned}$$
(49)

Further uncertainties, which are not explicitly considered in the fit, can be propagated to the result by using Eqs. (42) and (43) or also Eqs. (46) and (47).

8.7 Error (re-)scaling

When multiple data points are considered in the fit and they are correlated through some (systematic) uncertainties, the best-fit estimators may become biased [64,65,66,67,68,69]. This is mainly because uncertainties with bin-to-bin correlations are commonly valid as relative uncertainties, or so-called multiplicative uncertainties, such as normalization uncertainties, but those are included in the fit with absolute values (i.e., as variances). Since the data are subject to random noise from statistical fluctuations, or simply because of the different size of the values, for example when considering different data sets, the result may become biased since smaller values are effectively preferred by the fit. While such a bias is largely reduced or even absent from first principles when working with relative uncertainties, it may still be useful to discuss possible solutions to this problem, and several solutions are discussed in the literature (see e.g. Ref. [69]).

A common solution is to consider (multiplicative) uncertainties with bin-to-bin correlations through their relative values in the \(\chi ^{2}\) computation, and their relative values are multiplied coefficient-wise with the prediction(s) rather than the data values [68]. However, since the minimization of \(\chi ^{2}\) is then no longer linear in the fit parameters, there is no closed analytic solution for the best estimators. However, an approximately equivalent result is obtained when calculating the covariance matrices and systematic shifts (cf. Eq. (31)) from theoretical predictions [66], and the coefficients of the shifts are recalculated according to

$$\begin{aligned} s_{(l),i} \rightarrow \frac{s_{(l),i} y_i}{d_i}\,= \mathrm {s}_{(l),i} y_i, \end{aligned}$$
(50)

and analogously for covariance matrices. Various options for the predictions \(\varvec{y}\) for this error rescaling may be considered, and the prediction \(\varvec{y}\) could be chosen to be

  • One of the template distributions, \(\varvec{y}_{(j)}\);

  • The prediction of the linearized model with ad hoc selected parameters \(\varvec{\alpha }\), \({\hat{\mathbf {y}}}(\varvec{\alpha })\);

  • The best estimator of the linear model, \({\hat{\mathbf {y}}}(\varvec{{\hat{a}}})\); or

  • The Linear Template Fit may be iteratively repeated with the best estimator, \({\hat{\mathbf {y}}}(\varvec{{\hat{a}}})\).

The rather compact equations of the Linear Template Fit are well suited for an iterative algorithm. Such procedures are equivalent to an alternative formulation of the \(\chi ^{2}\) quantity, but the convergence of such iterative algorithms should be critically assessed.

9 Linear Template Fit with a detector response matrix

Many measurements need to be corrected for detector effects like resolution or acceptance in order to compare the measurement with theoretical predictions. This is referred to as unfolding (for reviews see e.g. [10, 74,75,76,77,78,79]) and is commonly performed by first simulating the detector response and representing it in terms of a response matrix A,

$$\begin{aligned} \varvec{y} = A\varvec{x}\,, \end{aligned}$$
(51)

where \(\varvec{y}\) denotes the measured detector-level distribution and \(\varvec{x}\) the “true” underlying distribution. However, it is not straightforward to apply the inverse response matrix to the distribution \(\varvec{y}\), since the unfolding problem represents an ill-posed inverse problem, or the matrix A needs to be a square matrix, and thus more sophisticated unfolding algorithms need to be employed to determine the “true” distribution \(\varvec{x}\) [80,81,82,83,84,85]. When performing a parameter determination, however, it is equivalent whether using the unfolded distribution x together with templates of the predictions, or alternatively using the detector-level measurement y and applying the response matrix to the templates instead. These two procedures are supposed to be equivalent if the unfolding problem and algorithm are unbiased, and are referred to as folding, up-folding, or forward folding. Because the simulation of the detector effects and the determination of the response matrix A may be computationally expensive, as is the case for complex particle physics experiments, it may be reasonable to determine the matrix A only once and with high precision, and then apply it to the templates.

The inclusion of the detector effects in the Linear Template Fit is straightforward and is realized by the simple substitution

$$\begin{aligned} Y&\rightarrow AX \end{aligned}$$
(52)

in Eqs. (32) and (33), where X represents the template matrix without detector effects at the “truth” or “particle” level. When using relative uncertainties, the substitution becomes

$$\begin{aligned} \log (Y) \rightarrow \log (AX)\,, \end{aligned}$$
(53)

and the full expression of the Linear Template Fit becomes

$$\begin{aligned} \begin{pmatrix}\varvec{{\hat{\alpha }}}\\ \varvec{{\hat{\epsilon }}}\end{pmatrix}= & {} {\mathcal {F}} \big ( \varvec{\log d} - \log (AX)\varvec{{\bar{m}}}\big ) ~~~\text {using}\nonumber \\ {\mathcal {F}}= & {} \left( \begin{pmatrix}\log (AX){\tilde{M}}&~S\end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix}\log (AX){\tilde{M}}&~S\end{pmatrix}\right. \nonumber \\&\left. + \begin{pmatrix} {0}_k &{} 0 \\ 0 &{} {1}_l\end{pmatrix} \right) ^{-1} \begin{pmatrix}\log (AX){\tilde{M}}&~S\end{pmatrix}^{{\text {T}}} {\mathbb {W}}\,. \end{aligned}$$
(54)

Since Y and AX are \(i\times j\) matrices, the index t for the number of entries (“bins”) on the truth level was introduced. Thus, A is an \(i\times t\) matrix and X has dimension \(t\times j\), and in order to obtain a higher resolution, it may be reasonable to consider a non-square response matrix A, i.e., \(i\ne t\). It is important that the detector response A is required to be independent from the physics parameter of interest.

The response matrix A is typically associated with uncertainties, and these may be entry-to-entry correlations (like systematic or “model” uncertainties) or without such correlations (like statistical uncertainties). Note, that the response matrix is commonly normalised and therefore the statistical uncertainties are without entry-to-entry correlations only in an approximation.

Uncertainties of the best estimators that arise from uncertainties in A can be obtained from linear error propagation. After applying Eq. (52), the partial derivative of Eq. (32) is

$$\begin{aligned} \frac{\partial \varvec{{\hat{a}}}}{\partial A_{it}}&= {\mathcal {D}}^{-1} \Bigg [ \begin{pmatrix}{\mathbb {1}}_{(it)}X{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}( \varvec{d}- AX \varvec{{\bar{m}}})\nonumber \\&\quad - \begin{pmatrix}AX{\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}{\mathbb {1}}_{(it)} X \varvec{{\bar{m}}}\nonumber \\&\quad - \left[ \begin{pmatrix} AX {\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix}{\mathbb {1}}_{(it)}X{\tilde{M}}&{0}_{il} \end{pmatrix}\right. \nonumber \\&\left. \quad + \begin{pmatrix}{\mathbb {1}}_{(it)}X{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix} AX {\tilde{M}}&S \end{pmatrix} \right] \varvec{{\hat{a}}}\Bigg ]\,. \end{aligned}$$
(55)

Consequently, uncertainties in the response matrix A without or with entry-to-entry correlations are propagated to the fitted results as

$$\begin{aligned} \varvec{\sigma }^{(A_\text {uncorr.})}_{\varvec{{\hat{a}}}}= & {} \sqrt{ \sum _{i,t} \left( \frac{\partial \varvec{{\hat{a}}}}{\partial A_{it}} \sigma _{A_{it}} \right) ^2} ~~~\text {and}~~~ \nonumber \\ \varvec{\sigma }^{(A_\text {corr.})}_{\varvec{{\hat{a}}}}= & {} \sum _{i,t} \frac{\partial \varvec{{\hat{a}}}}{\partial A_{it}} \sigma _{A_{it}} \,, \end{aligned}$$
(56)

respectively, where \(\sigma _{A_{it}}\) denotes the size of the uncertainty in the element \(A_{it}\), and the power and square-root operations are applied coefficient-wise. When using a response matrix, the uncertainties in the template matrix X (cf. Sect. 8.5) are propagated using

$$\begin{aligned} \frac{\partial \varvec{{\hat{a}}}}{\partial X_{tj}}&= {\mathcal {D}}^{-1} \Bigg [ \begin{pmatrix}A{\mathbb {1}}_{(tj)}{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\left( \varvec{d}- AX \varvec{{\bar{m}}}\right) \nonumber \\&\quad - \begin{pmatrix}AX{\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}A{\mathbb {1}}_{(tj)} \varvec{{\bar{m}}}\nonumber \\&\quad - \left[ \begin{pmatrix} AX {\tilde{M}}&S \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix}A{\mathbb {1}}_{(tj)}{\tilde{M}}&{0}_{il} \end{pmatrix}\right. \nonumber \\&\left. \quad + \begin{pmatrix}A{\mathbb {1}}_{(tj)}{\tilde{M}}&{0}_{il} \end{pmatrix}^{{\text {T}}} {\mathbb {W}}\begin{pmatrix} AX {\tilde{M}}&S \end{pmatrix} \right] \varvec{{\hat{a}}}\Bigg ]\,. \end{aligned}$$
(57)

Note that for the Linear Template Fit with relative uncertainties, it is reasonable to consider uncertainties in \(A_{it}\) as absolute uncertainties, rather than relative, because the elements \(A_{it}\) represent already relative quantities and may be interpreted as probabilities. Therefore, one employs the partial derivative \(\partial \varvec{{\hat{a}}}/\partial A_{ij}\) for error propagation, instead of \(\partial \varvec{{\hat{a}}}/\partial \log {A_{ij}}\), although elsewhere relative uncertainties are considered. This partial derivative is straightforward to derive. The propagation of uncertainties in the templates \(X_{tj}\), however, becomes somewhat more complicated when using a response matrix A, since one has to calculate the partial derivative \(\partial \varvec{{\hat{a}}}/\partial \log {X_{tj}}\) for error propagation of the relative uncertainties \(\sigma _{X_{tj}}\).Footnote 4

10 Considerations, validations, and cross-checks

As in any optimization problem, the result of the Linear Template Fit needs to be validated and cross-checked. Due to its close similarity to the iterative Gauß–Newton minimization algorithm of nonlinear least squares [5, 50, 51, 86, 87], the Linear Template Fit is expected to result in an unbiased and optimal result, in particular if the templates provide a sufficiently accurate approximation of the model and its Jacobi matrix at the best estimator. In fact, when assuming that the model \(\varvec{\lambda }(\varvec{\alpha })\) could be used with continuous values of \(\varvec{\alpha }\) for inference, the best estimator from the Linear Template Fit becomes equivalent to the best estimator that would be obtained when using \(\varvec{\lambda }(\varvec{\alpha })\) directly, if two approximations are exactly fulfilled,Footnote 5

$$\begin{aligned}&\varvec{\lambda }(\varvec{{\hat{\alpha }}}) \approx {\hat{\mathbf {y}}}({\hat{\varvec{\alpha }}}) = Y{\bar{\mathbf {m}}} + Y{\tilde{M}}\varvec{{\hat{\alpha }}} ~~~~~\text {and}~ \end{aligned}$$
(58)
$$\begin{aligned}&\frac{\partial \varvec{\lambda }(\varvec{\alpha })}{\partial \varvec{\alpha }}\bigg |_{\varvec{\alpha }=\varvec{{\hat{\alpha }}}} \approx \frac{\partial {\hat{\mathbf {y}}}(\varvec{\alpha })}{\partial \varvec{\alpha }}\bigg |_{\varvec{\alpha }=\varvec{{\hat{\alpha }}}} =Y{\tilde{M}}\,. \end{aligned}$$
(59)

The first equation requires that the model is correctly represented by the interpolated templates at the best estimator, which is needed for the value of the best estimator, and the second equation demands that the first derivatives are correctly approximated, which is a requirement for the error propagation and minimization. These equations taken together represent just a linear approximation of \(\varvec{\lambda }(\varvec{\alpha })\) at \(\varvec{{\hat{\alpha }}}\), but using the templates only. However, since for the motivation of the Linear Template Fit it was assumed that \(\varvec{\lambda }(\varvec{\alpha })\) cannot be used for inference itself, these two approximations also cannot be validated directly. In the following, some possible studies for the validation of the results and the practical application of the Linear Template Fit are discussed.

10.1 Template range

If the model is nonlinear in the parameters \(\varvec{\alpha }\), an unbiased result may only be obtained if the best estimator \({\hat{\varvec{\alpha }}}\) lies within the interval of the reference values,

$$\begin{aligned} {\hat{\varvec{\alpha }}} \in [\varvec{{\dot{\alpha }}}_{(1)},\varvec{{\dot{\alpha }}}_{(j_\text {max})}]\,. \end{aligned}$$
(60)

This is because the templates provide the approximation of the model and its first derivatives, but an extrapolation beyond the template range is inappropriate for a nonlinear model. If Eq. (60) does not hold, additional templates at further reference points should be generated and the Linear Template Fit should be repeated with an improved selection of reference values. For similar reasons, the templates should be in the close vicinity of the best estimator, such that the linear approximations are justified. As a rule of thumb, the distance between the reference points should not greatly exceed the size of the variance of the best estimator

$$\begin{aligned} \varvec{{\dot{\alpha }}}_{(j)} - \varvec{{\dot{\alpha }}}_{(j+1)} \lesssim \varvec{\sigma }_{\varvec{{\hat{\alpha }}}}\,. \end{aligned}$$
(61)

For similar reasons, the best estimator should ideally be found in the center of the template range, rather than at its boundary,

$$\begin{aligned} \varvec{\varvec{{\hat{a}}}}-\varvec{{\dot{\alpha }}}_{(0)} \approx \varvec{{\dot{\alpha }}}_{(j_\text {max})}-\varvec{\varvec{{\hat{a}}}}\,. \end{aligned}$$
(62)

However, for rather linear parameter dependencies, these requirements may be more relaxed.

The choice of the reference values may be related to the size of the uncertainties of the best estimator. If one considers the uncertainties of the best estimator to be normally distributed,Footnote 6 as a consequence, the model also needs to exhibit a sufficiently linear behavior at least within a few \(\sigma \). As a rule of thumb, one could demand that the reference values are approximately within 3 standard deviations:

$$\begin{aligned} \varvec{{\dot{\alpha }}}_{(j)} \in [\varvec{{\hat{\alpha }}}-3\varvec{\sigma }_{\varvec{{\hat{\alpha }}}}, \varvec{{\hat{\alpha }}}+3\varvec{\sigma }_{\varvec{{\hat{\alpha }}}}]\,. \end{aligned}$$
(63)

10.2 The value of \({\hat{\chi }}^{2}\), its uncertainty, and the partial \(\chi ^{2}\)

The value of \(\chi ^{2}\) at the minimum, \({\hat{\chi }}^{2}\), is a meaningful quantity to assess the agreement between the data and the model. Since the calculation of \(\chi ^{2}\) is not explicit in the Linear Template Fit, its value needs to be calculated separately, and \({\hat{\chi }}^{2}\) is calculated from data and the templates at \(\varvec{{\hat{a}}}\) using the somewhat simplified equation

$$\begin{aligned} {\hat{\chi }}^{2}= \left( \varvec{d}- Y\varvec{{\bar{m}}}- Y{\tilde{M}}\varvec{{\hat{a}}}\right) ^{{\text {T}}} W \left( \varvec{d}-Y\varvec{{\bar{m}}}\right) \,. \end{aligned}$$
(64)

Since the data and the templates are associated with uncertainties, the value of \({\hat{\chi }}^{2}\) consequently may also be considered as uncertain. The uncertainty in \({\hat{\chi }}^{2}\) is obtained from linear error propagation of the input uncertainties and calculated as

$$\begin{aligned} \sigma _{{\hat{\chi }}^{2}} = \sqrt{\varvec{\xi }^{{\text {T}}} {\mathbb {V}}\varvec{\xi }+ \sum _l(\varvec{\xi }\varvec{s}_{(l)}^{{\text {T}}})^2}\,, \end{aligned}$$
(65)

where the i-vector \(\varvec{\xi }\) is introduced and whose coefficients represent the partial derivatives \(\tfrac{\partial {\hat{\chi }}^{2}}{\partial d_i}\), which are calculated as \(\xi _i = 2W_i (\varvec{d}- {\hat{\mathbf {y}}}(\varvec{{\hat{a}}}))\), and \(W_i\) denotes the ith row of the matrix W. Also, the uncertainties in the templates Y may be propagated to \({\hat{\chi }}^{2}\) with similar equations. The interpretation of \(\sigma _{{\hat{\chi }}^{2}}\) is not trivial, but it may well be that this quantity has interesting features and may serve as an additional quality parameter in the future.

The value of \(\chi ^{2}\) is inherently obtained from a sum of various sources of uncertainties. It is instructive to calculate the contribution from any uncertainty source \(\ell \) to \({\hat{\chi }}^{2}\) separately using [92]

$$\begin{aligned} {\hat{\chi }}^{2}_\ell = (\varvec{d}-{\hat{\mathbf {y}}}(\varvec{{\hat{a}}}))^{{\text {T}}} W V_\ell W (\varvec{d}-{\hat{\mathbf {y}}}(\varvec{{\hat{a}}}))\,, \end{aligned}$$
(66)

where \(V_\ell \) is any of the covariance matrices, see Eq. (31). It is obvious that \({\hat{\chi }}^{2}=\sum _\ell {\hat{\chi }}^{2}_\ell \). This may become particularly useful when multiple measurements are considered in the fit, and thus their individual contributions to \({\hat{\chi }}^{2}\), i.e. the partial \(\chi ^{2}\)’s, may be calculated by summing the respective \({\hat{\chi }}^{2}_\ell \) values.

10.3 Interpretation of the \(\chi ^{2}_j\) values as a cross-check

An obviously simple validation of the result from the Linear Template Fit is performed by comparing \({\hat{\chi }}^2\) with the \(\chi ^{2}\) values calculated from the individual templates, defined as

$$\begin{aligned} \chi ^{2}_j= & {} (\varvec{d}- \varvec{y}_{(j)})^{{\text {T}}} W (\varvec{d}-\varvec{y}_{(j)})~~~\text {or}~~~ \nonumber \\ \chi ^{2}_j= & {} (\varvec{\log \varvec{d}} - \varvec{\log {\varvec{y}}}_{(j)})^{{\text {T}}} W (\varvec{\log {\varvec{d}}} -\varvec{\log {\varvec{y}}}_{(j)})\,, \end{aligned}$$
(67)

whether absolute or relative normally distributed uncertainties are considered.

An alternative best estimator, denoted in the following as \(\varvec{\check{\alpha }}\), is obtained from the interpretation of all of the \(\chi ^{2}_j\) values, and serves as a validation of the best estimator from the Linear Template Fit (\(\varvec{{\hat{\alpha }}}\)). Based on the assumption that the \(\chi ^{2}\) distribution \(\chi ^{2}(\varvec{\alpha })\) follows a second-degree polynomial, the so-called \(\chi ^{2}\)  parabola, its minimum is interpreted as best estimator. For a univariate problem, when introducing

$$\begin{aligned}&\varvec{\chi ^{2}} = \begin{pmatrix} \chi ^{2}_1 \\ \vdots \\ \chi ^{2}_j \end{pmatrix} \,,~~~ C_\text {(pol-2)}= \begin{pmatrix} 1 &{} \dot{\alpha }_1 &{} \dot{\alpha }_1^2 \\ \vdots &{} \vdots &{} \vdots \\ 1 &{} \dot{\alpha }_j &{} \dot{\alpha }_j^2 \\ \end{pmatrix} ~~~\text {and}~~~\nonumber \\&\varvec{\vartheta } = \begin{pmatrix} \vartheta _0 \\ \vartheta _1 \\ \vartheta _2 \end{pmatrix}\,, \end{aligned}$$
(68)

with \(\chi ^{2}_j\) from Eq. (67), the best estimator of the model parameter is given by the stationary point of the polynomial

$$\begin{aligned} \check{\alpha } = \frac{{\hat{\vartheta }}_1}{-2{\hat{\vartheta }}_2}\,, \end{aligned}$$
(69)

where the best estimators of the (three) polynomial parameters are \(\varvec{{\hat{\vartheta }}} = (C^{{\text {T}}}C)^{-1}C\varvec{\chi ^{2}}\). An illustration for Example 1 is shown in Fig. 5. The uncertainty in \(\check{\alpha }\) is given by the criterion \(\chi ^{2}_\text {min}+1\) [10] and hence

$$\begin{aligned} \sigma _{\check{\alpha }}= & {} \frac{\sqrt{{\hat{\vartheta }}_1^2 - 4{\hat{\vartheta }}_2({\hat{\vartheta }}_0 - \check{\chi }^2 - 1)}}{-2{\hat{\vartheta }}_2} ~~~\text {using}~~~ \nonumber \\ \check{\chi }^2= & {} (\varvec{\chi ^{2}}- C\varvec{{\hat{\vartheta }}})^{{\text {T}}}(\varvec{\chi ^{2}}- C\varvec{{\hat{\vartheta }}})\,. \end{aligned}$$
(70)

The equations for the multivariate case are straightforward and not repeated here. This alternative best estimator, \(\check{\alpha }\), as well as its variance \(\sigma _{\check{\alpha }}\) and the minimum \(\chi ^{2}\), should be sufficiently consistent with the results from the Linear Template Fit:

$$\begin{aligned} \check{\alpha }\sim {\hat{\alpha }}\,, ~~~ \sigma _{\check{\alpha }}\sim \sigma _{{\hat{\alpha }}} ~~~\text {and}~~~ \check{\chi }^2\sim {\hat{\chi }}^2\,, \end{aligned}$$
(71)

and may serve as a valuable cross-check. As a quick and practical visual test, if the templates line up on the parabolic fit (as seen in Fig. 5), the problem is essentially linear, and the Linear Template Fit will provide an unbiased and optimal best estimator. Note that for the calculation of Eq. (69), at least three templates are required, whereas the minimum number of templates for the univariate Linear Template Fit is only two.

Fig. 5
figure 5

Illustration of various \(\chi ^{2}/n_\text {df}\) values in the scope of the Linear Template Fit for Example 1 (Sect. 4). The red star indicates the best-fit \(\chi ^{2}\) of the Linear Template Fit, \({\hat{\chi }}^{2}\). The full circles display the \(\chi ^{2}\) values of the individual templates and the red line displays the \(\chi ^{2}\)  parabola calculated from them, and its minimum is indicated with an open circle (\(\check{\chi }^2\), see text)

10.4 Power law factor for the linear regression

In the Linear Template Fit, the functions \({\hat{\mathbf {y}}}(\varvec{\alpha })\) are (multidimensional) linear functions, see Eqs. (15), (27), and (38). Although the use of higher-degree polynomials or multivariate polynomials with interference terms could in principle be considered (see next Sect. 11), in such cases the optimization problem cannot be solved analytically, and no closed matrix expression for the best estimators is found. However, an analytic solution is still obtained when only power law factors are introduced, and in a univariate Linear Template Fit the model approximation becomes

$$\begin{aligned} \mathrm {y}_i(\alpha ;\theta _0,\theta _1) = \theta ^{(i)}_0 + \theta ^{(i)}_1 \alpha ^\gamma \end{aligned}$$
(72)

with \(\gamma >0\) being a real number. The matrix M, Eq. (10), then has elements \(M_{j,0}=1\) and \(M_{j,1}={\dot{\alpha }}_j^{\gamma }\). In the multivariate case, different power law factors for each fit parameter can be introduced, and one writes \(\gamma _k\) and has \(M_{j,(k+1)}={\dot{\alpha }}_{(j),k}^{\gamma _k}\). The best estimators of the multivariate Linear Template Fit, Eq. (32), then become

$$\begin{aligned} {\hat{\alpha }}_k= \bigg ( {\mathcal {F}}( \varvec{d}-Y\varvec{{\bar{m}}}) \bigg )_k^{\frac{1}{\gamma _k}}\,. \end{aligned}$$
(73)

This solution can immediately be seen by a variable substitution of \(\alpha ^\gamma \) in Eqs. (32) and (72). The application of the power law factor(s) \(\gamma \) when working with relative uncertainties (see Sect. 7) is straightforward, since all equations remain linear in \(\alpha _k\). Similarly, the equations for the error propagation (see Sect. 8) can easily be adopted with that variable substitution. For instance, the elements of the systematic shifts (see Eq. (43)) become

$$\begin{aligned} (\varvec{\sigma }_{\varvec{{\hat{a}}}}^{(\varvec{s}_{(l)})})_k = \frac{\left( {\mathcal {F}}\varvec{s}_{(l)}\right) _k }{\gamma _k{\hat{\alpha }}_k^{\gamma _k-1} }\,, \end{aligned}$$
(74)

and similarly for other uncertainty components or covariance matrices. The power law factor \(\gamma \) may be of practical use in order to validate the assumptions of the linearity of the regression analysis, or to improve the fit, such as when it is known that the model is proportional to \(\alpha ^2\). This would be the case in determinations of the strong coupling constant from n-jet production cross sections at the LHC, which are proportional to \(\alpha ^n_s(m_Z)\) at the leading order in perturbative quantum chromodynamics (QCD).

10.5 Finite differences

In a use case where a larger number of templates are available, various choices of subsets of these templates can become input to the Linear Template Fit, and some of such potential choices have already been suggested in the previous paragraphs. In the case where only two (or three) templates are provided as input to the Linear Template Fit, the selected templates may have a special meaning, because the templates can also be considered as a numerical first derivative through finite differences. Such numerically calculated derivatives are the essence of many gradient descent optimization methods [5, 49, 50, 86, 87], and it is instructive to study the relation.

As an example, let us consider a univariate problem where the objective function \(\chi ^{2}\) is given by Eq. (8). Close to the minimum, \(\chi ^{2}\) is nearly quadratic, and the Newton step, \(\varDelta \alpha =-(\chi ^{2})^\prime /(\chi ^{2})^{\prime \prime }\), exhibits a good convergence [5, 50, 51, 87]. The first and second derivatives at an expansion point \(\alpha _0\) are

$$\begin{aligned} (\chi ^{2})^\prime= & {} -2\varvec{\lambda }^{\prime \mathrm{{T}}}(\alpha _0) W (\varvec{d} - \varvec{\lambda }(\alpha _0)) ~~~\text {and}~~~ \nonumber \\ (\chi ^{2})^{\prime \prime }= & {} 2 \varvec{\lambda }^{\prime \mathrm{{T}}}(\alpha _0) W \varvec{\lambda }^\prime (\alpha _0)\,, \end{aligned}$$
(75)

where the second derivatives \(\varvec{\lambda }^{\prime \prime }\) are zero or neglected in the latter equation. Hence, one obtains the estimator from the Newton step as

$$\begin{aligned} {\hat{\alpha }}_\text {(Newton)}= & {} \alpha _0 + \left[ \varvec{\lambda }^{\prime \mathrm{{T}}}(\alpha _0) W \varvec{\lambda }^\prime (\alpha _0) \right] ^{-1}\nonumber \\&\times \varvec{\lambda }^{\prime \mathrm{{T}}}(\alpha _0) W (\varvec{d}-\varvec{\lambda }(\alpha _0))\,, \end{aligned}$$
(76)

and the calculation of the first derivative could be done numerically using a finite difference with nonzero step size h

$$\begin{aligned} \varvec{\lambda }^\prime (\alpha _0) = \frac{\varvec{\lambda }(\alpha _0+h) - \varvec{\lambda }(\alpha _0)}{h}\,. \end{aligned}$$
(77)

The analogous Linear Template Fit makes use of the templates at \(\alpha _0\) and \(\alpha _0+h\) and defines the following matrices:

$$\begin{aligned}&Y = \begin{pmatrix} \varvec{\lambda }(\alpha _0)&~\varvec{\lambda }(\alpha _0+h) \end{pmatrix} ~~\text {and}~~ M=\begin{pmatrix} 1 &{} \alpha _0 \\ 1 &{} ~\alpha _0+h \end{pmatrix}\,, ~\text {hence}~ \nonumber \\&M^+ = \begin{pmatrix} \tfrac{\alpha _0+h}{h} &{} -\tfrac{\alpha _0}{h}\\ -\tfrac{1}{h} &{} \tfrac{1}{h} \end{pmatrix} = \begin{pmatrix} \varvec{{\bar{m}}}^{{\text {T}}} \\ \varvec{{\tilde{m}}}^{{\text {T}}} \end{pmatrix}\,. \end{aligned}$$
(78)

From the Linear Template Fit, Eq. (17), when resolving \(\varvec{{\bar{m}}}\), one obtains the best estimator

$$\begin{aligned} {\hat{\alpha }}= \alpha _0 + \left[ (Y\varvec{{\tilde{m}}})^{{\text {T}}} W (Y\varvec{{\tilde{m}}})\right] ^{-1} (Y\varvec{{\tilde{m}}})^{{\text {T}}} W(\varvec{d} - \varvec{\lambda }(\alpha _0))\,, \end{aligned}$$
(79)

and since \(Y\varvec{{\tilde{m}}}=\varvec{\lambda }^{\prime }(\alpha _0)\), it can be directly seen that it is equivalent to the estimator \({\hat{\alpha }}_\text {(Newton)}\) from the Newton step using a numerical finite difference.

11 Nonlinear model approximation

The Linear Template Fit can be regarded such that linear functions are fitted to every row i of the template matrix Y.Footnote 7 In the following, nonlinear functions are considered for the model \(\lambda _i(\varvec{\alpha })\) in each “bin” i. For a simplified discussion, the formulae for only a univariate model are sometimes discussed, while the multivariate case is commonly straightforward. Also, the concepts of the \(\gamma \) factor, relative uncertainties, or “systematic shifts” \(\varvec{s}_l\) can equally be considered in a multivariate formulation.

The model can be approximated from the templates in every bin i with an nth-degree polynomial function

$$\begin{aligned}&\lambda _i(\alpha ) \approx \mathrm {y}_i(\alpha ;\varvec{{\hat{\theta }}}_{(i)}) =: \hat{\mathrm {y}}_i(\alpha ) ~~~\text {using}~~~\nonumber \\&\mathrm {y}_i(\alpha ;\varvec{\theta }_{(i)})= \theta _{0}^{(i)} + \theta _{1}^{(i)}\alpha + \cdots + \theta _{n}^{(i)}\alpha ^n\,. \end{aligned}$$
(80)

The best estimators for the polynomial parameters are obtained from regression analysis,

$$\begin{aligned} {\varvec{{\hat{\theta }}}_{(i)}} = \left( {\begin{matrix}{\hat{\theta }}_{0}\\ {\hat{\theta }}_{1}\\ \vdots \\ {\hat{\theta }}_{n}\end{matrix}}\right) _{(i)} = {\mathcal {M}}^+\begin{pmatrix} y_{(1),i} \\ \vdots \\ y_{(j),i} \end{pmatrix}\,, \end{aligned}$$
(81)

where \(y_{(j),i}\) is the ith value of the jth template, and the g-inverse is

$$\begin{aligned}&{\mathcal {M}}^+ = \left( {\mathcal {M}}^{{\text {T}}} {\mathcal {M}}\right) ^{-1} {\mathcal {M}}^{{\text {T}}} ~~~~\text {using here}\nonumber \\&~~~~ {\mathcal {M}} = \begin{pmatrix} 1 &{} \dot{\alpha }_1 &{}\ldots &{} \dot{\alpha }_1^n \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ 1 &{} \dot{\alpha }_j &{} \ldots &{} \dot{\alpha }_j^n \\ \end{pmatrix}\,. \end{aligned}$$
(82)

For reasons outlined in Sect. 3, an unweighted regression is considered. Using this model approximation in the objective function of the fitting problem, Eq. (2), the \(\chi ^{2}\) expression for a univariate model becomes of order \({\mathcal {O}}(2n)\) in \(\alpha \),

$$\begin{aligned} \chi ^2_{{\mathcal {O}} {(2n)}}&:= \left( \varvec{d} - \varvec{{\hat{\mathrm {y}}}}(\alpha )\right) ^{{\text {T}}} W \left( \varvec{d} - \varvec{{\hat{\mathrm {y}}}}(\alpha )\right) \end{aligned}$$
(83)
$$\begin{aligned}&= \left( \varvec{d} - Y\varvec{m_0} - Y\varvec{m_1}\alpha - \cdots - Y\varvec{m_n}\alpha ^n\right) ^{{\text {T}}} \nonumber \\&\quad \times W \left( \varvec{d} - Y\varvec{m_0} - Y\varvec{m_1}\alpha - \cdots - Y\varvec{m_n}\alpha ^n\right) \,, \end{aligned}$$
(84)

where the vectors \(\varvec{m}\) were used as the representation of \({\mathcal {M}}^+\) according to

$$\begin{aligned} {\mathcal {M}}^+ =: \left( \begin{array}{c} \varvec{m_0}^{{\text {T}}} \\ \varvec{m_1}^{{\text {T}}} \\ \vdots \\ \varvec{m_n}^{{\text {T}}} \\ \end{array}\right) \,. \end{aligned}$$
(85)

It is obvious that for \(n=1\), this is the Linear Template Fit, and the generalizations to the multivariate fit (Eq. (30)), and with relative uncertainties (Eq. (39)), are straightforward. Note that for different degrees n, the vectors \(\varvec{m}\) differ unless the equations are not rewritten for orthogonal polynomials [93].

Since the first derivative of \(\chi ^2_{{\mathcal {O}} {(2n)}}\) is nonlinear, in \(\alpha \), no analytic solution of the optimization problem is found. Therefore, in the following, two iterative algorithms to find the minimum of the objective function \(\chi ^2_{{\mathcal {O}} {(2n)}}\) are discussed, which refer to the Gauß–Newton and the Newton algorithm (see also Refs. [5, 50, 51, 87]). In both cases, if the starting value is identified with the best estimator from the Linear Template Fit, \(\alpha ={\hat{\alpha }}\), the step size of the next iterative step can be interpreted as convergence criterion or uncertainty of the Linear Template Fit. Reasonable values for n are \(n=2\), 3, or at most 4 (since \({\mathcal {M}}\) becomes quickly ill-conditioned for large n, e.g., Refs. [94, 95] and therein), while for \(n=1\), consistent results with the Linear Template Fit are found.

11.1 Linearization of the nonlinear approximation \(\hat{\mathrm {y}}(\alpha )\)

An iterative minimization algorithm based on first derivatives is defined through linearly approximating the (nonlinear) function \({\hat{{{\textbf {y}}}}}(\alpha )\), Eq. (80). For the purpose of validating the result from the Linear Template Fit, we consider \({\hat{\alpha }}\) as the expansion point, although this could be any other value as well. Hence, the linear approximation of \({\hat{{{\textbf {y}}}}}(\alpha )\) in every bin i is

$$\begin{aligned} {\tilde{y}}_i(\alpha )&\approx \hat{\mathrm {y}}_i({\hat{\alpha }}) + \frac{\partial \hat{\mathrm {y}}_i(\alpha )}{\partial \alpha }\bigg |_{\alpha ={\hat{\alpha }}} (\alpha -{\hat{\alpha }}) \nonumber \\&= Y(\textstyle {\sum }_0^n (1-n)\varvec{m}_{(n)}{\hat{\alpha }}^n) + Y\left( \textstyle {\sum }_0^n n\varvec{m}_{(n)}{\hat{\alpha }}^{n-1} \right) \alpha \nonumber \\&=: Y\varvec{\tilde{{\bar{m}}}} + Y\varvec{\tilde{{\tilde{m}}}}\alpha \,. \end{aligned}$$
(86)

The last term defines the shorthand notations \(\varvec{\tilde{{\bar{m}}}}\) and \(\varvec{\tilde{{\tilde{m}}}}\). From comparison with Eq. (15), it is seen that an alternative best estimator of the template fit (see Eq. (19)) is obtained as

$$\begin{aligned} \tilde{{\hat{\alpha }}} = \frac{(Y\varvec{\tilde{{\tilde{m}}}})^{{\text {T}}}W}{(Y\varvec{\tilde{{\tilde{m}}}})^{{\text {T}}}WY\varvec{\tilde{{\tilde{m}}}} } ( \varvec{d}- {Y\varvec{\tilde{{\bar{m}}}}})\,. \end{aligned}$$
(87)

The difference \({\hat{\alpha }}-\tilde{{\hat{\alpha }}}\) is supposed to become small for a valid application of the Linear Template Fit, and one should find

$$\begin{aligned} |{\hat{\alpha }} - \tilde{{\hat{\alpha }}}| \ll \sigma _{{\hat{\alpha }}}\,. \end{aligned}$$
(88)

If Eq. (88) holds for a particular application, then nonlinear effects are negligible, and the best estimator of the Linear Template Fit is insensitive to nonlinear effects in the model \(\lambda \). If Eq. (88) does not hold, then it should be studied if the nonlinearity is present in the vicinity of \({\hat{\alpha }}\) or if it is rather introduced by a long-range nonlinear behavior of \(\lambda (\alpha )\). In the latter case, the range of the templates should be reduced such that one has about

$$\begin{aligned} {\dot{\alpha }}_\text {max}-{\dot{\alpha }}_\text {min} \lesssim 2\sigma _{{\hat{\alpha }}}\, \end{aligned}$$
(89)

and the Linear Template Fit should be repeated. Note that if the templates are subject to statistical fluctuations and if only a few templates are provided, then a nonlinear approximation is more sensitive to those fluctuations than linear approximations, and \(|{\hat{\alpha }} - \tilde{{\hat{\alpha }}}|\) may become larger. In such an application, \(\sigma _{{\hat{\alpha }}}^Y\) should also be considered as a relevant source of uncertainty (see Eqs. (46) and (47)).

figure e

Furthermore, the notation of Eq. (86) suggests an alternative interpretation of the nonlinear behavior of the model. The vectors \(\varvec{\tilde{{\bar{m}}}}\) and \(\varvec{\tilde{{\tilde{m}}}}\) may be interpreted in terms of uncertainties of \(\varvec{\bar{{m}}}\) and \(\varvec{\tilde{{m}}}\) when writing

$$\begin{aligned} \varvec{\sigma }_{\varvec{\bar{{m}}}} := \varvec{\tilde{{\bar{m}}}} - \varvec{\bar{{m}}} ~~~~\text {and}~~~~ \varvec{\sigma }_{\varvec{\tilde{{m}}}} := \varvec{\tilde{{\tilde{m}}}} - \varvec{\tilde{{m}}}\,. \end{aligned}$$
(90)

Through linear error propagation, one can define an uncertainty in \({\hat{\alpha }}\) as

$$\begin{aligned} \sigma _{{\hat{\alpha }}}^{(m)}&= \sum _j \bigg ( -FY{\mathbb {1}}_j\sigma _{{\bar{m}}_j} + D^{-1} \left[ (Y{\mathbb {1}}_j)^{{\text {T}}}W ( d-Y{\bar{m}}) \right. \nonumber \\&\left. \quad - \left[ (Y{\mathbb {1}}_j)^{{\text {T}}}W Y{\tilde{m}} + (Y{\tilde{m}})^{{\text {T}}}WY{\mathbb {1}}_j\right] {\hat{a}} \right] \sigma _{{\tilde{m}}_j} \bigg ) \nonumber \\&= \sum _i \bigg ( -F{\mathbb {1}}_i\sigma _{{\overline{Ym}}_i} + D^{-1} \big [ {\mathbb {1}}_i^{{\text {T}}}W ( d-Y{\bar{m}}) \nonumber \\&\quad - \left[ {\mathbb {1}}_i^{{\text {T}}}W Y{\tilde{m}} + (Y{\tilde{m}})^{{\text {T}}}W {\mathbb {1}}_i\right] {\hat{a}} \big ] \sigma _{{\widetilde{Ym}}_i} \bigg )\,. \end{aligned}$$
(91)

Note that the sum runs over j in the first and over i in the second expression. The second expression makes use of

$$\begin{aligned} \varvec{\sigma }_{\varvec{{\overline{Ym}}}} := Y\varvec{\sigma }_{\varvec{{\bar{m}}}} ~~~~\text {and}~~~~ \varvec{\sigma }_{\varvec{{\widetilde{Ym}}}} := Y\varvec{\sigma }_{\varvec{\tilde{{m}}}} \end{aligned}$$
(92)

and it has the advantage that the summands represent a useful quantifier for every bin i. For a sufficiently linear problem, both the linear sum (Eq. (91)) and the quadratic sum should be sufficiently small. In particular, one should find \(\sigma _{{\hat{\alpha }}}^{(m)}\approx \tilde{{\hat{\alpha }}}-{\hat{\alpha }}\) for reasonable nth-degree polynomials, i.e. n=2, and also, according to Eq. (88),

$$\begin{aligned} |\sigma _{{\hat{\alpha }}}^{(m)}| \ll |\sigma _{{\hat{\alpha }}}|\,. \end{aligned}$$
(93)

11.2 Minimization of \(\chi ^2_{{\mathcal {O}} {(2n)}}\)

When \(\hat{\alpha }\) is obtained from the Linear Template Fit, the objective function \(\chi ^2_{{\mathcal {O}} {(2n)}}\), Eq. (84), is expected to exhibit a minimum in the close vicinity of \(\hat{\alpha }\). This provides good motivation for the application of the Newton optimization at \(\hat{\alpha }\), which is based on second derivatives of \(\chi ^2_{{\mathcal {O}} {(2n)}}\), so a parabolic approximation. The Newton distance to the minimum is given by

$$\begin{aligned} \varDelta \alpha = -\frac{(\chi ^{2}_{{\mathcal {O}} {(2n)}})^\prime }{(\chi ^{2}_{{\mathcal {O}} {(2n)}})^{\prime \prime }} = -\left( \left. \frac{\partial ^2\chi ^{2}_{{\mathcal {O}} {(2n)}}}{\partial ^2\alpha }\right| _{\alpha ={\hat{\alpha }}}\right) ^{-1} \left. \frac{\partial \chi ^{2}_{{\mathcal {O}} {(2n)}}}{\partial \alpha }\right| _{\alpha ={\hat{\alpha }}} \nonumber \\ \end{aligned}$$
(94)

and in the multivariate case by [5, 55]

$$\begin{aligned} \varvec{\varDelta \alpha } = \left. -H^{-1}\varvec{g}\right| _{\varvec{\alpha }=\varvec{{\hat{\alpha }}}}\,. \end{aligned}$$
(95)

The gradient vector \(\varvec{g}\) and the Hesse matrix H are defined by the first and second derivatives of \(\chi ^2_{{\mathcal {O}} {(2n)}}\), respectively. Both quantities can be calculated directly at an expansion point \(\varvec{\alpha _0}\), which was identified in the above equations with \(\varvec{{\hat{\alpha }}}\), since \(\chi ^{2}_{{\mathcal {O}} {(2n)}}\) is just a known second-degree polynomial and dependent on M, Y, \(\varvec{d}\), S, and W. The value of \(\varvec{\varDelta \alpha }\) can be interpreted as the expected distance to the minimum (EDM) based on the nonlinear model approximation.Footnote 8

For practical purposes and to validate the result of the Linear Template Fit, one should require that its value be smaller than the uncertainty from the fit

$$\begin{aligned} |\varDelta \alpha | \ll \sigma _{{\hat{\alpha }}}\,. \end{aligned}$$
(96)

The most reasonable degree for this purpose is \(n=2\) (which is exploited in the following in greater detail), and one expects that \(\varDelta \alpha \) will be negligibly small. This observation can also be interpreted such that the result from the Linear Td an alternative result from a minimization of \(\chi ^2_{{\mathcal {O}} {(2n)}}\) are consistent. If the reference points are in the vicinity of \({\hat{\alpha }}\), this is to be expected because of the underlying statistical model of (log-)normally distributed estimators. In contrast, if \(\varDelta \alpha \) becomes large, it may well be that in such a case not even the reported uncertainties in the best estimator, \(\sigma _{{\hat{\alpha }}}\), can be considered to be normally distributed.

12 The Quadratic Template Fit

Despite all the considerations and cross-checks as discussed in the previous sections, in some problems the linearized model may simply be an inaccurate approximation of the true model (cf. Eq. (27)). Reasons could be that the templates cannot be generated in a sufficiently close vicinity around the best estimator, perhaps for technical reasons, or in multivariate problems some parameters are highly nonlinear, a dimension is poorly constrained and the reference points span a large range, or interference terms between the parameters are non-negligible. In such cases, the quantities \(\varvec{\tilde{{\hat{\alpha }}}}-\varvec{{\hat{\alpha }}}\), \(\varvec{\sigma _{{\hat{\alpha }}}^{(m)}}\), or \(\varvec{\varDelta \alpha }\) (Eqs. (88), (91) and (95)) become non-negligible when they are calculated for second-degree polynomials \(n=2\).

An improved approximation of the model is obtained when using a second-order approximation, which may also include interference terms. The equations are summarized in Appendix C (Eqs. (C.7)–(C.9)). Using such a second-degree polynomial model \({\hat{{{\textbf {y}}}}}(\varvec{\alpha })\) that includes terms \({\mathcal {O}}(\alpha _k)\), \({\mathcal {O}}(\alpha ^2_k)\), and \({\mathcal {O}}(\alpha _{k_1}\alpha _{k_2})\) in the optimization problem results in a nonlinear least-squares problem. Therefore, an iterative algorithm is required to obtain the best estimator.

A robust algorithm, denoted as Quadratic Template Fit, to obtain the best estimator and to perform a full error propagation is defined as

  1. 1.

    The Linear Template Fit is performed to obtain a first estimator \(\varvec{{\hat{\alpha }}}_{(0)}\), for example using Eq. (32);

  2. 2.

    The Newton algorithm [5, 50, 51, 87], Eq. (95), is employed with a few m iterations in order to obtain improved best estimators \(\varvec{{\hat{\alpha }}}_{(m)} = \varvec{{\hat{\alpha }}}_{(m-1)} + \varvec{\varDelta \alpha }_{(m)}\) and for which the Hesse matrix H is directly analytically calculable;

  3. 3.

    The best estimator and the error calculation are obtained using the linearized approximations for \(\varvec{\tilde{{\bar{m}}}}\) and \(\tilde{{\tilde{M}}}\) (see Eqs. (86) or (C.13)) in the equations of the Linear Template Fit.

The first step provides the starting point for the Newton algorithm, and since it is obtained with the Linear Template Fit, it is already close to the minimum. The second step provides a fast and stable convergence for several reasons: i) the starting point is already in the vicinity of the minimum, ii) the Hesse matrix is commonly positive definite and it is analytically exactly calculableFootnote 9, as well as the gradient vector, and iii) the Newton method has good convergence for nearly quadratic functions [5, 46, 47], which is the case for \(\chi ^2_{{\mathcal {O}} {(4)}}\). Although the algorithm is expected to converge quickly after a few iterations, several stopping criteria are thinkable, for example when EDM\(\ll |\varvec{\varDelta \alpha }|\) or \(\varvec{\varDelta \alpha }_{(m)}\approx \varvec{\varDelta \alpha }_{(m-1)}\). The last step provides an even more improved best estimator, but mainly gives access to the equations of the error propagation. Note that in the first step, the true model \(\varvec{\lambda }(\varvec{\alpha })\) is linearly approximated, and in the latter steps, by quadratic functions.

figure f

13 Example 3: the Quadratic Template Fit

As an application of the Quadratic Template Fit, an example is constructed where the model is nonlinear and the templates purposefully span a (too) large range. Therefore, the criteria for a valid application of the Linear Template Fit as discussed in Sect. 10 are mainly not fulfilled.

As in Examples 1 and 2 (Sects. 46), the model is considered to be a normal distribution, and the same pseudo-data as in the previous examples are also used. In contrast to Examples 1 and 2, only the variance of the normal distribution shall be determined. Six templates are generated similarly to Examples 1 and 2, with variances between 3 and 8, and each template is generated using \(4\cdot 10^6\) events. The pseudo-data and the templates are displayed in Fig. 6. Recall that the pseudo-data were generated with a variance of 6.2 (due to statistical fluctuations, the true variance is about 6.3).

Fig. 6
figure 6

Some illustrations of Example 3. Left: the template distributions Y as a function of the observable, and more details are given in the caption of Fig. 1. Right: the templates Y for some selected bins. The blue line indicates a second-degree polynomial regression of the templates, as is used by the Quadratic Template Fit. The green dashed line indicates the linearization at the best estimator, and the red dotted line indicates the linear regression of the templates as it would be used by the plain Linear Template Fit

In this example, the Linear Template Fit results in a best estimator of \(6.61\pm 0.32\), and is thus not quite consistent with the expectation. The EDM is comparably large, with a value of \(-0.27\), and \(\sigma _{{\hat{\alpha }}}^{(m)}=-0.27\), and the minimum of the \(\chi ^{2}\)-parabola results in \(6.52\pm 0.27\). Consequently, Eqs. (61), (62), (71), and (93) are not well fulfilled. In Fig. 6 (right), the parameter dependence in some selected bins is displayed, and it is clearly seen that the templates are not described by a linear function. It is also observed from the \(\chi ^{2}\) values of the individual templates that these cannot be described by a parabola as would be the case for a linear problem, as displayed in Fig. 7.

The Quadratic Template Fit, in contrast, results in a best estimator of \(6.36\pm 0.40\) and is thus in excellent agreement with the expectation. This is because the algorithm employs a second-degree polynomial in each bin to represent the model. This provides an adequate representation of the parameter dependence even of a nonlinear model, as also seen in Fig. 6 (right). A linearization at the value of the best estimator is used only for linear error propagation, and appears to be appropriate given the size of the final uncertainties, \(\pm 0.40\).

14 Validation study: determination of the strong coupling constant from inclusive jet cross section data

As an example application of the Linear Template Fit, the strong coupling constant in quantum chromodynamics (QCD), \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), is determined from a measurement of inclusive jet cross sections in proton–proton collisions at a center-of-mass energy of \(\sqrt{s}=7\,\)TeV [96]. The data were taken with the CMS detector at the LHC at CERN. Inclusive jets were measured as a function of the transverse momentum \(p_\text {T}\) in five regions of absolute rapidity |y|. Altogether 133 data points are available, and 24 different sources of uncertainties are associated with the data.

The model is obtained from a calculation in next-to-leading order perturbative QCD (NLO pQCD) using the program nlojet++ [97, 98] that was interfaced to the tool fastNLO [99,100,101]. The latter makes it possible to provide templates for any value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) and different parameterizations of parton distribution functions of the proton (PDFs) without rerunning the calculation of the pQCD matrix elements. Multiplicative non-perturbative (NP) [102] and electroweak corrections [103] are applied to the NLO predictions.

The value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) was already determined earlier from these data and in NLO accuracy in two independent analyses [104, 105]. The two analyses differ in the assumption of the statistical model, the inference algorithm, and the uncertainties considered in the fit. This provides a comprehensive testing ground to compare the results from the Linear Template Fit.

Fig. 7
figure 7

Visualization of the \(\chi ^{2}\) values of the templates in Example 3. A thin line connects these values. The red line displays a fitted parabola, which is not able to describe the \(\chi ^{2}\) values adequately, which is indicative of a nonlinear problem. The star indicates the best estimator from the Quadratic Template Fit

In Ref. [104], the value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) is determined from the minimum of a \(\chi ^{2}\) parabola, where the \(\chi ^{2}\) is derived from normally distributed PDFs. In order to avoid a bias in that procedure [64, 68, 69], some uncertainty components are treated as multiplicative [104], which means that the covariance matrix is calculated from relative uncertainties that are multiplied with theory predictions. We chose predictions obtained with \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}}) =0.116\) for that purpose, and the total covariance matrix in \(\chi ^{2}\) is calculated as described in Ref. [104].

The templates are generated in the range \(0.112\le \alpha _{\mathrm{s}} (m_{\mathrm{Z}}) \le 0.121\) in steps of 0.001 using the MSTW PDF set [106]. In this methodology, the model predictions are not available for continuous values of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), since the PDFs also exhibit an \(\alpha _{\mathrm{s}}\) dependence, but are provided only in a limited range and for discrete \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) values [106]. Therefore, this kind of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) inference is a typical use case for the Linear Template Fit. The templates and the linear model are displayed for some selected bins in Fig. 8.

The results from the Linear Template Fit (Eq. (32)), the Quadratic Template Fit (Sect. 12), and the parabolic fit (Sect. 10.3) are compared with the published results from CMS in Table 1. An illustration of the \(\chi ^{2}\) values is displayed in Fig. 9 (left).

Fig. 8
figure 8

Visualization of the templates, the data, and the implicit linearization of the model in some selected bins of the Linear Template Fit to the CMS inclusive jet cross sections. More details are given in the caption of Fig. 1

Table 1 Results from a Linear Template Fit of NLO pQCD predictions using the MSTW PDF set to CMS inclusive jet cross section data. The results are compared with the Quadratic Template Fit and to the analytic calculation of the minimum of the \(\chi ^{2}\)parabola, and to the published results by the CMS collaboration in the last row. Shown are the best estimators for \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), the quadratic sum of all experimental uncertainties (exp), the propagated PDF uncertainties (PDF), the propagated NP uncertainties (NP), and the last column shows the \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values. The difference between the linear and the Quadratic Template Fit is only 0.000017, but appears larger due to a rounding effect. Differences to the CMS result may also be due to numerical limitations of the input data, or how the PDF uncertainties are symmetrized. The total uncertainties that are considered in the fit sum to \(\pm 0.0018\) in all cases
Fig. 9
figure 9

Illustration of \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values in the fit of NLO pQCD predictions to CMS inclusive jet cross section data. Shown are the \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) value of the Linear Template Fit at its best estimator (star), the \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values of the individual templates (full circle), and the calculated \(\chi ^{2}\)parabola (see Sect. 10.3) and its minimum (open circle). Left: \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values of the Linear Template Fit with normally distributed uncertainties. Right: \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values of the linear template with log-normally distributed uncertainties. Note that the left figure uses the PDF set MSTW for the NLO predictions with a varying \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) value, whereas the right one uses the PDF set NNPDF3.0

Fig. 10
figure 10

Left: nuisance parameters of the systematic uncertainties in the \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) fit to CMS inclusive jet cross section data. Details on these uncertainties are found in Refs. [96, 104]. The gray value(s) are not included in the \(\chi ^{2}\)computation but are propagated separately. Middle: size of the individual uncertainty components to the total uncertainty of the best estimator, \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), in the Linear Template Fit to CMS inclusive jet cross section data. The color indicates the sign of the uncertainty, which, however, is of relevance only for multivariate fits. Right: partial \(\chi ^{2}\) from the individual uncertainty components

The best estimator for \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), its uncertainties, and the value of \(\chi ^{2}\) are in good agreement among each other, and with the published CMS values [104], and differences are only in the rounding digit. It is also found that the total “fit” uncertainty is reproduced, which comprises the experimental and the PDF uncertainties, \(\delta \alpha _{\mathrm{s}} (m_{\mathrm{Z}}) =\delta \alpha _{\mathrm{s}} ^\text {(exp)}\oplus \delta \alpha _{\mathrm{s}} ^\text {(PDF)}=\pm 0.0018\). The small differences in the individual uncertainty components between Ref. [104] and the Linear Template Fit are likely because the error breakdown in Ref. [104] is only approximate, whereas the Linear Template Fit provides an analytic error propagation at the value of the best estimator. In fact, all 24 uncertainty components of the data, the 20 symmetrized PDF uncertainties, and the NP uncertainties are propagated separately (cf. Sect. 8) and are displayed in Fig. 10. It is observed that the largest individual experimental uncertainty source is the luminosity uncertainty and the statistical uncertainty. The nuisance parameters are quite similar to those from (yet another) fit in Ref. [104]. The uncertainty component with the largest contribution to the total uncertainty stems from the PDFs and the luminosity uncertainty. The statistical and uncorrelated uncertainties have the largest contribution to \({\hat{\chi }}^{2}\). The NP uncertainties are not included in the nominal \(\chi ^{2}\)computation and are propagated separately. The EDM of the Linear Template Fit is \(7\cdot 10^{-5}\) and thus smaller than the rounding digit of the result.

Fig. 11
figure 11

Similar visualization of the templates than in Fig. 8, but using relative uncertainties (log-normally distributed PDFs) in the Linear Template Fit

In Ref. [105], as an intermediate result, the value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) was determined from CMS inclusive jet cross sections as well. In contrast to Ref. [104], the estimators are assumed to follow a log-normal distribution, and some theoretical uncertainties are further considered in the \(\chi ^{2}\) expression, which was then minimized with Minuit [4]. The NNPDF3.0 PDF sets [107] are used for the NLO predictions and provide a covariance matrix with PDF uncertainties. This inference provides a useful testing ground for the Linear Template Fit with relative uncertainties (Sect. 7). Templates are generated in the range \(0.112\le \alpha _{\mathrm{s}} (m_{\mathrm{Z}}) \le 0.121\), and the same uncertainty components as in Ref. [105] are considered. Some selected bins are displayed in Fig. 11. The results from the Linear Template Fit (Eq. (41)), the Quadratic Template Fit, and the parabolic fit are compared with the published results in Table 2, and an illustration of \(\chi ^{2}\) values, which here are calculated from normally distributed relative uncertainties (Eq. (39)), is displayed in Fig. 9 (right).

Table 2 Results from a Linear Template Fit with relative uncertainties where NLO pQCD predictions using the NNPDF3.0 PDF set are fitted to CMS inclusive jet cross section data. The results are compared with the Quadratic Template Fit and to the analytic calculation of the minimum of the \(\chi ^{2}\)parabola, as well as to previously published results in the last row. Shown are the best estimators for \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\), the quadratic sum of all experimental uncertainties (exp), the propagated PDF uncertainties (PDF), the propagated NP uncertainties (NP), and the last column shows the \(\chi ^{2}\)/\(n_{\mathrm{dof}}\) values

The best estimator, the uncertainties, and the value of \(\chi ^{2}\) are in good agreement among each other and with the results from Ref. [105]. The results differ only in the rounding digit, and different sizes of the uncertainty breakdown (into (exp), (PDF), and (NP) uncertainties) may be explained since the error breakdown in Ref. [105] is only approximate, but the total uncertainty is consistently \(\delta \alpha _{\mathrm{s}} =\pm 0.0026\).

It is worth noting that due to the application of the factorization theorem [108], both the hard coefficients and the PDFs exhibit an \(\alpha _{\mathrm{s}} \) sensitivity. Consequently, in the two approaches discussed above, not only is the value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) determined, but the PDFs are determined simultaneously as well [109], which is realized through the inclusion of the PDF uncertainties in the Linear Template Fit, and which is commonly referred to as PDF profiling. This technique avoids a possible bias that is present when the PDFs are kept fixed [110], and it is particularly valid since the PDFs and the value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) are only weakly correlated in PDF determinations, and thus their correlations are negligible [111]. The different size of the uncertainties in Tables 1 and 2 is then a consequence of how the PDFs are considered in the fit. Since the PDFs themselves are determined from data as well, the ansatz from CMS also exploits, to some extent, the \(\alpha _{\mathrm{s}}\) sensitivity of these further data by using the \(\alpha _{\mathrm{s}}\) dependent PDF, whereas the other ansatz from Ref. [105] exploits only the jet data and their sensitivity to \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) in the hard coefficients.

Consequently, these examples represent a comprehensive application of the Linear Template Fit, where in addition to determining the value of \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) a PDF profiling is performed, as well as a simultaneous constraint of the systematic uncertainties. Such constraints are obtained when the uncertainties in the nuisance parameters become smaller than unity. A PDF determination from these data alone, instead of a PDF profiling of an existent PDF, would be possible by considering the PDF uncertainties as unconstrained uncertainties in the Linear Template Fit, as described in Sect. 8.4. Such a fit exploits the PDF parameter space of the PDF set that is used to derive the PDF uncertainties. For the given example of CMS inclusive jet cross sections, however, this is not possible, since these data do not have sufficient constraints on the separate PDF flavors.

15 Summary and conclusions

In this article, the equations of the Linear Template Fit were presented. The Linear Template Fit provides an analytic expression for a maximum likelihood estimator in a uni- or multivariate parameter estimation problem. The underlying statistical model is constructed from normal or log-normal probability distribution functions and refers to a maximum likelihood or a minimum \(\chi ^{2}\) method. For parameter estimation with the Linear Template Fit, the model needs to be provided at a few reference values in the parameter(s) of interest: the templates. The multivariate Linear Template Fit can be written as

$$\begin{aligned} \varvec{{\hat{\alpha }}} = \left( (Y{\tilde{M}})^{{\text {T}}}WY{\tilde{M}}\right) ^{-1}(Y{\tilde{M}})^{{\text {T}}} W ( \varvec{d}-Y\varvec{{\bar{m}}})\,, \end{aligned}$$

where Y is the template matrix, \(\varvec{d}\) is the data vector (random vector), W is the inverse covariance matrix, and \(\varvec{{\bar{m}}}\) and \({\tilde{M}}\) are calculated from the reference points (Eq. (26)).

The equation of the Linear Template Fit is derived using two key arguments: (i) in the vicinity of the best estimator the model is linear, and (ii) the templates for different reference values are all generated in the same manner. The first precondition is commonly justified if the templates are provided within a reasonably small range around the (to be expected) best estimator. From the second, it follows that the bin-wise polynomial regression is unweighted, and thus the identical regression matrix is applicable in all bins. Several quantities to validate the applicability of the Linear Template Fit in a particular application were discussed.

Table 3 Summary of the notation used. Capital letters denote a matrix, and small bold letters denote a column vector. The letters i, j, k, l, and t are indices. Throughout this article, a vector notation is used and, for instance, the vector \(\varvec{d}\) denotes a set of (random) variables, so \(\varvec{d}=\{d_i\}\). The same letters for indices and for the maximum value of an index are used. A subscript index denotes the entry of a vector (or matrix), and a bracketed index denotes one vector out of several others. Some letters have multiple meanings, if e.g. normal- or log-normally distributed random variables are considered, or if nonlinear effects are discussed, while the meaning always becomes clear from the context. Occasionally, a single entry of a vector will be denoted as a bin, like it would be a histogram entry. The dot notation, “\({\dot{\alpha }}\),” denotes the reference values and looks similar to a “point” of a grid, but it should not be mistakenly understood as a derivative

If the linearity of the model is not justified, the Quadratic Template Fit is a suitable alternative algorithm for parameter estimation. It considers second-degree polynomials for the parameter dependence of the model and employs the quickly converging exact Newton minimization algorithm. For error propagation, the equations of the Linear Template Fit are applicable.

Both the linear and the Quadratic Template Fit implicitly transform the discrete representation of the model (the templates) into analytic continuous functions using polynomial regression. These expressions themselves may also become useful in several other applications where templates are available but continuous functions would be needed.

The equations for error propagation of different sources of uncertainties were presented. The analytic nature of the Linear Template Fit allows the straightforward propagation of each uncertainty component separately, which provides additional insights into the parameter estimation analysis.

As an example application, previously published results on the determination of the strong coupling constant \(\alpha _{\mathrm{s}} (m_{\mathrm{Z}})\) from inclusive jet cross section data taken at the LHC were repeated. The Linear and the Quadratic Template Fits reproduce these previously published results within the rounding digit.

In summary, key features of the Linear and the Quadratic Template Fit are its profound statistical model based on normal- or log-normally distributed PDFs, its simple formulae, stable results, low computational demands, full analytic error propagation, and its simple applicability. It is believed that these template fits may become useful for statistical inference in a large variety of problems, such as performance-critical applications, and also in several fields outside of high-energy particle physics.