MODELING THE SWIFT BAT TRIGGER ALGORITHM WITH MACHINE LEARNING*

Philip B. Graff; Amy Y. Lien; John G. Baker; Takanori Sakamoto

doi:10.3847/0004-637X/818/1/55

1. INTRODUCTION

Long gamma-ray bursts (GRBs) are related to core-collapse supernovae from the death of massive stars. These are important for studying star formation history, particularly in the early universe where other methods become difficult. The Swift space telescope (Gehrels et al. 2004) is able to detect and localize these out to large distances and quickly downlink the data to the ground. These abilities enable prompt ground-based follow-up observations that can provide redshift measurements of the GRBs. To date, Swift has detected over 900 GRBs, of which ∼30% have redshift measurements. From these observations, one can try to infer the intrinsic GRB rate that is connected to stellar evolution over the history of the universe. Many researchers have used Swift's observations to study intrinsic GRB redshift and luminosity distributions, as well as the implications for star formation history (e.g., Guetta & Della Valle 2007; Guetta & Piran 2007; Kistler et al. 2008; Pélangeon et al. 2008; Yüksel et al. 2008; Salvaterra et al. 2009; Butler et al. 2010; Campisi et al. 2010; Qin et al. 2010; Wanderman & Piran 2010; Virgili et al. 2011; Robertson & Ellis 2012; Salvaterra et al. 2012; Coward et al. 2013; Kanaan & de Freitas Pacheco 2013; Wang 2013; Howell et al. 2014; Lien et al. 2014; Yu et al. 2015; Pescalli et al. 2015; Petrosian et al. 2015).

Several studies have suggested that the GRB rate at high redshift (z ≳ 5) is larger than the expectation based on star formation rate (SFR) measurements (e.g., Le & Dermer 2007; Yüksel et al. 2008; Kistler et al. 2009; Butler et al. 2010; Ishida et al. 2011; Jakobsson et al. 2012; Tanvir et al. 2012; Lien et al. 2014). This result could imply several possibilites, such as a larger SFR in the early universe (e.g., Kistler et al. 2009; Tanvir et al. 2012), an evolving luminosity function (e.g., Virgili et al. 2011; Pescalli et al. 2015), or a different GRB to supernova ratio (i.e., a different scenario of stellar evolution) due to a different environment in the early universe (e.g., Woosley & Heger 2012).

However, it remains difficult to constrain the GRB rate. Though Swift has observed a large population of GRBs, only some of these have measured redshifts. Even with a relatively complete redshift sub-sample, there are complicated selection effects from the complex trigger algorithm adopted by the burst alert telescope (BAT) on board Swift and the difficulty in searching through a large parameter space. It is challenging to distinguish the luminosity function and the redshift distribution using the observational data. We address some of these issues with a machine learning (ML) approach to produce a fast, but reliable, treatment of Swift's instrumental selection effects, thereby enabling a robust Bayesian treatment of population model analysis.

ML is a field of research that involves designing algorithms (MLAs) for building models that learn from generic data. The models are fit to a set of training data in order to make predictions or decisions. Often, the original training data come from actual observations or simulations of a complex process. The models trained by MLAs can be evaluated very quickly for any new example after a one-time cost of training the model.

In this study, we look to aid the analysis of GRB data by using MLAs to train models that emulate the Swift trigger algorithm. Our training data comes from simulations of GRB populations computed by Lien et al. (2014).

The structure of this paper is as follows. In Section 2, we describe the aspects of Swift and its model for triggering on incident GRBs that are relevant to GRB population inferences. Then, in Section 3, we describe the ML algorithms used and compared in this study. Section 4 presents the results of training the different ML models on the training data from the Swift pipeline. We apply a trained ML model for accelerating Bayesian inference with faster likelihoods in Section 5, fitting the parameters of the intrinsic GRB rate distribution. Section 6 compares our study to previous work estimating the intrinsic distributions of long GRBs with Swift observations. Finally, in Sections 7 and 8, we summarize and propose future projects to follow-up.

2. THE Swift DETECTION ALGORITHM

BAT on board Swift adopts over 500 rate trigger criteria based on the photon count rate in the raw light curve. Moreover, the burst needs to pass the "image threshold" determined by the signal-to-noise ratio estimated from an image generated on board, based on the duration found by the rate trigger criteria. Each rate trigger criterion uses a different energy band, a different part of the detector plane, and different foreground and background durations for calculating the signal-to-noise ratio. In addition to the rate trigger algorithm, BAT also generates an image about every minute to search for bursts that are missed by the rate trigger method (which is the so-called "image trigger"; Fenimore et al. 2003, 2004; McLean et al. 2004; Palmer et al. 2004; Barthelmy et al. 2005).

This complex trigger algorithm successfully increases the number of GRB detections. However, it also increases the difficulty of estimating the detection theshold, which is crucial for probing many intrinsic GRB properties from the observations. To address this problem, Lien et al. (2014) developed a code that simulates the BAT trigger algorithm, and used it to study the instrinsic GRB rate and luminosity function. This "trigger simulator" follows the same trigger algorithm and criteria for the rate trigger as those adopted by BAT and mimics the image threshold and image trigger (see Lien et al. 2014 for detailed descriptions). Although the trigger simulator can be used to address the complex detection thresholds of BAT, it takes ∼10 s to a few minutes to simulate the trigger status of a burst using a common PC with the 2.7 GHz Intel Core processor (the speed mainly depends on the number of bins in the light curve). Therefore, it is computationally intensive to perform a large number of simulations to cover a wide parameter space. This is where ML is able to accelerate our analysis.

3. MACHINE LEARNING ALGORITHMS

To generate a fast emulator for the Swift trigger simulator, we consider a variety of supervised learning algorithms, where the goal is to infer a function from labeled training data. Each example consists of input properties, which are used to predict the output label.

Here we briefly describe each of the ML algorithms used in this study. We denote the set of input features by ${\boldsymbol{x}}$ and the ML model's predicted output is given by $y({\boldsymbol{x}})$ . The inputs are a set of 15 parameters describing the GRB and detector as detailed in Table 1. Depending on the MLA, the output may be a discrete label, e.g., {0, 1}, or it may be a continuous probability in [0, 1] and is designed to be the probability that a GRB, as specified by the features in ${\boldsymbol{x}}$ , is detected by Swift's BAT. The true output is given by ${\boldsymbol{t}}$ and is zero for a non-detection and one for a detection.

Table 1. Parameters Describing Each Simulated GRB

Parameter	Description
${\mathrm{log}}_{10}(L)$	luminosity of the GRB
z	redshift
r	distance from center of detector grid of peak
ϕ	azimuthal angle in detector grid of peak
bin_size_emit	source time bin size
α	band function parameter
β	band function parameter
${\mathrm{log}}_{10}({E}_{\mathrm{peak}})$	peak of the energy spectrum of the GRB
bgd_15–25 keV	background count rate in 15–25 keV band
bgd_15–50 keV	background count rate in 15–50 keV band
bgd_25–100 keV	background count rate in 25–100 keV band
bgd_50–350 keV	background count rate in 50–350 keV band
θ	incoming angle of GRB
${\mathrm{log}}_{10}({\rm{\Phi }})$	incident flux of GRB
ndet	number of active detector pixels (constant)
trigger_index	0 for non-detections and 1 for detections

Note. There are 15 inputs and the output class label. See Equation (4) in Lien et al. (2014) for details of α, β, and $\mathrm{log}({E}_{\mathrm{peak}})$ .

Download table as: ASCII Typeset image

3.1. Random Forests and AdaBoost

Random forests and AdaBoost both involve creating ensembles of decision trees, so we first introduce these as an ML model. In a decision tree, binary splits are performed on the training data input features, the dimensions of ${\boldsymbol{x}}$ . In training a tree on data, a series of splits are made that choose a dimension and a threshold that optimize some criterion. Examples of this criterion are the accuracy of the resulting classifications (maximize or equivalently minimize errors); the entropy of the resulting branches (minimize) given by

$\begin{eqnarray}&&E=-\displaystyle \sum _{i=\{0,\;1\}}{p}_{i}\mathrm{log}({p}_{i}),\end{eqnarray} \tag{ 1 }$

where p_i is the fraction of samples with each label in a branch; or the Gini impurity (minimize) which is given by

$\begin{eqnarray}&&G=1-\displaystyle \sum _{i=\{0,\;1\}}{f}_{i}^{2},\end{eqnarray} \tag{ 2 }$

where f_i is the fraction of correctly classified samples labeled with the value i. This measure aims to make each sub-set resulting from a branch as "pure" as possible in the class labels of its members. The choice of criterion will make only minor differences to the final tree built and we use the Gini impurity here because it is the default setting for the software package used.

Each split creates a pair of "branches," one with each class label. These splits are made until a stopping condition is reached (e.g., the samples are all of uniform class, a maximum number of splits has been reached, or the number of samples left to split among has fallen below a minimum value). This branch now becomes a "leaf" that assigns a class to all samples ending there. When a new event of unknown class is put into the tree, the tree will pass it through the learned splits/branches until it reaches a leaf, at which point it will be labeled according to the label of the leaf. An example tree fit to this data (with a hard limit of three in depth) is shown in Figure 1; it has a classification accuracy of 93.9% on the training data to which it was fit. Trees fit in the later models will be much larger and thus more accurate.

**Figure 1.** Decision tree fit to the *Swift* training data with a maximum depth of three. An accuracy of 93.9% is achieved. Each box shows the parameter chosen for branching and the threshold used, as well as the total number of samples used to make that decision. The Gini factor shown is that from Equation (2) for the subset at that location. At the leaves, the number of items with class 0 and class 1 are shown; the tree can assign class probabilities based on this split at the leaf that any new sample arrives at after following the branches down.
Download figure:
Standard image High-resolution image

3.1.1. Random Forests (RFs)

RFs (Breiman 2001) improve upon classical decision trees by training an ensemble of trees that vote on the final classification. A "strong learner" (the RF) is created from an ensemble of "weak learners" (decision trees). In an RF, many decision trees are trained on the data—often hundreds. To obtain many different trees, at each split in a tree, a random subset of the dimensions of ${\boldsymbol{x}}$ are chosen and the optimal binary split to be made out of these dimensions is made. Furthermore, each tree is trained on a bootstrap sample of the data; the original K points are sampled with replacement to form a new set of K points that may contain repeats. An RF thus guards against overfitting to the training data and potentially poorly performing individual trees. A single tree can provide a probabilistic classification, ${y}_{\mathrm{DT}}({\boldsymbol{x}})\in [0,\;1]$ , and combining many allows us to obtain a near-continuous probability, ${y}_{\mathrm{RF}}({\boldsymbol{x}})\in [0,\;1]$ by using

$\begin{eqnarray}&&{y}_{\mathrm{RF}}({\boldsymbol{x}})=\displaystyle \frac{1}{N}\displaystyle \sum _{n=1}^{N}{y}_{\mathrm{DT},n}({\boldsymbol{x}}),\end{eqnarray} \tag{ 3 }$

with N being the number of trees in the forest. This value we obtain as ${y}_{\mathrm{RF}}({\boldsymbol{x}})$ is simply the probability that the GRB described by ${\boldsymbol{x}}$ is detected by Swift.

We use the implementation of RFs in the scikit-learn ⁴ Python library (Pedregosa et al. 2011).

3.1.2. AdaBoost

AdaBoost is short for "Adaptive Boosting," a meta-algorithm for ML (Freund & Schapire 1997). It creates a single strong learner from an ensemble of weak learners, much like RFs. However, in the boosting framework, the decision trees are trained iteratively and when added together (as in Equation (3)) are weighted, typically based on their accuracy. Additionally, unlike for RFs, the training examples will not be all equally weighted when evaluating the accuracy. After an individual decision tree is added to the ensemble, the training data is reweighed so that examples that are misclassified increase in weighting and those classified correctly decrease in weighting. Therefore, future decision trees will attempt to better fit examples previously misclassified. In this way, the overall ensemble prediction may become more accurate.

Boosting may be applied to any ML algorithm, but in this work we apply it only to the decision tree weak learner (the other classifiers qualify as strong learners on their own and would thus likely not benefit significantly from boosting). We use the implementation of AdaBoost for decision tree classifiers in scikit-learn. We note that the predicted probability, ${y}_{\mathrm{AB}}({\boldsymbol{x}})\in [0,\;1]$ , is approximately continuous, similarly to ${y}_{\mathrm{RF}}({\boldsymbol{x}})$ .

3.2. Support Vector Machines (SVMs)

SVMs (Cortes & Vapnik 1995) are a tool for binary classification that finds the optimal hyper-plane for separating the two classes of training samples. Events are classified by which side of the hyper-plane they fall on. The hyper-plane that maximizes the separation from points in either class will (in general) have minimal generalization error for new data points.

In a linear SVM, we label the two classes with ${t}_{i}\in \{-1,\;1\}$ corresponding to an undetected GRB and a detected GRB, respectively. A hyper-plane separating the two classes will satisfy ${\boldsymbol{w}}\cdot {\boldsymbol{x}}-b=0$ , where ${\boldsymbol{w}}$ and b must be found by training on the data $\{{\boldsymbol{x}}\}$ . If the classes are separable, we can place two parallel hyper-planes that separate the points and have no points between them in the "margin." This can be seen for a toy example in Figure 2. We describe these hyper-planes mathematically as

$\begin{eqnarray}&&{\boldsymbol{w}}\cdot {\boldsymbol{x}}-b=\pm 1.\end{eqnarray} \tag{ 4 }$

Examples will lie on either side of the two planes such that

$\begin{eqnarray}&&{t}_{i}({\boldsymbol{w}}\cdot {{\boldsymbol{x}}}_{i}-b)\geqslant 1\end{eqnarray} \tag{ 5 }$

for all samples, ${{\boldsymbol{x}}}_{i}$ . Because the samples are typically not separable, we introduce slack variables ${\xi }_{i}\geqslant 0$ that measure the misclassification of ${{\boldsymbol{x}}}_{i}$ by setting

$\begin{eqnarray}&&{t}_{i}({\boldsymbol{w}}\cdot {{\boldsymbol{x}}}_{i}-b)\geqslant 1-{\xi }_{i}.\end{eqnarray} \tag{ 6 }$

We then seek to minimize

$\begin{eqnarray}&&{\rm{Cost}}({\boldsymbol{w}},{\boldsymbol{\xi }},b)=\displaystyle \frac{1}{2}\parallel {\boldsymbol{w}}{\parallel }^{2}+C\displaystyle \sum _{i}{\xi }_{i}\end{eqnarray} \tag{ 7 }$

subject to the constraint in Equation (6). The C parameter is a penalty factor for misclassification and this optimization will face the trade-off between a smaller margin and smaller misclassification error. The cost function seeks to maximize the distance between the two hyper-planes at the margin edges, which is given by $2/| | {\boldsymbol{w}}| |$ . This separation by hyper-plane is demonstrated for a toy example in Figure 2.

**Figure 2.** Maximum separating hyper-plane and the margin hyper-planes for a toy data set. The "support vectors" are the highlighted points along the margin hyper-planes. Image courtesy of Wikimedia Commons (Wikimedia Commons 2008).
Download figure:
Standard image High-resolution image

The two classes of points are generally not easily separated in the original parameter space of the problem. Therefore, we map the points into a higher-dimensional space where they may be more easily separated. To make this a computationally tractable problem, we consider mappings such that the dot product between pairs of points may be easily computed in terms of the original variables by a kernel function, $k({{\boldsymbol{x}}}_{i},{{\boldsymbol{x}}}_{j})$ . Hyper-planes in the higher-dimensional space are defined as surfaces on which the kernel is constant. If the kernel is defined such that $k({{\boldsymbol{x}}}_{i},{{\boldsymbol{x}}}_{j})$ decreases as the points ${{\boldsymbol{x}}}_{i}$ and ${{\boldsymbol{x}}}_{j}$ move away from one another, then the kernel is a measure of closeness. Thus, the sum of many kernels like this can be used to measure the proximity of a sample data point to data points in the two classes; this distance can then be used to classify the point into one class or the other. This mapping can result in a very convoluted hyper-plane separating the two sets of points—this can accurately model the true classification boundary, but we must be careful not to overfit this to the training data.

In order to perform a nonlinear separation, we employ a Gaussian kernel function (a.k.a., a radial basis function),

$\begin{eqnarray}&&k({{\boldsymbol{x}}}_{i},{{\boldsymbol{x}}}_{j})=\mathrm{exp}(-\gamma \parallel {{\boldsymbol{x}}}_{i}-{{\boldsymbol{x}}}_{j}{\parallel }^{2}),\end{eqnarray} \tag{ 8 }$

where γ is a tunable parameter reflecting the width of the Gaussian.

A point is classified by which side of the learned hyper-plane it falls on, as determined (in our notation) by

$\begin{eqnarray}&&y({\boldsymbol{x}})=\mathrm{sign}(K({\boldsymbol{w}},{\boldsymbol{x}})-b),\end{eqnarray} \tag{ 9 }$

where K is the aggregate kernel function that is a linear combination of the individual kernel functions (closeness to each of the other points). Minimizing the cost given by Equation (7) under the constraint of Equation (6) can be solved as a quadratic programming problem with the solution locally independent of all but a few data points, the "support vectors" of the model. These will be those samples closest to, or on, the margin in both classes and a weighted sum of distances from them will determine which class a new sample is in. Points that are not support vectors will have small or zero weight in the aggregate kernel function.

In this study, we use the implementation of SVMs in scikit-learn. A radial basis function is chosen and we perform five-fold cross-validation to optimize the hyper-parameters of the model, γ and C. The model is also trained to allow for the prediction of continuous class probabilities, ${y}_{\mathrm{SVM}}\in [0,\;1].$ ⁵

3.3. Artificial Neural Networks

Artificial neural networks are an ML method that is inspired by the function of a brain. A neural network (NN) consists of interconnected nodes, each of which processes information that it receives and passes this product on to other nodes via weighted connections. In a feed-forward NN, these nodes are organized into layers that pass information uniformly in a certain direction. The input layer passes information to an output layer via zero, one, or many "hidden" layers in between. Each node in the network performs a simple function, but their combined activity can model complex relationships. A useful introduction to NNs as well as their training and use can be found in MacKay (2003).

A single node takes an input vector of activations ${\boldsymbol{a}}\in {{\mathfrak{R}}}^{N}$ and maps it to a scalar output $f({\boldsymbol{a}};{\boldsymbol{w}},b)$ through

$\begin{eqnarray}&&f({\boldsymbol{a}};{\boldsymbol{w}},b)=g\left(b+\displaystyle \sum _{i=1}^{N}{w}_{i}{a}_{i}\right),\end{eqnarray} \tag{ 10 }$

where ${\boldsymbol{w}}$ and b are the parameters of the node, called the "weights" and "bias," respectively. The function, g, is the activation function of the node; we use the sigmoid, linear, and rectified linear activation functions in this work,

$\begin{eqnarray}g(z)=\left\{\begin{array}{lc}{(1+{e}^{-z})}^{-1} & {\rm{sigmoid}}\\ z & {\rm{linear}}\\ \mathrm{max}\{0,z\} & {\rm{rectified}}\;{\rm{linear}}.\end{array}\right.\end{eqnarray} \tag{ 11 }$

The sigmoid and rectified linear activations are used for hidden layer nodes and the linear activation is used for the output layer nodes to obtain values in $(-\infty ,\infty )$ . This is then converted into a probability by the softmax transform given by

$\begin{eqnarray}&&{y}_{j}({\boldsymbol{x}};{\boldsymbol{w}},b)\to \displaystyle \frac{\mathrm{exp}({y}_{j}({\boldsymbol{x}};{\boldsymbol{w}},b))}{{\displaystyle \sum }_{l=\{0,1\}}\mathrm{exp}({y}_{l}({\boldsymbol{x}};{\boldsymbol{w}},b))},\end{eqnarray} \tag{ 12 }$

where j indexes over the output nodes. After the softmax, all output values are in (0, 1) and sum to one. We show here the case where there are only two output nodes for a binary classification problem; these values are degenerate, but the setup of one output node per class generalizes to the multi-class problem.

The weights and biases of all nodes in the network are the parameters that must be optimized with respect to the training data. The number of input nodes is the number of features given by the data. The two output nodes are the values of the probabilities that the input GRB features would result in detection or non-detection. In this work, we will take ${y}_{\mathrm{NN}}({\boldsymbol{x}})$ to be the continuous probability given in the output node for the "detection" class. Thus, the output is the predicted probability that the given input GRB features correspond to a detected GRB.

The optimization algorithm seeks to minimize the cross-entropy of the predicted probabilities, given by

$\begin{eqnarray}&&{\rm{Cost}}({\boldsymbol{p}})=-\displaystyle \sum _{i}\displaystyle \sum _{k=\{0,\;1\}}{t}_{i,k}\mathrm{log}{y}_{k}({{\boldsymbol{x}}}_{i})\end{eqnarray} \tag{ 13 }$

where ${\boldsymbol{p}}$ is a parameter vector containing all of the weights and biases of the nodes in the NN. The index i is over all data samples in the training set and the index k is over the two output nodes corresponding to the non-detection and detection classes, respectively. t = {1, 0} for a non-detection and t = {0, 1} for a detection. This cost function pushes predicted probabilities toward their correct values with large penalties for incorrect predictions and is based on information theory. We take the value from the output node corresponding to the detection class as the probability that the input GRB, ${\boldsymbol{x}}$ , is detected by Swift, ${y}_{{\rm{NN}}}({\boldsymbol{x}})$ .

We use the SkyNet ⁶ algorithm (Graff et al. 2014) for training of the NN and refer the reader to that paper for more information on NNs, including the optimization function used, how the optimization is performed, and additional data processing that is performed. SkyNet provides an easy-to-use interface for training as well as an algorithm that will efficiently and consistently find the best-fit NN parameters for the training data provided.

3.4. Heuristics Used

For each model's optimal settings, we compute the accuracy of predictions using a naïve probability threshold of 0.5 for the output probability for the detection class; i.e., ${y}_{m}({\boldsymbol{x}})$ for the different models, m. This is later found to be close to optimal. We also plot the receiver operating characteristic (ROC) curves for the classifiers as seen in Figure 9. An ROC curve plots the true positive rate (a.k.a., recall) against the false positive rate. The F1-score is a useful metric for finding the optimal probability threshold to balance type I (false positive) and type II (false negative) errors. The F1-score takes values in [0, 1] and is maximized at the optimal probability threshold. These values are given by

$\begin{eqnarray*}\mathrm{TP} & = & {\rm{\#}}\;{\rm{positives}}\;{\rm{correctly}}\;{\rm{labeled}}\\ \mathrm{TN} & = & {\rm{\#}}\;{\rm{of}}\;{\rm{negatives}}\;{\rm{correctly}}\;{\rm{labeled}}\\ \mathrm{FP} & = & {\rm{\#}}\;{\rm{of}}\;{\rm{negatives}}\;{\rm{labeled}}\;{\rm{as}}\;{\rm{positive}}\\ \mathrm{FN} & = & {\rm{\#}}\;{\rm{of}}\;{\rm{positives}}\;{\rm{labeled}}\;{\rm{as}}\;{\rm{negative}}\end{eqnarray*}$

$\begin{eqnarray*}\mathrm{TPR} & = & \displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}=\mathrm{recall}\\ \mathrm{FPR} & = & \displaystyle \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\\ \mathrm{precision} & = & \displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\\ {\rm{F}}1-\mathrm{score} & = & 2\;\displaystyle \frac{\mathrm{precision}\times \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}},\end{eqnarray*}$

where positives are detections and negatives are non-detections of GRBs. For a random classifier, the ROC will be a diagonal line from (0, 0) to (1, 1). Better classifiers will be above and to the left of this line. A common measure is the area under the curve (AUC), which is the integrated area under the ROC. Values closer to one indicate better classifiers.

In this study, we will use the ROC (with AUC) to find the MLA that best models the Swift pipeline. We can then use the F1-score to identify the best probability threshold for declaring a detection from the predictions of the model.

3.5. Cross-validation

To measure the performance for each type of MLA, we perform hyper-parameter optimization over a range of settings for each. To properly compare these settings against each other, we perform cross-validation. In this setup, with five folds, we split the data into five random subsets of equal size. In training, we train five models for each setting, using four of the five sets during fitting and then evaluating the model on the left-out set. Thus we make predictions on the entire set, but without having trained the model on the data it was predicting (this would lead to overfitting).

Once the optimal model settings are found for each MLA, the entire data set is used to re-train a model with those values. This model is then evaluated on the left-out validation data set so as to compare it with the other MLAs. This latter test is much more stringent because the evaluation data is from different populations than what was used in training. This better reflects how the ML model will be used in practice and is used to pick an MLA and model fit for use in Bayesian parameter estimation (Section 5).

4. MACHINE LEARNING RESULTS

In this section, we present the details of the MLA model fitting we performed. We describe the data set used for training and validation followed by results from hyper-parameter optimization searches performed for each classifier. The hyper-parameter optimization uses only the training data and evaluates different settings with cross-validation as described in Section 3.5. Once we obtain optimal settings for each MLA, we evaluate the models on a validation data set (separate from the training data) for final performance measurement and comparison.

4.1. Training Data Used

The data used in this analysis was generated by simulations of the Swift pipeline—as described in Section 2—for different settings of the GRB redshift and luminosity distribution functions (Equations (2) and (3) in Lien et al. 2014 and reproduced below).

$\begin{eqnarray}{R}_{\mathrm{GRB}}(z)={n}_{0}\left\{\begin{array}{lc}{(1+z)}^{{n}_{1}} & z\leqslant {z}_{1}\\ {(1+{z}_{1})}^{{n}_{1}-{n}_{2}}{(1+z)}^{{n}_{2}} & z\gt {z}_{1}\end{array}\right.\end{eqnarray} \tag{ 14 }$

$\begin{eqnarray}\phi (L)=\displaystyle \frac{{dN}}{{dL}}=\left\{\begin{array}{lc}{\left(\displaystyle \frac{L}{{L}_{\star }}\right)}^{x} & L\leqslant {L}_{\star }\\ {\left(\displaystyle \frac{L}{{L}_{\star }}\right)}^{y} & L\gt {L}_{\star }\end{array}\right.\end{eqnarray} \tag{ 15 }$

R_GRB(z) is the co-moving GRB rate, with units of ${\mathrm{Gpc}}^{-3}\;{\mathrm{yr}}^{-1}$ . In these data sets, the luminosity distribution function was held constant with x = −0.65, y = −3.00, and ${L}_{\star }={10}^{52.05}\;\mathrm{erg}\;{{\rm{s}}}^{-1}$ . Additionally, the break in the redshift distribution was also held constant at z₁ = 3.60. Therefore, we only varied values of n₁ and n₂ (n₀ is ignored for the purpose of generating training data as it is only a normalization parameter). In total, 38 data sets are combined for use in training. These data sets were originally generated for Lien et al. (2014) and do not cover the space systematically. We use 34 of the 38 data sets for training models, including optimization of hyper-parameters; each of these contains ∼4000 samples. The final four, which contain ∼10,000 samples each, are set aside for evaluating the final model from each MLA as the validation data. The distribution of parameters for each of these data sets is shown in Figure 3.

**Figure 3.** Values of the parameters for the redshift distribution function for sample GRB populations used to train ML models. Blue stars were used in training and optimization and red circles were used for final evaluation.
Download figure:
Standard image High-resolution image

We used this data for training as it was generated around the best-fit values from Lien et al. (2014) for the real Swift GRB redshift measurements of Fynbo et al. (2009). In the end, our goal is to fit the GRB rate model to these same observations.

A total of 15 parameters are taken from each simulated GRB in order to determine whether or not the GRB was detected by Swift. These are summarized in Table 1. These are used for classification of GRBs by MLAs. The target value is given by the trigger_index, which is zero for GRBs that are not detected by the Swift algorithm and one for those that are detected.

A pairwise plot of a few of the most significant parameters in determining detection is shown in Figure 4. Lighter points are GRBs that are detected by Swift in the trigger simulator (Lien et al. 2014) while darker ones are undetected GRBs. This plot shows a random subset of 5000 points from the entire training data set. From this figure, we see that a few of the parameters— ${\mathrm{log}}_{10}(L),{\mathrm{log}}_{10}({\rm{\Phi }})$ , and ${\mathrm{log}}_{10}({E}_{\mathrm{peak}})$ —are strongly correlated. We did not perform pruning of these strongly correlated features. Other packages, such as CERN's TMVA (Toolkit for Multivariate Data Analysis with ROOT⁷ ), provide automated tools for pruning that have been shown in some cases to speed up training and/or improve results (see Mingers 1989; Blum & Langley 1997; Kearns & Mansour 1998). An analysis of the impact of pruning may be performed in future work.

To determine how much training data is required, we evaluated the learning curve for the RF classifier. This plots the prediction accuracies, computed using five-fold cross-validation, as a function of the size of the training data. The "training data" in this case is the four-fifths of the data used for fitting the model and the "test data" is the one-fifth left out for evaluation. The learning curve was done after finding the optimal RF settings in Section 4.2 as a check. We thus examined if use of the entire data set benefits model fitting significantly. The data was randomly shuffled before performing this test.

The resulting learning curve is shown in Figure 5. For small sample sizes, there is overfitting of the training data that begins to flatten out by 3 × 10⁴ samples. The accuracy of the test set continues to increase as we add more data points, meaning that more data improves the generalizability of the model. Therefore, in all subsequent training, we will use the entire data set for fitting a model; using fewer points would increase the bias of subsequent predictions.

4.2. Random Forest

The RF model was optimized for combinations of the $\mathrm{min}\_\mathrm{samples}\_\mathrm{split}$ and $\mathrm{max}\_\mathrm{features}$ parameters. These govern the minimum number of samples needed to perform a branching split and the number of features considered at each split, respectively. Choices for each are as follows:⁸

$\begin{eqnarray*}\mathrm{min}\_\mathrm{samples}\_\mathrm{split} & \in & \{2,4,8,16,32,64\},\\ \mathrm{max}\_\mathrm{features} & \in & \{3,4,5,6,7,8,9,10,11,12,13,14\}.\end{eqnarray*}$

This choice of model hyper-parameters to optimize is package dependent and we could have also chosen to optimize the number of trees in the forest or the criterion for determining branching splits. We chose, however, to limit ourselves to two parameters so that a grid search would be possible. Different software packages may offer different strategies for hyper-parameter optimization.

Forests were trained with 500 trees, using the Gini impurity for deciding the optimal split at each branching point and with no limit on the number of branches before reaching a leaf. The five-fold cross-validation evaluates the test accuracy for each pairwise combination of the parameters; the set with the highest test accuracy is the optimal model. The optimal parameters found were $\mathrm{min}\_\mathrm{samples}\_\mathrm{split}$ = 4 and $\mathrm{max}\_\mathrm{features}$ = 5; however, it can be seen in Figure 6 that there is very little variation in accuracy with regard to the value of $\mathrm{max}\_\mathrm{features}$ . The minimum number of samples required to make a split is the dominant factor for improving the accuracy, where smaller values that naturally fine-tune the model further obtain better accuracy on the test set as well. The overall range in test set accuracy is not large and the worst model hyper-parameters still achieve an accuracy of >98%. The preference for lower values in $\mathrm{max}\_\mathrm{features}$ can be understood as increasing variability between trees in the forest and thus minimizing overfitting.

**Figure 6.** Test set accuracy for random forest classifier hyper-parameters. The optimal value is (4, 5). It is clear that $\mathrm{min}\_\mathrm{samples}\_\mathrm{split}$ is a much stronger influence on overfitting to the training data.
Download figure:
Standard image High-resolution image

**Figure 6.** Test set accuracy for random forest classifier hyper-parameters. The optimal value is (4, 5). It is clear that $\mathrm{min}\_\mathrm{samples}\_\mathrm{split}$ is a much stronger influence on overfitting to the training data.
Download figure:
Standard image High-resolution image

**Figure 7.** Test set accuracy for AdaBoost classifier hyper-parameters. The optimal value is (100, 0.001). All options are very close to each other, ranging only from 99.01% to 99.05% in accuracy.
Download figure:
Standard image High-resolution image

**Figure 8.** Test set accuracy for support vector machine classifier hyper-parameters. The optimal value is (C, γ) = (10^2.25, 1).
Download figure:
Standard image High-resolution image

Using the optimal model, we perform predictions on the validation data set. With a naïve threshold of 0.5 on the output probability for the detection class for declaring a GRB detected, these predictions have an accuracy of 97.5%. This is lower than the test accuracy obtained earlier as the test samples were from the same distribution as the training ones while this validation data presents new distributions. The ROC for this classifier is shown with the others in Figure 9 and has an AUC = 0.9935.

**Figure 9.** Receiver operating characteristic (ROC) curves for the classifiers. A dot is placed at the values for the optimal probability threshold found for each classifier. The ROC curve of a random classifier is shown in a dashed red line. A logarithmic scale for the x-axis is used to display the differences in the ROC curves.
Download figure:
Standard image High-resolution image

Analysis of the F1-score found no significant difference between the optimal probability threshold and the naïve threshold of 0.5.

4.3. AdaBoost

The AdaBoost model was optimized for combinations of the ${\rm{n}}\_\mathrm{estimators}$ and $\mathrm{learning}\_\mathrm{rate}$ parameters. The former describes the number of "weak learners" (decision trees) fit in each ensemble model and the latter describes the rate for adjusting the weighting of the weak learners as each is added to the ensemble. Settings for the individual decision trees were chosen to match those found as optimal for the RF classifier, with $\mathrm{min}\_\mathrm{samples}\_\mathrm{split}$ = 4 and $\mathrm{max}\_\mathrm{features}$ = 5. Choices for each were as follows:⁹

$\begin{eqnarray*}&&{\rm{n}}\_\mathrm{estimators}\in \{100,200,300,400,500\}\\ &&{\mathrm{log}}_{10}(\mathrm{learning}\_\mathrm{rate})\in \{-3,-2.5,-2,-1.5,-1,-0.5,0\}.\end{eqnarray*}$

The five-fold cross-validation found that the optimal parameters are (100, 0.001). However, the range in test set accuracies is extremely small, varying only between 99.01% and 99.05%, as can be seen in Figure 7. Therefore, any of these models would be nearly equally accurate.

Using the optimal model, we perform predictions on the validation data set. With a naïve threshold of 0.5 on the output probability for the detection class for declaring a GRB detected, these predictions have an accuracy of 97.4%. The ROC for this classifier is shown with the others in Figure 9 and has an AUC = 0.9921. Analysis of the F1-score found no significant difference between the optimal probability threshold and the naïve threshold of 0.5.

4.4. Support Vector Machines

The SVM model was trained using a Gaussian (radial basis function) kernel, as described in Equation (8). The input data values were all scaled to have zero mean and unit variance, so as to prevent undue bias in the kernel's distance measure. As errors in the predictions are allowed, there are thus two hyper-parameters to optimize, the penalty factor for errors, C, and the tunable parameter for the width of the Gaussian, γ. Choices examined for these were (after first searching over a larger grid with coarser spacing):

$\begin{eqnarray*}{\mathrm{log}}_{10}(C) & \in & \{1.25,1.5,1.75,2,2.25,2.5,2.75,3\}\\ {\mathrm{log}}_{10}(\gamma ) & \in & \{-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1\}.\end{eqnarray*}$

Five-fold cross-validation found optimal parameters of (C, γ) = (10^2.25, 1) with a test set accuracy of 99.0%. For smaller values of C, there is a much more limited range in γ that gives comparable results, if at all, as shown in Figure 8.

Using the optimal model and a 0.5 probability threshold for classification as a detection, the SVM has a prediction accuracy of 94.5%. The ROC for this classifier is shown in Figure 9 and has an AUC = 0.9348. From all of these measures, it is clear that the SVM model, in this scenario, does not generalize as well as the decision tree ensemble methods (RF and AdaBoost).

4.5. Neural Networks

Using SkyNet, we trained several NN architectures using either the sigmoid or rectified linear activation function for the hidden layer nodes. Training NNs is much more computationally expensive than any of the other models, despite the efficiencies in the training algorithm. Therefore, the size of our NN models (in both number and width of hidden layers) as well as the time spent training them, is limited. For each architecture,¹⁰ we employed five-fold cross-validation in order to asses its performance. We report in Table 2 the test set accuracies for each of the networks trained. They are all similar and are getting close to the 99% achieved by the previous MLAs. It is possible that more complex networks would achieve this level of accuracy.

Table 2. Test Set Accuracy from Five-fold Cross-validation from the Training of Neural Networks with SkyNet

Hidden Layers	Activation	Test Accuracy
25	sigmoid	97.89
25	rectified	97.25
50	sigmoid	98.33
50	rectified	97.57
100	sigmoid	98.47
100	rectified	98.00
1000	sigmoid	97.49
1000	rectified	98.28
25 + 25	sigmoid	98.33
25 + 25	rectified	97.65
50 + 50	sigmoid	98.73
50 + 50	rectified	98.16
100 + 30	sigmoid	98.47
100 + 30	rectified	98.27
100 + 50	sigmoid	98.64
100 + 50	rectified	98.41
100 + 100	sigmoid	97.95
100 + 100	rectified	98.35

Note. The activation functions are given in Equation (11).

Download table as: ASCII Typeset image

Due to the constraints on training, we consider the NN architecture with highest average test accuracy, considering both activation functions: the 100 + 50 architecture of hidden layers. This is retrained on the entire data set with both activation functions, and we find that the optimal model has hidden layers of 100 + 50 with the rectified linear unit activation function. We use this NN to make predictions on the validation data set with a naïve probability threshold of 0.5. This yields an accuracy of 96.9%. The ROC curve for this NN is shown in Figure 9 and has an AUC = 0.989. Analysis of the F1-score found no significant difference between the optimal probability threshold and the naïve threshold of 0.5.

4.6. Summary of Results

Here we summarize the results for the optimal model returned by each MLA. The accuracy, AUC, and optimal F1-score are all reported in Table 3. We also include in this comparison the use of a constant cut in GRB flux; GRBs with flux greater than a threshold value will be labeled as detected and those with lower flux are non-detections. Varying this flux threshold produces an ROC and we find an optimal cut at ${\mathrm{log}}_{10}({\rm{\Phi }})=-7.243\;\mathrm{erg}\;{{\rm{s}}}^{-1}\;{\mathrm{cm}}^{-2}$ (based on the F1-score) for which we measure the accuracy. It is clear that all ML classifiers except SVMs significantly out perform a flux threshold; SVMs still out perform a flux cut at optimal settings.

Table 3. Results for Measuring the Performance of the Classifiers Trained in This Study

Classifier	Threshold	Accuracy	AUC	F1-score
Random Forest	0.449	0.975	0.994	0.912
AdaBoost	0.362	0.975	0.992	0.910
Neural Net	0.459	0.969	0.989	0.890
SVM	0.028	0.947	0.935	0.824
Flux	−7.243	0.896	0.945	0.663

Note. The accuracy on the validation data, the area under the ROC curve, and the optimal F1-score are reported. The threshold values are probabilities for all models except the flux cut, which uses the ${\mathrm{log}}_{10}({\rm{\Phi }})$ .

Download table as: ASCII Typeset image

From this analysis, we see that the RF and AdaBoost classifiers performed the best in the classification task. NNs were very close behind, with SVMs performing the worst among the MLAs.¹¹

5. USE OF ACCELERATED PIPELINE FOR BAYESIAN INFERENCE

Here we demonstrate the use of the trained ML models in accelerating Bayesian inference, namely fitting the intrinsic redshift distribution of GRBs. We do so with best-fit RF, AdaBoost, and SkyNet NN models, these being the most accurate.

5.1. Likelihood Function

We first consider how we will evaluate the fit of a model to a set of GRB redshift observations, a.k.a., "the data." If we bin the observations to obtain a redshift density, then in each bin (with central redshift z_i) there will be an observed number of GRBs, N_obs(z_i). There is also an expected number of intrinsic GRBs occurring in Swift's field of view during the observation time in each redshift bin given by

$\begin{eqnarray}&&{N}_{{\rm{int}}}({z}_{i})=\displaystyle \frac{4\pi }{6}{\rm{\Delta }}{t}_{{\rm{obs}}}{R}_{{\rm{GRB}};{dz}}({z}_{i}){dz},\end{eqnarray} \tag{ 16 }$

where

$\begin{eqnarray}&&{R}_{{\rm{GRB}};{dz}}(z)=\displaystyle \frac{{R}_{{\rm{GRB}}}(z)}{1+z}\displaystyle \frac{{{dV}}_{{\rm{comov}}}}{d{\rm{\Omega }}{dz}}.\end{eqnarray} \tag{ 17 }$

${R}_{{\rm{GRB}};{dz}}(z)$ is the observed GRB rate that accounts for time dilation and the co-moving volume in addition to the co-moving rate, R_GRB(z). The ${}^{4\pi }{/}_{6}$ factor introduced here reflects that Swift observes only a sixth of the entire sky and ${\rm{\Delta }}{t}_{{\rm{obs}}}$ reflects the fraction of time (per year) that Swift is observing; this is taken as ${\rm{\Delta }}{t}_{{\rm{obs}}}\approx 0.8$ as calculated from related Swift log data. V_comov is the cosmological co-moving volume and Ω is the subtended sky angle.

Not all GRBs occurring in Swift's field of view will be detected; however, this is taken into account by an extra factor, F_det(z). This is the fraction of GRBs at redshift z that are detected by Swift and is further discussed in Section 5.1.1. Including this factor gives us the expected number of observed GRBs in each bin,

$\begin{eqnarray}&&{N}_{{\rm{exp}}}({z}_{i})=\displaystyle \frac{4\pi }{6}{\rm{\Delta }}{t}_{{\rm{obs}}}{R}_{{\rm{GRB}};{dz}}({z}_{i}){F}_{{\rm{det}}}({z}_{i}){dz}.\end{eqnarray} \tag{ 18 }$

The probability, then, of observing N_obs(z_i) GRBs when N_exp(z_i) are expected is given by the Poisson distribution. The bins can be treated as independent, so for K bins we can multiply their probabilities:

$\begin{eqnarray}\mathrm{Pr}(\{{N}_{{\rm{obs}}}({z}_{i})\};\{{N}_{{\rm{exp}}}({z}_{i})\}) & = & \displaystyle \prod _{i=1}^{K}\mathrm{Pr}({N}_{{\rm{obs}}}({z}_{i});{N}_{{\rm{exp}}}({z}_{i}))\\ & = & \displaystyle \prod _{i=1}^{K}\displaystyle \frac{{N}_{{\rm{exp}}}{({z}_{i})}^{{N}_{{\rm{obs}}}({z}_{i})}{e}^{-{N}_{{\rm{exp}}}({z}_{i})}}{{N}_{{\rm{obs}}}({z}_{i})!}.\end{eqnarray} \tag{ 19 }$

The log-likelihood is therefore the log of this probability,

$\begin{eqnarray}{ \mathcal L }({\boldsymbol{n}}) & = & \mathrm{log}(\mathrm{Pr}({N}_{{\rm{obs}}}({z}_{i});{N}_{{\rm{exp}}}({z}_{i})))\\ & = & \displaystyle \sum _{i=1}^{K}[{N}_{{\rm{obs}}}({z}_{i})\mathrm{log}({N}_{{\rm{exp}}}({z}_{i}))-{N}_{{\rm{exp}}}({z}_{i})\\ & & -\mathrm{log}({N}_{{\rm{obs}}}({z}_{i})!)]\end{eqnarray} \tag{ 20 }$

where ${\boldsymbol{n}}=\{{n}_{0},{n}_{1},{n}_{2},{z}_{1},x,y,{L}_{*}\}$ is the set of model parameters that let us obtain N_exp(z_i), which is really ${N}_{{\rm{exp}}}({z}_{i}| {\boldsymbol{n}})$ .

In the limit of a large number of bins, each bin will contain either zero or one detected GRBs so ${N}_{{\rm{obs}}}({z}_{i})!\;=1\Rightarrow \mathrm{log}({N}_{{\rm{obs}}}({z}_{i})!)=0$ . We can also split terms and rewrite Equation (20) as

$\begin{eqnarray}{ \mathcal L }({\boldsymbol{n}}) & = & \displaystyle \sum _{i=1}^{K}[{N}_{{\rm{obs}}}({z}_{i})\mathrm{log}({N}_{{\rm{exp}}}({z}_{i}))]-\displaystyle \sum _{i=1}^{K}{N}_{{\rm{exp}}}({z}_{i})\\ & = & -{N}_{{\rm{exp}}}+\displaystyle \sum _{\{i\}{}_{{\rm{det}}}}\mathrm{log}({N}_{{\rm{exp}}}({z}_{i})),\end{eqnarray} \tag{ 21 }$

where {i}_det are those bins with a detection. We can perform this calculation in the limit of infinite bins, essentially a continuous measurement. N_exp is the integrated expected rate of observations given by

$\begin{eqnarray}&&{N}_{{\rm{exp}}}={\displaystyle \int }_{0}^{10}{N}_{{\rm{exp}}}(z){dz}.\end{eqnarray} \tag{ 22 }$

This likelihood is the same as the C-statistic derived in Cash (1979) in the un-binned limit (see Equation (7) therein, where $C=-2{ \mathcal L }$ ). This likelihood function is also equivalent to that of Stevenson et al. (2015), which compares discrete intrinsic population models for binary black hole mergers as observed by advanced LIGO and Virgo using the observed mass distribution, if the latter is taken to the same limit of infinite bins of infinitesimal width. This is particularly notable because Stevenson et al. (2015) uses a Poisson probability for the total number of detections multiplied by a multinomial distribution describing the fractional distribution of detections among bins in mass space.

5.1.1. Detection Fraction

The detection fraction (also known as detection efficiency) F_det(z) is computed in advance of the analysis by utilizing the ML models trained to reproduce the Swift detection pipeline. 10⁶ GRBs are simulated at each of 10,001 redshift points in [0,10] in order to precisely measure the average detection fraction. These points are used as the basis for a spline interpolation to compute F_det(z) at any z. The detection fraction as a function of z from each of the three models used is shown in Figure 10. It is important to note that this F_det(z) is calculated under the assumption of the particular luminosity function used in this study; it may change significantly for other choices of the luminosity function parameters.

**Figure 10.** F_det(z) as computed by the three different MLAs used as well as the constant flux cut and an analytic form used in Howell et al. (2014). The detection fraction of all data provided for training and validation is also shown. This is calculated under the assumption of the particular luminosity function used in this study and may change significantly for other choices of the luminosity function parameters.
Download figure:
Standard image High-resolution image

We also show, for comparison, the detection fraction as computed by the constant flux cut and from an analytic fit used in Howell et al. (2014). This was computed using the data from Lien et al. (2014), so we are not surprised that it matches well in the low-redshift range where there is better sampling. The flux cut has discrepancies across the entire redshift range while the analytic fit is close until z = 5.96, after which the authors used a constant value.

These can all be compared against the detection fraction of the entire data set (training and validation) provided from using the original Swift pipeline of Lien et al. (2014). There is less resolution and large uncertainty on this curve as there are much fewer samples ( ${ \mathcal O }({10}^{5})$ versus ${ \mathcal O }({10}^{9})$ ), but we can see that RF, AB, and NN track it well.

5.2. Model, Parameters, and Prior

In our analysis, as the detection fraction is averaged over the luminosity distribution, we hold those parameters constant with x = −0.65, y = −3.00, and L_⋆ = 10^52.05 erg s⁻¹. The parameters describing the redshift distribution are allowed to vary with ranges and prior distribution given in Table 4.

Table 4. Prior Ranges and Distributions for the Redshift Distribution Model Parameters

Parameter	Min	Max	Prior
n₀	0.01	2.00	logarithmic
n₁	0.00	4.00	flat
n₂	−6.00	0.00	flat
z₁	0.00	10.00	flat

Download table as: ASCII Typeset image

The population generation code developed in Lien et al. (2014) was used to generate simulated data for testing purposes. In addition to the above-specified parameters, we also return the total number of GRBs, N_exp.

5.3. Parameter Estimation Tests

The BAMBI algorithm (Feroz & Hobson 2008; Feroz et al. 2009; Graff et al. 2012) is a general-purpose implementation of the nested sampling algorithm for Bayesian inference. We use it to perform Bayesian parameter estimation, measuring the full posterior probability distribution of the model parameters.

In the ideal case, any X% credible interval calculated from the posterior distribution should contain the true parameters ∼X% of the time. We sampled a large number of parameter values from the prior and obtained a posterior distribution from simulated data generated with each. For each parameter, we then computed the cumulative fraction of times the true value was found at a credible interval of p—as integrated up from the minimum value—as a function of p. This result was compared to a perfect one-to-one relation using the Kolmogorov–Smirnov test. All parameters passed this test, thus confirming the validity of returned credible intervals.

The posterior distribution for a particular realization of an observed GRB redshift distribution generated using $\{{n}_{0},{n}_{1},{n}_{2},{z}_{1}\}=\{0.42,2.07,-0.70,3.60\}$ (best-fit values from Lien et al. 2014) is shown in Figures 11–13 for the RF, AdaBoost, and SkyNet NN models, respectively. While the RF and AdaBoost posteriors are nearly identical, the NN posterior has small differences due to the difference in the detection fraction. However, these differences are not major. We can see that n₂ is effectively unconstrained due to the low number of observed GRBs with redshift greater than z₁. The true values are marked by the blue lines.

**Figure 11.** Posterior distribution for simulated data with $\{{n}_{0},{n}_{1},{n}_{2},{z}_{1}\}=\{0.42,2.07,-0.70,3.60\}$ using the random forest classifier for data generation and detection fraction. N_tot is the total number of GRBs in the universe per year. Blue lines indicate true values and dot–dash red lines indicate maximum likelihood (i.e., best-fit) values. 2D plots show contour lines every σ (68%, 95%, 99%). Vertical dashed lines in 1D plots show 5%, 50%, and 95% quantiles, with values given in the titles.
Download figure:
Standard image High-resolution image

**Figure 12.** Posterior distribution for simulated data with $\{{n}_{0},{n}_{1},{n}_{2},{z}_{1}\}=\{0.42,2.07,-0.70,3.60\}$ using the AdaBoost classifier for data generation and detection fraction. Same features as Figure 11.
Download figure:
Standard image High-resolution image

**Figure 13.** Posterior distribution for simulated data with $\{{n}_{0},{n}_{1},{n}_{2},{z}_{1}\}=\{0.42,2.07,-0.70,3.60\}$ using the SkyNet NN classifier for data generation and detection fraction. Same features as in Figure 11.
Download figure:
Standard image High-resolution image

We also plot in Figure 14 the distribution of model predictions as specified by the posterior (from RF). In both panels, we select 200 random models selected from the set of posterior samples (light blue lines) as well as the maximum ${ \mathcal L }({\boldsymbol{n}})$ point (black line). The upper panel shows R_GRB(z) (Equation (14)); the lower panel shows N_exp(z)/dz (Equation (18)) and N_int(z)/dz. The lower panel also plots a histogram of the simulated population of measured redshifts for observed GRBs. The upper panel clearly shows us the allowed variability in the high-redshift predictions of the model; in the lower panel, we see that the detection fraction and other factors constrain this variability to consistently low predictions.

**Figure 14.** Distribution of model predictions from the posterior (RF) for a simulated population of GRBs. 200 models with parameters chosen randomly from the posterior are shown by light blue lines in both panels. The maximum ${ \mathcal L }({\boldsymbol{n}})$ point is shown by the solid black line. The upper panel shows R_GRB(z) (Equation (14)) and the lower panel shows N_exp(z)/dz (Equation (18)). The lower panel also shows N_obs, a histogram (pink) of the simulated population of measured redshifts for observed GRBs (scaled to N_obs/dz), and the model intrinsic GRB rate, N_int(z)/dz, for the maximum ${ \mathcal L }({\boldsymbol{n}})$ point in dashed black.
Download figure:
Standard image High-resolution image

These tests show that we can trust the results of an analysis—under the model assumptions, we can recover the true parameters of a simulated GRB redshift distribution.

5.4. Analysis of Swift GRBs

In Lien et al. (2014), the authors use a sample of 66 GRBs observed by Swift whose redshift has been measured from afterglows only or afterglows and host galaxy observations. These observations are taken from the larger set of Fynbo et al. (2009) and the selection is done in order to remove bias toward lower-redshift GRBs in the fraction with measured redshifts (see Section 4.1 of Lien et al. 2014). In our final analysis, we use these 66 GRB redshift measurements as data that we fit with the models described in this paper.

Using RFs, AdaBoost, and NN ML models for the detection fraction, we find posterior probability distributions for n₀, n₁, n₂, and z₁, as seen in Figures 15–17, respectively. The maximum likelihood estimate and posterior probability central 90% credible interval are given in Table 5. We also plot in Figure 18 the distribution of model predictions as specified by the posterior (from RF) as we did in Figure 14 for the test population.

**Figure 15.** Posterior distribution for the real set of 66 *Swift* GRBs using the random forest classifier for the detection fraction. N_tot is the total number of GRBs in the universe per year. The dot–dash red lines indicate maximum likelihood (i.e., best-fit) values. 2D plots show contour lines every σ (68%, 95%, 99%). Vertical dashed lines in 1D plots show 5%, 50%, and 95% quantiles, with values given in the titles.
Download figure:
Standard image High-resolution image

**Figure 16.** Posterior distribution for the real set of 66 *Swift* GRBs using the AdaBoost classifier for the detection fraction. Similar to Figure 15.
Download figure:
Standard image High-resolution image

**Figure 17.** Posterior distribution for the real set of 66 *Swift* GRBs using the SkyNet NN classifier for the detection fraction. Similar to Figure 15.
Download figure:
Standard image High-resolution image

**Figure 18.** Distribution of model predictions from the posterior (RF) for the real set of 66 *Swift* GRBs (Fynbo et al. 2009). 200 models with parameters chosen randomly from the posterior are shown by light blue lines in both panels. The maximum ${ \mathcal L }({\boldsymbol{n}})$ point is shown in black. The upper panel shows R_GRB(z) (Equation (14)) and the lower panel shows N_exp(z)/dz (Equation (18)). The lower panel also shows N_obs, a histogram (pink) of the observed GRB redshifts in the Fynbo et al. (2009) sample (scaled to N_obs/dz), and the model intrinsic GRB rate, N_int(z)/dz, for the maximum ${ \mathcal L }({\boldsymbol{n}})$ point shown by the dashed black line.
Download figure:
Standard image High-resolution image

Table 5. Maximum Likelihood (i.e., Best-fit) Estimates and Central 90% Credible Intervals for the Redshift Distribution Parameters As Fit To The Real Set of 66 Swift GRBs (Fynbo et al. 2009; Lien et al. 2014) Using Each of the MLAs

Parameter	Method	Max Like	90% CI
	RF	0.480	[0.247, 0.890]
n₀	AB	0.489	[0.249, 0.902]
	NN	0.416	[0.238, 0.986]
	RF	1.700	[1.155, 2.261]
n₁	AB	1.681	[1.146, 2.273]
	NN	1.875	[1.030, 2.334]
	RF	−5.934	[−5.675, −0.238]
n₂	AB	−5.950	[−5.665, −0.230]
	NN	−0.483	[−5.598, −0.217]
	RF	6.857	[3.682, 9.654]
z₁	AB	6.682	[3.603, 9.622]
	NN	3.418	[3.215, 9.385]
	RF	4455	[2967, 6942]
N_exp	AB	4392	[2967, 6822]
	NN	3421	[2546, 5502]

Download table as: ASCII Typeset image

Parameters n₀, n₁, and N_tot show mostly Gaussian marginal distributions and some correlation between n₀ and n₁—larger values of the former lead to lower values of the latter in order to maintain a constant value for N_tot and similar values at the peak of the observed distribution. The data do not strongly constrain the high redshift part of the distribution, namely the n₂ parameter. The upper panel of Figure 18 clearly shows us the allowed variability in the high-redshift predictions of the model; in the lower panel, we see that the detection fraction and other factors constrain this variability to consistently low predicted numbers of GRB observations. We see a double-peak in z₁, not the clear single peak seen in the simulated data. One peak occurs around z₁ ≈ 3.6, the best-fit value from Lien et al. (2014) and is more prominent when using the NN model. This shows a sensitivity to the detection fraction for this set of GRB observations. A hint of this can be seen in the posterior plots of Section 5.3—Figures 11–13. All measured parameters are consistent with the best-fit values found by Lien et al. (2014).

5.5. Computational Cost

The main computational costs of this entire analysis procedure were

1.
producing the training data,
2.
performing MLA model fitting and hyper-parameter optimization, and
3.
using the MLA models to compute the detection fraction.

These steps are in roughly decreasing order of cost, from CPU weeks to days. However, all three are one-time initialization costs and can be run massively parallel to reduce wall-time.

After this initialization is complete, however, subsequent analysis of real or simulated data is performed extremely quickly. A single likelihood evaluation takes <0.1 ms, meaning that a Bayesian analysis can be computed in less than a minute on a laptop. Providing this same kind of accurate measurement of the detection fraction without the MLAs would take orders of magnitude more time; while ${ \mathcal O }({10}^{5})$ samples were used for training the MLA models, ${ \mathcal O }({10}^{10})$ evaluations were used in measuring the detection fraction as a function of redshift. The precision of the detection fraction would need to be reduced significantly to make the overall cost comparable. Furthermore, we now are equipped with accurate models of the Swift detection algorithm.

6. COMPARISON TO PREVIOUS WORK

We have developed an MLA model (emulator) for the detailed Swift BAT long GRB pipeline simulator developed in Lien et al. (2014). These techniques allow us to complete a thorough Bayesian analysis of the long GRB rate redshift dependence using the Fynbo et al. (2009) data set, improving on the more coarsely sampled study in Lien et al. (2014). Our results are compatible with those from Lien et al. (2014) with tight agreement for lower redshifts up to z ∼ 4 with compatible results and relatively narrow distributions for our n₁ and n₀ rate parameters. We find values of ${n}_{0}\sim {0.48}_{-0.23}^{+0.41}\;{{\rm{Gpc}}}^{-3}\;{{\rm{yr}}}^{-1}$ and ${n}_{1}\sim {1.7}_{-0.5}^{+0.6}$ , consistent with the best-fit values of n₀ = 0.42 and n₁ = 2.07 from Lien et al. (2014). For larger redshifts the model is less constrained; n₂ spans the prior range and z₁ is significantly constrained only at the low-z end. Our general agreement with Lien et al. (2014) supports their identification of differences between the long GRB redshift distribution and estimates of the SFR (Hopkins & Beacom 2006). Though our analysis indicates that the Fynbo et al. (2009) data do not provide strong constraints on the rate at high redshift, the results seem to indicate significant differences for z < 4. A follow-up Bayesian analysis comparing with a two-break model would allow a more direct comparison with SFR models. We can also note how our results compare with several other studies that use GRB observations and subsequent redshift measurements in order to estimate the redshift or luminosity distribution of GRBs in the universe.

The paper by Butler et al. (2010) used an extensive set of GRBs both with and without redshift measurements to fit intrinsic distributions for GRB redshift, luminosity, peak flux, and more. This fitting was performed using PyMC, a python package for Markov chain Monte-Carlo analyses, marginalizing over all redshifts when no measurement is available; the log-likelihood function used is un-binned, similar to the one used in our study. The detection fraction (a.k.a., the detection efficiency) used by Butler et al. (2010), however, is a probability dependent solely on the photon count rate. Their results for n₁, n₂, and z₁ are consistent with the 90% confidence intervals that we measure.

Wanderman & Piran (2010) performs a careful study of the GRB rate and luminosity distribution via a Monte-Carlo approach. This study adopts an empirical probability function to determine whether a burst is detectable based on the peak flux. In addition, they also introduce an empirical function to estimate the probability of obtaining a redshift measurement based on the GRB peak flux. Since we adopt the same functional form as Wanderman & Piran (2010), it is possible to compare the values of the same parameters. However, in this paper, we quantify the parameter uncertainties of the GRB rate, and assume an un-changed luminosity function from Lien et al. (2014), which is different from the one found in Wanderman & Piran (2010). The parameters found by Wanderman & Piran (2010) are n0 ∼ 1.25, n1 ∼ 2.07, and n₂ ∼ −1.36 (as listed in Table 2 of by Wanderman & Piran 2010). These values of n₁ and n₂ are consistent with our findings; the value of n₀ is at the upper end of our range, but this difference is likely due to the difference in luminosity distribution.

Salvaterra et al. (2012) constructs a sub-sample of Swift long GRBs that is complete in redshift by selecting bursts that satisfy certain observational criteria that are optimal for follow-up observations. In addition, these authors select only bright bursts with 1 s peak photon fluxes greater than 2.6 photons s⁻¹ cm⁻², in order to achieve a high completeness of 90% in redshift measurements. They use this sub-sample to estimate the luminosity function and GRB rate via maximum likelihood estimation—using the same likelihood as our study and marginalizing over a flat z distribution if no value was measured for a GRB—and found that either the rate or the luminosity function is required to evolve strongly with redshift, in order to explain the observational data. The Swift detection efficiency is modeled as a threshold on the GRB flux. The rate model fits of Salvaterra et al. (2012) are not directly comparable to ours due to a different functional form based off of the SFR.

The study of Howell et al. (2014) takes advantage of some of the work done by Lien et al. (2014) in using the detection efficiency computed from simulated GRB populations. The authors perform a time-dependent analysis that considers the rarest events—the largest redshift or the highest peak flux—and how these values progress over observational time. These are used to fit the intrinsic redshift and luminosity distributions of GRBs and infer 90% confidence intervals. Howell et al. (2014) measures a local GRB rate density consistent with our constraints on n₀. Other rate parameters were held fixed to values obtained by Lien et al. (2014) and are thus also consistent with our measurements.

Yu et al. (2015) and Petrosian et al. (2015) use subsets of observed GRBs at different redshifts to construct a more complete GRB sample and to account for observational biases. This method is called the Lynden-Bells c⁻ method. Each sub-sample is selected based on the minimum detectable GRB luminosity at each redshift. Both of these studies find significant luminosity evolution and a rather high GRB rate at low redshift in comparison to the one expected from previous SFR measurements. However, as noted in Yu et al. (2015) and Petrosian et al. (2015), several additional selection effects can be the cause of this discrepancy, including the potential bias toward redshift measurements of nearby GRBs and those with bright X-ray and optical afterglows. The rate evolution found by Yu et al. (2015) is not consistent with our results at low redshift, but is consistent at high redshift due to the large uncertainty in measuring n₂.

Our study is able to improve upon the methodology of these studies and may be extended to cover the same breadth of GRB source models to be fit. These improvements are not the same for all, but in summary involve using a fully Bayesian model fitting procedure with a likelihood function that does not involve any binning of observations. Furthermore, the detection efficiency of the Swift BAT detector can be better modeled using ML techniques that incorporate all available information (marginalizing over parameters not under consideration) than with probabilities dependent solely on the flux or photon counts. With both of these, not only will we be able to extract as much information as possible out of GRB detections and follow-up observations, but such analyses will incur minimal modeling bias while maintaining computational speed.

7. SUMMARY AND CONCLUSIONS

We have built a set of models emulating the Swift BAT detection algorithm for long GRBs using ML. Using a large set of simulated GRBs from the work of Lien et al. (2014) as training data, we used the RF, AdaBoost, SVM, and NN algorithms to optimize, fit, and validate models that simulate the Swift triggering algorithm to high accuracy. RF and AdaBoost perform best, achieving accuracies of 97.5%; NNs and SVMs have accuracies of 96.9% and 94.7%, respectively. These all out perform a threshold in GRB flux, which has an accuracy of 89.6%. The improved faithfulness to the full Swift triggering removes potential sources of bias when performing analyses based on the model.

Using these models, we computed the detection fraction (efficiency) of Swift as a function of redshift for a fixed luminosity distribution. Using this empirical detection fraction and a model for the GRB rate given by Equation (14), we fit the model parameters on both simulated redshift measurements and on the redshifts reported by Fynbo et al. (2009). We find best-fitting values and 90% credible intervals as reported in Table 5 for each of the top three MLAs. These, expectedly, are consistent with values found by Lien et al. (2014).

After incurring the initial costs of generating training data, fitting the models, and computing the detection fraction, we are able to perform Bayesian parameter estimation extremely rapidly. This allows us to explore the full parameter space of the model and determine not only the best-fit parameters, but also the uncertainty and degeneracies present.

8. FUTURE WORK

In performing this analysis, we identified several potential avenues for further work to improve our model and extend the analysis performed. Hence, we see this work as just the first step in advancing GRB research. The easiest extension is to analyze different samples of measured GRB redshifts that may contain different selection biases.

To improve our ML models, we would like to continue building on our training data set and making it more agnostic with respect to GRB parameters. This would allow for better modeling of the Swift detection algorithm and its dependency on different GRB characteristics. We may also test the impacts of pruning correlated features in the data set in order to accelerate analysis and potentially improve results. The GRB rate model can also be expanded to include a second break point in redshift. This would allow for more direct comparison with most fits of the SFR that use a double-broken power-law model. Bayesian model selection could then be used to compare these and other models.

The analyses can be extended to fitting the intrinsic luminosity distribution by including GRB luminosity or flux in the detection fraction, ${F}_{{\rm{det}}}({\mathrm{log}}_{10}(L),z)$ or ${F}_{{\rm{det}}}({\rm{\Phi }}({\mathrm{log}}_{10}(L),z),z)$ . The likelihood function can then jointly describe both the luminosity and redshift distributions—including luminosity distribution evolution with redshift—by analyzing measured GRB fluxes and redshifts; the redshift distribution can be marginalized over if there is no measured value for a particular GRB. The likelihood function can also be modified to account for known selection biases, including the probability of measuring a redshift for each GRB.

Beyond improving and extending the model used in this paper, a similar analysis can be performed for the study of short GRBs detected by Swift and other detectors. This work has demonstrated the value of ML for GRB data analysis and the algorithms and techniques may be extended to other problems in GRB follow-up and analysis.

The authors would like to thank Brad Cenko, Judith Racusin, and Neil Gehrels for helpful discussions. P.G. acknowledges support from NASA Grant NNX12AN10G and an appointment to the NASA Postdoctoral Program at the Goddard Space Flight Center, administered by Oak Ridge Associated Universities through a contract with NASA. J.B. acknowledges support from NASA Grant ATP11-00046.

MODELING THE SWIFT BAT TRIGGER ALGORITHM WITH MACHINE LEARNING^{^*}

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. THE Swift DETECTION ALGORITHM

3. MACHINE LEARNING ALGORITHMS

3.1. Random Forests and AdaBoost

3.1.1. Random Forests (RFs)

3.1.2. AdaBoost

3.2. Support Vector Machines (SVMs)

3.3. Artificial Neural Networks

3.4. Heuristics Used

3.5. Cross-validation

4. MACHINE LEARNING RESULTS

4.1. Training Data Used

4.2. Random Forest

4.3. AdaBoost

4.4. Support Vector Machines

4.5. Neural Networks

4.6. Summary of Results

5. USE OF ACCELERATED PIPELINE FOR BAYESIAN INFERENCE

5.1. Likelihood Function

5.1.1. Detection Fraction

5.2. Model, Parameters, and Prior

5.3. Parameter Estimation Tests

5.4. Analysis of Swift GRBs

5.5. Computational Cost

6. COMPARISON TO PREVIOUS WORK

7. SUMMARY AND CONCLUSIONS

8. FUTURE WORK

Footnotes

MODELING THE SWIFT BAT TRIGGER ALGORITHM WITH MACHINE LEARNING*

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. THE Swift DETECTION ALGORITHM

3. MACHINE LEARNING ALGORITHMS

3.1. Random Forests and AdaBoost

3.1.1. Random Forests (RFs)

3.1.2. AdaBoost

3.2. Support Vector Machines (SVMs)

3.3. Artificial Neural Networks

3.4. Heuristics Used

3.5. Cross-validation

4. MACHINE LEARNING RESULTS

4.1. Training Data Used

4.2. Random Forest

4.3. AdaBoost

4.4. Support Vector Machines

4.5. Neural Networks

4.6. Summary of Results

5. USE OF ACCELERATED PIPELINE FOR BAYESIAN INFERENCE

5.1. Likelihood Function

5.1.1. Detection Fraction

5.2. Model, Parameters, and Prior

5.3. Parameter Estimation Tests

5.4. Analysis of Swift GRBs

5.5. Computational Cost

6. COMPARISON TO PREVIOUS WORK

7. SUMMARY AND CONCLUSIONS

8. FUTURE WORK

Footnotes

MODELING THE SWIFT BAT TRIGGER ALGORITHM WITH MACHINE LEARNING^{^*}