Machine learning classification of Gaia Data Release 2

Yu Bai; Ji-Feng Liu; Song Wang

doi:10.1088/1674-4527/18/10/118

1. Introduction

The ESA space mission Gaia is performing an all-sky astrometric, photometric and radial velocity survey at optical wavelengths (Gaia Collaboration et al. 2016). The primary objective of the Gaia mission is to survey more than one billion stars, in order to investigate the origin and subsequent evolution of our Galaxy. Its second data release (Gaia DR2; Gaia Collaboration et al. 2018) includes ∼1.3 billion objects with valid parallaxes. These parallaxes are obtained with a complex iterative procedure, involving various assumptions (Lindegren et al. 2012). Such procedure may produce parallaxes for galaxies and quasi-stellar objects (QSOs), which should present no significant parallaxes (Liao et al. 2018).

In addition, Gaia uses two fields of view to observe, and this, in principle, might lead to a global parallax bias (van Leeuwen 2005; Butkevich et al. 2017; Liao et al. 2018). Separating galaxies and QSOs from stars allows us to characterize parallax bias in the Gaia catalog, and to provide a clean and accurate stellar sample for further investigation. Traditionally, the classification of objects involves magnitude and color criteria, but the criteria become too complex to be described with functions in a multidimensional parameter space. By contrast, this parameter space can be effectively explored with machine learning (ML) algorithms, which have aided astronomers in dealing with complex problems in modern astrophysics (Huertas-Company et al. 2008, 2009; Manteiga et al. 2009; Bai et al. 2018a; Pashchenko et al. 2018).

ML provides us with an alternative option to classify billions of objects that cannot be followed-up spectroscopically. Bai et al. (2018a) applied supervised ML to star/galaxy/QSO classification based on the combination of SDSS and LAMOST spectral surveys (the Sloan-LAMOST (SL) classifier). Actually, the class labels of the training objects are from spectroscopy and are regarded as true. Narrow line QSOs are classified as galaxies by both the SDSS and LAMOST pipelines because the template of a QSO in the pipelines is a theoretical one with broad emission lines. A classifier built with the random forest algorithm generated the best performance in terms of time cost and total accuracy. Several blind tests were also performed on objects observed by RAVE, 6dFGS and UVQS. Accuracies were higher than 99% for the stars and galaxies, and higher than 94% for the QSOs.

In this paper, we apply the SL classifier to Gaia DR2 to investigate potential extragalactic objects. The data and classification are described in Section 2. Section 3 gives the result and analysis, and a summary is presented in Section 4.

2. Data and Classification

In order to use the SL classifier, we build a nine-dimensional color space that includes g − r, r − i, i − J, J − H, H − K, K − W1, W1 − W2, w1mag1 − w1mag3 and w2mag1 − w2mag3 (Bai et al. 2018a). The optical colors are extracted from Data Release 1 of the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS 1; hereafter PS1) archive data. PS1 has carried out a set of imaging sky surveys including the 3π Steradian Survey, in which the mean 5σ point source limiting sensitivities are 23.3, 23.2 and 23.1 mag in g, r, i bands, respectively (Chambers et al. 2016). We cross-matched the Gaia DR2 with PS1 using panstarrs1_best_neighbour, the pre-computed PS1 cross-match table provided in the Gaia archive (Marrese et al. 2017). The table includes 810 359 898 entries, which represent the most likely matches between PS1 and Gaia DR2, as determined by angular distances, position errors, epoch differences and density of sources in PS1.

In order to obtain infrared colors, we cross-matched the Gaia DR2 with the AllWISE catalog using allwise_best_neighbour, which includes 300 207 917 matches (Marrese et al. 2017). Here we select objects with signal-to-noise ratios higher than 2 in the W1 and W2 bands. As a result, the cross-matchings yield 85 613 922 objects with the nine colors. We feed the nine-color matrix to the SL classifier, and the classifier returns the types and the probabilities (P) for stars, galaxies and QSOs. The sum of P for the three types is 100%, and the type with the highest P is adopted by the SL classifier as the output type. Therefore, the P of the adopted type is higher than 33.33%.

Traditionally, QSOs are separated from other AGNs mainly by their absolute B magnitudes. The QSOs in training data of the SL classifier are identified with the QSO spectral templates. The different definitions of QSOs may cause many galaxies in our sample to be classified as QSOs in literatures. Therefore, the galaxies and QSOs given by the SL classifier are combined and hereafter called galaxies. The results include 83 891 260 stars and 1 722 662 galaxies.

3. Result and Analysis

3.1. Comparison with Simbad

We cross-match these objects with the Simbad database in order to estimate the probability of incorrect classifications. The Simbad database returns 308 864 galaxies, 191 497 stars and 10 987 unclassified objects.

The distributions of the output probabilities are presented in Figure 1. We defined the accuracies as the ratios between the numbers of Simbad types and those given by the SL classifier. The total accuracy is 91.9%. More than 99.1% of the galaxies in our sample are also classified as galaxies in the Simbad database, and more than 83% of the stars in our sample are classified as stars.

**Fig. 1** The ML probability distributions for galaxies (*left panel*) and stars (*right panel*). The *red lines* are the numbers of Simbad classifications higher than the corresponding probabilities. The y axis is displayed on a log scale for clarity. The *blue lines* are the accuracies compared to Simbad.
Download figure:
Standard image

The classification accuracy of the stars is lower than those of the spectrally resolved samples in Bai et al. (2018a). Stars in the training sample of the SL classifier are mainly from LAMOST, which is dominated by stars located near the Galactic Anticenter. This selection effect may make the SL classifier perform better when applied to lightly reddened objects. Objects located in the heavily reddened direction of the Galaxy are probably hard to recognize by the SL classifier.

3.2. Sky Distribution

We present distributions of the classification results in Figure 2. It is expected that the Galactic plane is dominated by stars, and the percentages of galaxies become higher at high latitudes. The relatively high percentages in the most central Galactic plane may be due to the low density of stars in this region (left panel in Fig. 2). The low completeness of PS1 caused by high extinction (Chambers et al. 2016) may result in the low density of stars in the most central Galactic plane. Additionally, WISE photometry is limited by confusion near the Galactic plane due to high source density (Wright et al. 2010). In the distribution of galaxies, we can find overdense areas corresponding to some galaxy clusters (Jarrett 2004), e.g., Abell 624, Perseus-Pisces Supercluster and Shapley Concentration.

**Fig. 2** The distributions of classification results in Galactic coordinates. *Upper left panel*: the percentage of galaxies per degree². *Upper right panel*: the density of stars. *Lower panel*: the density of galaxies.
Download figure:
Standard image

3.3. Relative Error of Parallax

The relative parallax uncertainty, σ_π/π, is an important parameter that can be used to constrain bias caused by the Lutz–Kelker Effect (LKE; Trumpler & Weaver 1953; Lutz & Kelker 1973; Bai et al. 2018b). We present the stacked distributions of σ_π/π in Figure 3. The sample of galaxies makes up ∼2.5% of objects with parallaxes less than zero. The percentages of stars decrease sharply in the range −0.6 < σ_π/π < 0.0, and reach a minimum of 96.7%. Since there is no negative uncertainty, the negative σ_π/π means negative parallax.

**Fig. 3** The stacked distributions of σ_π/π. *Left panel*: the distribution in the range between −2 and 3. *Right panel*: the distribution in the range between 0 and 1.
Download figure:
Standard image

The percentages of stars decline with the increase of σ_π/π for objects with positive parallaxes. The galaxies make up less than 1% of objects in the range 0 < σ_π/π < 0.6. The sample is nearly all stars (∼ 99.9%) when 0 < σ_π/π < 0.2. In this range, bias caused by the LKE also becomes insignificant (Bai et al. 2018b). Using the threshold of 0 < σ_π/π < 0.2 could yield a very clean stellar sample that includes 27 500 769 stars and 18 674 galaxies.

4. Summary

We apply the SL classifier to 85 613 922 objects in Gaia DR2, based on colors built from the PS1 and AllWISE. The classification shows that about 98% of the sample are stars and 2% are galaxies. This result is cross-matched with the Simbad database in order to estimate the probability of incorrect classifications, and the total accuracy is 91.9%. The Galactic plane is dominated by stars, and the percentages become higher at high latitudes. We find that about 2.5% of the sample are galaxies for objects with negative parallaxes and the threshold of 0 < σ_π/π < 0.2 could yield a very clean stellar sample including about 99.9% stars.

Acknowledgements

This work was supported by the National Program on Key Research and Development Project (Grant No. 2016YFA0400804) and the National Natural Science Foundation of China (Grant Nos. 11603038, 11333004, 11425313 and 11403056). Some of the data presented in this paper were obtained from the Mikulski Archive for Space Telescopes (MAST). STScI is operated by the Association of Universities for Research in Astronomy, Inc., under NASA contract NAS5-26555. Support for MAST for non-HST data is provided by the NASA Office of Space Science via grant NNX09AF08G and by other grants and contracts.

This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.

Machine learning classification of Gaia Data Release 2

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Data and Classification

3. Result and Analysis

3.1. Comparison with Simbad

3.2. Sky Distribution

3.3. Relative Error of Parallax

4. Summary

Acknowledgements

Machine learning classification of Gaia Data Release 2

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Data and Classification

3. Result and Analysis

3.1. Comparison with Simbad

3.2. Sky Distribution

3.3. Relative Error of Parallax

4. Summary

Acknowledgements