Identifying Kinematic Structures in Simulated Galaxies Using Unsupervised Machine Learning

Min Du; Luis C. Ho; Dongyao Zhao; Jingjing Shi; Victor P. Debattista; Lars Hernquist; Dylan Nelson

doi:10.3847/1538-4357/ab43cc

1. Introduction

The Hubble (1926) sequence, the most widely used system of morphological classification of galaxies (Sandage & Tammann 1981), has served as a powerful framework for understanding galaxy evolution. Notwithstanding their myriad complexities, at the most fundamental level galaxies are principally distinguished by two dominant structural/dynamical components: a fast-rotating, flattened disk and a pressure-supported, spheroidal bulge. The relative light fraction of these two components, traditionally determined through photometric decomposition (e.g., Peng et al. 2002; Méndez-Abreu et al. 2008; Erwin 2015; Gao et al. 2019), establishes the galaxy type. Early-type galaxies are pure spheroids or bulge-dominated disk systems, whereas late-type galaxies are increasingly disk-dominated and even bulgeless. The processes that build up bulges and disks underlie the physical basis of the Hubble sequence.

Galaxies often comprise additional structures. For example, many nearby galaxies have a thick disk, which is both older and more metal-poor with respect to the thin disk (Dalcanton & Bernstein 2002; Yoachim & Dalcanton 2006; Comerón et al. 2011, 2014; Elmegreen et al. 2017). Meanwhile, the morphology of bulges comes in more than one flavor, ranging from highly spherically symmetric to flat (Andredakis & Sanders 1994; Andredakis et al. 1995; Courteau et al. 1996; Méndez-Abreu et al. 2010). Classical bulges are dynamically hot and largely featureless, likely the end-products of galaxy major mergers (Toomre 1977). The more flattened, rotationally supported pseudo bulges, an outgrowth of internal secular evolution, generally coexist with complex central structures (e.g., Erwin 2004; Kormendy & Kennicutt 2004). The Milky Way is a prototypical spiral that has several components, including a thin and a thick disk, a boxy/peanut-shaped bulge, a bar, a stellar halo, and a nuclear star cluster (see review by Bland-Hawthorn & Gerhard 2016). Boxy/peanut-shaped bulges are generally considered as the vertically thickened part of bars. Whether the Milky Way has a classical bulge is still uncertain, but it is unlikely to have a massive one (Shen et al. 2010; Debattista et al. 2017). The rich diversity of substructures observed among nearby galaxies imprints the formation and evolutionary history of galaxies. Accurate recognition and decomposition of these underlying substructures is essential.

The rapid development of integral-field spectroscopy has enabled galaxies to be classified by their internal kinematics (e.g., Emsellem et al. 2007, 2011; Cappellari et al. 2011a, 2011b). Early-type galaxies can be classified into slow and fast rotators (see the review of Cappellari 2016, and references therein), with fast-rotator early types forming a parallel sequence to spiral galaxies. Zhu et al. (2018b) made the first attempt to decompose observed galaxies based on their kinematics. Using the orbit-superposition Schwarzschild method (e.g., Schwarzschild 1979; Valluri et al. 2004; van den Bosch et al. 2008), they reconstructed stellar orbits for galaxies in the CALIFA survey (Sánchez et al. 2012), decomposing them into cold, warm, and hot components (Zhu et al. 2018a, 2018c). However, given the limited information that can be extracted from spectra, it is still very difficult to decompose observed galaxies in detail.

Numerical simulations are powerful tools for studying the formation and evolution of galaxy structures. In recent years, significant progress has been made in modeling star formation and stellar feedback, leading to increasingly realistic galaxies with reasonable bulge-to-disk ratios (Agertz et al. 2011; Guedes et al. 2011; Aumer et al. 2013; Stinson et al. 2013; Marinacci et al. 2014; Roškar et al. 2014; Murante et al. 2015; Colín et al. 2016; Grand et al. 2017). The increase of simulation resolution has enabled us to generate galaxies with multiple structures that go much beyond the basic bulge+disk system, including vertical structures of disks (Brook et al. 2012; Ma et al. 2017; Navarro et al. 2018; Obreja et al. 2019), stellar halos (Cooper et al. 2010; Tissera et al. 2013; Pillepich et al. 2014; Elias et al. 2018; Monachesi et al. 2019), pseudo bulges (Guedes et al. 2013; Okamoto 2013), and bars (Algorry et al. 2017; Peschken & Łokas 2019).

Large-scale hydrodynamical cosmological simulations provide the opportunity to investigate the statistical properties of galaxies evolving in a fully cosmological context. Recent advances include Illustris (Genel et al. 2014; Vogelsberger et al. 2014a, 2014b), EAGLE (Crain et al. 2015; Schaye et al. 2015), and Horizon-AGN (Dubois et al. 2016). The IllustrisTNG simulations (Marinacci et al. 2018; Naiman et al. 2018; Nelson et al. 2018, 2019; Pillepich et al. 2018a, 2019; Springel et al. 2018) can reproduce galaxies that successfully emulate plausible visual morphologies, thanks to an updated galaxy physics model (Weinberger et al. 2017; Pillepich et al. 2018b). The optical morphologies of galaxies in the TNG100 run (the highest-resolution version currently available at z = 0) are in good agreement with observations of nearby galaxies (Huertas-Company et al. 2019; Rodriguez-Gomez et al. 2019). The realism of the mock galaxies inspires confidence that the latest simulations can be used for detailed statistical study. With the aid of numerical simulations in which information is known in all six dimensions of phase space, we can investigate the intrinsic properties of galaxy structures, as well as track their formation physics and evolutionary history.

The structures of simulated galaxies can be identified through the kinematic properties of their constituent stars. Abadi et al. (2003) proposed a circularity parameter = J_z/J_c, the ratio of the azimuthal angular momentum J_z and the maximum angular momentum J_c having the same binding energy E, that can separate effectively the spheroidal component from the disky component. In order to characterize different components in detail, Doménech-Moral et al. (2012) further introduced into consideration the binding energy E and the non-azimuthal angular momentum vector J_p = J − J_z, where J is the total angular momentum vector of the stellar particle. These parameters identify the clustering of particles in kinematic phase space that corresponds to intrinsic structures of a galaxy. Obreja et al. (2016, 2018) replaced the k-means clustering algorithm used in Doménech-Moral et al. (2012) with an unsupervised machine-learning algorithm, the Gaussian mixture model (GMM). The use of a GMM reduces the errors caused by the mixtures of different structures via soft assignment of stars.

This study extends the application of GMM and develops a method to decompose simulated galaxies automatically. We test the method by applying it to five prototype galaxies from the TNG100 simulation (now also publicly available; see Nelson et al. 2019), and we discuss prospects for forthcoming applications using larger samples of simulated galaxies.

2. Dynamical Decomposition Method

Stars belonging to the same physical structure naturally cluster in their kinematic phase space. The method of Doménech-Moral et al. (2012) and Obreja et al. (2018) offers a promising framework to decompose the complex internal structures of galaxies. Their method uses three-dimensional (3D) Gaussian distributions to represent structures identified through their kinematics. However, a few limitations still affect its application:

1.
The number of structures is determined artificially. This not only opens the possibility of human bias, but also renders impractical implementation to large samples of galaxies from cosmological simulations.
2.
Real galaxy structures may not follow simple single-Gaussian distributions. The distribution function of disks, possibly all structures in galaxies, are not single Gaussians. A given structure may be composed of more complex distribution functions. Moreover, any realistic, dynamic, evolving system inevitably contains some degree of finer substructure.

An automated method, such as the unsupervised machine-learning nature of GMM, is needed to explore this problem. To mitigate human bias, the number of Gaussian components should be inferred directly from data, and all galaxies must be treated with the same standard. The fits should be sufficiently detailed to resolve significant structures in galaxies, allowing the same structure to host more than one Gaussian component if necessary. Our fully automated methodology, auto-GMM, complies with all these requirements and is able to identify multiple kinematic structures. We do not ascribe any physical significance to each individual component, postponing to a later stage the interpretation of the components/sub-components and their association with known, observed structures.

2.1. 3D Kinematic Phase Space

For TNG100, we load the positions and velocities of all particles (including dark matter, gas, and star) within a selected subhalo. Then, the code of Obreja et al. (2018) is adopted to calculate the 3D kinematic phase space of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ of stars, which will be used as inputs to GMM. The quantities j_z, j_p, j_c, and e are the specific J_z, J_p, J_c, and E, respectively. The origin of the coordinate coincides with the galaxy center, which is defined as the minimum of the gravitational potential. The z-axis of the galaxy is oriented perpendicularly to the outer disk. The average angular momentum vector is calculated by stars whose radii are between 2.1 kpc (three times the softening radius of stars) and 0.1 times the virial radius. Then, the azimuthal term of the angular momentum j_z can be easily decomposed from j_p. In order to estimate j_c and e, the code recalculates the gravitational potential of the halo, under the assumption that the halo is isolated. This assumption is generally well satisfied, unless the galaxy is undergoing significant accretion, which is fairly rare at low redshifts. All of the dark matter, stellar, and gaseous masses are included to recalculate the gravitational potential. The quantity $| e{| }_{\max }$ is the absolute value of the energy of the most bound stellar particle in the halo. Thus, $e/| e{| }_{\max }$ describes how tightly bound or centrally concentrated a particle is. It is a dimensionless parameter that gives a typical value across all galaxy masses. Only particles with j_z/j_c ∈ [−1.5, 1.5], j_p/j_c ∈ [0, 1.5], and e∈[−1, 0] are considered in dynamical decomposition, consistent with the criteria used in Obreja et al. (2018). These criteria reject interlopers that are particles with clearly different kinematics from the galaxy.

2.2. Gaussian Mixture Models

Unsupervised machine-learning algorithms can be used to cluster data points into different groups. The PYTHON language (Pedregosa et al. 2011) offers several clustering methods. As suggested by Obreja et al. (2018), GMM is suitable for finding structures in the kinematic phase space of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ . In the updated scikit-learn package, the old GMM module is replaced by the GaussianMixture module. Each Gaussian component is a triaxial ellipsoid in kinematic phase space. In order to maximize the likelihood in the parameter space, an expectation-maximization algorithm iterates until the default criterion is satisfied, returning a matrix of probabilities. Each data point has a probability array of how likely it is that it belongs to a certain component.

As an example, Figures 1 and 2 illustrate the kinematic phase space of a galaxy with distinct spheroidal and disky structures. Three components are clearly visible in the j_z/j_c versus $e/| e{| }_{\max }$ diagram (Figure 1), namely a compact and slow-rotating spheroid or bulge, a diffuse and slow-rotating spheroid or halo, and a fast-rotating disk (details about the identification of kinematic and morphological structures are provided in Section 3.2). By contrast, only two components are clearly seen in the j_z/j_c versus j_p/j_c diagram (Figure 2). This is because both spheroidal components, dominated by random motions, have a wide range of j_p/j_c.

**Figure 1.** Distribution of j_z/j_c vs. $e/| e{| }_{\max }$ of a typical galaxy. Three components, likely corresponding to a disk, a bulge, and a halo, are visible. The color bar indicates the number of stellar particles in each bin. We fit the distribution of particles with increasing numbers of Gaussian components, from 2 to 9, as labeled in the upper-left corner of each panel. The overlaid ellipses represent 63% confidence regions of the Gaussian components found by the GMM fits. The fit improves by adding more components. Both the bulge (magenta ellipses) and disky components (j_z/j_c > 0.7; blue ellipses) are well fitted using n_c = 5–8. With the increase of n_c, the halo breaks up into multiple substructures (cyan ellipses).
Download figure:
Standard image High-resolution image

**Figure 1.** Distribution of j_z/j_c vs. $e/| e{| }_{\max }$ of a typical galaxy. Three components, likely corresponding to a disk, a bulge, and a halo, are visible. The color bar indicates the number of stellar particles in each bin. We fit the distribution of particles with increasing numbers of Gaussian components, from 2 to 9, as labeled in the upper-left corner of each panel. The overlaid ellipses represent 63% confidence regions of the Gaussian components found by the GMM fits. The fit improves by adding more components. Both the bulge (magenta ellipses) and disky components (j_z/j_c > 0.7; blue ellipses) are well fitted using n_c = 5–8. With the increase of n_c, the halo breaks up into multiple substructures (cyan ellipses).
Download figure:
Standard image High-resolution image

**Figure 2.** Distribution of j_z/j_c vs. j_p/j_c of the same galaxy shown in Figure 1. A spheroidal and a disky component are clearly visible. The color bar indicates the number of stellar particles in each bin. We fit the distribution of particles with increasing numbers of Gaussian components, from 2 to 9, as labeled in the upper-left corner of each panel. The Gaussian components found by GMM are represented by ellipses, using the same color scheme as in Figure 1.
Download figure:
Standard image High-resolution image

We fit the kinematic phase space with GMM, varying the number of Gaussian components n_c from 2 to 9. In each case, the fit is performed 10 times with different initializations by setting the keyword n_init = 10. We emphasize that running enough initializations is very important to obtain a stable fit. All initial parameters are generated with the k-means algorithm.

Figures 1 and 2 use colored ellipses to represent the 63% confidence ellipse of each Gaussian distribution obtained by the GMM fit. A disky and a spheroidal component can be roughly represented by setting n_c = 2 (top-left panel), but the kinematics of the galaxy are apparently more complex than such a simple, conventional bulge+disk decomposition. As expected, the fit improves by adding more components, but the data clearly become overfit when n_c ≥ 9. The kinematics of this galaxy are well reproduced with n_c = 5–8. Both the bulge (magenta ellipses) and disky components (j_z/j_c > 0.7; blue ellipses) are well fitted, while the halo breaks up into multiple substructures (cyan ellipses) with increasing n_c. The halo and bulge components show no distinct separation in the j_z/j_c versus j_p/j_c diagram. Thus, components identified purely in the j_z/j_c versus j_p/j_c plane are not as robust as those decomposed in the j_z/j_c versus $e/| e{| }_{\max }$ plane.

2.3. Bayesian Information Criterion

Instead of artificially choosing n_c, we derive a modified version of the Bayesian information criterion (BIC) for selecting n_c in GMM. The BIC, developed by Schwarz (1978) and widely used in analysis of clustering data, allows the user to infer an approximate posterior distribution over the parameters of a Gaussian mixture distribution. Its formal definition is

$\begin{eqnarray}&&\mathrm{BIC}=-2n\cdot \mathrm{ln}(\widehat{L})+k\cdot \mathrm{ln}(n),\end{eqnarray} \tag{ 1 }$

where ln $(\widehat{L})$ is the average log-likelihood of a given data set, n is the number of data points, and k is the number of free parameters to be estimated. Because the geometry of each Gaussian distribution is fully relaxed by allowing a free 3D covariance matrix, GMM adds 10 extra free parameters (1 weight, 3 means, and 6 covariances) for each additional Gaussian component (i.e., k = 10n_c). The BIC is a decreasing function of ln $(\widehat{L})$ and an increasing function of k. Hence, the second term on the right-hand side of Equation (1) is a penalty for the number of parameters introduced in the fit and serves to limit overfitting. A model having a smaller BIC is preferred, which implies either fewer free parameters or a better fit.

The mean BIC of each data point is

$\begin{eqnarray}&&\widehat{\mathrm{BIC}}(n,{n}_{c})=\displaystyle \frac{\mathrm{BIC}}{n}=-2\mathrm{ln}(\widehat{L}({n}_{c}))+\displaystyle \frac{10{n}_{c}\mathrm{ln}(n)}{n}.\end{eqnarray} \tag{ 2 }$

This form is more meaningful, as $\widehat{\mathrm{BIC}}$ quantifies how good a model is for each single stellar particle. However, it is not completely independent of n. We vary n_c from 2 to 15. Because n is generally rather large (≳10⁵), the penalty term is ≲0.01, estimated from the case of n = 10⁵ and n_c = 10. As a consequence, we cannot see a clear minimum $\widehat{\mathrm{BIC}};$ instead, $\widehat{\mathrm{BIC}}$ approaches an asymptotic value that changes little for n_c > 10. Additionally, as suggested in Section 2.2, using more than 10 Gaussian components in the fit is not well motivated physically. We define

$\begin{eqnarray}&&{\rm{\Delta }}\widehat{\mathrm{BIC}}=\widehat{\mathrm{BIC}}-{\widehat{\mathrm{BIC}}}_{\min },\end{eqnarray} \tag{ 3 }$

where ${\widehat{\mathrm{BIC}}}_{\min }={\sum }_{{n}_{c}=11}^{15}\widehat{\mathrm{BIC}}({n}_{c})/5$ is the mean value of $\widehat{\mathrm{BIC}}({n}_{c}\gt 10)$ . ${\rm{\Delta }}\widehat{\mathrm{BIC}}$ of every galaxy asymptotically reaches ∼0. The number of components can be chosen as the minimum value that satisfies ${\rm{\Delta }}\widehat{\mathrm{BIC}}\lt {C}_{\mathrm{BIC}}$ , where C_BIC is our criterion for a reasonable GMM model. The choice of C_BIC will be discussed in the following section.

Our approach of combining GMM with BIC, which we call auto-GMM, takes advantage of the unsupervised nature of GMM and allows n_c to be inferred objectively and automatically from the data, with no additional assumptions imposed.

3. Application of Auto-GMM to Prototype Galaxies from `IllustrisTNG`

Auto-GMM allows us to decompose galaxies automatically and efficiently, making it a powerful tool for large data sets. In order to test the efficiency of this method, we apply it to prototype galaxies at redshift 0 from the TNG100.

3.1. C_BIC Inferred from the `IllustrisTNG` Galaxies

The criterion C_BIC is the only parameter that needs to be chosen artificially when auto-GMM is used. A proper C_BIC can be inferred statistically from the large sample of galaxies in IllustrisTNG. To ensure that the galaxies have meaningful, well-resolved structures, we only use galaxies with stellar masses that exceed 10¹⁰ M_⊙, which corresponds to >10⁴ stellar particles. For each star, we specify the parameter ${\kappa }_{\mathrm{rot}}={v}_{\phi }^{2}/{v}^{2}$ , which measures the relative importance of its kinetic energy in ordered rotation. Then, the average value of this quantity for each galaxy, which gives an indication of its morphology and kinematics, is ${K}_{\mathrm{rot}}={\sum }_{i}{m}_{i}{\kappa }_{i,\mathrm{rot}}/{M}_{\star }$ (Sales et al. 2010), where m_i represents the mass of particle i and M_⋆ is the total stellar mass of the system. More massive galaxies become increasingly dominated by random motions, such that K_rot ≈ 0.3 for M_⋆ ≳ 10¹¹ M_⊙ (Figure 3; top panel). The mass ratio of spheroids f_sph, estimated by summing up stars with κ_rot < 0.5, likely increases with increasing M_⋆ (Figure 3; bottom panel). Both K_rot and f_sph are kinematic indicators of the morphology of the galaxies. All the parameters above are calculated using the stars of radius <30 kpc. In Figure 3, all galaxies of stellar mass ≥10¹⁰ M_⊙ are included, but only unbarred galaxies satisfying K_rot ≥ 0.5 are selected for inferring the C_BIC of disk galaxies. We regard K_rot = 0.5 as the criterion to separate elliptical and disk galaxies.

**Figure 3.** Fraction of the kinetic energy in ordered rotation, K_rot, of the stellar particles (top) and the fraction of the stellar mass in the spheroidal component (bottom), f_sph, as a function of stellar mass, M_⋆, for the TNG100 galaxies. Here all 6503 galaxies, including barred galaxies, are shown. Five prototypes are marked with solid dots, classified into three disk galaxies (D1, D2, D3) and two ellipticals (E1, E2) with the criterion K_rot = 0.5 (dashed lines). The color bar represents the number of galaxies in each bin. A total of 2994 unbarred disk (K_rot ≥ 0.5) galaxies are used for further analysis in Sections 3.1 and 3.2.
Download figure:
Standard image High-resolution image

A massive, long bar complicates the kinematic decomposition (see discussion in Section 3.4). D. Zhao et al. (2019, in preparation) find that a significant fraction of local disk galaxies in the TNG100 have formed a bar. We only focus on unbarred galaxies here, in order to obtain a clean result. The sample of unbarred galaxies is selected using their maximum ellipticity obtained from isophotal analysis of face-on images. Following standard convention (e.g., Marinova & Jogee 2007), a galaxy is considered unbarred if the maximum ellipticity is less than 0.25. We obtain a total of 2994 unbarred disk galaxies. This selected sample of unbarred disk galaxies is expected to have regular disky and spheroidal structures.

The ${\rm{\Delta }}\widehat{\mathrm{BIC}}$ profiles of the selected galaxies are shown in Figure 4. We vary the criterion C_BIC from 0.05 (blue dashed line) to 0.15 (red dashed line); the corresponding distribution of n_c obtained with each C_BIC is shown in Figure 5. It is apparent that C_BIC = 0.05 gives an unreasonable number of components (n_c ≈ 8–11) for most galaxies. Both C_BIC = 0.1 and 0.15 yield a reasonable number of components (n_c ≈ 4–8). In general, C_BIC = 0.1 results in one or two more components compared to C_BIC = 0.15.

**Figure 5.** Distribution of the number of components n_c chosen by criterion C_BIC = 0.05, 0.1, and 0.15 for the same galaxies shown in Figure 4.
Download figure:
Standard image High-resolution image

3.2. Intrinsic Structures Found in Unbarred Disk Galaxies from `IllustrisTNG`

A large library of GMM components in unbarred disk galaxies is built up by applying auto-GMM to TNG100. Then we need to associate these components to structures with which we are familiar from observations. Visual classification is not feasible for a large sample of galaxies. One reasonable way to classify GMM components automatically is by setting appropriate criteria on the mean values of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ of each component. The components belonging to the same structure should have similar properties, and hence should also cluster in kinematic phase space. Here we define the kinematic phase space of $\langle {j}_{z}/{j}_{c}\rangle$ , $\langle {j}_{p}/{j}_{c}\rangle$ , and $\langle e/| e{| }_{\max }\rangle$ as the mass-weighted mean values of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ , respectively, of each Gaussian component.

The 2D histogram of mean circularity $\langle {j}_{z}/{j}_{c}\rangle$ versus mean rescaled energy $\langle e/| e{| }_{\max }\rangle$ of all components of the unbarred disk galaxies is shown in the left panel of Figure 6. There are four clear, distinguishable clusters that are likely to correspond to intrinsic structures. We can easily classify the components into spheroidal and disky structures by setting a threshold circularity criterion $\langle {j}_{z}/{j}_{c}\rangle =0.5$ (thick dashed line). The spheroidal components can be classified further into bulges and halos by the criterion $\langle e/| e{| }_{\max }\rangle =-0.75$ (horizontal dashed line), while the disky components can be classified into cold disks and warm disks by $\langle {j}_{z}/{j}_{c}\rangle =0.85$ . Here j_p/j_c is not used in the classification, as it generally has quite a broad distribution for spheroids. To some extent, this is reasonable because spheroidal components may be composed of stars moving on highly radial orbits and in misaligned rotating orbits. The above strategy directly uses the data to statistically classify the components found by auto-GMM.

The middle and right panels of Figure 6 plot the contours of the number distribution of the components. The four series of contours correspond to the four kinds of structures. Both the cold and warm disks also cluster well in $\langle {j}_{p}/{j}_{c}\rangle$ , while halos and bulges have two sub-groups of $\langle {j}_{p}/{j}_{c}\rangle \approx 0.5$ and ∼0.2, respectively. For spheroids, the components with $\langle {j}_{p}/{j}_{c}\rangle \approx 0.2$ are dominated by radial motions, while those with $\langle {j}_{p}/{j}_{c}\rangle \approx 0.5$ have significant, but misaligned, rotation.

We have demonstrated that the statistical results can be used to objectively infer the intrinsic structures of galaxies. The application of auto-GMM not only can decompose galaxies but also classify components in a completely automatic way. A few illustrative examples are shown in the next section.

3.3. Examples of Auto-GMM Fits

We choose a few prototype galaxies with diverse morphological and kinematic properties to test the performance of auto-GMM. In all cases, we adopt C_BIC = 0.1. The five prototypes—three disk galaxies (D1, D2, D3) and two ellipticals (E1, E2)—all have the same stellar mass (10^10.5 M_⊙) but cover a range of K_rot (∼0.35–0.7) and f_sph (∼0.25–0.75).

Figures 7–11 show the fits for models D1, D2, D3, E1, and E2, respectively. The first row shows diagnostic plots of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ , with 63% confidence ellipses of all Gaussian components overlaid. The crosses mark their means. The 3D kinematic phase space of the five prototypes are well fit. From the second to the fourth row, we show, respectively, the face-on surface density, the edge-on surface density, and the line-of-sight velocity distribution for the edge-on view. Based on their properties, we classify the best-fit components into the structural families presented in Section 3.2: cold disk, warm disk, bulge, and halo. Their corresponding mass fractions are also given. Note that each identified structure can contain more than one component (e.g., two cold disk components in D1; two bulge components in D3), and we do not ascribe any particular interpretation to the physical nature of such substructures here.

**Figure 7.** Model D1. The top row shows the kinematic phase space of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ and the Gaussian components found using auto-GMM. The TNG100 ID and the number of components are labeled in the first panel. The log-scale color bars of the top panels show the number of stars per bin in phase space. Seven components are found by auto-GMM using C_BIC = 0.1. Their 63% confidence ellipses are overlaid, whose corresponding means are marked with crosses. The bottom three rows show the face-on and edge-on surface density distributions and the edge-on line-of-sight velocity distribution, respectively, of each component. These components are titled according to visual classification, and their corresponding mass fractions are labeled, using the same colors as those of the ellipses in the top row. For the line-of-sight velocity distribution, only bins having more than five particles are shown. The dimensions of the x and y axes are 60 kpc.
Download figure:
Standard image High-resolution image

**Figure 8.** Model D2. Seven components are found by auto-GMM. The figure uses the same conventions as Figure 7.
Download figure:
Standard image High-resolution image

**Figure 9.** Model D3. Seven components are found by auto-GMM. The figure uses the same conventions as Figure 7.
Download figure:
Standard image High-resolution image

**Figure 10.** Model E1. Seven components are found by auto-GMM. The figure uses the same conventions as Figure 7.
Download figure:
Standard image High-resolution image

**Figure 11.** Model E2. Seven components are found by auto-GMM. The figure uses the same conventions as Figure 7.
Download figure:
Standard image High-resolution image

Models D1 (Figure 7) and D2 (Figure 8) are largely dominated by disky structures. Components likely associated with cold disks and warm disks contribute 79% to the total stellar mass of D1 and 55% to the total stellar mass of D2. Only a small fraction of the mass in D1 arises from spheroidal components that we attribute to a bulge and halo. The two distinct "cold disk" components seen in the j_z/j_c versus $e/| e{| }_{\max }$ plot might share the same origin, with one portion being slightly dynamically hotter than the other, or they might originate from different gas accretion events. The two substructures of the bulge in D1 have similar compactness and weak rotation, differing principally only in their non-azimuthal angular momentum j_p. Such a difference is ignored in our classification, and they are considered as substructures of the same bulge.

Model D3 clearly has much more massive spheroidal components (Figure 9). Its disky components, including a cold (23%) and a warm (9%) disk, contribute only about one-third of the total stellar mass. The spheroidal components are impressively prominent. A clear pattern of rotation is still evident in the kinematic phase plot of D3. By contrast, the diagram of j_z/j_c versus $e/| e{| }_{\max }$ for model E1 is much more irregular and exhibits far fewer meaningful features (Figure 10). Violent mergers may have erased much of the substructure. Some mild rotation still exists, with K_rot ≃ 0.45, arising mostly from an intermediate-scale disky structure. Model E1 resembles moderate-mass ellipticals, which typically have disky isophotes and moderate rotation (e.g., Kormendy et al. 2009). Note that models E1 and D3 actually have similar rotation. However, D3 still maintains a clear disky morphology while E1 is quite spheroidal. E1 might be regarded as a lenticular galaxy given its mild rotation.

Cases with even lower values of K_rot are largely dominated by random motions. Model E2 is a typical elliptical galaxy with extremely weak rotation (K_rot ≃ 0.35). The pattern of E2's kinematic phase space is more regular than that of E1. The bulges and halos classified by the criteria from the sample of unbarred disk galaxies correspond to a compact nuclear component and a diffuse envelope, respectively. A tidal tail due to a recent minor merger is still visible in the halo. The differences between E1 and E2 indicate that they may have experienced different assembly histories. Auto-GMM is also able to decompose typical elliptical galaxies, such as E2. However, many elliptical galaxies have a featureless kinematic phase space (e.g., E1). There is no proper way to model a featureless distribution even with multiple Gaussians. Thus, physically meaningless multiple Gaussians will be used to recover the data. Care is required in applying the auto-GMM method to elliptical galaxies.

It is worth emphasizing that the relation between the structures decomposed by kinematics and those from morphological observations is still unclear. The morphologies of the structures found by auto-GMM here are roughly consistent with our expectations of thin disks, thick disks, bulges, and halos. However, there are some essential differences. On the one hand, the bulge defined in kinematics is the tightly bound/compact part of spheroids, while the halo is the diffuse part. Halos do contribute to the central density, which is indistinguishable in observations of most of external galaxies. Thus, the inner part of kinematic halos will be considered as part of (classical) bulges in observations. Whether bulges and halos are formed in the same way, namely through mergers, is beyond the scope of this paper. On the other hand, warm disks may be related to thick disks and pseudo bulges in observations, and may have formed via very diverse pathways. Forthcoming papers will statistically investigate the properties and evolution of the structures identified here.

3.4. The Failure of Auto-GMM Fits in Barred Galaxies

Particles moving on bar orbits have complex kinematics that are unlikely to be well described by the phase space of j_z/j_c, j_p/j_c, and $e/| e{| }_{\max }$ . Figure 12 shows an example of an auto-GMM fit of a typical barred galaxy from IllustrisTNG. At a given radius, particles moving on bar orbits rotate more slowly compared with those on circular orbits, and j_z/j_c decreases gradually with decreasing $e/| e{| }_{\max }$ . As a consequence, bar particles significantly pollute the components having moderate rotation, such as warm disks. At the same time, bar particles with j_z/j_c < 0.3 may also influence significantly the kinematic decomposition of slowly rotating components, probably even bulges. Under this circumstance, the mass of the bulge is clearly overestimated. Bar particles do not cluster well in this kinematic phase space, as shown in the first row of Figure 12. They instead drive significant mixture in the kinematic phase space between disks and spheroids. Therefore, the auto-GMM method fails to reliably decompose barred galaxies.

4. Summary

We have described an automated method, auto-GMM, that generalizes the GMMs to decompose the stellar kinematics of simulated unbarred galaxies. A modified version of the BIC is used to infer the optimal number of statistically significant Gaussian components to fit the data.

We demonstrate that the simulated galaxies display rich substructures that can be identified and decomposed effectively by auto-GMM in the kinematic phase space of the stellar particles. Each substructure is a 3D Gaussian component. The substructures belonging to the same structure also cluster in the diagram of the mean circularity versus the compactness (rescaled energy) of the Gaussian components. Taking advantage of a large sample of galaxies in the cosmological simulation IllustrisTNG, four kinds of intrinsic structures are identified: cold disks, warm disks, bulges, and halos. While the present study does not ascribe any rigorous physical interpretation to the decomposed individual components, we illustrate the power of the auto-GMM method to isolate features that can be plausibly associated with morphological components (cold disk, warm disk, bulge, halo) traditionally associated with structures in the Hubble sequence of galaxies.

Our proposed method is automated, fast, and effective. It is a powerful tool to analyze a large data set of galaxies from cosmological simulations to gain insights into the origin and nature of galaxy structure. In forthcoming work, we will statistically investigate the properties of structures in thousands of galaxies from IllutrisTNG. We hope that the results can help interpret observations and provide more insight into the formation and evolution of real galaxies.

This work was supported by the National Science Foundation of China (11721303) and the National Key R&D Program of China (2016YFA0400702). M.D. is also supported by the grants "National Postdoctoral Program for Innovative Talents" (#8201400810) and "Postdoctoral Science Foundation of China" (#8201400927) from the China Postdoctoral Science Foundation. D.Y.Z. and J.S. acknowledge the support by the Peking University Boya Fellowship. V.P.D. was supported by STFC Consolidated grant ST/R000786/1. The TNG100 simulation used in this work, one of the flagship runs of the IllustrisTNG project, has been run on the HazelHen Cray XC40-system at the High Performance Computing Center Stuttgart as part of project GCS-ILLU of the Gauss centers for Supercomputing (GCS). The authors thanks all the IllustrisTNG team for making the IllustrisTNG data available to us prior to the public release. We thank the anonymous referee for valuable comments. We also thank Dandan Xu and Aura Obreja for constructive discussions. This work is highly supported by the High-performance Computing Platform of Peking University, China. The analysis was performed using Pynbody (Pontzen et al. 2013).

Identifying Kinematic Structures in Simulated Galaxies Using Unsupervised Machine Learning

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction