Estimating Photometric Redshifts Using Support Vector Machines

Yogesh Wadadekar

doi:10.1086/427710

1. INTRODUCTION

In the coming decade, ongoing and planned surveys will lead to an exponential increase in the quality and quantity of data available to the astronomical community. Efficient and sensitive imaging and spectroscopic surveys, such as the Sloan Digital Sky Survey (SDSS; York et al. 2000), the VLT/VIRMOS survey (Le Fèvre et al. 2003), the VST survey, the Keck DEEP2 survey (Davis et al. 2003), and several others will enable observational cosmologists to map, with great accuracy and detail, the structure and evolution of the universe. For an effective analysis of these next‐generation data sets, a wide variety of new tools will need to be developed; an accurate and efficient redshift estimator is an important step in this effort.

In spite of the recent spectacular advances in multiobject spectroscopy, photometric methods for redshift estimation provide the most efficient use of telescope time for estimating redshifts of large numbers of galaxies. There are two broad approaches to the determination of photometric redshifts. In the spectral energy distribution (SED) fitting technique (e.g., Koo 1985; Sawicki et al. 1997; Fernández‐Soto et al. 1999; Fontana et al. 2000), a library of template spectra is used. For redshift determination, each template is redshifted, the appropriate extinction correction is applied, and the resulting colors are compared with the observed ones. Usually, a χ² fit is used to obtain the optimal template/redshift pair for each galaxy. Such techniques are simple to implement and computationally inexpensive on modern computers. Several implementations are publicly available (e.g., HYPERZ; Bolzonella et al. 2000). The various techniques in this category vary in their choice of template SEDs and in the procedure for fitting. Template SEDs can be derived from population synthesis models (e.g., Bruzual & Charlot 1993) or based on spectra of real objects (e.g., Coleman et al. 1980, hereafter CWW) selected to span a range of galaxy morphologies and luminosities. Both kinds of templates have their failings—template SEDs from population synthesis models might include unrealistic combinations of parameters or exclude known cases. The real galaxy templates are almost always constructed from data on bright, low‐redshift galaxies and may be poor representations of the high‐redshift galaxy population.

The alternative empirical best‐fit approach is feasible for a data set in which spectroscopic redshifts are available for a subsample of objects. In such cases, the spectroscopic data can be used to constrain the fit of a polynomial function mapping the photometric data to the redshift (e.g., Connolly et al. 1995; Brunner et al. 1997; Wang et al. 1998). The disadvantage of this approach is that it cannot be applied to purely photometric data sets. Additionally, it cannot easily be extrapolated to objects fainter than the spectroscopic limit. This limitation is particularly serious, because it is this regime that is often of the highest interest.

Such techniques have the advantage, however, of being automatically constrained by the properties of galaxies in the real universe and requiring no additional assumptions about their formation and evolution. Given the particular strengths and weaknesses of these interpolative techniques, they are ideally suited to exploit mixed data sets, such as the VLT/VIRMOS survey and the Keck DEEP2 survey, which will provide spectroscopic redshifts for more than 10⁵ galaxies. The SDSS, with its extensive spectroscopy, can also be exploited using such techniques.

Among the interpolative techniques, new possibilities based on machine learning have emerged. Firth et al. (2003), Tagliaferri et al. (2002), Vanzella et al. (2004), and Collister & Lahav (2004) propose methods to estimate the photometric redshift using artificial neural networks (ANNs). The levels of accuracy achievable for photometric redshifts with ANNs is comparable to—if not better than—that achievable with SED fitting for cases in which moderately large training sets are available. Nevertheless, neural networks have some disadvantages. Their architecture has to be determined a priori or modified during training by some heuristic. This may not necessarily result in the most optimal architecture. Also, neural networks can get stuck in local minima during the training stage. The number of weights depends on the number of layers and the number of nodes in each layer. As the number of layers and/or nodes increase, the training time also increases.

In this paper, we propose the use of a technique from a distinct class of machine learning methods known as kernel learning methods. The method is called support vector machines (SVMs) and, like the ANNs, is only applicable to "mixed" data sets in which a moderately large training set with photometry in the survey filters and spectroscopic redshifts for the same objects is available.

This paper is organized as follows. In § 2 we provide a brief overview of SVMs and an introduction to aspects relevant to this work. In §§ 3 and 4 we apply the technique to data from the SDSS Data Release 2 and GalICS simulations. Section 5 discusses the results and the regime of applicability of this technique.

2. SUPPORT VECTOR MACHINES

SVMs are learning systems that use a hypothesis space of linear functions in a high‐dimensional feature space and are trained with learning algorithms from optimization theory, which implements a learning bias derived from statistical learning theory.

The input parameters (e.g., broadband colors) form a set of orthogonal vectors that define a hyperspace. Each galaxy would then represent a point in this hyperspace. The basic SVM is a linear classifier; i.e., training examples labeled either "yes" or "no" (e.g., the answer to the question, Is z>1?) are given, and a maximum‐margin hyperplane splits the "yes" and "no" training examples in the hyperspace, such that the distance from the closest examples to the hyperplane (the margin) is maximized. The use of the maximum‐margin hyperplane is motivated by statistical learning theory, which provides a probabilistic test error bound that is minimized when the margin is maximized. The parameters of the maximum‐margin hyperplane are derived by solving a quadratic programming (QP) optimization problem. Several specialized algorithms exist for quickly solving the QP problem that arises in SVMs.

The original optimal hyperplane algorithm was a linear classifier, and thus inapplicable to nonlinear problems. Vapnik (1995) suggested applying Mercer's theorem to the problem of finding maximum‐margin hyperplanes. The theorem states that any positive semidefinite kernel function can be expressed as a dot product in a high‐dimensional feature space. The resulting algorithm is formally similar, except that every dot product (the distance measure) in the feature space is replaced by a nonlinear kernel function operating on the input space. In this way, nonlinear classifiers can be created. The dimensionality of the feature space depends on the kernel function used; e.g., if the kernel used is a radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum‐margin classifiers are well regularized, so the infinite dimension does not affect the results.

An additional complication involving real data could be that no hyperplane would exist that would cleanly split the "yes" and "no" examples. For such situations, a modified maximum‐margin idea was introduced; the "soft margin" method chooses a hyperplane that splits the examples as cleanly as possible while still maximizing the distance to the nearest cleanly split examples.

The estimation of a continuous output parameter such as redshift requires the extension of the SVM algorithm to handle regression. The support vector regression algorithm was proposed by Smola (1996). The model produced by support vector classification, as described above, only depends on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by support vector regression only depends on a subset of the training data, because the cost function for building the model ignores any training data that is close (within a threshold epsilon ) to the model prediction.

In the last decade, extensive enhancements to all aspects of the SVM formulation have been introduced, and theoretical developments in the field continue to be extremely active. SVMs are being used in a wide variety of applications, such as text recognition, face identification, weather forecasting, financial market predictions, and gene data analysis. In astronomy, SVMs have recently been used for classifying variable stars (Wozniak et al. 2001), determining galaxy morphology (Humphreys et al. 2001), and distinguishing AGNs from stars and galaxies (Zhang & Zhao 2003).

Unlike ANNs, SVMs do not require the user to choose an architecture before training. Any number of input dimensions can be used. For a detailed explanation of the mathematical underpinnings of SVMs, see Vapnik (1995, 1998). For a more practical introduction, see Cristianini & Shawe‐Taylor (2000).

An important advantage of the SVM is that adding additional input parameters to the classifier leads to only a near‐linear increase in computational costs (Smola 1996). This is an advantage for the problem of estimating photometric redshifts, because one then has the potential to use additional photometric parameters that may be related to redshift (albeit in a very nonlinear manner). Such parameters, which can include size measures such as Petrosian or scale radii from de Vaucouleurs or exponential fits, central surface brightness, or fixed aperture magnitudes, are available in many modern survey catalogs, such as the SDSS catalog. On the other hand, parameters such as the disk‐to‐bulge luminosity ratio, which shows only a weak dependence on redshift, if any, are not likely to be useful in improving the accuracy of redshift estimation.

Several robust software implementations of the SVM algorithm are publicly available, and each has its own set of distinguishing features that make it optimal for a particular type of classification or regression problem. After surveying the capabilities of the available implementations, we decided to use the program SVMTorch II (Collobert & Bengio 2001) for this work. SVMTorch is a C++ implementation that works for both classification and regression problems. It has been specially tailored to large scale problems (e.g., >20,000 examples, even for input dimensions >100). A special feature of SVMTorch is that it employs a RAM‐based cache to store the values of the most used variables of the kernel matrix. The size of the cache used by SVMTorch has to be set by the user, depending on the free memory available.¹

Before using SVMs, the user needs to choose the kernel to be used for the mapping from the nonlinear space (containing the data) to the feature space, where the minimization is performed, using linear learning machines. Choosing a kernel for the SVM is analogous to choosing the architecture of the ANN. As with ANNs, there is no simple heuristic for making this choice; some experimentation with the input data is essential. Commonly used kernel functions include polynomial, Gaussian, and sigmoidal forms. If these prove ineffective, more elaborate kernels can be utilized. For input data that are not strongly clustered with respect to the output variable, at least one of the above‐mentioned kernels should work well. Some experimentation reveals that the Gaussian kernel with a σ of 1.0 gives the best results for our problem. All the results presented in this paper were obtained using a Gaussian kernel.

The second parameter that must be set in SVMTorch is the size of the error pipe. This basically influences the number of training iterations that are performed before the training is considered complete. Decreasing the width of the error pipe size (eps) beyond a certain point increases training time considerably, with only a marginal improvement in the final results. Through trial and error, we determined that a value of eps = 0.02 was appropriate. Unless otherwise noted, this error pipe width has been used throughout this paper.

The key components of the system are in place, once the choice of kernel and termination criterion has been made. Unlike some other learning systems, SVMs do not require a lengthy series of experiments in which various parameters are tweaked until satisfactory performance is achieved. In many cases, the most straightforward SVM implementations are known to perform as well as other learning techniques, without any need for further adaptation.

With this set of parameters, training a set of 10,000 objects with five input vectors requires about 2 million iterations that take about 20 minutes on an Athlon XP 1800+ processor with 512 MB of memory.

3. PHOTOMETRIC REDSHIFT FROM SDSS DATA

The SDSS consortium has publicly released more than 10⁵ spectroscopic galaxy redshifts in the Data Release 2 (DR2). In order to build the training and test sets, we first accessed the current version of the SDSS catalog database (BESTDR2) and selected all objects satisfying the following criteria: (1) the spectroscopic redshift confidence must be greater than 0.95, and there must be no redshift warning flags, (2) 0.01<z<0.5, and (3) r<17.5. These criteria resulted in a galaxy sample totalling 139,000 objects. The order of the catalog was randomized, and nonoverlapping training and testing sets of equal size (10,000 objects) were selected. The input vectors were the dereddened magnitudes in each of the five SDSS filters, and the output vector was the redshift.

Figure 1 plots the SVM estimated photometric redshift against the spectroscopic redshift for each galaxy in the test set. The rms deviation σ_rms = [〈(z_phot - z_spec)²〉]^1/2 = 0.027. The mean deviation is 0.0006. The number of outliers is small, and there are no obvious systematic deviations. There is some tendency for redshift to be overestimated by the SVM for z<0.05. In addition, the scatter is larger for z>0.25. Both of these effects are caused by the presence of fewer training examples in the sample at very low z (because the cosmic volume sampled is small) and at higher z, where incompleteness sets in.

**Fig. 1.—** Comparison of photometric and spectroscopic redshifts using SDSS DR2 data. A training test of 10,000 objects was used. The SVMs were tested using a nonoverlapping set of 10,000 objects (plotted). The rms shown is that of Δ_z = z_spec - z_phot.

For galaxies with z<0.05, the relative fraction of training examples can be increased substantially by constructing a separate training set that is uniformly sampled in redshift. From our galaxy sample of 139,000 objects, we constructed nonoverlapping training and test sets, with the additional constraint that there should be an equal number of galaxies in each redshift bin of width 0.025. Our new training set had 7310 examples, and the test set had 7300 examples. The rms deviation was 0.029, which is somewhat larger than the nonuniformly sampled case. This is presumably because we have fewer galaxies in our training sample. However, when we compared only those galaxies having z<0.05, we found that the rms deviation decreased from 0.031 to 0.029, going from the nonuniformly sampled training/test sets to the uniformly sampled ones.

In Table 1, we compare the rms value obtained using SVMs with those using a variety of other SED template and empirical best‐fit techniques by Csabai et al. (2003) and Collister & Lahav (2004) on a similar sample of galaxies drawn from the SDSS Early Data Release. The SVM approach is clearly better than the template fitting techniques (e.g., CWW and Bruzual‐Charlot [1993]) and is nearly as good as the best empirical best‐fit approach (ANNz).

3.1. Using Additional Input Parameters

One advantage of the empirical best‐fit approach to photometric redshift estimation is that additional parameters that can help in estimating the redshift can be easily incorporated as additional input columns. However, these parameters need to be chosen carefully such that they have a genuine dependence on the redshift. We found that choosing inappropriate parameters that have no obvious redshift dependence (e.g., galaxy ellipticity) leads to larger scatter in redshift estimation.

To illustrate this capability, we included the r‐band 50% and 90% Petrosian flux radii of our SDSS training sample as additional inputs to the SVM. These are the angular radii containing the stated fraction of the Petrosian flux. Each of these radii is a measure of the angular size of the galaxy, which is a redshift‐dependent property. Their ratio defines the "concentration index" of the galaxy, which is a measure of the steepness of its light profile. The index is (weakly) correlated with galaxy morphology (and therefore color).

The SVM was retrained with seven input parameters (the five filter magnitudes and the two Petrosian radii) for the 10,000 galaxies in our original training sample. When tested on the test set, the SVM produced a redshift estimate with a rms error of σ_rms = 0.023. This represents a nearly 15% improvement in the accuracy of the redshift estimation, and it illustrates how additional parameters can be incorporated with little effort.

3.2. Using Smaller Training Sets

The SDSS includes spectroscopic follow‐up for a substantial number of galaxies detected by its photometric component. Most other surveys lack the necessary resources for such extensive spectroscopic follow‐up. In such situations, large samples for SVM training will not be available. It is important, therefore, to explore the effectiveness of the SVM on smaller training sets.

We constructed training and test sets that were 1/10 and 1/100 the size of our original sample sets of 10,000. When the SVM was run on these smaller data sets, the rms error was σ_rms = 0.036 and 0.049, respectively. Clearly, 100 training examples chosen at random are insufficient to encapsulate the diversity of the SDSS; even 1000 galaxies are not quite good enough. This deterioration of SVM performance with smaller sample sizes is somewhat more severe than that observed by Collister & Lahav (2004) with their ANN‐based approach.

3.3. Spectral Class from SDSS Data

SED template matching techniques provide useful supplementary information by assigning a spectral type to the galaxy, based on the type of the best‐fit galaxy SED. Firth et al. (2003) and Collister & Lahav (2004) have demonstrated how ANNs can be used to determine galaxy spectral type from broadband photometry.

The spectroscopic catalog of the SDSS includes a continuous parameter eClass that indicates a spectral type, deduced from an analysis of the galaxy spectrum, that ranges from about −0.5 (early type) to 1 (late type). We trained the SVM with the same 10,000 galaxies used for redshift estimation, using eClass as the output parameter, in place of the redshift. When tested with the original test sample, the eClass was estimated with a rms error of σ_rms = 0.057 (Fig. 2). The error is comparable to that obtained by Collister & Lahav (2004) on their sample of SDSS galaxies (σ_rms = 0.052).

**Fig. 2.—** Results from using SVMs to predict the spectral type, as measured by the *eClass* parameter, for 10,000 galaxies from the SDSS DR2.

4. PHOTOMETRIC REDSHIFT FROM GalICS SIMULATIONS

One of the limitations of the SDSS survey is that it contains very few galaxies with redshift z>0.4 or so. Training (and test) sets of adequate size beyond this redshift will be difficult to obtain, even when the survey is completed. In order to test the performance of SVMs beyond this redshift, one must use galaxy magnitudes computed from simulated models of the high‐redshift universe.

To generate a mock galaxy catalog to train and test SVMs, the hybrid numerical and semianalytic model GalICS (Galaxies in Cosmological Simulations) was used. In the model, dark matter evolution is traced using numerical simulations, and galaxy formation within the dark matter halos is treated using semianalytic recipes. These recipes attempt to parameterize galaxy formation within the framework of the hierarchical paradigm of galaxy formation. Once the distribution and physical properties of the baryons have been determined, a set of models are used to calculate the amount of light they produce. Luminosities at different wavelengths are calculated from stellar synthesis models that take into account the metallicity and age of the stellar population. Geometry and metallicity dependent models for the absorption and reemission of starlight by the dust and gas in the interstellar medium are included. Various lines of sight through the simulation box generate mock galaxy catalogs that have a realistic distribution of galaxy types, luminosities, colors, and redshift. The GalICS model is described by Hatton et al. (2003). The mock galaxy catalog generation process is described by Blaizot et al. (2004).

From the GalICS 1 database (accessible through the Mock Map Facility² [MoMaF]), we selected 6965 objects brighter than r_AB = 21.5 distributed over 1 deg² of sky. At this magnitude limit, GalICS is nearly free of incompleteness introduced by its limited mass resolution at high redshifts. At the same time, the limit is faint enough to allow us to obtain reasonably large training and test sets out to a redshift of z∼1. As with the real SDSS data, the input parameters were just the ugriz magnitudes in the five Sloan filters, and the output parameter was the redshift. Our training and test sets were constructed by splitting the GalICS sample into 3483 and 3482 objects, respectively. The training was done with an error pipe width of eps = 0.01. The somewhat lower choice of the eps parameter was motivated by the absence of photometric noise in the mock galaxy catalog.

Figure 3 plots the SVM estimated photometric redshift for the test sample against the redshift in the GalICS model. The rms scatter is σ_rms = 0.026. There are no systematic deviations and virtually no outliers. The complete lack of outliers is probably due to the fact that GalICS spectra are based on a restricted set of model templates.

**Fig. 3.—** Photometric redshift vs. GalICS model redshifts, using 3482 test set galaxies. A nonoverlapping set of 3483 galaxies was used for training.

5. DISCUSSION

The SVM technique presented here provides comparable performance to ANN‐based techniques. The major advantage over ANNs is that SVMs require less effort in training; e.g., with ANNs, the researcher has to make decisions about the optimal network architecture (number of layers, number of nodes in each layer, committee of networks required to minimize network variance, etc.). More complex network architectures have more free parameters (weights) and therefore allow a closer fit to the data, but are subject to the danger of overfitting. In addition, adding layers or nodes to the network leads to an increase in training time. The challenge for the ANN expert is to find the simplest possible architecture that will provide satisfactory results. SVM simplifies this process by replacing the "choice of architecture" problem with one of "choice of kernel" (and associated kernel function parameters). As we have seen, even simple kernel functions such as a Gaussian give a performance comparable to that obtained with finely tuned ANNs.

Like ANNs, the technique is well suited only for cases in which a training set is available that has the same general characteristics (of magnitude and color distributions) as the sample for which photometric redshifts are to be determined. This means that extrapolation beyond the spectroscopic limit is not straightforward, except in situations in which the color characteristics of the fainter population are independently constrained; e.g., Collister & Lahav (2004) show how photometric redshifts can be obtained for faint luminous red galaxies that are about a magnitude fainter than the spectroscopic limit, using a neural network trained on their brighter, lower‐redshift counterparts. In this case, some extrapolation is possible, because these early‐type galaxies show little spectral evolution with redshift. For a less constrained galaxy sample, such extrapolation is not possible. Operationally, this implies that spectroscopic follow‐up that covers a limited area of sky but is very deep is preferable to a follow‐up program that is shallow but covers a wider area. Of course, this may not be compatible with other aims of the survey. Secondly, the size of the training set has to be sufficiently large. To obtain redshifts in the range z = 0–3, we estimate that about 10⁴ galaxies in the training set will be needed, with photometric data in five optical bands. It may be possible to reduce the size of the training set if broadband photometry in additional filters is available. Ideally, the galaxies in the training set should be properly distributed in redshift bins. This never occurs in a flux‐limited survey. Consequently, results at higher z, where the training examples are less numerous, will have larger errors in redshift estimation. At high redshifts, a combination scheme of independent photometric redshift estimation through SED fitting and SVM/ANN techniques may help restrict outliers to genuinely astrophysically distinct galaxies.

It must be noted that the rms errors in the redshift estimation reported in this paper apply only to the test set as a whole. Error bars on individual galaxy redshifts are not available. So although the redshifts are likely to be estimated correctly on average, the redshift of a particular galaxy may be off by a larger amount. This shortcoming applies to most "empirical best fit" types of approaches. These techniques are thus not well suited to finding rare objects in the test set that do not have numerous corresponding examples in the training set. On the other hand, it is well suited to problems that require the redshift distribution rather than accurate redshifts of individual galaxies (e.g., for mapping large‐scale structure).

In principle, it is possible to train SVMs to the depth desired by using simulated catalogs—from models such as GalICS—and then applying the trained SVMs for photometric redshift estimation from real data. Such an approach with ANNs has been taken by Vanzella et al. (2004). We have not attempted the same with the GalICS simulations, since the current version is seriously incomplete at high z and low luminosities. In addition, the simulation box is too small to provide a large enough sample for training. A larger simulation with more particles and a larger box size is being developed for GalICS (GalICS 3).

Deep spectroscopic surveys such as DEEP2 and VLT/VIRMOS will provide the large training sets necessary for SVM applications. With the high quality of data obtained by these surveys, the level of photometric redshift estimation should be quite accurate to a redshift of z∼1.

SVMs may well be appropriate to other regression and classification problems in astronomy.

This work uses the GalICS/MoMaF Database of Galaxies.³

Funding for the creation and distribution of the SDSS archive⁴ has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the US Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.

The SDSS is managed by the Astrophysical Research Consortium (ARC) for the Participating Institutions. The Participating Institutions are The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, Los Alamos National Laboratory, the Max‐Planck‐Institute for Astronomy (MPIA), the Max‐Planck‐Institute for Astrophysics (MPA), New Mexico State University, University of Pittsburgh, Princeton University, the United States Naval Observatory, and the University of Washington.

The author thanks the anonymous referee, whose insightful comments helped improve this paper. The author also thanks N. S. Philip for bringing the SVM to his attention. This research was initially supported by projects 1610‐1 and 1910‐1 of the Indo‐French Center for the Promotion of Advanced Research (CEFIPRA). Support for program AR 9540 was provided by NASA through a grant from the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Inc., under NASA contract NAS 5‐26555.

Estimating Photometric Redshifts Using Support Vector Machines

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. SUPPORT VECTOR MACHINES

3. PHOTOMETRIC REDSHIFT FROM SDSS DATA

3.1. Using Additional Input Parameters

3.2. Using Smaller Training Sets

3.3. Spectral Class from SDSS Data

4. PHOTOMETRIC REDSHIFT FROM GalICS SIMULATIONS

5. DISCUSSION

Footnotes

Estimating Photometric Redshifts Using Support Vector Machines

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. SUPPORT VECTOR MACHINES

3. PHOTOMETRIC REDSHIFT FROM SDSS DATA

3.1. Using Additional Input Parameters

3.2. Using Smaller Training Sets

3.3. Spectral Class from SDSS Data

4. PHOTOMETRIC REDSHIFT FROM GalICS SIMULATIONS

5. DISCUSSION

Footnotes