Observation Strategy Optimization for Distributed Telescope Arrays with Deep Reinforcement Learning

Peng Jia; Qiwei Jia; Tiancheng Jiang; Jifeng Liu

doi:10.3847/1538-3881/accceb

1. Introduction

For quite a long time, astronomy observations have been carried out by general-purpose telescopes controlled by astronomers to observe several selected celestial objects, which have lead to many new discoveries. In recent decades, the sky survey has become an important observation approach (Solar et al. 2016; Morris et al. 2018; Tonry et al. 2018; Bellm et al. 2019; Shimwell et al. 2019; Lacy et al. 2020). In sky survey projects, scientists use a telescope to collect images according to predefined strategies. Then, these images are shared through the internet, and scientists use these images to study the large-scale structure of the universe or discover new transients. Nowadays, celestial objects with fast variations have attracted much attention and have become main targets studied by time-domain astronomy. Because fast variations in temporal domain are caused by astrophysical phenomena on a very small spatial scale, celestial objects with fast variations are either astronomical phenomena with very high energy (such as mergers of compact stars, superflares, or tidal disruption events) or astronomical targets that are very close to us (such as near-Earth objects or artificial celestial objects). To observe these celestial objects, scientists require a telescope with a larger field of view, larger aperture, and continuous observation ability, which are beyond the capacity of any contemporary telescope designs. Therefore, telescope arrays have become a possible approach to observe celestial objects with fast variations.

In recent years, telescope arrays of different configurations have been proposed by Abu-Zayyad et al. (2012), Tonry et al. (2018), Lokhorst et al. (2020), Abdalla et al. (2021), Acharyya et al. (2021), Rinchiuso et al. (2021). We could classify these telescope arrays into two different types, according to locations of telescopes in the array. If all telescopes in a telescope array are installed in one observatory, the telescope array could be viewed as a new type of wide-field telescope, which could observe celestial objects in an extremely large field of view with multiple different bands. If telescopes in the telescope array are installed in different locations, it could be viewed as a distributed telescope array, which will observe different parts of the sky at different times. The telescope array of the first type could obtain images with higher temporal resolution or more color information, while the telescope array of the second type could obtain continuous observation data. Besides, since the telescope array of the second type would distribute telescopes in different observatories, it would be less affected by weather conditions. Thanks to these advantages, scientists have proposed several telescope arrays that adapt both designs: the telescope array would distribute telescopes in several different observatories, and there are several telescopes in the same observatory. If a telescope array contains different telescopes in different observatories, it would be quite difficult to design an appropriate observation strategy for the following reasons:

1.
Comparing to the number of telescopes in a telescope array, there are too many celestial objects required to be observed with adequate cadence.
2.
Some astronomical phenomena would happen in any positions of the sky (such as the supernovae or the tidal disruption events), and we need to allocate time to discover them.
3.
Telescopes in a telescope array are sparsely distributed in different locations, which require the observation strategy to consider locations, observation conditions, and requirements for different observation tasks.

Some methods have been proposed to optimize the observation strategy for sky surveying telescopes. However, these methods are mainly used to allocate time for a single telescope and do not consider interactions between different telescopes in the telescope array (Africano et al. 2000; Solar et al. 2016; Bellm et al. 2019). Among these algorithms, the greedy algorithms are widely discussed because they provide a convenient way to automatically schedule the observation strategy of a telescope (Rhodes 2011; Morris et al. 2018). The greedy algorithms will compute values for each possible target and select the target with the highest value to observe. In real observations, the greedy algorithms will iterate to continuously obtain observation results. For a telescope array with a small number of telescopes, ordinary algorithms could increase observation efficiency (Seitzer et al. 2007). As the number of telescopes in the telescope array increases, optimization of the observation strategy would become more and more complex. Besides, optimization of the fine-scale observation strategy would require a lot of human interventions, and optimization of the large-scale observation strategy would have a complexity that is proportional to the number of telescopes and celestial objects, which would excede the capacities of any ordinary planning algorithms. In recent years, machine-learning-based strategy optimization algorithms have been widely discussed (Milani et al. 2012; Ferreira et al. 2016; Hinze et al. 2016; Frueh et al. 2018; Cai et al. 2020). To reduce computation cost and complexity of these algorithms, there are many assumptions that would deviate from real observation conditions. For example, telescopes are assumed as ideal detectors, which does not consider observation environments or abilities of these telescopes. Besides, these observation strategies do not consider real locations of telescopes and distributions of celestial objects, which would reduce their effectiveness (Jia et al. 2020, 2022b).

Nowadays, deep reinforcement learning (DRL) is widely studied as an effective method for control of complex systems. There are four different parts in a DRL: the agent, the actuator, the sensor, and the environment. The agent obtains information on the environment from the sensor and drives the actuator to achieve some predefined targets. Some complex representation methods are developed for the agent, such as deep neural networks or deep Monte Carlo trees, which have achieved remarkable performance (Mnih et al. 2013; Silver et al. 2016, 2017; Arulkumaran et al. 2019; Jumper et al. 2021). In real applications, DRL algorithms do not need to compute all the factors like the greedy algorithms. Instead, DRL algorithms will learn the optimal policy automatically according to the state of the problem. Therefore, DRL algorithms not only will choose the optimal solution of the current strategy in every case (known as the "greedy exploration") but also has some chances to explore new possibilities, which may be better than the current one (known as the "random exploration"). DRL algorithms will make a trade-off between the "random exploration" and the "greedy exploration" (known as "epsilon-greedy strategy," which we will discuss later in Section 4).

Training is necessary for DRL algorithms. We need a lot of off-line interaction data or on-line interactions with the real world to train the DRL algorithm, which would be quite expensive for many applications. On one hand, devices in the real world would perform thousands of rounds of experiments to obtain data. These experiments require a lot of funding, and some experiments will break devices or environments. On the other hand, experiments carried out in the real world would take place in regular time speed, which would take a lot of time. Besides, we could not control the outer environment, which would reduce the training efficiency. Thanks to recent developments in numerical simulation and digital twin technologies (Makoviychuk et al. 2021; Jia et al. 2022a; Huang et al. 2022; Rojas et al. 2022), we could build a digital twin of the real world with many highly optimized blocks. With a digital twin of the real world, DRL algorithms could be developed and used with faster speed in real applications.⁴

In this paper, we develop a framework to optimize the observation strategy for the telescope array, which includes a simulator and a DRL algorithm. With predefined parameters of telescopes, observation conditions, and properties of observation targets, our framework could obtain an effective observation strategy after training. To test the performance of our framework, we would consider a scenario that uses several telescopes located in different observatories of the world to observe space debris. Results show that our optimization framework could obtain strategies for these telescopes to track known space debris and discover new space debris. The trained observation strategy has better performance than that of ordinary methods. We will discuss the design of our simulator in Section 2 and the design of the DRL algorithm in Section 3. In Section 4, we will show the performance of our algorithm, and we will make our conclusions and anticipate our future works in Section 5.

2. The Telescope Array Simulator

Training of DRL algorithms is quite expensive for real-world tasks. We are facing a similar situation for optimization of the observation strategy for the telescope array: it is impractical to waste expensive telescope observation resources to train the DRL algorithm, not to mention that the training stage will take quite a long time. Nowadays, a high-fidelity simulation is a practical approach to reduce the training cost of DRL, and several applications have shown that it is possible to train the DRL algorithm with only simulated data and apply it directly after training (Rahimi et al. 2018; Yurtsever et al. 2020; Kiran et al. 2021). In this section, we propose a telescope array simulator to train the observation strategy. The telescope array simulator includes three models: the observation environment model, the telescope array model, and the celestial object model, as shown in Figure 1.

**Figure 1.** The telescope array simulator, which consists of three main components: the telescope array model, the observation environment model, and the celestial object model. We use the STK package to visualize results obtained by telescope arrays with different observation strategies.
Download figure:
Standard image High-resolution image

Our simulator is a half Monte Carlo and half analytical simulation code written with Python 3. The simulator defines different models in an object-oriented way, and different models are connected according to a real light propagation procedure. For each model, we have designed functions for necessary physical processes to simulate the imaging process, and we could easily add new functions or components in the simulator to contain more effects. We would run the simulator by discrete time steps to obtain observation results of the telescope array. Therefore, we can train DRL algorithms effectively through interactions with the simulator. Results obtained by the simulator are visualized by the STK package (McCamish & Romano 2007) to help scientists better review the performance of different observation strategies. We will discuss details of different parts of the simulator below.

2.1. The Celestial Object Model

The celestial object model will generate an atlas of celestial objects for different telescopes. The values of these celestial objects in a particular time are defined according to parameters of the telescope, observation mode, star catalogs, and observation tasks. There are three types of celestial objects defined in the celestial object model:

1.
Extended celestial objects, which will introduce the background noise to the observation data, such as the Sun and the Moon. Earth is viewed as an extended celestial objects for space-based telescopes.
2.
Fast variation celestial objects, which have fast moving speeds or fast magnitude variations, such as space debris, near-Earth objects, or variable stars.
3.
Ordinary celestial objects, which have constant magnitudes and do not move in the celestial coordinate during the simulation procedure, such as stars or quasars.

We calculate the signal-to-noise ratio (S/N) of celestial objects as the evaluation criterion of data quality. Because astrometry and photometry accuracy could be directly obtained from the S/N, we would assume the second and third types of celestial objects as point sources and define properties of these celestial objects as shown in Table 1. Since we would carry out a lot of iterations with the simulation code and it would cost a lot of additional computational resources, if we generate realistic streak images for moving targets, we assume that moving targets have point-like images and that the S/Ns of these images are a little larger. We will add new properties of moving targets for simulation of streak-like images in our future work.

Table 1. Properties Defined in the Celestial Object Model

Properties	Definition
R.A.	Right ascension
Decl.	Declination
Mag	Visual magnitude
Size	Visual size of celestial objects in arcseconds
Initial value	Value of the celestial object being observed for the first time
Value escalation index (M)	Growth rate of the value of the celestial target
Monitoring period (t₀)	Value of a target would be set as 0 for t₀, after it has been observed by the telescope array
Value	Unitless value of a celestial target, which reflects the desirability of the target

Download table as: ASCII Typeset image

R.A. and decl. describe positions of the first and third types of celestial objects in the celestial coordinate system. When we run the simulator to train or test the observation strategy, we could obtain positions of celestial objects for a particular telescope given the position and the observation time of that telescope. However, since celestial objects of the second type would have different moving speeds, we would use the two-line orbital data (TLE, Two-Line-Orbital Element) from the celestrak website to describe positions of these targets (Vallado & Cefola 2012). In a particular time, we would obtain the position and velocity of a celestial object of the second type through the SGP4/SDP4 model developed by North American Aerospace Defense Command (NORAD) for a particular telescope. Mag is the visual magnitude of the celestial object, which is a constant number for the first and third types of celestial objects, while magnitude is a list (magnitude curve) for the celestial object of the second type with variable magnitudes. We could obtain magnitudes of these celestial objects through interpolation according to the magnitude curve. Size is the visual size of celestial objects in arcseconds, which is normally used to describe the celestial object of the first type.

Since the telescope array is used to capture images of celestial objects, we would set values of different celestial objects according to their properties and observation requirements. We have defined the value to evaluate values of celestial objects in a particular time t:

$\begin{eqnarray}\mathrm{value}(t)=\begin{array}{lc}\mathrm{Initial}\,\mathrm{Value}, & \mathrm{if}\,\mathrm{Value}\geqslant \mathrm{Initial}\,\mathrm{Value}\\ {e}^{\tfrac{t-{t}_{0}}{M}}-1, & \mathrm{if}\,t\geqslant {t}_{0}\,\mathrm{and}\,\mathrm{Value}\leqslant \mathrm{Initial}\,\mathrm{Value}\\ 0, & \mathrm{otherwise},\end{array}\end{eqnarray} \tag{ 1 }$

where t is the observation time, t₀ is the monitoring period, and M is the value escalation index. The value of a celestial object reflects the desirability of the object defined by the scientific requirements. For time-domain astronomy, we need to observe celestial objects within a predefined cadence (monitoring period). Therefore, the value would increase after we start to carry out observations with the telescope array. However, the value of a celestial object would not increase to an infinite value, and we set the maximal value of a celestial object as the initial value, if value is larger than initial value. The initial value reflects the importance of the celestial object, and the monitoring period and the value esecalation index reflect the required frequency in observation of a particular celestial object. Some transients, such as supernovae or tidal disruption events, would have very large initial values. Since some celestial objects would require frequent observations, such as stars with exoplanets, variable stars, or space debris, we could set a different monitoring period and value escalation index to reflect their temporal sampling requirements.

2.2. The Telescope Array Model

Although it would be possible to build a full Monte Carlo simulator for each telescope in the telescope array,⁵ it will require a lot of computation resources and memories. Therefore, we introduce a half Monte Carlo and half analytical telescope model for the telescope array. All telescopes in the telescope array would be controlled as the whole system, and properties of all these telescopes are defined in Table 2.

Table 2. Properties Defined in the Telescope Array Model

Properties	Definition
Aperture size	Diameter of the telescope
Field of view	Field of view of the telescope
Pixel size	Pixel scale of the detector
Efficiency	Efficiency of the telescope in converting photons to electrons in the detector
CamDark	Dark current noise of the detector
CamRead	Readout noise of the detector
Az axis pointing	Pointing direction of the telescope in Az
Alt axis pointing	Pointing direction of the telescope in Alt
Axis moving speed	Moving speed of the telescope axis
Axis moving acceleration rate	Acceleration rate of the telescope axis
PSF	Point-spread function of the telescope

Download table as: ASCII Typeset image

Telescopes in the telescope model contain properties of three different types: properties of the optical system, properties of the mechanical system, and properties of the electronic system. We could generate simulated images according to properties of the optical system and properties of the electronic system. Meanwhile, we could calculate possible actions of the telescope, according to properties of the mechanical system. There are two actions that a telescope could carry out in the simulator: observation and moving. For the moving action, telescopes will move to a position according to the axis moving speed, the axis moving acceleration rate, and the moving time. For the observation action, we would first obtain sky areas observed by the telescope according to the field of view, the azimuth, and the altitude axis of these telescopes. Second, we will either split the field of view into small patches and choose one star in each patch or directly select several celestial objects as references. Third, we will use the wave front propagation method or point-spread functions of telescopes to obtain images of point sources according to the exposure time, the aperture of the telescope, the efficiency, and noise from detectors. At last, we would calculate the S/N of these images and set a detection threshold according to the S/N (10 in ordinary conditions). All targets that are brighter than the detection threshold and not obstructed by the cloud would be marked as effective detection results. We could use the detection results to update the star catalog and values obtained by the telescope array.

2.3. The Observation Environment Model and the Simulation Procedure

In this subsection, we describe the observation environment model and the simulation procedure. The observation environment model defines the time and the observation condition for the simulator, as shown in Table 3. Meanwhile, the observation environment model would connect the celestial object model and the telescope array model to generate final detection results.

Table 3. Properties Defined in the Observation Environment Model

Properties	Definition
Observatory location	Longitude, latitude, and altitude of the observatory
Cloud	Cloud distribution matrix in a particular observatory
Wind	Wind speed and direction in a particular observatory in an given time
Atmospheric turbulence	FWHM of the seeing disk
Atmospheric scattering	Scattering coefficients matrix defined in
Atmospheric extinction	Extinction coefficients matrix defined in
Sky background	Sky background noise introduced by light pollution from the ground

Download table as: ASCII Typeset image

There are two types of environment properties: static properties and dynamical properties. The observatory location, the atmospheric scattering coefficients, and the atmospheric extinction coefficients are all static properties. The cloud, the wind, and the atmospheric turbulence are dynamical properties, which will change within a certain range during real observations. With these properties, we could carry out simulations accordingly. The simulation procedure in an instant time is shown in Figure 2. We would first set the observation time of the simulator (start time of the simulation). Second, we would call the celestial object model to generate celestial objects with their magnitudes and positions in the celestial coordinates. Meanwhile, we will call the observation environment model to generate clouds, wind speed, wind direction, sky background, and atmospheric conditions for each observatory. Third, we will obtain positions of celestial objects for each observation site and calculate background noises from the sky background or the extended celestial objects. Then, we will call the telescope array model to observe these celestial objects and obtain values for each telescope. Finally, we will update the value and the celestial object catalog for the telescope array.

**Figure 2.** Simulation procedure in an instant time. As shown in this figure, our method will continuously call the telescope model, the celestial object model, and the observation environment model to generate realistic observation results.
Download figure:
Standard image High-resolution image

Since we will carry out observations continuously, the simulator will update its state in discrete time as shown in Figure 3. For each instant time, we would first obtain values and catalogs of celestial objects with the procedure discussed above (simulation procedure in an instant time). Then, according to values obtained by the telescope array, the observation strategy will drive telescopes in the telescope array to take actions (pointing to a direction and taking images with an exposure time). Meanwhile, the environment model and the celestial object model will update states of the simulator according to the length of the exposure time and that of the moving time with the following steps. First, the position of celestial objects will be calculated, as well as the phase of the Sun and the Moon. Second, the cloud will move according to the wind speed and wind direction. Third, the pointing direction of the telescope will be updated according to the axis moving speed, the axis moving acceleration rate, and the moving time. After we update these states, we will carry out our simulation with the simulation procedure in an instant time discussed above to obtain states and catalogs of the telescope array. We will run the simulator for a given length of time to train the observation strategy, which we discuss in the next section.

3. Observation Strategy Optimization with the DRL Algorithm

AlphaGo (Silver et al. 2016), released by Google's DeepMind team in 2016, shocked the world with a 4:1 victory over the world Go champion Lee Sedol. Since then, AlphaGo Master and AlphaGo Zero (Silver et al. 2017) have achieved even better results. DRL, which is used to optimize AlphaGo, has attracted widespread attention from the scientific community. Compared with classic control methods, such as gain scheduling, robust control, and nonlinear model predictive control (MPC), which have been widely discussed in autonomous driving and robotics areas (Xu et al. 2018; Bertsekas 2021), DRL is an effective end-to-end controller, which has the following advantages in real applications: (1) Through training, DRL could deal with very complex environments that are hard to describe. (2) With deep neural networks, DRL could generalize to previously unseen states that are different from states in the training data. Because observations carried out by the telescope array are quite complex (the observation environment includes different effects, and there are many celestial objects to observe by several telescopes), it would be adequate to use DRL to optimize the observation strategy for the telescope array. In this paper, we set the state, the action, and the reward for each telescope in the telescope array separately, and the agent would control the whole telescope array according to the celestial object catalog and values obtained by different telescopes. We discuss the design of DRL in the following subsections.

3.1. Design of the State, the Action, and the Reward in the DRL Algorithm

The state, the action, and the reward are necessary components for DRL. The state is used to describe the agent's understanding of the environment and the task. It should be mentioned that the agent should not be omniscience, which means that the information contained in the state could only be extracted from sensors. Based on this concept, we design the state with the following information:

1.
Catalog of known celestial objects, which include positions and values of these celestial objects. Since there are so many celestial objects on the sky, we would split the sky area into 441 regions. Each time, we would calculate values of celestial objects in these regions.
2.
Measurements of environment conditions, which include distributions of clouds, locations of the observatory, and positions of the Sun and the Moon. If any regions are affected by the cloud, values in these regions would be directly set as zero. Meanwhile, the detection ability of telescopes would be affected, if regions are affected by the sky background.
3.
Measurements of the telescope conditions, which are the pointing direction of the telescope. The dimension of the state space is 441 + 2 + 2 = 445 for the agent. Figure 4 shows the schematic diagram of these states.

**Figure 4.** Schematic draw of the state in DRL used for observation strategy optimization.
Download figure:
Standard image High-resolution image

The action is used to define interactions between the agent and the environment. In this paper, we define actions as moving and exposure of each telescope in the telescope array. The telescope could move with azimuth from 0° to 360° and with altitude from 10° to 85° in this paper. However, we could modify these parameters according to the real design of different telescopes. Because different telescopes have different moving speeds and moving acceleration rates, moving angles would be different even with the same moving direction and time. The length of exposure time could be an int number, which defines the length of exposure time, or a Boolean number, which defines whether we would obtain images with the same exposure time. There are five dimensions in the action space as shown in Figure 5.

**Figure 5.** Schematic draw of the action in DRL used for observation strategy optimization. Satellites represent moving targets; six-point stars represent stars, quasars, and galaxies; the blue circle represents exposure time; and orange arrows represent moving directions.
Download figure:
Standard image High-resolution image

The reward is defined to guide the DRL algorithm to obtain better strategies for a specific task. In this paper, we set the reward as the values of celestial objects obtained by the telescope array:

$\begin{eqnarray}&&R=\displaystyle \sum _{i=0}^{n}{\mathrm{value}}_{i}^{* }{K}_{i},\end{eqnarray} \tag{ 2 }$

where n is the total number of celestial objects in the simulator, value_i is the value of the ith celestial object, and K_i is the value to indicate whether the ith spatial target is observable, with 1 for observable and 0 for targets with low S/N. During the simulation, we would calculate S/Ns of all celestial objects and compare S/Ns with the detection threshold in each step. If the S/N is lower than the predefined threshold, we would set K_i as 0, and 1 otherwise. Meanwhile, we would calculate values of these celestial objects with Equation (1), and we would obtain the reward by summing up all values.

3.2. Design of the DRL Algorithm for Optimization of the Observation Strategy

There are several different DRL algorithms proposed and applied in different areas (Mnih et al. 2013; Silver et al. 2016, 2017; Rahimi et al. 2018; Arulkumaran et al. 2019; Jumper et al. 2021; Kiran et al. 2021). In this paper, we choose the Q-learning as the basic structure. The Q-learning is a model-free, off-policy reinforcement learning algorithm, which obtains the best action according to its experience learned previously. The experience is stored as the Q-table, inside which are state-action pairs. If the dimension of the state and that of the action space are not too large, the Q-learning is an effective DRL algorithm. Otherwise, the Q-learning would fail owing to the "dimensional catastrophe" problem. For the telescope array observation optimization problem, the space of action and that of the state are relatively simple. Therefore, it would be possible to use the Q-learning algorithm to optimize the observation strategy. Since the Q-table is a table with discrete values and telescopes would carry out continuous actions, we will select a parameterized function to approximate the Q-table as Q(S, A), where S represents the state and A represents the action. By sampling enough (S, A) state-action pairs and Q values, we could fit the full Q(S, A) table. Given that the action value function would be quite large and complex, we select deep neural networks to approximate the parameterize function. Q-learning with deep neural networks is also called Deep Q-Network (DQN). With the simulator and the DQN, the schematic diagram of our framework is shown in Figure 6.

**Figure 6.** Schematic diagram of the DQN and the simulator in our framework.
Download figure:
Standard image High-resolution image

We adapt the experience replay mechanism in the DQN to store the trained data in a replay buffer for subsequent training with random sampling (Mnih et al. 2013). With this design, we could make better use of the data and reduce correlation of consecutive samples to make variance lower. Besides, we also use the method proposed by Mnih et al. (2015), which copies replicated parameters to the target network after N rounds of training to reduce variance. We use a neural network with five layers of multilayer perceptron (MLP) in the DQN (Gardner & Dorling 1998), which could be further modified accordingly. With these modifications, the schematic diagram of the DQN is shown in Figure 7.

As shown in Figure 7, the sensor would obtain states from the environment. Then, the state would be sent to the value network, as well as the replay buffer. The value network would consider the output from the replay buffer and the state to output the Q value. The DQN loss function would evaluate the Q value and output the loss function gradient to train the value network. After several iterations, weights from the value network would be copied to the target value network. The target value network would output the max Q value according to weights from the value network and outputs from the replay buffer. The max Q value would be used to optimize the observation strategy.

3.3. Application of the DQN in Optimization of the SpDeb Observation Strategies

In this subsection, we connect the simulator and the DQN to obtain effective observation strategies. The flowchart of the simulator and the DQN is shown in Figure 8. First, we need to set celestial objects and simulation environments. We will set the start time and the end time with the Julian calendar and the UTC time. The start time and the end time define the total time of the observations carried out by the telescope array. Meanwhile, we will initialize the celestial catalogs, states of telescopes, and observatory environments with parameters defined in the simulator. Then, we will carry out simulations with parameters defined above by a minimal time step of δ t. All environmental parameters will be updated according to T + n × δ t, where n represents the number of steps of the simulator and T is the start time. The simulator will calculate the position of the Sun and the Moon and the phase of the Moon, the sky background within the observation range of each telescope, and the cloud cover. The telescope will capture celestial objects according to observation conditions. At the same time, the DQN algorithm would accept environment information from the sensor, which includes scores, distribution of clouds, and new celestial objects observed by telescopes. At last, the DQN algorithm will output the best action and drive all telescopes in the telescope array to a preferred direction. After the procedure defined above, we would carry on the simulation by δ t, until the simulator runs to the end of simulated time.

**Figure 8.** The flowchart of the framework for optimization of the observation strategy for distributed telescope arrays.
Download figure:
Standard image High-resolution image

4. Performance Test of the Framework in Observation of SpDeb

In this section, we would test the performance of our framework in control of a distributed telescope array to observe space debris. There are two tasks for the telescope array: observation of known space debris and discovery of unknown space debris. These two tasks require two different observation strategies. Observations of known space debris require frequent tracking carried out by different telescopes. Discoveries of unknown space debris require surveys carried out by different telescopes in their spare time. For ordinary observations, scientists need to carefully design observation strategies and allocate most time in tracking of known space debris, which would reduce efficiency of the telescope array in discovery of space debris. Since there are many space debris to observe and limited telescopes, an observation strategy is required to maximize the observation ability of the telescope array. We use our framework to solve this problem. Considering the Moon phase, we plan to simulate a month to train the DQN (from 2022 April 1 to 2022 May 1). The minimal time step is 4 s, and there are 21,600 time steps each day in the simulator. We have randomly selected 200 space debris according to their TLE data from the celestra website. The value_max of these space debris is 4, and t₀ and M of these space debris are twice their periods. The location of these telescopes is shown in red in Figure 9, and parameters of these telescopes are set according to Table 4.

**Figure 9.** The location information of the telescope. Observation sites in black represent the configurations of the observation sites in the training set, and observation sites in red represent the configurations of the observations sites in the test set.
Download figure:
Standard image High-resolution image

Table 4. Parameters for Telescopes in the Telescope Array

Parameters	Value
Aperture size	1 (m)
Field of view	3 (deg)
Efficiency	90%
Readout noise	1 (e⁻)
Dark current noise	1 (e⁻)
Pixel scale	1 (arcsec)
Axis movement speed	2 m s⁻¹
Axis movement acceleration rate	2 m s⁻²

Download table as: ASCII Typeset image

4.1. Design and Training of the DQN

We use the five-layer MLP as the policy network Q and the target network Q in the framework. The input dimension of the MLP is the dimension of the state space, which is 445 in this task, and the output dimension of the MLP is the dimension of the output space, which is 5 in this task. We set the hidden layer with a dimension of 256 in the MLP. We have tested neural networks with other structures, and their performance is close. Therefore, we choose the five-layer MLP for convenience. The update rate is 8 for the target network, which means that we would update weights of the target network after eight steps. The number of the training rounds is 200, and the batch size is 256. We set the empirical playback pool with a size of 10⁶. The optimization algorithm is the Adam Optimizer with an initial learning rate of 0.01. The loss function is the rms loss function. In order to obtain higher rewards, the agent must choose the right action in different states based on previous experience: the agent needs to use learned actions to observe known space debris and take necessary exploration to discover new space debris based on the epsilon-greedy strategy (Dabney et al. 2020). The epsilon-greedy is used to balance the choice between random exploration and greed exploitation. In each step, the epsilon-greedy strategy chooses an action randomly with probability of epsilon and the best known action in the current state with probability of 1-epsilon. With the design, we could explore different actions more and find better strategies in the early stages, while being able to make more use of existing knowledge and stabilize performance in the later stages. The value of epsilon would be different in different stages. At the start of the training stage the value of epsilon is close to 1, and at the end of the training stage the value of epsilon is close to 0. Therefore, in our experiments, we set the _start of the -greedy strategy as 0.99 and _end as 0.01, with the decay rate as 80,000. The epsilon of the xth action is updated according to

$\begin{eqnarray}&&\epsilon (x)={\epsilon }_{\mathrm{end}}-({\epsilon }_{\mathrm{start}}-{\epsilon }_{\mathrm{end}})\times {e}^{\tfrac{-1\times x}{\mathrm{decay}}}.\end{eqnarray} \tag{ 3 }$

With the hyperparameters defined above, we train the framework with 200 episodes. The reward and the number of episodes are shown in Figure 10. The blue curve represents the reward value in each step, and the yellow curve represents the mean reward value in each round. As shown in this figure, the reward is 0 at the start of the training, which represents that the telescope array could not find or track any space debris. As we train the framework, the reward would increase, and after 175 rounds, the reward value is stable. Since the reward is abstract to understand, we further use the number of detected and the number of tracked space debris as the evaluation criterion for the framework. As shown in Figure 11, blue circles represent the number of newly discovered space debris, and yellow circles represent the number of tracked space debris. Since we would remove space debris from the catalog that could not be tracked after several hours, the number of tracked space debris would reduce during the training stage. Therefore, there are some fluctuations of the number of tracked and newly detected space debris. In this figure, we find that the framework could detect or track no space debris at the start of the training. Then, the observation ability of the framework would become stable after 150 rounds. They could discover about 115 space debris each day (about 57.5% of the total number of space debris) and track 90 space debris (about 45.0% of the total number of space debris.

**Figure 11.** The performance of the framework in detection of space debris, when we use number of detected and tracked space debris to evaluate its performance.
Download figure:
Standard image High-resolution image

It would be interesting to investigate the actions carried out by the DQN in the framework. Because the DQN will consider successive actions and rewards obtained by these actions, the agent will prefer to take fewer moves and observe more valuable space debris. To show the actions carried out by the agent, we have shown the trajectory of a telescope in the telescope array. As shown in Figure 12, the transparencies of these curves represent different observation times. The gray curves represent the trajectory of space debris with little observation value, and the yellow curve represents the trajectory of space debris with more values. The blue curve represents the telescope trajectory, which is controlled by the traditional sky survey method. The red curve represents the telescope trajectory, which is controlled by the DQN algorithm. As shown in this figure, the telescope will scan the sky with a predefined strategy, if it is controlled by the sky survey strategy. However, the DQN algorithm will automatically adjust the observation strategy of the telescope, when it finds a valuable target. Besides, we could clearly see that the DQN algorithm will try to predict the motion trajectory of the space debris and move to an appropriate position.

**Figure 12.** The schematic diagram of the space debris track and observation track of the telescope.
Download figure:
Standard image High-resolution image

For DRL algorithms (such as the DQN proposed in this paper), scientists will normally use the length of right action chain or the right action rate as an indicator of the learning ability of the agent. We also use the right action rate to show the learning ability of our algorithm. The right action rate represents the rate of right actions made by the agent, which is the reciprocal of the length of the right action chain. As shown in Figure 13, during the training stage, the right action ratio is increasing. Finally, the right action rate could be more than 10%, which means that the agent could obtain valid strategies after less than 10 actions.

**Figure 13.** The performance of the framework in detection of space debris, when we use right action rates to evaluate its performance. The blue line represents the ratio of the number of valid observations to the number of all actions in each step, and the orange line represents mean right action rates in each round.
Download figure:
Standard image High-resolution image

4.2. Performance Test of the DQN in a New Observation Task

A trained DQN should be able to adapt to a new environment, which has similar properties to the training environment. Therefore, we have generated four new data sets to test the performance of our framework, which have the following changes:

1.
We change the observation time from May 1 to 2022 May 10.
2.
We have randomly selected a new batch of space debris to observe.
3.
We have switched telescopes from observatories labeled in red to virtual observatories labeled black as shown in Figure 9.

Meanwhile, we have built a sky survey strategy for the telescope array for comparison, which would survey the sky with a predefined sequence. With these four new data sets, we have carried out observations with 10 days, and the final results are shown in Figures 14 and 15.

**Figure 14.** The performance of the strategy obtained by our framework and that obtained by the Sky Survey in a Four-Group Environment, when evaluated with reward value.
Download figure:
Standard image High-resolution image

We have shown the reward values, the number of newly discovered space debris, and the number of tracked space debris in this section to evaluate the performance of the trained DQN algorithm. Several different conditions have changed in the new environment, including "Change Time," "Change Debris," "Change Observatories," and "Change All." "Change Time" represents different observation time. "Change Debris" represents different space debris. "Change Observatories" represents telescopes located in different observatories. "Change All" represents that all conditions will change. As shown in Figure 14, when we have different observing conditions, the sky survey is a relatively stable observing strategy with a reward value of 30–40. Accordingly, the reward value of our framework is basically able to keep 500–600 before. Only when all observational conditions are changed is the reward value of our framework less than 500, which is still far better than the sky survey.

In Figure 15, we further show the number of newly discovered and tracked space debris obtained by different observing strategies. As shown in this figure, our framework can discover 110–120 targets. About 90 targets could be tracked in the "Change Time" and "Change Debris" scenarios, and in the "Change Observatories" and "Change All" scenarios our method could track about 60 targets. Meanwhile, the sky survey strategy could find some new space debris and track some known space debris, but with worse results. We further investigate the reward value when one of these observation conditions changes. For example, we could find that the variance of the reward is the highest when we change observatories. This may be due to the fact that when the location of the observatory changes some space debris have large variations in positions or moving speeds with respect to the observatory, resulting in a decrease of the telescope's ability to discover and monitor some of the target.

5. Conclusions and Future Works

In this paper, we propose a framework to optimize observation strategy for distributed telescope arrays. The framework contains a half Monte Carlo and half analytical simulation code, which could simulate observations carried out by telescope arrays with acceptable computation cost and memory requirement. Meanwhile, we propose to use the DQN to control the telescope array for observation. The DQN could be trained with a simulator and directly used for observation after training. We simulate a scenario that uses several telescopes located in different observatories to observe space debris. Results show that our framework could provide an effective observation strategy after training, even if locations of telescopes or catalogs of space debris change.

Our framework is a novel method to optimize observation strategy for telescope arrays. We will further study the framework and use it to control telescope arrays, such as the Sitian project (Liu et al. 2021) or the TIDO project (Kou 2019). However, there are still some works that require being developed before we could use the method in real applications: (1) We need to design an algorithm with an all-sky camera to obtain positions of clouds for the DQN. (2) It would be necessary to contain more restriction conditions in the simulator, such as effects brought by manned celestial objects (such as the star-link). (3) We also need to design a method to segment sky to different regions according to the distribution of celestial objects, which could help us to better evaluate the performance of the telescope array.

This work is supported by National Natural Science Foundation of China (NSFC) with funding Nos. 12173027 and 12173062, Civil Aerospace Technology Research Project (D050105). We acknowledge the science research grants from the China Manned Space Project with No. CMS-CSST-2021-A01.

Software: We used Python (Van Rossum et al. 2007), numpy (Van Der Walt et al. 2011), astropy (Robitaille et al. 2013), gym (Brockman et al. 2016), pyephem (Rhodes 2011), scipy (Virtanen et al. 2020), matplotlib (Hunter 2007), and scikit-learn (Pedregosa et al. 2011). The code is released in the China-VO: 10.12149/101240.

Observation Strategy Optimization for Distributed Telescope Arrays with Deep Reinforcement Learning

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction