Controlling Randomization in Astronomy Simulations

Megan E. Schwamb; Jeremy Kubica; Mario Jurić; Drew Oldag; Maxine West; Melissa DeLucchi; Matthew J. Holman

doi:10.3847/2515-5172/ad1f6b

1. Introduction

Scientific analysis needs pseudo-randomized runs to produce independent samples. This applies anytime the user is looking to produce statistically meaningful data whether from repeated simulations or Monte Carlo samples for data fitting. Slightly different outputs will be produced for the same set of inputs, each time a simulator is run when it relies on randomized values. For testing and validation, simulators also need to have the ability to allow forced reproducibility. Otherwise testing can be flaky, prone to failing unpredictably, and cause wasted debugging time. A common approach to addressing this trade-off is to provide the ability to seed the random number generator with a user inputted value for testing and debugging. The seed starts the random pseudo-random number generator from a known position and thus produces deterministic samples. Outside of active software development and testing, the standard is to use the computer time (in seconds or milliseconds) to set the seed value, causing the random number generator to be initialized with a unique seed each time the software is executed for scientific analysis.

Typically, a software package or simulator will have all of its constituent modules/functions share the same random number generator so that a single value controls the reproducibility of the code. This strategy works well when implementing unit tests, where each function/module in the software is tested independently, or when checking the end-to-end output of a software package. This strategy does not work very well when attempting to develop a new function in a large software package or trying to debug within a specific function/module that uses randomized values, especially if it takes a significant amount of time for the entire software program to run. Even if using a single seed for the random number generator, the same functions and modules must run in the exact same order (the same number of requests to the random number generator occur) to get deterministic outputs from the function in question.

In this research note, we present the pseudo-random number generator strategy we implemented⁵ for Sorcha ⁶ (Merritt et al. 2023, in preparation), an open-source python solar system simulator for the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST; LSST Science Collaboration et al. 2009; Ivezić et al. 2019). Our method makes each of the simulator's functions and modules requiring randomization independently deterministic for testing and debugging with a single user-provided seed, regardless of which functions/modules within the software are executed or the order in which they are called.

2. Per Function/Module Random Number Generator

We developed a per-module/function random number generator (which we refer to as the PerModuleRNG python class) which creates individual independently seeded NumPy (Harris et al. 2020) random number generators⁷ for each function/module in a software package. An PerModuleRNG instance is initialized with a single base seed and returns new random number generators using NumPy's default BitGenerator(PCG64) algorithm (O'Neill 2014) seeded using the combination of the base seed and the MD5 hash⁸ of the function/module's name as the per-module seed:

$\begin{eqnarray}&&\begin{array}{l}\mathrm{per}\_\mathrm{module}\_\mathrm{seed}=(\mathrm{base}\_\mathrm{seed}\ \\ \qquad +\mathrm{md}5\_\mathrm{hash}(\mathrm{module}\_\mathrm{name})) \% ({2}^{31})\end{array}\end{eqnarray} \tag{ 1 }$

Using the hash ensures a high level of randomness of the generated seed, even for similarly named functions.

The code for the PerModuleRNG python class is shown below:

import hashlib
import numpy as np
class PerModuleRNG:
 def__init__(self, base_seed):
 self._base_seed = base_seed
 self._rngs = {}
 def getModuleRNG(self, module_name):
 if module_name in self._rngs:
 return self._rngs[module_name]
 hashed_name = hashlib.md5(module_name.encode())
 seed_offset = int(hashed_name.hexdigest(), base = 16)
 module_seed = (self._base_seed + seed_offset) % (2**31)
 new_rng = np.random.default_rng(module_seed)
 self._rngs[module_name] = new_rng
 return new_rng

The initialization function sets the base seed for future use. The function getModuleRNG() checks if a random number generator already exists for that module/function name and, if so, returns it. Otherwise, a new random number generator (with a per-module/function seed) is created and returned. This allows the individual modules or functions to perform deterministically when using a single base seed for testing/debugging regardless of the order in which they are run. We note that the same NumPy random number generator instance will be used if a given function/module is called more than once in the software. It would be relatively straightforward to adapt PerModuleRNG such that each time a function is called a unique seed is created by keeping track of the number of times the function has been called and adding that count as a string on to the function name before hashing.

3. Usage

There are several important considerations to keep in mind when setting the base seed value and initializing an PerModuleRNG instance within your code:

1.
Outside of the unit tests, you want to avoid any hard coded base seed values.
2.
By default, the base seed should be initialized as the computer time at runtime, but you also want to make it relatively straightforward for an expert user (such as someone contributing to the code base) to set the base seed for reproducible runs.
3.
You want to make it hard for a new user who might not fully understand pseudo-random number generation in computing to accidentally set the same fixed base seed for all their scientific runs.

Sorcha uses a configuration file and command line arguments to initialize the majority of the simulator's parameters and settings. It is very easy for a novice user to incorrectly set the base seed if it is accessible via these options. To prevent accidental settings, we use a hidden environmental configuration parameter for setting the base seed value. If the environment variable is not defined then by default the code uses the current system time as the base seed for PerModuleRNG. This strategy allows us to hide this functionality deep within our software's documentation such that an expert user will find it, but a new user will not. It also prevents users from accidentally forcing deterministic behavior by copying a test configuration file, an all-too common failure case observed with new users.

Assuming that the PerModuleRNG class is stored in PerModuleRNG.py, we demonstrate how to use the PerModuleRNG python class in this manner with an example main and function snippet below:

import numpy as np
from PerModuleRNG import PerModuleRNG
import time
import os
def function1(module_rngs, observations):
 rng = module_rngs.getModuleRNG("function1") # once initialized in the function, the returned
 # generator is used in the same manner as any other numpy random number generator.
 # if creating a per module random number generator instead use:
 # rng = module_rngs.getModuleRNG(__name__)
 uniform_distr = rng.random(len(observations)) # creates a numpy array the same length as the
 # observations array comprised of floats randomly uniformly distributed within [0, 1)
 ...
 if __name__=="__main__":
 if "MY_SEED" in os.environ: # checking if an environment variable for setting a fixed
 # base seed exists.
 base_seed = os.environ["MY_SEED"] # use the environment variable value as base seed
 else:
 base_seed = int(time.time()) # otherwise set base seed using computer time (in seconds)
 module_rngs = PerModuleRNG(base_seed)
 observations = np.zeros(100, dtype=float)
 function1(module_rngs, observations)
 ...

As needed for reproducible runs, the environment variable MY_SEED can be defined in the terminal beforehand, as "export MY_SEED = 42" for example if using a bash shell or Z-shell. In unit tests, the base seed can be directly hard coded and passed when creating the first instance of PerModuleRNG.

Acknowledgments

MES acknowledges support from UK STFC grants ST/V000691/1 and ST/X001253/1 and LSST Discovery Alliance LINCC Frameworks Incubator grant [2023-SFF-LFI-01-Schwamb]. Support was provided by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. DIRAC Institute is supported through generous gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences, the Washington Research Foundation, and individual supporters. This work has made use of NASA's Astrophysics Data System Bibliographic Services.

Software: NumPy (Harris et al. 2020).

Controlling Randomization in Astronomy Simulations

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Per Function/Module Random Number Generator

3. Usage

Acknowledgments

Footnotes

Controlling Randomization in Astronomy Simulations

Article metrics

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Per Function/Module Random Number Generator

3. Usage

Acknowledgments

Footnotes