Paper The following article is Open access

Speeding-up the decision making of a learning agent using an ion trap quantum processor

, , , , , and

Published 20 December 2018 © 2018 IOP Publishing Ltd
, , Citation Th Sriarunothai et al 2019 Quantum Sci. Technol. 4 015014 DOI 10.1088/2058-9565/aaef5e

2058-9565/4/1/015014

Abstract

We report a proof-of-principle experimental demonstration of the quantum speed-up for learning agents utilizing a small-scale quantum information processor based on radiofrequency-driven trapped ions. The decision-making process of a quantum learning agent within the projective simulation paradigm for machine learning is implemented in a system of two qubits. The latter are realized using hyperfine states of two frequency-addressed atomic ions exposed to a static magnetic field gradient. We show that the deliberation time of this quantum learning agent is quadratically improved with respect to comparable classical learning agents. The performance of this quantum-enhanced learning agent highlights the potential of scalable quantum processors taking advantage of machine learning.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The past decade has seen the parallel advance of two research areas—quantum computation [1] and artificial intelligence [2]—from abstract theory to practical applications and commercial use. Quantum computers, operating on the basis of information coherently encoded in superpositions of states that could be considered classical bit values, hold the promise of exploiting quantum advantages to outperform classical algorithms, e.g. for searching databases [3], factoring numbers [4], or even for precise parameter estimation with quantum metrology [5, 6]. At the same time, artificial intelligence and machine learning have become integral parts of modern automated devices using classical processors [710]. Despite this seemingly simultaneous emergence and promise to shape future technological developments, the overlap between these areas still offers a number of unexplored problems [11, 12]. It is hence of fundamental and practical interest to determine how quantum information processing and autonomously learning machines can mutually benefit from each other.

Within the area of artificial intelligence, a central component of modern applications is the learning paradigm of an agent interacting with an environment [2, 13, 14] illustrated in figure 1(a), which is usually formalized as so-called reinforcement learning. This entails receiving perceptual input and being able to react to it in different ways. The learning aspect is manifest in the reinforcement of the connections between the inputs and actions, where the correct association is (often implicitly) specified by a reward mechanism, which may be external to the agent. In this very general context, an approach to explore the intersection of quantum computing and artificial intelligence is to equip autonomous learning agents with quantum processors for their deliberation procedure9 . That is, an agent chooses its reactions to perceptual input by way of quantum algorithms or quantum random walks. The agent's learning speed can then be quantified in terms of the average number of interactions with the environment until targeted behavior (reactions triggering a reward) is reproduced by the agent with a desired efficiency. This learning speed cannot generically be improved by incorporating quantum technologies into the agent's design [17].

Figure 1.

Figure 1. Learning agent and quantum reflecting projective simulation (Q-RPS). (a) Learning agents receive perceptual input ('percepts') from and act on the environment. The projective simulation (PS) decision-making process draws from the agent's memory and can be modeled as a random walk in a clip network, which, in turn, is represented by a stochastic matrix P. Here the clips represent the elementary patches of episodic memory of prior experiences. (b) Q-RPS agents enhance the relative probability of (desired) actions (green columns) compared to other clips (gray) that may include undesired actions or percepts (blue) within the stationary distribution of P before sampling, achieving a quadratic speed-up with respect to classical RPS agents.

Standard image High-resolution image

However, a recent model [20] for learning agents based on projective simulation (PS) [14] allows for a speed-up in the agent's deliberation time during each individual interaction. Theoretical work has shown that such a quantum improvement in the reaction speed should be possible within the reflecting projective simulation (RPS) variant of PS [20]. There, the desired actions of the agent are chosen according to a probability distribution that can be modified during the learning process. This is of particular relevance to adapt to rapidly changing environments [20], as we shall elaborate on in the next section. For this task, the deliberation time of classical RPS agents is proportional to the quantities 1/δ and 1/epsilon, where δ represents a spectral gap of a Markov chain and epsilon represents the probability to sample an action in a probability distribution. These characterize the time needed to generate the specified distribution in the agent's internal memory and the time to sample a suitable (e.g. rewarded rather than an unrewarded) action from it, respectively. A quantum RPS (Q-RPS) agent, in contrast, is able to obtain such an action quadratically faster, i.e. within a time of the order $1/\sqrt{\delta \epsilon }$ as is shown in the next section.

Here, we report on the first proof-of-principle experimental demonstration of a quantum-enhanced reinforcement learning system, complementing recent experimental work in the context of (un)supervised learning [2123]. We implement the deliberation process of an RPS learning agent in a system of two qubits that are encoded in the energy levels of one trapped atomic ion each. Within experimental uncertainties, our results confirm the agent's action output according to the desired distributions and within deliberation times that are quadratically improved with respect to comparable classical agents. This laboratory demonstration of speeding up a learning agent's deliberation process can be seen as the first experiment combining novel concepts from machine learning with the potential of ion trap quantum computers where complete quantum algorithms have been demonstrated [2427] and feasible concepts for scaling up [2830] are vigorously pursued.

2. Theoretical framework of RPS

A generic picture for modeling autonomous learning scenarios is that of repeated rounds of interaction between an agent and its environment. In each round the agent receives perceptual input ('percepts') from the environment, processes the input using an internal deliberation mechanism, and finally acts upon (or reacts to) the environment, i.e. performs an 'action' [14]. Depending on the reward system in place and the given percept, such actions may be rewarded or not, which leads the agent to update its deliberation process, the agent learns.

Within the PS [14] paradigm for learning agents, the decision-making procedure is cast as a (physically motivated) stochastic diffusion process within an episodic compositional memory, that is, a (classical or quantum) random walk in a representation of the agent's memory containing the interaction history. One may think of the episodic compositional memory as a network of clips that can correspond to remembered percepts, remembered actions, or combinations thereof. That is, the clips represent the elementary patches of episodic memory. Mathematically, this clip network is described by a stochastic matrix (defining a Markov chain) $P=({p}_{{ij}})$, where the pij with $0\leqslant {p}_{{ij}}\leqslant 1$ and ${\sum }_{i}{p}_{{ij}}=1$ represent transition probabilities between the clips labeled i and j with $i,j\in \{1,2,\,\ldots ,\,N\}$. The learning process is implemented through an update of the N × N matrix P, which, in turn, serves as a basis for the random walks in the clip network. Different types of PS agents vary in their deliberation mechanisms, update rules, and other specifications.

In particular, one may distinguish between PS agents based on 'hitting' and 'mixing'. For the former type of PS agent, a random walk could, for instance, start from a clip c1 called by the initially received percept. The first 'step' of the random walk then corresponds to a transition to clips cj with probabilities p1j. The agent then samples from the resulting distribution ${\{{p}_{1j}\}}_{j}$. If such a sample provides an action, for instance, if the clip ck is 'hit', this action is selected as output, otherwise the walk continues on from the clip ck. An advanced variant of the PS model based on 'mixing' is RPS [20]. There, the Markov chain is first 'mixed', that is, an appropriate number10 of steps are applied until the stationary distribution is attained (approximately), before a sample is taken. This, or other implementations of random walks in the clip network provide the basis for the PS framework for learning. The classical PS framework can be used to solve standard textbook problems in reinforcement learning [3133], and has recently been applied in advanced robotics [34], adaptive quantum computation [35], as well as in the machine-generated design of quantum experiments [36].

Here, we focus on RPS agents, where the deliberation process based on mixing allows for a speed-up of Q-RPS agents with respect to their classical counterparts [20]. In contrast to basic hitting-based PS agents, the clip network of RPS agents is structured into several sub-networks, one for each percept clip, and each with its own stochastic matrix P. In addition to being stochastic, these matrices specify Markov chains which are ergodic [20], which ensures that the Markov chain in question has a unique stationary distribution, i.e. a unique eigenvector ${\boldsymbol{\alpha }}$ with eigenvalue +1, $P{\boldsymbol{\alpha }}={\boldsymbol{\alpha }}$. Starting from any initial state, continued application of P (or its equivalent in the quantized version) mixes the Markov chain, leaving the system in the stationary state.

As part of their deliberation process, RPS agents generate stationary distributions over their clip space, as specified by P, which is updated as the agent learns. These distributions have support over the whole sub-network clip space, and additional specifiers—flags—are used to ensure an output from a desired sub-set of clips. For instance, standard agents are presumed to output actions only, in which case only the actions are flagged using standard emoticons [14]. This ensures that an action will be output, while maintaining the relative probabilities of the actions. Put simply, flags provide a mechanism that can be used as a short-term memory, or to mark actions, to (temporarily) store additional information about the clip network besides that contained in the Markov chain. The same mechanism of flags can also be used to eliminate iterated attempts of actions which did not yield rewards in recent time-steps. This leads to a more efficient exploration of correct behavior.

In the quantum version of RPS, each clip ci is represented by a basis vector $| i\rangle $ in a Hilbert space ${ \mathcal H }$. In the most general case, the mixing process is then realized by a diffusion process on two copies of the original Hilbert space. On the doubled space ${ \mathcal H }\otimes { \mathcal H }$ a unitary operator W(P) (called the Szegedy walk operator [37, 38]) and a quantum state $| {\alpha }^{{\prime} }\rangle $ with $W(P)| {\alpha }^{{\prime} }\rangle =| {\alpha }^{{\prime} }\rangle $ take the roles of the classical objects P and ${\boldsymbol{\alpha }}$. Both W(P) and $| {\alpha }^{{\prime} }\rangle $ depend on a set of unitaries Ui on ${ \mathcal H }$ that act as ${U}_{i}| 0\rangle ={\sum }_{j}\sqrt{{p}_{{ij}}}| j\rangle $ for some reference state $| 0\rangle \in { \mathcal H }$ . The more intricate construction of W(P) is given in detail in [39].

The feature of the quantum implementation of RPS that is crucial for us here is an amplitude amplification similar to Grover's algorithm [3], which incorporates the mixing of the Markov chain and allows outputting flagged actions after an average of ${\rm{O}}(1/\sqrt{\epsilon })$ calls to W(P), where epsilon is the probability of sampling an action from the stationary distribution. The algorithm achieving this is structured as follows. After an initialization stage where $| {\alpha }^{{\prime} }\rangle $ is prepared, a number of diffusion steps are carried out. Each such step consists of two parts. The first part is a reflection over the states corresponding to actions in the first copy of ${ \mathcal H }$. In the second part, an approximate reflection over the state $| {\alpha }^{{\prime} }\rangle $, the mixing, is carried out [20]. This second step involves ${\rm{O}}(1/\sqrt{\delta })$ calls to W(P).

The two-part diffusion steps are repeated ${\rm{O}}(1/\sqrt{\epsilon })$ times before a sample is taken from the resulting state by measuring in the basis $\{| i\rangle \}{}_{i=1,\ldots ,N}$. If an action is sampled, the algorithm concludes and that action is chosen as output. Otherwise, all steps are repeated. Since the algorithm amplifies the probability of sampling an action (almost) to unity, carrying out the deliberation procedure with the help of such a Szegedy walk hence requires an average of ${\rm{O}}(1/\sqrt{\delta \epsilon })$ calls to W(P). In comparison, a classical RPS agent would require an average of ${\rm{O}}(1/\delta )$ applications of P to mix the Markov chain, and an average of ${\rm{O}}(1/\epsilon )$ samples to find an action. Q-RPS agents could hence achieve a quadratic speed-up in their reaction time.

Here, it should be noted that, its elegance not withstanding, the construction of the approximate reflection for general RPS networks is demanding for current quantum computational architectures. Most notably, this is due to the requirement of two copies of ${ \mathcal H }$, on which frequently updated11 coherent conditional operations need to be carried out [39, 41, 42]. However, for the special case of rank-one Markov chains P, the entire chain can be represented on one copy of ${ \mathcal H }$ by a single unitary ${U}_{P}={U}_{i}\ \forall i$, since all columns of P are identical. Conceptually, this simplification corresponds to a situation where each percept-specific clip network contains only actions and the Markov chain is mixed in one step ($\delta =1$). In such a case one uses flags to mark desired actions. Interestingly, these minor alterations also allow to establish a one-to-one correspondence with the hitting-based basic PS using two-layered networks, into which all standard tabular reinforcement learning models such as Q-learning or SARSA can be subsumed when the update, and transition rules have been appropriately amended [2]. In particular, basic PS using a two-layered network is already able to solve interesting classical tasks such as the mountain-car problem, grid-world, and many more [3136].

Let us now discuss how the algorithm above can be performed for the rank-one case with the flagging mechanism in place. First, we restrict ${ \mathcal A }$ to be the subspace of the flagged actions only, assuming that there are $n\ll N$ of these, and we denote the corresponding probabilities within the stationary distribution by ${a}_{1},\,\ldots ,\,{a}_{n}$. In the initialization stage, the state $| \alpha \rangle ={\sum }_{i=1,\ldots ,N}\sqrt{{a}_{i}}| i\rangle $ is prepared. Then, an optimal number of k diffusion steps [3] is carried out, where

Equation (1)

and $\epsilon ={\sum }_{i=1,\ldots ,n}{a}_{i}$ is the probability to sample a flagged action from the stationary distribution. Within the diffusion steps, the reflections are performed only over all flagged actions, i.e.

Equation (2)

In the rank-one case, the reflections over the stationary distribution α becomes an exact reflection

Equation (3)

and can be carried out on one copy of ${ \mathcal H }$ [39]. After the diffusion steps, a sample is taken and the agent checks if the obtained action is marked with a flag. If this is the case, the action is chosen as output, otherwise the algorithm starts anew.

While a classical RPS agents requires an average of ${\rm{O}}(1/\epsilon )$ samples until obtaining a flagged action, this number reduces to ${\rm{O}}(1/\sqrt{\epsilon })$ for Q-RPS agents. This quantum advantage is particularly pronounced when the overall number of actions is very large compared to n and the environment is unfamiliar to the agent or has recently changed its rewarding pattern, in which case epsilon may be very small. Given some time, both agents learn to associate rewarded actions with a given percept, suitably add or remove flags, and adapt P (and by extension ${\boldsymbol{\alpha }}$). In the short run, however, classical agents may be slow to respond and the advantage of a Q-RPS agent becomes apparent. Despite the remarkable simplification of the algorithm for the rank-one case with flags, the quadratic speed-up is hence preserved12 . This simplification also leads to a reduction in experimental complexity, in terms of the required number of two-qubit gates.

3. Experimental implementation of rank-one RPS

3.1. Quantum algorithm

The proof-of-principle experiment that we report in this paper experimentally demonstrates the speed-up of quantum-enhanced learning agents. That is, we are able to empirically confirm both the quadratically improved scaling of ${\rm{O}}(1/\sqrt{\epsilon })$, and the correct output according to the tail of the stationary distribution. Here, epsilon denotes the initial probability of finding a flagged action within the stationary distribution ${\boldsymbol{\alpha }}=\{{a}_{i}\}$ for the average number of calls of the diffusion operator before sampling one of the desired actions. The tail is defined as the first n components of ${\boldsymbol{\alpha }}$. By a correct output according to the tail of the stationary distribution, we mean that ${a}_{j}/{a}_{k}={b}_{j}/{b}_{k}\ \forall j,k\in \{1,\,\ldots ,\,n\}$, where bj denotes the final probability that the agent obtains the flagged action labeled j. Note that the Q-RPS algorithm enhances the overall probability of obtaining a flagged action such that

Equation (4)

whilst maintaining the relative probabilities of the flagged actions according to the tail of ${\boldsymbol{\alpha }}$, as illustrated in figure 1(b).

For the implementation we hence need at least a three-dimensional Hilbert space that we realize in our experiment using two qubits encoded in the energy levels of two trapped ions (see the experimental setup section): two states to represent two different flagged actions (represented in our experiment by $| 00\rangle $ and $| 01\rangle $), and at least one additional state for all non-flagged actions ($| 10\rangle $ and $| 11\rangle $ in our experiment). The preparation of the stationary state is implemented by

Equation (5)

where ${R}_{j}(\theta ,\phi )$ is a single-qubit rotation on qubit j, i.e.

Equation (6)

Here, Xj, Yj, and Zj denote the Pauli operators of qubit j. The total probability $\epsilon ={a}_{00}+{a}_{01}$ for a flagged action within the stationary distribution is then determined by ${\theta }_{1}$ via

Equation (7)

whereas ${\theta }_{2}$ determines the relative probabilities of obtaining one of the flagged actions via

Equation (8)

The reflection over the flagged actions ${\mathrm{ref}}_{{ \mathcal A }}$ is here given by a Z rotation, defined by ${R}_{j,z}(\theta )=\exp \left[-{\rm{i}}\tfrac{\theta }{2}{Z}_{j}\right]$, with rotation angle $-\pi $ for the first qubit,

Equation (9)

The reflection over the stationary distribution can be performed by a combination of single-qubit rotations determined by ${\theta }_{1}$ and ${\theta }_{2}$ and a CNOT gate given by

Equation (10)

which can be understood as two calls to ${U}_{P}$ (once in terms of ${U}_{P}^{\dagger }$) supplemented by fixed single-qubit operations [39]. The total gate sequence for a single diffusion step (consisting of a reflection over the flagged actions followed by a reflection over the stationary distribution) can hence be decomposed into single-qubit rotations and CNOT gates and is shown in figure 2. The speed-up of the rank-one Q-RPS algorithm with respect to a classical RPS agent manifests in terms of a quadratically smaller average number of calls to ${U}_{P}$ (or, equivalently, to the diffusion operator $D={\mathrm{ref}}_{\alpha }{\mathrm{ref}}_{{ \mathcal A }}$) until a flagged action is sampled. Since the final probability of obtaining a desired action is $\tilde{\epsilon }\equiv {\sum }_{i=1,\ldots ,n}{b}_{i}$, we require $1/\tilde{\epsilon }$ samples on average, each of which is preceded by the initial preparation of $| \alpha \rangle $ and k diffusion steps. The average number of uses of ${U}_{P}$ to sample correctly is hence

Equation (11)

which we refer to as 'cost' in this paper. In what follows, it is this functional relationship between C and epsilon that we put to the test, along with the predicted ratio ${a}_{00}/{a}_{01}$ of occurrence of the two flagged actions.

Figure 2.

Figure 2. Quantum circuit for Q-RPS. A rank-one Q-RPS is implemented using two qubits. The diffusion step consisting of reflections over the flagged actions and the stationary distribution (shown once each) is repeated k times, where k is given by (1) in section 2. The specific pulse sequence implementing this circuit is explained in section 3.2.

Standard image High-resolution image

3.2. The experimental setup

Two 171Yb+ ions are confined in a linear Paul trap with axial and radial trap frequencies of $2\pi \,\times \,117\,\mathrm{kHz}$ and $2\pi \,\times \,590\,\mathrm{kHz}$, respectively. After Doppler cooling, the two ions form a linear Coulomb crystal, which is exposed to a static magnetic field gradient of 19 T m−1, generated by a pair of permanent magnets. The ion–ion spacing in this configuration is approximately 10 μm. Magnetic gradient induced coupling (MAGIC) between ions results in an adjustable qubit interaction mediated by the common vibrational modes of the Coulomb crystal [43]. In addition, qubit resonances are individually shifted as a result of this gradient and become position dependent. This makes the qubits distinguishable and addressable by their frequency of resonant excitation. The addressing frequency separation for this two-ion system is about 3.7 MHz. All coherent operations are performed using radio frequency (RF) radiation near 12.6 GHz, matching the respective qubit resonances [44]. The RF power is carefully adjusted for each ion in order to achieve an equal Rabi frequency of 20.92(3) kHz. A more detailed description of the experimental setup is given in [26, 43, 45].

The qubits are encoded in the hyperfine manifold of each ion's ground state, representing an effective spin 1/2 system. The qubit states $| 0\rangle $ and $| 1\rangle $ are represented by the energy levels $| {}^{2}{S}_{1/2},F=0\,\rangle $ and $| {}^{2}{S}_{1/2},F=1,{m}_{F}=+1\,\rangle $, respectively. The ions are Doppler cooled on the resonance $| {}^{2}{S}_{1/2},F=1\,\rangle $ ↔ $| {}^{2}{P}_{1/2},F=0\,\rangle $ with laser light near 369 nm. Optical pumping into long-lived meta-stable states is prevented using laser light near 935 and 638 nm. The vibrational excitation of the Doppler cooled ions is further reduced by employing RF sideband cooling for both the center of mass mode and the stretch mode [46]. This leads to a mean vibrational quantum number of $\langle n\rangle \leqslant 5$ for both modes. The ions are then initialized in the qubit state $| 0\rangle $ by state selective optical pumping with a 2.1 GHz blue-shifted Doppler-cooling laser on the $| {}^{2}{S}_{1/2},F=1\rangle $ ↔ $| {}^{2}{P}_{1/2},F=1\,\rangle $ resonance.

3.3. State preparation, conditional dynamics, and read-out

The desired qubit states are prepared by applying an RF pulse resulting in a coherent qubit rotation with precisely defined rotation angle and phase as given by (6)–(8). We replace ${R}_{z}\left(\tfrac{\pi }{2}\right)$ by $R\left(\tfrac{\pi }{2},\tfrac{\pi }{2}\right)R\left(\tfrac{\pi }{2},0\right)R\left(\tfrac{\pi }{2},\tfrac{3\pi }{2}\right)$ and ${R}_{z}\left(-\tfrac{\pi }{2}\right)$ by $R\left(\tfrac{\pi }{2},\tfrac{\pi }{2}\right)R\left(\tfrac{\pi }{2},\pi \right)R\left(\tfrac{\pi }{2},\tfrac{3\pi }{2}\right)$. The required number of diffusion steps is then applied to both qubits, using appropriate single-qubit rotations and a two-qubit ZZ-interaction given by

Equation (12)

which is directly realizable with MAGIC [43]. A CNOT gate (UCNOT) can then be performed via

The required number of single qubit gates is optimized by combining appropriate single qubit rotations together from ${\mathrm{ref}}_{{ \mathcal A }}$ and ${\mathrm{ref}}_{\alpha }$ (see figure 2). Thus, we can simplify the algorithm to

Equation (13)

as shown in figure 3.

Figure 3.

Figure 3. Experimental sequence for Q-RPS. RF1 and RF2 each indicate a time axis for a qubit. The qubits are prepared in the desired input states using single-qubit rotations implemented by applying RF pulses. For each RF pulse, the two parameters within the parentheses represent the specific rotation angle and phase according to (6)–(8). Also, dynamical decoupling (DD) during conditional evolution ${U}_{{zz}}(\pi /2)$ (indicated by a green box), is implemented using RF pulses (indicated in yellow). Ten sets of 14 pulses each (UR14) [47] are applied during the evolution time $\tau =4.24$ ms with a J-coupling between the two ions of $2\pi \times 59\,\mathrm{Hz}$. The diffusion step is repeated k times according to (1) in section 2. Laser light near 369 nm is used for cooling and to initialize the ions in the qubit state $| 0\rangle \equiv | {}^{2}{S}_{1/2},F=0\,\rangle $. At the end of the coherent manipulation, laser light is used again for state selective detection and also for Doppler cooling. These process durations are: 30 ms for Doppler cooling, 100 ms for sideband cooling on the center-of-mass mode, 100 ms for sideband cooling on the stretch mode, 0.25 ms for initialization in state $| 0\rangle $ of the ions, and 2 ms for detection.

Standard image High-resolution image

During the evolution time of 4.24 ms for UZZ in each diffusion step both qubits are protected from decoherence by applying universally robust (UR) dynamical decoupling (DD) pulses [47]. A set of ten (x = 10) UR14 (N = 14) RF π-pulses, equaling a total of 140 pulses, is applied. Each set is comprised of 14 error canceling pulses (figure 3) with appropriately chosen phase ϕ:

Since the phases of the π-pulses are symmetrically arranged in time, only the first seven pulses are shown in figure 3. The last pulse is also shown to visualize the spacing of these pulses with respect to the start and end of evolution time, compared to the intermediate pulses. The maximum interaction time of 30 ms required to realize the deliberation algorithm (corresponding to 7 diffusion steps) is 60 times longer than the qubit coherence time. Such a long coherent interaction time is accomplished by the DD pulses applied to each qubit simultaneously.

Finally, projective measurements on both qubits are performed in the computational basis $\{| 0\rangle ,| 1\rangle \}$ by scattering laser light near 369 nm on the $| {}^{2}{S}_{1/2},F=1\,\rangle $ ↔ $| {}^{2}{P}_{1/2},F=0\,\rangle $ transition, and detecting spatially resolved resonance fluorescence using an electron multiplying charge coupled device to determine the relative frequencies ${b}_{00},{b}_{01},{b}_{10},{b}_{11}$ for obtaining the states $| 00\rangle $, $| 01\rangle $, $| 10\rangle $, and $| 11\rangle $, respectively. Two thresholds are used to distinguish between dark and bright states of the ions, thus discarding 10% of all measurements as ambiguous events with a photon count that lies in the region of two partially overlapping Poissonian distributions representing the dark and bright states of the ions [45, 48].

4. Experimental results

As discussed above, our goal is to test the two characteristic features of rank-one Q-RPS: (i) the scaling of the average cost C with epsilon, and (ii) the sampling ratio for the different flagged actions.

Therefore, our first set of measurements studies the behavior of the cost C as a function of the total initial probability epsilon. The second set of measurements studies the behavior of the output probability ratio ${r}_{f}={b}_{00}/{b}_{01}$ as a function of input probability ratio ${r}_{i}={a}_{00}/{a}_{01}$.

For the former, a series of measurements is performed for different values of epsilon corresponding to k = 1 to k = 7 diffusion steps after the initial state preparation (table 1). To obtain the cost $C=(2k(\epsilon )+1)/\tilde{\epsilon }$, where $\tilde{\epsilon }={b}_{00}+{b}_{01}$, we measure the probabilities b00 and b01 after k diffusion steps and repeat the experiment 1600 times for fixed epsilon. The average cost is then plotted against epsilon as shown in figure 4. The algorithm complexity is defined as the number of computational steps (equivalently, the number of calls to UP) until the flagged action is sampled. To describe the algorithm complexity, the number of operations can be expressed as ${\rm{O}}({\epsilon }^{-\xi })$. Ideally, the RPS gives $\xi =1$ whereas the Q-RPS gives $\xi =0.5$. The experimental data shows that the cost decreases with epsilon where $\xi =0.57(5)$. This is in good agreement with the behavior expected for the ideal Q-RPS algorithm. In the range of chosen probabilities epsilon, the experimental result of Q-RPS shows improved scaling as compared to the expected classical RPS, and clearly outperforms the classical RPS, as shown in figure 4. The deviation from the ideal behavior is attributed to a small detuning of the RF pulses implementing coherent operations, as we discuss in section 4.1.

Figure 4.

Figure 4. Scaling behavior of the learning agent's cost employing Q-RPS and RPS. After the preparation of $| \alpha \rangle $, k diffusion steps are applied before an action is sampled. This procedure is repeated until a flagged action is obtained, accumulating a certain cost C, whose average is shown on the vertical axis. Measurements are performed for different values of epsilon corresponding to k = 1 to k = 7 diffusion steps. The dashed black line and the solid blue line represent the behavior expected for ideal Q-RPS (${\epsilon }^{-0.5}$) and ideal classical RPS (${\epsilon }^{-1}$), respectively. The fit to the experimental data confirms that the scaling behavior follows a ${\epsilon }^{-0.57}$ behavior, and thus is consistent with Q-RPS. The data show that the experimental Q-RPS outperforms the classical RPS within the range of epsilon chosen in the experiment. Error bars represent the statistical errors.

Standard image High-resolution image

For the second set of measurements, we select calculated probabilities a00 and a01 in order to obtain different values of the input ratio ${r}_{i}={a}_{00}/{a}_{01}$ between 0 and 2, whilst keeping $k(\epsilon )$ in a range between k = 1 and k = 3 (table 2). For these probabilities a00 and a01, the corresponding rotation angles ${\theta }_{1}$ and ${\theta }_{2}$ of RF pulses used for preparation are extracted using (7) and (8). We then carry out the Q-RPS algorithm for the specific choices of k and repeat it 1600 times to estimate the probabilities b00 and b01. We finally obtain the output ratio ${r}_{f}={b}_{00}/{b}_{01}$, which is plotted against the input ratio in figure 5. The experimental data follows a straight line with a small offset from the ideal behavior ${r}_{f}/{r}_{i}=1$. Therefore, the ratio of the number of occurrences of the two actions obtained at the end of the deliberation process is maintained with respect to the relative probabilities of the initial stationary distribution.

Figure 5.

Figure 5. Output distribution. A comparison of the output ratio of two flagged actions at the end of the algorithm with the corresponding input ratio is shown. Measurements are performed with k = 1 (red square) and k = 3 (blue circle) diffusion steps. The black dashed line shows the behavior of the ideal Q-RPS. The red and blue dashed lines, each representing a linear fit to the corresponding set of data, confirm that the initial probability ratio is maintained. Error bars represent statistical errors.

Standard image High-resolution image

The slopes of the two fitted linear functions shown in figure 5 agree within their respective error showing that the deviation of the output ratio from the ideal result is independent of the number of diffusion steps. In addition, this indicates that this deviation is not caused by the quantum algorithm itself, but by the initial state preparation and/or by the final measurement process where such a deviation can be caused by an asymmetry in the detection fidelity (see section 4.1). Indeed, the observed deviation is well explained by a typical asymmetry in the detection fidelity of 3% as encountered in the measurements presented here. This implies reliability of the quantum algorithm also for a larger number of diffusion steps.

4.1. Interpretational considerations

In this section, we discuss deviations of the experimental data from idealized theory predictions. In particular, for the chosen values of epsilon and the corresponding optimal $k(\epsilon )$, it is expected that the probability of obtaining a flagged action is close to 100%. However, the success probability in our experiment lies between 66% (for k = 7) and 88% (for k = 1). In what follows, we discuss several reasons for this. First, we consider in detail experimental imperfections that affect the scaling of cost C with epsilon as shown in figure 4. Then, we discuss how the input and output ratios (figure 5), ri and rf, are affected by an imbalanced detection efficiency for both qubit states. In both cases the observed deviations from the ideal results are quantitively explained by numerically simulating the quantum algorithm taking into account experimental imperfections.

4.1.1. Scaling of cost C

Even in an ideal scenario without noise or experimental imperfections the success probability $\tilde{\epsilon }$, as defined in (4), after k diffusion steps is usually not equal to unity, and depends on the specific value of epsilon. This behavior originates from the step-wise increase of the number of diffusion steps $k=\mathrm{round}(\pi /(4\sqrt{\epsilon })-\tfrac{1}{2})$ in the algorithm. The success probability is hence only 100%, if k is an integer without rounding. The change of the ideal success probability with deviations of epsilon from such specific values is largest for small numbers of diffusion steps (e.g. k = 1) and can drop down to 82% (neglecting the cases where it is not advantageous to use a quantum algorithm at all). For larger numbers of diffusion steps, the exact value of epsilon does not play an important role any more for the ideal success probability provided that the correct number of diffusion steps is chosen. For example, for k = 6, the ideal success probability is larger than 98% independently of the exact value of epsilon. Throughout this paper, we have chosen epsilon in such a way, that $(\pi /(4\sqrt{\epsilon })-\tfrac{1}{2})$ (see (1)) is always close to an integer (see table 1), such that the deviation from a 100% success probability due to the theoretically chosen epsilon is negligible compared to other error sources.

Table 1.  Experimentally realized success probabilities. Initial theoretical probabilities, epsilon, of finding a flagged action within the stationary distribution for various diffusion steps are shown. Success probabilities ($\tilde{\epsilon }$), that are theoretically calculated and experimentally measured, for diffusion steps k = 1–7 are also shown.

    Theory     Theory     Experiment  
k a00 a01 epsilon b00 b01 $\tilde{\epsilon }$ b00 b01 $\tilde{\epsilon }$
1 0.1371 0.1371 0.2742 0.4966 0.4966 0.9932 0.449(15) 0.440(15) 0.89(2)
2 0.0493 0.0493 0.0987 0.4996 0.4996 0.9993 0.347(15) 0.353(15) 0.70(2)
3 0.0252 0.0252 0.0504 0.4999 0.4999 0.9998 0.438(16) 0.334(15) 0.77(2)
4 0.0152 0.0152 0.0305 0.5000 0.5000 1.0000 0.422(15) 0.336(15) 0.76(2)
5 0.0102 0.0102 0.0204 0.5000 0.5000 1.0000 0.407(17) 0.331(16) 0.74(2)
6 0.0073 0.0073 0.0146 0.5000 0.5000 1.0000 0.431(17) 0.324(16) 0.76(2)
7 0.0055 0.0055 0.0110 0.5000 0.5000 1.0000 0.365(15) 0.299(14) 0.66(2)

Table 2.  Input and output distributions. Input and output ratios, ri and rf respectively, of the two flagged actions represented by the states $| 00\rangle $ and $| 01\rangle $ for diffusion steps k = 1 and k = 3 are shown.

    Theory     Experiment  
k a00 a01 ri b00 b01 rf
1 0.002 71 0.271 44 0.01 0.061(7) 0.809(12) 0.075(9)
1 0.072 57 0.201 59 0.36 0.290(14) 0.583(15) 0.50(3)
1 0.113 83 0.160 32 0.71 0.415(15) 0.466(15) 0.89(4)
1 0.141 07 0.133 09 1.06 0.488(15) 0.389(15) 1.25(6)
1 0.160 40 0.113 76 1.41 0.519(13) 0.351(12) 1.48(6)
1 0.174 82 0.099 33 1.76 0.566(15) 0.305(14) 1.85(10)
1 0.137 08 0.137 08 1.00 0.468(16) 0.401(16) 1.17(6)
3 0.004 58 0.045 78 0.10 0.127(10) 0.718(14) 0.176(14)
3 0.016 33 0.034 02 0.48 0.301(15) 0.518(16) 0.58(3)
3 0.023 28 0.027 07 0.86 0.442(16) 0.451(16) 0.98(5)
3 0.027 88 0.022 48 1.24 0.510(16) 0.354(15) 1.44(8)
3 0.031 14 0.019 22 1.62 0.551(16) 0.305(14) 1.81(10)
3 0.033 57 0.016 79 2.00 0.586(15) 0.268(13) 2.19(12)
 

However, in a real experiment, the initial state, and therefore epsilon, can only be prepared with a certain accuracy. This can lead to an inaccurate estimation of the optimal number of diffusion steps. As opposed to the ideal case, an assumed accuracy of $\epsilon \pm 1 \% $ for the preparation only has a small effect on the success probability $\tilde{\epsilon }$ (drop of less than 5%) for $\epsilon \gg 0.01$, corresponding to $k\leqslant 3$. However, when epsilon does not fulfill the aforementioned condition and approaches $\approx 0.01$ from above, corresponding to k = 6, then the success probability drops down to $\tilde{\epsilon }=70 \% $ due to a non-optimal choice of k.

The preparation accuracy depends on the detuning ${\rm{\Delta }}\omega $ of the RF pulses for single-qubit rotations as well as on the uncertainty ΔΩ in the determination of the Rabi frequency Ω. The calibration of our experiment revealed Δω/Ω < 0.05 and Δ Ω/Ω = 0.0015 leading to an error in epsilon of $\pm 2.5\times {10}^{-3}$ and a decrease of the success probability $\tilde{\epsilon }$ of less than 0.04. The detuning Δω and the uncertainty of the Rabi frequency ΔΩ not only influence the state preparation at the beginning of the quantum algorithm, but also its fidelity, as is detailed in the next paragraph.

To prevent decoherence during conditional evolution, we use 140 RF π-pulses per diffusion step and ion. Therefore, already a small detuning influences the fidelity of the algorithm. Consequently, the error induced by detuning is identified as the main error source leading, for example, to $\tilde{\epsilon }\approx 0.77$ for k = 6 and Δω/Ω = −0.04. This error is much larger than the error caused by dephasing (that is still present after DD is applied), or the detection error. In a separate measurement, we determined an exponential dephasing rate of γτ ≈ 1/14 for a single diffusion step of duration τ ≈ 4 ms, which would lead to $\tilde{\epsilon }\approx 0.90$ for k = 6. Here, γ indicates the experimentally diagnosed rate of dephasing, and τ is the time of coherent evolution. The influence of the detuning on the cost of our algorithm is shown in figure 6 for different detunings. Here, we simulated the complete quantum algorithm including the experimentally determined dephasing and detection errors for ${\rm{\Delta }}\omega /{\rm{\Omega }}\in \{0,-0.04,-0.08\}$. The experimental data is consistent with an average negative detuning of Δω/Ω = −0.04. Note that the detuning not only influences the single-qubit rotations that are an integral part of the quantum algorithm, but also leads to errors during the conditional evolution when DD pulses are applied.

Figure 6.

Figure 6. Detuning affecting scaling cost C. The influence of the detuning of the RF pulses on the fidelity of the Q-RPS algorithm is shown for three different values of the relative detuning ${\rm{\Delta }}\omega /{\rm{\Omega }}$. Black markers indicate the results of numerical simulations of the complete Q-RPS algorithm taking different values of the relative detuning ${\rm{\Delta }}\omega /{\rm{\Omega }}$ into account. Most of the experimental data (red circles) lie close to a relative detuning of ${\rm{\Delta }}\omega /{\rm{\Omega }}=-0.04$.

Standard image High-resolution image

4.1.2. Input and output ratios.

In the ideal algorithm, the output ratio ${r}_{f}={b}_{00}/{b}_{01}$ of the two flagged actions represented by the states $| 00\rangle $ and $| 01\rangle $ at the end of the algorithm equals the input ratio ri. However, in the experiment we have observed deviations from ${r}_{f}/{r}_{i}=1$. During the measurements for the investigation of the scaling behavior (figure 4), we fixed ri = 1. The observed output ratios are varying by $0.98\leqslant {r}_{f}/{r}_{i}\leqslant 1.33$. That is, the probability b00 to obtain the state $| 00\rangle $ is increased with respect to b01. Also during the measurement testing the output ratio, we observe that the output ratios are larger than the input ratios.

An asymmetric detection error could be the cause for this observation. Typical errors in our experiment are given by the probability to detect a bright ion ($| 1\rangle $) with a probability of ${d}_{{\rm{B}}}=0.06$ as dark, and a dark ion ($| 0\rangle $) with a probability of ${d}_{{\rm{D}}}=0.03$ as bright. In figure 7 we compare the measured output ratios with the calculated output ratios assuming the above mentioned detection errors and two different detuning errors, ${\rm{\Delta }}\omega /{\rm{\Omega }}\in \{-0.015,-0.04\}$.

Figure 7.

Figure 7. Imbalanced detection. The measured values (red squares and blue circles) of the input and output ratios are compared to simulations (solid and dashed–dotted black lines) of the Q-RPS algorithm taking into account the experimentally determined detection error and detuning errors. The solid line corresponds to an expected output ratio taking into account an unbalanced detection error where ${d}_{{\rm{B}}}=0.06$ for bright ions and ${d}_{{\rm{D}}}=0.03$ for dark ions and detuning error ${\rm{\Delta }}\omega /{\rm{\Omega }}=-0.04$. The dashed–dotted line represents the same detection error and a detuning error of −0.015.

Standard image High-resolution image

When the experimentally determined detection error is taken into account, the simulation with detuning error ${\rm{\Delta }}\omega /{\rm{\Omega }}=-0.04$ does not describe the experimental data well for both one diffusion step and three diffusion steps. The experimental data agree well with a simulation using an average detuning error ${\rm{\Delta }}\omega /{\rm{\Omega }}=-0.015$. This indicates that the detuning during these measurements was kept around ${\rm{\Delta }}\omega /{\rm{\Omega }}=-0.015$ leading to an average success probability of $\tilde{\epsilon }=85 \% $ for k = 3 diffusion steps compared to $\tilde{\epsilon }=77 \% $ for k = 3 during the measurements investigating the scaling (see table 1). In addition, errors in the preparation of the input states play a role, especially when preparing very large or very small ratios leading to either a00 or a01 being close to the preparation accuracy of $\leqslant 2.5\times {10}^{-3}$.

5. Conclusion

We have investigated a quantum-enhanced deliberation process of a learning agent implemented in an ion trap quantum processor. Our approach is centered on the PS [14] model for reinforcement learning. Within this paradigm, the decision-making procedure is cast as a stochastic diffusion process, that is, a (classical or quantum) random walk in a representation of the agent's memory.

The classical PS framework can be used to solve standard textbook problems in reinforcement learning [3133], and has recently been applied in advanced robotics [34], adaptive quantum computation [35], as well as in the machine-generated design of quantum experiments [36]. We have focused on reflecting PS [20], an advanced variant of the PS model based on 'mixing', where the deliberation process allows for a quantum speed-up of Q-RPS agents with respect to their classical counterparts. In particular, we have considered the interesting special case of rank-one Q-RPS. This provides the advantage of the speed-up offered by the mixing-based approach, but is also in one-to-one correspondence with the hitting-based basic PS using two-layered networks, which has been applied in classical task environments [3136]. In addition, rank-one Q-RPS can be used to encode all tabular reinforcement learning models including Q-Learning and SARSA by appropriately amending the update and transition rules [2].

In a proof-of-principle experimental demonstration, we verify that the deliberation process of the quantum learning agent is quadratically faster compared to that of a classical learning agent. The experimental uncertainties in the reported results, which are in excellent agreement with a detailed model, do not interfere with this genuine quantum advantage in the agent's deliberation time. We achieve results for the cost C for up to 7 diffusion steps corresponding to an initial probability epsilon = 0.01 to choose a flagged action. In this sense, our experimental realization of a rank-one Q-RPS decision-making algorithm, which differs from standard amplitude amplification already due to the reflection over the stationary (rather than a uniform) distribution, also provides a comprehensive test of the scaling behavior that goes beyond previous experiments [4951], where standard amplitude amplification based on single diffusion steps has been carried out.

The systematic variation of the ratio ri between the input probabilities, a00 and a01 for flagged actions and the measurement of the ratio rf between the learning agent's output probabilities, b00 and b01 as a function of ri shows that the quantum algorithm is reliable independent of the number of diffusion steps.

This experiment highlights the potential of a quantum computer in the field of quantum enhanced learning and artificial intelligence. A practical advantage, of course, will become evident once larger percept spaces and general rank-N Q-RPS are employed. Such extensions are, from the theory side, unproblematic given that the modularized nature of the algorithm makes it scalable, following [39]. An experimental realization of such large-scale quantum enhanced learning will be feasible with the implementation of scalable quantum computer architectures. Meanwhile, all essential elements of Q-RPS have been successfully demonstrated in the proof-of-principle experiment reported here.

Acknowledgments

SW thanks A Melnikov for fruitful discussions. TS, SW, GSG and CW acknowledge funding from Deutsche Forschungsgemeinschaft. GSG also acknowledges support from the European Commission's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement number 657261. HJB acknowledges support from the Austrian Science Fund (FWF) through the Grant No. SFB FoQuS F4012. NF acknowledges support from the Austrian Science Fund (FWF) through the project P 31339-N27, the START project Y879-N27, and the joint Czech-Austrian project MultiQUEST (I 3053-N27 and GF17-33780L).

Footnotes

  • Other approaches that we will not further discuss here concern among others models where internal processes are sped up by annealing processes [15, 16]; or where the environment, and the agent's interaction with it may be of quantum mechanical nature as well [1719].

  • 10 

    The mixing time depends on the spectral gap δ of the Markov chain P, i.e. the difference between the two largest eigenvalues of P [20].

  • 11 

    Updating of the clip network may include, e.g. modifications of the weights associated to the edges of the graph corresponding to the clip network in such a way that weights of connections between percepts and rewarded actions are increased. In addition, updates may involve the addition or deletion of clips, as well as more sophisticated mechanisms such as glow, generalization, etc. see [20, 31, 40].

  • 12 

    Since $\delta =1$, the speed-up is only possible, and is achieved for, the quantity epsilon.

Please wait… references are loading.
10.1088/2058-9565/aaef5e