Introduction

Recently, artificial intelligence (AI) systems based on advanced machine learning algorithms have attracted a surge of interest for their potential applications in processing the information hidden in large datasets1,2. Wave-based analog implementations of these schemes, exploiting microwave or optical neural networks, promise to revolutionize our ability to perform a large variety of challenging data processing tasks by allowing for power-efficient and fast neuromorphic computing at the speed of light. Indeed, wave-based analog processors work directly in the native domain of an analog signal, processing it while the wave propagates through an engineered artificial structure (metamaterials and metasurfaces)3,4,5,6, as previously established in the cases of simple linear operations such as image differentiation, signal integration, and integro-differential equations solving7,8,9,10,11,12,13,14,15,16,17. For more complex processing tasks, for example, image recognition or speech processing, both nonlinearity and a high degree of interconnection between the elements are desired, requirements that have led to various proposals of neuromorphic processors exploiting optical diffraction, coupled waveguide networks, disordered structures18,19,20,21,22,23,24,25,26,27, or coupled oscillator chains28,29. A particularly vexing challenge, however, is the implementation of nonlinear processing elements. While power-efficient neuromorphic schemes require a pronounced, particular form of non-linearities, optical non-linearities, such as in Kerr dielectrics, are typically weak at low intensitites, and cannot be much controlled. This leads to sub-optimal systems that must operate with high input powers25,30,31,32. As an alternative, non-linearities that are external to the wave-based processor have also been considered, for example by exploiting the intensity dependency of a sensor, that needs an additional electronic interconnection. Unfortunately, exploiting such weak and non-controllable non-linearities drastically confines the performance of most machine learning schemes, and the relevance of wave-based platforms has so far been largely restricted to the implementation of simple linear matrix projections.

Here, we propose to leverage the physics of wave systems that are periodically modulated in time, the so-called time-Floquet systems33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48, to solve this vexing challenge by implementing a strong, controllable nonlinear entanglement between all the neuron signals. We propose to use a simple, thin, uniform dielectric slab, whose refractive index is slowly and weakly modulated in time. With the addition of linear random scattering disorder, we implement very efficient recurrent neural networks (RNNs) schemes, namely extreme learning machine (ELM) and reservoir computing (RC). We demonstrate the high accuracy of our Floquet extreme learning machine in challenging computing tasks, from the processing of one-dimensional data (learning nonlinear functions), to challenging multi-dimensional data (e.g., the abalone dataset classification problem). We also demonstrate the flexibility of our scheme that can be multiplexed to tackle two unrelated classification tasks at the same time, simultaneously sorting COVID-19 X-ray lung images and handwritten digits. Finally, we validate our Floquet RC by predicting the time evolution of a chaotic system over a large time period (the Mackey-Glass time-series). The reservoir size of the proposed wave-based reservoir computing system is enhanced by leveraging both spatial and spectral domains in order to improve the learning performance compared to prior works, without imposing additional filters or a larger computational overhead. Such extreme time-Floquet analog learning machines are not only fast, easy-to-train, power-efficient, and versatile, but also feature a unique accuracy performance that is comparable to that obtained with the best digital schemes.

Results

We consider a particular class of neural networks, known as recurrent neural networks (RNNs). RNNs are ideal to process intricate data due to the internal cyclic connections between internal neurons, whose outputs depend on both the current inputs and the previous states of the neurons49. This memory effect allows RNNs to detect recursive relations in the data, which are relevant for example to process temporal signals. In digital implementations, however, the heavy internal connectivity matrices that are involved in the training process make RNNs particularly computationally expensive and complicated50,51,52,53. In order to solve these challenges, a number of alternative computing approaches such as long short-term memory (LSTM)54, echo state networks (ESNs)55, extreme learning machines (ELMs)56,57,58, and reservoir computing (RC)51,52,53,59 have emerged. These schemes are particularly well suited for wave-based implementations, because wave propagation inherently relies on the inertial memory of the medium, which can be enhanced and engineered by leveraging resonant elements, or multiple scattering. In addition, wave interferences are a particularly efficient way to create a high degree of interconnections between a large set of inputs.

Our time-Floquet neuromorphic processor implements an ELM, schematically shown in Fig. 1a. ELMs, or closely related methods based on random neural networks60 or support vector machines61, are a powerful scheme in which only a last layer of connections is trained (in blue). The fundamental mechanism is the use of the non-trained part of the network, whose layers are represented in gray and red in Fig. 1a, in order to establish a nonlinear mapping between the initial space of the dataset and higher-dimensional feature space, where a properly trained classifier performs the separation and classification. In our case, this nonlinear mapping is performed by letting one of the non-trained layers (in red) be weakly modulated in time at a frequency much lower than the one of the signal, and with a modulation phase that depends on the input state.

Fig. 1: Wave-based time-Floquet extreme learning machine.
figure 1

a Schematic of a neural network including a time-Floquet layer made from neurons whose properties are modulated periodically in time, and traditional random layers. Only the last layer (output) is trainable. b Concrete implementation with electromagnetic waves. The input signals \({\zeta }_{n}^{{{{{{\mathrm{in}}}}}}}\) are modulated at ω1 and ω2. The sums of these frequency components forms input signals that are independently radiated into the surrounding space by an array of source antennas (black disks). As the waves propagates in the green region, they encounter a thin dielectric slab whose index of refraction is modulated at the frequency ωm = ω1 − ω2/2, as well as five sub-wavelength scatterers, randomly located in the domain. The modulation phase depends on the input vector \({\zeta }_{n}^{{{{{{\mathrm{in}}}}}}}(t)\). The gray rectangle represents an absorbing boundary layer. The outputs \({\zeta }_{n}^{{{{{{\mathrm{out}}}}}}}(t)\) are fed into an adaptable blue dense layer, and used for regression and classification. c Nonlinear phase entanglement. The modulated slab mixes signals at ω1 and ω2 into Floquet harmonics spaced by ωm, whose interferences depend non-linearly on the input vector.

A concrete implementation of this scheme in a wave platform is shown in Fig. 1b. It consists of three parts: (i), an array of monopole antennas that radiates the various components of the input vector into the surrounding medium; (ii), a propagation space composed of a few scatterers and a thin dielectric slab, called a scattering time-modulated slab (STMS), whose index of refraction is weakly modulated in time; and (iii), the output layer made of an array of receiving antennas and a single dense layer, digitally trained to perform the desired regression or classification tasks. At the input layer, the input vector ζin with components \({\zeta }_{1}^{{{{{{{{\rm{in}}}}}}}}},...,{\zeta }_{N}^{{{{{{{{\rm{in}}}}}}}}}\) is first encoded into N signals \({s}_{i}^{{{{{{{{\rm{in}}}}}}}}}\), injected directly into the source antenna array (see further details in section Details of the proposed wave-based ELM architecture of the Methods). We assume that ζin is modulated at two distinct close-by frequencies ω1 and ω2, such that:

$${s}_{i}^{{{{{{{{\rm{in}}}}}}}}}={\zeta }_{i}^{{{{{{{{\rm{in}}}}}}}}}\left(\sin ({\omega }_{1}t)+\sin ({\omega }_{2}t)\right).$$
(1)

The permittivity ϵr of the STMS is modulated with a depth δm and a phase ϕ, at a frequency ωm = ω1 − ω2/2, so that \({\epsilon }_{r}={\epsilon }_{s}+{\delta }_{m}\cos ({\omega }_{m}t+\phi )\). This choice of modulation frequency allows for the two input frequencies to be efficiently mixed at the dominant Floquet harmonic (ω1 + ω2)/2 (see Fig. 1c). As we will now see, the reflection and transmission coefficients of Floquet Harmonics can show a strongly nonlinear dependency on the modulation phase, a key property that we will leverage to make the ELM very efficient.

To understand how time-Floquet systems can be used to induce large nonlinear entanglement between the incident and reflected signals, let us consider the toy model of a generic two-port time-Floquet system, where incident and reflected signals at ports 1 and 2 are represented by their time-varying complex amplitudes a1,2(t) and b1,2(t). This model applies for each plane wave incident on our STMS, with transverse wave number k, on which the actual field can be decomposed. Assuming the modulation frequency ωm to be much smaller than the operation frequency ωk62,63, we can neglect dispersive effects and write the following instantaneous relation between the signals at each ports62,63,64:

$$\left[\begin{array}{c}{a}_{1}(t)\\ {b}_{1}(t)\end{array}\right]=\tilde{{{\Psi }}}({\omega }_{k},t)\left[\begin{array}{c}{a}_{2}(t)\\ {b}_{2}(t)\end{array}\right],$$
(2)

where \(\tilde{{{\Psi }}}({\omega }_{k},t)\) is the transfer matrix at ωk, which varies slowly with time. Taking the Fourier transform of both sides yields

$$\left[\begin{array}{c}{A}_{1}(\omega )\\ {B}_{1}(\omega )\end{array}\right]= \,\tilde{{{\Psi }}}({\omega }_{k},\omega )* \left[\begin{array}{c}{A}_{2}(\omega )\\ {B}_{2}(\omega )\end{array}\right] \\ = \int \tilde{{{\Psi }}}({\omega }_{k},\omega -\omega ^{\prime} )\left[\begin{array}{c}{A}_{2}(\omega ^{\prime} )\\ {B}_{2}(\omega ^{\prime} )\end{array}\right]d\omega ^{\prime} ,$$
(3)

Since the scattering process into each Floquet harmonic component is linear, we can define the reflection and transmission coefficients into each harmonic as R0(ωk + nωm) = B1(ωk + nωm)/A1(ωk) and T0(ωk + nωm) = A2(ωk + nωm)/A1(ωk). A direct calculation shows that (see Sec. 1 of the Supplementary Material for detail derivations):

$${R}_{\phi }({\omega }_{k}+n{\omega }_{m})={e}^{in\phi }{R}_{0}({\omega }_{k}+n{\omega }_{m})$$
(4)
$${T}_{\phi }({\omega }_{k}+n{\omega }_{m})={e}^{in\phi }{T}_{0}({\omega }_{k}+n{\omega }_{m}),$$
(5)

where we have used the notation Rϕ to highlight the dependency of the scattering coefficients on the modulation phase ϕ. These equations imply that upon adding a phase delay ϕ to the modulation, the generated frequency harmonic of order n will acquire a phase shift of nϕ, both for the forward and backward scattered plane waves. On the other hand, the amplitude of harmonic waves is constant when we alter the phase delay.

Now, consider the superposition of two incident plane waves at frequencies ω1 and ω2. Recalling our choice of modulation frequency, namely ωm = ω1 − ω2/2, we can write the reflection and transmission waves for all Floquet harmonic components of frequency ω1 + nωm = ω2 + mωm by using the superposition principle:

$$\left|\right.{R}_{\phi }^{\prime}\left|\right.=\left|\right.{e}^{in\phi }{R}_{0}({\omega }_{1},{\omega }_{1}+n{\omega }_{m})+{e}^{im\phi }{R}_{0}({\omega }_{2},{\omega }_{2}+m{\omega }_{m})\left|\right.$$
(6)
$$\left|\right.{T}_{\phi }^{\prime}\left|\right.=\left|\right.{e}^{in\phi }{T}_{0}({\omega }_{1},{\omega }_{1}+n{\omega }_{m})+{e}^{im\phi }{T}_{0}({\omega }_{2},{\omega }_{2}+m{\omega }_{m})\left|\right.,$$
(7)

where n and m are the orders of the Floquet harmonics with respect to ω1 and ω2, respectively. A particular example is the harmonic located at the average frequency ω = (ω1 + ω2)/2, for which n = 1 = −m (orange spectrum in Fig. 1c). According to Eqs. 6 and 7, the relation between the modulation phase and the intensity of scattered harmonic fields is highly nonlinear. In fact, we can control the amplitude of the Floquet harmonics only by changing the modulation phase. In order to have a nonlinear input–output mapping, we must therefore entangle the phase delay with the input vector (i.e., ϕ = f(ζin)). This can be done by using a simple voltage-controlled phase shifter (VCP) (see further details in section Details of the proposed wave-based ELM architecture of the Methods). In other words, the value of the modulation phase is directly determined by the value of the input vector, which is fixed when the system is excited, automatically turning the scattering process into a highly nonlinear function of the input, regardless of the input power. This makes such time-Floquet nonlinear entanglement highly advantageous in neuromorphic computing schemes.

To exemplify the strong nonlinear response of the proposed system, we plot the amplitude of the transmitted central harmonic (ω = (ω1 + ω2)/2) as a function of various variables, including the phase delay ϕ. The results are displayed in Fig. 2a–d. We fix one of the harmonics and plot \({T}_{\phi }^{\prime}\) versus the modulation phase and the real or imaginary part of the other transmitted harmonics, T(ω1 + ωm) (or T(ω2 − ωm)). As we can see in Fig. 2a–d, we indeed obtain a complex nonlinear semi-sinusoidal form for \({T}_{\phi }^{\prime}\), upon altering the modulation phase. The dependency on the real or imaginary parts of the other transmitted harmonic is also always nonlinear.

Fig. 2: Nonlinear Floquet entanglement.
figure 2

ad Theoretical demonstration of the nonlinear dependency of the intensity \({T}_{\phi }^{\prime}\) of the central Floquet harmonic (ω = (ω1 + ω2)/2) on both the modulation phase ϕ and the real or imaginary part of one of the generated harmonics. The results are based on Eq. 7. The fixed parameters for ad are: a T(ω1 + ωm) = 0.1 − 0.25i and {T(ω2 − ωm)} = 0.1. b T(ω2 − ωm) = 0.1 − 0.25i and {T(ω1 + ωm)} = 0.1. c T(ω1 + ωm) = 0.1 − 0.05i and {T(ω2 − ωm)} = 0.05. d T(ω2 − ωm) = 0.1 − 0.05i and {T(ω1 + ωm)} = 0.05. e, f The numerical demonstration of linear/nonlinear Floquet entanglement for the central harmonic wave for different readout nodes in terms of input intensity, for static (e) and dynamic phase delays (f).

Next, we implement the entanglement with the input vector to demonstrate the complex nonlinear behavior of the Floquet system, using a full-wave finite-difference time-domain simulation of the setup of Fig. 1b (see Methods). We compute the intensity of the central harmonic with respect to the input intensity for two different scenarios: a static phase delay and an entangled phase delay. In the first scenario, the phase delay is fixed and not dependent on the input (ϕ = 0), and as shown in Fig. 2e, the harmonic intensities are linear in terms of input intensities. In the second scenario, the delay phase is a simple linear function of the input (i.e., ϕ = 2πζin)). Figure 2f shows the complex nonlinear form of the proposed system. The oscillating nonlinear mapping performed by the proposed system is completely different from any earlier approach. As we will show, it is surprisingly effective in transforming the input data space to a nearly linearly separable output data space.

Note that another alternative approach to reach such a highly nonlinear input–output mapping is to entangle the input data with the modulation depth instead of the modulation phase. In this case, no phase shifters are needed. In section 2 of the Supplementary Material, more explanations about this alternative can be found, including a demonstration of its high performance in terms of transforming the input data space to a nearly linearly separable output data space.

Learning highly nonlinear functions

We now demonstrate the performance of the Floquet ELM by starting with simple regression problems, on a dataset generated with nonlinear relations. Such a dataset is often used as a standard benchmark in machine learning since linear regression of a nonlinear function is impossible without a nonlinear transformation31,56. The input information (ζin) is a set of randomly generated numbers between −π to π and the corresponding output labels (yi) are generated according to nonlinear functions, namely \({y}_{1}=\alpha \sin (4\pi {\zeta }^{in})(| {\zeta }^{in}| /\pi )\), y2 = rect(ζin) (pulse function), and \({y}_{3}=\sin (\pi {\zeta }^{in})/(\pi {\zeta }^{in})\). We use 1000 randomly generated samples, which lie in [−π, π] to cover the entire characteristic behavior of the function. We map each input value to a vector by multiplying it with a fixed random 1D vector (mask), here of dimension 1 × 10. In this task, we use 10 and 20 input and readout nodes, respectively. By recording the intensity of the harmonics in the readout nodes of many input values, linear regression is performed on the output data (see Fig. 3a–c). A remarkable learning performance, with very low root-mean-squared error (RMSE) for all three nonlinear functions, is obtained. Interestingly, in the proposed wave-based neural network with a nonlinear time-Floquet layer, the multiple generated harmonic fields can be used to extend the dimension of the nonlinear mapping, and increasing their number improves the accuracy of classification/regression. This tendency is demonstrated in Fig. 3d–f), which plots the RMSE versus the number of considered Floquet harmonics. This mechanism is a clear advantage of the Floquet ELM: by involving a higher number of scattered harmonics, we can improve the RMSE and enhance the accuracy of learning with no additional computational cost. It should be noted that in order to compute outputs at the decision layer, we simply rescale the linear regression weights without having to use additional filters. (see further details in section Training of readout of the Methods).

Fig. 3: Floquet extreme learning for highly nonlinear maps.
figure 3

ac Comparison between ground truth and predicted values for three different nonlinear functions: \({y}_{1}=\alpha \sin (4\pi {\zeta }^{in})(| {\zeta }^{in}| /\pi )\), and y2 = rect(ζin) and \({y}_{3}=\sin (\pi {\zeta }^{in})/(\pi {\zeta }^{in})\), respectively. df The corresponding values of root-mean-square error (RMSE) upon increasing the numbers of involved Floquet harmonics at the readout nodes.

We can explain the learning principle of the proposed computing system with well-known kernel methods. Kernel methods use kernels (or basis functions) to map the input data into a feature space. After this mapping, simple models can be trained on the new feature space, instead of the input space, which can result in an increase in the performance of the models65,66. We can describe the projections of input samples in the feature space by \({T}^{\prime}={H}_{{{{{{\mathrm{non}}}}}}}\left(p({\zeta }^{{{{{{\mathrm{in}}}}}}})\right)\), where p is encoding function, here for example \(p({\zeta }^{{{{{{\mathrm{in}}}}}}})={\zeta }^{{{{{{\mathrm{in}}}}}}}\left({{{{{\mathrm{sin}}}}}}({\omega }_{1}t)+{{{{{\mathrm{sin}}}}}}({\omega }_{2}t)\right)\) and Hnon is a nonlinear and complex function associated with the time-Floquet entanglement. Essentially, Hnon can be seen as an explicit form of optical kernel function which contains both the multiple scattering occurring in the media and the complex nonlinearity form. This kernel contains several polynomial basis functions, {x, x2, x3, x4, . . . } (we can theoretically show it by Taylor expansion of Eqs. 6 and 7). Hence, we have a combination of different orders of polynomial mappings with random coefficients in the feature space for each readout node, resulting in transferring different features of the input data into the feature space. The proposed kernel is thus expected to be very efficient in performing all tasks, even when compared with a strongly nonlinear Kernel such as the modulus square operator (x2), typically found in sensors and detectors used in prior arts.

To prove this quantitatively, we compare the feature space projection of our kernel with the form of nonlinearity that is most commonly used: a square-law at the detector, x2. As a reference, we also look at the purely linear case. As an example, we consider again the nonlinear interpolation of y = sinc(x). In order to perform well, the data projected in the feature space should be highly nonlinear with respect to the feature coordinates. To visualize this, we use principal component analysis (PCA) to reduce the dimension of the output data, because it lives in a high-dimensional space (10 dimensions, set by the number of readout nodes). PCA is a kind of linear projection that consists in transforming correlated variables into new variables, de-correlated from each other. These new variables are called “principal components” or principal axes67. In Fig. 4a, b, we calculate and plot this projected data in a 3D PCA space for three distinct cases: linear, x2 nonlinearity, and time-Floquet entanglement. Panels (a) and (b) show that in both the linear and x2 cases, the projected data is on a line, whereas in the case of the time-Floquet entanglement, the data follows a highly nonlinear relation with respect to the principal axes. For this reason, the sinc(x) interpolation fails when using both a linear system or one with x2 nonlinearity at the detector (see Fig. 4c, d). Conversely, time-Floquet entanglement is extremely good at performing the sinc problem (see Fig. 4e).

Fig. 4: PCA analysis of the proposed optical kernel.
figure 4

a, b Two different perspectives of Projected data for three different cases: (i) linear case, (ii) square-law nonlinearity (x2), and (iii) the proposed nonlinearity. ce sinc(ζin) interpolation results for three aforementioned cases, respectively.

Abalone dataset

In the previous section, we have used our Floquet ELM to learn nonlinear functions and their interpolation capability. However, interpolation is not always the relevant task, especially in complex inference problems. Therefore, we now move to a more challenging multivariable problem: the abalone dataset. This dataset is one of the most used benchmarks for machine learning and concerns the classification of sea snails in terms of age and physical parameters. It lists eight physical features of sea snails that can be used for the prediction of their age. To tackle this problem with our Floquet ELM, we encode the 8 physical features of sea snails on our input nodes (8 input nodes), and consider 50 readout nodes to feed the decision layer, which performs a linear regression. Figure 5a presents the true ages and the corresponding predictions; the figure indicates that the framework learns the ages of the abalone with remarkable accuracy. For a direct comparison, we plot the predicted values for 75 random input data (Fig. 5c). The RMSE with respect to a number of harmonic waves is plotted in Fig. 5b. A remarkable accuracy (RMSE = 0.064) can be achieved by considering five generated harmonics. The achieved RMSE is smaller than the best value reported in prior art31.

Fig. 5: Floquet extreme learning for multivariable regression.
figure 5

a, b Learning of the abalone dataset and corresponding RMSE for different numbers of considered harmonic waves, respectively. c Comparison between predicted data (blue) and ground truth (red).

Parallel image classifications

Another remarkable feature of time-Floquet systems is that since the inputs are modulated at a certain carrier frequency, we can use several frequency bands and multiplex different signals to classify them simultaneously using the same system, and at no additional cost in terms of power consumption. Let us now demonstrate this in a specific complex parallel classification task. We examine the possibility to perform parallel image classification using two wavelength inputs. We use two distinct datasets: the MNIST dataset of handwritten digits and the COVID-19 X-ray images (see Fig. 6a, b). We resize all of the images into 10 × 10 pixels, down-sampling them to decrease the number of input and readout nodes and the total size of our structure. In this task, we use 100 nodes to encode the images with the amplitude of the input waves, and 100 readout nodes. The MNIST data are encoded onto a (randomly selected) frequency range from 4 to 4.125 THz, and the COVID-19 data are encoded between 4.375 and 4.5 THz (see the red and blue frequency bands in Fig. 6c). In the output layer, we use Softmax regression to perform classifications (See Methods).

Fig. 6: Floquet extreme learning for parallel image classification.
figure 6

a, b Examples of realizations taken from the MNIST and COVID-19 X-ray datasets. c Simulated spectra of readout nodes for two different channels. d, e Evolution of the loss function for MNIST and COVID-19 classifications, respectively. f, g Corresponding confusion matrices, for condition classification.

The training results are shown in Fig. 6d–g. The observed test accuracies were 88.2% for the COVID-19 and 85.3% for the MNIST datasets. These classification accuracies are competitive. For example, they are higher than the ones reported in reference50 (parallel image classification). Also, the classification-accuracy results are comparable with other relevant works despite decreasing the pixel sizes of all images53. In addition, this frequency multiplexing technique is the first demonstration of wave-based parallel task processing with extreme deep learning. This enables the use of wide bandwidth as a computational resource, which significantly boosts computation efficiency.

Nonlinear Time-Floquet-based RC system for autonomous forecasting chaotic time-series

To show the high versatility of the proposed nonlinear time-Floquet neuromorphic computing system, we slightly modify it to implement a reservoir computing (RC) scheme. Consider an input vector i(t) that is injected into a high-dimensional dynamical system called the reservoir. The reservoir is described by a vector h(t) and the initial state of the reservoir is defined randomly. Let the Wres matrix define the internal connections of the reservoir nodes and the Win matrix define the connections between the input and the reservoir nodes. Both matrices are initialized randomly and fixed during the whole RC training process. The state of each reservoir node is a scalar h(t), which evolves according to the following recursive relation:

$$h(t+\tau )=F\left({w}_{{{{{{\mathrm{in}}}}}}}i(t)+{w}_{{{{{{\mathrm{res}}}}}}}h(t)\right)$$
(8)

where τ is the discrete time-step of the input and F is a nonlinear function. From Eq. 8, we see that the reservoir is defined as a dynamical system provided with a unique memory property; namely, each consequent state of the reservoir contains some information about its previous states and about the inputs injected until that time. In the training phase, the input i(t) is fed to the reservoir, and the corresponding reservoir states are recursively calculated. The final step of the information processing is to perform a simple linear regression in order to minimize the RMSE that adjusts the Wout weights. The output can be computed with O(t) = Wouth(t). It should be noted that the output weights are the only parameters that are modified during the training. The input and reservoir weights are fixed throughout the whole computational process, and they are used to randomly project the input into a high-dimensional space, which increases the linear separability of inputs.

In our concrete scheme, we implement this memory using a feedback loop, and use the intensity of harmonic waves as reservoir states. The reservoir computing in our scheme can be described by the following recursive relation:

$${T}_{\phi }^{\prime}(t+\tau )=F\left({w}_{{{{{{\mathrm{in}}}}}}}i(t)+{w}_{{{{{{\mathrm{res}}}}}}}{v}_{h}{T}_{\phi }^{\prime}(t)\right)$$
(9)

where F the nonlinear function describing our system, vh is a tunable parameter that selects one (or more) harmonics as reservoir states, and \({T}_{\phi }^{\prime}\) is the intensity of transmission harmonic waves. In general, the RC and its different implementations have proven to be very successful for various tasks, such as spoken digits recognition, temporal Exclusive OR task, Mackey-Glass, or Nonlinear Autoregressive Moving Average time-series prediction68,69.

We use the nonlinear time-Floquet RC for the prediction of chaotic time-series. Forecasting chaotic time-series is an extremely difficult task due to the accumulation of quantitative differences between the ground truth and the predicted value in subsequent predictions, which lead to exponential errors at large times. Indeed, the positive Lyapunov exponent in chaotic systems leads to exponential growth for the separation of close trajectories, so that even small errors in prediction can quickly lead to divergence of the prediction from the ground truth49. We test our system using the Mackey-Glass time-series defined by49,70.

$$\frac{dy}{dt}=\beta \frac{y(t-\tau )}{1+{\left(y(t-\tau )\right)}^{n}}-\gamma y(t)$$
(10)

Unlike deterministic equations, predicting such time-series for specific values of parameters is difficult and thus has been widely used as a benchmark for challenging forecasting tasks. To obtain chaotic dynamics, here, we set the parameters β = 0.2, γ = 0.1, τ = 18, n = 10. During the training phase, as soon as the reservoir states are calculated, a simple linear regression is executed to adjust the Wout weights such that their linear combination with the calculated reservoir states makes the actual output as close as possible to the next time-step of the input. Finally, to automatically predict the future evolution of i(t), we make a feedback loop from the output to the input by replacing the next input i(t + 1) with the one-step prediction Wouto(t), as is done in conventional RC. The ability of the proposed RC system in time-series prediction is tested using a reservoir with 100 input nodes and 50 readout nodes. We consider the middle harmonic as a reservoir state and input, \({\zeta }_{n}^{{{{{{\mathrm{in}}}}}}}\), to feed our RC system for each interaction (Eq. 9). All of the intensity harmonics and reservoir states are then applied to the readout layer (see Methods) to generate the predicted data for the next time-step. Figure 7 shows the results obtained during training from the simulation. Excellent agreement between the target and the predicted value can be obtained, indicating that the trained readout weights can correctly calculate the next time-step signal on the basis of the internal states of the reservoir. Further evidence of successful training can be found by examining the network performance in regression and phase space, as shown in Fig. 7b, c, where an excellent agreement can again be observed.

Fig. 7: Floquet reservoir computing for forecasting the chaotic Mackey-Glass time-series.
figure 7

a Training results: The ground truth (blue) and the predicted output from the RC system (red) are plotted. b Corresponding results of linear regression. c Trace of time-series values in phase space, with respect to the previous time-step for both ground truth and predicted values.

The network is then used to forecast the time-series autonomously. After training for 400 time-steps, the output from the readout function, that is, the predicted data for the next time-step is then connected to the reservoir as the new input, and the system autonomously produces the forecasted time-series continuously. Figure 8 shows the results for autonomous time-series prediction using the proposed RC system. Afterward, the autonomously generated output (from the 400th time-step onwards) still matches very well the ground truth, showing the ability of the proposed RC system to autonomously forecast the chaotic system. After more than 70 time-steps of autonomous prediction, the predicted signal starts to diverge from the correct value, which is unavoidable due to the chaotic nature of the series. Increasing the size of the reservoir further, by using more nodes and using more previous states may reduce the prediction error so that the length of accurate prediction can be increased. Another solution for long-term forecasting without increasing the dimension of the system is utilizing a periodical update procedure as in ref. 49 . In section 3 of the Supplementary Material, we compared the computing performance of the proposed system with prior works for all tasks.

Fig. 8: Autonomous forecasting of Mackey-Glass time-series.
figure 8

Training and forcasting results: a The ground truth (blue) and the predicted output from the RC system (red) for 100 next time-steps are plotted. b Trace of time-series values, phase space, for predicting phase with respect to the previous time-step.

In conclusion, we have shown how nonlinear Floquet entanglement can be used to enable wave-based neuromorphic computing, by allowing for strong and tailored nonlinear mapping to a higher-dimensional space without involving any nonlinear material. Our nonlinear time-Floquet learning machine can process information to compute complex tasks that are traditionally only tackled by slower, sophisticated, and digital deep neural networks. In our benchmarks, the proposed computing platform performs as well as its digital counterparts. With better energy efficiency in comparison to the previous proposals and a path to high scalability, our nonlinear time-Floquet system provides a unique solution for supercomputer-level optical computation.

Methods

Numerical simulations

We use a two-dimensional finite-difference time-domain (FDTD) method for all simulations71,72. Figure 9 shows the rectangular layout of the employed setup. We set the parameters ϵs = 3, δm = 0.3, ωm = ω1 − ω2/2. Furthermore, the propagation space’s height and width, as well as the thickness of the STMS, are set to be 15λ0, 10λ0, and λ0/4, respectively. We use five high index permittivity dielectrics sub-wavelength scatterers, randomly located in the propagating substrate. The time window of simulation and the spatial window (time- and space-discretization factors) is set as a dt = dx,y/(2C) and dx,y = λ0/30, respectively, (C is speed of light). We use 10,000 time-steps to ensure convergence.

Fig. 9: A detailed schematic of the proposed wave-based ELM architecture.
figure 9

On the left, the input side includes a signal generator (SG) with variable attenuators (VOA) that encode the information. A Voltage-controlled phase shifter (VCP) is used to change the phase of the modulation depending on the voltage applied, which is fixed for each input and does not change in time. AA denotes the array of source antennas. Vp is the required voltage to control the VCP (\({V}_{p}=\gamma {\bar{\zeta }}^{{{{{{\mathrm{in}}}}}}}\)), where \({\bar{\zeta }}^{{{{{{\mathrm{in}}}}}}}\) and γ are the mean of input vector and a scaling factor, respectively.

Details of the proposed wave-based ELM architecture

We encode the input information with simple circuit elements, namely variable attenuators and a voltage-controlled phase shifter. These devices are set externally to a certain operating point once the input is selected, and do not change as the neural network processes a given input. Therefore, no dynamic tuning, or conversion between amplitude and phase, is needed between the temporal signal sent by the generator and the modulation signal: just like in standard learning machines, they are just set externally once an input is selected to be processed by the system. More details can be found in Fig. 9. The amplitude-coding scheme that we employ is simple and physically feasible by using variable attenuators (VOA). Since the values of input vector \({\zeta }_{n}^{{{{{{\mathrm{in}}}}}}}\) are normalized between zero and one, the input information can be encoded on the amplitude of the temporal signal generated by a single signal generator (SG), as illustrated in Fig. 9. Variable attenuators provide the input node amplitudes depending on the external voltages applied. In addition, a lower frequency oscillator with a voltage-controlled phase shifter (VCP) is used to drive the modulation of the time-Floquet layer. VCPs are tunable, and the applied voltage is simply calculated by the external user from the input vector. Discussion about realistic physical platforms for realizing the time-Floquet layer is provided in section 4 of the Supplementary Material.

Training of readout

Here, we show how to train the decision layer using the data of temporal signals received at the readout nodes without using extra filtering operations. Consider \({\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}(t)\) as the temporal signal received at the output antenna q, and its discrete Fourier transform of the discretized signal \({\bar{\zeta }}_{q}^{{{{{{\mathrm{out}}}}}}}={\left({\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}(0),{\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}({t}_{0}),...,{\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}((n-1){t}_{0})\right)}^{T}\) defined by \({y}_{{f}_{j}}=\mathop{\sum }\nolimits_{(k = 0)}^{(n-1)}{\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}({t}_{k}){{{{{\mathrm{exp}}}}}}(\frac{-2\pi i}{n}jk)\) where j = 0, . . . , n − 1, (. )T is the transpose operation, and t0 is the sampling time. The relation between the Fourier coefficients \({y}_{{f}_{j}}\) and the discretized signal is described by the so-called Vandermonde matrix:

$$\left(\begin{array}{c}{y}_{{f}_{0}}\\ {y}_{{f}_{1}}\\ ...\\ {y}_{{f}_{n-1}}\end{array}\right)=\left(\begin{array}{ccc}1&\cdots &1\\ 1&\cdots &{\kappa }^{n-1}\\ \vdots &\ddots &\vdots \\ 1&\cdots &{\kappa }^{{(n-1)}^{2}}\end{array}\right)\left(\begin{array}{c}{\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}(0)\\ {\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}({t}_{0})\\ ...\\ {\zeta }_{q}^{{{{{{\mathrm{out}}}}}}}((n-1){t}_{0})\end{array}\right)$$
(11)

where \(\kappa ={e}^{\frac{-2\pi i}{n}}\). Generally, we are only interested in the middle harmonic whose component q can be calculated by the scalar product \({y}_{{f}_{j}}^{q}={F}_{j}{\bar{\zeta }}_{q}^{{{{{{\mathrm{out}}}}}}}\), where \({F}_{j}=\left(1,{\kappa }^{j},{\kappa }^{2j},...,{\kappa }^{(n-1)j}\right)\) and j is correspond to desired harmonics. In order to train the readout function by linear regression (which is commonly defined by a linear matrix operation of the form Y = WXT, where W is the weights matrix), we must compose both operations, multiplying the Fj and weight matrices:

$$Y=W{({F}_{j}X)}^{T}$$
(12)

where \(X=\left({\bar{\zeta }}_{1}^{{{{{{\mathrm{out}}}}}}},{\bar{\zeta }}_{2}^{{{{{{\mathrm{out}}}}}}},...,{\bar{\zeta }}_{q}^{{{{{{\mathrm{out}}}}}}}\right)\). Clearly, just like a regular extreme learning machine (Y = WXT), the output of the Floquet extreme learning scheme involves a simple multiplication of matrices without sensitive or complex filters. This can be also viewed as a mere rescaling of the weight matrix of the digital layer.

The relationship between ϕ and ζin is set to be (\(\phi =\gamma {\bar{\zeta }}^{{{{{{\mathrm{in}}}}}}}\)), where \({\bar{\zeta }}^{{{{{{\mathrm{in}}}}}}}\) and γ are the mean of input vector and a scaling factor, respectively. The value of γ for learning nonlinear functions is 1 and for other tasks is equal to 2π.

For learning nonlinear functions, Abolone dataset, and forecasting chaotic time-series, we used a supervised learning algorithm, linear regression, to train the readout function. The predicted output is compared with the ground truth, and the error is calculated and used to update the weights in the readout network following the linear regression learning rule.

To train the readout network, for classification task-parallel image processing, we used the Python toolkit Keras, which provides a high-level application programming interface to access TensorFlow. A supervised learning algorithm, softmax regression, was used to train the readout network. A softmax function is used as the activation function of the readout network to calculate the probability corresponding to the different possible outputs. The cost is calculated following a categorical crossentropy. A standard gradient-based optimization method is used to minimize the cost function and train the output network. There are several ways of converting images into one-dimensional representations. For simplicity, we used a flattened version of downsampled images as an output vector.