Introduction

There is considerable evidence that the intrinsic mechanical and optical properties of cells change slightly upon firing of an action potential. Signatures of such changes have been reported as early as 19491 and have since been studied in a variety of channels. Action potentials slightly alter the birefringence of cell membranes, on a relative level of 10–100 ppm on a single cell2,3,4,5. This is plausibly explained by a Kerr effect induced by molecular alignment in the electric field, or changes in membrane thickness. Similar changes occur in light scattering, with a level of change of 1–1000 ppm for a single cell2,6. Less directly correlated with electrical activity, they are presumably linked to motion or swelling of cells. This ‘intrinsic optical signal’7,8,9,10 has been widely employed for the study of networks of neurons, both in a cell culture and in vivo in the retina11. Similar changes of transmission or reflection in near-infrared imaging of a living brain12,13,14 have been controversially reported as a ‘fast intrinsic signal’15,16. On the microscopic level, nanometer-scale motion of the cell membrane in response to an action potential has been observed by fiber-optical and piezoelectric sensors17, atomic force microscopy18 and optical interferometry19,20,21,22,23,24. More recently, such motion has been detected in non-interferometric microscope videos by image processing25,26, which is the technique we intend to advance with the present work.

Intrinsic changes in optical or mechanical properties are of interest for two reasons. First, mechanical motion of cell membranes can be involved in cellular communication, driving for instance synchronization of heart muscle cells27. Second, intrinsic signals could provide access to neural activity. In contrast to existing fluorescent indicators28, a method based on intrinsic signals would be label-free. It would not require genetic engineering and would not suffer from toxicity and photobleaching.

Previous studies have detected and quantified membrane micromotion by very simple schemes, such as manual tracking or subtraction of a static background image. The past decade has seen the emergence of numerous novel approaches to highlight small temporal changes in time series data, detecting for instance gravitational waves in interferometer signals29 and invisible motion in real-life videos30. In the present work, we will study whether these tools can improve detection of cellular micromotion in video recordings of living cells. We focus our study on three of the most common approaches: spectral filtering, matched filtering, and convolutional neural networks (CNNs).

Spectral filtering has been a long-standing standard technique in the audio domain, where it is known as “equalizing”. A time-domain signal is Fourier-transformed into the frequency domain, multiplied by a filter function that highlights or suppresses specific frequency bands, and subsequently transformed back into the time domain. It is equally applicable to video recordings30, where it can detect and amplify otherwise invisible changes, such as the slight variation of skin color induced by blood circulation during a human heartbeat. Similarly, it should be able to detect micromotion of cells.

Matched filtering can be understood as an extension of spectral filtering. Here, the filter applied in the frequency domain is the Fourier transform of an ideal template signal. Employing this transform as a filter has a convenient interpretation in the time domain: it is a deconvolution of the signal with the template, i.e. a search for occurrence of the template in an unknown time series. Originally developed for radar processing31, the technique has found ubiquitous applications. It is for instance employed to detect and count subthreshold events in gravitational wave detectors29. It has already been applied to the detection of mechanical deformation in videos32 and should equally be applicable to cellular micromotion. It does, however, require a priori knowledge of an “ideal” template signal.

This drawback is overcome by neural networks, which can autonomously learn complex patterns and detect their occurrence in time-series data, images or video recordings. We focus on “convolutional neural networks” (CNNs), a widely employed subclass of networks that can be understood as an extension of matched filtering. In a CNN, an unknown input signal is repeatedly convolved with a set of simple patterns (“filters”) and subjected to a nonlinear “activation function”. Repetition of this process greatly enlarges the range of patterns that can be detected, so that the technique can detect patterns even if they deviate from some fixed ideal signal. The pattern vocabulary of the network is learned in a training procedure, in our case in a “supervised” fashion where the network is optimized to detect known occurrences of a pattern in a separate training dataset. During training, the filters are continuously adapted to improve the detection fidelity. The result is equivalent to repeated application of matched filtering with 'learned' filters, interleaved with nonlinear elements. CNNs have been implemented for datasets of various dimensions. One-dimensional CNNs have found use in time-series processing, most prominently speech recognition33, two-dimensional CNNs in image recognition34 and three-dimensional networks in video analysis35.

Methods

All techniques are trained and benchmarked on a dataset recorded as displayed in Fig. 1. A sample of HL-1 cardiac cells (originating from the Claycomb lab36) is recorded in a homebuilt dual-channel video microscope (Fig. 1a). These cells fire spontaneous action potentials every few seconds, which are accompanied by micromotion on a wide range of motion amplitudes (see below). Hence, they provide a convenient testbed to evaluate signal processing. One channel of the microscope performs imaging in transmission mode under strong brightfield illumination (Fig. 1b). We choose a camera with high frame rate (50 fps) and full-well-capacity (\(10^{5}\) electrons, obtained by averaging over 10 consecutive frames of a 500 fps recording) to reduce photon shot noise and thus enhance sensitivity to small changes in the image. Illumination is polarized, and detection is slightly polarization-selective, in order to be sensitive to small changes in cellular birefringence, although we did not find evidence for such a signal in the final data. A second channel records fluorescence of a Ca2+-active dye (Cal-520) staining the cells (Fig. 1b). Ca2+ transients correlating with the generation of action potentials are directly visible in this channel as spikes of fluorescence and serve as a “ground truth” signal for training of the signal processing algorithms. We will identify these transients with individual action potentials37, although our setup is lacking electrophysiological means to strictly prove this connection. All following analysis is based on a video recording of 3:30 min length. This dataset is divided into two parts for validation (frames 1 to 3072) and training (frames 3073–10,570) of the processing schemes.

Figure 1
figure 1

Experimental setup and data. (a) Experimental setup. A correlative video microscope records a sample of cells in two channels: light transmitted under brightfield illumination and fluorescence of a Ca-active staining. LP long pass, SP short pass, pol. polarizer. (b) Resulting data. A region of several cells is visible in the transmission channel (scale bar: 20 µm). The same region displays spikes of Ca activity in the fluorescence channel. The fluorescence intensity of the whole region is summed to a time trace, which is employed as ground truth for supervised learning.

All signal processing schemes under study are tasked with the same challenge: to predict fluorescence activity from the transmission signal (Fig. 2a). The algorithms employed are summarized in Fig. 2. We implement spectral filtering by applying a temporal bandpass filter to every pixel of the recording (Fig. 2b). Changes in transmission intensity, as they are produced by motion of the cells, will pass this filter, while both the static background image and fast fluctuating noise are suppressed. The filter parameters have been manually tuned to match the timescale of cellular motion, resulting in a 3 dB passband from 2 to 15 Hz. Processing by this filter serves as an initial stage in all other algorithms, for reasons to be discussed below.

Figure 2
figure 2

Signal processing schemes to detect cellular micromotion (a) concept: signal processing is employed to predict fluorescence from micromotion cues in transmission data. (bd) present schemes processing time-domain data from a single pixel, (e) presents 3D neural networks processing a whole recording in the temporal and spatial domains.

Matched filtering is implemented as an additional stage of processing as displayed in Fig. 2c. We generate a pixel-wise template signal by averaging over all 51 action potentials obtained from the training dataset, aligned by triggering on spikes in the fluorescence channel. These template signals are generated in a pixel-wise fashion so that this scheme should capture arbitrary signal shapes, such as upward and downward excursions of the video signal that might occur on two sides of a moving cell membrane. The prediction is computed by pixel-wise deconvolution of the transmission signal with the template.

One-dimensional convolutional neural networks equally serve as an additional stage of processing downstream of spectral filtering (Fig. 2d). We stack 20 convolutional layers with eight filters of three frames width each, the last four of which are dilated with an exponentially increasing rate to capture features extending over long timescales38 (see Table 1 for an exact description). The number of features (filters) is subsequently condensed to three and finally to only one, which provides the output prediction. Padding in the convolutional layers ensures that the temporal length of the data is preserved throughout the entire network, so that the output is a time series of the same size as the input. The network can therefore be equally understood as a cross-encoder architecture translating transmitted light into a prediction of fluorescence. The choice of this network architecture has been motivated by its simplicity, and by encouraging reports on conceptually similar fully convolutional neural networks that have proven successful in classification of electrocardiogram data39. Recurrent neural networks, frequently employed in speech recognition, would be another natural choice. They have not been pursued further in this study, because of the widely held belief that they are more difficult to train39.

Table 1 Time-domain 1D convolutional network.

The network is trained to predict the fluorescence intensity summed over the full frame from single-pixel input data. This is of course an impossible challenge for all those pixels which do not contain a trace of cell motion, such as background regions. However, we found that convergence of the network was stable despite this conceptual weakness. Networks have been initialized with Glorot-uniform weights and have been trained for 3 epochs at a learning rate of 10–3 without weight decay, using the training data described above and mean-square-error as a loss function. Training has been carried out by the Adam algorithm, which locally adapts the learning rate for every weight and time step. We tried to train on raw data that had not undergone spectral filtering, but did not achieve convergence in this case. Due to the simple architecture, optimization of the hyperparameters is mainly limited to varying the number of layers. Here we found that performance generally improves with network depth, despite the fact that the patterns to be detected are relatively simple. This observation is in line with previous reports on time series classification by CNNs40. We also found dilations to provide a significant gain in performance. This suggests that the network successfully captures temporally extended patterns. We therefore did not venture into adding and optimizing pooling layers, which are frequently used for the same goal29.

As the most elaborate approach we apply three-dimensional neural networks (Fig. 2e). They operate on a 256 frames long section of the video recording, predicting a 256 frames long fluorescence trace. For 3D networks, limited training data is a major challenge, which we address by two means. First, we employ transfer learning, by reusing the one-dimensional network (Fig. 2d) as an initial stage of processing. We terminate this 1D processing before the final layer, providing three output features for every pixel and timestep (see Table 2 for an exact description) that serve as input to a newly trained final layer. Second, we restrict ourselves to region-specific networks that cannot operate on video recordings of another set of cells. This simplifies processing of the spatial degrees of freedom, which is implemented by a simple dense layer connecting every feature in every pixel to one final output neuron computing the prediction. As in the case of matched filters, this stage can learn pixel-specific patterns, such as upward and downward excursions of light intensity during an action potential. The network was trained for 60 epochs by Adam with a learning rate of \(8 \cdot 10^{ - 7}\) without weight decay. The 1D processing layers were initialized with the 1D network described above (Fig. 2d), but were not held fixed during training. The final layer was initialized with the three final-layer weights of the 1D network divided by the number of pixels.

Table 2 3D network.

All networks have been defined in Keras using the Tensorflow backend, and all training has been performed on Google Cloud.

Results

The performance of all one-dimensional processing schemes (Fig. 2b–d) is compared in Fig. 3. We employ cross-correlation as a score to benchmark how well the predicted fluorescence time trace of a specific pixel matches the global fluorescence intensity. Specifically, we normalize prediction and global fluorescence in a first step by subtracting the temporal mean and dividing by temporal standard deviation, since otherwise correct prediction of a constant background signal would be rated more important than correct prediction of spike signals. We then compute the full cross-correlation of both signals and employ the maximum as a figure of merit. This approach assigns a high score to predictions that are correct except for a constant temporal shift, which would be less detrimental to a human user than errors in detection of spikes.

Figure 3
figure 3

Performance of time-domain signal processing. (a) Definition of correlation score. Predicted fluorescence on the single-pixel level is correlated with observed fluorescence summed over the full region. The maximum correlation is used as a score to assess accuracy of the prediction. (be) Pixel-wise maps of correlation score for (b) band-pass filtering, (c) matched-filtering, (d) processing by a 1D CNN as defined in Fig. 2 and (e) fluorescence activity (ground truth) (f) still frame from transmission channel. Labels denote regions of interest displaying strong motion, weak motion and no visible motion that will serve as test cases in the following analysis (Fig. 4). Scale bar: 20 µm.

All processing schemes succeed in detecting micromotion correlating with Calcium activity, with varying levels of success. The self-learning matched filter (Fig. 3c) does not improve performance over mere spectral filtering (Fig. 3b). This is likely caused by noise in the data. Since the template is generated on the single-pixel level, it will contain a higher level of noise than a smooth handcrafted filter or the filters of a neural network that have been trained on a much larger amount of data (all pixels). This error is inherent to the training technique rather than the technique of matched filtering by itself. With a better template, performance would likely improve. 1D neural networks offer a clear gain in performance (Fig. 3d), even in region where signals are weak (lower right), consistent with the intuition that training on all pixels reduces noise artefacts.

Figure 4 analyzes performance in terms of three regions of interest (Fig. 3f). One region (center) contains cells that display strong beating motion of 600 nm amplitude (measured by visually tracking membrane motion). A second region (lower right) contains motion on a weak level that is barely noticeable to the naked eye, suggesting an amplitude of significantly less than one pixel (i.e. \(\ll\) 300 nm) or motion in an out-of-focus plane. A third “silent” region does not contain any visible motion (upper). Figure 4 shows the 1D predictions, binned over these regions of interest (ROI), as well as the prediction of the 3D network that has been separately trained and tested on each ROI. As in the correlation analysis (Fig. 3), the difference of performance is most striking in the region of weak beating, where neural networks deliver a clear gain in performance. 3D networks perform marginally better than one-dimensional approaches. No approach is able to reveal a meaningful signal in the silent region.

Figure 4
figure 4

Performance of all considered schemes. Signals of 1D predictions (upper three lines) have been summed over the regions of interest marked in Fig. 3. The output of filtering approaches (upper two lines) has been squared to produce unipolar data comparable to fluorescence. All approaches manage to correctly predict fluorescence in the strong beating region. Performance varies in the weak beating region, where neural networks yield a clear gain in accuracy. No approach is able to reveal a meaningful signal in the silent region. Length of the recording is 20 s.

We finally analyze the dense layers of the 3D neural networks by visualizing their weights (Fig. 5), reasoning that these encode “attention” to specific pixels, where micromotion leaves a pronounced imprint. Strong weights are placed within a large homogeneous part of the strong beating region (Fig. 5a), presumably a single cell. The boundary of the region is not clearly visible in the raw microscope image, which might be due to the high confluency in this region, or due to the cell sitting in a different layer than the image plane. Weights are placed very differently in the weak beating region (Fig. 5b), where attention is mostly drawn to a small area at the border of a cell or nucleus, presumably because small motions produce the most prominent signal change in this place. Interestingly, a similar behavior is observed in the silent region (Fig. 5c). Most strong weights are placed on one membrane, even though the network does not detect a meaningful signal. This might be a sign that the network is overfitting to fluctuations rather than a real signal, since these are stronger at a membrane. It might equally hint towards existence of a small signal that could be revealed by further training on a more extensive dataset. Besides more extensive training, adding batch normalization and stronger dropout could be numerical options to improve generalization in these more challenging regions.

Figure 5
figure 5

3D neural networks. Weights of the fully connected layer (“2D Dense” in Fig. 2e), connecting three activity maps to the final output neuron. Weights are encoded in color and overlayed onto a still frame of the transmission video. The color scale is adjusted for each region; max: maximum weight occurring in all three layers of one region. (a) Strong beating region. Weights are placed on a confined region, presumably a single cell. (b) Weak beating region. Weights are predominantly placed on the border of one cell or nucleus, where intensity is most heavily affected by membrane motion. (c) Silent region. While no meaningful prediction is obtained, the network does place weights preferentially on the border of one cell or nucleus, hinting towards micromotion.

Discussion

In summary, recent schemes of signal processing can effectively detect and amplify small fluctuations in video recordings, revealing tiny visual cues such as micromotion of cells correlating with Calcium activity (and hence, presumably, with action potentials). Neural networks provide a clear gain in performance and flexibility over simpler schemes. They can be efficiently implemented, even with limited training data, if their architecture is sufficiently simplified. We achieve this simplification by one-dimensional processing of single-pixel time series, and by constructing more complex network from these one-dimensional ancestors via transfer learning. While the schemes of this study successfully predict Calcium activity from cellular micromotion for some cells, none of them is truly generic, i.e. able to provide a reliable prediction for any given cell. This is evidenced by performance in the “silent” region where no activity is detected, despite a clear signal of Calcium activity in fluorescence imaging. We foresee several experimental levers to overcome this limitation. Future experiments could improve detection of micromotion by recording with higher (kHz) framerate. This promises to reveal signals on a millisecond timescale, which would be the timescale of electric activity and some previously reported micromotion20,21,41. Illumination by coherent light could further enhance small motions42 and thus reduce photon load of the cells, and more advanced microscopy schemes such as phase imaging or differential interference contrast could be employed to enhance the signal. The neural network architecture could be improved as well. Recurrent neural networks could be used for 1D processing, although the success of 1D-CNNs renders this direction less interesting. Instead, a major goal of our future work will be pushing 3D convolutional networks to a truly generic technique that is no longer restricted to one region of cells. This could be achieved by more complex architectures, such as the use of spatial 2D convolutions within the network rather than a single dense connection in the final layer. Recording orders of magnitude more training data (hours instead of minutes) will be another straightforward experimental improvement that likely needs to be addressed to successfully work with these architectures.

The ultimate performance of all schemes can be estimated from the photon shot noise limit of the acquisition chain. This limit predicts the minimum relative fluctuation of intensity that can be detected in a signal binned over \(N_{{{\text{frames}}}}\) frames in the temporal dimension, and a region of Npx pixels in the spatial dimension:

$$\frac{{\sigma_{I} }}{I} = \frac{1}{{\sqrt {N_{{{\text{frames}}}} N_{{{\text{px}}}} N_{{{\text{FWC}}}} } }}$$

Here, \(N_{{{\text{FWC}}}}\) is the full-well capacity of the camera, typically on the order of \(10^{4}\) photons. With current hardware, operating at 10 kHz frame rate, \(N_{{{\text{frames}}}} = 10\) frames could be acquired over a 1 ms long action potential. Tracking the motion of a cell border extending over \(N_{{{\text{px}}}} \approx 10^{2}\) pixels could thus reveal intensity fluctuations of \(\frac{{\sigma_{I} }}{I} \approx 3 \cdot 10^{ - 4}\). Assuming a pixel size of 300 nm, and a modulation of the light intensity by the membrane of 10%, this would correspond to motion of \(\approx 1 {\text{nm}}\). Label-free detection of single action potentials, with a reported amplitude of several nanometers24, appears well within reach for future experiments.