1 Introduction

For many years, efficient software implementations of cryptographic algorithms for constrained embedded processors were mainly restricted to symmetric ciphers. However, in recent years, various libraries for elliptic curve cryptography (ECC) have been published that offer acceptable runtime and code size also on microcontrollers with very limited computational resources, e.g., the 8-bit AVR ATmega series of processors. Notable examples for these ECC implementations are summarized in Table 1.

Table 1. Overview of ECC implementations for embedded AVR processors.

Due to the fact that an adversary often has physical access to an embedded device performing ECC operations, implementation attacks and in particular side-channel analysis (SCA) are severe threats in this scenario. Consequently, several libraries comprise countermeasures against SCA, for example, by performing computations in constant-time, or by using randomized projective coordinates. The protected implementations are further detailed in Table 1.

Many common SCA countermeasures assume that the adversary needs access to multiple traces (with identical scalar) to recover the secret key, which inherently protects protocols with ephemeral scalars. In this paper, we challenge this assumption and target fundamental building blocks of any ECC implementation, namely conditional moves and loads/stores from/to secret memory addresses. We show that template attacks allow to recover most of the secret scalar with a single trace of elliptic-curve scalar multiplication (ECSM) in both cases, which in turn renders all currently published ECC implementations for the AVR (and likely other, similar architectures) insecure.

Note that although this paper focuses on implementations of ECC, our attacks also apply to exponentiation algorithms as used in, e.g., RSA, classical Diffie-Hellman, DSA, or ElGamal. We actually expect the attacks to work even better there, because group elements are larger and thus require more loads (or conditional moves). We leave this investigation for future work.

Related work. Carefully combining countermeasures like uniformity of modular operations, (re-)randomization of the projective representation of points, scalar blinding, point blinding, and random field (or curve) isomorphisms prevent classical side-channel attacks like timing [38], SPA [20], DPA [39], CPA [11] or collision attacks [25, 31]. These attacks require a fixed scalar for multiple measured power or electromagnetic traces. The main protection relies on the full randomization of intermediate data, including input point, scalar and group, during the execution of an ECSM [4, 19, 24]. In this work we consider implementations based on the Montgomery ladder algorithm, protected by scalar randomization (SR) and projective-coordinate randomizationFootnote 1.

To overcome the aforementioned countermeasures two kinds of attacks have emerged: template and horizontal attacks. Although in general template attacks [14] can be used to attack multiple traces that share the same scalar, we need to attack ECSM traces independently, because of the SR. Template attacks combine statistical modeling and power-analysis, and consist of two phases. In the first phase, called profiling, the attacker builds templates by executing a sequence of instructions using a fixed scalar (with SR turned off). The second phase is called matching, in which the attacker matches the templates to attacked single traces (with SR turned on). The assumption is that the attacker possesses a profiling device, in order to build templates, that behaves the same as the target device, and runs the same implementation.

Template attacks on ECC trace back to an attack on ECDSA demonstrated by Medwed and Oswald [44]. However, this attack requires an offline DPA on the ECSM during profiling, in order to select the points of interest. Moreover, since the attack exploits data-dependent leakage it requires profiling with multiple templates (i.e., 33) while for our attacks two templates are enough. Furthermore, the attack only needs to recover a few bits of the multiple ephemeral scalars and can then employ ECDSA-specific lattice techniques to recover the long-term secret key [10]. This is not possible in the context of our work, since we do not target ECDSA: an attacker has only a single trace to recover sufficiently many bits of the randomized scalar using SCA to be able to compute the remaining bits.

Another template attack on ECC is presented in [30]. This attack follows a similar approach to our attack, but instead of exploiting address-dependent leakage, it exploits register location based leakage using a high-resolution inductive EM probe. As a result the attack is considerably expensive to execute. A template attack on a wNAF ECC algorithm is presented in [61]. However, this attack is applied to an implementation that is not protected with either, scalar randomization or base-point randomization. Another approach to attack ECC are the so called online template attacks [5, 22]. These attacks work if SR is enabled, but not when point randomization is enabled.

The template attack from [16] targets load instructions. However, multiple traces are required in the attack phase. Therefore, this attack does not work against implementations protected by SR. The template attack from [28] aims to extract a random multiplicative mask (base-blinding) out of a single measurement exploiting data leakage; then it is possible to unmask all intermediate values and run DPA.

Horizontal attacks on RSA [6, 8, 9, 15, 17, 18, 29, 54, 55, 57] and ECC [7, 27] are emerging forms of side-channel attacks on exponentiation-based or scalar-multiplication-based algorithms. Their methodology allows recovering the exponent bits through the analysis of individual traces. Therefore, these attacks are efficient against SR even when combined with point and group randomization. The attacks employ different common distinguishers: SPA, horizontal correlation analysis [18], Euclidean distance [57], horizontal collision-correlation [6,7,8, 17], horizontal cross-correlation [27], or clustering [29, 55].

An interesting horizontal address-based DPA attack on Montgomery multiplications is presented in [15]. The approach is similar to ours, but this attack exploits Hamming weight leakage of addresses. Furthermore, the analysis in [15] lacks the results for a full modular exponentiation (only a few iterations are attacked) and success rates.

The main issue of horizontal attacks is that extracting leakage from a single unlabeled trace is usually heavily limited by noise. Therefore, we have decided to attack our state-of-the art implementations, that contains scalar and point randomizations, using a more powerful attack paradigm, from the point of view of the attacker setting, namely, template attacks.

Contributions. The main contributions of this paper are threefold:

  1. 1.

    First, by the example of a protected version of \(\mu \)NaCl, we show that the single-trace leakage of conditional moves within the Montgomery ladder can be exploited to recover the scalar.

  2. 2.

    Second, we show that a similar attack applies to loads and stores from/to secret-dependent addresses. In doing so, we show that even implementations on embedded devices without cache cannot tolerate secret-dependent memory accesses.

  3. 3.

    Finally, we generalize the method from [26] to tolerate a certain number of incorrectly recovered scalar bits without relying on normal or side-channel-enhanced exhaustive search. Furthermore, we present experimental results for our algorithm.

Organization of the paper. The remainder of this paper is structured as follows: in Sect. 2, we review the use of conditional moves in scalar multiplication algorithms, together with possible countermeasures against side-channel analysis. Then, in Sect. 3, we describe the measurement setup and target implementation used for our attacks presented subsequently: while Sect. 4 deals with template attacks on the (arithmetic) conditional swap within the Montgomery ladder, Sect. 5 applies similar methods to recover the scalar by exploiting the leakage of secret load addresses. Section 6 discusses how to tolerate a certain number of incorrectly recovered scalar bits more efficiently than by simple exhaustive search. Finally, we conclude in Sect. 7 with directions for future work, in particular regarding countermeasures.

2 Scalar Multiplication and Conditional Moves

The most basic scalar-multiplication algorithm is the double-and-add algorithm, which scans through the bits of the scalar and performs a double operation for each zero bit and a double-and-add operation for each one bit. This algorithm is well known to be vulnerable to all kind of side-channel attacks, including power analysis and timing attacks.

The first step to side-channel protection is to always perform the same sequence of finite-field operations, independent of the scalar. The most common approaches to achieve such a structure are either to use (fixed-window) double-and-add-always scalar multiplication or ladder-based approaches (typically the Montgomery ladder [45] or, for general Weierstrass curves, the Brier-Joye ladder [12]). Another layer of side-channel protection then adds randomization of the scalar (through one of various blinding methods), and the internal representation of points (for example through projective randomization, field isomorphisms, or curve isomorphisms). By re-randomizing before or after each ECSM loop iteration, most horizontal collision or cross-correlation attacks are thwarted.

Interestingly, even with all those countermeasures in place, scalar-multiplication algorithms contain operations that choose one out of two (or more) curve points depending on bit(s) of the scalar. An attacker who learns all of these choices from side-channel information from just one trace, learns all of the scalar bits used in this scalar multiplication and thus obtains the secret key. On microcontrollers with restricted register space, there are essentially two different ways to implement this conditional move (cmov): either by loading from (or storing to) addresses that depend on the secret scalar, or by using arithmetic operations to perform a conditional register-to-register move. The latter approach is very common on large processors with cache, where the former approach leaks through cache-timing information. Essentially, the idea is to replace a computation of the form \(R \leftarrow P[s]\), where s is a secret scalar bit, by a computation of the form \(R \leftarrow sP[1] + (1-s)P[0]\). Note that this approach does not require actual multiplications; it is much easier to expand s to a bit mask of all ones or all zeros and use bit-logical instructions.

Most implementations of ECSM contain considerably more than just one secretly-indexed load, store, or conditional move. Sometimes this is a choice made by the implementors to improve performance (by avoiding otherwise unnecessary loads and stores); sometimes it is an inherent property of the ECSM algorithm. For example, the Montgomery ladder needs a conditional swap (cswap) of two points instead of a conditional move, which requires significantly more operations that involve the secret scalar bit than a simple cmov (for details, see Sect. 4).

The side-channel attacks described in the remainder of this paper attack both implementations that make use of secretly indexed memory accesses (in Sect. 5) and implementations that use the arithmetic cmov operation (or more specifically, the cswap operation) in Sect. 4. The idea of attacking loads from secret positions through side-channel information is not new: it is not only used in various cache-timing attacks (that do not apply to simple architectures such as the AVR), but it is also the underlying principle of address-bit-DPA [34]. What is novel is the fact that we need only a single trace. This renders countermeasures such as scalar blinding and address randomization [35, 36] ineffective.

3 Attack Setup

In this section, we describe the targeted implementations, the utilized microcontroller, our measurement setup. The trace pre-processing, frequency filtering and alignement, are described in the full paper [48].

3.1 Target Implementations

We target two protected ECSM implementations based on [49]. Both employ the Montgomery ladder, with the pseudocode given in Algorithm 1. The main difference between the two variants is the realization of the cmov (i.e., the function cswap_coords): The first implementation, described in more detail in Sect. 4.1, consists of applying an arithmetic conditional swap of the respective coordinates values of the working points \(P_1 = (X_1: Z_1)\) and \(P_2 = (X_2: Z_2)\). The second, described in Sect. 5.1, replaces the arithmetic conditional swap by a conditional swap of pointers to the coordinate values. Both implementations utilize projective-coordinate re-randomization as the main side-channel countermeasure. A randomly generated \(\lambda \in \mathbb {F}_p\) is multiplied with the coordinates of \(P_1 = (X_1: Z_1)\) and \(P_2 = (X_2: Z_2)\) at the beginning of every ECSM iteration. We make publicly available the source code for both implementations [47].

figure a

3.2 Target Device and Measurement Setup

We carried out our experiments with an ATmega328P 8-bit microcontroller placed on the target board of the ChipWhisperer [51] side-channel evaluation platform. While the ChipWhisperer also provides the possibility to capture analog signals (e.g., power consumption or electro-magnetic emanation), we used a separate oscilloscope (Picoscope 5203) due to the limited bandwidth, memory, and sample rate of the ChipWhisperer.

The targeted ATmega328P has a 32 KB of Flash, 2 KB of SRAM, and 1 KB of EEPROM. The register file contains 32 registers (R0–R31), among which 6 serve as pointers for indirect 16-bit addressing and have the following aliases: X (R27:R26), Y (R29:R28) and Z (R31:R30). Arithmetic instructions take 1 cycle, with the exception of multiplication instructions, which take 2 cycles. Loads and stores from/to SRAM take 2 cycles. Loads from Flash take 3 cycles. More technical details about the target device are given in the full paper [48].

4 Attacking Arithmetic Cswaps

In this section, we describe a template attack on conditional swaps (cswaps) in the Montgomery ladder step. In our case, the cswap is implemented using Boolean and arithmetic operations in constant time.

4.1 Target Implementation

In the Montgomery ladder (Algorithm 1), the function cswap_coords implements the cswap (based on input bit s) by first creating a mask m, which is either 0x00 or 0xFF for \(s = 0\) and \(s = 1\), respectively, by setting \(m = -s\) (assuming m, s are 8-bit values). Then, a (conditional) XOR swap is executed as follows:

figure b

In other words, if \(m =\) 0x00 (\(s = 0\)), \(tt = 0\) and the XORs \(\mathrm {xx = xx}\oplus \mathrm {tt}\) and \(\mathrm {yy = yy}\oplus \mathrm {tt}\) leave the values unchanged. Otherwise, if \(m =\) 0xFF (\(s = 1\)), we have a standard XOR swap, i.e., \(\mathrm {xx = xx}\oplus \mathrm {xx} \oplus \mathrm {yy = yy}\) (equivalent for \(\mathrm {yy}\)).

4.2 Template Generation and Matching

We generated templates for the and instruction (line 5 of Listing 1.1), grouping the traces in the profiling set into two sets \(V_0\) and \(V_1\). Traces in \(V_0\) represent those where \(m =\) 0 (i.e., an AND with 0x00), while \(V_1\) are traces where \(m =\) 0xFF. Note that the traces were cut to only contain the clock cycle for the targeted and instruction, i.e., each trace is \(64\cdot 67 = 4288\) samples long (cf. Appendix 2 of the full paper [48]). For \(V_i\), \(i = 0,1\), we subsequently computed templates consisting of the pointwise mean vector \(\varvec{\mu }^{(i)}\) and the covariance matrix \({\varvec{\varSigma }}^{(i)}\) [14]. Note that the two possible leakages 0x00 (all bits zero) and 0xFF (all bits one) can be expected to be maximally (or at least to a large degree) different, which should facilitate template attacks in this particular case.

We matched the templates to the traces in the test set with the standard approach, i.e., computing the respective probabilities using the multivariate normal distribution pdf and identifying the template with the highest probability to recover the respective bit of the scalar. The respective success rates wrt the size of the profiling set are given in Sect. 4.3.

Classification. For each template we computed the Euclidean distance between the sample vector and the template mean vector. The template (\(T_0\) or \(T_1\)) that results in the smallest distance is considered the best match for the sample vector. In this attack, the index of the closest template (0 or 1) corresponds to the swap bit.

Confidence score and confidence level. For the first classification method we derived a simple confidence score on the recovered bit value based on the distances (\(d_0\) and \(d_1\)) to each template. It varies linearly for a particular \(d_0 + d_1\) value, ranging from 0 (no confidence) and 1 (full confidence):

$$\begin{aligned} \mathrm {conf\_score} = 2 \cdot \left| {0.5 - \frac{min(d_0, d_1)}{d_0 + d_1}}\right| \end{aligned}$$
(1)

We furthermore define the confidence level of a given trace (in the test set) as follows: Let us call a recovered bit suspicious if its confidence level is less than the greatest confidence score of any falsely identified bit (whereas this threshold is determined experimentally in the profiling phase). Then, the confidence level is the percentage of bits that are not suspicious, i.e., that can be unambiguously recovered. Note that the average confidence level (over all number of traces in the test set) is always less than or equal to the average success rate, since an incorrectly recovered bit is always suspicious.

4.3 Attack Results

Figure 1 shows the average and best case success rates (computed over all 255 scalar bits), together with the respective confidence levels over the number of traces used for template generation and matching. Note that each full trace comprises 255 ECSM iterations, which were all used for generating the templates – in other words, each full trace contributes 255 “effective” traces to the profiling set.

The traces used for template generation and matching were taken from different trace sets (coming from different capture sessions). The same number of traces was used for profiling and testing, i.e., a given value on the horizontal axis of Fig. 1 is the same for profiling and testing.

Fig. 1.
figure 1

Success rates for the template attack on cswap for different number of full traces.

Fig. 2.
figure 2

Results for the template attack on loads/stores for different number of full traces.

As evident in Fig. 1, already for 10 full traces (i.e., about 2,550 effective traces), the average success rate reaches 96.71%, i.e., we can recover most of the bits of the scalar. Furthermore, the best success rate reaches 99.6% with the confidence level 76.1%. By increasing the number of traces, both success rate and confidence level change only minimally; due to the strong leakage of the targeted device, most information can be already extracted with a low trace count.

5 Attacking Secret-Dependent Memory Accesses

In general, ECC (and in particular NaCl-derived) implementations avoid loads from secret-dependent addresses altogether due to the possibility of cache-timing attacks. However, for embedded implementations without caches, secret load addresses are sometimes deemed acceptable. In this section, we show that template attacks can be employed to exploit this leakage.

5.1 Target Implementation

The targeted implementation replaces the cswap of the \((X_1:Z_1)\) and \((X_2:Z_2)\) coordinates values used in the targeted implementation in Algorithm 1 by working with pointers to those coordinates, and conditionally swapping these pointers. Besides being slightly faster, this implementation also potentially exhibits less leakage, because it uses the secret-dependent mask m in an AND operation only twice for each pointer cswapFootnote 2, rather than 32 times as in the ECSM implementation based on arithmetic cswap (cf. Sect. 4.1).

However, in implementations of finite-field operations both input and output operands are pointers. The values of these pointers are addresses to the memory holding the actual field element value, and those addresses directly depend on whether the swap occurred or not, which in turn depends on the value of the secret mask bit.

AVR memory access instructions internals. Memory access instructions (loads and stores) on an AVR take 2 clock cycles to execute. According to the ATmega328 datasheet [3], the effective address for such instructions is computed in the first cycle, while during the second cycle, the data word is read (load) or written (store) if the effective address is valid. Our proposed attack focuses on the address leakage of memory access instructions, and thus any data-dependency may negatively impact the attack success rate if not detected and mitigated. Therefore, we take advantage of this architectural feature by using only the samples from the first clock period of such instructions.

Targeted loads and stores. During each iteration of the Montgomery ladder, the actual field arithmetic occurs in the so-called ladderstep function (cf. Algorithm 1). We target the loads and stores addresses in the first three field operations in ladderstep, i.e., addition, subtraction, and addition. Each of these operations has two \(\mathbb {F}_p\) inputs (a and b) and one output r.

Finite-field addition and subtraction are implemented with reduction modulo \(2^{256}-38\). The reduction step also execute loads and stores, of which the samples are also used for template creation and matching. Listing 1.2 shows a small segment of the execution trace containing the loads of the first operands bytes and the store of the first byte of the result (before reduction):

figure c

Our oscilloscope’s memory is divided into 255 segments, each of which is 65 kSample in length. A memory segment holds the samples captured from a single ECSM iteration. Due to the 65 kSample limit for each ECSM iteration, we were able to capture the samples from all the loads and stores from the first field addition and the first field subtraction, but only half of the loads and stores from the arithmetic part of the second field addition. Note that the memory limitation is due to the relatively low-cost oscilloscope we used—high-end equipment would further facilitate the presented attack.

Table 2 shows the number of executed instructions of each type that are used in the attack. We used a total of 372 instructions, which are concatenated into a single sample vector. After trace preprocessing, 67 power samples are available per clock cycle, and as only the first clock period of a memory access instruction is used, the sample vector per ECSM iteration has \(n_v = 24,924\) samples.

Table 2. Number of executed instructions of each type that are used in the attack.

5.2 Template Generation

Each load or store instruction accesses at most two possible addresses. If it always accesses the same address, then it does not provide useful leakage relevant for the attack. Considering only those loads and stores that may access two addresses, during any execution of the ladderstep, only two distinct sequences of addresses can be accessed: \(A_{\mathrm {noswap}}\), containing the addresses accessed before the first pointers swap has taken placeFootnote 3, i.e., an even state (noswap state); and \(A_{\mathrm {swap}}\) containing the addresses accessed in an odd state (swap state).

First, we grouped the sample vectors into two sets. The first set, \(V_0\), consists of the load/store sample vectors for addresses in the set \(A_{\mathrm {noswap}}\), while the second set, \(V_1\), contains those originating from addresses in set \(A_{\mathrm {swap}}\). Then, we computed various statistics for each sample index of \(V_i\), \(i = 0,1\): mean \(\varvec{\mu }^{(i)}\), standard deviation \(\varvec{\sigma }^{(i)}\), median \(\varvec{md}^{(i)}\), as well as lower \(\varvec{l}^{(i)}\) and upper \(\varvec{u}^{(i)}\) percentiles (the actual percentiles used are discussed in Sect. 5.3). The collection of these statistics for \(V_0\) and \(V_1\), called \(T_0\) and \(T_1\), are the two possible templates.

5.3 Point-of-Interest Selection

The POI selection consists of using the lower and upper percentile vectors \(\varvec{l}^{(i)}\) and \(\varvec{u}^{(i)}\) (i=0,1) to compute the intersection of the pair of intervals \([ l_j^{(0)}, u_j^{(0)} ]\) and \([ l_j^{(1)}, u_j^{(1)} ]\) for each sample index \(j=1,\cdots ,n_v\). The sample indices where the intersection is empty are the considered POIs.

Intuitively, the sample indices with an empty intersection are those that are good distinguishers for the two templates, because in these points the samples tend to be clustered around the median (and also typically around the mean) of one template, rather than being scattered.

Different values for the lower and upper percentiles may give a different number of POIs, and that directly affects the success rate and confidence level of the attack. Thus, we tested the attack for different pairs of values for these parameters, ranging from wider and more selective percentiles (12.5, 87.5)Footnote 4 to narrow, less selective (40, 60). We emphasize that the POI selection is completely based on the samples of the traces used for the generation—it does not depend on the samples of the trace being attacked (i.e., the sample vector to classify). In fact, the POIs are represented as a Boolean vector used during template matching to select the samples from the target trace vector to be classified.

POI selection refinements. To improve the confidence level of the attack, we tested two POI selection refinements, as explained above. First, we noticed that when using more selective percentile parameters, the current selection method returned sample indices that were clustered in a few instructions, while most of the remaining instructions were not covered by any sample, although they should in theory contribute some leakage. To make the POIs more evenly distributed and exploit leakage from all useful instructions, we forced a minimum of one sample index per instruction to be included in the POI vector. If there was no sample index for a given instruction in the current POI vector, one was randomly selected. Second, also due to the clustering of the POIs in a few instructions, we limit the number of samples per instruction to one. In the case that sample indices had to be removed, we selected those randomly as well.

5.4 Template Matching

At first, without using any POI selection, we tried to use the standard multivariate Gaussian model, taking advantage of both the mean vector and covariance matrix computed from \(V_0\) and \(V_1\) (also known as complete templates) similar to the approach of Sect. 4. However, in contrast to Sect. 4, the sample vectors to be classified and the mean template vectors are relatively long (24, 924 samples) and relatively similar to each other (i.e., their Euclidean distance is very small), numerical instability issues due to almost singular matrices arose during the computation of the probability density function. For those reasons, we decided to use reduced templates instead, which uses only the mean vectors.

After applying POI selection, the matched sample vectors are much smaller, and thus full templates could then in principle be applied, as the covariance matrices would not lead to numerical instability. However, due to the high success rates achieved using the reduced templates, we decided to not use full templates to avoid increasing storage and computational requirements.

We also evaluated the effect on the attack success rate and confidence level of compressing the sample vector using normal and absolute sum for different window lengths. In addition, we applied a straightforward outlier detection to remove samples that have likely been subject to larger distortions: In the matching phase, we discarded all samples that have a distance of more than a multiple of standard deviations to the mean trace at the respective point in time. Using reduced templates, template matching boils down to computing the (squared) Euclidean distance between the sample vector to match and the template mean vectors. The lower that distance is, the stronger is the match. In this case, other distinguishers can be used in a straightforward way, and thus we also tested the attack using the Pearson correlation coefficient.

Classification methods and confidence score. As a first classification method to test, we selected the template closer to the sample vector (cf. Sect. 4.2). We also tested majority voting classification, where each sample is individually classified, also based on its distance to the corresponding element in the templates mean vectors, and the majority vote wins. In both cases, as each template directly corresponds to a scalar bit value, the classification output is the recovered bit value. The confidence score was computed in the same way as in Sect. 4.2.

5.5 Attack Results

Figure 2 depicts average and best case success rates for the template attack on secret-dependent memory accesses for the best and average cases. Again, as in Sect. 4.3, the trace sets used for template generation and matching were recorded in different capture sessions, and the same number of traces was used for each set. Again, only a limited number of profiling traces was sufficient to reach success rates exceeding 90%; the best success rate reaches 95.3% (there are only 12 errors) with the confidence level 78.8% (the 12 errors are included in the 54 suspicious bits). To investigate the effect of various pre-processing steps and attack parameters, using 10 traces we investigated the average success rate and confidence level depending on various attack parameters. In particular, we investigated various signal frequency filtering options, POI selection methods, classification and compression methods, outlier filtering, and distinguishers; the result of the investigation are described in the full paper [48]. The best parameters that we discovered, were used to perform the main attack described in this section.

6 Error Detection and Correction

Due to noise, data leakage (note that we are aiming at exploiting the address leakage only), and other aspects that interfere with the side-channel analysis (misalignment, clock jitter, etc.), the derivation of the final scalar for a single trace likely contains errors. If the amount of wrong bits is sufficiently small, then a brute-force attack may still be feasible. However, first the attacker needs a metric to indicate the location of the possible wrong bits in the recovered scalar. The notion of suspicious bits (cf. Sect. 4.2) can be used as a reference for the scalar bits selection with respect to a brute-force attack.

Let us consider the trace with smallest amount of suspicious bits from the experiment from Sect. 5; for this trace there are \(54\) suspicious bits that comprise all falsely identified bits. Unfortunately, to recover a full randomized scalar, even in this case, the attacker needs \(2^{54}\) operations, which is generally impractical. Note, that we consider only the worst-case complexity and not the average case.

To improve upon the brute-force search complexity, there are two options. The first approach is to try to exploit the distribution of suspicious bits for incorrectly (red) and correctly (blue) recovered bits (Fig. 3). While there is a clear trend for incorrect bits to have lower confidence score, the intersection between correct and incorrect bits is large. Still, it may possible to exploit the trend with an informed brute force attack [40], prioritizing bits with the lowest confidence score. Unfortunately this attack works well if the bits containing errors are adjacent to each other and that is not the case in our setting.

Fig. 3.
figure 3

Distribution of confidence scores over all traces for suspicious bits. Red: incorrectly recovered bits, blue: correctly recovered but suspicious bits. (Color figure online)

Alternatively (or combined with the informed brute-force search), we apply the second algorithm from [26], which is originally designed for square-and-multiply chains, to the Montgomery ladder. We describe how the algorithm works using the aforementioned example trace, which contains \(s=54\) suspicious bits, as an example. Let us represent the indices of these bits as a list sorted in descending order: \(i_s, \dots i_1\), where each \(i_j \in \{0, \dots 254\}\) and \(s \ge j \ge 1\); note that there are 255 bits in total. Let x denote the bit index (namely, \(i_{28}\) for the example trace). Let a be the number represented by the bit string corresponding to the left part of the scalar from x (including \(i_x\)) and let b be the number corresponding to the bit string of the (least significant) right part. Furthermore, we know that \(R = [k] P\), where R is the resulting point, k the scalar to be recovered, and P the input point. Then, clearly \(R = [k] P = [a \cdot 2^{i_x} + b] P = [a] ([2^{i_x}] P) + [b] P\). If we denote \([2^{i_x}] P\) by H, then the above equation reduces to

$$\begin{aligned} R - [b] P = [a] H \end{aligned}$$
(2)

We can use Eq. 2 to check correctness of our guess. Now, following [26], we use a time-memory trade-off technique to speed up an exhaustive search: Consider all different possible guesses for a. For each guess, we compute [a]H and store all pairs (a, [a]H). We then sort all pairs based on the value of [a]H and store them in an ordered table.

Next, we make a guess for b and compute \(z = R - [b] P\). If our guess for b is correct, then z is present in the second column of some row in the table we built—the first column is the corresponding a. Finding such a pair can be done using binary search, as the table is sorted as per the second column. If z is present, we are done since we have determined the scalar. Otherwise, we make a new, different guess for b and continue. Since there are approximately \(2^{\frac{s}{2}}\) guesses for a and b, the time complexity is \(O(2^{\frac{s}{2}})\) operations. As there are \(2^{\frac{s}{2}}\) guesses for a, the table has that many entries and the space complexity is \(O(2^{\frac{s}{2}})\) points. This way, we limit the time complexity to \(O(2^{\frac{s}{2}})\) (cf. [26] for a detailed complexity analysis), which is \(2^{27}\) for the example trace.

We do not know which trace contains the smallest number of suspicious bits since we do not know the maximum confidence score of a falsely identified bit. However, to use the above algorithm we assume that we know the number of suspicious bits to be bruteforced to recover the correct scalar. This can be determined by using templates to attack some traces, for which we know the randomized key. Furthermore, note that if the attack fails, we can extend the execution to the second most likely suspicious bit and reuse the previously obtained data. Based on our experiments, we determined that the number \(54\) of suspicious bits should cover all falsely identified bits for at least one trace. Our complete attack works as follows: we run the above algorithm sequentially for each of the n traces. We stop the attack as soon as the time-memory trade-off technique succeeds for one trace.

Since we are running the attack n times, the complexity of the complete attack is multiplied by n. It totals to \(O(n \cdot 2^{\frac{s}{2}})\) operations and \(O(n \cdot 2^{\frac{s}{2}})\) points in memory. For the attack from the previous section, this corresponds to \(100\cdot 2^{27} = 2^{32}\) operations. Therefore, we conclude that the scalar can be recovered successfully and efficiently even in the presence of multiple errors and uncertain bits (for experimental results see Sect. 6.1). Furthermore, we believe that the above technique may be of independent interest since it can be applied to a commonly used ECSM algorithm, i.e., Montgomery ladder, even if errors are randomly spread across the scalar recovered by the SCA attack.

6.1 Algorithm Implementation and Experimental Results

The first challenge we faced is how to compute the point subtraction in Eq. 2. Curve25519 is a curve in the Montgomery form, and as such, there is an efficient formula for differential point addition using XZ coordinates, but no efficient formula to compute a standard point addition, as far as we know. For that reason, we decided to do the point addition in affine coordinates, which costs a field inversion and a few multiplications. However, to use them we need to know the y-coordinates y(R) and y([b]P). The attack assumes that x(R) (the ECSM output) is known, but y(R) is not, and thus has to be computed. To do so, we use the curve formula directly to compute the two possible values for y(R), at the cost of a field square root, an expensive operation, but it has to be done only once for each value of R. In the case of y([b]P), an efficient algorithm by Okeya and Sakurai [52] costs one field inversion.

To generate the table of precomputed points \(A = [a]H\) and to compute \(B = [b]P\) in Eq. (2), the naive approach is to compute a full ECSM for each value of a and b. A more efficient method is to apply Gray coding to the suspicious bits in scalars a and b. One property of such a code is that consecutive code words differ in just a single bit, which means that, in our context, we can generate \([k']P\) from [k]P using a single point addition (if the bit changed from 0 to 1) or point subtraction (if the change is from 1 to 0), where k and \(k'\) are scalars whose unknown bits are represented as Gray code words, and the code word in \(k'\) is the successor of the respective code word in k. To compute the sequence of points \([k_i]P\) (\(i=0,1\ldots \)), we first construct the scalar \(k_0\), by setting the unknown bits to zero and the (assumed correct) recovered bits from the output of the SCA attack to their respective values. Then, we apply the full ECSM algorithm to compute \([k_0]P\), and from there we use the aforementioned method to generate the sequence of points \([k_1]P, [k_2]P \dots \), which costs essentially a point addition per each computed point.

We implemented the key recovery algorithm with the aforementioned arithmetic-level optimizations as a single-threaded program. We tested our implementation in a smaller scale, to recover 40 suspicious bits of a scalar on a PC with 8 GB of RAM total, but only 5 GB available for the program, a i7-3740QM CPU, running at 2.7 GHz. It took 1h23 to recover the correct scalar, where about 1.5 ms is spent to add a single entry to the table and about 3 ms to test a possible value of b. By using these time values as a reference, we estimate that the time for the recovery of a scalar with 60 suspicious bits using the current implementation is around 18 days. The source code of the key recovery implementation is publicly available [46].

7 Conclusions and Possible Countermeasures

In this paper we show that the single-trace data leakage of conditional moves can be exploited to recover the scalar using a template attack. We also show that a similar attack applies to address leakage due to loads and stores from/to secret-dependent addresses. Furthermore, we generalize the method from [26] to tolerate a certain number of incorrectly recovered scalar bits without relying on normal exhaustive search.

Now we discuss possible countermeasures against our attack. We consider evaluating or improving our attack to work against these countermeasures as future work. First of all, note that any countermeasure based on modifying the base point before or during the scalar multiplication does not protect against our attacks, since they aim at exploiting address-dependent and the cswap leakage. Similarly, scalar blinding or splitting does not affect the attack, since we require only one trace and could hence recover the blinded or split scalar. The knowledge of the randomized scalar (or the split scalars) is sufficient to either recover the original scalar or to compute the correct scalar multiplication result. A potential countermeasure against our attack is presented in [50], performing online data randomization during the exponentiation to prevent horizontal collision-correlation attacks. The main idea is to the split scalar to two parts and to randomly interleave two scalar multiplications. However, we believe that our attack might still be mounted if four templates are used to recognize which bit is processed and during which ECSM.

The idea behind Itoh et al. [34] memory-address countermeasure is to store sensitive variables at different memory addresses, but with the same Hamming weight. We believe that although this would cause our attack to be less effective, the addresses leakage may still be identified by template matching. Randomization of memory addresses of the coordinates used in the Montgomery ladder before the ECSM might lead to our attack being less effective, since the templates are prepared assuming fixed addresses. The above countermeasure can be improved by randomizing not only the addresses but also the memory accesses [35,36,37].

The countermeasure of [30] protects against localized EM template attacks on the ECC Montgomery ladder. The main idea is to randomly swap the ladder registers at the end of a ladder iteration; the addressing of the registers within the loop is inverted according to whether the registers have been swapped. The countermeasure is uniform in its operation sequence, and hence, our template attacks would be infeasible in principle. In addition, several randomization techniques protecting the Montgomery ladder are presented in [41]. Similarly to the countermeasure of [30], these techniques generate operation sequences independent from the scalar. Thus we assume that our attack would be less effective or ineffective against them. We therefore regard as future work evaluating and improving our attacks with respect to the three latter countermeasures.