Keywords

1 Introduction

In recent years, low-end embedded devices have been deployed in an increasing number and used in various applications, such as Radio Frequency Identification (RFID) tags and wireless sensor networks (WSNs). Providing security solutions to these widely used devices has attracted a lot of attention from cryptography researchers. These kinds of devices have very limited power consumption, constrained memory and computing capability, and thus applying traditional security solutions, such as TLS and IPsec, in these contexts is often impractical. Hence, lightweight cryptography has been developed in order to provide compact algorithms and protocols that fit in resource-constrained environments.

Numerous lightweight ciphers have appeared. Among them are a large number of block ciphers such as TEA [31], XTEA [26], PRESENT [9], KATAN and KTANTAN [11], LED [16], EPCBC [33], KLEIN [15], LBlock [32], Piccolo [29], Twine [30], and the more recent Simon and Speck [3]. There exist also some lightweight stream ciphers such as Trivium [12], Grain [17] and WG [25], which provide suitable security and small implementations for resource-constrained devices.

The recently proposed lightweight block ciphers, Simon and Speck [3], have led to papers concerning their security [1, 7, 10]. This is partially due to the fact that these ciphers are recognized to be the smallest block ciphers in each of the block/key size categories when used in resource-constrained environments. Simon is optimized for hardware implementation, while Speck is optimized for software. Inspired by the designs of Simon and Speck, we combine their good components in order to get a new design of block cipher family, called Simeck. We use a slightly modified version of Simon’s round function, and reuse it in the key schedule like Speck does. Moreover, we take the benefits of using Linear Feedback Shift Register (LFSR) based constants in the key schedule in order to further reduce hardware implementation footprints. The new family of lightweight block ciphers Simeck aims to have comparable security levels but more efficient hardware implementations.

Based on the aforementioned motivations, we have the detailed design goals as follows.

  • Hardware. First, we want to minimize the area and power consumption of the Application Specific Integrated Circuit (ASIC) implementations. We also want to allow a range of options in the area, throughput, and power consumption. Finally, we want to keep the maximum operating frequency as high as possible.

  • Applications. Take the application of passive RFID tags for example, Simeck should satisfy the following requirements in order to be used in practice: (1) The area of Simeck should be less than 2000 GEs [2, 18]. (2) The power consumption of Simeck should be very small. (3) The typical passive RFID tag’s operating frequency is 2 MHz and the data rate is 64 Kbps [14, 34], and thus the throughput is 64K/2M \(\approx \) 1/32. Therefore, if the tag’s operating frequency is 100 KHz (for benchmarking purpose), the throughput of Simeck should at least be 100 K \(\cdot \) 1/32 bps \(\approx \) 3.1 Kbps.

  • Security. Although Simon and Speck were designed with small, simple round functions, they are iterated a sufficient number of times in order to resist traditional attacks. We follow the same strategy with Simeck, and due to its similarity with Simon, we benefit from its analysis carried so far.

Table 1. Comparison of Hardware Implementations of Lightweight Block Ciphers

In this paper, we offer a wide range of options between area, throughput, and power consumption for the implementations of Simeck. All the Simeck’s family members can meet our security, hardware, and applications design goals. We compare our results to the previous constructions with comparable block sizes and key sizes as given in Table 1. Table 1 gives our smallest area results for all the instances of Simeck from before and after the Place and Route (P&R) in CMOS 130 nm and CMOS 65 nm ASICs. In addition, the corresponding throughput and power consumption after the Place and Route are also provided. In particular, Table 1 presents our hardware implementation results of Simon which cost less area than the original results in [3]. Moreover, the hardware implementations of our Simeck block cipher family are even smaller than our implementations of Simon in terms of area and power consumption.

More specifically in Table 1, we can achieve a small area of 505 GEs before the Place and Route with a throughput of 5.6 Kbps and 0.417 \(\mu W\) power consumption for Simeck32/64 in CMOS 130 nm ASIC. With a fair comparison (before the Place and Route) in CMOS 130 nm, Simeck32/64 can achieve 2.3 % smaller than our implementations of Simon32/64, and 3.4 % smaller than the original implementations of Simon32/64. Correspondingly, we can get an even smaller area of 454 GEs before the Place and Route and 1.292 \(\mu W\) power consumption in CMOS 65 nm ASIC. In this case, Simeck32/64 is 2.6 % smaller than our implementations of Simon32/64.

Similarly, Simeck48/96, 64/128 are 2.5 %, 2.1 %, respectively, smaller than our implementations of Simon48/96, 64/128, and they are 3.3 %, 3.5 %, respectively, smaller than the original implementations of Simon48/96, 64/128 in CMOS 130 nm. Correspondingly in CMOS 65 nm, Simeck48/96, 64/128 are 2.4 %, 2.0 %, respectively, smaller than our implementations of Simon48/96, 64/128. Moreover, with only a little extra area (GEs) and power consumption, we can increase Simeck’s throughput a lot.

This paper is organized as follows. In Sect. 2, we describe the specifications and design rationales of the Simeck family. Section 3 first presents our metrics and design flow in CMOS 130 nm and CMOS 65 nm ASICs. Then, we give two different hardware architectures of Simeck in order to make a trade-off between area, throughput, and power consumption. Later, the hardware evaluations in CMOS 130 nm and CMOS 65 nm are given with a thorough analysis. In Sect. 4, we compare our results of Simeck and Simon with the results in [3]. Before concluding this paper, we provide a security analysis of our new block ciphers in Sect. 5.

2 Design Specifications and Rationales

In this section, we give the specifications, as well as design rationales, of our block cipher family Simeck. We use the following notations throughout the rest of the paper.

  • \(x \lll c\) denotes the cyclic shift of x to the left by c bits.

  • \(x \odot y\) is the bitwise AND of x and y.

  • \(x \oplus y\) is the exclusive-or (XOR) of x and y.

2.1 Specifications of Simeck

Our lightweight block cipher family Simeck is denoted Simeck2 n / mn, where n is the word size and n is required to be 16, 24 or 32; while 2n is the block size and mn is the key size. More specifically, our Simeck family includes Simeck32/64, Simeck48/96, and Simeck64/128. For example, Simeck32/64 refers to perform encryptions or decryptions on 32-bit message blocks using a 64-bit key. These three size choices of the ciphers aim to fit different applications of embedded systems including RFID systems, and these sizes are also contained in the specifications of Simon and Speck families of block ciphers.

Simeck is designed to be extremely small in hardware footprints and to be compact in software implementations as well. The round function and the key schedule algorithm follow the Feistel structure. A plaintext to be encrypted is first divided into two words \(l_0\) and \(r_0\), where \(l_0\) contains the most significant n bits, and \(r_0\) consists of the least significant n bits. Then these two words are processed by the Simeck round function for certain number of rounds, and finally the two output words \(l_T\) and \(r_T\) are concatenated to form a complete ciphertext, where T denotes the total number of rounds.

Round Function. We define the round function (of the i-th round) as the following function,

$$\begin{aligned} R_{k_i}(l_i,\; r_i) = (r_i \oplus f(l_i) \oplus k_i,\; l_i), \end{aligned}$$

where \(l_i\) and \(r_i\) are the two words for the internal state of Simeck, \(k_i\) is the round key, and the function f is defined as

$$\begin{aligned} f(x) = (x \odot (x \lll 5)) \oplus (x \lll 1). \end{aligned}$$

Fig. 1 illustrates the operations of the round function \(R_{k_i}\).

Fig. 1.
figure 1

The Round Function of Simeck

Fig. 2.
figure 2

The Key Expansion of Simeck, where \(R_{C \oplus (z_{j})_i}\) is the Simeck Round Function with \(C \oplus (z_{j})_i\) Acting as the Round Key

Key Schedule/Expansion. To generate the round key \(k_i\) from a given master key K, the master key K is first segmented into four words and loaded as the initial states \((t_2, t_1, t_0, k_0)\) of the feedback shift registers shown in Fig. 2. The least significant n bits of K are loaded into \(k_0\); while the most significant n bits are put into \(t_2\). To update the registers and generate round keys, we reuse the round function with a round constant \(C \oplus (z_{j})_i\) acting as the round key, i.e. \(R_{C \oplus (z_{j})_i}\). The updating operation can be expressed as

$$\begin{aligned} \left\{ \begin{array}{rl} k_{i+1} &{}= t_i, \\ t_{i+3} &{}= k_i \oplus f(t_i) \oplus C \oplus (z_j)_i, \end{array} \right. \end{aligned}$$

where \(0 \le i \le T - 1\). The value \(k_i\) is used as the round key of the i-th round.

The value of the constant C is defined by \(C = 2^n - 4\), where n is the word size. \((z_j)_i\) denotes the i-th bit of the sequence \(z_j\). Simeck32/64 and Simeck48/96 use the same sequence \(z_0\), i.e. \(j = 0\), which is an m-sequence with period 31 and can be generated by the primitive polynomial \(X^5 + X^2 + 1\) with the initial state (1, 1, 1, 1, 1). When the rounds number is larger than 31, the sequence repeats itself. Simeck64/128 uses another m-sequence \(z_1\) with period 63, which is generated by the primitive polynomial \(X^6 + X +1\) with the initial state (1, 1, 1, 1, 1, 1).

Number of Rounds. The number of rounds T for Simeck32/64, Simeck48/96, and Simeck64/128 are 32, 36, and 44, respectively.

2.2 Design Rationales

In Simeck, we use a slightly simplified version of the round function of Simon. The round function of Simon can be expressed as

$$\begin{aligned} R'_{k_i}(l_i,\; r_i) = (((l_{i}\lll 1)\odot (l_{i}\lll 8)) \oplus (l_{i}\lll 2) \oplus r_{i} \oplus k_i,\; l_i), \end{aligned}$$

where \(l_i\) and \(r_i\) are the input words, and \(k_i\) is the round key. The operations of the round function only contain bitwise AND, XOR and cyclic shifts, and they are very efficient for hardware implementations. In particular, for Simeck, we change these shift numbers from (1, 8, 2) to (0, 5, 1). We choose our shift numbers in order to realize an acceptable trade-off between hardware performance and security. These modifications will improve the efficiency of hardware implementations, but will have comparable security strengths against certain attacks. More discussions will be given in the following sections.

For the key expansion/schedule algorithm of Simeck, we learn the idea of re-using the round function to update the round-key registers from the design of Speck.

Concerning the number of rounds for Simeck, we choose the same numbers as the corresponding block ciphers in the Simon family, in order to have comparable security levels and fair hardware implementation evaluations.

To defeat certain self-similarity attacks such as slide attacks and rotational attacks, we add the round constants C and \((z_j)_i\) into the key expansion process. The constant \(C = 2^n - 4\) is also used in the key expansion of Simon. The polynomials for the two m-sequences \(z_0\) and \(z_1\) are chosen to have minimum numbers of components, such that their hardware implementations will have small footprints.

3 Hardware Implementations

We discuss the hardware implementations of the Simeck family of block ciphers in this section.

3.1 Metrics and Design Flow

We use the Synopsys Design Compiler Version D-2010.03-SP4 to synthesize the RTL of the designs into netlist based on the STMicroelectronics CMOS 65 nm CORE65LPLVT_1.20V and IBM CMOS 130 nm CMR8SF-LPVT Process SAGE v2.0 standard cell libraries with both having a typical 1.2 V voltage, and \(25^{\circ }\)C temperature. Cadence SoC Encounter v09.12-s159_1 is used to finish the Place and Route phase in order to generate the layout of the designs. We use Mentor Graphics ModelSim SE 10.1a to conduct functional simulation of the designs and perform timing simulation by using the timing delay information generated from SoC Encounter as well. The areas of the designs after the logic synthesis are provided for comparisons with previous ciphers, and a more accurate area after the Place and Route is also provided for using the ciphers in practical cases. The densities used for the Place and Route phase for CMOS 130 nm and 65 nm are 0.92 and 0.93 respectively, in order to make a trade-off between area and maximum operating frequency when the densities are high enough. As usual, the area is measured in gate equivalents (GEs), and one GE is equivalent to the physical area required for the two-input one-output NAND gate with the lowest driving strength of the corresponding technology.

We use SoC Encounter v09.12-s159_1 to generate the accurate power consumption based on the activity information generated from the timing simulation with a frequency of 100 KHz, and a duration time of 0.1s. We do so because the 100 KHz clock frequency is widely used for benchmarking purpose in resource-constrained applications and 0.1 s is long enough to provide an accurate activity information for all the signals.

Moreover, the critical path is obtained after the Place and Route phase, which would be more accurate than the estimated value obtained from logic synthesis. Hence, the maximum clock frequency which can be operated for a specific design is obtained.

Table 2. The Areas of Basic Gates in the Libraries

In fact, during the analysis of the previous results [3, 11, 24, 27, 28], the ASIC results for various implementations differ not only in the basic gate technology but also in the types of flip-flops used. In order to be fair to compare our results with the previous ones, we provide the areas of some basic gates in our specific libraries and the library used in [3] by the researchers from the NSA for Simon in Table 2. In addition, all the areas of basic gates provided here are the smallest ones in the library. We observe that our IBM 130 nm library is almost the same as the IBM 130 nm library used by the researchers from the NSA [3] except the scan flip-flops in terms of the areas of the basic gates.

3.2 Two Different Hardware Architectures for Simeck

In this section, we target low-area implementations of Simeck and make a trade-off between area and throughput. Meanwhile, we still keep a very high operating frequency. We give two architectures for the implementations: one is parallel architecture, and another one is fully serialized architecture. Moreover, we provide a block diagram of the top-level I/O interface between the cipher and the outside environment in order to provide a benchmark for the future implementations and comparisons with other ciphers.

Fig. 3.
figure 3

Parallel Architecture for Simeck

Parallel Architecture. The parallel architecture processes one round of the message in one clock cycle, and one round of the key schedule at the same clock cycle, as shown in Fig. 3. This architecture provides a very high throughput while keeping a compact design. The round function in Fig. 3(a) includes three parts: 2n flip-flops, a n-bit width 2-to-1 multiplexer, a combinational circuit (dashed box) to compute the feedback data for the multiplexer. Inside the 2n flip-flops, n flip-flops are for the message b, and the other n ones are for the message a. The multiplexer is used to select the initial plaintext or the feedback data from the combinational circuit for the message b. The combinational circuit includes one n-bit AND gate, three n-bit XOR gates, and two shift modules (cyclic shift to the left by 5 bits and 1 bit). The shift modules cost no extra hardware resources, because they can be done by rewiring the corresponding signals. When the cipher runs, the n-bit data from the message block b shifts to message block a, and simultaneously, the message block b loads a new n-bit data from the multiplexer until the cipher stops. The round key \(k_i\) in the combinational circuit for every round comes from the key schedule function, which generates a key for every rounds until the cipher outputs the ciphertext.

Different from the round function architecture, the key schedule in Fig. 3(b) has four n-bit key blocks and one input to the combinational circuit (dashed box) is different. This n-bit input to the key schedule is a combination of an \((n-1)\)-bit constant and a 1-bit signal generated from the control circuit.

All the flip-flops in the round function and key schedule are standard flip-flops without chip-enable in our architecture. In addition, there are only two n-bit width 2-to-1 multiplexers in total in our architecture to select the initial data or feedback data, where one is for the round function, and the other is for the key schedule. Moreover, the latency for generating a ciphertext using our parallel architecture is \(T + 4\), where T is the total number of rounds.

Partially Serialized Architecture. In order to make a trade-off between area, throughput, and power consumption, we provide a partially serialized architecture. This architecture processes only several bits in the round function and the key schedule during one clock cycle. The specific partially serialized size (par_sz) of Simeck are summarized as follows:

$$\begin{aligned} {\textsf {Simeck32/64}}&: 1, 2, 4, 8, \\ {\textsf {Simeck48/96}}&: 1, 2, 3, 4, 6, 8, 12, \\ {\textsf {Simeck64/128}}&: 1, 2, 4, 8, 16. \end{aligned}$$
Fig. 4.
figure 4

Fully Serialized Architecture for Simeck

Besides the round counter (i in Figs. 3 and 4) in the control circuit, there is another counter to control the rounds of the specific serialized size in the partially serialized architecture. The range of this serialized counter (l in Fig. 4) is between 0 and n / par_sz - 1. In total, the latency for generating a ciphertext is (n / par_sz) \(\cdot \) (\(T + 4\)), where T is the total number of rounds.

A fully serialized architecture is shown in Fig. 4. In this architecture, the multiplexer (MUX), and combinational circuit (dashed box) are all 1-bit width, which save a lot of area. Compared to the parallel architecture, there are two more multiplexers. They are used to select the cyclic shift inputs. The MUX1 is used for the left shift by 1 bit, and MUX5 is used for left shift by 5 bits. The MUX1 selects \(b_{n-1}\) as input when the serialized counter equals 0, and chooses \(a_{n-1}\) when the serialized counter is larger than 0. Similarly, the MUX5 selects \(b_{n-5}\) when the serialized counter is smaller than or equal to 4, and chooses \(a_{n-5}\) when the serialized counter is larger than 4.

The partially serialized architecture with par_sz larger than 1 is similar to the fully serialized architecture, where the multiplexer and combinational circuit are par_sz-bit width and the selection signals for the multiplexers (MUXes selection circuitry) are different for various values of par_sz.

Fig. 5.
figure 5

The Top-level I/O Interface between the Cipher and the Outside Environment

The Top-Level I/O Interface for Different Architectures. As discussed in Sect. 3.1, the area of the chip depends on not only the area of the basic gates, but also the adopted types of flip-flops. We provide a top-level I/O interface between the cipher and the outside environment as shown in Fig. 5. We do not have a Finite State Machine (FSM) to control the circuit with the purpose of reducing the entire area as much as possible. In our top-level architecture, the cipher is always running and it is controlled by the outside signal i_mode. Therefore, we only have two modes in our architecture: loading phase and running phase. The cipher goes into loading phase when \(\mathtt{i\_mode }\) equals 0, and it loads the initial data from the inputs \(\mathtt{Key }\) and \(\mathtt{Plaintext }\). Later on, the cipher begins running phase when \(\mathtt{i\_mode }\) equals 1. The user obtains the \(\mathtt{Ciphertext }\) at the end of the running phase. Then, \(\mathtt{i\_mode }\) returns back to 0, another \(\mathtt{Plaintext }\) encryption begins. As our architecture never stops, all the flip-flops in the datapath are standard flip-flops without chip-enable signals. This property makes our design ever smaller in terms of area. This architecture presents a benchmark ASIC implementation of Simeck and can be used to fairly compare with the hardware results of other ciphers.

It is worth mentioning that the parallel architecture can be viewed as a special case of the partially serialized case when par_sz equals n. However, the two cases have different architectures as depicted in Figs. 3 and 4.

Our top-level architecture includes two parts: the control circuit and the datapath. The control circuit for the parallel architecture is used to provide the key constant from the LFSR as described in Sect. 2. However, an extra serialized counter in the control circuit is needed for the partially serialized architecture. The datapath includes round function and key scheduling, and they are described as above for the parallel architecture and partially serialized architecture.

Recently, LFSR or NLFSR based counters are used to replace binary counter in the control circuit in hardware implementations [20], because they only contain flip-flops and some combinational feedback logics without using a full-adder. Hence, it can reduce the area to some extent if the LFSR or NLFSR counter does not incur extra area in the datapath. However, the serialized counter in our partially serialized architecture is used in two aspects: one is used to count the serialized rounds in the control circuit and another one is used to select the two multiplexers (MUX1 and MUX5) in the datapath. After a theoretical and practical analysis of the effects of the LFSR or NLFSR counter in our partially serialized architecture, we discovered that the total area using binary serialized counter is the smallest one because the LFSR or NLFSR counter results in more additional area in the datapath (i.e., the area of the multiplexers selection circuitry) than the area saved by replacing the binary counter with LFSR or NLFSR counter in the control circuit. Therefore, the binary serialized counter is used for our partially serialized architecture.

3.3 Hardware Evaluations of Simeck

We use three different compilation techniques in the Design Compiler to perform hardware optimizations: simple compile, compile ultra and compile ultra with clock gating. The simple compile option can provide us the hierarchical architectures of the design, and the areas of specific sub-modules. The compile ultra option can make deeper optimizations in a way of optimizing the entire module together, thereby reducing the area and power consumption significantly [11, 20]. The clock gating technique can further reduce the area and power consumption [11]. However, we use all standard flip-flops without chip-enable signals for the parallel architecture. Only the LFSR generating the key constant in the control circuit uses the flip-flops with chip-enable signals, which costs 5, 6, and 6 flip-flops for Simeck32/64, Simeck48/96, and Simeck64/128 respectively. Therefore, the clock gating optimization affects only a little of our results in terms of area and power consumption.The ASIC implementation results of Simeck and Simon in CMOS 130 nm are shown in Tables 3 and 4, and the corresponding results of Simeck and Simon in CMOS 65 nm are shown in Tables 7 and 8. It is worth noting that these results are obtained without using scan registers.

Table 3. Our Implementation Results of Simeck32/64, 48/96, 64/128 in 130 nm
Table 4. Our Implementation Results of Simon32/64, 48/96, 64/128 in 130 nm

We provide the best area results before and after the Place and Route phase using compile ultra or compile ultra plus clock gating. These results can be used for comparing with other ciphers or for practical purpose. The maximum frequency corresponding with the best optimization technique is given and it is calculated by using the critical path. The calculated throughput is based on the latency in our architectures and it is the same as Simon. The difference of the total power consumption among the three different optimizations is marginal. Therefore, we only provide a total power consumption using compile ultra at 100 KHz, which is typical for benchmarking purpose. Since the operating frequency is too small, the static power consumption dominates the total power consumption. However, the static power consumption is larger in CMOS 65 nm than in CMOS 130 nm, which is the reason why the total power consumption is larger in CMOS 65 nm as shown in Tables 7 and 8.

Besides having a very small area, our another observation is that most part of the area for all the architectures are built of the sequential logics, especially for the fully serialized architecture. Take Simeck32/64 for example. 86 %, 85 %, 82 %, 76 %, and 70 % of the entire area are sequential logics for the cases that par_sz equals 1, 2, 4, 8, and 16 respectively. From the data provided, we can obtain that the fully serialized architecture is built of about 90 % sequential logics. Similar conclusions can be obtained for Simeck48/96 and Simeck64/128.

We provide a range of options between the area, throughput, and power consumption in our ASIC implementations. Taking Simeck32/64 in CMOS 130 nm for illustration, we can achieve a throughput of 5.6 Kbps at the area cost of 505 GEs (before the Place and Route) and 549 GEs (after the Place and Route) with the power consumption of 0.417 \(\mu \)W. However, a two-fold throughput (11.1 Kbps) can be obtained with only 5 and 6 extra GEs (before and after the Place and Route respectively), and 0.014 \(\mu \)W extra power consumption. With more extra area and power consumption, we can get even higher throughput.

4 Result Comparisons Between Simeck and Simon

We compare our area results before the Place and Route of Simeck and Simon in CMOS 130 nm with the Simon results of the NSA researchers [3]. This is because the NSA researchers only provide the area results before the Place and Route. The comparison is shown in Fig. 6. We can observe that our Simon results are all smaller than that of NSA’s results, and our Simeck results are even smaller than Simon for all the cases shown in Fig. 6.

Fig. 6.
figure 6

Comparisons of Areas (before the Place and Route) between the Implementation Results of the NSA Researchers’ and Ours in CMOS 130 nm

From the theoretical point of view, Simeck is designed to have a smaller area due to the following considerations: the simplified key schedule, the simplified LFSR to generate the key constant, and the decreased shift numbers in the round function. It is worth noting that the decreased shift numbers do not affect any area in the parallel architecture, and it only affect the area in the partially serialized architecture.

The construction of the combinational circuit in the key schedule of Simon32/64, 48/96, 64/128 and Simeck32/64, 48/96, 64/128 in the parallel architecture are shown as follows:

figure a

In general, one XOR gate is larger than one AND gate. Therefore, the key schedule of Simon is larger than that of Simeck. The LFSRs used to generate the key constants for Simon32/64 and Simon48/96 are defined by the primitive polynomial \(X^5 + X^4 + X^2 + X + 1\), and the LFSR for Simon64/128 is defined by \(X^5 +X^3 + X^2 + X +1\). They are all 2 XOR gates (4 GEs) bigger than the ones used in corresponding Simeck, as described in Sect. 2. The decreased shift numbers of the round function and key schedule reduce 1 MUX for the inputs to the combinational circuits of the round function and the key schedule respectively (2 MUXes in total, \(2 \cdot 2.25\) GEs/MUX = 4.5 GEs), and also some logics to select the MUXes.

Table 5. Breakdown of the Implementation Results before the Place and Route in CMOS 130 nm

From the practical point of view, we break down the area results before the Place and Route in CMOS 130 nm for Simeck32/64, and Simon32/64 in our implementations, as shown in Table 5. For parallel architectures, the differences of the control circuits and the key combinational circuits between Simeck32/64 and Simon32/64 are 4 GEs (key constant) and 16 GEs respectively. The results are almost the same as the theoretical analysis. For the fully serialized architecture, the control circuit is reduced by 4 GEs (key constant), the key combinational circuit (dashed box in Fig. 4) is reduced by 3 GEs, and the 2 MUXes plus the MUXes selection circuitry are reduced by 9 GEs for Simeck32/64 (i.e., a total saving of 16 GEs), compared to that of Simon32/64. Therefore, the practical results match the theoretical analysis. Simeck is smaller than Simon for both parallel architecture and partially serialized architecture.

The main area cost for Simon comes from the registers storing the message block and the key. In order to design a smaller cipher than Simon, we can reduce the areas of only the round function, key schedule, key constant, and multiplexers. For fully serialized architecture of Simon32/64 (see Table 5), the combined area of these blocks is 34.5 GEs (7 + 8 + 6 + 6 \(\cdot \) 2.25/MUX), which accounts for only about 6.4 % (34.5/533) of the total area. Simeck32/64 reduces this by 16 GEs, a saving of more than 46 %. This reduction leads to 2.3 % smaller total area in comparison to our implementations of Simon32/64 in CMOS 130 nm, and 3.4 % smaller in comparison to the original Simon32/64 results (see Table 1). Similarly, the fully serialized architectures of Simeck48/96, 64/128 are 2.5 %, 2.1 %, respectively, smaller than our implementations of Simon48/96, 64/128 and they are 3.3 % and 3.5 %, respectively, smaller than the original implementation results of Simon48/96, 64/128 in CMOS 130 nm (see Table  1). For the parallel architectures of Simon, these blocks consume a larger fraction (about 29 %) of the total area (see Table 5). Simeck32/64, 48/96, 64/128 achieve the saving of 3.7 %, 3.3 %, and 3.7 % respectively, compared to the original results of Simon32/64, 48/96, 64/128 (see Tables 3 and 4). The choice of the values of the shift numbers plays a significant role in the area reduction of the partially serialized architecture. Because the parallel architecture does not contain the MUXes for the inputs to the combinational circuit (dashed box), the total area reduction is only slightly greater than the fully serialized architecture.

From Tables 3 and 4, we can also observe that the power consumption of Simeck is smaller than Simon for all the cases in CMOS 130 nm using the same optimizations. This is easy to understand because the area of Simeck is smaller than Simon. This conclusion also holds for CMOS 65 nm in Tables 7 and 8.

In summary, Simeck is smaller than Simon in terms of area and power consumption in both CMOS 130 nm and CMOS 65 nm techniques.

5 Security Analysis

In this section, we give the security analysis of the Simeck family of block ciphers. Due to its similitude with Simon and Speck, most of the next analysis follow from the best known attacks against the Simon and Speck families of block ciphers. As we show in the following, the security level of Simeck is comparable to those of Simon, which is reasonable to be used in practice. Indeed, the number of rounds chosen for Simeck is sufficiently high with respect to the best known attacks on reduced versions. Moreover, it is worth noticing that the ARX (Addition-Rotation-XOR) design of Simeck borrowed from Speck, using the round function as key-schedule, did not lead to a weakness so far. In a recent paper [22], Kölbl et al. study the influence of the shifts in Simon-like ciphers. They provide some set of parameters that are optimal with respect to differential and linear properties, and diffusion. Our parameters seem comparable to theirs because we take also into account hardware efficiency and other types of cryptanalysis (e.g., impossible differential cryptanalysis).

Differential/Linear Attacks [6, 23]. Since the differential and linear behaviors of Simon and Simeck are very closely related, it makes sense to use the best known differential and linear attacks of Simon to evaluate the security of Simeck against these attacks. This is why we have essentially followed the procedure of [7] to evaluate the security of Simeck against differential cryptanalysis. It is then possible to perform an attack on 19 rounds of Simeck32/64 with the time and data complexity \(2^{34}\) and \(2^{31.5}\) respectively. It is also possible to attack 20 rounds out of 36 of Simeck48/96 with the time and data complexity \(2^{75}\) and \(2^{46}\) as well as an attack of 26 rounds out of 44 of Simeck64/128 with the time and data complexity \(2^{121}\) and \(2^{63}\).

For the best cryptanalytic result using linear attacks against Simon, we refer to [1]. Because of the similar structure of Simeck, we verified that those results are also conform with respect to Simeck. For Simeck32/64, we can cover 12 rounds with the data complexity \(2^{31}\). For Simeck48/96, we can cover 15 rounds with the data complexity \(2^{43}\). Finally, it is possible to perform a linear cryptanalysis of Simeck64/128 up to 19 rounds with \(2^{123}\) known plaintexts. All these attacks have a success probability of 0.997.

Table 6. Comparison of Impossible Differential Attacks against Simon and Simeck

Since the best known differential and linear trails found on Simeck, and Simon, only cover a reduced number of rounds, we believe that the full-round Simeck (any version) is sufficiently secure against differential and linear cryptanalysis.

Impossible Differential Attacks [4]. Impossible differential attacks against Simeck cover few more rounds (depending on the version) than for Simon as it can be seen in Table 6. This is due to the fact that the diffusion of one bit difference is one round slower for Simeck than for Simon. Nevertheless, this does not damage the overall security of the Simeck family, since the full versions have more rounds.

Algebraic Degree [21]. We computed that after 5 rounds, the algebraic degree of Simeck (any version) is 13, as the one of Simon. It is sufficient to ensure that after few more rounds, no attack can exploit properties of the algebraic degree, such as algebraic attack or higher-order differential attack.

Meet-in-the-Middle Attacks [13]. Because of the key schedule algorithm of Simeck, many key bits of the master key are processed quickly in the round function of Simeck. This should ensure a good resistance of Simeck against Meet-in-the-Middle (MITM) attacks. Moreover, until now Simon has not shown to be a good candidate for MITM attacks. As the round function of Simeck is very similar as the one of Simon, we believe that Simeck will also be resistant against MITM attacks.

Slide Attacks and Rotational Attacks [8, 19]. The round constant addition and the key schedule design prevent any efficient slide or rotational attacks.

Related-key Differential Attacks [5]. Although Simon and Speck have been extensively studied in the past years, no concrete attacks in the related-key setting have been shown. Like Speck, Simeck reuses its round function in the key schedule part. It is reasonable to think that Simeck has also good cryptographic properties in the related key model.

Table 7. Our Implementation Results of Simeck32/64, 48/96, 64/128 in 65 nm

6 Concluding Remarks

In this paper, we have presented Simeck, a new family of lightweight block ciphers. Simeck is very suitable for resource-constrained devices, such as passive RFID tags and wireless sensor networks. We have provided an extensive exploration for different hardware architectures in order to make a balance between area, throughput, and power consumption for Simon and Simeck in both CMOS 130 nm and CMOS 65 nm techniques. We have shown that it is possible to design a smaller cipher than Simon in terms of area and power consumption. Moreover, we have improved the hardware implementations of Simon given in the original paper. In addition, the similarities between Simon/Speck and Simeck allow us to have an idea of the actual security offered by Simeck. Even if the round function of Simeck is quite simple, this round function is iterated a sufficient number of time to provide an adequate security against most known attacks. In conclusion, all of the instances in the Simeck family can meet the area, power consumption, and throughput requirements in the passive RFID tags and they are promising candidates for resource-constrained devices.

Table 8. Our Implementation Results of Simon32/64, 48/96, 64/128 in 65 nm

We have learnt and understood many techniques about designing hardware-oriented ciphers during the process of completing the design of Simeck. It is interesting to see if we can devise a block cipher with even smaller hardware footprints than Simeck. It also interests us whether we can design, from the theoretical point of view, a smallest block cipher with the minimum number of components. This should be very useful for cryptography researchers to get deep insights into designing and analyzing ciphers.