1 Introduction

Compression is the process in which the information is encoded using fewer bits than an uncoded representation. The redundancy of the data is decreased by the compression algorithm. This further reduces the storage space required for data. Data compression offers an important approach to reduce transmission costs by using pre-existing bandwidth efficiently[1]. In the past decade, there has been an increase in the amount of data transmitted via the internet, representing text, images, speech, video, sound and computer data. Hence, there is a need for efficient compression algorithms that can be effectively used in the existing network bandwidth.

Security is an important factor in our digital life. Encryption is the art of achieving security by encoding messages to make them non-readable by an intruder. To prove the theoretical feasibility of operations, the algorithm implements the compression of encrypted data[24].

There is a need for compression combined with encryption. In most cases, encryption followed by compression is used. WinZip and WinRAR[5] are two packages which encrypts the data using advanced encryption standard (AES) algorithm after compression. Due to improper injection of cryptography into compression, several attacks are possible[6,7].

Another drawback is that text processing has to be performed on compressed data[8,9]. Finding keywords in the compressed data using compressed pattern matching methods[10, 11] is an example of text processing.

An alternate approach is unifying compression and encryption to secure the data. Some of the compression algorithms like interval splitting in arithmetic coding[8, 12], using multiple code trees in Huffman coding[13], encrypted dictionaries in dictionary based compression[1, 14], adaptive character word length (ACW) algorithm[15], differential method[16], dynamic Markov compression (DMC), prediction by partial matching (PPM), and burrows wheeler transform (BWT) algorithms are used for compression[1719]. These algorithms do not achieve better compression ratio. Hence, there is a need for developing a better algorithm.

The preprocessing step is done on a source file prior to applying an existing compression algorithm. This transformation is designed to make the compression of the source data easier[20]. Dictionary based encoding is generally used for this type of preprocessing transformation of the source text[21]. The dictionary is created by commonly used words expected in the input files[10, 22, 23]. The dictionary must be prepared in advance, and must be known to the source and destination. In this technique, each word in the dictionary has to be replaced by ASCII character[2426]. The dictionary based encoding is very much weak and vulnerable to attacks. But the dictionary based encryption provides the required security.

The idea of text compression is to preprocess the input text and transform it into an intermediate form. Then, this data can be compressed and secured with better efficiency. The purpose of the proposed technique is to reduce the consumption of expensive resources and to reduce data length. In this paper, a new hybrid technique for securing data is introduced. To attain better compression, a new compression and encryption cum compression algorithm is developed. In this approach, new dictionary based compression algorithm called multi-dictionary compression algorithm is applied to a source text leading to a better improvement in the existing algorithm, and further encryption cum compression (reduced array based encryption) offers a sufficient level of security of the transmitted information. So this method is called new compression and encryption cum compression (CEC). The multi-dictionary based encryption provides the required security.

The four main steps of this technique are multi-dictionary based compression, BWT with run length encoding (RLE) based compression, reduced array based encryption algorithm and Huffman coding based compression process. The words are extracted from the input files and formed as dictionary entries. In this technique, each word in the multi-dictionary has to be replaced by ASCII character. This multi-dictionary method increases the speed of encoding and decoding, because word retrieval is very easy and fast. The multi-dictionary based compressed result is further compressed using a BWT with RLE. After this step, the compressed data is encrypted using reduced array based encryption algorithm. This reduced array based encryption algorithm is used for both encryption and compression.

In general encryption algorithm, plain text is converted into the cipher text with the same size or increased size. But this algorithm has not only done the encryption but also reduced the input array size by 3 times. Two keys are generated here. The primary key is generated by three values which are designed by user. The secondary key is generated by TA-key which represents the probability of data. Further, encrypted output is compressed with Huffman coding. This CEC technique combines the compression and encryption processes, and thus develops a better transformation yielding greater compression and added security. The reverse operation is performed on receiver side as shown in Fig. 1. The rest of the paper is organized as follows. Section 2 presents a multi-dictionary making algorithm. Section 3 presents encoding algorithm. Section 4 presents decoding algorithm. Section 5 provides performance analysis. Section 6 concludes the paper.

Fig. 1
figure 1

Block diagram of CEC compression

2 Multi-dictionary generation

The words are extracted from the input files and formed as dictionary entries. If the preceding alphabet in a word is in upper case, it is changed into lower case and frequency of occurrence is calculated. Individual dictionaries are created for each starting letter of the words. Similarly, all ASCII characters are categorized into a separate dictionary. This multi-dictionary method increases the speed of encoding and decoding. This is because words retrieval is very easy and fast. ASCII character numbers (33–255) are used as codes. An ASCII character is assigned as code to every word. Table 1 shows codeword formation. In Table 1, 170 single ASCII characters are assigned as codes for first 170 words. For the next 170 words, the same 170 ASCII characters with a prefix of character “a” is employed. Thus, it becomes a two-character code. The words (341–4250) will have the combination of (b, ⋯, z) and the single 170 ASCII characters. For next 170 words (4251–4421), letter “A” combines with the ASCII and become codes. The words (4422–8840) will have the combination of (B, ⋯, Z) and the single 170 ASCII characters. N× 170 = Number of words assigned as codes for two characters combination. N = Number of alphabetic characters [(a, ⋯, z) + (A, ⋯, Z) = 52]. Further words (8841–13261) will have the combination of 170 × N codes with prefix of (a–z), thus, they became three character codes. Then, words with combinations are given as below: The first letter “A”, second letter [A–Z] and third letter becomes 170 character codes. Similarly, other words in the dictionary form codes. M × N × 170 = Number of words assigned as a code for three character combination. The combination of (a–z) (A–Z) or (A–Z) (a–z) or (A–Z) (A–Z) or (a–z) (a–z) and the single 170 ASCII characters create 4 59 680 codes.

Table 1 Codeword formation

The short codes are assigned to the most frequently used words. The longest codes are assigned to the less frequently used words[25].

3 Encoding algorithm

3.1 Burrows wheeler transform and Run length encoding

The input files are converted into code (ASCII character) by using multi-dictionary based compression. This code is given to the input of BWT. Most of the compression methods operate in the streaming mode, where the code inputs may be a byte or several bytes which are processed until the end of the file is sensed. The burrows wheeler method works in a block mode, where the input stream is read block by block and each block is encoded separately as one string. This method is also referred to as the block sorting. The BWT method is general purpose, it works well on images, sound and text. It can achieve very high compression ratios.

BWT output is given to the input of RLE. After reading the first character, the count is 1 and the character is saved. Subsequent characters are compared with the one already saved. If they are identical to it, the repeat-count is incremented. When different characters are read, the operation depends on the value of the repeat count. If it is small, the saved character is written on the compressed file and the newly read character is saved. Otherwise, a “@” is written followed by the repeat-count and the saved character. A run of three characters results in no compression. Only the runs longer than 3 characters get compressed[27].

3.2 The proposed algorithm

Output from the run length coding is given as input to the encryption with compression algorithm. In general encryption algorithms, plain text is converted into the cipher text with the same size or increased size. This algorithm not only has done the encryption, but also has reduced input array size by 3 times. Two keys are generated here. One is primary key and the other is secondary key. Primary key is designed by three values such as starting value, maximum value and the factor. During encryption time, secondary key is generated by the number of characters occurring in the input string. Encryption output is reduced to 1/3 of an input array size. By using primary key and secondary key, we can retrieve the original input.

3.2.1 Key generation

Algorithm 1. Key generator

Input: Start value, MAX_VALUE, factor

Output: Stream of keys

Key size = 3

KEY (1)← Start value

fori = 2, …, Key size do

KEY (i) ← KEY (i - 1) x MAX_VALUE+ factor

end for

Let the primary key be denoted by KEY (i), where i = 1,2 and 3. KEY(1) is the start value of the key. MAX_VALUE represents that maximum value is used to generate key 2, 3, etc. Factor has integer value 2, 3 and 4. KEY (i) is generated by multiplication of maximum value (MAX_VALUE) with KEY (i - 1), where i = 1, 2 and 3, and then factor is added. Factor may be any value from 2 to 4.

3.2.2 Reduced array based encryption algorithm

Algorithm 2. Encryption coding for stream of data depends on the minimize array algorithm

Input: Key1 (primary key), RLE output

Output: TA-key, encrypted data

%Compute TA-key of Symbols

S AC ← Number of data in X

for jAC=1, •••, S_AC do

S_2AC ← Number of symbols in the TA-key

flag =0

for kAC = 1, •••, S 2AC(2) do

if TA-key(kAC)=X(jAC)

flag=1

end if

end for

end for

if Flag is zero

TA-key (S_2AC (2) +1) ← X (jAC)

end if

%Data Encrypted

S_← Number of data in X

Pad zeros after the last position in the array from X(S-(2)+1 to S (2)+3)

Initialize i and L is one

While i is less than or equal to S-(2)

Sum is zero

for j=0 to 2 do

Sum ← Sum+(X (i + j)x Key1 (j+1))

end for

Encrypted data (L) ← Sum

L4 ← L + 1

ii + 3

end while

Run length coding output is given to the input of reduced array based encryption algorithm (X). Let X be the input character, S_AC be the number of characters in X. TA-key (T) be the secondary key, S-2AC be the number of characters in the TA-key. This algorithm is used to find the exact data depending on the TA-key, which represents the probability of data. The first character of (X) is the first character of TA-key. Instead of repeated characters in X, it occurs for single time in TA-key.

This algorithm is used to convert every 3 data in X into a real number. Each three input character set is multiplied by corresponding KEY (i) where i=1, 2, 3, and then three multiplication outcomes are added. It is considered as encrypted output. Encrypted output has a array size which is three times less array size of input X. Encrypted data length is unlimited. It depends upon the length of X.

3.3 Huffman coding

The encrypted output code undergoes Huffman coding for further compression. Huffman code was generated by using binary tree.

Consider the source symbols {b 1, b 2 , …, b n } with frequencies {y 1, y 2, … , y n } for y1 ⩾ y 2 ⩾ … ⩾ y n , where the symbol b j has frequency y j . Using the Huffman’s algorithm, the codeword z j for 1⩽ jn, which is a binary string, for symbol b j can be obtained. Let us denote C= {z j , … , z n } as the Huffman code. Let the level of the root of the Huffman tree be zero, and the level of the other node be equal to one more than that. Codeword length lj for b j can be known as the level of bj.

Assume the right edge corresponds to “0” and the left edge corresponds to “1”. The codeword of a node j, denoted z(j), is defined as the bit sequence corresponding to the path from the root to node j. The codeword of a sub tree T j , denoted I(Tj), is defined as the codeword of Tjs root. The level of a subtree Tj, denoted l(Tj), is defined as the level of Tjs root. Given a string x = x 1 x 2, … ,x m , we define the jth prefix of x, for j = 1, … , m, as prefix j(x) = x 1x2, … , xj and prefix 0(x ) = _ is an empty string. The Huffman procedure is based on two observations:

  1. 1)

    The symbols that occur more frequently will have shorter code words than the symbols which occur less frequently.

  2. 2)

    The two symbols that occur less frequently will have the same length.

4 Decoding algorithm

The output code obtained through Huffman coding is given to the Huffman decoder for decryption.

4.1 Reduced array based decryption algorithm

Encrypted as well as compressed data is decrypted by primary key, secondary key and encrypted data.

Algorithm 3. Decryption for minimized data

Input: Encrypted data, Key (primary key), TA-key (secondary key)

Output: New_arr

S_Enc ← Number of Encrypted Data

Set L = 1

fori = 1, …, S_Enc(2) do

S_TA-key ← Number of symbol in the TA-key

Set flag=1

Set EST =0

T(1), T(2), and T(3) are set to one

S 1 = 1,S 2 = 1,S 3 = 1

while (check flag=1)

EST=0

T(1) ← TA-key(S1)

T(2) ← TA-key(S 2)

T(3) ← TA-key(S 3)

for K 2 = 1,2 and 3 do

EST ← EST+ (T (K 2) x Key1 (K 2))

EST ← mod(EST, 255)

If check EST is equal to encrypted data (i)

set flag =0

else

S1 ← S 1 + 1

if (S 1 >S_TA-key(2))

S2 ← S 2 + 1

set S 1 = 1

end if

if (S 2 >S_TA-key(2))

S 3S 3 + 1 set S 2 = 1

end if

if (S 3 >S_TA-key(2))

Set S 3 = 1

end if

end if

end for

end while

New_arr(L:L+2) ← T of array one to three

L ← L+3 end for

Let the encrypted data be denoted by encrypted data (S_Enc). S_Enc is the number of data in encrypted data. Number of symbols in the TA-key is denoted by S_TA-key. EST is the predicted encrypted data. First, three contents of S_TA-key (i) where i = 1, 2, 3 correspondingly are multiplied by primary key (j), where j = 1, 2 and 3, and added to form EST. Then EST is compared with first entry of encrypted data. If EST and encrypted data are the same, then S_TA-key (i) is the decrypted data. Otherwise, contents of S_TA-key are multiplied with primary key with different combinations [S_TA-key (i, i, i+1) or S_TA-key (i, i+1, i+1) or S_TA-key (i, i, i)], and compared with EST. When the match is found, decrypted data is estimated from the S_TA-key contents.

It has some zeros at the end of the estimated array (New_arr), because at the encryption or coding algorithm, zeros are padded automatically.

4.2 Decoding from dictionary

Decrypted data is given to the RLE decoder and then BWT. The RLE decoded output converts short sequence symbol into a long sequence symbol. After this conversion, the output is given as an input to the BWT reverse transform which rearranges the data into the original order. The output of BWT becomes a code when the upper case letter (A-Z) or lower case letter (a-z) combines with ASCII character. Based on starting of the character, the word is searched and extracted from the respective dictionary. If two consecutive special characters accompany along with (a, …, z) or (A, …, Z), they are considered as different codes and are extracted from the dictionary.

As an example, a section of text from Calgary corpus version of paper 1 looks like this in the original text:

“Its performance is optimal without the need for blocking of input data. It encourages a clear separation between the model for representing data and the encoding of information with respect to that model. It accommodates adaptive models easily. It is computationally efficient.” (1)

Number of characters required = 420

Memory space = 420 bytes

Running text (1) through the multi-dictionary based encoder yields the following text:

ÒNM{LBNå|HNMS{LBNå|HNMS{LBNå|He¥CP@B{Ù O¯D{{E|ÔO°~AEA}KCC4E&. (2)

Number of characters required = 61

Memory space = 61 bytes

Time requirement = 0.355457 s.

Running text (2) through the BWT encoder yields the following text:

KNNN~EHÚÔCMMMH{SSSBEECP@C¯A{4|||}{{{Ò BBBDåååA°LLL¥eOO&|{NNN7. (3)

Number of characters = 61

Memory space = 61 bytes

Time requirement = 0.052289 s.

Running text (3) through the run length encoder yields the following text:

KNNN~EHHÙÔCMMMH{SSSBEECP@C¯A{4|||}{{{Ô BBBDåååA°LLL¥eOO&|{NNN7. (4)

Number of characters required = 61

Memory space = 61 bytes

Time requirement = 0.179045 s.

Running text (4) through the encryption yields the following result:

Primary key =[ 38 2436 155908]

Secondarykey=KN~E4@HÙÔCM{SBP¯A|}Ò7Då°L¥e O&

Encrypted data = Ï'*8Y© ÷ ÉÛ+ *ü?3p©èB2 (5)

Number of character required = 21

Memory space = 21 bytes

Time requirement = 0.010227 s.

Running text (5) through the Huffman coding yields the following result: #àú÷l©9¸ ®<c

0010100110010000011000100000100000111101001011100 10001111011100101110101001100001110110

Memory space = 11 bytes

Time requirement = 0.179045 s.

In the above example, input data collected from test files paper 1 has 420 characters. The compression based on multi-dictionary based method reduces it to 61 characters. Then BWT changes the order of the input data. Then, compression is done using RLE. The obtained output data has 86 characters. This is encrypted and compressed to \(\tfrac{1}{3}\) by array reduction algorithm. Then it reduces to 21 characters. Finally, 11 characters are achieved by using Huffman coding.

$$\begin{array}{*{20}{c}} {\operatorname{Compression} \;\operatorname{ratio} = \frac{{11}}{{420}} = 0.0261} \\ {\operatorname{Bits} \;\operatorname{per} \;\operatorname{character} = \frac{{11}}{{420}} \times 8 = 0.2095.} \end{array}$$

5 Performance analysis

Experiments were performed on the CEC transformation algorithms described in Sections 2-4 using standard Calgary corpus test file collections[28].

In order to evaluate its performance, the CEC scheme is implemented using Matlab to compress test files from standard corpora such as Calgary. At this stage, it has been taken to optimize the runtime of the compression-decompression prototype codes. Therefore, results can be obtained for the compression ratio, bits per character and compression time are presented.

The performance metrics, such as compression ratio and bits per character (BPC) for this algorithm are compared with standard algorithm (arithmetic coding, Huffman coding, Lempel-Ziv-Storer-Szymanski (LZSS)), dictionary based encoding (DBE), multi-dictionary based compression, multi-dictionary BWT with RLE (MBR), MBR with Huffman coding (MBRH), MBR with new reduced array based encryption (CE) and CE with Huffman coding (CEC). The results are shown graphically. They prove that CEC outperforms the other techniques in compression ratio, bits per character, compression time and security.

$$\begin{array}{*{20}{c}} {\operatorname{Compression} \;\operatorname{ratio} = \frac{{\operatorname{Output} \;\operatorname{file} \;\operatorname{size} }}{{\operatorname{input} \;\operatorname{file} \;\operatorname{size} }}} \\ {\operatorname{Bits} \;\operatorname{per} \;\operatorname{character} (\operatorname{BPC} ) = \frac{{\operatorname{Output} \;\operatorname{file} \;\operatorname{size} }}{{\operatorname{input} \;\operatorname{file} \;\operatorname{size} }} \times 8} \end{array}$$

Details on test files are as shown in Table 2. This data is collected from Calgary corpus[28]. The experiments are done by using selected existing compression algorithms as well as with a new encryption cum compression algorithm to compress the input files. CEC was implemented in Matlab. The experiment is to determine the percentage of decrease in text size using the CEC transformation algorithms. This CEC algorithm compares the eight coding formats, namely, arithmetic coding, Huffman coding, LZSS, (DBE), multi-dictionary based compression, MBR, MBRH and CE.

Table 2 List of files used in experiments

The compression ratios, bits per character and compression time determined for 12 test files of various sizes from the Calgary corpus are given.

First, the input text from test file is compressed by multi-dictionary based compression. Second, the resultant code is compressed by BWT and RLE. Further, it is encrypted as well as compressed using the CE scheme, then it is compressed by Huffman coding. So, the resultant compression ratio is the combination of CEC compression ratios.

Results are shown in Tables 35. Tables 3 and 4 compare the proposed algorithm with an existing standard compression algorithm. The results are shown graphically in Figs. 25.

Table 3 Comparison of compression ratios
Table 4 Comparison of bits per character
Table 5 Comparison of compression times
Fig. 2
figure 2

Comparison of compression ratios

Fig. 3
figure 3

Comparison of compression ratios

Fig. 4
figure 4

Comparison of bits per character

Fig. 5
figure 5

Comparison of bits per character

In Table 3 and Fig. 2, the compression ratios are compared with standard algorithm, such as arithmetic coding, Huffman coding, LZSS and CEC. In Table 3 and Fig.3, the performance of our scheme with reference to compression ratio is compared with various schemes.

The CE coding achieves higher efficiency than MBR and MBRH coding formats. However, the ratio achieved using CE coding is less than that using CEC coding.

For example, for the bib text file, the compression ratios achieved by the arithmetic coding, Huffman coding, LZSS, DBE, multi-dictionary based compression, MBR, CE and CEC coding formats are 0.654, 0.457, 0.627, 0.278, 0.277, 0.244, 0.192, 0.08 and 0.074, respectively. The biggest gain is achieved on progp (resulting mainly from the EOL-encoding) as well as on obj1 and progl.

The compression of geo is practically unaffected by CEC, because those files do not contain any textual data. There are two files in the corpus that do contain some textual data paper 3 and bib. CEC yields the same gain on paper 3 and bib. Paper 1 and progc files also have the same gain.

In Table 4 and Fig. 4, the performance with reference to bits per character is compared with standard algorithm such as arithmetic coding, Huffman coding, LZSS and CEC.

In Table 4 and Fig. 5, the performance of our scheme with reference to bits per character is compared with various schemes. The CE coding achieves higher efficiency in bits per character than statistical coding, DBE, MBR, MBRH and MBR coding formats. The results are listed in Table 4. The results reveal that the bits per character achieved using CEC is higher than that using CE coding by a very small margin. However, the BPC in CEC coding provides nearly the same performance as the CE algorithms. For example, for the paper 1 text file, the BPC achieved by the arithmetic coding, Huffman coding, LZSS, DBE, multi-dictionary based compression, MBR, CE and CEC coding formats are 4.984, 3.616, 4.976, 2.256, 2.183, 2.028, 1.615, 0.676 and 0.617, respectively.

6 Conclusions

The CEC algorithm provides better results than other algorithms. This algorithm has an admirable and viable performance as it outperforms the other widely used data compression algorithms. The compression ratio depends on the number of words in the file, size of the test file, and the frequencies and distribution of words within the file.

Nine coding formats have been investigated. It reveals that the highest compression ratio achieved is for CEC as it performs coding and compression at the same time. The CEC scheme can be used as a complementary scheme to any statistical and dictionary based lossless compression algorithm, such as static or adaptive Huffman coding, arithmetic coding, the LZSS algorithms, or any modified form of them. Our approach is to secure the message using CEC technique, to compress it for the reduction in length, and to encrypt it using the new reduced array based encryption algorithm. The CEC results have achieved excellent improvement in data compression and security without increasing its size over the existing techniques. Our future work will focus on the performance of this scheme in compressing multimedia files.