Block cipher four implementation on field programmable gate array

Block ciphers are used to protect data in information systems from being leaked to unauthorized people. One of many block cipher algorithms developed by Indonesian researchers is the BCF (Block Cipher-Four) - a block cipher with 128-bit input/output that can accept 128-bit, 192-bit, or 256-bit keys. The BCF algorithm can be used in embedded systems that require fast BCF implementation. In this study, the design and implementation of the BCF engine were carried out on the FPGA DE2. It is the first research on BCF implementation in FPGA. The operations of the BCF machine were controlled by Nios II as the host processor. Our experiments showed that the BCF engine could compute 2,847 times faster than a BFC implementation using only Nios II / e. Our contribution presents the description of new block cipher BCF and the first implementation of it on FPGA using an efficient method.


Introduction
Block cipher is one of the cryptographic components used to protect information. Information can be in the internet network, financial system, military, and IoT (internet of things). IoT is a network of interconnected objects in various forms such as wireless sensor networks, electrical, electronic, mechanical devices, and their interaction with computer data via the internet [5]. In the IoT period, embedded devices were connected to the internet. The advent of IoT has put telecommunications and embedded systems at risk [6]. BCF is an encryption algorithm based on AES [13], Camellia [14], TwoFish [15], and Khazad [12]. It has 128 bits of input /output and 128, 192, and 256 bits keys. BCF is an encryption algorithm designed by Indonesian researchers [1]. This algorithm has an advantage over AES: The key schedule in BCF is more secure than AES because the main key is very difficult to find even when all sub-keys of BCF have been found. The SBox from BCF changes dependent on the key, while the SBox from AES does not change. Thus, BCF is safer than AES.
There are two types of BCF keys: master key and sub-keys. A master key is processed by key schedule function becoming the sub-keys. Every sub-key is used to encrypt or decrypt partial data in every round. Encryption is a process to convert plaintext to be cipher text and decryption converts cipher text to be plaintext.
Cryptanalysis is used to crack the key of a block cipher in an unusual way or test the security of a cryptographic algorithm that has been created. Correlation power analysis, for instance, tries to find all of the sub-keys using the correlation between the hamming weights and the power used in the embedded device when calculating the encryption algorithm [7].
The hardware implementation is very important in terms of a performance and security, especially as a countermeasure against timing attacks [8] in particular and as side-channel attacks in general. This paper aims to introduce the BCF algorithm implemented in FPGA with an efficient method. This paper proposes a hardware architecture of the BCF algorithm as a co-host processor (encryption engine accelerator). This architecture was written in Verilog and tested on the Altera Cyclone IV EP4CE115F29 [9] using NIOS as the host processor. We compared the results with AES, Camellia, and TDEA data taken from SASEBO [10]. Moreover, we compared the BCF hardware accelerator with software implementation enabling us to measure how fast the BCF encryption engine accelerator computed, compared to software.

BCF Algorithm
BCF uses the Feistel structure [11], in contrast to AES which uses the SPN structure. The SPN structure requires fewer rounds than does Feistel to achieve the same diffusion rate. The advantage of using the Feistel structure over SPN is related to the use of the same structure for the encryption and decryption processes so that it will require few memories in the implementation. SPN requires two different algorithms for encryption and decryption.
The BCF algorithm has two main components: scheduling part and randomization part. Key Schedule is performed to generate sub keys and randomization is performed to encrypt or decrypt data using sub keys generated by key scheduling. The number of rounds at the randomization stage depends on the length of the key in which 128-bit keys are used in the randomization of 15 rounds, 192-bit keys require 16 rounds and 256-bit keys for 18 rounds. In each round, the F0 function is applied. This function uses sub keys to manipulate the input data for each round.
The main features of the BCF algorithm are: 1. The input and output data are 128 bits (plain text and cipher text) respectively. 2. The length of the master key has 3 variants: 128, 192 and 256 bits. 3. Key scheduling is done in 8 rounds using the F0 function. 4. The number of rounds at the randomizing stage (for encryption or decryption) depends on the length of the key.
The key schedule stage is carried out at the beginning to generate sub-keys for the randomizing stage, but in this paper we will begin by explaining the randomizing stage.

BCF Encryption
BCF uses the Feistel structure as in the Twofish algorithm, so it can use the same algorithm for encryption and decryption. BCF has 128-bit input / output. The pseudo code of the BCF encryption algorithm is presented as follows.  In the BCF algorithm, there is an FO function that has an input of two data words x and two words of the k sub-key and produces two output words of y, where 1 word is 32 bits. This function is the heart of BCF encryption/decryption. For a note, 1 word is 32 bits.
SBox has 1 byte input / output. Because each x consists of 8 bytes, there are 8 SBox operations for each input x{ x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 }. There are 4 SBox-es used in BCF. The Substitution Boxes (in hexadecimal) are shown in (Tables 1-4). Table 5. shows the algorithms used to select BCF SBox before encryption/decryption.  P is the product between the matrix M and the input x with an aim to obtain optimal diffusion.

BCF Key Expansion
The BCF keyschedule (key expansion) algorithm has a key input of 128 bits (16 bytes or 4 words), and performs a Key Expansion to generate some sub-keys. The Key Expansion produces a total of 17 sub-keys, 15 sub-keys for the regular round ( K 0 , K 1 , ..., K 14 ) and 4 sub-keys for whitening keys ( KW 1, KW 2, KW 3, KW 4 ). If the primary key is 192 bits or 256 bits, then we perform XOR operation between the left side and the right side of the primary key so that it still generates a key of 128 bits to be included in the key schedule.
At the beginning of the key schedule, the intermediate keys:K A , K B , ..., K G are generated. From these intermediate keys, all sub-keys required for the encryption and decryption processes are generated. Figure 2 shows the beginning of the key expansion to generate K A , K B , K C and K D.  Table 6 describes the complete key expansion process. The table is connected to the figure 2 as key expansion process.

BCF Decrytpion
The BCF decryption ( Figure 3) procedure can be performed in the same way as encryption, but with the sub-key order reversed. More details are shown in the following pseudo code.

BCF Core IP Design
IP BCF core design is implemented in Verilog HDL language, top down method. The design begins by defining the system, making the architecture first, and designing the supporting modules. The IP BCF core symbol and pin out are shown in Figure 4. Table 7 describes the function of each pin. Figure 5 shows the general architecture of BCF. Key_len pin out functions to set the number of rounds and the number of rounds depends on the size of the key. For a key with a size of 128 bits, 192 bits, 256 bits, it takes 15, 16, and 18 rounds, respectively. The decrypt pin out determines whether BCF_Core will encrypt or decrypt. The core of the BCF Engine contains encryption, decryption and keyschedule. Figure 6 illustrates the core algorithm for the BCF engine. This core architecture was used interchangeably for randomizing and key schedule, resulting in large latency. For the encryption and decryption process, the BCF engine required 316 clocks. Figure 7 shows the BCF timing diagram. The system was controlled by clk.
To implement BCF, we needed a FSM (Finite State Machine). The BCF FSM is shown in Figure 8.   Figure 9 and Figure 10 illustrate the keys schedule (key expansion) architecture and the FO module, respectively. One Core FO was used interchangeably in encryption, decryption, and key schedule. The advantage of this design is to use a small area but it has the disadvantage of having a large latency.
The further explanation of FSM in Figure 8 and FO Core in Figure 9 is described in more detailed in Table 14 and Table 15 (appendix). Subbyte BCF operations were implemented using LUTs for the ease of design and minimization of critical paths [4].
The description of FO module pins is described in Table 16 (appendix).
The search for the substitution box number was obtained from this equation :

si = ((K >> ( (2*i) + (8*r) ) ) & 8'h03) ^ ( (((K >> (8*r + 2)) & 8'h03) ^ ((K >> (8*r + 4)) & 8'h03) ^ ((K >> (8*r + 6)) & 8'h03)) & zero)
() The MixColumn operation used a systolic array architecture ( Figure 11). The architecture used only eight processing elements in MixColumn and one processing element in Subbyte processing. The tradeoff of this architecture was the latency from one clock to eight clocks. The design of processing elements is implemented using the architecture as shown in Figure 13.   Table 8 shows some descriptions of the Processing Element pinouts. Pinout R contains the result of polynomial multiplication between data from pinout A and B. Xtime algorithm was used for polynomial multiplication in mixcolumn operation. For efficiency in Mixcolumn operation, shift and xor operation was applied [3]. Figure 14 depicts the architecture of xtime algorithm implemented in mixcolumn. Table 9 shows some descriptions of the XTime pinouts. Xtime architecture used in this design had one input data and eight output data. It was purposely to enable the polynomial multiplication to be completed in one clock.

BCF Integration to FPGA Atera DE II
DE2-115 is a development board with the main component in the form of Altera Cyclone® IV 4CE115 FPGA. Soft core NIOS processor can be implemented on FPGA. NIOS is a soft core 32-bit RISC Microprocessor. In this paper, we used 50 MHz frequency on BCF. The BCF module was wrapped with the Avalon interconnect interface. Figure 15 shows the block diagram of the NIOS interface. To access IP BCF, NIOS write / read registers in Table 11. The functionality test of IP BCF was carried out by comparing the computing results of BCF IP with program that run on a computer. Figure 16 and 17 show the result.
Based on the above test, the ciphertext and plaintext values generated by the IP Core BCF were found similar with those generated by the C program running on the computer. This means that the implementation of BCF on BCF has been functionally successful. One way to measure the BCF performance is by comparing the hardware and software implementation. If the speed of the hardware far exceeds the speed of the software, the hardware implementation can be stated to be successful. The measurement results are presented in Table 12 showing that the computation time for BCF software implementation depends on the processor architecture. Hardware BCF Engine can speed up BCF compute 488-2847 times compared to software, dependent on processor architecture and BCF key length. The speed of BCF and AES was measured, and the results of the comparison are shown in Table 13 informing the BCF Engine was 44 times faster than the AES hardware accelerator, where the devices operated at a clock of 50MHz. From this data, BCF Engine is suitable to be implemented in devices with small computing resources such as IoT, where these devices require a low clock for power saving, but require a high level of security for sending data to the internet [17].

Conclusion
This paper describes the BCF encryption algorithm, the algorithm implementation on the Altera DE2-115 FPGA and its performance. On Altera DE2-115 boards, hardware implementations were found 488-2847x faster than software implementations, dependent on processor architecture and BCF key length. BCF also has a high speed to be implemented on devices with small resources such as IoT. For further research, we will perform a Correlation Power Analysis (CPA) attack on this proposed BCF device. The attack will be based on the previous paper [2].