# **Analysis and Design of a High Performance Radix-4 Booth Scheme in CMOS Technology**

Ali Rahnamaei

Department of Electrical Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran. Email: [a\\_rahnamaei@iauardabil.ac.ir](mailto:a_rahnamaei@iauardabil.ac.ir)

Received: January 2021 Revised: March 2021 Accepted: May 2021

# **ABSTRACT:**

In this paper, a novel high performance structure has been demonstrated which can be widely used for circuit-level realization of radix-4 Booth scheme. The notable privilege of proposed scheme is its higher speed for generation of Partial Products (PPs) compared to the previous designs. The objective has been achieved by means of the modified truth table of Booth algorithm. Moreover, Pass-Transistor Logic (PTL) has been employed to reduce the middle stage capacitances which has considerably enhanced the operating frequency of the designed architecture. The thorough analysis over previously reported works has also been provided to help the authors for optimized implementation of the Booth circuitry. Simulation results for TSMC 0.18µm CMOS technology and 1.8V power supply using HSPICE indicate the correct operation of the proposed scheme. In addition, the best-reported works have been redesigned and simulated on the same conditions to provide a fair comparative environment with our designed scheme. The results demonstrate the superiority of our circuit over the selected structures.

**KEYWORDS:** Parallel Multiplier, Radix-4 Booth Algorithm, High-Speed, Partial Product, Low Power.

## **1. INTRODUCTION**

In a modern computer, the microprocessor is occasionally made of a small chip. Such system is known as the Central Processing Unit (CPU) and is mounted on the main board to perform the arithmetic operations [1]. One of the main responsibilities of CPU is to carry out the machine orders classified in one the following sections [2]:

- 1) Utilization of arithmetic unit
- 2) Data transfer between different parts
- 3) Decision making process

As a general concept, integer 2 constitutes the basis of mathematical calculations in a microprocessor. Such fact helped the researchers to introduce different algorithms for simplification of arithmetic operations. Until the development of Integrated Circuit (IC) technologies, those algorithms were only available for academic studies. But after that and because of the need for faster computational operations, they were widely employed in the body of Digital Signal Processors (DSPs) [3].

Among the arithmetic operations, binary multiplication is of high importance. Because of higher computational complexity than the binary summation, special design considerations have been brought up for this operation [4].

In a microprocessor, the multiplier is one of the

building blocks which lays in the critical delay path. As a consequence, it will directly affect the performance of whole system. To improve the efficiency of multiplication process, different algorithms have been introduced. Each method has its own privileges and drawbacks concerning the certain parameter to be optimized [5-8]. In some of them speed was the main concern while in the others power and active area consumption were the emphasized parameters. Along with the development of submicron technologies, the need for high-speed and low power processing blocks becomes vital. One of the impressive solutions in this criterion was the invention of parallel multipliers. Although they are more complex than their serial counterparts, the higher operating frequency feature has drawn the attention of circuit designers [5].

As illustrated in Fig. 1, a parallel multiplier is composed of three main sub systems. At the first stage, the Partial Products (PPs) are generated. Radix-4 Booth is the most common algorithm for implementation of this section.

DOI: https://doi.org/10.52547/mjtd.10.2.67

How to cite this paper: A. Rahnamaei "**Analysis and Design of a High Performance Radix-4 Booth Scheme in CMOS Technology**", Majlesi Journal of Telecommunication Devices, Vol. 10, No. 2, pp. 67-73, 2021.

Paper type: Research paper



**Fig. 1.** General architecture of a parallel multiplier.

With the help of this algorithm, the multiplication of two signed numbers becomes possible, while the rows of PPs are almost halved [9].

Hardware level implementation of Booth algorithm was into the contention of researchers over the recent years and many attempts have been done to improve the performance of this method [10-13]. For instance, in [12] and [13] two high-speed structures have been demonstrated in which the achieved gate-level latency was less than 2 XOR gates. Although there are other circuit-level realizations in the past four years, none of them could reach the latencies less than 3 XOR logic gates [14-18].

In this paper, the best-reported works have been investigated carefully to design a high performance structure for radix-4 Booth scheme. The result which is implemented in 0.18µm CMOS technology shows a higher speed behavior compared with the previous designs.

## **2. BOOTH ALGORITHM**

In 1951, an efficient algorithm was invented by Andrew Booth [6] which is widely utilized in parallel multiplier design criterion. Nowadays, the modified version known as radix-4 scheme constitutes the main part at the PP generation block [9].

In order to perform circuit-level optimizations, at the first step, the procedure should be investigated carefully. Therefore, it is assumed that the product of two signed N-bit binary numbers denoted by  $X$  and  $Y$  is going to be computed. By defining by  $X$  as the

multiplier and  $Y$  as the multiplicand,  $Y$  can be written as:

$$
Y = Y_{n-1} 2^{n-1} + \sum_{j=0}^{n-2} Y_j 2^j
$$
 (1)





Equation (1) is the basic representation of Booth algorithm which is populated as radix-2 method. The efficiency of algorithm can further be extended by rewriting  $(1)$  as:

$$
Y = \sum_{i=0}^{n} (-2Y_{2i+1} + Y_{2i} + Y_{2i-1}) 2^{2i}
$$
 (2)

Where $Y_{-1} = 0$ . By defining:

 $Y_i = -2Y_{2i+1} + Y_{2i} + Y_{2i-1}$  (3) The substitution of  $(3)$  in  $(2)$ , gives us:

$$
Y = \sum_{i=0}^{n} (Y_i) 2^{2i}
$$
 (4)

Which is the modification of Booth algorithm to its radix-4 version. The comparison between (1) and (4) depicts that by means of such abbreviation, the rows of PPs are almost halved. The result of the modification can also be summarized in a table where Table 1 shows the result.

|   | $b_{2i+1}$       | $b_{2i}$         | $b_{2i-1}$ | $d_i$          | Х        | N        | L        | Neg      |
|---|------------------|------------------|------------|----------------|----------|----------|----------|----------|
| Ш | $\theta$         | $\theta$         | $\theta$   | $\theta$       | $\theta$ |          | $\theta$ | $\theta$ |
|   | $\mathbf{0}$     | $\theta$         |            |                |          | $\theta$ | $\theta$ | $\theta$ |
|   | $\theta$         |                  | $\theta$   |                |          | $\theta$ | $\theta$ | $\theta$ |
| П | $\boldsymbol{0}$ |                  |            | $\overline{c}$ | $\theta$ | $\theta$ |          | $\theta$ |
| Ш |                  | $\boldsymbol{0}$ | $\theta$   | $-2$           | $\theta$ |          | $\theta$ |          |
|   |                  | $\boldsymbol{0}$ |            | -1             |          | $\theta$ | $\theta$ |          |
|   |                  |                  | $\theta$   | $-1$           |          | $\theta$ | $\theta$ |          |
| Π |                  |                  |            | $\theta$       | $\theta$ | $\theta$ |          |          |

**Fig. 2.** Designed truth table for tadix-4 scheme of [13].



**Fig. 3.** Designed architecture for tadix-4 scheme of [13].

## **3. ANALYSIS AND DESIGN**

As a start point, the radix-4 Booth architecture with the lowest reported latency [13] will be analyzed here. The corresponding truth table along for this work is shown in Fig. 2.

As it is evident, there must be four paths from the input signals to the output node. Hence, the speed will drastically be enhanced. The designed circuitry which is shown in Fig. 3, demonstrates latency almost equal to 3 transistors (or 1.5 XOR logic gates).

Although the speed of this architecture is high, there are non-uniform paths from inputs to the outputs. Therefore, the output waveforms will contain glitches and in order to eliminate such effect, en extra buffer is needed at the output node. Considering this, the gatelevel delay in Fig. 3 will be 2 XOR logic gates practically.

On the other hand, the transistor-level implementation has provided the possibility to remove the complement transistors. This has reduced the transistor count. In the other work reported in [12], due to the careful design consideration, there are uniform paths for signal transfer which omits the glitch effect. However, the transistor-level delay from inputs to the outputs is equal to 5 transistors (or 2.5 XOR logic gates).

As discussed before, there are also other works over recent years for hardware realization of radix-4 Booth scheme [14-18]. Most of those circuits are based on error tolerant systems. Although the speed performance seems to be enhanced, because of utilization of error correction structures, the final latency is 3 XOR gates for the best case (reported in [14]). Moreover, the existence of glitch in the output waveforms is inevitable in all of these circuits.

To resolve the glitch problem and achieve lower latencies, Table 1 should be reconsidered. In order to produce the scale factors  $\{2X, 1X, 0X, -1X, -2X\}$  in radix-4 Booth scheme, at least three control signals are needed. On the other hand, in an N-bit multiplier each output signal of encoder section must be applied to N number of decoders. As a result, if one tries to reduce the capacitances of middle stages, he should lower the number of control signals. The standard method is to employ three control parameters where in most cases they are chosen as  $Neg, X$ , and  $2X$ .

The parameter  $Neg$  is a representation of sign bit value and is assumed equal to  $Y_{2i+1}$  for more simplification. However, the investigation of previous works depicts that generation of  $X$  and  $2X$  scale factors results in a system with at least 3 XOR logic gate-level delay. In  $[12]$  and  $[13]$ , the 2X factor has been

neglected and instead, new parameters have been considered. The final result which was discussed thoroughly, was the achievement to lower latencies. Following the same concept, in this paper the modified truth table of [12] has been utilized.

| $\boldsymbol{d}_i$ | $b_{2i+1}$       | $b_{2i}$         | $\boldsymbol{b}_{2i-1}$ | Neg              | $\pmb{X}$        | $\pmb{Z}$        |
|--------------------|------------------|------------------|-------------------------|------------------|------------------|------------------|
| $\boldsymbol{0}$   | $\boldsymbol{0}$ | $\boldsymbol{0}$ | $\boldsymbol{0}$        | $\boldsymbol{0}$ | $\boldsymbol{0}$ | $\boldsymbol{0}$ |
| $\,1\,$            | $\boldsymbol{0}$ | $\boldsymbol{0}$ | $\mathbf{1}$            | $\boldsymbol{0}$ | $\mathbf{1}$     | $\,1$            |
| $\,1\,$            | $\boldsymbol{0}$ | $\,1$            | $\boldsymbol{0}$        | $\boldsymbol{0}$ | $\,1$            | $\,1\,$          |
| $\overline{2}$     | $\boldsymbol{0}$ | $\,1$            | $\mathbf{1}$            | $\boldsymbol{0}$ | $\boldsymbol{0}$ | $\mathbf{1}$     |
| $-2$               | $\,1$            | $\boldsymbol{0}$ | $\boldsymbol{0}$        | $\,1$            | $\boldsymbol{0}$ | $\mathbf{1}$     |
| $^{\rm -1}$        | $\,1\,$          | $\boldsymbol{0}$ | $\mathbf{1}$            | $\,1$            | $\mathbf{1}$     | $\,1$            |
| $^{\rm -1}$        | $\,1\,$          | $\,1$            | $\boldsymbol{0}$        | $\mathbf{1}$     | $\mathbf{1}$     | $\,1$            |
| $-0$               | $\,1$            | $\,1$            | $\mathbf{1}$            | $\,1$            | $\boldsymbol{0}$ | $\boldsymbol{0}$ |

**Table 2.** The modified truth table of Radix-4 Booth scheme.

Table 2 illustrates the modified truth table in which, a new parameter denoted as  $Z$  has been employed instead of  $2X$ . Such parameter is used to cover the neutral states of sign bit. For the complement state of parameter  $Z$  we have:

$$
\bar{Z} = b_{2i}.b_{2i-1}.b_{2i+1} + \overline{b_{2i}}. \overline{b_{2i-1}}. \overline{b_{2i+1}}
$$
 (5)

Based on (5), only for identical states of three input



bits the value of complement state will rise to high level voltage. Also, for scale factor  $X$ , it can be written:

$$
X = b_{2i} \oplus b_{2i-1} \tag{6}
$$

And finally, the scale factor  $2X$  will be equal to:

$$
2X = \overline{X}.Z \tag{7}
$$

With the help of the obtained relations, the architecture of Fig, 4 will be achieved for hardware realization of proposed Booth scheme [19]. As it is obvious,  $a_0, \ldots, a_n$  represent the bits of multiplier, while  $Neg$ ,  $X_1$ , and Z illustrate the coding parameters.

The gate-level delay of designed system starting from encoder section and finishing in the PPs of decoder stage will be 2 XOR logic gates.

It must be mentioned that in encoder stage the parameter  $Z$  is generated by means of an inverter gate after  $\overline{Z}$ . However, the inverter does not affect the total gate-level delay as the control parameter  $Z$  is applied to the gates of transistors in decoder block. The use of this parameter is for full level recovery at the output nodes of Transmission Gates (TGs).

For further enhancement of speed performance, the decoder section can also be modified. To do this, the parallel NMOS and PMOS transistors in every TG could be separated. With the help of such modification, the capacitances of contacted transistors will be reduced effectively.

The optimized scheme for the decoder section has been shown in Fig. 5 for generation of only one PP. To obtain other PPs, each block should be repeated in the decoder stage.



**Fig. 4.** The proposed architecture for radix-4 Booth scheme.



**Fig. 5.** Optimization of the decoder section for speed enhancement.



**Fig. 6.** Employed XOR/XNOR gates (retrieved from [20]).

The comparison between Fig. 4 and Fig. 5 depicts that detachment of NMOS and PMOS pairs in a TG, separates the parallel capacitances for each path. Therefore, the equivalent capacitance will be reduced which enhances the signal propagation speed.

Furthermore, the circuits of Fig. 6 have been used for the implementation of XOR gates which helped us to reduce total transistor count of the system. These gates were previously designed and an analyzed in [20] and [21].

The full swing gates have been utilized in decoder section and non-full swing ones were employed in encoder block to produce  $X_1$  and  $X_2$  control signals.

## **4. SIMULATION AND COMPARISON**

For the simulation of proposed radix-4 Booth scheme, TSMC standard 0.18µm CMOS library along with 1.8V power supply in HSPICE have been employed. For a fair comparative analysis, the architectures of  $[10]$ ,  $[11]$ ,  $[12]$ , and  $[13]$  have also been redesigned and simulated with the same gates used for implementation of the proposed architecture.

The 4-2 compressor of [22] has been used as a load to provide a real environment. The final results for the measurement of propagation latency, power consumption and the Power Delay Product (PDP) are shown in Fig. 7. Based on the results, the optimum performance of designed scheme can clearly be observed.

**Table 3.** Comparison of the Proposed Scheme and Previous Works.

| 11 U V 1 U U U                          |       |        |                |                |                 |  |
|-----------------------------------------|-------|--------|----------------|----------------|-----------------|--|
| Work                                    | [10]  | $[11]$ | [12]           | [13]           | <b>Proposed</b> |  |
| <b>Technology</b><br>$(\mu m)$          | 0.18  | 0.18   | 0.18           | 0.18           | 0.18            |  |
| Voltage (V)                             | 1.8   | 1.8    | 1.8            | .8             | 1.8             |  |
| <b>Transistor</b><br><b>Level Delay</b> | 8     | 8      | 5              | 4              | 4               |  |
| Glitch                                  | Yes   | Yes    | N <sub>0</sub> | N <sub>0</sub> | No              |  |
| <b>Propagation</b><br>Latency (ps)      | 432   | 564    | 269            | 234            | 198             |  |
| Power $(\mu W)$<br>@100MHz              | 1138  | 1025   | 317            | 285            | 297             |  |
| PDP (fJ)                                | 491.6 | 578.1  | 85.27          | 66.69          | 58.806          |  |



**Fig. 7.** Comparative results between proposed scheme and previous works considering propagation latency, power consumption and PDP.

It must be mentioned that the results have been obtained for the operating frequency of 100MHz. Finally, Table 3 summarizes the comparison results. According to the results, the PDP of proposed work is the lowest compared to the previous designs.

## **5. CONCLUSION**

In this paper a novel high performance structure for radix-4 Booth scheme has been presented in which the latency of critical path was equal to 4 transistors (2 XOR logic gates). Considering the fact that PTL has been utilized for implementation of the circuits, the capacitances of middle stages have been reduced effectively. As a result, the operating frequency can drastically be improved.

Simulation results for TSMC standard 0.18µm CMOS technology confirm the correct behavior of the proposed architecture. According to the results, the designed scheme can be widely used in high-speed parallel multipliers.

## **REFERENCES**

- [1] Adam Osborne, **"An Introduction to Microcomputers, Volume 1: Basic Concepts (2nd ed.),"** *Berkeley, California: Osborne-McGraw Hill*, 1980, ISBN 0-931988-34-9.
- [2] Ross Basset, **"When is a Microprocessor not a Microprocessor? The Industrial Construction of Semiconductor Innovation,"** *In Finn, Bernard, Exposing Electronics, Michigan State University Press*, p. 121, 2003, ISBN 0-87013-658-5.
- [3] Jeffrey Shallit, **"A Very Brief History of Computer Science,"** CS 134, University of Waterloo, summer of 1995.
- [4] M. Rafiquzzaman, **"Fundamentals of Digital Logic**

**and Microcomputer Design,"** *John Wiley & Sons*, pp. 251, 2005, ISBN 9780471733492.

- [5] D. Naresh and G. Babu Kande, **"High Speed Signed multiplier for Digital Signal Processing Applications,"** *IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE)*, Vol. 8, Issue 2, pp. 57-61.
- [6] Andrew D. Booth, **"A signed binary multiplication technique,"** *The Quarterly Journal of Mechanics and Applied Mathematics*, Vol. IV, Pt. 2, 1951.
- [7] C. S. Wallace, **"A suggestion for a fast multiplier,"** *IEEE Trans. on Electronic Comp. EC-*13(1), pp. 14- 17, 1964.
- [8] L. Dadda, **"Some schemes for parallel multipliers,"** *Alta Frequenza*. Vol. 34, pp. 349–356, 1965.
- [9] Parhami, Behrooz, **"Computer Arithmetic: Algorithms and Hardware Designs,"** *Oxford University Press, New York*, 2000, ISBN 0-19- 512583-5.
- [10] Shiann-Rong Kuang, Jiun-Ping Wang, and Cang-Yuan Guo, **"Modified Booth Multipliers with a Regular Partial Product Array,"** *IEEE Transactions on Circuits and Systems—II: Express Briefs*, Vol. 56, No. 5, pp. 404-408, 2009.
- [11] Ravindra P. Rajput, M.N. Shanmukha Swamy, **"High speed Modified Booth Encoder multiplier for signed and unsigned numbers,"** *14th International Conference on Modelling and Simulation*, pp. 649- 654, 2012.
- [12] Fathi, S. Azizian, R. Fathi, H.G. Tamar, **"Low latency, glitch-free booth encoder-decoder for high speed multipliers,"** *IEICE Electronics Express*, Vol. 9, No. 16, pp. 1335-1341, 2012.
- [13] A. Fathi, S. Azizian, Kh. Hadidi, A. Khoei, **"Ultra High Speed Modified Booth Encoding Architecture for High Speed Parallel Accumulations,"** *IEICE transactions on electronics*,

Vol. 95, No. 4, pp. 706-709, 2012.

- [14] Honglan Jiang, Jie Han, Fei Qiao, and Fabrizio Lombardi, **"Approximate Radix-8 Booth Multipliers for Low-Power and High-Performance Operation,"** *IEEE Transactions on Computers*, Vol. 65, No. 8, pp. 2638-2644, Aug 2016.
- [15] Saurabh Katariya, and Manish Singhal, **"A Hybrid 4 bit Radix-4 Low Power Booth Multiplier with High Performance,"** *IEEE International Conference on Recent Advances and Innovation in Engineering (ICRAIE -2016),* Dec 23-25, 2016.
- [16] A N Nagamani, R Nikhil, Manish Nagaraj, and Vinod Kumar Agrawal, **"Reversible Radix-4 Booth Multiplier for DSP Applications,"** *International Conference on Signal Processing and Communications (SPCOM)*, 2016.
- [17] Alberto A. Del Barrio, and Rom´an Hermida, **"A Slack-based Approach to Efficiently Deploy Radix 8 Booth Multipliers,"** *IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE)*, 2017.
- [18] Weiqiang Liu, Liangyu Qian, Chenghua Wang, Honglan Jiang, Jie Han, and Fabrizio Lombardi, **"Design of Approximate Radix-4 Booth Multipliers for Error-Tolerant Computing,"** *IEEE*

*Transactions on Computers*, Vol. 66, No. 8, pp. 1435- 1441, 2017.

- [19] Ali Rahnamaei, Gholamreza Zare Fatin, and Abdollah Eskandarian, **"High speed Radix-4 Booth scheme in CNTFET technology for high performance parallel multipliers,"** *Int. J. Nano Dimens*., Vol.10, No. 3, pp. 281-290, 2019.
- [20] Amir Fathi, Sarkis Azizian, Khayrollah Hadidi, Abdollah Khoei and Amin Chegeni, **"CMOS Implementation of a Fast 4-2 Compressor for Parallel Accumulations,"** *2012 IEEE International Symposium on Circuits and Systems (ISCAS*), pp. 1476-1479, May 2012.
- [21] Amir Fathi, Behbood Mashoufi, and Sarkis Azizian, **"Very fast, high-performance 5-2 and 7-2 compressors in CMOS process for rapid parallel accumulations,"** *IEEE Transactions on Very Large Scale Integration Systems*, Vol. 28, No. 6, pp. 1403- 1412, 2020.
- [22] Amir Fathi, Sarkis Azizian, Khayrollah Hadidi and Abdollah Khoei, **"A Novel and Very Fast 4-2 Compressor for High Speed Arithmetic Operations,"** *IEICE transactions on electronics*, Vol. 95, No. 4, pp. 710-712, April 2012.