A Software Defined Radio Transceiver Based on Dynamic Partial Reconfiguration

Sherif Hosny\(^1\), Eslam Elnader\(^1\), Mostafa Gama\(^2\), Abdelrhman Hussien\(^2\), Ahmed H. Khalil\(^2\), and Hassan Mostafa\(^2,3\)

\(^1\)Mentor Graphics.
\(^2\)Electronics and Communications Engineering Department, Cairo University, Egypt.
\(^3\)Nanotechnology Department at Zewail City for Science and Technology, Cairo, Egypt.

\{sherif.hosny@mentor.com, islam.elnader@mentor.com, mostafa525.ma@gmail.com, abdelrhmanhussien31@gmail.com, ahmed.hussien60@gmail.com, hmostafa@uwaterloo.ca\}

Abstract—Dynamic Partial Reconfiguration (DPR) has been used extensively over the past few years allowing reconfiguration of Field Programmable Gate Array (FPGA) during run time. With the aid of DPR, multi-standard Software Defined Radio (SDR) system can be implemented in order to save power and area extensively. In this paper, SDR is implemented using five wireless communication systems: Bluetooth, Wi-Fi, 2G, 3G, and LTE on the same reconfigurable hardware. A test environment is established to measure the effectiveness of the new technique using Zynq-7000. A comparison is performed for the system total area and power consumption with and without DPR. This work achieves reduction of area and power by 10.19% and 76.71% respectively using DPR with an average switching time of 3.49 ms.

Keywords—Software Defined Radio, Dynamic Partial Reconfiguration, Field Programmable Gate Array, and System on Chip.

I. INTRODUCTION

Modern wireless communication systems are witnessing a new era of high data rates and consequently higher power consumption of mobile batteries due to the powerful baseband signal processing. Researchers always try to minimize the area and consumed power by the signal processing in various wireless communication standard with the aid of different algorithms and software techniques.

DPR is a promising technology that offers the reconfiguration of a specified partition in the FPGA during the run time which helps in implementing a multi-standard SDR [1]. SDR addresses the reconfiguration time of the specified partition on the FPGA and switching between different standards to perform the baseband signal processing without affecting the overall performance of any of the standards [2].

SDR using DPR technology offers the use of the same specified hardware resource on the FPGA for different wireless communication standards implemented in mobile phones. This implies that only the specified partition on the FPGA is used for the baseband signal processing of different standards saving significant amount of power and area. The implementation of a high-speed reconfiguration time dynamic cognitive radios using the Zynq FPGA is presented in [3]. DPR is used to switch between two baseband standards: Digital Video Broadcasting Cable (DVB-C) and Satellite (DVB-S).

This paper is organized as follows: Section II shows a list of related work. Section III gives details about the overall system implementation. Section IV illustrates the FPGA environment followed by the DPR technique in Section V. Simulation results are listed in Section VI. Ultimately, Section VII shows the paper conclusion and future work.

II. BACKGROUND AND PREVIOUS WORK

DPR technique is used to switch between different configurations of LTE OFDM modulators in [4]. Variations are based on the size of IFFT, number of subcarriers, cyclic prefix, and window length. The implemented design on Virtex-7 is divided on four reconfigurable partitions, while a single static partition is kept for the FFT. Similar design for LTE FFT is proposed in [5], where configuration is dependent on the FFT size.

The proposed dynamic cognitive radio in [6] implements the physical layer on the FPGA Programmable Logic (PL) and the Medium Access Control (MAC) layer on the ARM processor. Switching between different baseband modules is performed using a custom partial reconfiguration controller to achieve high reconfiguration speed. Virtex-7 is used to host the physical layer blocks.

The proposed design in [7] is limited to implementing only three transmitter chains: Wi-Fi, 3G, and LTE. The contribution of this work is deploying SDR by implementing the physical layer of five transceiver chains: Bluetooth V2.0, IEEE 802.11a, GSM, UMTS and LTE on the same reconfigurable hardware. Area and power optimization techniques are used to enhance the system performance.

III. SDR TRANSCIEVER DESIGN

The first row in Figure 1 shows the implemented physical layer blocks of the Bluetooth chain as mentioned in [8]. The Wi-Fi blocks are listed in the second row [9]. The third, fourth, and fifth rows show the blocks of 2G, 3G, and LTE chains as mentioned in [10], [11], [12]. The grey blocks are the new optimized version of the implemented blocks in [7]. The
figure shows only the physical layer implementation, the data is transferred from and to the MAC layer.

The Wi-Fi chain, illustrated in [9], has many Modulation Coding Schemes (MCS). The implemented chain contains different combinations, starting from MCS1 to MCS6. Table I shows the difference between all implemented modulation schemes. The 3G chain also has variations in the types of the used CRC blocks (CRC8, CRC12, CRC16, and CRC24). Only single type of modulation scheme is implemented for Bluetooth, 2G, and LTE.

<table>
<thead>
<tr>
<th>MCS</th>
<th>Number</th>
<th>Puncturing</th>
<th>Interleaver</th>
<th>Mapper</th>
</tr>
</thead>
<tbody>
<tr>
<td>MCS1</td>
<td>None</td>
<td>N_{CBPS} = 48</td>
<td>BPSK</td>
<td></td>
</tr>
<tr>
<td>MCS2</td>
<td>r = 3/4</td>
<td>N_{CBPS} = 48</td>
<td>BPSK</td>
<td></td>
</tr>
<tr>
<td>MCS3</td>
<td>None</td>
<td>N_{CBPS} = 96</td>
<td>QPSK</td>
<td></td>
</tr>
<tr>
<td>MCS4</td>
<td>r = 3/4</td>
<td>N_{CBPS} = 96</td>
<td>QPSK</td>
<td></td>
</tr>
<tr>
<td>MCS5</td>
<td>None</td>
<td>N_{CBPS} = 192</td>
<td>16-QAM</td>
<td></td>
</tr>
<tr>
<td>MCS6</td>
<td>r = 3/4</td>
<td>N_{CBPS} = 192</td>
<td>16-QAM</td>
<td></td>
</tr>
</tbody>
</table>

Each block in every chain has three input and three output signals. The input signals are: data in, valid in, and enable. The output signals are: data out, valid out, and finished. Valid out signal is set when the output is ready. It is connected to the valid in signal of the successive block. Finished and enable signals are used to synchronize between blocks easily. Finished signal is a feedback signal connected to the enable of the preceding block which is set when block is ready to receive data from the preceding block as shown in Figure 2.

Each chain is working on multiple clock domains in order to match the input rate with the desired output rate. Dual clock RAMs are used to overcome the Clock Domain Crossing (CDC) issues leading to metastability. However, matching all clocks in each chain to avoid jitter is another challenge. A customized clock distribution network is integrated in each chain as an RTL design to overcome this challenge. The network accepts single clock source from the ARM processor, then generates identical clones for each system.

Parameterization technique is used to design various modulation schemes of Wi-Fi chain. Since all modulation schemes differ only in the type of puncturing, interleaving, or mapping, all blocks are implemented under the same source directory and switching between them is performed using simple parameters. The same criteria is used in 3G chain as well to switch between different CRC blocks.

SDR basic idea is having multiple communication standards sharing the same hardware controlled by the software. In order to make it feasible to deploy this idea, all corresponding blocks in each chain should almost occupy the same area and consume the same power. The following area and power optimization techniques are deployed:

1) **Finite State Machine (FSM) extraction**: In order to achieve the best allocation for complex FSMs, the synthesizer must recognize them. This is performed by writing the RTL code using Xilinx FSM template.

2) **Using DSP Primitives**: Implementing complex mathematical operations using DSPs is performed through either using Xilinx RTL attributes or instantiating the DSP primitive explicitly.

3) **Synthesizing in BRAMs**: Most of the chain blocks contain memories to store the symbol values. In order to save area, synthesizing in BRAMs instead of LUTs is deployed.

4) **Implementing customized DFT**: Since the DFT IP offered by Xilinx is 24-point and the required is only 14-points, its power consumption is high. Solving this issue is done by implementing a customized 14-point DFT to save power and area.

5) **Synthesis options**: Specific synthesis options are selected in order to achieve high area optimization.

Figure 3 shows the percentage of area reduction in the Wi-Fi, 3G, and LTE transmitters implemented in this work compared to the design in [7]. According to [13], the minimum area of the Reconfigurable Partition (RP) is 400 LUTs, or 10 BRAMs, or 10 DSPs. Since each slice in Virtex-7 contains 4 LUTs, expressing the area in terms of slices can be derived from the following equation:

\[
A_{Slices} = N_{Slices} + 10 \times N_{BRAMs} + 10 \times N_{DSPs}
\]  

(1)

Table II, summarizes the utilized area for all transceiver chains in terms of slices, BRAMs, and DSPs.

<table>
<thead>
<tr>
<th>Transceiver Chain</th>
<th>Slices</th>
<th>BRAMs</th>
<th>DSPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bluetooth</td>
<td>594</td>
<td>7</td>
<td>0</td>
</tr>
<tr>
<td>Wi-Fi</td>
<td>1576</td>
<td>7.5</td>
<td>18</td>
</tr>
<tr>
<td>GSM</td>
<td>400</td>
<td>4.5</td>
<td>0</td>
</tr>
<tr>
<td>UMTS</td>
<td>1324</td>
<td>21.5</td>
<td>3</td>
</tr>
<tr>
<td>LTE</td>
<td>3377</td>
<td>34</td>
<td>37</td>
</tr>
</tbody>
</table>

IV. FPGA PROTOTYPING

The test environment used in [7] is deployed in this work as well. Some modifications are added to serve the testing of the five transceivers and enhance the overall process. Each communication standard specifies the transmitter legit frequency in order to modulate its symbols on. Each chain has its own major input frequency that is used to generate the required distributed clocks in order to mimic the standards. The processor offers up to 4 different clocks. The major operating clocks derived by the processor to feed the communication chains are:

1) **200MHz**: Used for AXI Bus, 2G, Wi-Fi, and LTE chains.
2) **100MHz**: Used for Internal Configuration Access Port (ICAP).
3) **55.5MHz**: Used for Bluetooth chain.
4) **15.3MHz**: Used for 3G chain.
Testing the five chains is performed through running a C-code program that uses Xilinx APIs to apply the input data for each system and extract the results. Figure 4 shows the program flow chart. The states are listed below:

- **Hardware Initialization**: Starts by resetting the system, then the processor sends the required parameters for DMAs, input interfaces, and Xilinx Partial Reconfiguration Controller (PRC) [14].
- **Reconfiguring The Chosen System**: The processor triggers the PRC with the chosen system to be loaded.
- **Testing The System**: The test environment flow is launched.

V. **DYNAMIC PARTIAL RECONFIGURATION**

DPR system migration while keeping the same test environment is performed through replacing the five systems (2G, 3G, LTE, Wi-Fi, BT) by two black boxes for the transmitter and the receiver of the desired chain.

According to the DPR techniques discussed in [14], Xilinx PRC is used due to its high throughput that is close to the ideal throughput of 400 MB/sec using ICAP to communicate with the FPGA configuration memory. Figure 5 shows an overview of the overall system.

The three metrics that measure the effectiveness of deploying the SDR system using the DPR flow are: area, power, and switching time. The switching time is the bottleneck of the technique since the system should switch rapidly between chains in order to achieve performance similar to the case of no DPR.

The size of the bit stream file is dependent on the allocated design area. The switching time of the Reconfigurable Partition (RP) is dependent on the size of the partial bit file. Thus, area optimization is one of the challenges to achieve small switching time. The optimization techniques mentioned earlier are used to achieve the best performance.

Since the maximum clock frequency of ICAP is 100 MHz, and the width of the data bus is 32-bits; the ideal switching...
time is calculated using the following equation:

\[
T = \frac{\text{Partial Bit File Size in Bits}}{\text{Bus Width} \times \text{Max Clock Frequency}}
\]  

(2)

In real life, wireless transmitter and receiver communicate remotely. The proposed approach in partitioning is choosing a single RP for the transmitter and another one for the receiver. Since the LTE is the most complex transceiver chain, it has the largest allocated area. Therefore, the size of the two partitions is selected to fit the LTE transceiver. The sizes of the partial bit files of the transmitter and the receiver are 523KB and 837KB respectively. Measuring the switching time is done in the C-code with the aid of timers for time calculation. The actual switching time for all chains is 3.49 ms which is close to the theoretical value 3.48 ms calculated by adding the calculated switching time of both transmitter and receiver. Figure 6 shows the floorplanning of the allocated RPs on the FPGA. Selection of transmitter and receiver partitions is large enough to fit all LTE required resources.

![Fig. 6: SDR design floorplan](image)

VI. SIMULATION RESULTS

Figure 7 shows a comparison between the total utilized area of all systems calculated using Equation 1 in case of with and without DPR. The occupied area in the DPR case is the area of the largest chain (LTE chain). The DPR technique reduces the system total area by 10.19%. Power measurements are performed by considering the worst case scenario, where four chains are working simultaneously. The estimated average power in case of DPR is reduced by 76.71% compared to the case of no DPR.

VII. CONCLUSION AND FUTURE WORK

DPR technique shows its effectiveness in saving the allocated area and the power consumption compared to the current transceivers in the mobile devices with a reasonable switching time overhead. The proposed approach reduces the system total area and power by 10.19% and 76.71% respectively. For system completeness, implementation of data link and MAC layers shall be added in the future.

VIII. ACKNOWLEDGEMENT

This work was partially funded by Mentor Graphics, ONE Lab at Cairo University, and Zewail City of Science and Technology.

REFERENCES


