Advancements in low cost FPGAs and medium performance DSPs have brought this solution to the center of the receiver design target.
By Dyson Wilkes
Glossary of Acronyms
3GPP — Third Generation Partnership Project
Many factors influence the choice of components for implementing wireless BTSs. ASICs offer the best cost per device, but carry a high upfront cost, both in terms of development time and NRE. ASICs also lack the flexibility offered by DSPs and FPGAs. DSPs offer perhaps the most flexibility and the easiest development, but the cost of the high performance DSPs required for chip-rate processing can discourage their use.
One of the more demanding functions on the baseband module in a radio basestation is the receiver and, in particular, the de-spreading of a signal after it has been subjected to a multipath channel. The rake receiver is a common solution to this problem, and it offers a good example of the use of low cost FPGAs and low cost, medium performance DSPs. This article describes a rake receiver reference design developed for the 3GPP WCDMA standard and targeted for implementation on an FPGA. The primary goal of the reference design was to demonstrate how many rake fingers could be implemented in a low cost, DSP-oriented FPGA device, and to identify an approach that reduces the overall cost of the design.
The problem of reception of a signal subjected to a multipath channel has been widely discussed in the literature for many years. The WCDMA standard was designed based on one approach to solving this problem: spreading data symbols using a higher frequency bit sequence. The sequences are selected so that they have properties that allow the resulting signal to be distinguished from a delayed version of itself. This property makes it possible to receive the signals from a set of paths with different delays and combine them, resulting in improved reliability and performance.
Price and Green, who invented the technique in 1958, coined the "rake" nomenclature because this method of reception involves the use of several correlators, each one of which can be thought of as a finger, or tine, of a garden rake.
CDMA signal reception also requires performing two other key tasks. One is to find the delays along the strongest paths, and the other is to determine the effect on the signal as it travels along its paths. The first task is known as path searching; the second, channel estimation. Both functions have been omitted from the design; the first, because the hardware is not very complex, and the second, because there are many different solutions to the problem of channel estimation. Furthermore, the channel is often found to be changing relatively slowly (update times of about 5 to 10 ms are not unreasonable), so the estimation algorithms are typically run in software in a DSP rather than implemented in hardware.
Some key parameters for WCDMA are:
Chip rate (after spreading)=3.84 MHz. Typical over-sampling rate of data input to rake=2 (giving sample rate of 7.68 Ms/s). Spreading factor from four to 256 for data channels and fixed at 256 for control channel. Data and control channels transmitted on uplink (from mobile) at the same time.
BTS System Overview
The OBSAI system partitioning for a basestation in a mobile wireless system has been used to provide context for the design. This divides the major functions into modules with interfaces called RPs. The rake receiver is shown in context in Figure 1. Notice that it is one of the baseband processing functions. A key point is the amount of data that needs to be buffered in a full system with multiple radio modules connected to one baseband module. FPGA support for DDR interfaces and the SDRAM controller IP proves useful when realizing a complete solution. Such a solution would handle data from several data streams coming from the radio modules. The sources of these streams are the radio signals received on a number of different carriers on a number of different antennas. The term "A-C" refers to these signal sources. Once the received signal from an A-C has been mixed down to IF, digitized and down-converted, typically by a DDC, the data rate is 7.68 Ms/s, assuming two times over-sampling of the 3.84 MHz spread data.
|Figure 1. Overview of a mobile basestation showing location of the rake receiver. To keep the figure clear, not all baseband functions are shown. Click here to enlarge.|
Figure 1 is primarily a functional view of the basestation, although it does show the physical partitioning of the radio and baseband modules. The next section examines one way in which the rake design can be combined with other components to form a simple working system.
Figure 2 shows how the rake, along with a DSP chip and DDR SDRAM, could form a fully functional baseband subsystem. The rake was designed with a simple memory-mapped register interface to facilitate control by a DSP. Registers within the rake engine control the allocation of fingers to a given user, hold path delay values and define the parameters (e.g., spreading and scrambling codes) for the channel being used. In this example, the channel estimation is done by software in the DSP so the channel correction data is written by the DSP to the engine via the register interface. This diagram does not show the searcher, but the correlation peaks from this would form the basis for the initial relative path delays for each of the fingers. Because path delay can vary over time, and the bit error rate is sensitive to errors in alignment of the sampling, the engine has a tracking mechanism that updates these delays.
The frame buffer would use a suitable external memory inasmuch as one WCDMA 10 ms frame requires nearly 1 Mb of memory, which is beyond the capacity of low-cost FPGAs. An FPGA that supports DDR interfacing and has SDRAM controller IP available is a viable option for a larger frame buffer if multiple 7.68 Ms/s data streams (from more than one A-C) are to be supported. There are FPGAs with DDR interfaces that can run as fast as 400 DDR (200 MHz), so a 32-bit wide data interface could support 12 A-Cs without difficulty.
|Figure 2. A simple system using the rake receiver.Click here to enlarge.|
Figure 2 also illustrates the simplicity of the interfacing to the rake, which offers flexible use of memories. The output FIFO could be implemented on-chip using block RAM. For example, 16 channels of 8-bit data with the lowest spreading factor (=4) would require two block sysRAMs. A small amount of logic would be needed to implement the FIFO addressing. The "Chan_out" signal, which indicates which channel the data are for, would be used to form part of the memory address. Separate FIFO pointers would need to be maintained for each channel, but these could be packed into one distributed RAM (see discussion about time-slicing below).
The basic rake finger is rather simple in operation. Summing a series of binary products of the complex input data and the corresponding complex de-spreading/de-scrambling code performs the correlation. Without knowledge of the delay along the path from the mobile, this correlation would produce nothing more than noise. As mentioned earlier, a searcher can be used on a known transmitted signal (the pilot symbols that are included in the control channel signal) to find correlation peaks. The strongest peaks show where the strongest paths are in the propagation channel. These peaks are then used to set up the relative delays on each of the fingers allocated to a given channel. Different propagation channel conditions will apply to different mobile users, so the optimum number of fingers will vary from user to user and during the connection with a given user. Thus, one of the features to consider in the design is the allocation of fingers to channels.
The different delays along the paths need to be equalized at some point in the design so that the de-spread symbols can be reunited in time and combined into one, more accurate result. This calls for a buffer somewhere in the design, but where is the best place to put it? Figure 3 shows two alternative positions for the buffer. The first, called sample buffering, is to buffer the samples as they come in. This makes for a simple control scenario and is, perhaps, the most obvious solution. The amount of memory required per antenna-carrier is also reasonable. Assuming the delay spread is less than 33µs (33µs is equivalent to a 10 km path difference) or 128 chip periods and 8-bit resolution for I and Q input data, the buffer will require 4096 kb of two-port memory.
The other alternative, shown in Figure 3 (bottom), is to buffer the de-spread symbols before correction and combining. This step may use more memory than the first option, depending on the number of fingers and the bit width of the output of the de-spreader. Assuming the worst case, in which all fingers are de-spreading data with a spreading factor of four and a bit width of 2 × 16 for the complex de-spread data, 16 kb of buffer are needed. Also, because the relative path delay for a finger has to be apportioned between the symbol buffer and the code generator, the control and tracking scheme becomes more complex. So far, it would seem the sample buffer is the better choice on two counts: memory size and simplicity.
|Figure 3. Sample buffering (top) vs. symbol buffering (bottom).Click here to enlarge.|
One block has not yet been discussed. The interpolator is optional when it comes to designing rake receiver architecture, but it has a significant effect on performance. The interpolator calculates sample points in between those in the input data stream. This allows the de-spreading to be performed with the data and code in better alignment to each other, resulting in a better overall bit error rate. In the sample buffer architecture, one interpolator is needed per finger, whereas symbol buffering allows just one interpolator to be used for all fingers processing data from the same A-C. The decision appears to hinge on how complex the interpolator is, and whether the hardware performing this function can be shared across many fingers. In fact, it turns out this is only one factor that swings the choice towards symbol buffering. It will be seen that symbol buffering also increases the flexibility of the design.
|Figure 4. Rake engine with tracking.Click here to enlarge.|
Each path in the propagation channel imposes a different phase rotation on the signal. If these signals were simply additively combined, the result would not be any better, and in some cases worse, than the single path result. It is necessary to de-rotate the de-spread signal from each finger before combining them. The channel estimator determines the amount of phase rotation. For each finger (i.e., path), the channel estimator provides a complex number representing the relevant component of the channel impulse response. Multiplying each complex symbol value by the complex channel correction value performs the channel correction. This complex multiplication requires four multiplications, an add and a subtract, per symbol, in a given finger. However, for a given WCDMA channel, because data are mapped only onto either the real or imaginary component, only two multiplications and an add/subtract are needed. The symbol data after channel correction is accumulated in the combine block across all the fingers allocated to a given channel.
|Figure 5. MATLAB plot of tracking and symbol timing.Click here to enlarge.|
Another point to note is that the received data are sampled only at 7.68 Ms/s and the chip rate is half that (i.e., 3.84 MHz). Given that current low-cost FPGAs can run at speeds well in excess of 200 MHz, there should be plenty of opportunities to time-share hardware. Furthermore, the channel correction hardware uses multipliers that could be shared across a number of fingers if they could run fast enough. For example, in theory, one multiplier running at only 15.36 MHz can serve 16 fingers. In fact the multiplier has to work faster than this because the symbols arrive in the buffer at arbitrary times that depend on the relative path delays. Increasing the buffer size so that symbol data points were not overwritten before they were used could alleviate this. LUT-based multipliers tend not to be fast enough for this application, and they consume a lot of resource. Embedded DSP blocks offer a cost-efficient way to implement the channel correction function. They can run at high speed and can be shared across several fingers. In this case it was decided to keep the buffer size to a minimum, because availability of multipliers was not a limiting factor for this design in the chosen device.
In fact, the entire rake engine can be time-shared to implement a number of fingers. The limitations on the degree of time-slicing are the maximum clock speed for the hardware in the target device, and the storage capacity for the state of each finger. Furthermore, if pipelining is needed in a given part of the design to reach the target speed, additional state storage capacity may be needed. Hence, there is a point at which increasing the number of time-shared fingers yields rapidly diminishing returns.
|Figure 6. Constellations before and after correction and combining.Click here to enlarge.|
Time-sharing requires storage of the state, and small, distributed memory blocks, which provide 16 bits of memory organized as one bit by 16 words, are an ideal solution. One such block uses the LUT in a slice, so it is an efficient way to implement small, disparate blocks of memory. Without this feature, the designer is faced with either trying to pack the state storage into one or more large block memories, or using slice flip-flops. The first proves awkward, because the state needs to be updated each cycle and the granularity of the state vectors is smaller than the memory width (i.e., four lots of 8-bit vectors packed into a 32-bit wide memory). If each vector is put into a separate memory, this soon uses up all the available RAMs on a given chip, so packing is the only solution. Using slice flip-flops consumes slices at 16 times the rate of using distributed RAM and, again, a given FPGA would soon be filled up.
Using the distributed RAM introduces a quantum of 16 in choosing how many fingers to time-share. It turns out that a 16-finger rake engine is the best trade-off for ease of implementation and efficient use of FPGA resources. This 16-finger rake engine can be duplicated a number of times on the chip to produce scalable solutions. As an example, Table 1 shows the capacity of the Lattice ECP-DSP family with respect to this design.
|Figure 7. Input and output data.Click here to enlarge.|
The ECP33 implementation provides enough capacity to implement a 64-user system in two devices and leaves enough spare resource (about 9 k slices) to allow the implementation of an input interface, optional DDR interface and an output buffer.
Figure 4 shows the top level architecture of the rake engine with some control signals shown, but several have been left out for the sake of clarity. The blocks inside the boundary form the core of the rake engine, which is time-shared across 16 fingers. The interpolator is not time-shared, but its output can be shared among several (N) rake engine cores to implement N times 16 fingers that process the signal from a given A-C. The control block orchestrates the activity in the whole engine. It contains a counter that relates to which finger is being processed in a given time slice, and its value is used to address various small memories that contain the state in each domain of control for all the fingers. Other small memories hold parameters for each finger: one 16-word memory per parameter, such as the spreading factor, spreading code and scrambling code. The allocation of fingers to users is also stored as a number from zero to 15 to support the maximum number of users where only one finger is used.
|Figure 8. Simulation of tracking in RTL design.Click here to enlarge.|
As mentioned in the design overview, the alignment of the sampling point with the incoming chips has a significant effect on the system performance. To ensure that any drift in the timing between the mobile and basestation, or error in the initial path delay estimation, does not degrade the performance, a tracking loop is included in the design. The tracking loop is formed by the TED and the NCO. The NCO accumulates the output of the TED block and, when this exceeds one chip, it asserts the chip step signal to increment that finger's pointer by two rather than one. If the accumulation underflows, the pointer is not incremented for that finger.
Simulation of the Design
The design was developed with the aid of a MATLAB model of the transmitter and radio propagation channel and radio module. Some sample results are presented here to illustrate the principles of the rake receiver, followed by equivalent results from RTL simulation for comparison.
Tracking symbol timing errors For each rake finger, the output of the early/late gate detector is used to adjust an NCO controlling the exact sampling point of each chip for that finger. Figure 5 shows the dynamic tracking of two independent rake fingers. The top one starts with an initial timing offset that is too advanced by one chip; the bottom one is too retarded by one chip. The x-axis in each is the symbol number and the spreading factor is 256.
The first column, early/late error output, shows how the error converges towards zero as time progresses.
The second column, chip fractional offset, is the fraction of a chip by which the data are being delayed. This offset is used to select one of the 16 over-samples coming out of the interpolator on each chip cycle. In both cases, this offset is seen to wrap around. To compensate for this, an additional "chip_step" signal is generated (not shown) to either advance by two steps if too retarded, or freeze if too advanced, the scrambling/spreading code generators on the next cycle.
The third column shows the final symbol output values for each finger, with the tracking circuitry correcting the timing offset. Notice the opening of the "eye" for the output data. This shows the profound effect that a small timing error can have on the performance.
The initial offset errors in these plots are used to illustrate the function of the tracker. In a real scenario a one chip offset would not be encountered, as the searcher would yield a more precise initial offset.
Multipath delay and rake finger timing These plots illustrate the beneficial effect of combining the outputs of more than one rake finger. The combining is done after the effect of the propagation channel path has been removed by de-rotation. The improved performance can be seen in the "combined" plot where all the points have moved away from the y-axis, reducing the ambiguity in the symbol value. As described in the introduction, a signal from a terminal may be received at the base station along more than one signal path. Each signal path will potentially have a different delay and phase rotation associated with it. Figure 6 shows the reception and combining of two such multi-path components of a signal within the rake receiver.
|Table 1.Capacity of the design in the ECP-DSP family.Click here to enlarge.|
The top "integration" plots show the two multipath components after they have been de-scrambled, de-spread and integrated over the symbol period. The middle row of the plot shows the output of the "correct" block of the rake receiver, after the channel phase correction and a weighting value for each multipath component have been applied. The final row shows the combined sum of the two de-rotated finger outputs; the effect of combining, taking into account the larger x-axis range, has been to widen the gap between the constellation clouds. Note that the y-axis (imaginary) data on this plot have been discarded as the particular DPDCH channel being decoded would only appear on the real (I) axis.
RTL Simulation Results
The last two figures show some results from an RTL simulation and illustrate the actual implementation of the design. Figure 7 shows just the input I and Q data. Two sets of data are data multiplexed together, but it is possible to discern them in the analog step style waveform. These are the data for the current control channel frame and the previous data channel frame. Two rake engines in parallel process the data, and their inputs are shared at the top of the design that was simulated. The reason for the delay in the data channel is that its spreading factor is only known once the parallel control frame has been decoded.
A burst of output data can be seen on the "Data0_out_dpcch" signal while the "Chan0_out_dpcch" shows that this data is for channels 0, 2, 3 and then 1.
Figure 8 shows the point at which the delay for finger No. 8 is stepped by the tracker across a chip boundary. The 2-bit "chip step" signal is one for a normal cycle, but in this case it is zero, meaning that the path has shorted such that the delay has crossed a chip boundary. The transition from one to zero and back to one can be seen just after 3 ms, but the duration of the zero is too short to be seen. The tracking of finger No. 7 is also shown.
This paper has shown how a WCDMA rake receiver can be efficiently implemented in an FPGA device family. The implementation includes flexible interfacing, allowing it to be incorporated into a complete system solution. Support for DDR helps keep component count and cost to a minimum. The design is scalable, allowing it to be used in applications ranging from pico-basestations up to high performance baseband modules.
For example, a system supporting 64 users (256 control channel plus 256 data channel fingers) can be implemented with just two ECP33 low-cost devices (< $75 per device) leaving about 9 k LUTs for implementing, for example, interfaces and output buffering. Tracking of symbol timing, which is critical to performance, is included in this design. By comparison, implementing 512 fingers on a DSP, even if the de-spreading is done by a special instruction, would take about four high cost (< $200 per device) DSPs running flat out at 600 MHz.
One ASIC-based solution offers acceleration of the receive path chip processing for 64 users, but this capacity drops to 45 if they are all using a spreading factor of 16, and only supports 12 users with a spreading factor of four. This design supports 32 users on one chip, no matter what the spreading factor is. An FPGA implementation means there is considerably more flexibility compared to an ASIC solution, and significantly lower cost compared to an all DSP-based solution.
About the Author
Dyson Wilkes is a Product Planning Engineer at Lattice Semiconductor. Previously, he was a Project Manager at Infineon and before that worked at Ericsson in various roles. He is a Chartered Engineer and received his B.S. in Physics from Bath University in the UK.