C2000 FFT: VCU, FPU or FixedPoint

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

This page describes the different FFT implementations on a C28x.

FFT Implementations

32-bit Fixed-Point
This implementation uses the C28 fixed-point CPU. It uses the on-chip 32bit fix-point math capabilities of the CPU. As a rule of thumb, it takes ~20 cycles for each FFT butterfly for an optimized 32-bit implementation.
16-bit Fixed-Point
This implementation can use the C28 fixed-point CPU or the C28x with VCU enhancements (C28x+VCU).
C28x Fixed-Point
This implementation uses the 16-bit math capabilities of the C28x fixed-point CPU. If using the C28 CPU core, it takes ~16 cycles per FFT butterfly
C28x with VCU
This implementation uses the 16-bit math capabilities of the C28x with VCU. The VCU provides optimized 16-bit complex math capabilities that are in addition to that of the fixed-point CPU. If using the C28x+VCU enhancements, then it takes ~5 cycles for each FFT butterfly. There are currently two versions of the VCU, Type 0 more commonly referred to as VCU-I and Type 2 referred to as VCU-II. The FFT is substantially sped up on VCU-II with the butterfly taking ~2.5 cycles on average to complete.
32-bit Floating-Point
This implementation uses the extended floating-point instruction set. It uses the 32-bit floating point math capabilities of the C28x+FPU as well as the repeat block (RPTB) instruction. As a rule of thumb, it takes ~10 cycles for each FFT butterfly. If the implementation is on a floating-point device and 32-bits are required, then this is the preferred implementation.
32-bit Floating-Point with TMU
The Trigonometric Math Unit (TMU) is an extension of the 32-bit single precision Floating Point Unit (FPU). It provides instructions to do certain trigonometric and arithmetic functions in a cycle efficient manner. The TMU specific instructions are used to speed up magnitude and phase calculations through efficient computation of square roots, divisions, and arc tangents. The TMU can be enabled by setting the following compiler options,
 --float_support=fpu32 and --tmu_support=tmu0.
Should the user wish to make use of the TMU in C code they must turn on the additional option,
 --fp_mode=relaxed 
This will cause the compiler to replace calls to the standard C math library, like sin or cosine, with TMU instructions.



Conclusions

16-bit Implementation
The C28x+VCU implementation is the best option. The magnitude and phase calculations would be the same as on a fixed-point device. This is because the VCU does not have enhancements to improve these algorithms.
32-bit Fixed-Point FFT Performance
To improve the performance of a 32-bit fixed-point FFT:
  • Consider using a floating-point device. The FPU can double the performance. In addition, magnitude and phase calculations are faster because the FPU does a better job at this than 32-bit fixed-point math. The trade-off in resolution between a 32-bit fixed-point and 32-bit floating-point implementation is negligible.
  • If the application can tolerate a 16-bit implementation, then consider using the C28x+VCU. This would be faster compared to a 32-bit fixed-point implementation. The VCU does not, however, have instructions to improve the performance of a magnitude or phase calculation. These operations are best done in floating-point.
32-bit FPU vs 16-bit VCU
The performance difference between a 16-bit VCU and a 32-bit FPU implementation is not great. The VCU-I (Type 0) does not have enhancements to improve the performance of magnitude and phase calculations. VCU-II has a new instruction to compute the magnitude of a 16-bit fixed point complex variable in a single cycle; it does not provide any improvements towards phase calculations.
CLA
While the CLA itself is not well suited for a full FFT algorithm, it could be considered for magnitude and phase calculations. This would offload these operations from the main CPU. On a device like 2806x a floating point FFT could be performed on the main C28x+FPU and the magnitude calculation performed on the CLA, as an example.



Abbreviations

This is a list of some of the terms and abbreviations used in this article. For a more complete list please visit C2000 Terms and Abbreviations.

C28x
By itself it is the fixed-point CPU. It supports both 16-bit and 32-bit operations.
C28x + FPU
The C28x CPU with 32-bit floating-point extensions. Sometimes called C28x with FPU or shortened to FPU.
C28x + VCU
The C28x CPU with Viterbi, Complex Math and CRC extensions. Sometimes called C28x with VCU or shortened to VCU.
C28x + FPU + VCU
A C28x CPU with both the floating-point and VCU extensions.
C28x + FPU + TMU
The C28x CPU with 32-bit floating-point extensions including a Trigonometric Math Unit.
C28x + FPU + VCU + TMU
The C28x CPU with 32-bit floating-point extensions including a Trigonometric Math Unit as well as VCU extensions
CLA
Control Law Accelerator. Refer to:
CPU
Central Processing Unit. In C2000 we typically use this to refer to the main processor. For example, the C28x CPU.
FPU
Floating Point Unit. The FPU adds floating-point extensions to the main CPU instruction set.
VCU
Short for Viterbi, Complex Math and CRC Unit. The VCU adds instructions to support these operations to the main CPU instruction set.
TMU
Short for Trigonometric Math Unit. The TMU adds instructions to support certain trigonometric and arithmetic operations to the main CPU instruction set.