FFT Implementation With No Data Scaling
From Texas Instruments Embedded Processors Wiki
Contents |
Introduction
The FFT (DSP_fft16x16) and iFFT (DSP_ifft16x16) implementation provided with the C64x+ DSPLIB apply scaling of data to avoid overflow. See below from the user’s guide.
All stages are radix-4 except the last one, which can be radix-2 or radix-4, depending on the size of the FFT. All stages except the last one scale by two the stage output data
It is desirable in certain use cases that the data scaling is not applied in the FFT routines. This article suggests modifications to the provided FFT source in the DSPLIB such that the data is not scaled. Also, an example is provided to demonstrate the affect of suggested change.
Suggested Change
The change to both the routines (DSP_fft16x16 and DSP_ifft16x16) is similar. The below description suggests modifications to the serial assembly (SA) implementation of the kernels. The kernels are located at:
- [DSPLIB_INSTALLATION_DIR]\dsplib_v210\src\DSP_fft16x16\DSP_fft16x16_sa.sa
- [DSPLIB_INSTALLATION_DIR]\dsplib_v210\src\DSP_ifft16x16\DSP_ifft16x16_sa.sa
Change 1: Identify the below code in the SA files
;----------------------------------------------------------;
; Compute first set of outputs: ;
; ;
; x0[0]= xh0_0 + xh20_0 + 1 >> 1 ;
; x0[1]= xh1_0 + xh21_0 + 1 >> 1 ;
; x0[2]= xh0_1 + xh20_1 +1 >> 1 ;
; x0[3]= xh1_1 + xh21_1 +1 >> 1 ;
;----------------------------------------------------------;
AVG2.2 B_xh1_0_xh0_0, A_xh21_0_xh20_0, B_x_1o_x_0o
AVG2.2 B_xh1_1_xh0_1, A_xh21_1_xh20_1, B_x_3o_x_2°
Update the code to:
ADD2.2 B_xh1_0_xh0_0, A_xh21_0_xh20_0, B_x_1o_x_0o
ADD2.2 B_xh1_1_xh0_1, A_xh21_1_xh20_1, B_x_3o_x_2°
Note replacement of AVG2 instruction with ADD2
Change 2: Identify the below code in the SA files
;---------------------------------------------------------;
; The following code computes intermediate results for: ;
; ;
; si10' = -si10 twiddle table has -sin factors ;
; ;
; x2[h2 ] = (co10 * xt0_0 + si10'* yt0_0 + 0x8000) >> 16 ;
; x2[h2+1] = (co10 * yt0_0 - si10'* xt0_0 + 0x8000) >> 16 ;
; x2[h2+2] = (co11 * xt0_1 + si11'* yt0_1 + 0x8000) >> 16 ;
; x2[h2+3] = (co11 * yt0_1 - si11'* xt0_1 + 0x8000) >> 16 ;
;---------------------------------------------------------;
CMPYR .M1 A_co10_si10, B_yt1_0_xt1_0, A_xh2_1_0;
CMPYR .M1 A_co11_si11, B_yt1_1_xt1_1, A_xh2_3_2;
;---------------------------------------------------------;
; ;
; x2[l1 ] = (co20 * xt1_0 + si20'* yt1_0 + 0x8000) >> 16 ;
; x2[l1+1] = (co20 * yt1_0 - si20'* xt1_0 + 0x8000) >> 16 ;
; x2[l1+2] = (co21 * xt1_1 + si21'* yt1_1 + 0x8000) >> 16 ;
; x2[l1+3] = (co21 * yt1_1 - si21'* xt1_1 + 0x8000) >> 16 ;
; ;
; These four results are retained in registers and a ;
; double word is formed so that it can be stored with ;
; one STDW. ;
;---------------------------------------------------------;
; This equation ONLY has minus sign for x, y components
CMPYR .M1 A_co20_si20, A_myt0_0_mxt0_0, A_xl1_1_0;
CMPYR .M1 A_co21_si21, A_myt0_1_mxt0_1, A_xl1_3_2;
;---------------------------------------------------------;
; The following code computes intermediate results for: ;
; ;
; x2[l2 ] = (co30 * xt2_0 + si30'* yt2_0 + 0x8000) >> 16 ;
; x2[l2+1] = (co30 * yt2_0 - si30'* xt2_0 + 0x8000) >> 16 ;
; x2[l2+2] = (co31 * xt2_1 + si31'* yt2_1 + 0x8000) >> 16 ;
; x2[l2+3] = (co31 * yt2_1 - si31'* xt2_1 + 0x8000) >> 16 ;
;---------------------------------------------------------;
CMPYR .M2 B_co30_si30, B_yt2_0_xt2_0, B_xl2_1_0
CMPYR .M2 B_co31_si31, B_yt2_1_xt2_1, B_xl2_3_2
Update the code to:
;---------------------------------------------------------;
; The following code computes intermediate results for: ;
; ;
; si10' = -si10 twiddle table has -sin factors ;
; ;
; x2[h2 ] = (co10 * xt0_0 + si10'* yt0_0 + 0x8000) >> 16 ;
; x2[h2+1] = (co10 * yt0_0 - si10'* xt0_0 + 0x8000) >> 16 ;
; x2[h2+2] = (co11 * xt0_1 + si11'* yt0_1 + 0x8000) >> 16 ;
; x2[h2+3] = (co11 * yt0_1 - si11'* xt0_1 + 0x8000) >> 16 ;
;---------------------------------------------------------;
CMPYR1 .M1 A_co10_si10, B_yt1_0_xt1_0, A_xh2_1_0;
CMPYR1 .M1 A_co11_si11, B_yt1_1_xt1_1, A_xh2_3_2;
;---------------------------------------------------------;
; ;
; x2[l1 ] = (co20 * xt1_0 + si20'* yt1_0 + 0x8000) >> 16 ;
; x2[l1+1] = (co20 * yt1_0 - si20'* xt1_0 + 0x8000) >> 16 ;
; x2[l1+2] = (co21 * xt1_1 + si21'* yt1_1 + 0x8000) >> 16 ;
; x2[l1+3] = (co21 * yt1_1 - si21'* xt1_1 + 0x8000) >> 16 ;
; ;
; These four results are retained in registers and a ;
; double word is formed so that it can be stored with ;
; one STDW. ;
;---------------------------------------------------------;
; This equation ONLY has minus sign for x, y components
CMPYR1 .M1 A_co20_si20, A_myt0_0_mxt0_0, A_xl1_1_0;
CMPYR1 .M1 A_co21_si21, A_myt0_1_mxt0_1, A_xl1_3_2;
;---------------------------------------------------------;
; The following code computes intermediate results for: ;
; ;
; x2[l2 ] = (co30 * xt2_0 + si30'* yt2_0 + 0x8000) >> 16 ;
; x2[l2+1] = (co30 * yt2_0 - si30'* xt2_0 + 0x8000) >> 16 ;
; x2[l2+2] = (co31 * xt2_1 + si31'* yt2_1 + 0x8000) >> 16 ;
; x2[l2+3] = (co31 * yt2_1 - si31'* xt2_1 + 0x8000) >> 16 ;
;---------------------------------------------------------;
CMPYR1 .M2 B_co30_si30, B_yt2_0_xt2_0, B_xl2_1_0
CMPYR1 .M2 B_co31_si31, B_yt2_1_xt2_1, B_xl2_3_2
Note the CMPYR instruction has been replaced with CMPYR1 intrinsic The updates to the FFT routines can be incorporated in the application in two ways
- The DSPLIB SW can be recompiled so that the generated library includes the updated kernels. To do this, recompile the library project [DSPLIB_INSTALLATION_DIR] \dsplib_v210\dsplib64plus.pjt. The updated library will be generated at [DSPLIB_INSTALLATION_DIR]\Release\dsplib64plus_rebuild.lib.
- The updated kernels can be directly included in the application project. Including the updated kernels will override the kernels that are included in the dsplib library
Example Application
Find attached with this article an example that demonstrates the impact of the above suggested changes (fft_scaling_example.zip). The example should be unarchived at [DSPLIB_INSTALLATION_DIR]\dsplib_v210\example. Note the example assumes that the updated FFT and iFFT files are located at [DSPLIB_INSTALLATION_DIR]\dsplib_v210\src\DSP_fft16x16 and [DSPLIB_INSTALLATION_DIR] \dsplib_v210\src\DSP_ifft16x16. Following files are included:
- fft_example.pjt: Test project that demonstrates the impact of above suggested changes
- lnk.cmd: Linker command file
- fft_example.c: Test code implementation. Note the test file performs the following operations:
- Generates input data that is summation of Sine wave data of various freqs. The complex input data is separated into real and imag components for easier visualization (xin_real_16x16 and xin_imag_16x16)
- Generates twiddle factors for the FFT and iFFT kernels. Please note that the two kernels use different twiddle factors. Two separate twiddle factor kernels are provided with the DSPLIB release. For the FFT kernel, please use the function [DSPLIB_INSTALLATION_FOLDER]\dsplib_v210\src\DSP_fft16x16\gen_twiddle_fft16x16.c and for the iFFT kernel, please use the function [DSPLIB_INSTALLATION_FOLDER]\dsplib_v210\src\DSP_ifft16x16\gen_twiddle_ifft16x16.c
- Calls the FFT routine. The generated complex fft data is separated into real and imag components for easier visualization (y_real_16x16 and y_imag_16x16)
- Calls the iFFT routine. The generated complex output is separated into real and imag components for easier visualization (xout_real_16x16 and xout_imag_16x16)
- Images: The package includes images that help visualize the impact of the changes. Each image displays side by side three data buffers; xin_real_16x16, y_real_16x16 and xout_real_16x16. Three images are provided:
- FFTOutput_bothKernelsApplyScaling.JPG: Helps visualize the case where both FFT and iFFT routines apply scaling. This is the default operation
- FFTOutput_onlyIFFTApplyScaling.JPG: Helps visualize the case where only iFFT routines apply scaling. This is the case where the FFT routine has been updated to not apply any scaling
- FFTOutput_noScaling.JPG: Helps visualize the case where no scaling is applied. This is the case where both the FFT routine and the iFFT routine have been updated to not apply any scaling
Conclusions
Following conclusions can be derived:
- N = 256. Since 256 = 4^4, the processing will happen as follows: 3x radix 4 iterations in the main loop (scaling by 2) and 1x radix 4 iteration at the end. So the total scaling will be 2^3. This is because every stage applies scaling of 2 but the last stage doesn’t apply any scaling. Thus, we expect scaling by 2*2*2 = 8. The images confirm the understanding. Note the X Output magnitude increased by 8 times when scaling is removed from FFT and then from iFFT.
- When both the kernels are updated to not apply scaling, the X output magnitude is 256 times the input. This is correct as 1/N scaling is not applied overall. Thus, we expect X Output to be 256 times X Input
Thus, the suggested changes correctly remove the scaling from the two kernels
Comments
Comments on FFT Implementation With No Data Scaling

Hi. when debugging session im taking this error. _TI_EABI_ undefined. how can i build this example with code composer 3.3? how can enable eabi option? Thanks
--Ayhankm 03:02, 9 June 2011 (CDT)