C6000 CGT Optimization Lab - 1

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

This lab demonstrates some common techniques to use when optimizing C code for C6000. It includes a description of the unroll and jam optimization.

The source files used are in this .zip file.

Presumptions

  • You have CCS v3.3 installed
  • You execute on a C64x+ system
  • You are familiar with basic CCS operations like loading a project, building a project, changing compiler options, loading and running code.

Add v6.1.5 Compiler Tools

  • Start at TI Express DSP Wiki (bookmark it!)
  • Click on "Code Generation Tools"
  • Click on "Compiler Releases"
  • Go to the link given there. Requires a my.ti.com account.
  • Download and install the package "C6000 Code Generation Tools v6.1.5 Windows"
  • OK to download any later 6.1.X version, if available
  • Tell CCS to use CGT v6.1.5 via Component Manager
  • Component Manager documentation located in CCS. Navigate to: Help | Contents | Before Starting | Component Manager

Notes on the Source

The function optimized is named Xcorr. It is in the lone C file tuning.c. Three versions are supplied in the source: original, development, and final. Make all of your changes to the development version, which is currently selected by setting the preprocessor symbol VERSION to DEVT_VERSION. The other two versions are supplied to show you where you start, and where you end up.

Initial Steps

  • Open tuning.pjt
  • Build
  • Run simply to insure everything is working properly

If you execute on a simulator, be patient! This first run could take 30-60 secs. The next ones will be much faster.

This lab was developed using the DM6443 Cycle Accurate Simulator and C6000 Compiler version 6.1.5. Cycle counts you see should be similar, though not exactly the same, to those shown here.

The initial results printed are: benchmark results : 15392593

After each of the steps given below, build and run to see the improvement.

Step 1: Build Options

Add these options:

  • -o2 : Optimization level 2
  • -s : Place optimizer generated comments in the assembly output
  • -mw : Place extra information about software pipeline loops in the assembly output
  • -mo : Place each section in its own sub-section

Remove this option:

  • -g : Generate debug information

Motivation for these option changes is given in the application note Hand-Tuning Loops and Control Code on the TMS320C6000 spra666 section 3.1.

Cycle count improves to 2757296.

Step 2: Change Type of sum

Make these source changes:

  • Insure <stdint.h> is included
  • Change type of sum from long long to int40_t

When the type of sum is long long (64-bits), the compiler must allocate a sequential register pair (like A5:A4) to contain it. This increases the number of registers the compiler must manage overall. In this case, it causes the compiler to make some poor register allocation decisions that lead to a loop carried dependency. It is better to use a type that can be contained in a single register. This becomes even more important when, in the next 2 steps, loop unrolling creates multiple copies of sum.

Use the int40_t type, defined in <stdint.h>, to indicate that a 40-bit wide type is needed. Of course, insure none of your results exceed 40-bits.

If your results permit, you could also use int32_t, and see similar improvement.

Cycle count improves to 1973988.

Step 3: Unroll and Jam

Make these source changes:

  • Add restrict keyword to pointers pdFirst and pdSecond
  • Add #pragma MUST_ITERATE(2,,2) to the outer loop
  • Add #pragma MUST_ITERATE(1) to the inner loop

The point of these changes is to enable an optimization called unroll and jam. This optimization exposes more parallelism in the inner loop.

Unroll and jam is best explained by example. Note these code fragments illustrate what the compiler can do automatically. Do not enter these changes in the source. Start with this:

for(i=0; i<iFirstLen; i++)
{
    sum = 0;
    for(j=0; j<iSecondLen; j++)
    {
        sum += pdFirst[i+j]*pdSecond[j];
    }
    pixCorrResult[i] = sum; 
}

The first step is to unroll the outer loop one time. That gives this:

for(i=0; i<iFirstLen; i += 2)
{
    sum = 0;
    for(j=0; j<iSecondLen; j++)
    {
        sum += pdFirst[i+j]*pdSecond[j];
    }
    pixCorrResult[i] = sum;                  // AAA
 
    sum = 0;
    for(j=0; j<iSecondLen; j++)
    {
        sum += pdFirst[i+1+j]*pdSecond[j];   // BBB
    }
    pixCorrResult[i+1] = sum; 
}

Then those two inner loops are "jammed" together:

for(i=0; i<iFirstLen; i += 2)
{
    sum0 = sum1 = 0;
    for(j=0; j<iSecondLen; j++)
    {
        sum0 += pdFirst[i+j]*pdSecond[j];
        sum1 += pdFirst[i+1+j]*pdSecond[j];  // BBB
    }
    pixCorrResult[i] = sum0;                 // AAA
    pixCorrResult[i+1] = sum1; 
}

Notice the two independent computations in the inner loop. Software pipelining can optimize such code with ease.

The compiler can perform the unroll and jam transformation automatically, if you give it a little bit of help. And that help is in the form of the minimal source changes given in the bulleted list at the start of this step.

Why is restrict needed? Notice how the first assignment to pixCorrResult[i] (marked AAA) is moved after the second computation with pdFirst and pdSecond (marked BBB). That reordering of memory references is safe only if the compiler knows that those pointers never refer to the same memory locations. Applying restrict to pdFirst and pdSecond is the only way to inform the compiler the memory references are independent. More information on the restrict keyword can be found in the Wiki article Restrict_Type_Qualifier, as well as section 4.1.2 of Hand-Tuning Loops and Control Code on the TMS320C6000 tidoc:spra666.

Why are the MUST_ITERATE pragmas needed? To perform unroll and jam, the compiler must know that the outer loop executes a multiple of 2 times, and that the inner loop executes at least once. These pragmas give the compiler exactly that information. The MUST_ITERATE pragma is described on page 19 of Hand-Tuning Loops and Control Code on the TMS320C6000 tidoc:spra666 .

Cycle count improves to 1521192.

Step 4: SIMD (Single Instruction Multiple Data) Operations

Make these source changes:

  • Add _nassert(((int)pdFirst % 8) == 0);
  • Add _nassert(((int)pdSecond % 8) == 0);
  • Change inner loop pragma to MUST_ITERATE(2,,2);
  • Change outer loop pragma to MUST_ITERATE(8,,8);

The first three changes are about enabling SIMD operations in the inner loop. The _nassert technique is discussed in section 4.1.4 of Hand-Tuning Loops and Control Code on the TMS320C6000 tidoc:spra666. The MUST_ITERATE on the inner loop indicates the loop always runs a multiple of 2 times. Because the outer loop is unrolled by 2, unrolling this inner loop by 2 means, overall, the loop is unrolled 4 times. What is so magic about the number 4? Well, the goal is to enable use of the 64-bit wide LDDW (or LDNDW) instructions to access memory. The pointers pdFirst and pdSecond point to 16-bit wide short values. Thus, 64/16 equals 4.

The MUST_ITERATE on the outer loop must be any value divisible by 2 to enable unroll and jam. Why 8? With values less than 8, the compiler generates multiple copies of the loop, each tailored to work best for a range of loop iteration counts that is checked at runtime. In this case, only the loop tailored for high loop iteration counts ever executes. The other loops amount to wasted code space. Feel free to experiment with higher and lower values on this MUST_ITERATE pragma to see the effects.

Cycle count improves to 682512.

Caution!

When you use mechanisms such as restrict, MUST_ITERATE, or _nassert, you are conveying extra information to the compiler. Always verify all such information is correct for every call. Suppose, for example, the outer loop really can execute less than 8 times. The compiler cannot catch that error, and it will result in incorrect code execution that is very difficult to debug.

Summary

This table summarizes the improvements.

Improvement Cycles Overall Delta
Initial 15392593    
Build Options 2757296 5.6 5.6
Change Type of Sum 1973988 7.8 1.4
Unroll and Jam 1521192 10.1 1.3
SIMD 682512 22.6 2.2
  • Improvement – What improvement is performed
  • Cycles – Number of cycles executed
  • Overall – How much improvement versus Initial
  • Delta – How much improvement versus the previous change