Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

C6000 Compiler: Tuning Software Pipelined Loops

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

A particular area where the C6000 processor family shines is its ability to speed through looped code. This is quite advantageous in digital signal processing, image processing and other mathematical routines that tend to be loop-centric. A technique called software pipelining contributes the biggest boost to improving looped code performance. Software pipelining is only enabled at –o2 or –o3. On the C6000 variants C62x, C67x, and C64x, software pipelining is completely disabled when code size flags –ms2 and –ms3 (see C6000 Compiler: Recommended Compiler Options) are used. On C64x + , C674x, and C66x cores, software pipelining is enabled if and only if the loop buffer (see section 2.1 of Hand Tuning Loops and Control Code on the TMS320C6000) can be used.

Without software pipelining, loops are scheduled so that loop iteration i completes before iteration i + 1 begins. Software pipelining allows iterations to be overlapped. Thus, as long as correctness can be preserved, iteration i + 1 can start before iteration i finishes. This generally permits a much higher utilization of the machine ’ s resources than might be achieved from non-software-pipelined scheduling techniques.

In a software-pipelined loop, even though a single loop iteration might take s cycles to complete, a new iteration is initiated every ii cycles.

Spra666 figure2.JPG

In an efficient software-pipelined loop, where ii < s, ii is called the initiation interval ; it is the number of cycles between starting iteration i and iteration i + 1. ii is equivalent to the cycle count for the software-pipelined loop body. s is the number of cycles for the first iteration to complete, or equivalently, the length of a single scheduled iteration of the software-pipelined loop.

Because the iterations of a software-pipelined loop overlap, it can be difficult to understand the assembly code corresponding to the loop. If the source code is compiled with –mw, the software-pipelined loop information displays the scheduled instruction sequence for a single iteration of the software-pipelined loop. Examining this single scheduled iteration makes it easier to understand the compiler ’ s output. This, in turn, makes tuning easier.

WARNING: Don't Lie to the Compiler!

The more information the compiler has, the better the optimization decisions that it can make. When using annotations, make sure the information being communicated is correct. If the information is not correct, the resulting code will not be correct either.

Establishing a Baseline

Given this function BasicLoop() ...

void BasicLoop(int *output, int *input1, int *input2, int n)
{
    int i;
    for (i=0; i<n; i++)
        output[i] = input1[i] + input2[i];
}

Compile it with the recommended options: -o -s -mw -mv6400.

Open up the assembly file and look at the software pipelining information for this loop:

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 5
;*      Loop opening brace source line   : 5
;*      Loop closing brace source line   : 6
;*      Known Minimum Trip Count         : 1
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 7
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     0        1
;*      .D units                     2*       1
;*      .M units                     0        0
;*      .X cross paths               1        0
;*      .T address paths             2*       1
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          1        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1
;*      Bound(.L .S .D .LS .LSD)     1        1
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 7  Schedule found with 1 iterations in parallel
...
;*        SINGLE SCHEDULED ITERATION
;*
;*        C25:
;*   0              LDW     .D1T1   *A4++,A3          ; |6|  ^
;*     ||           LDW     .D2T2   *B4++,B5          ; |6|  ^
;*   1      [ B0]   BDEC    .S2     C24,B0            ; |5|
;*   2              NOP             3
;*   5              ADD     .L1X    B5,A3,A3          ; |6|  ^
;*   6              STW     .D1T1   A3,*A5++          ; |6|  ^
;*   7              ; BRANCHCC OCCURS {C25}           ; |5|
;*----------------------------------------------------------------------------*
L1:    ; PD LOOP PROLOG
;** --------------------------------------------------------------------------*
L2:    ; PIPED LOOP KERNEL
           LDW     .D1T1   *A4++,A3          ; |6| <0,0>  ^
||         LDW     .D2T2   *B4++,B5          ; |6| <0,0>  ^

   [ B0]   BDEC    .S2     L2,B0             ; |5| <0,1>
           NOP             3
           ADD     .L1X    B5,A3,A3          ; |6| <0,5>  ^
           STW     .D1T1   A3,*A5++          ; |6| <0,6>  ^
;** --------------------------------------------------------------------------*
L3:    ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*

The software-pipelined loop information includes the source lines from which the loop originates, a description of the resource and latency requirements for the loop, and whether the loop was unrolled (among other information). When compiling with –mw, the information also contains a copy of the single scheduled iteration. Details on the -mw comment block can be found in Chapter 4 of the C6000 Programmer's Guide. While this information is not completely up-to-date, it is the best available as of this writing.

The initiation interval (ii) is 7. This means that in the steady state, a result (equivalently, an original loop iteration) is computed every 7 CPU cycles. Therefore, the baseline CPU performance is 7 cycles/result.

Observe that the length of the single scheduled iteration is also 7. Thus, only one iteration is being executed at any given time, so there is no overlap across iterations (which would generally not be optimal).

Eliminating Loop-Carried Dependencies

Look again at the software pipelining information in the assembly file. Where is the bottleneck? To find it, one must understand how the compiler computes a lower bound on the cycle count for the loop. This lower bound is the maximum of the loop-carried dependency bound and the resource bound. The loop-carried dependency bound is based on ordering constraints among the assembly instructions. For example, the two loads must complete before the add can proceed. The resource bound is based on hardware constraints, such as the required number of functional units of each given type. In actuality, there are two resource bounds: partitioned and unpartitioned. In this case, both are the same.

In this case, the partitioned resource bound is 2. Even if the assembly instructions could be executed out of order, at least two cycles would be required to execute all instructions in the loop body. However, the loop-carried dependency bound is 7.

;*      Loop Carried Dependency Bound(^) : 7
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2

Thus, ii ≥ max(2,7). To reduce the ii and consequently the number of cycles/result, the loop-carried dependency bound must be reduced.

The loop-carried dependency bound arises because there is a cycle in the ordering constraints for the instructions. The maximum cycle length is the loop-carried dependency bound. To reduce or eliminate a loop-carried dependency, one must identify the cycle and then find a way to shorten or break it.

To identify the maximum loop-carried dependency cycle, refer to the instructions in the single iteration. The instructions involved in the critical cycle are marked with a ( ^ ) sign. These instructions include the two loads, the add and the store.

;*        SINGLE SCHEDULED ITERATION
;*
;*        C25:
;*   0              LDW     .D1T1   *A4++,A3          ; |6|  ^
;*     ||           LDW     .D2T2   *B4++,B5          ; |6|  ^
;*   1      [ B0]   BDEC    .S2     C24,B0            ; |5|
;*   2              NOP             3
;*   5              ADD     .L1X    B5,A3,A3          ; |6|  ^
;*   6              STW     .D1T1   A3,*A5++          ; |6|  ^
;*   7              ; BRANCHCC OCCURS {C25}           ; |5|

With this information and by looking at which instructions feed into each other, one can reconstruct the loop-carried dependency cycle. The nodes in the graph are precisely the instructions denoted by ( ^ ) signs. The edges (arrows between node pairs) denote the ordering constraints. The edges are annotated by the number of cycles needed between the source and destination instructions. In most cases, results are written to registers at the end of the cycle in which the instruction executes and available on the following cycle. One of the few exceptions is that certain loads take 5 cycles for the data to be available in the target register.

Spra666 figure3.JPG

In this graph, there are two critical cycles, each with length 7. To reduce the loop-carried-dependency bound, the largest cycle in the graph must be shortened or eliminated. This can be accomplished by eliminating one of the edges in the cycle. To do so, one must understand the origin of the edges.

The edges from the load instructions to the add instructions are straightforward. The destination registers for the loads are the source registers for the add instruction. A load instruction takes 5 cycles to populate its destination register. Consequently, the add instruction cannot be executed until 5 cycles after the last of the two loads has been executed.

The edge from the add to the store is also straightforward since the destination of the add is the source of the store. The result of the add is available after 1 cycle. Consequently, the edge between the add and the store is annotated with a 1. The store can be executed on the cycle immediately following the add.

The edges from the store back to the loads are less obvious. How does one know to put them in? Why are they there? The answer to these questions is determined by process of elimination. Since there is no register dependence, there is most likely a memory dependence. In this case, the compiler does not know whether the input arrays could reference the same memory locations as the output, so it is conservative and assumes this is possible. The edge from the store to the loads ensures that the store from one iteration executes before the loads in the next iteration, just in case those loads try to read the data written to by the store. Whether this occurs in practice depends on the values of "input1", "input2" and "output" at run time. (For more background on data dependencies and graphs that represent them, see Memory Alias Disambiguation on the TMS320C6000.)

In reality, experienced programmers generally write code so that input parameter arrays and output parameter arrays are independent. The reason is that this makes their algorithm much more parallel, which in turn leads to better performance. Suppose that, across all call sites, neither "input1" nor "input2" ever accesses the same memory locations as "output". Tell this to the compiler and the back edges from the store to the loads will be eliminated. This is done either by using the –mt option or by using the restrict keyword.

void BasicLoop(int *restrict output,
               int *restrict input1,
               int *restrict input2,
               int n)
{
    int i;
    #pragma MUST_ITERATE(1)
    for (i=0; i<n; i++)
        output[i] = input1[i] + input2[i];
}

While it is sufficient for this example to restrict qualify either both loads or the single store, it is recommended to restrict qualify all parameters that can be restrict qualified (and local pointer variables as well). First, this is usually quicker than determining which actually need to be restrict-qualified. Second, this provides information to other programmers who might maintain or modify this code base in the future. However, before inserting restrict keywords, be sure that pointers being restrict-qualified cannot overlap with any other pointers. When writing a library routine and using restrict, be sure to document parameter restrictions for library users.

After adding the restrict keyword, rebuild the function. Observe that the loop-carried dependency bound has disappeared. This means that each iteration is independent. New iterations are now initiated as soon as resources are available.

;*      Loop Carried Dependency Bound(^) : 0

Further note that a new iteration is initiated every 2 cycles. Thus, in the steady state, a new result is produced every 2 cycles (instead of every 7 cycles).

;*         ii = 2  Schedule found with 4 iterations in parallel.

Balancing Resources

Look at the software pipelining information for the loop. The limiting factor is now the number of functional units as indicated by the resource bound:

;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2

Which functional unit is the bottleneck? To determine this, look at the detailed breakdown of functional units required to execute a single iteration. Recall that the C6000 architecture is partitioned into two nearly symmetric halves. The resource breakdown displayed in the software pipelining information is computed after the compiler has partitioned instructions to either the A-side or the B-side.

Look for the machine resources with a (*) after them. Notice which ones are most congested. In this case, the bottleneck is on the D unit and the T address path.

;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     0        1
;*      .D units                     2*       1
;*      .M units                     0        0
;*      .X cross paths               1        0
;*      .T address paths             2*       1
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          1        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1
;*      Bound(.L .S .D .LS .LSD)     1        1
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 4 iterations in parallel
...
;*        SINGLE SCHEDULED ITERATION
;*
;*        C26:
;*   0              LDW     .D1T1   *A5++,A4          ; |9|
;*   1              LDW     .D2T2   *B4++,B5          ; |9|
;*   2      [ B0]   BDEC    .S2     C26,B0            ; |8|
;*   3              NOP             3
;*   6              ADD     .L1X    B5,A4,A3          ; |9|
;*   7              STW     .D1T1   A3,*A6++          ; |9|
;*   8              ; BRANCHCC OCCURS {C26}           ; |8|

From the single scheduled iteration, it can be seen that the D units and T address paths are used for loads and stores, one D unit and one T address path for each. There are three memory access instructions but only two D units and two T address paths available on each cycle. Thus, the resource bound is two cycles. This means that in the steady state, at least two cycles are required per result.

Performance can be improved by better utilizing the D units and T address paths. Assume it is known that the number of iterations is always even. If the loop is unrolled 2x (so that the resulting loop contains two copies of the original loop body and executes half the number of iterations), the compiler could balance out the D units and T address paths and achieve better resource utilization.

The following diagrams show the concept of unrolling the loop for better (more balanced) utilization of the critical D unit resource (the situation would be analogous for the critical T address path). On the left side, four loop iterations are shown, as indicated by the double-arrows, producing eight results in four cycles. One D unit is unused every other cycle. The right side shows performance after unrolling the loop by 2x. Both D units are executing useful instructions in every cycle. Of course, the order of the loads and stores must be rearranged, but the compiler takes care of this.

Spra666 figure4.JPG

One possibility for achieving this is to unroll the loop manually:

void BasicLoop(int *restrict output,
               int *restrict input1,
               int *restrict input2,
               int n)
{
    int i;
    #pragma MUST_ITERATE(1)
    for (i=0; i<n; i+=2) {
        output[i]   = input1[i]   + input2[i];
        output[i+1] = input1[i+1] + input2[i+1];
    }
}

Rebuilding yields the following results:

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 8
;*      Loop opening brace source line   : 8
;*      Loop closing brace source line   : 11
;*      Known Minimum Trip Count         : 1
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 3
;*      Partitioned Resource Bound(*)    : 3
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     0        1
;*      .D units                     3*       2
;*      .M units                     0        0
;*      .X cross paths               2        0
;*      .T address paths             3*       3*
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          2        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1
;*      Bound(.L .S .D .LS .LSD)     2        1
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 3  Schedule found with 4 iterations in parallel
...
;*        SINGLE SCHEDULED ITERATION
;*
;*        C26:
;*   0              LDW     .D2T2   *B5++(8),B6       ; |10|
;*   1              NOP             1
;*   2              LDW     .D1T1   *A6++(8),A3       ; |10|
;*     ||           LDW     .D2T2   *-B5(4),B4        ; |10|
;*   3              LDW     .D1T1   *-A6(4),A3        ; |10|
;*   4              NOP             1
;*   5      [ B0]   BDEC    .S2     C26,B0            ; |8|
;*   6              NOP             1
;*   7              ADD     .S1X    B6,A3,A4          ; |10|
;*   8              ADD     .L1X    B4,A3,A5          ; |10|
;*   9              NOP             1
;*  10              STNDW   .D1T1   A5:A4,*A7++(8)    ; |10|
;*  11              ; BRANCHCC OCCURS {C26}           ; |8|


The T address paths are now balanced. There is one less D unit than expected because the compiler chose to use a non-aligned double word store instead of two aligned single-word stores. Recall, non-aligned memory accesses use both T address paths but only one D unit.

When the loop is unrolled 2x, each iteration takes longer, but the loop now generates two results per iteration instead of one. Thus, the unrolled loop requires 1.5 cycles/result in the steady state, whereas the non-unrolled version requires 2 cycles/result.

Although manually unrolling a loop achieves the required results, it can be rather tedious for a large loop. An alternative is to let the compiler do this. If the compiler knows that the trip count for the loop (in this case n) is a multiple of 2, the compiler unrolls the loop automatically, if deemed profitable. To tell the compiler that the trip count is a multiple of 2, modify the MUST_ITERATE pragma preceding the loop. The MUST_ITERATE pragma has the syntax:

#pragma MUST_ITERATE(lower_bound, upper_bound, factor)

The lower bound is the lowest possible value for n. The upper bound is maximum possible value for n. The factor is a number that always divides n. Any of these parameters can be omitted.

Instead of unrolling the loop manually, simply modify the pragma instead:

void BasicLoop(int *restrict output,
               int *restrict input1,
               int *restrict input2,
               int n)
{
    int i;
    #pragma MUST_ITERATE(2,,2)
    for (i=0; i<n; i++) {
        output[i] = input1[i] + input2[i];
    }
}

Now rebuild. The throughput is the same as when the loop was unrolled manually, namely 1.5 cycles/result. The extra line in the software pipelining information communicates that the compiler unrolled the loop 2x:

;*      Loop Unroll Multiple             : 2x

If the compiler had not deemed unrolling to be profitable (despite the use of the MUST_ITERATE pragma), one could force the compiler to unroll by inserting an UNROLL pragma in addition to the MUST_ITERATE pragma:

#pragma UNROLL(2)

Similarly, if the compiler chooses to unroll, and it is preferred not to have the loop unrolled, the loop can be preceded with

#pragma UNROLL(1)

The compiler usually succeeds at selecting the best unroll factor. Occasionally, however, you can do better by selecting the unrolling factor manually. The reason is that the compiler must make an educated guess up front (during high-level optimization) as to how many times to unroll (if any). The actual software-pipelining is not done until low-level optimization. At this point, the unrolling decision is not reversible. In contrast, as a user, you have the opportunity to iteratively try out various unrolling factors and pick the best one.

Loop pragmas must appear immediately before a loop, without any other intervening source code. Note the compiler may ignore an UNROLL pragma, unless there is an accompanying MUST_ITERATE pragma. The MUST_ITERATE pragma must note that the trip count is divisible by the unroll factor and that the minimum trip count is at least the unroll factor.

When targeting the C64x+, beware of over-unrolling. Otherwise, the loop may become too large for the compiler to exploit the loop buffer (see section 2.1 of Hand Tuning Loops and Control Code on the TMS320C6000). The loop buffer has power, code size, and performance benefits, but can only be used with loops that have an ii of 14 or less and a single scheduled iteration length of 48 or less.

Exploiting Wide Loads and Stores (SIMD)

Although a speed up of 4.7x has already been achieved, it is possible to do better. Note that the loop is still bottlenecked on the memory access instructions. Since the memory access instructions are balanced, unrolling more will not help. However, two improvements can be made:

  • Wider load instructions can be used instead of multiple loads to reduce the number of D unit resources.
  • Aligned memory access instructions can be used instead of non-aligned memory access instructions to reduce the number of T address paths.

The C64x and C64x+ processors support both aligned and non-aligned double-word loads and stores. (Recall that the C67x and C67x+ support only aligned double-word loads and no double-word stores. The C62x supports only word-wide loads and stores.) If it is known that the function parameters are double-word aligned, switching to aligned, double-word memory accesses saves both D units and T address paths.

How does one get the compiler to select the double-word versions of the memory access instructions? There are two options: (1) use intrinsics, or (2) tell the compiler that the memory accesses are aligned. The second method is simpler, so it is best to try that first.

To tell the compiler that the memory accesses are aligned on double-word (64-bit) boundaries, use _nasserts() inside the function just prior to the loop of interest:

_nassert((int) input1 % 8 == 0); // input1 is 64-bit aligned
_nassert((int) input2 % 8 == 0); // input2 is 64-bit aligned
_nassert((int) output % 8 == 0); // output is 64-bit aligned

The _nassert() communicates that the pointer is aligned at the point in the function where the _nassert() is located. Although communicating alignment information once per function usually suffices, it is recommended to reassert the information immediately before each loop of interest.

If the data was not already aligned on a double-word boundary, it might be possible to force this alignment using the DATA_ALIGN pragma. _nasserts() make a statement about the value of variable at the point in the program where the _nassert() is located. From this information, the compiler can often derive the information about that variable at other locations in the program. For best performance, however, if the function contains multiple loops, it may be best to repeat the _nasserts() on entrance to each loop.

Rebuild. The resource bound, and consequently the ii, have been reduced to 2:

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 13
;*      Loop opening brace source line   : 13
;*      Loop closing brace source line   : 15
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 1
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     0        1
;*      .D units                     2*       1
;*      .M units                     0        0
;*      .X cross paths               2*       0
;*      .T address paths             2*       1
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          2        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1
;*      Bound(.L .S .D .LS .LSD)     2*       1
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 4 iterations in parallel
...
;*        SINGLE SCHEDULED ITERATION
;*
;*        C26:
;*   0              LDDW    .D2T2   *B6++,B5:B4       ; |14|
;*     ||           LDDW    .D1T1   *A3++,A7:A6       ; |14|
;*   1              NOP             1
;*   2      [ B0]   BDEC    .S2     C26,B0            ; |13|
;*   3              NOP             2
;*   5              ADD     .S1X    B4,A6,A4          ; |14|
;*   6              ADD     .L1X    B5,A7,A5          ; |14|
;*   7              STDW    .D1T1   A5:A4,*A8++       ; |14|
;*   8              ; BRANCHCC OCCURS {C26}           ; |13|

In the steady state, the loop is now generating one result per cycle (more accurately, two results every two cycles), a 7x speedup, compared to the baseline.

Rebalancing Resources

Although the incorporation of wider loads and stores speeded up the loop, the loop is still bottlenecked on D units and T address paths. The number of memory accesses has been reduced to three. D units and the T address paths are again unbalanced. As before, further improvement can be achieved by unrolling one more time, if legal. Assuming that the trip count is indeed a multiple of 4, modify the MUST_ITERATE pragma to communicate this to the compiler. Rebuild. The resulting source code (which achieves optimal throughput) looks as follows:

void BasicLoop(int *restrict output,
               int *restrict input1,
               int *restrict input2,
               int n)
{
    int i;

    _nassert((int) input1 % 8 == 0); // input1 is 8-byte aligned
    _nassert((int) input2 % 8 == 0); // input2 is 8-byte aligned
    _nassert((int) output % 8 == 0); // output is 8-byte aligned

    #pragma MUST_ITERATE(4,,4)       // n >= 4, n % 4 = 0
    for (i=0; i<n; i++) {
        output[i]   = input1[i] + input2[i];
    }
}

The software-pipelining information for the resulting assembly code, which yields one result every 0.75 cycles (or 4 results every 3 cycles), is as follows:

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 13
;*      Loop opening brace source line   : 13
;*      Loop closing brace source line   : 15
;*      Loop Unroll Multiple             : 4x
;*      Known Minimum Trip Count         : 1
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 3
;*      Partitioned Resource Bound(*)    : 3
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     1        0
;*      .D units                     3*       3*
;*      .M units                     0        0
;*      .X cross paths               2        2
;*      .T address paths             3*       3*
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          2        2     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             1        0
;*      Bound(.L .S .D .LS .LSD)     2        2
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 3   Schedule found with 3 iterations in parallel
...
;*        SINGLE SCHEDULED ITERATION
;*
;*        C26:
;*   0              LDDW    .D2T2   *B5++(16),B9:B8   ; |14|
;*     ||           LDDW    .D1T1   *A16++(16),A7:A6  ; |14|
;*   1              LDDW    .D2T2   *-B5(8),B7:B6     ; |14|
;*     ||           LDDW    .D1T1   *-A16(8),A9:A8    ; |14|
;*   2              NOP             1
;*   3      [ A0]   BDEC    .S1     C26,A0            ; |13|
;*   4              NOP             1
;*   5              ADD     .S1X    B9,A7,A5          ; |14|
;*   6              ADD     .L1X    B8,A6,A4          ; |14|
;*     ||           ADD     .L2X    B6,A8,B6          ; |14|
;*   7              ADD     .L2X    B7,A9,B7          ; |14|
;*   8              STDW    .D1T1   A5:A4,*A3++(16)   ; |14|
;*     ||           STDW    .D2T2   B7:B6,*++B4(16)   ; |14|
;*   9              ; BRANCHCC OCCURS {C26}           ; |13|

The loop has been sped up 9.3x compared to the original source code with no modifications other than the addition of restrict qualifiers, MUST_ITERATE pragmas and _nasserts().

For this example, only two of the three fields of the MUST_ITERATE pragma are exploited. The third (middle) field, which communicates an upper bound on the trip count, can also be useful for performance tuning. There are certain optimizations that are profitable when the loop trip count is large but can hurt performance when the trip count is small. By default, the compiler assumes that the expected trip count is large. Hence, if the upper bound on the trip count is small, it is best to communicate this to the compiler.

Reference

This article is derived from the application note Hand Tuning Loops and Control Code on the TMS320C6000.



CN C6000 Compiler: Tuning Software Pipelined Loops