Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Optimized Sort Algorithms For DSP

From Texas Instruments Wiki
Jump to: navigation, search

Implementing an optimized sort algorithm on a DSP requires giving some thought to the algorithm used, just like on any CPU architecture, but it also requires paying careful attention to how it is possible to leverage some advantages of the Texas Instruments c64x+ DSP architecture such as:

  • the ability to execute of up to 8 instructions in parallel per CPU cycle,
  • the ability to specify a condition for individual instructions,
  • the SPLOOP mechanism for the optimization of loop execution,
  • the ability of the Texas Instruments compiler to "software pipeline" loops to leverage the DSP pipelined architecture,

The result of this study is a set of data points that will hopefully guide you in making an informed decision when picking and/or implementing your algorithm, and give you an order of magnitude of the performance you can expect to see in your real-time application, as well as code-size and memory requirements.

Sort Algorithms

In the scope of this algorithm and implementation review we have considered the Quicksort and Merge Sort algorithms, and we are only considering the problem of sorting unsigned 32-bits integers.

Much of these results do apply if we were to sort signed integers, but would likely vary if we were sorting 16-bits or 8-bits values as the optimization techniques used would likely vary.

Quick Sort Algorithm

The Quicksort algorithm has the following performance caracteristics:

  • best-case performance is 0(n.log(n))
  • worst-case performance is O(n^2)
  • average performance is O(n.log(n))

For a good discussion on this algorithm please refer to description of the Quick Sort algorithm on Wikipedia, or to the book The Art of Computer Programming by Donald Knuth which is an excellent reference on the subject.

Merge Sort Algorithm

The Merge Sort algorithm has the following performance caracteristics:

  • best-case performance is 0(n.log(n))
  • worst-case performance is O(n.log(n))
  • average performance is O(n.log(n))

A typical implementation of the Merge Sort algorithm involves:

  • Division of the elements to be sorted into subsets.
  • Sorting of each of the subset.
    • The same process (Merge Sort) can be applied recursively on each of these subsets to sort them.
  • Merge the resulting sorted lists of elements together.
    • Merging sorted lists is a process that is loop-friendly (and will leverage some of the DSP features), and the way memory is accessed by the program during the merge makes it quite cache-friendly as well.

When implementing this algorithm you need to choose at what level the division into subsets will stop.

For a good discussion on this algorithm please refer to description of the Merge Sort algorithm on Wikipedia, or to the book The Art of Computer Programming by Donald Knuth which is an excellent reference on the subject.

Variations on a Theme

When considering the uses a DSP application might do of a sort function it quickly becomes obvious that sorting alone will often not be the complete solution to the problem the application is trying to solve.

Index Ordering

For quite a few applications it will be interesting for the application to not only get as a result a sorted list, but also get an array that will give the original position of each item in the sorted list.

This is what is called here "Index Ordering", and as part of this study implementations with and without "Index Ordering" have been created.

Finding the Top K entries from an array

In some cases the real problem to be solved requires only finding the top K entries from an array of N values, rather than sorting the whole array.

If that is the case of your application then this opens the door for some additional optimizations, as quite some CPU cycles can be saved by only partially sorting the content of the array.

A good starting point to learn more about this type of algorithms is the Wikipedia page on Selection Algorithm.

We have decided in this article to implement an optimized version of the Quickselect algorithm described on Wikipedia. It is basically a modified version of the Quicksort algorithm.

Keeping a list always sorted

Will be documented later.

Rather than sorting an array of values at periodic intervals or even just when needed it might be interesting to keep the list of sorted at all times.

This means that the list has to be re-ordered every time a value is updated.

Benchmark Results

Here are benchmark results of the run of three implementations of sort. These benchmarks were run in the following configuration:

  • using a TCI6488 DSP,
  • code placed in L2 memory,
  • data placed in L2 memory,
  • L1 memory entirely configured as cache,
  • the cache was warm.

CPU Cyles

The results are CPU cycles for the execution of the sort algorithm, with index ordering enabled.

  • The horizontal scale is the size of the array being sorted (32 bits unsigned int). It goes from 1 to 256 elements to be sorted.
  • The vertical scale is the number cycles it takes to sort the array.
  • There are several measures displayed in these results:
    • Merge Sort is the worst case performance of the Merge sort implementation.
    • Qsort is almost a worst case performance of the unoptimized Quicksort implementation.
    • Qsort Best is the best case performance of the unoptimized Quicksort implementation.
    • Qsort Opt is almost a worst case performance of the optimized Quicksort implementation.
    • Qsort Opt Best is the best case performance of the optimized Quicksort implementation.

Please note that in our benchmarking the index ordering does not noticeably degrades the performance of algorithm. This can be explained by the fact that Texas Instruments DSPs have the ability to execute up to 8 instructions in parallel, and that the cost of index ordering is absorbed or hidden that way.

Merge sort vs qsort 1 to 256.png

From this graph we can draw the following conclusions on our implementations:

  • The fastest sort algorithm we have studied here is the Merge Sort, with a performance in CPU cycles of about 50 cycles per unsigned 32-bit word to be sorted (estimate for a number of elements to be sorted between 1 and 256).
  • The performance of the Merge Sort is varies only a little between best cases and worst cases scenarios.
  • The optimized Quicksort variant is coming in second at about twice the number of cycles.
  • The Quicksort is ranke third at about 3 times the number of cycles.
  • Both Quicksort implementation are prone to significant performance degradation (O(n^2) trend clearly visible on the graph) in worst case scenarios.

Code Size

To be documented later

Memory Requirements

Here is a summary of the extra memory these various algorithms require to work. We call here N the number of 32 bit unsigned words to be sorted:

  • Quicksort: No extra memory is required as all sorting is done in place.
  • Optimized Quicksort: N 32 bits words of extra memory are required by this implementation.
  • Merge Sort: N+log(N) 32 bits words of extra memory are required by this implementation.
  • Quicksort with Index Ordering: No extra memory is required as all sorting and index ordering is done in place.
  • Optimized Quicksort with Index Ordering: 2*N 32 bits words of extra memory are required by this implementation.
  • Merge Sort with Index Ordering: 2*(N+log(N) 32 bits words of extra memory are required by this implementation.

Implementations

We are providing here the software implementations of the algorithms we have evaluated in this study. Feel free to use these functions as a starting point for your own implementation.

Quicksort from <stdlib.h>

The qsort() function provided with the Texas Instruments runtime support library (interface defined in <stdlib.h>) gives you the ability to sort an array of elements of any type and size. However the genericity of this quicksort implementation has an impact on its performance which is clearly demonstrated by the results of our benchmarks.

If the performance of your sort function is likely to be on the critical path of your real-time application using the qsort() function is probably a good first step to get a functional application, but implementing your own sort function might be required at some point.

In the benchmark results our custom implementation Quick Sort algorithm is consistently outperforming the <stdlib.h> qsort() implementation. The reasons for that are simple:

  • The <stdlib.h> implementation supports sorting data types of any size which means that the code generated cannot leverage the fact that we are interested in sorting 'unsigned int's.
  • The comparison is implemented in the form of a function, which means additional overhead for each comparison done, a function call being made instead of just a simple comparison.
  • The way the pivot point selection is done in our implementation apparently differs from the RTS implementation which leads to differing performance on our sample test vectors.

Standard Quicksort Implementation

#include <stdlib.h>
int qsort_cmp(const void *e1, const void *e2) {
  unsigned int v1 = *(unsigned int *)e1;
  unsigned int v2 = *(unsigned int *)e2;
 
  if (v1 < v2) return -1;
  else if (v1 > v2) return 1;
  else return 0;
}
void Sort_stdlib32(unsigned int * restrict array, int size) {
 
  /* This is a simple implementation of the QuickSort algorithm.  */
  qsort((void *)array, size, 4, &qsort_cmp);
}

Quicksort

Quicksort Implementation

The following Quicksort implementation is very heavily based on the pseudo code found the Wikipedia entry for the Quicksort Algorithm.

inline int partition(unsigned int * restrict array, int left, int right, int pivotIndex) {
  int pivotValue = array[pivotIndex];
  int storeIndex, i;
  unsigned int temp;
 
  temp = array[pivotIndex];
  array[pivotIndex] = array[right]; // Move pivot to end
  array[right] = temp;
  storeIndex = left;
  for (i=left; i<right; i++) { // left <= i < right
    if (array[i] <= pivotValue) {
      temp = array[i];
      array[i] = array[storeIndex];
      array[storeIndex++] = temp;
    }
  }
  temp = array[right];
  array[right] = array[storeIndex]; // Move pivot to its final place
  array[storeIndex] = temp;
  return storeIndex;
}
 
static void quicksort(unsigned int * restrict array, int left, int right) {
  int pivotIndex, pivotNewIndex;
 
  if (right > left) {
    /* select a pivot index (e.g. pivotIndex := (left+right)/2) */
    pivotIndex = (left+right)/2;
    pivotNewIndex = partition(array, left, right, pivotIndex);
    quicksort(array, left, pivotNewIndex - 1);
    quicksort(array, pivotNewIndex + 1, right);
  }
}
 
void Sort_qsort32(unsigned int * restrict array, int size) {
 
  /* This is simple implementation of the QuickSort algorithm.  */
  quicksort(array, 0, size-1);
}

Quicksort Implementation with Index Ordering

The index ordering requires that the index be updated as the array is sorted. Because the Texas Instruments DSPs are able to execute multiple instructions in parallel the overhead introduced by the maintenance of the index is very limited.

static int ipartition(unsigned int * restrict array, unsigned int * restrict order, int left, int right, int pivotIndex) {
  int pivotValue = array[pivotIndex];
  int storeIndex, i;
  unsigned int temp;
  unsigned int itemp;

  temp = array[pivotIndex]; itemp = order[pivotIndex];
  array[pivotIndex] = array[right]; // Move pivot to end
  order[pivotIndex] = order[right];
  array[right] = temp;
  order[right] = itemp;
  storeIndex = left;
  for (i=left; i<right; i++) { // left <= i < right
    if (array[i] <= pivotValue) {
      temp = array[i];
      itemp = order[i];
      array[i] = array[storeIndex];
      order[i] = order[storeIndex];
      order[storeIndex] = itemp;
      array[storeIndex++] = temp;
    }
  }
  temp = array[right];
  itemp = order[right];
  array[right] = array[storeIndex]; // Move pivot to its final place
  order[right] = order[storeIndex];
  array[storeIndex] = temp;
  order[storeIndex] = itemp;
  return storeIndex;
}

static void quicksort(unsigned int * restrict array, unsigned int * restrict order, int left, int right) {
  int pivotIndex, pivotNewIndex;

  if (right > left) {
    /* select a pivot index (e.g. pivotIndex := (left+right)/2) */
    pivotIndex = (left+right)/2;
    pivotNewIndex = partition(array, order, left, right, pivotIndex);
    quicksort(array, order, left, pivotNewIndex - 1);
    quicksort(array, order, pivotNewIndex + 1, right);
  }
}

void SortIndex_qsort32(unsigned int * restrict array, int size, unsigned int * restrict index) {

	quicksort(array, index, 0, size-1);
}

Optimized Quicksort

It is possible to make a quite simple optimization of our base Quicksort implementation by using some scratch memory and a slight rewrite of the inner-loop of the Quicksort algorithm (found in the partition() function).

This change allows the compiler to generate code that is much more efficient and leverage the Texas Instruments C64x+ DSP features for the optimization of loops.

Optimized Quicksort Implementation

static int partition(unsigned int * restrict ping, unsigned int * restrict pong, int left, int right, int pivotIndex) {
  int pivotValue = ping[pivotIndex];
  int storeIndex, i, rightIndex;
  unsigned int temp;

  temp = ping[pivotIndex];
  ping[pivotIndex] = ping[right]; // Move pivot to end
  ping[right] = temp;
  storeIndex = left;
  rightIndex = right;
  /* This loop is optimized by the Texas Instruments compiler.  */
  for(i=left;i<right;i++) {
    if (ping[i] <= pivotValue) {
      pong[storeIndex++] = ping[i];
    } else {
      pong[rightIndex--] = ping[i];
    }
  }
  pong[storeIndex] = ping[right];
  /* Copy pivotValue back into the ping buffer.  */
  ping[storeIndex] = pong[storeIndex];
  return storeIndex;
}

static void quicksort(unsigned int * restrict ping, unsigned int * restrict pong, int left, int right, int depth, int idx) {
  int pivotIndex, pivotNewIndex;

  if (right <= left && (depth & 1)) {
    /* Copy array value back into the pong buffer.  */
    pong[idx] = ping[idx];
  }
  if (right > left) {
    /* select a pivot index (e.g. pivotIndex := (left+right)/2) */
    pivotIndex = (left+right)/2;
    pivotNewIndex = partition(ping, pong, left, right, pivotIndex);
    quicksort(pong, ping, left, pivotNewIndex - 1, depth+1, left);
    quicksort(pong, ping, pivotNewIndex + 1, right, depth+1, right);
  }
}

void Sort_qsort32(unsigned int * restrict array, int size, unsigned int * restrict scratch) {

  /* This implementation is using scratch memory to enable an optimization
     of the inner loop of the quicksort algorithm.  */
  quicksort(array, scratch, 0, size-1, 0, 0);
}

Optimized Quicksort Implementation with Index Ordering

One interesting thing to note is that the implementation with Index Ordering is only introducing a very small degradation in performance (we have observed it to be consistently under XXX %).

static int partition(unsigned int * restrict ping, unsigned int * restrict pong,
                     unsigned int * restrict iping, unsigned int * restrict ipong,
                     int left, int right, int pivotIndex) {
  int pivotValue = ping[pivotIndex];
  int storeIndex, i, rightIndex;
  unsigned int temp, itemp;

  temp = ping[pivotIndex]; itemp = iping[pivotIndex];
  ping[pivotIndex] = ping[right]; // Move pivot to end
  iping[pivotIndex] = iping[right]; // Move pivot to end
  ping[right] = temp; iping[right] = itemp;
  storeIndex = left;
  rightIndex = right;
  /* This loop is optimized by the Texas Instruments compiler.  */
  for(i=left;i<right;i++) {
    if (ping[i] <= pivotValue) {
      pong[storeIndex] = ping[i];
      ipong[storeIndex++] = iping[i];
    } else {
      pong[rightIndex] = ping[i];
      ipong[rightIndex--] = iping[i];
    }
  }
  pong[storeIndex] = ping[right];
  ipong[storeIndex] = iping[right];
  /* Copy pivotValue back into the ping buffer.  */
  ping[storeIndex] = pong[storeIndex];
  iping[storeIndex] = ipong[storeIndex];
  return storeIndex;
}

static void quicksort(unsigned int * restrict ping, unsigned int * restrict pong,
                      unsigned int * restrict iping, unsigned int * restrict ipong,
                      int left, int right, int depth, int idx) {
  int pivotIndex, pivotNewIndex;

  if (right <= left && (depth & 1)) {
    /* Copy array value back into the pong buffer.  */
    pong[idx] = ping[idx];
    ipong[idx] = iping[idx];
  }
  if (right > left) {
    /* select a pivot index (e.g. pivotIndex := (left+right)/2) */
    pivotIndex = (left+right)/2;
    pivotNewIndex = partition(ping, pong, iping, ipong, left, right, pivotIndex);
    quicksort(pong, ping, ipong, iping, left, pivotNewIndex - 1, depth+1, left);
    quicksort(pong, ping, ipong, iping, pivotNewIndex + 1, right, depth+1, right);
  }
}

void SortIndex_qsort32(unsigned int * restrict array, int size, unsigned int * restrict index, unsigned int * restrict scratch) {

  /* This is simple implementation of the QuickSort algorithm.  */
  quicksort(array, scratch, index, scratch+size, 0, size-1, 0, 0);
}

Merge Sort

The Merge Sort implementation presented here works as follows:

  • First Pass: the content of the array is sorted by small groups of four elements.
    • If the total number of elements in the array is not a multiple of four, then the last 1, 2 or 3 elements are sorted as well.
    • Sorting groups of 4 elements is done using the basic insertion algorithm. However this implementation of the insertion algorithm is written in such a way that the compiler is able to optimize the loop over all the array elements.
    • Index Ordering Note: the index is updated as the elements are sorted.
  • Second Pass: the small sorted list of 4 elements are merged together to form large sorted lists. This larger sorted lists are in turn merged, etc... until this all results in one sorted list.
    • This implementation is using functions to perform two-way, three-way and four-way merges (i.e. merge two lists, three lists or four lists into one).
    • Index Ordering Note: the indexes are merged as the elements are merged.

As you can see we have decided to opt for an implementation that is using an iterative approach rather than a recursive approach. This iterative approach in our experiments has proven to be much more optimization-friendly.

Merge Sort Implementation

The full implementation of our Optimized Merge Sort For DSP is documented separately.

Merge Sort Function

void Sort_merge32(unsigned int * restrict array, int size, unsigned int * restrict output, unsigned int * restrict scratch);

Two-Ways Merge

void Merge_twoWays32 (unsigned int **inputs, int *sizes,
		      unsigned int * restrict out, unsigned int * restrict scratch);

Three-Ways Merge

void Merge_threeWays32(unsigned int **inputs, int *sizes, unsigned int * restrict out, unsigned int * restrict scratch);

Four-Ways Merge

void Merge_fourWays32(unsigned int **inputs, int *sizes, unsigned int * restrict out, unsigned int * restrict scratch);

Merge Sort Implementation with Index Ordering

The full implementation of our Optimized Merge Sort For DSP With Index Ordering is documented separately.

Merge Sort Function with Index Ordering

void SortIndex_merge32(unsigned int * restrict array, int size, unsigned int * restrict index,
                       unsigned int * restrict output, unsigned int * restrict order, unsigned int * restrict scratch);

Two-Ways Merge With Index Ordering

void MergeIndex_twoWays32 (unsigned int **inputs, int *sizes,
		      unsigned int * restrict out, unsigned int * restrict scratch);

Three-Ways Merge With Index Ordering

void MergeIndex_threeWays32(unsigned int **inputs, int *sizes, unsigned int * restrict out, unsigned int * restrict scratch);

Four-Ways Merge With Index Ordering

void MergeIndex_fourWays32(unsigned int **inputs, int *sizes, unsigned int * restrict out, unsigned int * restrict scratch);

Top K Elements from an Array of N Elements

Based on the Quickselect algorithm described in the Selection Algorithm page from Wikipedia we have made an implementation optimized for Texas Instruments c64x+ DSPs. This implementation does perform #Index_Ordering as described in the

A number of optimizations have been built in the implementation of this algorithm among which:

  • Detection of worst-case performance and application of Median of Medians Algorithm, this prevents the algorithm from experiencing O(N^2) performance.
  • No recursion of the algorithm for 4 elements or less. Instead a fully unrolled implementation of a sort for 2, 3 and 4 elements is used.
  • Use of scratch memory to enable the acceleration of the partition algorithm.

Software Implementation

The software implementation of this algorithm can be found here: Top K Elements Sort For DSP With Index Ordering.


Benchmark Results

Here is a visualization of the results of a benchmark of the function in a case where we sort the top 20 elements of an array of up 600 elements.

Our Quickselect algorithm is compared to repeating 20 times the search for the maximum in the array. The maximum search function is an optimized one and is also provided with our example code.

Quick select vs max.png

References