Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Optimized In-Memory Bitmap Indexes

From Texas Instruments Wiki
Jump to: navigation, search
Construction Icon small.png This page is currently under construction. The content of this page is due to change quite frequently and thus the quality and accuracy are not guaranteed until this message has been removed. Please feel free to contribute to this page while construction is in progress.

This article documents an implementation of uncompressed Bitmap Indexes that supports an index size up to 1024 elements. The use of bitmap indexes enables very effective software optimizations techniques for some types of complex software applications (aka control code).

Many embedded software applications maintain databases of objects (arrays of structs) stored in memory, and the maintenance of these databases often involves operations such as:

  • iterate over a linked list of objects, and perform an operation on these objects.
  • iterate over all the database objects, and depending on a per-object set of logical conditions perform one or more operations

The problem with the implementation of such maintenance operation is that the execution performance is generally quite poor as the code most often does not get the best out of the:

  • software pipelining of loops.
  • parallel execution of multiple CPU instructions per CPU cycle.

This is a problem that is common to most CPU architectures, including architecture that have branch prediction built-in. (need references here: Monet-DB X100 presentation looks like a good one, need to find others)

For some background information on what Bitmap Indexes are please refer to Wikipedia that has a good article on Bitmap Indexes. Please note however that a significant part of the article is dedicated to compression of large bitmap indexes (read: millions of items in the index), where the present article is focused on the use of an uncompressed bitmap index implementation that has a fixed size of 1024.

The implementation presented here has been optimized for execution on the Texas Instruments C64x+ DSP Architecture as well as C66x Architecture.


Bitmap Indexes

Replacing structure fields with bitmap indexes

It is possible to use bitmap indexes to replace boolean fields in data-structures. The benefits of doing so are the following:

  • reduction of memory storage requirements for the same information. One bit only is used for each field, vs at least 8 bits in the initial approach.
  • boolean logic can be used to combine conditions and create more complex queries using very simple bit-wise operations.
  • the resulting software is much more "software pipeline"-friendly.

Replacing a linked list with a bitmap index

It is also possible to replace linked-lists with bitmap indexes:

  • when the order of the items in the list can easily be computed on the fly (for example by using a sort algorithm),
  • when the order does not matter (such as in the case of linked-lists used for resource allocation).

Optimized DSP Bitmap Index implementation

Here is an optimized implementation of uncompressed bitmap indexes that supports up to 1024 objects (these indexes have a fixed size). The limitation on size allows us to store the whole bitmap in memory (it is only 136 bytes long for one bitmap index), and to use many such bitmap indexes in one application.

This implementation has been developed and tested on Texas Instrument DSPs, but this code should be easy to adapt for other CPU architectures, as very few TI-specific intrinsics are used.

Bitmap Index data structure

The data-structure associated to a Bitmap Index is a very simple array, designed to fit in two cache lines. Its total size is 136 bytes.

typedef struct {
   uint32_t bitmap[32];  /* Bitmap, one bit per entry.  */
   uint32_t index;       /* Index of the words in the bitmap[] array that have
                            at least one bit set, one bit per word.  */
   uint32_t count;       /* Counter of the number of bits that are set in the bitmap.  */
} BitmapIndex;

The data-structure should be stored at a 64-bit aligned address for the maximum efficiency of the code that is generated.

Methods to initialize and modify a bitmap index

Here are a few commodity functions that allow you to create a new bitmap index, set, clear and get bits from it. These can be used as basis for a custom implementation of bitmap indexes.

#include <inttypes.h> /* C99 types.  */
#include <c6x.h>      /* C64x+, C66x intrinsics.  */
 
inline void BitmapIndex_init(BitmapIndex *bitmap_index) {
    int i;
 
    /* Initialize an empty bitmap index.  */
    for(i=0; i<32; i++) {
        bitmap_index->bitmap[i] = 0;
    }
    bitmap_index->index = 0;
    bitmap_index->count = 0;
}
 
inline void BitmapIndex_set(BitmapIndex * bitmap_index, int bit) {
    int word_position = bit / 32;
    int bit_position = bit % 32;
    uint32_t bitmap = bitmap_index->bitmap[word_position];
	uint32_t mask = (0x1 << bit_position);
 
    /* Set one bit in the bitmap.  */
    if (!(bitmap & mask)) { bitmap = bitmap ^ mask; }
    bitmap_index->bitmap[word_position] = bitmap;
}
 
inline void BitmapIndex_clear(BitmapIndex * bitmap_index, int bit) {
    int word_position = bit / 32;
    int bit_position = bit % 32;
    uint32_t bitmap = bitmap_index->bitmap[word_position];
	uint32_t mask = (0x1 << bit_position);
 
    /* Clear one bit in the bitmap.  */
    if (bitmap & mask) { bitmap = bitmap ^ mask; }
    bitmap_index->bitmap[word_position] = bitmap;
}
 
inline uint32_t BitmapIndex_get(BitmapIndex *bitmap_index, int bit) {
    int word_position = bit / 32;
    int bit_position = bit % 32;
 
    /* Get the value of one bit in the bitmap.  */
    return _extur(bitmap_index->bitmap[word_position], bit_position + bit_position * 32);
}
 
inline void BitmapIndex_copy(BitmapIndex *source, BitmapIndex *destination) {
    int i;
 
    /* Copy a full bitmap.  */
    for(i=0; i<32; i++) {
        destination->bitmap[i] = source->bitmap[i];
    }
    destination->index = source->index;
    destination->count = source->count;
}
 
inline void BitmapIndex_update(BitmapIndex *bitmap_index) {
    int i;
    uint32_t count = 0;
    uint32_t index = 0;
    uint32_t bitmap;
 
    /* Update the 'index' and 'count' fields of the Bitmap Index data-structure.
       This is useful when the bitmap is updated in a batch of operations.  */
    for (i=0; i<32; i++) {
        index = index >> 1;
        bitmap = bitmap_index->bitmap[i];
        if (bitmap != 0) {
            index |= 0x80000000;
        }
        /* Count the number of bits set to one in the bitmap.  */
        count += _dotpu4(_bitc4(bitmap),0x01010101);
    }
    bitmap_index->index = index;
    bitmap_index->count = count;
}

Iteration over a Bitmap Index

There are several ways to iterate efficiently on a bitmap index.

One-level loop iteration

To be documented.

Two-level loop iteration

The inner-loop in this method is software-pipelined by the compiler with the following characteristics:

  • ii = 3 Schedule found with 2 iterations in parallel
  • Total cycles (est.): 3 + trip_cnt * 3
  • Iteration over 1024 items (in a full bitmap index) costs about 3500 CPU cycles, or about 3.5 cyles/item.
static int BitmapIndex_iterator_bitmap[32];
static int BitmapIndex_iterator_position[32];
static int BitmapIndex_iterator_bitcount[32];
 
int bitmap_iteration_two_levels(BitmapIndex * bitmap_index, int * restrict vector) {
	int word_position, word_count = 0, bitcount = 0;
	int bit_position;
	int position, i, j;
	uint32_t index = _bitr(bitmap_index->index);
	uint32_t * restrict bitmap_array = bitmap_index->bitmap;
	uint32_t bitmap;
	int result = 0;
 
	while (index != 0) {
		word_position = _lmbd(1, index);
		index = index ^ (0x80000000 >> word_position);
		bitmap = bitmap_array[word_position];
		BitmapIndex_iterator_position[word_count] = word_position;
		BitmapIndex_iterator_bitcount[word_count] = _dotpu4(_bitc4(bitmap),0x01010101);
		BitmapIndex_iterator_bitmap[word_count++] = _bitr(bitmap);
	}
	for(i = 0; i < word_count; i++) {
		bitmap = BitmapIndex_iterator_bitmap[i];
		word_position = BitmapIndex_iterator_position[i];
		bitcount = BitmapIndex_iterator_bitcount[i];
		_nassert(bitcount > 0);
		for (j = 0; j < bitcount; j++) {
			bit_position = _lmbd(1, bitmap);
			bitmap = bitmap ^ (0x80000000 >> bit_position);
			position = word_position * 32 + bit_position;
 
			/* This where your code goes in.  */
			vector[result++] = position;
		}
	}
	return result;
}

Vectorized iteration

Let's assume that in the previous code you are placing your own functionality where the comment says "place your own code here".

In this case your functionality is subject to following constraints:

  • Some register pressure due to the code operating the loop.
  • No more than two iterations in parallel with an ii of 3 cycles.
  • Up to 32x the penalty of the loop preamble.

For complex functionality it might turn out to be very interesting to work in two passes as follows:

  • First create a vector that has the index of all the entries in the Bitmap Index.
  • Second iterate on the resulting vector in one go, and benefit from a much reduced register pressure, and the ability for the compiler to schedule much more iterations in parallel.

Compute the Vector

static int BitmapIndex_iterator_bitmap[32];
static int BitmapIndex_iterator_position[32];
static int BitmapIndex_iterator_bitcount[32];
 
int BitmapIndex_compute_vector(BitmapIndex * bitmap_index, int * restrict vector) {
	int word_position, word_count = 0, bitcount = 0;
	int bit_position;
	int position, i, j;
	uint32_t index = _bitr(bitmap_index->index);
	uint32_t * restrict bitmap_array = bitmap_index->bitmap;
	uint32_t bitmap;
	int vector_size = 0;
 
	while (index != 0) {
		word_position = _lmbd(1, index);
		index = index ^ (0x80000000 >> word_position);
		bitmap = bitmap_array[word_position];
		BitmapIndex_iterator_position[word_count] = word_position;
		BitmapIndex_iterator_bitcount[word_count] = _dotpu4(_bitc4(bitmap),0x01010101);
		BitmapIndex_iterator_bitmap[word_count++] = _bitr(bitmap);
	}
	for(i = 0; i < word_count; i++) {
		bitmap = BitmapIndex_iterator_bitmap[i];
		word_position = BitmapIndex_iterator_position[i];
		bitcount = BitmapIndex_iterator_bitcount[i];
		_nassert(bitcount > 0);
		for (j = 0; j < bitcount; j++) {
			bit_position = _lmbd(1, bitmap);
			bitmap = bitmap ^ (0x80000000 >> bit_position);
			position = word_position * 32 + bit_position;
 
			/* Fill in the vector.  */
			vector[vector_size++] = position;
		}
	}
	return vector_size;
}

Process the Vector

int example(BitmapIndex * bitmap_index) {
	int i, result, count;
 
	result = 0;
	count = BitmapIndex_compute_vector(bitmap_index, iteration_vector);
	for (i=0; i<count; i++) {
		/* Implement your own functionality here.  */
		result += database[iteration_vector[i]].metric * 7;
	}
	return result;
}

Boolean logic on Bitmap Indexes

Some applications need to perform boolean logic operations on bitmap indexes. This allows the applications to perform the equivalent of complex queries on a database of objects.

This is easily implemented by performing bitwise logical operations on several bitmaps as illustrated below, and this functionality can be optimized heavily by ensuring that bitmap indexes are stored at 64-bit aligned adresses.

inline void BitmapIndex_or(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] | bitmap_b[i];
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_or_not(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] | (~ bitmap_b[i]);
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_and(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] & bitmap_b[i];
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_and_not(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] & (~ bitmap_b[i]);
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_xor(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] ^ bitmap_b[i];
    }
    BitmapIndex_update(bitmap_index_dest);
}

Example of use of Bitmap Indexes

Here is an example of software loop that can be transformed to significantly improve its performance by converting boolean fields to bitmap indexes.

Original code

#define DATABASE_SIZE 1024
 
typedef struct {
	int active;
	int urgent;
	int scheduled;
	int metric;
} DemoObject;
DemoObject database[1024];
 
int example(int * statistics) {
	int i, ignored, result;
 
	result = 0;
	ignored = 0;
	for (i=0; i < 1024; i++) {
		/* "Complex" control loop.  */
		if (database[i].active) {
			if (database[i].scheduled && !database[i].urgent) {
				/* Do something for scheduled and NOT urgent objects.  */
				result += database[i].metric * 7;
			} else if (database[i].scheduled && database[i].urgent) {
				/* Do something for scheduled and urgent objects.  */
				result += database[i].metric * 10;
			} else {
				ignored++;
			}
		}
	}
	*statistics += ignored;
	return result;
}

Optimized code using Bitmap Indexes

And here is what the same functionality looks like when the fields active, scheduled, urgent have been converted into bitmap indexes.

Benchmarking of this implementation shows a speed-up of the example() function of at least 8x!

#define DATABASE_SIZE 1024
 
#pragma DATA_MEM_BANK(indexes, 0);
BitmapIndex indexes[4];
 
BitmapIndex *active = &indexes[0];
BitmapIndex *urgent = &indexes[1];
BitmapIndex *scheduled = &indexes[2];
BitmapIndex *temporary = &indexes[3];
 
typedef struct {
	/* The boolean structure fields are no longer used, as they are replaced with bitmap indexes.  */
	/* Memory saving: 12K - 4 * 136 bytes = 11744 bytes, and in most cases a performance boost!  */
	int metric;
} DemoObject;
DemoObject database[1024];
 
/* Utility functions for boolean logic. Assumption is made that all bitmap indexes are 64-bit aligned.  */
inline void BitmapIndex_active_not_scheduled(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] & (~ bitmap_b[i]);
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_active_urgent_scheduled(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_c, BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_c = bitmap_index_c->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_c & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] & bitmap_b[i] & bitmap_c[i];
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
inline void BitmapIndex_active_not_urgent_scheduled(BitmapIndex *bitmap_index_a, BitmapIndex *bitmap_index_b,
      BitmapIndex *bitmap_index_c, BitmapIndex *bitmap_index_dest) {
    int i;
    uint32_t * restrict bitmap_a = bitmap_index_a->bitmap;
    uint32_t * restrict bitmap_b = bitmap_index_b->bitmap;
    uint32_t * restrict bitmap_c = bitmap_index_c->bitmap;
    uint32_t * restrict bitmap_dest = bitmap_index_dest->bitmap;
    _nassert(((int)bitmap_a & 0x7) == 0);
    _nassert(((int)bitmap_b & 0x7) == 0);
    _nassert(((int)bitmap_c & 0x7) == 0);
    _nassert(((int)bitmap_dest & 0x7) == 0);
    for (i=0; i<32; i++) {
        bitmap_dest[i] = bitmap_a[i] & bitmap_b[i] & (~ bitmap_c[i]);
    }
    BitmapIndex_update(bitmap_index_dest);
}
 
int example(int * statistics) {
	int i, ignored, result, count;
 
	result = 0;
	ignored = 0;
	BitmapIndex_active_not_urgent_scheduled(active, scheduled, urgent, temporary);
	count = BitmapIndex_compute_vector(temporary, iteration_vector);
	for (i=0; i<count; i++) {
		result += database[iteration_vector[i]].metric * 7;
	}
	BitmapIndex_active_urgent_scheduled(active, scheduled, urgent, temporary);
	count = BitmapIndex_compute_vector(temporary, iteration_vector);
	for (i=0; i<count; i++) {
		result += database[iteration_vector[i]].metric * 10;
	}
	BitmapIndex_active_not_scheduled(active, scheduled, temporary);
	ignored = temporary->count;
	*statistics += ignored;
	return result;
}

Software optimization techniques using In-Memory Bitmap Indexes

Here are some examples of structural optimizations that are made possible by the use of Bitmap Indexes.

convert use of boolean fields

convert use of linked lists

complex database queries optimization

convert to vectorized processing

use of the count field to further optimization by iterating multiples of 2, 4 times