Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

OpenMP Accelerator Model User's Guide

From Texas Instruments Wiki
Jump to: navigation, search

TIBanner.png


OpenMPTM Accelerator Model v1.1.0


User's Guide



Introduction

The OpenMP ™ 4.0 specification enables the use of OpenMP on heterogeneous systems by adding support for a set of device constructs. The OpenMP community uses the term "OpenMP Accelerator Model" to refer to this set. OpenMP 4.0 defines a host device on which the OpenMP programs begin execution, and target devices onto which regions of code can be offloaded.

Texas Instruments' implementation of the OpenMP Accelerator Model targets the 66AK2H SoC. The 66AK2H SoC is composed of a ARM MPCore cluster with 4 Cortex-A15 CPUs (up to 1.4 Ghz) and an octa-core C66x DSP (up to 1.228 Ghz) cluster. The Cortex-A15 quad cores are fully cache coherent, while the DSP cores do not maintain cache coherency. The DSP cores do not have any virtual memory support as they do not have an MMU and do not share the ARM MMU. The DSP cores have 32KB of L1D and L1P and 1MB L2 cache each. Each of these caches are configurable and can be partitioned into scratchpad RAM (SRAM) as needed. The ARM cores also have 32 KB of L1D and L1P cache per core, but share a single 4 MB L2 cache. The Multicore Shared Memory Controller (MSMC) also provides 6 MB of SRAM which is shared by all ARM and DSP cores.

In this implementation, the host device is the ARM MPCore cluster running SMP Linux and the target device is the DSP cluster.

User's Guide

The User's Guide is located here.

Supported OpenMP Device Constructs

This product supports the the following OpenMP 4.0 device constructs and runtime library routines:

  • #pragma omp target
  • #pragma omp declare target
  • #pragma omp target data
  • #pragma omp target update
  • omp_set_default_device()
  • omp_get_default_device()
  • omp_get_num_devices()
  • omp_is_initial_device()

The following sections briefly introduce the supported constructs. Refer the OpenMP 4.0 specification for details.

#pragma omp target

The target construct is used to specify the region of code that should be offloaded for execution onto the target device. The construct also creates a device data environment by mapping host buffers to the target for the extent of the associated region. Map clauses on the target construct specify data movement from host to device before execution of the offloaded region, and device to host after execution of the offloaded region. Map types alloc, to, from and tofrom indicate data allocation and transfer direction.

float a[1024];
float b[1024];
float c[1024];
int size;

void vadd_openmp(float *a, float *b, float *c, int size)
{
    #pragma omp target map(to:a[0:size],b[0:size],size) map(from: c[0:size])
    {
        int i;
        #pragma omp parallel for
        for (i = 0; i < size; i++)
            c[i] = a[i] + b[i];

    }
}

In the example above, variables a, b, c and size initially reside in host (ARM Linux) memory. Upon encountering a target construct:

  • Space is allocated in device memory for variables a[0:size], b[0:size], c[0:size] and size.
  • Any variables annotated ‘to’ are copied from host memory -> device memory.
  • The target region is executed on the device. Note that #pragma omp parallel for is used to distribute iterations of the for loop across the 8 DSP cores.
  • Any variables annotated ‘from’ are copied from device memory -> host memory.


Mapping variables from host memory to device memory

To reduce the overhead of copying variables between host and target devices, refer the section on reducing offload overhead

#pragma omp declare target

The declare target construct specifies that variables and functions are mapped to a device. Each function specified in a declare target region must have definitions for both the host and target device. In the code snippet below, function DSPF_sp_fftSPxSP is called from within a target region. It's prototype must be placed in a declare target region.

#pragma omp declare target
/* There must be a host and accelerator target definition for this function */
void DSPF_sp_fftSPxSP(int N,
                      float *x, float *w, float *y,
                      unsigned char *brev,
                      int n_min, int offset, int n_max);
#pragma omp end declare target

void dsplib_fft(int N, int bufsize,
                float* x, float* w, float *y,
                int n_min, int offset, int n_max)
{
    #pragma omp target map(to: N, x[0:bufsize], w[0:bufsize], \
                               n_min, offset, n_max) \
                       map(from: y[0:bufsize])

    {
        DSPF_sp_fftSPxSP (N, x, w, y, 0, n_min, offset, n_max);
    }
}



#pragma omp target data

The target data construct creates a device data environment by mapping host buffers to the target for the extent of the associated region. Target regions may be executed within a device data environment. Furthermore, mapped buffers may be re-used by multiple enclosed target regions and are valid for the entire scope of the 'target data' region. Target data regions may also be nested.

/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
#pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
   /* Existing device buffers are used and no data is transferred here */
   #pragma omp target
   {
       int i;
       #pragma omp parallel for
       for (i = 0; i < size; i++)
           c[i] += a[i] + b[i];
   }
} /* Device -> Host data transfer of buffer c is done here*/


#pragma omp target update

The target update construct is used to synchronize host or device buffers within a target data region as required. In the following example, buffer c is synchronized to the host in the first target update call and to the device in the second target update call. The update direction is specified using to and from clauses.

void operate_on_host(int* buffer);

#pragma omp declare target
void operate_on_device_initial(int* srcA, int* srcB, int* result);
void operate_on_device_final(int* srcA, int* srcB, int* result);
#pragma omp end declare target

/* Create device buffers for a, b, c and transfer data from Host -> Device for a,b */
#pragma omp target data map(to:a[0:size], b[0:size]) map(from:c[0:size])
{
   /* Existing device buffers are used and no data is transferred here */
   #pragma omp target
   {
      /* Here buffer c is modified with results while operating on input buffers a,b */
      operate_on_device_initial(a,b,c);
   }

   /* Copy modified target data to the host */
   #pragma omp target update from(c[0:size])

   /* Do some host side processing */
   operate_on_host(c);

   /* Synchronize c with target device data buffer */
   #pragma omp target update to(c[0:size])

   /* Offload more computation to target device */
   #pragma omp target
   {
      operator_on_device_final(a,b,c);
   }

} /* Device -> Host data transfer of buffer c is done here again*/



TI Extensions

TI has implemented 2 extensions to the OpenMP Accelerator Model:

  1. An API to reduce the overhead of offloading target regions
  2. A local map type to map variables to target scratchpad memory - this memory has much lower access times compared to MSMC SRAM or DDR.

Reducing offload overhead

Data synchronization between the host and target device can be a significant source of overhead. This overhead has implications for the amount of computation that needs to be performed by a target region to outweigh the data synchronization overhead. On the 66AK2H SoC, the host and target device share internal and external memory. However:

  • the target device does not have a memory management unit (MMU); and
  • there is no hardware cache coherency between the target and host device.

As a result, the host accesses shared memory using virtual addresses and the target accesses the shared memory using physical addresses. Moreover, host device variables can span multiple non-contiguous pages in Linux virtual memory whereas the target device operates on contiguous physical memory. When mapping variables from the Linux process space, the variables must be copied into contiguous memory for target operation. This copy is inefficient, especially for large variables. To eliminate this copy, the implementation provides a special purpose dynamic memory allocation API, malloc ddr and malloc msmc. The physical memory associated with this heap is contiguous and is mapped to a contiguous chunk of virtual memory on the host. If any host variables allocated via this API are mapped into target regions, the map clauses translate to cache management operations on the host, significantly reducing the overhead.

The following TI-provided functions may be used to allocate and free contiguous segments of memory that may be accessed by all ARM and DSP cores:

  • __malloc_ddr: Allocate a buffer in contiguous DDR memory
  • __free_ddr: Free associated with __malloc_ddr
  • __malloc_msmc: Allocate a buffer in contiguous MSMC memory
  • __free_msmc: Free associated with __malloc_msmc

These functions have the same interface as standard malloc/free. Note that when using these functions within an OpenMP application, any required memory consistency operations are automatically handled by the supporting runtime systems.

/* Allocate buffer in device memory */
float* a     = (float*) __malloc_ddr(bufsize);

/* Initialize buffer on the host */
for (int i=0; i < NumElements; ++i)
{
    a[i] = 1.0;
}


/* Map to is a cache write-back operation on host. No copy is performed
   Map from is a cache invalidate operation on host. No copy performed */

#pragma omp target map(to:a[0:size],size) map(from: a[0:size])
{
    int i;
    #pragma omp parallel for
    for (i = 0; i < size; i++)
        a[i] *= 2.0;
}

/* Free buffer */
__free_ddr(a);


The table below lists data movement and cache operations that are performed by the Accelerator Model runtime for the various target constructs. This information is specific to TI's implementation of the OpenMP Accelerator Model.

Data Movement Operations
OpenMP 4.0 construct
target or target data map target update
Location of host variable To target From target To target From target
Linux paged memory (heap/stack/global) memcpy from host memory to target memory memcpy from target memory to host memory memcpy from host memory to target memory memcpy from target memory to host memory
Allocated via __malloc_ddr/__malloc_msmc ARM Cache Writeback of address range specified by the clause ARM Cache Invalidate of address range specified by the clause ARM Cache Writeback of address range specified by the clause ARM Cache Invalidate of address range specified by the clause


local map type

An additional local map type has been added, which maps a variable to the L2 scratchpad memory. In terms of data synchronization, such variables are treated as map-type alloc. They have an undefined initial value on entry to the target region and any updates to the variable in the target region cannot be reflected back to the host. Mapping host variables to target scratchpad memory provides significant performance improvements.

On the 66AK2H, each DSP core's 1MB L2 memory is configured as 128K cache, 768K scratchpad available to user programs and 128K reserved by the runtime. The 768K scratchpad is accessible via the local map type.

/* A 1KB scratch buffer */
char* scratch_buffer = malloc(1024);

#pragma omp target map(tofrom: a[0:size]) map(local: scratch_buffer[0:1024])
{
    /* a[] is copied to the Device */
    /* scratch_buffer[] is allocated in L2 SRAM,
       scratch_buffer[] is NOT copied to the Device */


    /* Perform operations on buffer a[] in DDR using the L2 SRAM scratch_buffer */
    operate_on(a, scratch_buffer);

} /* a[] is copied back to the Host,
     scratch_buffer[] is NOT copied back to Host */


Dynamic Memory Management within Target Regions

Dynamic memory requests via malloc/calloc/realloc within OpenMP target regions are allocated from a heap managed by the Multicore DSP runtime. This heap is located in DDR, shared by all the DSP cores and is 8MB in size.

If the target region requires a heap >8MB or a heap in MSMC SRAM, the runtime provides additional APIs to initialize and manage such heaps. Descriptions and examples of use of the API are provided in DSP Heap Management.


Change Log

Package Version Build Date Short Description Changes
1.1.0 09-10-2014 Production patch
  • Integrated TI's Usage and Load Monitor library (ULMLib). ULMLib collects specific OpenMP runtime events into DSP trace buffers. More details here
  • Non-OpenMP pragmas within target regions are now supported
  • default clause on constructs is now supported
  • Long long type for loop iteration variables is now supported for OpenMP loops
  • An empty statement is no longer required to follow a target update construct
1.0.0 07-30-2014 Production OpenMP Accelerator Model release
  • Non-zero lower bounds for array sections is now supported
  • Mapping sub-array sections of mapped array sections is now supported
  • Added examples dotprod_fileread, edmabw, local, sub_section, vecadd_complex, vecadd_lib
  • Device and if clauses are now supported on device constructs
  • Memory management functions __TI_omp_device_alloc and __TI_omp_device_free have been replaced by __malloc_ddr()/__malloc_msmc() and __free_ddr()/__free_msmc()
  • Creation of CLACC static libraries is now supported
  • Included man page for CLACC
0.3.3 04-25-2014 MCSDK GA Release
  • Fixed target region without map clause problem in parser, removed error description in readme
  • Added examples target_implicit_map, target_orphan_call, target_update, dotprod, openmpbench_C_v3
  • Performance optimization: CMEM cacheWb and CMEM_cacheInv's done only when needed
  • Supports applications written entirely in C
0.2.2 02-26-2014 Patch release
  • Changed order of -L,-I,-l to match order in which user specifies it
  • MPAX problem fixed in OpenCL, TI_OCL_DSP_NOMAP removed
0.2.1 02-24-2014 Patch release
  • Documented non-zero array start problem in readme.txt
  • Added printf_debug example back, as bug fixed in OpenCL
  • Fixed CLACC multiple -l, -L, -I problem
  • Added edmamgr example
0.2.0 02-17-2014 First release First release