C66x Heterogeneous Programming

From Texas Instruments Wiki
Jump to: navigation, search

Heterogeneous Programming for c66x

This wiki is one in a series showing how to use c66x accelerator cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.


Using established technology and software stacks built by TI's third-party ecosystem, it's now possible to accelerate generic C/C++ programs by automatically creating TI and Intel co-dependent executables, aka "heterogeneous programming" for c66x and x86 cores. Based on OpenMP pragmas, the CIM® (Compute Intensive Multicore) Hyperpiler™ separates original source into c66x and x86 source code streams, augments these streams with necessary code for run-time data transfer and synchronization, and then builds, downloads, and runs the resulting executables on multiple cores.

This wiki focuses on the Hyperpiler, while others in the series show how to set up off-the-shelf HPC servers that combine up to 10s of x86 cores and 100s of c66x cores in order to run High Performance Virtual Machines (HPVMs) and computer vision (OpenCV). The Server HPC overview wiki has detailed information about tested servers, including OS and Hypervisor info, and measured power consumption and thermal statistics.

Underlying Technology

Following is a list of TI and third-party items required:

  1. c66x CPUs and build tools, TI
  2. 32-core or 64-core c66x PCIe accelerator cards, Advantech
  3. Standard off-the-shelf server running Ubuntu, CentOS, or Red Hat Linux (tested examples given below)
  4. DirectCore host drivers and libraries, Signalogic
  5. CIM Hyperpiler, Signalogic
  6. Application Demo Programs, Signalogic

c66x CPUs and Build Tools

Yes you read that right -- CPU, not DSP. Although TI marketing continues to label c66x devices as "DSPs", after some 30 years of advanced chip development by TI, this is no longer a precise label. The c66x architecture is a CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication.

TI build tools are available online.

Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. The TI build tools are command line tools, and the Hyperpiler demos described below automatically generate makefiles in standard format, suitable for use by both c66x and x86 platform build tools.

PCIe Cards

The Advantech PCIe cards supply the server horsepower. Each card has 64 cores, takes up a single slot (unlike GPU boards that take 2 slots), has two (2) 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in virtualization, DPDK, and high bandwidth network I/O, as well as multicore programming models such as OpenMP and OpenACC.

Off-the-Shelf Linux Servers

Servers and OS tested with the CIM Hyperpiler include:

  • Servers: HP DL380 G8 and G9, Dell R710 and R720, Supermicro 6016GT or 1028Gx series, others
  • Linux OS: Ubuntu 12.0, 14.04, 14.10, CentOS 6.2, 7, 7.1, and Red Hat 7
  • KVM Hypervisor and QEMU system emulator

Detailed information about tested servers, including pictures of c66x card installation, power consumption stats, and temperature stats, are located on the Server HPC Overview wiki.

Host and Guest Drivers

DirectCore drivers interact with c66x PCIe cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers.

Host and Guest Libs

DirectCore libraries provide a high level API for applications. DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs on the PCIe cards. This applies regardless of the number of PCIe cards installed in the server. The High Performance VM wiki has more information on allocating c66x cores and NIC resources to VMs (under Linux + KVM).

Example Source Code

Below is example source code for this wiki, compatible with the following platforms:

  • c66x
  • x86
  • combined c66x and x86 (heterogeneous programming)

The first two examples rely on openmp support in the platform build tools, either TI tools (c66x) or gcc tools (x86). The third example uses the CIM Hyperpiler in addition to both sets of platform build tools. (Note that it's also possible to include / mix native OpenMP pragmas with CIM pragmas). In the source example below, platform #ifdef's are used to determine which pragma keyword to use: "omp" for platform build tools and "cim" for the Hyperpiler.

  File: conv.c
  Copyright (C) Signalogic, 2011-2015
    Convolution OpenMP example -- simple demonstration of parallel-for pragma for
    x86, c66x, and CIM(R) Hyperpiler(TM)
    Jeff Brower, Chris Johnson, Anish Mathew, Signalogic
    supports x86 OpenMP using gcc compiler, c66x OpenMP using TI compiler, and
    combined x86 and c66x using the CIM(R) Hyperpiler(TM) (which generates code
    for gcc and TI compilers)
  Revision History:
    Created: 2011, Andrey Khrokin
    Modified 2015, Anish Mathew.  Added changes to the original conv.c to support x86 and c66x OpenMP, test to be compatible with new CIM additions, including taskc pragma
#include <omp.h>
#include <stdio.h>
#include <string.h>
/* convolution array lengths */
#define H_LEN 5   /* filter length.  Tested values:  8000 5 2*/
#define X_LEN 20  /* input length.  Tested values:  40000 20 10*/ 
#ifdef _TI66XX  /* c66x platform */
  #include <xdc/runtime/System.h>  /* header files for System_printf, TSCL and cache function calls, AKM */
  #include "/opt/ti/ti-cgt-c6000_8.0.1/include/c6x.h" 
  #pragma DATA_SECTION (test_variable,"L2SRAM")
  volatile int test_variable = 1;
  #pragma DATA_SECTION (timer_start,"L2SRAM")
  volatile unsigned long long timer_start[8] = { 0 };
  #pragma DATA_SECTION (timer_end,"L2SRAM")
  volatile unsigned long long timer_end;
  #pragma DATA_SECTION (frame_cyc,"L2SRAM")
  volatile unsigned long long frame_cyc[8] = { 0 };
#else  /* x86 platform or combined x86 and c66x (CIM)*/
  #include <sys/time.h>
  #ifdef  _CIM   /* CIM Hyperpiler (for example, x86 server with c66x accelerator PCIe card) */
    #include "hwlib.h"   /* Directcore API library */
    #include "cimlib.h"  /* CIM API library */
    unsigned int numStreams = 0;
  volatile int test_variable = 0x01;
  double start, end;
  struct timeval tp;
  struct timeval tp2;
void InitArrays();
/* declare arrays */
int x[X_LEN + 2*H_LEN-2];  /* input */
int h[H_LEN];              /* filter */
int y[X_LEN + H_LEN-1];    /* output */
int main (int argc, char *argv[]) {
int i, j, sum;
   InitArrays();  /* for test purposes, init arrays with constant values to produce a known result */
   test_variable |= 0x06;  /* debug only */
/* start profiling */
   #ifdef _TI66XX /* c66x platform */
      timer_start[DNUM] = _itoll(TSCH, TSCL);
   #elif _CIM  /* x86 server with c66x accelerator */
      gettimeofday (&tp, NULL);
      start = tp.tv_sec + tp.tv_usec / 1.0e6;
   #else  /* x86 server is default */
      gettimeofday (&tp, NULL);
      start = tp.tv_sec + tp.tv_usec / 1.0e6;
/* use platform ifdef's to determine which OpenMP pragma keyword to use */
   #ifdef _CIM 
   #pragma cim parallel for num_threads 5 private(i,j,sum) /* combined x86 and c66x (CIM).  Notes:  num_threads = number of c66x cores to use */
   #pragma omp parallel for private(i,j,sum) /* x86 only or c66x only */
   for (i=0; i<X_LEN+H_LEN-1; i++) {  /* convolution nested for-loop */
      sum = 0;
      for (j=0; j<H_LEN; j++) sum += h[j] * x[i + j];
      y[i] = sum;
/* Measure and report profiling results */
   #ifdef _TI66XX /* c66x platform */
      timer_end = _itoll(TSCH, TSCL);
      frame_cyc[DNUM] =  timer_end - timer_start[DNUM];
   #elif _CIM  /* CIM, x86 server with c66x accelerator */
      gettimeofday (&tp2, NULL);
      end = tp2.tv_sec + tp2.tv_usec / 1.0e6;
      printf("Start time: %f\n", start);
      printf("End time:   %f\n", end);
      printf("Elapsed time: %f s\n", end - start);
      for (i = 0; i < X_LEN + H_LEN - 1; i += 1) printf("%d ",y[i]);
   #else  /* x86 server is default */
      gettimeofday (&tp2, NULL);
      end = tp2.tv_sec + tp2.tv_usec / 1.0e6;
      printf("Start time: %f\n", start);
      printf("End time:   %f\n", end);
      printf("Elapsed time: %f s\n", end - start);
      for (i = 0; i < X_LEN + H_LEN - 1; i += 1) printf("%d ",y[i]);
   test_variable |= 0x08; /* debug only */
   return 0;  /* exit, program complete */
/* for test purposes, init arrays with constant values to produce a known result */
void InitArrays() {
int x_len = X_LEN + 2*H_LEN - 2;
int i;
   for(i=0; i<X_LEN; i++) x[i] = 4;
   for(i=X_LEN; i<x_len; i++) x[i] = 0;
   for(i=0; i<H_LEN; i++) h[i] = 1; 	

CIM Hyperpiler Demos

Two Hyperpiler demos are described on this wiki:

  • Convolution
  • Real-time video streaming

Convolution Demo

The convolution demo is a short, simple example with one parallel-for pragma. The purpose of this demo is to highlight both compatibility between platform OpenMP's and the Hyperpiler approach to combining heterogeneous platforms while staying within an OpenMP framework.

To build the convolution demo, give the following command lines:

 cd /install_path/Signalogic/CIM/apps/test_demo/conv
 make cim

where "install_path" is the path designated during Signalogic software installation. Note that giving make with a cim target runs the following three (3) steps:

  • CIM Hyperpiler process
  • c66x make
  • x86 make (e.g. Linux gcc make)

Here is what the cim target looks like inside the Makefile:

 ./../../../cimrt/cimpp -px86 -ati66 -cSIGC66XX-32 -s /install_path/Signalogic/CIM/apps/test_demo/conv/conv.c

Next, to run the convolution demo, enter:

 ./conv -m0x1f -f1400 -cSIGC66XX-8

Command line syntax for core list (-m), clock rate (-f), card designator (-c), and other options is detailed in the SigC677x Users Guide, downloadable here.

Here are some notes about running CIM generated executables:

  1. c66x executables are automatically downloaded, initialized, and run. There is no “.out” file or other executable file entry in the command line (unlike DirectCore test programs that are run manually)
  2. If you give a standard make command, then you are re-running the Linux make, not re-running the CIM process (and not regenerating x86 and c66x CPU source codes)
  3. When giving the card designator in the command line, if you are running multiple process or sharing the card with other users, then it may be necessary to pay attention to the number of cores requested. For example, the convolution demo command line above has the following card designator entry:

which requests 32 cores. If you enter:


then only 8 cores are requested. Requested cores are allocated from cores previously reserved for the user's host or VM instance. For more information on VMs, see the High Performance VM wiki.

Below is a screen grab showing expected output from the convolution demo:

CIM Hyperpiler convolution demo output

Video Streaming Demo

The video streaming demo is a practical example, intended to demonstrate useful real-time processing under the Hyperpiler approach.

To build the video streaming demo, give the following command lines:

 cd /install_path/Signalogic/CIM/apps/test_demo/vid_streaming
 make cim

To run the video streaming demo, give the following command line:

 ./vid_streaming -m0x03 -f1600 -cSIGC66XX-64 -s2 -i/install_path/Signalogic/video_files/parkrun_720p_50fps_420fmt.yuv -x1280 -y720 -D10.0.1.63:45056:b4-ce-f6-9d-3f-36 -B1500000 -r15

Source code for the video streaming program is shown in the "Continuous Task Pragma" section below.

Continuous Task Pragma

For real-time systems, it's important that heterogeneous programming take into account:

  • Code that must run continuously, on event-driven or periodic (timing) basis
  • I/O latency and bandwidth
  • Double-buffering, circular queue, and other forms of real-time data transfer

The OpenMP standard is not far enough along yet for this, so the CIM Hyperpiler recognizes a "taskc" pragma and associated syntax to implement continuous tasks with performance-sensitive I/O.

Video Streaming Source Code

The source code below demonstrates the taskc pragma; note that code performs fully functional, multistream video output (using IP/UDP/RTP streaming format). Note also how little source code is required.

    H.264 file encoding and RTP streaming demo using CIM software and SigC6678 accelerator card
    This demo highlights taskc (continuous task) pragma:
      -runs data plane real-time functions and network I/O on target (accelerator) cores
      -runs control plane functions on host cores
    1) See streamTest API demo documentation for command line usage (command line options are same)
    2) file-to-file, file-to-stream, VM desktop to stream modes supported
    3) Received stream tested with various clients, including Linux servers, Android tablets, Surface Pro 3 with Ubuntu installed
  Copyright (C) Signalogic 2014-2015
  Revision History:
    Created 2014 - AKM
    Revised Feb-May 2015 - AKM, JHB
#include <stdio.h>
#include <sys/socket.h>
#include <limits.h>
#include <unistd.h>
/* following header files required if DirectCore APIs are used */
#include "hwlib.h"
#include "cimlib.h"
/* following header files required depending on application type */
#include "test_programs.h"
#include "keybd.h"
/* following shared host/target CPU header files required depending on app type */
#include "streamlib.h"
#include "video.h"
#include "vdi.h"
/* vars available due to code generation */
extern HCARD  hCard;   /* card initialization and card handle assignment is handled by CIM processing */
extern char   szTargetExecutableFile[];
extern char   szCardDesigator[];
extern char   szCardDescription[];
extern QWORD  nCoreList;
#define DEBUG_OUTPUT  /* enable debug output (see below) */
unsigned char  outputbuf[MAXSTREAMS][MAVIDDESCRIPTORSIZE];
VDIPARAMS      VDIParams[MAXSTREAMS] = { 0 };  /* video and streaming params filled in by command line, see cimlib.h */
unsigned int   numStreams = 0; 
unsigned int   numFramesEncoded = 0;
int            bytesToWrite = 0;
unsigned int   hostFramesWritten = 0;
unsigned int   numBytesPerFrame; 
unsigned int   mode_flag = CIM_FUNCMODE_INIT;  /* mode flag supported by target CPU functions with taskc pragma.  We set to init mode.  See cimlib.h for more info */
void RunOnce(int, time_t);
void UpdateStats(HCARD, int);
unsigned int ExitLoop(int);
void SaveOutputFile(HCARD, int);
int main(int argc, char **argv) {
int         i, exitCode;
FILE*       fp = NULL; 
char*       memBuffer = NULL;
time_t      timerInterval = 1000;  /* default timer setting:  1 msec rate in oneshot mode.  For continuous mode we set based on frame rate (below) */
/* process command line (onscreen error messages are handled) */
   if (!cimGetCmdLine(argc, argv, NULL, CIM_GCL_VDI | CIM_GCL_DISABLE_MANDATORIES, &CardParams, &VDIParams)) exit(EXIT_FAILURE);
/* print banner */
   printf("** CIM video streaming demo, Copyright (C) Signalogic, 2014-2015.  Card %s-%2.1fGHz, target executable file %s **\n", szCardDescription, CardParams.nClockRate/1e9, szTargetExecutableFile);
   numStreams = VdiNumStreams(VDIParams);
   numBytesPerFrame = (VDIParams[0].Video.width * VDIParams[0].Video.height * YUV12bits_per_pixel / CHAR_BIT);
/* initialize depending on mode specified in cmd line */
   if (StreamingMode(VDIParams) == STREAM_MODE_ONESHOT) {
      printf("Loading input video data from file %s... \n", VDIParams[0].Video.inputFilename);
      if ((numFileBytes = DSLoadDataFile(hCard, VDIParams[0].Video.inputFilename, TARGET_CPU_BUFFER_BASE_ADDR, 0, 0)) <= 0) {
         printf("Input video file not found\n");
      VDIParams[0].Video.framesToEncode = numFileBytes/numBytesPerFrame;
      printf("Number of frames to encode %d\n ", VDIParams[0].Video.framesToEncode);
   else if (StreamingMode(VDIParams) == STREAM_MODE_CONTINUOUS) {
      fp = fopen(VDIParams[0].Video.inputFilename, "rb"); 
      VDIParams[0].Video.framesToEncode = 0;  /* zero indicates continuous (indefinite) operation */
      memBuffer = (char*)malloc(MAX_MEM_BUFFER_SIZE*sizeof(char));
      timerInterval = 1000000/VDIParams[0].Streaming.frameRate;  /* set timer to frame rate in continuous mode */
/* start loop for file encoding or continuous RTP streaming */
   do {
      #pragma cim libpaths "streamlib.h" "vdi.h" "video.h" 
      #pragma cim parallel for num_threads 1 nowait
      for (i=0; i<numStreams; i++) {  /* multiple streams supported */
       /* taskc params: num cores, period (in usec), delay, start condition, repeat condition */
         #pragma cim taskc 2, 30000, 1, hostflg, bufrdy, noshare
            ReadStream(mode_flag | (VID << 16), (unsigned char*)inputbuf[i], numFramesEncoded, STREAM_ENDPOINT_TARGETCPUMEM | STREAM_FORMAT_YUV);
            H264Encode(mode_flag, (unsigned char*)inputbuf[i], (unsigned char*)outputbuf[i], (VDIPARAMS*)&VDIParams[i]);
            WriteStream(mode_flag | (VID << 16), (unsigned char*)outputbuf[i], bytesToWrite, STREAM_ENDPOINT_NETWORK | STREAM_ENDPOINT_TARGETCPUMEM | STREAM_FORMAT_RTP | STREAM_CODEC_H264);
      if (IsTimerEventReady()) { 
         for (i=0; i<numStreams; i++) {
            if (StreamingMode(VDIParams) == STREAM_MODE_CONTINUOUS) {  /* read stream data from source, write to target CPU core(s) */
               streamRead((HANDLE*)&fp, i, memBuffer, STREAM_MODE_CONTINUOUS | STREAM_ENDPOINT_FILE | STREAM_RESEEK_TO_START, numBytesPerFrame, 0);
               streamWrite((HANDLE)((uintptr_t)hCard), i, memBuffer, STREAM_MODE_CONTINUOUS | STREAM_ENDPOINT_TARGETCPUMEM | VDIParams[0].Streaming.bufferingMode, numBytesPerFrame, 0);
         UpdateStats(hCard, numStreams);  /* print host and target frame counters, other stats if needed */
      RunOnce(numStreams, timerInterval);  /* one-time items, including timer start */
   } while (!(exitCode = ExitLoop(numStreams)));
/* save output .h264 or .yuv file if (i) in oneshot mode or (ii) 'S' (save) key command was given */
   SaveOutputFile(hCard, exitCode);
/* print some debug info */
   cimDebugPrint(hCard, CIM_DP_FORMAT_SAMELINE | CIM_DP_FORMAT_SHOWSYMADDR, "Target CPU testrun", "testrun", DS_RM_SIZE32, 1, nCoreList);
   cimDebugPrint(hCard, CIM_DP_FORMAT_SAMELINE | CIM_DP_FORMAT_SHOWSYMADDR, "Target CPU errorCode", "errorCode", DS_RM_SIZE32, 1, nCoreList);
   cimDebugPrint(hCard, CIM_DP_FORMAT_SAMELINE | CIM_DP_FORMAT_SHOWSYMADDR, "Target CPU numFramesEncoded", "numFramesEncoded", DS_RM_SIZE32, 1, nCoreList);
/* clean up and exit */
   if (memBuffer) free(memBuffer);

Hyperpiler -- How it Works

The key steps performed by the CIM® Hyperpiler™ are:

  1. Generate separate source code streams for x86 and c66x cores, based on CIM pragmas (which use OpenMP syntax)
  2. Augment generated host source code streams and c66x target source code streams with APIs required for run-time synchronization and data transfer
  3. Using automatically generated Makefiles, build the resulting co-dependent x86 and c66x executables (using gcc tools for x86 and TI command line tools for c66x)

After Hyperpiler processing and build are complete, the resulting x86 executable can be run normally, and accepts application specific command line options, along with optional command line options to specify c66x accelerator clock rate and core allocation. When this executable runs, it automatically downloads, runs, and synchronizes the co-dependent c66x executables.

In the convolution source example above, platform #ifdef's are used to determine pragma keyword to use: "omp" for platform build tools and "cim" for the Hyperpiler. As noted above, these keywords can be intermixed as needed. In the case of the cim keyword, both a subset and superset of OpenMP syntax is supported:

  • A relatively small subset of OpenMP syntax is supported. The primary focus is on loops and parallel/concurrent code sections (additional syntax support can be implemented based on customer requirements)
  • A taskc (continuous task) pragma is supported for real-time operation. Input and output can be included within a taskc pragma section, in which case fully autonomous, low-latency, real-time processing occurs on the designated c66x cores. There is not yet an OpenMP standard equivalent for this