DirectCore

From Texas Instruments Wiki
Jump to: navigation, search

Overview

This wiki is part of a "cloud HPC" series showing how to use c66x co-CPU™ cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.

DirectCore® software is middleware in the cloud HPC software model (see diagram below), providing an interface between host CPU and co-CPU applications. DirectCore® can be used in two modes:

  • Transparent mode, for high level Linux applications that abstract underlying HPC hardware
  • Non transparent mode, for applications that need close interaction with hardware and c66x code

This wiki focuses on non-transparent mode, in which basic APIs are available to the programmer, including CPU initialization, reset, code download, c66x code symbol lookup, memory read/write, etc. Sections below describe DirectCore functionality, API, and give example source code.

Other wiki's in the cloud HPC series include:

Multiuser Operation

As expected in high level Linux applications, DirectCore allows true multiuser operation, without time-slicing or batch jobs. Multiple host and VM instances can allocate and utilize c66x resources concurrently. How this works is described in sections below (see DirectCore Host and Guest Drivers and DirectCore Libraries).

TI-RTOS and Bare Metal Support

DirectCore supports both TI-RTOS and bare metal applications.

Debug Capabilities

  • Both local reset and hard reset methods supported. Hard reset can be activated as a backup method in situations where a c66x device has network I/O in the "Tx descriptor stuck" state, or DDR3 memory tests are not passing
  • Core dumps and exception handling statistics can be logged and displayed. For bare metal applications, exception handling is provided separately from standard TI-RTOS functionality
  • Execution Trace Buffer (ETB) readout and display (note -- this capability is in work)

Unified Core List

DirectCore merges all c66x cards in the system, regardless of number of CPUs per card, and presents a unified list of cores. This is consistent with Linux multicore models and is required to support virtualization. Most DirectCore APIs accept a core list parameter (up to 64), allowing API functionality to be applied to one or more cores as needed. Within one host or VM instance, for cores lists longer than 64, multiple "logical card" handles can be opened using the DSAssignCard() API.

Symbol Lookup

DirectCore provides physical and logical addresses for c66x source code symbols (variables, structures, arrays, etc). A symbol lookup cache reduces overhead for commonly used symbols.

DirectCore Host and Guest Drivers

DirectCore drivers interact with c66x PCIe cards from either host or guest instances (VMs). Host instances use a "physical" driver and VMs use virtIO "front end" drivers. Drivers are usually loaded upon server boot, but can be loaded and unloaded manually. Some notable driver capabilities:

  • c66x applications can share host memory with host applications, and also share between c66x cores
  • multiple PCIe lanes are used concurrently when applicable
  • both 32-bit and 64-bit BAR modes are supported

DirectCore Libraries

DirectCore libraries provide a high level API for applications, and are used identically by both host and guest (VM) instances. Some notes about DirectCore APIs:

  • DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs on the PCIe cards. This applies regardless of the number of PCIe cards installed in the server
  • Most APIs have a core list argument, allowing the API to operate on one or more cores simultaneously. This applies to target memory reads and writes; in the case of reads a "read multiple" mode is supported that reads each core into an offset in the specified host memory
  • APIs are fully concurrent between applications. The physical driver automatically maximizes PCIe bus bandwidth across multiple c66x CPUs
  • APIs are mostly synchronous; reading/writing target memory supports an asynchronous mode where subsequent reads/writes to the same core(s) will block if prior reads/writes have not completed
  • Mailbox APIs are supported, allowing asynchronous communication between host and target CPUs

Software Model

Below is a diagram showing where DirectCore libs and drivers fit in the cloud HPC software architecture for c66x CPUs.

 

HPC software model diagram

 

Some notes about the above diagram:

  • Application complexity increases from left to right (command line, open source library APIs, user code APIs, heterogeneous programming)
  • All application types can run concurrently in host or VM instances (see below for VM configuration)
  • c66x CPUs can make direct DMA access to host memory, facilitating use of DPDK. The host memory DMA capability is also used to share data between c66x CPUs, for example in an application such as H.265 (HEVC) encoding, where 10s of cores must work concurrently on the same data set
  • c66x CPUs are connected directly to the network. Received packets are filtered by UDP port and distributed to c66x cores at wire speed

Minimum Host Application Source Code Example

For non-transparent, or "hands on" mode, the source example below gives a minimum number of DirectCore APIs to make a working program (sometimes called the Hello World program). From a code flow perspective, the basic sequence is:

 open			assign card handles
 init
 reset
 load			download executable code
 run
 communicate		read/write memory, mailbox messages, etc.
 reset
 close			free card handles

In the source code example below, some things to look for:

  • obtaining a card handle (hCard in the source code)
  • DSLoadFileCores() API, which downloads executable files (produced by TI build tools) to one or more c66x cores. Different cores can run different executables
  • cimRunHardware() API, which runs code on one or more c66x cores, including synchronization between host and target required to sync values of shared mem C code variables and buffers
  • use of a "core list" parameter in most APIs. The core list can span multiple CPUs
/*
   $Header: /root/Signalogic/DirectCore/apps/SigC641x_C667x/boardTest/cardTest.c
 
   Purpose:
 
     Minimum application example showing use of DirectCore APIs
 
   Description:
 
     host test program using DirectCore APIs and SigC66xx co-CPU hardware
 
   Copyright (C) Signalogic Inc. 2014-2016
 
   Revision History
 
     Created Nov 2014 AKM
     Modified Jan 2016, JHB.  Simplify for web page presentation (remove declarations, local functions, etc).  Make easier to read
 
*/
 
#include <stdio.h>
#include <sys/socket.h>
#include <limits.h>
#include <unistd.h>
#include <sys/time.h>
 
/* Signalogic header files */
 
#include "hwlib.h"    /* DirectCore API header file */
#include "cimlib.h"   /* CIM lib API header file */
 
/* following header files required depending on application type */
 
#include "test_programs.h"
#include "keybd.h"
 
/* following shared host/target CPU header files required depending on app type */
 
#ifdef APP_SPECIFIC
  #include "streamlib.h"
  #include "video.h"
  #include "ia.h"
#endif
 
/* Global vars */
 
QWORD        nCoreList = 0;   /* bitwise core list, usually given in command line */
bool         fCoresLoaded = false;
 
/* Start of main() */
 
int main(int argc, char *argv[]) {
 
HCARD       hCard = (HCARD)NULL;  /* handle to card.  Note that multiple card handles can be opened */
CARDPARAMS  CardParams;
int         nDisplayCore, timer_count;
DWORD       data[120];
 
/* Display program header */
 
   printf("DirectCore minimum API example for C66x host and VM accelerators, Rev 2.1, Copyright (C) Signalogic 2015-2016\n");
 
/* Process command line for basic target CPU items:  card type, clock rate, executable file */
 
   if (!cimGetCmdLine(argc, argv, NULL, CIM_GCL_DEBUGPRINT, &CardParams, NULL)) exit(EXIT_FAILURE);
 
/* Display card info */
 
   printf("Co-CPU card info: %s-%2.1fGHz, target executable file %s\n", CardParams.szCardDescription, CardParams.nClockRate/1e9, CardParams.szTargetExecutableFile);
 
/* Assign card handle, init cores, reset cores */
 
   if (!(hCard = cimInitHardware(CIM_IH_DEBUGPRINT, &CardParams))) {  /* use CIM_IH_DEBUGPRINT flag so cimInitHardware will print error messages, if any */
 
      printf("cimInitHardware failed\n");
      exit(EXIT_FAILURE);
   }
 
   nCoreList = CardParams.nCoreList;
 
/*
   If application specific items are being used, process the command line again using flags and
   structs as listed below (note -- this example gives NULL for the application specific struct)
 
   App                 Flag          Struct Argument (prefix with &)
   ---                 ----          -------------------------------
 
   VDI                 CIM_GCL_VDI   VDIParams
   Image Analytics     CIM_GCL_IA    IAParams
   Media Transcoding   CIM_GCL_MED   MediaParams
   Video               CIM_GCL_VID   VideoParams
   FFT                 CIM_GCL_FFT   FFTParams
*/
 
   if (!cimGetCmdLine(argc, argv, NULL, CIM_GCL_DEBUGPRINT, &CardParams, NULL)) goto cleanup;
 
/* Load executable file(s) to target CPU(s) */
 
   printf("Loading executable file %s to target CPU corelist 0x%lx\n", CardParams.szTargetExecutableFile, nCoreList);
 
   if (!(fCoresLoaded = DSLoadFileCores(hCard, CardParams.szTargetExecutableFile, nCoreList))) {
 
      printf("DSLoadFileCores failed\n");
      goto cleanup;
   }
 
/* Run target CPU hardware.  If required, give application type flag and pointer to application property struct, as noted in comments above */
 
   if (!cimRunHardware(hCard, CIM_RH_DEBUGPRINT | (CardParams.enableNetIO ? CIM_RH_ENABLENETIO : 0), &CardParams, NULL)) {
 
      printf("cimRunHardware failed\n");  /* use CIM_RH_DEBUGPRINT flag so cimRunHardware will print any error messages */
      goto cleanup;
   }
 
   nDisplayCore = GetDisplayCore(nCoreList);
   DSSetCoreList(hCard, (QWORD)1 << nDisplayCore);
 
   printf("Core list used for results display = 0x%llx\n", (unsigned long long)((QWORD)1 << nDisplayCore));
 
 
/* Start data acquisition and display using RTAF components */
 
   setTimerInterval((time_t)0, (time_t)1000);
   printf("Timer running...\n");
 
   while (1) {  /* we poll with IsTimerEventReady(), and use timer events to wake up and check target CPU buffer ready status */
 
      ch = getkey();  /* look for interactive keyboard commands */
 
      if (ch >= '0' && ch <= '9') {
 
         nDisplayCore = ch - '0';
 
         DSSetCoreList(hCard, (QWORD)1 << nDisplayCore);
      }
      else if (ch == 'Q' || ch == 'q' || ch == ESC) goto cleanup;
 
      if (IsTimerEventReady()) {
 
      /* check to see if next data buffer is available */
 
         if ((new_targetbufnum[nDisplayCore] = DSGetProperty(hCard, DSPROP_BUFNUM)) != targetbufnum[nDisplayCore]) {
 
            targetbufnum[nDisplayCore] = new_targetbufnum[nDisplayCore];  /* update local copy of target buffer number */
 
            printf("Got data for core %d... count[%d] = %d\n", nDisplayCore, nDisplayCore, count[nDisplayCore]++);
 
            if (dwCode_count_addr != 0) {
 
               DSReadMem(hCard, DS_RM_LINEAR_PROGRAM | DS_RM_MASTERMODE, dwCode_count_addr, DS_GM_SIZE32, &timer_count, 1);
               printf("Timer count value = %d\n", timer_count);
            }
 
         /* read data from target CPUs, display */
 
            if (DSReadMem(hCard, DS_RM_LINEAR_DATA | DS_RM_MASTERMODE, dwBufferBaseAddr + nBufLen * 4 * hostbufnum[nDisplayCore], DS_GM_SIZE32, (DWORD*)&data, sizeof(data)/sizeof(DWORD))) {
 
               hostbufnum[nDisplayCore] ^= 1;  /* toggle buffer number, let host know */
 
               DSSetProperty(hCard, DSPROP_HOSTBUFNUM, hostbufnum[nDisplayCore]);
 
               for (int i=0; i<120; i+=12) {
 
                  for (int j=0; j<12; j++) printf("0x%08x ", data[i+j]); printf("\n");
               }
            }
         }
      }
   }
 
cleanup:
 
   if (fCoresLoaded) SaveC66xLog(hCard);
 
   printf("Program and hardware cleanup, hCard = %d\n", hCard);
 
/* Hardware cleanup */
 
   if (hCard) cimCloseHardware(hCard, CIM_CH_DEBUGPRINT, nCoreList, NULL);
}
 
 
/* Local functions */
 
int GetDisplayCore(QWORD nCoreList) {
 
int nDisplayCore = 0;
 
   do {
 
      if (nCoreList & 1) break;
      nDisplayCore++;
 
   } while (nCoreList >>= 1);
 
   return nDisplayCore;
}

Mailbox Create Examples

Send and receive mailbox creation API examples are shown below. Send means host CPU cores are sending mail to target CPU cores, and receive means host CPU cores are receiving messages from target CPU cores.

/* Allocate send mailbox handle (send = transmit, or tx) */
 
   if (tx_mailbox_handle[node] == NULL) {
 
      tx_mailbox_handle[node] = malloc(sizeof(mailBoxInst_t));
 
      if (tx_mailbox_handle[node] == NULL) {
         printf("Failed to allocate Tx mailbox memory for node = %d\n", node);
         return -1;
      }
   }
 
/* Create send mailbox */
 
   mailBox_config.mem_start_addr = host2dspmailbox + (nCore * TRANS_PER_MAILBOX_MEM_SIZE);
   mailBox_config.mem_size = TRANS_PER_MAILBOX_MEM_SIZE;
   mailBox_config.max_payload_size = TRANS_MAILBOX_MAX_PAYLOAD_SIZE;
 
   if (DSMailBoxCreate(hCard, tx_mailbox_handle[node], MAILBOX_MEMORY_LOCATION_REMOTE, MAILBOX_DIRECTION_SEND, &mailBox_config, (QWORD)1 << nCore) != 0) {
 
      printf("Tx DSMailboxCreate() failed for node: %d\n", node);
      return -1;
   }
 
/* Allocate receive mailbox handle (receive = rx) */
 
   if (rx_mailbox_handle[node] == NULL) {
 
      rx_mailbox_handle[node] = malloc(sizeof(mailBoxInst_t));
 
      if (rx_mailbox_handle[node] == NULL) {
         printf("Failed to allocate Tx mailbox memory for node = %d\n", node);
         return -1;
      }
   }
 
/* Create receive mailbox */
 
   mailBox_config.mem_start_addr = dsp2hostmailbox + (nCore * TRANS_PER_MAILBOX_MEM_SIZE);
 
   if (DSMailboxCreate(hCard, rx_mailbox_handle[node], MAILBOX_MEMORY_LOCATION_REMOTE, MAILBOX_DIRECTION_RECEIVE, &mailBox_config, (QWORD)1 << nCore) != 0) {
 
      printf("Rx DSMailboxCreate() failed for node: %d\n", node);
      return -1;
   }

Mailbox Query and Read Examples

Source code excerpts with mailbox query and read API examples are shown below. Some code has been removed for clarity. These examples are processing "session" actions, for example a media transcoding application. Other application examples would include streams (video), nodes (analytics), etc.

/* query and read mailboxes on all active cores */
 
   nCore = 0;
 
   do {
 
      if (nCoreList & ((QWORD)1 << nCore)) {
 
         ret_val = DSQueryMailbox(hCard, (QWORD)1 << nCore);
 
         if (ret_val < 0) {
 
            fprintf(mailbox_out, "mailBox_query error: %d\n", ret_val);
            continue;
         }
 
         while(ret_val-- > 0) {
 
            ret_val = DSReadMailbox(hCard, rx_buffer, &size, &trans_id, (QWORD)1 << nCore);
 
            if (ret_val < 0) {
 
               fprintf(mailbox_out, "mailBox_read error: %d\n", ret_val);
               continue;
            }
 
            memcpy(&header_in, rx_buffer, sizeof(struct cmd_hdr));
 
            if (header_in.type == DS_CMD_CREATE_SESSION_ACK) {
               :
               :
            }
            else if (header_in.type == DS_CMD_DELETE_SESSION_ACK) {
               :
               :
            }
            else if (header_in.type == DS_CMD_EVENT_INDICATION) {
               :
               :
            }
         }
      }
 
      nCore++;
 
   } while (nCoreList >> 1);

CardParams Struct

The CardParams struct shown in the above "minimum" source code example is given here.

typedef struct {
 
/* from command line */
 
  char          szCardDesignator[CMDOPT_MAX_INPUT_LEN];
  char          szTargetExecutableFile[CMDOPT_MAX_INPUT_LEN];
  unsigned int  nClockRate;
  QWORD         nCoreList;
 
/* derived from command line entries */
 
  char          szCardDescription[CMDOPT_MAX_INPUT_LEN];
  unsigned int  maxCoresPerCPU;
  unsigned int  maxCPUsPerCard;
  unsigned int  maxActiveCoresPerCard;
 
  unsigned int  numActiveCPUs;   /* total number of currently active CPUs (note: not max CPUs, but CPUs currently in use) */
  unsigned int  numActiveCores;  /* total number of currently active cores (note: not max cores, but cores currently in use) */
 
  bool          enableNetIO;     /* set if command line params indicate that network I/O is needed.  Various application-specific params are checked */
 
  WORD          wCardClass;
  unsigned int  uTestMode;
  bool          enableTalker;    /* not used for c66x hardware */
 
  CIMINFO       cimInfo[MAXCPUSPERCARD];
 
} CARDPARAMS;  /* common target CPU and card params */
 
typedef CARDPARAMS* PCARDPARAMS;

Installing / Configuring VMs

Below is a screen capture showing VM configuration for c66x co-CPU cards, using the Ubuntu Virtual Machine Manager (VMM) user interface:

VMM dialog showing VM configuration for c66x co-CPU cards

c66x core allocation is transparent to the number of PCIe cards installed in the system; just like installing memory DIMMs of different sizes, c66x cards can be mixed and matched.