Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Porting GPP code to DSP and Codec Engine

From Texas Instruments Wiki
Jump to: navigation, search

Contents

Introduction

The purpose of this project is to show how to port GPP code to the DSP-side of DaVinci, OMAPL1 or OMAP35x platforms. This includes making the code XDAIS compliant, implementing an XDM interface, creating the RTSC package and running the code using a Linux application.

A JPEG Decoder will be used as an example to illustrate different steps. Source code for the implementation of the interfaces is provided. Source code for the algorithm itself is not yet available, however it is the wrappers and wizards that are the important detail.

Note that steps 1 -> 6 are independent of any GPP Operating System in an Arm+DSP environment i.e. it does not matter whether you are using Linux or WinCE or another GPP OS. At step 7 it does matter because that's where you run standard GPP applications to exercise the result of steps 1 -> 6 - these applications are different across Operating Systems.

Step 1: Build, Run, Test Golden C code on GPP

The Golden C code should include the following:

1) Source Code Project to build the source code into a library

2) Application Code Project that includes sample application to run the algorithm library. Project should include input and reference output vectors.

The code should also be benchmarked on the GPP. The GPP performance will be later compared to the DSP performance.

In this example there is a test-harness and project in Microsoft Visual Studio. A Visual C++ Workspace (.dsw file) is available. We downloaded the Express (free) version of MSVC to run this.


Step 2: Build,Run,Test Golden C code on DSP

In order to build and run the code on the DSP we used the Code Composer Studio IDE. Free trial CCS can be obtained from Free Evaluation Tools.

After building the code it is possible to run it on a hardware platform or on a simulator. Running the code on a hardware platform will require an emulator to connect CCS to the board. Information about how to connect CCS to an OMAP 3530 EVM can be found here.

A large variety of simulators are provided with CCS. Some simulate only the DSP core, others simulate the DSP core and some peripherals. The features supported are described in the CCS Setup window, when the simulator is selected (See more details next). The simulators are CPU and Memory cycle accurate, however for system benchmarking it is advised to run the code on a hardware platform.

CCS Setup

The following shows how to setup CCS to use the C64x+ Cycle Accurate Simulator.

  • Launch CCS Setup
  • Select Family: C64x+; Platform: simulator; Endianness: little
  • Selecting a simulator in the list provides the description of the features supported (right window).
  • Select C64x+ Cycle Accurate Simulator, click "<<Add" or drag it to the System Configuration window.
  • Save&Quit, Yes Start Code Composer Studio


Gpptodsp01.jpg


CCS Projects

Code will be organized in the same way as on the GPP:

1) A CCS Project to create the library. In CCS this is called an 'archive'.

2) A CCS Project to run the algorithm library i.e. to create an executable.


Library CCS Project Creation

  • Start CCS
  • Project -> New
    • Define Project Name, Location (CCS will create a folder Location\Name. If this is not the desired final location, modify the path.
    • Define Project Type: Library(.lib) will be used for the algorithm lib project.
    • Target will always be TMS320C64XX (for the heterogenous platforms mentioned above)


Gpptodsp03.jpg


  • The CCS Project is created. Notice that there are two pre-defined configurations: Debug and Release


Gpptodsp04.jpg


  • Load Source Files to project: Right click project name -> Add Files to Project
  • Define Build Options: Project -> Build Options
    • Important Options: Compiler->Basic->Target Version: C64x+
    • Important Options: Compiler->Preprocessor->Include Search Path
    • Important Options: Compiler->Preprocessor->Pre-Define Symbol
    • Important Options: Archiver->Output Filename
  • Build the Debug' library: Project->Build
  • Perform similar steps and build the Release library


Application CCS Project Creation

Following are specific steps.

  • Project Creation -> Project Type: Executable(.out)
  • Build Options: Project -> Build Options
    • Important Options: Linker->Basic - Define Linker Options (Heap Size, Stack Size ...)
    • Important Options: Linker->Libraries->Incl Libraries (.lib) - Include the Run Time Support Library rts64plus.lib. If the DSP Cache is used, include biosDM420.a64P. See Enabling 64x+ Cache.
    • Important Options: Linker->Libraries->Search Path
  • Add a linker command file to the project (Right click project name -> Add Files to Project). See next for more details about linker command files.
  • Make the Library CCS Project a Dependent Project (Right click on Dependent Projects -> Add Dependent Projects).


Linker Command File

A linker command file allows to put linking information in a file. The linker concatenates each section from all input files, allocating memory to each section based on its length and location as specified by the MEMORY and SECTIONS command in the linker command file. The MEMORY command defines the target memory configuration and the SECTION command controls how sections are built and allocated

Previously we have seen that Linker Options can be defined in CCS Build Options. They can also be defined in the linker command file

/*******************************************************/
/*               Specifiy Linker Options               */
/*******************************************************/

-stack    0x2000     /* Primary stack size   */
-heap     0x1f0f000    /* Heap area size       */


The MEMORY directive in the following example defines a target memory system that has 4 megabytes at address 0x80000000. If the memory required by an application does not fit, the size of the target memory system can be increased to 5,6,7 ... megabytes as long as there is enough physical memory on the board.

/*******************************************************/
/*         Specifiy the Memory Configuration           */
/*******************************************************/

MEMORY
{
    /*target memory system*/
    DSPDATA        org = 0x80000000  len= 0x4000000
}

The SECTIONS directive gives great flexibility to control output sections. One of the purposes is to control where output sections are placed in memory. Read more about SECTIONS in the Linker chapter of TMS320C6000 Assembly Language Tools.


/*******************************************************/
/*             Specifiy the Output Sections            */
/*******************************************************/

SECTIONS
{
      .text :      >  DSPDATA
      .far :       >  DSPDATA
      .alignconst  > (DSPDATA align(20h))
      .jpd_table   >  DSPDATA
      .stack :     >  DSPDATA
      .sysmem :    >  DSPDATA
      .data :      >  DSPDATA
      .bss :       >  DSPDATA
      .cinit :     >  DSPDATA
      .cio :       >  DSPDATA
      .pinit :     >  DSPDATA
      .const :     >  DSPDATA
      .switch :    >  DSPDATA
      .bios:       >  DSPDATA

      .text:JPEGIDEC_TI_cSect > DSPDATA
      .const:JPEGIDEC_TI_dSect> DSPDATA
      ...

}


.text:JPEGIDEC_TI_cSect and .const:JPEGIDEC_TI_dSect are subsections of .text and .const sections. Susections can be allocated separately if needed. The following shows how to use the CODE_SECTION pragma directive. This directive allows to allocate space for specific symbols in the section specified. Read more about Pragma Directives in the TMS320C6000 Optimizing Compiler.


#pragma CODE_SECTION(JPEGDEC_TI_initDecoder, ".text:JPEGIDEC_TI_cSect");


/* ======================================================================== */
/* JPEGDEC_TI_initDecoder()                                                 */
/* This function initializes the JPEGDEC_TI_Obj elements                    */
/* This function needs to be called before every frame decoding             */
/*                                                                          */
/* ======================================================================== */

S16 JPEGDEC_TI_initDecoder( JPEGDEC_TI_Obj *jpegdec,
					JPEGDEC_TI_Input *pinterfaceIn)
{
    jpegdec->sof_marker = 0;
    jpegdec->sos_marker = 0;
    jpegdec->dht_marker = 0;

    ...
}

The following shows how to use the DATA_ALIGN and DATA_SECTION pragma directives.


#pragma DATA_ALIGN (JPEGDEC_TI_decNormOrderArray, 4);
#pragma DATA_SECTION(JPEGDEC_TI_decNormOrderArray, ".const:JPEGIDEC_TI_dSect");


const short JPEGDEC_TI_decNormOrderArray[] =
        {
          0,   1,   8,  16,   9,   2,   3,  10,
          17,  24,  32,  25,  18,  11,   4,   5,
          ...

        };

CCS Resources

  • Start CCS, Help -> Tutorial
  • See the CCS projects in the attached JPEGDEC_Wrapper.zip
  • See presentation Tips & Tricks


Step 3: Basic DSP Optimization, Profiling and Benchmarking

Now that the code runs on the DSP, we want to perform some basic optimizations and benchmark the code to make sure that porting the code to DSP will provide us with the expected performance.

Basic Optimization

At this point we will perform basic optimization. More advanced optimization methods will be discussed later.

Following are basic optimization steps:

  • Allign data tables on 32-bit boundaries. See how to use pragma DATA_ALIGN above.
  • Use the DSP Cache:
    • Run code on hardware platform or simulator that supports the cache memory system.
    • See Application CCS Project Creation for CCS requirements when cache is used
    • Read Enabling 64x+ Cache
    • See how to use BCACHE APIs in attached JPEGDEC_Wrapper.zip, TestAppDecoder.c. It will show how to enable the MAR bits without DSP/BIOS.
  • Modify Compiler Options for Library Release configuration
    • Use following Basic Compiler Options:

Gpptodsp05.jpg

    • Use following Advanced Compiler Options:


Gpptodsp06.jpg

Profile Code on DSP using CCS

CCS has several profiling capabilities that can be used to understand the performance of the code (i.e how many times each function is called, how long does a function call take ...). In order to fully understand how the profiler works please review the Tutorial provided in:

  • CCS Help->Tutorial_>Application Code Tunning->Tools->Tunning Dashboard Tutorial

Here are the basic steps to start:

  • Open CCS project and rebuild it with Full Symbolic Debug (-g) option. (Optimization can still be enabled however full optimization will not be achieved because of -g compiler option. See Debug Vs Optimization Tradeoff)
  • Load Program
  • Profile->Setup, Select "Collect Application Level ..."
  • Activities Tab, Click on Clock to enable profiling (Enable/Disable Profiling)
  • Click on icon next to Clock to enable all functions.
  • The list of all the function should appear. A function can be removed by selecting it and typing Space key
  • Profile->Viewer
  • Run code (F5)
  • Save results

One must be aware of the following limitations when using the Profiler

  • When using the Profiler on hardware the cycle results are not accurate because the profiler uses the cache in the background.
  • When using the Profiler on simulator the cycle results are accurate when the code and data is placed in internal memory or cache. However when external memory is used the cycle results may not be accurate.

The Profiler is very useful to determine how many times a function is called. Because of the above limitations, to profile the number of cycles in the optimized code it may be necessary to use the TSC register as described in the next section.

Benchmark Code on DSP using the Time Stamp Counter (TSC)

The Time Stamp Counter Registers (TSCL, TSCH) implement a 64-bit free running counter in the DSP. This counter can be used to count the number of cycles on the DSP. This is a very accurate method of benchmarking code on the DSP. For most applications the lower 32-bit (TSCL) counter is sufficient.

Slides 34-36 of C64p_cgt_optimization.pdf show how to use the Time Stamp Counter.

An example of the benchmarking technique is shown below: -

#include <c6x.h> // bring in references to TSCL, TSCH
 
void main() {
   ...
   TSCL = 0; // Initiate CPU timer by writing any val to TSCL
   ...
   t1 = TSCL; // benchmark snapshot of free-running ctr
   my_code_to_benchmark();
   t2 = TSCL; // benchmark snapshot of free-running ctr
 
   printf("# cycles == %d\n", (t2-t1));
}

Another example of the benchmarking technique using the full 64-bit TSCL/TSCH value is shown here. Note that this is not protected from an interrupt routine that may read TSCL between the _itoll() reads of TSCL and TSCH.

#include <c6x.h>   // bring in references to TSCL, TSCH
#include <stdio.h> // bring in reference to printf
 
void main() {
   unsigned long long t1, t2;
   ...
   TSCL = 0; // Initiate CPU timer by writing any val to TSCL
   ...
   t1 = _itoll( TSCH, TSCL ); // benchmark snapshot of free-running ctr
   my_code_to_benchmark();
   t2 = _itoll( TSCH, TSCL ); // benchmark snapshot of free-running ctr
 
   printf("# cycles == %ld\n", (t2-t1));
}

In the attached JPEGDEC_Wrapper.zip, tsc_h.asm is provided. TSC_enable() and TSC_read() are implemented in assembly. These APIs are used as follows:

#include <c6x.h> // bring in references to TSCL, TSCH
 
void main() {
   ...
   TSC_enable(); // Initiate CPU timer by writing any val to TSCL
   ...
   t1 = TSC_read(); // benchmark snapshot of free-running ctr
   my_code_to_benchmark();
   t2 = TSC_read(); // benchmark snapshot of free-running ctr
 
   printf("# cycles == %d\n", (t2-t1));
}

The only reason you'd need this is if you need true 64 bit benchmarks. The asm functions atomically read both High & Low 32 bits to ensure that no interrupts are taken between these reads which would skew results.

Sample Results

Set Up Hardware Platform: OMAP3530 EVM (Samsung Memory) + XDS510USB

  • EVM Setup: Configure the OMAP board not to boot from Flash.
    • Set Switch SW4 as follows:

1=OFF 2=ON 3=ON 4=ON 5=OFF 6=OFF 7=OFF 8=OFF


Set up and connect to CCS

See steps on connecting to OMAP35x with CCS.


Benchmarking Environment

  • OMAP3530 running at 300MHz
  • Cache Configuration:
    • L1P = 32KB
    • L1D = 16KB
    • L2 = 64KB


Results


Optimization Level Number of Cyles Time (ms)
Debug + No Cache 2865979160 955
Debug + Cache 138761616 462
Release + No Cache 6983166 23


  • Note 1: Performance of the GPP code on a PC is in the 200ms range. However this is not an apples to apples comparism because an embedded GPP processor is different from a PC processor.
  • Note 2: Performance of a fully optimized JPEG decoder is around 12ms.

Step 4: Make the Code XDAIS Compliant

An excellent reference to begin with is XDAIS Docs -> Making DSP Algorithms Compliant with the TMS320 DSP Algorithm Standard.

XDAIS compliance is important because, as systems grow, consumers often add additional algorithms to their systems. If each algorithm allocates memory differently or they all have an external function named doIt then we quickly run into problems!

XDAIS Wizard

There is a XDAIS Wizard that can be used to generate the XDAIS wrappers. It is available at http://xdaiswiz.com for a low cost. This tool helps automate most of the XDAIS-wrapper and XDM-wrapper generation. It also generates the CCS library and application projects.

Algorithm Preparation

Before implementing the XDAIS interface, one needs to make sure that the implementation of the Algorithm obeys some rules that simply define good programming practices.

Here are some of the main rules. Details about these rules can be found in the TMS320 DSP Algorithm Standard User's Guide, Appendix A.

  • C-callable (Rule 1)
  • Re-entrant (Rule 2)
  • Relocatable (Rule 3,4)
  • Framework independent (Rule 9)
  • Hardware & I/O independent (Rule 6)
  • Able to run in any type of memory (Rule 27,33)
  • Naming conventions (Rule 8,10)

Implementing the IALG interface

The IALG memory interface is fundamental to XDAIS. The premise is to divide software between components and system integration to provide optimal reuse partitioning, allowing:

  • System Integrator/framework: full control of system resources
  • Algorithm Author: to write components that can be used in any kind of system

Algorithms never ‘take’ memory directly: -

  • Algos tell system its needs ( algNumAlloc(), algAlloc() )
  • Framework determines what memory to give/lend to algo (MEM_alloc() )
  • Framework tells algo what memories it may use ( algInit() )
  • Framework provides the memory typically via a malloc-like action
  • Algos may request internal or external RAM, but must function with either
  • Allows framework more control of system resources
  • Framework should note algo cycle performance can/will be affected
  • Algo authors can request memory as ‘scratch’ or ‘persistent’
    • Persistent : ownership of resource must persist during life of algo
    • Scratch : ownership or resource required only when algo is running

Defining the Private Instance Object

The private instance object structure contains the data which is not exposed to customers. The first field of this structure must be the IALG_Obj. The other members are algorithm specific.The instance object is defined in jpgdec_ti.h.

/*-------------------------------------------------------------------*/
/* The JPEG Object structure                                         */
/*-------------------------------------------------------------------*/
/*
 *  ======== JPEGDEC_TI_Obj ========
 */
typedef struct JPEGDEC_TI_Obj {
    IALG_Obj	alg;		/* MUST be first field of all JPEGDEC objs */
    unsigned int       *planar_buff;
    unsigned int       *intl_buff;
    unsigned int       *dct_buff;
    unsigned char      *data_str;
    unsigned char       *ptrCopyIntPersMem;
 
     . . .
 
} JPEGDEC_TI_Obj;

IALG Interface

The IALG Interface is defined in ialg.h header file. This file can be found in recent XDAIS releases in xdais_X_YY/packages/ti/xdais/ialg.h (as well as being distributed with CCS in C:\CCStudio_v3.3\C6000\xdais\include\ialg.h). This interface is preserved in all versions of XDAIS releases. The IALG_Fxns structure defines the list of methods that must be implemented by all XDAIS algorithms.

  • algActivate() - initialize scratch memory buffers prior to processing
  • algAlloc() - get algorithm object's memory requirements.
  • algControl() - algorithm control and status.
  • algDeactivate() - save persistent data to non-scratch memory
  • algFree() - get algorithm object's initialized memory records.
  • algInit() - initialize algorithm's instance object
  • algMoved() - notify instance that instance memory has been relocated (optional, and often not implemented)
  • algNumAlloc() - number of memory allocations requests required

For the JPEG decoder, the implementations are provided in jpegdec_ti.c and the function table is defined in jpegdec_ti_vtab.c. For methods which are not implemented, NULL is passed in the vector table. Note that the IALG function table is extended with the JPEGDEC_TI_decode() and JPEGDEC_TI_control() functions. These functions are part of the XDM interface and will be discussed in more details in the following sections.

/*
 *  ======== JPEGDEC_TI_IJPEGDEC ========
 *  This structure defines TI's implementation of the IJPEGDEC interface
 *  for the JPEGDEC_TI module.
 */
#define IALGFXNS                                                                \
    &JPEGDEC_TI_IALG,		    /* module ID */			        \
    JPEGDEC_TI_activate,	    /* activate */		                \
    JPEGDEC_TI_alloc,		    /* algAlloc */			        \
    NULL,			    /* control (NULL => no ctrl ops) */         \
    JPEGDEC_TI_deactivate,	    /* deactivate */		                \
    JPEGDEC_TI_free,		    /* free */			                \
    JPEGDEC_TI_initObj,		    /* init */				        \
    NULL,	    	            /* moved (NULL => not suported) */          \
    JPEGDEC_TI_numAlloc             /* numAlloc (NULL => IALG_DEFMEMRECS) */
 
#define IIMG_DEC_Fxns                                                           \
    IALGFXNS,                                                                   \
    JPEGDEC_TI_decode,                                                          \
    JPEGDEC_TI_control
 
IJPEGDEC_Fxns JPEGDEC_TI_IJPEGDEC = {
    /* module_vendor_interface */
    IIMG_DEC_Fxns
};

Algorithm Memory Requirements

The algorithm notifies the application how many memory blocks it needs through the algNumAlloc() API. For each memory block, the algorithm fills the following Memory Record Format through the algAlloc() API based on the static parameters provided in the call.

typedef struct IALG_MemRec {
    Uns             size;       /* size in MAU of allocation */
    Int             alignment;  /* alignment requirement (MAU) */
    IALG_MemSpace   space;      /* allocation space */
    IALG_MemAttrs   attrs;      /* memory attributes */
    Void            *base;      /* base address of allocated buf */
} IALG_MemRec;

These APIs are called in ALG_create() (Note that ALG_create() is not an IALG API. It is a "helper" API that invokes the IALG APIs). This topic explains the details of the various Framework Component modules (DSKT2) recommended for invoking XDAIS algorithms.

The algorithm fills out all the members of the structure except IALG_MemRec.base which will be filled by the application after it allocates the memory block. If the application allocates the memory successfully it will update IALG_MemRec.base.

Testing XDAIS Compliance

The XDAIS Compliance can be tested using the QualiTI XDAIS Compliance Tool.

Qualiti ScreenShot043.jpg

This tool checks the compliance rules and generates a report. An algorithm that passes these tests is considered XDAIS compliant.

For any given rule failure, the QualiTI wiki topic and the tool itself tells you: -

  • what went wrong
  • why it went wrong
  • possible solutions to fixing the problem.

Further XDAIS Examples

An example of a XDAIS FIR algorithm can be found here XDAIS sample algorithm. This is useful because it is a complete, buildable example. Note however that it does not implement an XDM interface.


Step 5 : Make the Code XDM Compliant

An algorithm should always implement the latest version of XDM. See more information about XDM versioning


XDM: An extension of XDAIS

XDM defines a set of IALG extensions which are required in order to use an algorithm with Codec Engine. These interfaces are defined for multimedia algorithms

  • video encode/decode: IVIDENC(1), IVIDDEC(2)
  • imaging encode/decode: IIMGENC(1), IIMGDEC(1)
  • speech encode/decode: ISPHENC(1), ISPHDEC(1)
  • audio encode/decode: IAUDENC(1), IAUDDEC(1)
  • video analytics: IVIDANALYTICS
  • transcoding: IVIDTRANSCODE

Gpptodsp07.jpg


These interfaces are provided part of the XDAIS package in \xdais_#_##\packages\ti\xdais\dm

An XDM interface uses the standard XDAIS APIs, except for the algControl() which is replaced by a new control() API. In additon to that XDM defines a new process() API. An XDAIS algorithm must support the XDM parameters and implement the new APIs to become XDM compliant.

In the attached JPEGDEC_Wrapper.zip, the control() and process() APIs are called JPEGDEC_TI_control() and JPEGDEC_TI_decode() respectively.

Extending XDM

An algorithm that requires more functionality than what a specific XDM interface is providing can extend the interface. There are some constraints when using Codec Engine. See Extending data structures in xDM. In the same topic, read also about [[Extending_data_structures_in_xDM#The_perils_of_extending_inArgs.2C_outArgsthe perils of extending inArgs, outArgs].

Gpptodsp08.jpg

For example, let's consider imaging decode algorithms. The XDM IIMGDEC interface was designed for this class of algorithms.

The following creation parameters are supported for this interface (defined in \xdais_#_##\packages\ti\xdais\dm\iimgdec.h)

/**
 *  @brief      Defines the creation time parameters for
 *              all IIMGDEC instance objects.
 *
 *  @remarks    The application should set the parameters to 0 to use
 *              the algorithm's default values.
 *
 *  @extensibleStruct
 */
typedef struct IIMGDEC_Params {
    XDAS_Int32 size;            /**< @sizeField */
    XDAS_Int32 maxHeight;       /**< Maximum image height. */
    XDAS_Int32 maxWidth;        /**< Maximum image width. */
    XDAS_Int32 maxScans;        /**< Maximum number of scans. */
    XDAS_Int32 dataEndianness;  /**< Endianness of output data.
                                 *
                                 *   @sa    XDM_DataFormat
                                 */
    XDAS_Int32 forceChromaFormat;/**< @copydoc XDM_ChromaFormat
                                 *
                                 *   @sa XDM_ChromaFormat.
                                 */
}IIMGDEC_Params;

Let's assume we are implementing a JPEG decoder which requires some additional creation parameters. We would need to extend the IIMGDEC interface.

In the attached JPEGDEC_Wrapper.zip, see file JPEGDEC_Wrapper\AlgApi\Interface\ISA\C64x\include\ijpegdec.h

The following shows how IIMGDEC_Params is extended.

typedef struct IJPEGDEC_Params {
	IIMGDEC_Params imgdecParams;
        XDAS_Int32	progressiveDecFlag;
	XDAS_Int32	outImgRes;
        XDAS_UInt8	RGB_Output;
} IJPEGDEC_Params;

Note that the example above uses XDM 0.9 which is an older version of the interface. Current algorithms should implement XDM 1.x which is the latest version of the interface See more information about XDM versioning


If we have a GPP algorithm that belongs to one of these XDM categories, making the algorithm XDM compliant means implementing one of the XDM interfaces and possibly extending it. If the algorithm does not belong to any of these categories (e.g. Face Recognition, Barcode Scanner), then it would be easier to use the IUNIVERSAL interface instead of implementing one of the XDM interfaces. The IUNIVERSAL wiki topic also comes with complete sample code.


XDM Resources

  • For more information about XDM see XDM FAQ.
  • XDM codecs can be obtained from TI eStore

Step 6 : Create RTSC package and Unit Server for Codec Engine

At this point, we should have an XDM compliant library and the next goal is to invoke the algorithm from a Linux or WinCE application using Codec Engine and running on the ARM.

Codec Engine requires RTSC packages to understand how to run the DSP executable. The package includes some information called meta-data that is used by the Codec Engine. Both, the codec and the DSP server must be packaged in the RTSC format. This can be done using the RTSC Codec and Server Package Wizards. The Wizards run both, on Linux and on Windows.

Step 7 : Test the package, server with DMAI GPP sample app and DVTB

After building the Unit-Server package, a server executable will be generated. This server includes the codec, DSP/BIOS, and all other software modules required to run the codec on the DSP.

There are currently several Linux application that can be used to test an XDM compliant codec.

  • DMAI provides sample applications for all the XDM interfaces. Very useful to test XDM-compliant codecs and to understand the general flow of an application which uses XDM compliant codecs.
  • The DVTBis a Linux application that is used as a testbench to test XDM codecs. It does not support all the XDM interfaces.

WinCE applications will be available in the near future.

Step 8 : Advanced Optimization of the DSP Code

There are many strategies to improve the performance of code running on a DSP. However all these strategies require that the original code is architectured to use block processing. An algorithm that uses sample by sample (or macroblock by macroblock) processing will not take advantage of the DSP architecture and will probably have the same performance on the DSP and on the GPP. For such an algorithm, the architecture must be modified to use blockprocessing before any further optimization is attempted.

Use Compiler Intrinsics to Access Assembly Language Statements

Compiler intrinsics allow C code to access certain asm instructions through a function like API. When using intrinsics, the compiled code uses specific assembly instructions which increases the speed of the DSP code.

See TMS320C6000 Optimizing Compiler v6.1 section 7.5.4

The following excerpt from jpegdec_rgb.c shows how the unsigned _spacku4(int src1, int src2) intrinsic is used.

Y = _spacku4(0,*iDctInPtr_Y++);
if (even == uv_inc) V = _spacku4(0,*iDctInPtr_V++);
if (even == uv_inc) U = _spacku4(0,*iDctInPtr_U++);
if (even == uv_inc) even = 0;
even++;

It is possible to run the same code using intrinsics on the GPP by using the C intrinsics host library. This library implements in C the intrinsic functions. See Run Intrinsics Code Anywhere.

Note - this particular example code-base was written before the above Intrinsics library existed hence it has a custom implementation. We would advise customers to use the version linked above since it is maintained separately.

Inline Functions

The DSP code must have as few as possible function calls in order to be performant. Inline functions as much as possible.

MUST_ITERATE Pragma

Compiler will be able to better schedule loops if it has loop count information. Use MUST_ITERATE pragma for that. See example in Pragmas You Can Understand

Use Assembly Optimized Functions from the DSPLIB and IMGLIB

The C64x+ DSPLIB and C64x+ IMGLIB provide a set of optimized functions commonly used in digital signal and imaging applications (FFT, FIR, IIR, IDCT ...) These functions can be integrated in the code to improve the performance.

Using the DMA Engine

The DMA can be used to off load the DSP during memory copies. Usually ping/pong buffers are used. The DSP works on buffer ping while data is copied to pong. The ping/pong buffers are usually allocated in fast internal memory.

Further Optimizations

See TI C6000 Compiler Optimization Techniques

FAQ

What happens to global symbols in XDAIS?

All global symbols must be compliant to the XDAIS Rules (8,9) to avoid name space collision. However, in many algorithms not all global symbols need to be exposed. It is possible to hide some of the global symbols using partial linking.

See Partial linking in CCS

Can I use malloc() in XDAIS?

XDAIS does not allow algorithms to allocate memory directly. This enables memory to be shared among several algorithms. The algorithm will use the IALG interface to notify the application about its memory needs. The application will try to allocate this memory and, if successful, will provide to the algorithm the pointer to it.

So, an XDAIS compliant algorithm should not use malloc().

How do I make C++ code XDAIS compatible?

The XDAIS interface is C based and there is no support for C++ XDAIS algorithms. However, XDAIS algorithms can be consummed in a C++ environment.

How do I use printf() with Codec Engine?

Codec Engine has a trace module that has similar functionality: TraceUtil

How will the performance be affected when I call my algorithm from the GPP through Codec Engine?

See Codec Engine Overhead

When I run my code on the DSP with CCS how can I follow which functions are being called?

Use the CCS Call Stack. In CCS, View->Call Stack. For best results should use

Build Options:

  • Basic: Generate Debug Info - Full Symbolic Debug (-g)
  • Basic: Opt Level - None

Attachments

Media:JPEGDEC_Wrapper.zip