Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

CodecEngineCodeOverlays

From Texas Instruments Wiki
Jump to: navigation, search


This page shows how to do code overlays to improve performance, in the context of Codec Engine & DVSDK applications. This topic applies to codec writers & combo/server producers/system-integrators.

Motivation

Some 3rd parties / ASPs want a 'mode' in which they can get the best performance out of their codec by overlaying at runtime the critical algorithm code into the fastest onchip memory. This usually means an EDMA copy from DDR2 external memory to L1PSRAM fast onchip 0 wait state memory. This bypasses the cache and is therefore the fastest theoretical way method, short of hardcoding the allocation into internal memory directly.

Methodology

This absolutely must be a 'mode' i.e. a configurable parameter. Requiring the system integrator to execute the overlays makes integration very difficult in the presence of 10s of codecs. Hence this topic shows (a) how to do the overlays (b) how the client can flip between standard non-overlays (code in DDR2 that gets cached) and advanced overlays without modifications to the codec.

Codec modifications

The following code is used in the G711DEC c64P example attached to this topic.

/*
 * The following code is conditionally overlaid, typically from
 * DDR2 -> IRAM or L1PSRAM. As it stands, we use a memcpy() i.e.
 * a CPU copy. That limits us to doing DDR2 -> IRAM (L2) since
 * only EDMA can be used for external memory -> L1P transfers.
 *
 * NOTE - if you update this code to use EDMA then *PLEASE
 * DOCUMENT THE EDMA RESOURCES USED!!!! E.G. TRANSFER COMPLETION
 * CODES USED ETC*. Hardcoding TCCs leads to near-impossible debug
 * scenarios in the presence of many other codecs.
 *
 * If doing a CPU copy from DDR2 -> L2 we need to
 * invalidate L1P. c64+ has no snooping between L2 & L1P. With CPU copy
 * there could be an issue with line sharing, and we additionally need to issue
 * an L1D writeback. Normally, the CPU reads the code and writes it to L2 SRAM
 * through the write buffer. However, if the line written to is for some reason
 * in L1D cache, code gets stuck in L1D where the CPU doesnt see it. A line
 * could be in L1D cache because of line sharing across code/data section boundaries
 * (since L1D line size is 64-byte and section alignment is 32-byte)
 * EDMA copies from DDR2 -> L1PSRAM dont need a BCACHE_wbInv but we do it
 * harmlessly (wastes just a few cycles) anyway to avoid nasty
 * memory-type-conditional code.
 *
 * If the overlays are EDMA'ed into L2 SRAM, there's no issue with line sharing
 * since only L1P cache needs to be invalidated, i.e. data is not affected
 * here at all. Even if code and data share a 64-byte L1D cache line there's
 * still no problem (one half of the line could be code and the other could
 * be data): If this line is in L1D Cache then at the time of the DMA write
 * the line is snooped (updated in L1D cache), bringing the new code in
 * L1D cache. I.e. at the time of a dirty eviction the line wont corrupt
 * the new code with old code
 *
 * Another problem is that BCACHE calls are not part of the allowed APIs in
 * the XDAIS spec. This will be considered in a future XDAIS spec update.
 *
 * If you do *NOT* want to do overlays simply remove the run = X
 * allocation in the linker contribution. In this case the load & run
 * addresses will be the same, and no copy/cache-invalidate will occur.
 */
if (&G711DEC_SUN_runFunc_load_addr != &G711DEC_SUN_runFunc_run_addr) {
    memcpy(&G711DEC_SUN_runFunc_run_addr, &G711DEC_SUN_runFunc_load_addr, (int)&G711DEC_SUN_runFunc_load_size);
    BCACHE_wbInv(&G711DEC_SUN_runFunc_run_addr, (int)&G711DEC_SUN_runFunc_load_size, TRUE);
}
G711DEC_SUN_runFunc(g711Dec, in, out, frameLen);

The comments say it all. In this example we use memcpy() since the copy is just from DDR2 -> IRAM (L2 memory). L2 is 1/2 the speed of L1. If you need to overlay to L1 then the c64P architecture requires that you use EDMA. See Table B-2 of the c64P Cache Users Guide. If you elect to do this PLEASE THOROUGHLY DOCUMENT THE RESOURCES USED BY THE EDMA TRANSACTION. FOR EXAMPLE THE TRANSFER COMPLETION CODE (TCC) NUMBERS ETC. Failure to document the resources used could (honestly!) lead to e.g. an audio driver on the Arm-side of Dm6446 suddenly failing! i.e. since EDMA resources are shared globally any 'hardcoding' can have disastrous side-effects!

The cache write-back invalidation is required regardless with CPU copy from DDR2 -> IRAM. If this step is not taken, when calling the newly copied in routine, what gets executed is whatever remains in the cache from before, and not what is in L2 IRAM.

Codec linker cmd file modifications

The following code is used in the G711DEC c64P example attached to this topic.


/* link_overlays.xdt */

/* guarentee that the load, run, size symbols we need are linked in to the app */
-u _G711DEC_SUN_runFunc_load_addr
-u _G711DEC_SUN_runFunc_run_addr
-u _G711DEC_SUN_runFunc_load_size

SECTIONS
{

   /* overlay this code - let the sys integrator decide src & dst memories */
   .text:G711DEC_SUN_runFunc {
   } load = `this.G711DEC.overlaysLoadCodeSection`, run = `this.G711DEC.overlaysRunCodeSection`,
                           LOAD_START(_G711DEC_SUN_runFunc_load_addr),
                           RUN_START (_G711DEC_SUN_runFunc_run_addr),
                           LOAD_SIZE (_G711DEC_SUN_runFunc_load_size)

}

This is a standard overlay with TI Code Generation Tools.

The -u introduces an undefined symbol into the application. In layman's terms it ensures that these symbols are always linked into the application regardless of whether an application uses them or not.

The load and run syntax is the way to say "load the code in memory X, but execute it from memory Y assuming somebody copies the code from X to Y at runtime".

The `this.G711DEC.overlaysLoadCodeSection` syntax is a neat way in RTSC/XDC tooling to say "let the system integrator decide what that memory name should be so that I dont need to hardcode it in my platform-agnostic codec".

How do we handle the case of "my client decided not to use overlays so how can I make sure I dont do the memcpy and BCACHE_inv in this case?". Well, its not pretty! We use linker tricks to do this. Here is the non-overlays link.xdt file: -


/* link.xdt */

/* guarentee that the load, run, size symbols we need are linked in to the app */
-u _G711DEC_SUN_runFunc_load_addr
-u _G711DEC_SUN_runFunc_run_addr
-u _G711DEC_SUN_runFunc_load_size

SECTIONS
{

   .text:G711DEC_SUN_runFunc {
   } load = `this.G711DEC.overlaysLoadCodeSection`,
                           LOAD_START(_G711DEC_SUN_runFunc_load_addr),
                           RUN_START (_G711DEC_SUN_runFunc_run_addr),
                           LOAD_SIZE (_G711DEC_SUN_runFunc_load_size)

}

Its identical to the overlays case except we dont have a run address. So load = run address. As you can see in the Codec Modifications section we then use that to perform the following check in the algorithm: -

if (&G711DEC_SUN_runFunc_load_addr != &G711DEC_SUN_runFunc_run_addr) {

If the load & run addresses are identical we dont do the overlay (or BCACHE stuff).

Codec configuration additions

The following code is used in the G711DEC c64P example attached to this topic.

/*!
 *  ======== G711DEC.xdc : overlays ========
 *  This config param allows the user to indicate whether to
 *  use code overlays or not.
 */
config Bool overlays = false;
 
config String overlaysLoadCodeSection;
config String overlaysRunCodeSection;

Again this is a neat feature in RTSC/XDC - we create configuration parameters in the codec that get set by the system integrator. This enables us to have 1 version of the codec and 1 version of the combo/server source which uses either overlays or non-overlay method. How else would we do this without such tooling - it would probably be a jumble of #ifdef OVERLAYS etc!

We default the overlays to false. This is the safest. As noted above overlays have side-effects!

We then create two new config parameters to allow the client to choose the load & run memory types. There are no 'sensible defaults' we can give them because the codec has no clue about platform-dependent memory names.

Later we'll see how these config params get used in the combo/server stage...

Codec supplying the right linker template

The following code is used in the G711DEC c64P example attached to this topic.

/*
 *  ======== package.xs : getSects ========
 */
function getSects()
{
    var template = null;
 
    if (Program.build.target.isa == "64P") {
        if ( this.G711DEC.overlays == false ) {
            template = "ti/sdo/codecs/g711dec/link.xdt";
        }
        else {
            template = "ti/sdo/codecs/g711dec/link_overlays.xdt";
        }
 
    }
 
    return (template);
}

Here we tell the client "if you dont want overlays (default) I'll link in the standard code, data sections etc. However if you do want overlays I'll bring in a different linker template that defines these overlays".

Basically the server integrator will get either link.xdt (non-overlays) or link_overlays.xdt depending on how he/she sets the overlays configuration boolean.

Server/combo configuration additions

(note that server & combo terms are used interchangeably to mean a DSP-side codec-eng based executable in Arm+DSP context. A unit-server is simply a combo with just 1 codec in it.)

The following code is used in the G711DEC unitserver c64P example attached to this topic.

/*
 *  "Use" the various codec modules; i.e., implementation of codecs.
 *  All these "xdc.useModule" commands provide a handle to the codecs,
 *  which we'll use below to add them to the Server.algs array.
 */
var G711DEC = xdc.useModule('ti.sdo.codecs.g711dec.ce.G711DEC');
 
// Package Config
G711DEC.alg.watermark = false;
G711DEC.alg.codeSection = 'DDR2';
G711DEC.alg.udataSection = 'DDR2';
G711DEC.alg.dataSection = 'DDR2';
 
try {
    G711DEC.alg.overlays = true;
    G711DEC.alg.overlaysLoadCodeSection = 'DDR2';
    G711DEC.alg.overlaysRunCodeSection = 'IRAM';
} catch (e) {
    print("\nConfigured for standard server/combo - no overlays\n");
}

The useModule() and watermark stuff is standard boiler-plate configuration for codecs generated by the RTSC Package Wizard

The 3 overlays based config paramters are additions - here the system integrator says "I do want to do overlays and I want to copy the code from DDR2 to IRAM ".

To turn overlays off just do

   G711DEC.alg.overlays = false;

Eh voila! You now have conditional overlays!

Final thoughts

  • This stuff is dangerous!. As noted above calling BCACHE, which EDMA TCCs to choose etc can lead to tricky integration problems! We'd prefer ASPs/3Ps to spend time getting a good link.xdt layout that does 'a decent job' minimizing L1P cache conflicts. Topics will follow shortly indicating how to do this.

Attachments Details

The attachments below are in compressed tar format. It shows the code for a G711DEC example in which we overlay a function from DDR2 to IRAM (L2 memory). Its derived from 1 of the Codec Engine examples.

Note also that you could use this algorithm as a sample algorithm that meets the typical Codec Engine flow. Source code is provided so you can see how it is built.

The server package uses a try-catch mechanism to gracefully omit the overlays if you dont need them. Hence again the server package can be used as a reference. The server was built with the RTSC Package Server Wizard

The tools and Target Content used to build this for the EVMDM6446 platform were: -

  1. Codec Engine 2.10 (pre-release) - note that you should upgrade to the final version of CE 2.10
  2. Codec Engine 2.10 (pre-release) cetools - note that you should upgrade to the final version of CE 2.10
  3. RTSC XDC Tools 3.00.00.05
  4. DSPBIOS 5.31.08 (use 5.32.x with CE 2.10 release)
  5. CGT 6.0.16