Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Codec Engine Overhead

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

This article addresses runtime overhead of Codec Engine (CE).

Initially, it just focuses on the cycle overhead of local and remote algorithms. However, it could grow to include other resource overhead (e.g. memory, system semaphores, etc).

Single-core (or local algorithm) Overhead

CE was designed from the beginning to be optimized for local algorithm execution. As such, there's almost zero overhead introduced when the algorithm is on the same processor as the application.

To help explain what goes on underneath the VISA calls, here is the VIDDEC_process() implementation - with comments, tracing, and "Checked Build" code removed - from the CE 2.10 release. Note that the full source to VIDDEC_process() is provided in CE 1.20 and later.

XDAS_Int32 VIDDEC_process(VIDDEC_Handle handle, XDM_BufDesc *inBufs,
    XDM_BufDesc *outBufs, VIDDEC_InArgs *inArgs, VIDDEC_OutArgs *outArgs)
{
    XDAS_Int32 retVal = VIDDEC_EFAIL;
 
    VIDDEC_InArgs refInArgs;
 
    if (handle) {
        IVIDDEC_Fxns *fxns =
            (IVIDDEC_Fxns *)VISA_getAlgFxns((VISA_Handle)handle);
        IVIDDEC_Handle alg = VISA_getAlgHandle((VISA_Handle)handle);
 
        if (fxns && (alg != NULL)) {
            VISA_enter((VISA_Handle)handle);
            retVal = fxns->process(alg, inBufs, outBufs, inArgs, outArgs);
            VISA_exit((VISA_Handle)handle);
        }
    }
 
    return (retVal);
}

The VISA_getAlgFxns() and VISA_getAlgHandle() calls abstract whether the algorithm is local or remote (if it's remote, the returned fxns will be pointers to the stub). In the case where it's local, the alg's IALG_Fxns are returned.

VISA_enter() and VISA_exit() are calls to activate/deactivate the algorithm. In BIOS-based systems, where DSKT2 is used, VISA_enter() and VISA_exit() are calls into DSKT2_activate() and DSKT2_deactivate(), and therefore also benefit from DSKT2's lazy deactivate feature.

ARM-side algorithm-provided Memory

The default memory being provided to local algorithms running on a Linux/WinCE ARM-based system is non-cached, CMEM-based memory. This is to ensure that algorithms which use hardware accelerators (e.g. VICP, DMA, etc) get physically contiguous memory that's not cached. However, for simple algorithms that don't use hardware accelerators, this non-cached memory is much slower than cached memory.

The Codec Engine Cache Per Alg article describes how to modify this default behavior, and provide a mix of cached and non-cached memory for different ARM-side algorithms.

Related System Overheads

Note that in the High-Level OS's, like Linux and WinCE, the algs run in user-mode, and may be doing address translation, handling kernel/user transitions due to interrupts from DMA and HW accelerators, etc. Also, the app and algs will be subjected to the OS's scheduler, and I/O drivers will be competing for MIPS from the OS.

Multi-core Overhead

Multi-core Architecture Background

In a multi-core processor environment, both heterogenous (e.g. DM644x, OMAP3) and homogenous (e.g. C6472, C6474), processing a frame of data captured on the app processor requires multiple steps. The following shows a traditional heterogenous multicore system (e.g. DM644x) with the app on the ARM and the alg on the DSP, but the same steps apply in a homogenous multicore system as well:

  1. [Potential] Address translation from ARM-side virtual to DSP-side physical (fast)
  2. Transitioning execution from ARM-side to DSP-side processing (fast, < 100 microseconds)
  3. Invalidating cache of the buffers so the DSP-side sees the right data (slow, especially with very large data buffers)
  4. Activating, processing, deactivating the codec (activate/deactivate are typically fast, but can vary with the alg)
  5. Writing back the cache of the buffers so the ARM sees the right data (dependent on buffer size, slow with very large data buffers)
  6. Transitioning execution from DSP to ARM-side processing (fast, < 100 microseconds)
  7. [Potential] Address translation back from DSP-side physical to ARM-side virtual (fast)

The first and last step may be avoided if the ARM is running an OS without virtual memory.

Analysis

As a concrete example, a customer's video decode application was benchmarked using CE 1.02. TI found the following:

  Steps 1 + 2 + 6 + 7 -   150 microseconds    ~  0.7%
  Step 3              -   500 microseconds    ~  1.8%
  Step 4              - 21000 microseconds    ~ 95.0%
  Step 5              -   450 microseconds    ~  2.0%

The findings suggest that cache maintenance is the significant overhead in this multi-core architecture.

Findings also indicate that this cache overhead scales with size of the buffer. So HD video-sized buffers will incur even more overhead.

Cache Optimization Techniques

Since Codec Engine 1.02, CE has taken several steps to address the cache overhead.

IVIDENC-specific Configuration

In CE 1.02.01 and 1.20+ (not in any 1.10 releases), CE added a config param to the ti.sdo.ce.video.IVIDENC interface called ".manageReconBufCache". This can be configured on a per-codec basis, for XDM 0.9 video encoder codecs only (i.e., those implementing IVIDENC).

If you know your video encoder does not require it's reconstruction buffers to be cache managed, you can set this to false in your server's .cfg file and save a couple hundred microseconds (depending on the buffer size). The server script would look something like this:

H264ENC = xdc.useModule('mycompany.mycodec.H264ENC');
H264ENC.manageReconBufCache = false;

Updates to XDM Buffer Descriptors

In newer XDM interfaces (e.g. XDM 1.x), the buffer descriptors (e.g. XDM1_SingleBufDesc) have a .accessMask field which the codec uses to indicate how it accessed the buffers using the DSP CPU (i.e. read, write, both, neither). Using these indicators, frameworks (like CE's XDM skeletons) can more efficiently manage the cache of buffers after the *_process() call. See the next section (CE Server Config) for details on eliminating cache maintenance before the *_process() call, as well.

So... if an XDM codec never performs CPU writes to the output buffers (i.e. it uses DMA), Step 5 goes to zero.

Codec producers, ensure your codecs set this mask appropriately! Not only will the value of this mask affect cache overhead - and therefore performance - but it will affect functionality. If the .accessMask is incorrect, cache coherency may not be maintained.

CE Skeleton Config

In CE 2.00, CE added new config options to enable Server Integrators to help the skeletons know when they can ignore cache maintenance for codecs implementing IVIDENC, IVIDDEC, IIMGENC, and IIMGDEC. (These interfaces were chosen as they have typically large buffers.) The XDAIS spec requires frameworks (like CE) to ensure physical memory and cache is coherent prior to calling an algorithm's IMOD functions (e.g. process()). However, as an optimization, CE allows system integrators to ignore this rule on a per-alg bases.

For the above classes, control over whether cache maintenance is performed is possible for inBufs and outBufs to the *_process() function calls. Furthermore, for the video-related classes, the same fine-grained control is possible for reconstruction buffers (IVIDENC) and display buffers (IVIDENC).

The most common use-case for these config options is to disable cache maintenance for a particular buffer when the codec only accesses it using the DMA (i.e. the code in the codec algorithm does not directly read/write the buffer). This eliminates unnecessary cache invalidations and write-backs that are associated with these buffers.

Note that if you inadvertently disable cache maintenance on a buffer that really needs it, you will run into occasional data corruption problems. So it is important to make sure that the codec is indeed accessing the buffer accordingly.

Example config of a IVIDENC-implementing codec looks like this:

// Get handle to the codec's 'Module' so it's brought into the system and we can configure it
myEncoder = xdc.useModule('mycompany.mycodec.H264ENC');
 
// do not flush/manage cache for the 2nd input buffer (#1) b/c the alg only uses DMA to access it:
myEncoder.manageInBufsCache[1] = false;
 
// do not manage cache for any of up to 16 possible "reconstruction" buffers (IVIDENC only!)
myEncoder.manageReconBufsCache[ 16 ] = [
    false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false,
];
 
// see more options in <Codec Engine>/packages/ti/sdo/ce/video*/IVID*.xdc

In CE 2.10, similar support was added for codecs implementing IVIDDEC1, IVIDENC1, IVIDDEC2, IIMGENC1 and IIMGDEC1. The Server Integrator can use these class-specific config params to disable cache maintenance before the *_process() call is made. As these newer XDM interfaces use buffer descriptors with .accessMask indicators (see previous section), the cache maintenance after the *_process() call is handled automatically:

/*
 * Get handle to the codec's 'Module' so it's brought into the system and we can configure it
 * Note, this alg implements IVIDENC1, so cache maintenance _after_ the process() call is
 * handled automatically.
 */
myEncoder = xdc.useModule('mycompany.mycodec.H264ENC1');
 
// do not invalidate the cache for any of the 16 inBufs _before_ the process() call
myEncoder.manageInBufsCache[ 16 ] = [
    false, false, false, false, false, false, false, false,
    false, false, false, false, false, false, false, false,
];

Custom Skeletons

As described in the Overriding stubs and skeletons article, skeletons (the DSP-side of the RPC) can be completely replaced. This can eliminate all overhead, but should be used very carefully as it can also result in broken functionality.

Skeleton Caching Policy

In CE 2.25.02, the Codec Engine skelCachingPolicy feature was introduced to provide the system integrator with further optimization options.

Future Directions

TI welcomes other suggestions and techniques - feel free to post them below, or on the TI E2E forums.

See Also