Codec Engine Overhead
From Texas Instruments Embedded Processors Wiki
Contents |
Introduction
This article addresses runtime overhead of Codec Engine (CE).
Initially, it just focuses on the cycle overhead of local and remote algorithms. However, it could grow to include other resource overhead (e.g. memory, system semaphores, etc).
Single-core (or local algorithm) Overhead
CE was designed from the beginning to be optimized for local codec execution. As such, there's very little overhead introduced.
To help explain what goes on underneath the VISA calls, here is the VIDDEC_process() implementation - with comments, tracing, and "Checked Build" code removed - from the CE 2.10 release. Note that the full source to VIDDEC_process() is provided in CE 1.20 and later.
XDAS_Int32 VIDDEC_process(VIDDEC_Handle handle, XDM_BufDesc *inBufs, XDM_BufDesc *outBufs, VIDDEC_InArgs *inArgs, VIDDEC_OutArgs *outArgs) { XDAS_Int32 retVal = VIDDEC_EFAIL; VIDDEC_InArgs refInArgs; if (handle) { IVIDDEC_Fxns *fxns = (IVIDDEC_Fxns *)VISA_getAlgFxns((VISA_Handle)handle); IVIDDEC_Handle alg = VISA_getAlgHandle((VISA_Handle)handle); if (fxns && (alg != NULL)) { VISA_enter((VISA_Handle)handle); retVal = fxns->process(alg, inBufs, outBufs, inArgs, outArgs); VISA_exit((VISA_Handle)handle); } } return (retVal); }
The VISA_getAlgFxns() and VISA_getAlgHandle() calls abstract whether the codec is local or remote (if it's remote, the returned fxns will be pointers to the stub). In the case where it's local, the codec's IALG_Fxns are returned. VISA_enter() and VISA_exit() are calls into algActivate() and algDeactivate().
Algorithm-provided Memory
The default memory being provided to local algorithms is non-cached, CMEM-based memory. This is to ensure that algorithms which use hardware accelerators (e.g. VICP, DMA, etc) get physically contiguous memory that's not cached. However, for simple algorithms that don't use hardware accelerators, this non-cached memory is much slower than cached memory.
There is a global setting in the ti.sdo.ce.alg.Settings module to provide cached memory to all local algorithms - this can improve performance for some algorithms. (Note that cacheable/non-cacheable memory can not be provided on a per-algorithm basis - it's a global setting):
algSettings = xdc.useModule('ti.sdo.ce.alg.Settings'); algSettings.useCache = true; // the default is false
There is some further discussion in this archived email thread.
Related System Overheads
Note that in the High-Level OS's, like Linux and WinCE, the codec runs in user-mode, and may be doing address translation, handling kernel/user transitions due to interrupts from DMA and HW accelerators, etc. Also, the app and codec(s) will be subjected to the OS's scheduler, and I/O drivers will be competing for MIPS from the OS.
Dual-core Overhead
Dual-core Architecture Background
In a dual-core processor environment like DM644x and DM6467, processing a frame of data captured on the ARM requires:
- Address translation from ARM-side virtual to DSP-side physical (fast)
- Transitioning execution from ARM to DSP-side processing (fast, < 100 microseconds)
- Invalidating cache of the buffers so the DSP sees the right data (slow, especially with very large data buffers)
- Activating, processing, deactivating the codec (typically fast, but variable based on the codec)
- Writing back the cache of the buffers so the ARM sees the right data (slow, especially with very large data buffers)
- Transitioning execution from DSP to ARM-side processing (fast, < 100 microseconds)
- Address translation back from DSP-side physical to ARM-side virtual (fast)
The first and last step can be avoided if the ARM is running an OS without virtual memory.
Analysis
As a concrete example, a customer's video decode application was benchmarked using CE 1.02. TI found the following:
Steps 1 + 2 + 6 + 7 - 150 microseconds ~ 0.7% Step 3 - 500 microseconds ~ 1.8% Step 4 - 21000 microseconds ~ 95.0% Step 5 - 450 microseconds ~ 2.0%
The findings suggest that cache maintenance is the significant overhead in this dual-core architecture.
Findings also indicate that this cache overhead scales with size of the buffer. So HD video-sized buffers will incur even more overhead.
Cache Optimization Techniques
Since Codec Engine 1.02, CE has taken several steps to address the cache overhead.
IVIDENC-specific Configuration
In CE 1.02.01 and 1.20+ (not in any 1.10 releases), CE added a config param to the ti.sdo.ce.video.IVIDENC interface called ".manageReconBufCache". This can be configured on a per-codec basis, for XDM 0.9 video encoder codecs only (i.e., those implementing IVIDENC).
If you know your video encoder does not require it's reconstruction buffers to be cache managed, you can set this to false in your server's .cfg file and save a couple hundred microseconds (depending on the buffer size). The server script would look something like this:
H264ENC = xdc.useModule('mycompany.mycodec.H264ENC'); H264ENC.manageReconBufCache = false;
Updates to XDM Buffer Descriptors
In newer XDM interfaces (e.g. XDM 1.x), the buffer descriptors (e.g. XDM1_SingleBufDesc) have a .accessMask field which the codec uses to indicate how it accessed the buffers using the DSP CPU (i.e. read, write, both, neither). Using these indicators, frameworks (like CE's XDM skeletons) can more efficiently manage the cache of buffers after the *_process() call. See the next section (CE Server Config) for details on eliminating cache maintenance before the *_process() call, as well.
So... if an XDM codec never performs CPU writes to the output buffers (i.e. it uses DMA), Step 5 goes to zero.
Codec producers, ensure your codecs set this mask appropriately! Not only will the value of this mask affect cache overhead - and therefore performance - but it will affect functionality. If the .accessMask is incorrect, cache coherency may not be maintained.
CE Skeleton Config
In CE 2.00, CE added new config options to enable Sever Integrators to help the skeletons know when they can ignore cache maintenance for codecs implementing IVIDENC, IVIDDEC, IIMGENC, and IIMGDEC. (These interfaces were chosen as they have typically large buffers.) The XDAIS spec requires frameworks (like CE) to ensure physical memory and cache is coherent prior to calling an algorithm's IMOD functions (e.g. process()). However, as an optimization, CE allows system integrators to ignore this rule on a per-codec bases.
For the above classes, control over whether cache maintenance is performed is possible for inBufs and outBufs to the *_process() function calls. Furthermore, for the video-related classes, the same fine-grained control is possible for reconstruction buffers (IVIDENC) and display buffers (IVIDENC).
The most common use-case for these config options is to disable cache maintenance for a particular buffer when the codec only accesses it using the DMA (i.e. the code in the codec algorithm does not directly read/write the buffer). This eliminates unnecessary cache invalidations and write-backs that are associated with these buffers.
Note that if you inadvertently disable cache maintenance on a buffer that really needs it, you will run into occasional data corruption problems. So it is important to make sure that the codec is indeed accessing the buffer accordingly.
Example config of a IVIDENC-implementing codec looks like this:
// Get handle to the codec's 'Module' so it's brought into the system and we can configure it myEncoder = xdc.useModule('mycompany.mycodec.H264ENC'); // do not flush/manage cache for the 2nd input buffer (#1) b/c the alg only uses DMA to access it: myEncoder.manageInBufsCache[1] = false; // do not manage cache for any of up to 16 possible "reconstruction" buffers (IVIDENC only!) myEncoder.manageReconBufsCache[ 16 ] = [ false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, ]; // see more options in <Codec Engine>/packages/ti/sdo/ce/video*/IVID*.xdc
In CE 2.10, similar support was added for codecs implementing IVIDDEC1, IVIDENC1, IVIDDEC2, IIMGENC1 and IIMGDEC1. The Server Integrator can use these class-specific config params to disable cache maintenance before the *_process() call is made. As these newer XDM interfaces use buffer descriptors with .accessMask indicators (see previous section), the cache maintenance after the *_process() call is handled automatically:
/* * Get handle to the codec's 'Module' so it's brought into the system and we can configure it * Note, this alg implements IVIDENC1, so cache maintenance _after_ the process() call is * handled automatically. */ myEncoder = xdc.useModule('mycompany.mycodec.H264ENC1'); // do not invalidate the cache for inBufs or outBufs _before_ the process() call myEncoder.manageInBufsCache[ 16 ] = [ false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, ];
Custom Skeletons
As described in the Overriding stubs and skeletons article, skeletons (the DSP-side of the RPC) can be completely replaced. This can eliminate all overhead, but should be used very carefully as it can also result in broken functionality.
Skeleton Caching Policy
In CE 2.25.02, the Codec Engine skelCachingPolicy feature was introduced to provide the system integrator with further optimization options.
Future Directions
TI welcomes other suggestions and techniques - feel free to post them here, or send them to the davinci mailing list.

