Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Codec Engine Profiling

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

System integrators have an acute need to profile the CPU activity in their applications. Especially in Arm + DSP environments it can be quite difficult to e.g. pinpoint the reason for a dropped frame. Many developers don't even have JTAG emulators and DSP IDE environments such as Code Composer Studio.

Codec Engine does however contain hooks to get key profiling data. Often this can be done from the command line i.e. without requiring a rebuild of either DSP or GPP code.

This topic attempts to collect all such Codec Engine profiling techniques under 1 roof.

We encourage contributions since this is an active area - everybody has their own profiling 'tricks'.

Profiling Techniques

Various techniques will be described here as sub-topics. For example understanding the time-deltas reported by CE DEBUG, pinpointing process() call duration, and analyzing cache impact.

Profiling DSP-side process() calls in an ARM+DSP environment

A typical question is How do I benchmark my codec process() call?

The easiest way is to simply leverage CE_DEBUG.

For example, on the DM6446EVM platform we ran the following command: -

CE_DEBUG=1 CE_DSP0TRACE="CV=5;GT_time=2" ./decode -v davincieffect_ntsc.mpeg4 -t 3

Before we look at the output, let's understand the command details: -

  • CE_DEBUG=1 - this is the minimal level of CE_DEBUG tracing. It causes all warnings and errors (both on the application processor and any remote processors, if they exist) to be printed on standard output. We don't need CE_DEBUG=2 or higher in this use-case since it produces more 'noise' than we need for this profiling scenario.
  • CE_DSP0TRACE - this is the initial mask for DSP-side tracing.
    • CV=5 - benchmarks are obtained via Level 5. CV is the module abbreviation for Codec Engine VISA APIs. We don't do CV=01234567 because we don't need to see function enter/exit reporting (Level 0), internal Codec Engine messages (Levels 1-4), nor extra warnings & fatal errors (Levels 6-7).
    • GT_time=2 - GT_time controls the format of timestamps in CE_DEBUG trace lines. Setting this to 2 shows the delta (from the previous traced call) in microseconds.
  • decode - this is the standard DVSDK demos. In this particular topic we used a codec combo with only an MPEG4 decoder built into it for simplicity.
    • -v davincieffect_ntsc.mpeg4 - the video clip we use for test
    • -t 3 - run the demos for 3 seconds only

Now let's look at a snapshot of the output trace: -

[DSP] @+003,829us: [+5 T:0x8fa52584] CV - VISA_enter(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+000,023us: [+5 T:0x8fa52584] CV - VISA_exit(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+003,508us: [+5 T:0x8fa52584] CV - VISA_enter(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+016,335us: [+5 T:0x8fa52584] CV - VISA_exit(visa=0x8fa51fb0): algHandle = 0x8fa51fe0

These VISA_enter/exit calls bracket the control() and process() calls of the algorithm interface. Control calls are typically short - hence in the above example it's clear that the 2nd VISA_exit call timestamp represents the enter -> process() -> exit duration.

So...is the benchmark time for the MPEG4 decode processing 16335 microseconds? Unfortunately there are a couple more steps to get the absolute time: -

  • multiply the reported value by 256 to get true 'ticks'. i.e. @+016,335us * 256 = 4181760. The reason this is not automatically done for you is that the c64+ core has a 64-bit timestamp split via TSCL (lo 32 bits) and TSCH (hi 32 bits). A good description of these from a benchmarking perspective can be found in this File:C64p cgt optimization.pdf. Using TSCL on its own would wrap too quickly (it increments every cycle). Using the upper 32 bits (TSCH) would increment too slowly. So Codec Engine essentially takes 32 bits "from the middle" of those 64 bits via an internal DSP/BIOS API C64P_getltime8(). Which 32 bits? It is TSCH/TSCL >> 8. Hence to get true 'ticks' you need to multiply by 256.
  • divide by the DSP clock-frequency. On the current DM6446EVM this is typically 594mhz i.e. 4181760 / 594 = 7040 microseconds. This is the duration of the MPEG4 decode processing call for that particular frame.

Since Codec Engine uses a standard tracing format you could also script this calculation.

Instrumenting your own code w/ GT trace profiling

Let's say you want to get more detailed instrumentation. Or you just want to verify the numbers in the above section are accurate. How can you add your own benchmarking and make it appear in the CE_DEBUG trace log? Step 1 is to read up on Overriding stubs and skeletons and Printing in stubs and skeletons. The skeleton (DSP-side wrapper for VISA calls) is what we care about in this Arm+DSP profiling scenario.

// benchmarking the process() DSP call
t0 = CLK_gethtime();
 
VISA_enter((VISA_Handle)handle);
retVal = fxns->process(alg, inBufs, outBufs, inArgs, outArgs);
VISA_exit((VISA_Handle)handle);
 
t1 = CLK_gethtime();
cpuCycles = (LgUns)((t1 - t0) * CLK_cpuCyclesPerHtime());
/* calculate absolute time in milliseconds */
timeAbsolute = cpuCycles / GBL_getFrequency();
 
GT_1trace(CURTRACE, GT_5CLASS, "BENCHMARK> process() call : delta=%ld\n", (t1-t0));
GT_1trace(CURTRACE, GT_5CLASS, "BENCHMARK> process() call : abs time in msec=%d\n", timeAbsolute);

Again, to understand the details: -

  • CLK_gethtime() - DSP/BIOS API which returns the hi-resolution time. We bracket the processing with separate calls to measure the delta.
  • cpuCycles - stores the resultant CPU cycles of the delta taking into account any possible scaling factor for htime ticks .v. CPU cycles (note that on c64+ since the 64 bit timestamp counter is used this ratio is actually 1:1)
  • timeAbsolute - divide the CPU cycles by the DSP CLK frequency (594mhz in this case).
  • GT_1trace(CURTRACE, GT_5CLASS - trace, with 1 argument, using the CURTRACE GT_Mask to define the scope, and use benchmarking Level 5.

To add this specific GT_1trace to the output you can do: -

CE_DEBUG=1 CE_DSP0TRACE="CV=5;ti.sdo.ce.video.VIDDEC=5;GT_time=2" ./decode -v davincieffect_ntsc.mpeg4 -t 2

The add-on to the CE_DSP0TRACE mask allows us to see the output of our additional GT_1trace instrumentation, resulting in: -

[DSP] @+004,348us: [+5 T:0x8fa52584] CV - VISA_enter(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+000,023us: [+5 T:0x8fa52584] CV - VISA_exit(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+003,426us: [+5 T:0x8fa52584] CV - VISA_enter(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+016,383us: [+5 T:0x8fa52584] CV - VISA_exit(visa=0x8fa51fb0): algHandle = 0x8fa51fe0
[DSP] @+000,086us: [+5 T:0x8fa52584] ti.sdo.ce.video.VIDDEC - BENCHMARK> process() call : delta=4236411
[DSP] @+000,001us: [+5 T:0x8fa52584] ti.sdo.ce.video.VIDDEC - BENCHMARK> process() call : abs time in msec=7

As per the calculations in the previous section you can see that the abs time in msec=7 matches up nicely (7040 microseconds above). Hence we've now validated the Codec Engine DSP tracing benchmarks and shown how to add arbitrary benchmark segments.

See Also