Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Stack issues

From Texas Instruments Wiki
Jump to: navigation, search

This page is about problems and crashes that occur due to insufficient stack space in DSP server threads, and how to find them.

Introduction

A large portion of mysterious crashes occur due to stack overflows of codec threads running on the DSP. How do you know if your DSP threads have too little stack space?

Why is this important? The value used in the getStackSize() Codec Engine method for a given codec is plugged into dynamic creation of a DSP/BIOS task (in SoC scenario) hence you'd better ensure its enough for the context of Codec Engine + algorithm processing method, else you'll get a nasty crash!

One solution

The easiest way to deal with stack space issues is: Allocate plenty of stack space! Allocate way more than absolutely necessary -- perhaps 32K or more per thread, especially video and imaging. All the stack memory is external, and only what is used is cached, so the only penalty for unnecessarily large stack sizes is wasting a few dozen K per codec instance (and those are few) of cheap external memory, for codecs that eat megabytes to process their data. The potential benefit is preventing a crash a few years down the road because an unusual input caused the codec to overflow.

You can often see 8K or 16K stack allocations, in order to save additional 32 or so K, for codecs that require 8MB of data for intermediate buffers.

Finding out dynamic stack usage

Unless your application crashed (e.g. because they threads over-ran their stack!), CE's tracing facilities can display stack usage per remote alg thread, when the remote alg is deleted.

Note: The stack usage display only applies to remote algorithms. Local algorithms use the stack of the calling thread, so applications will have to use OS primitives for their given OS (e.g. TSK_stat() on BIOS).

If using CE 2.00 or later, you can use the CE_DEBUG feature to enable all trace, then search the output for the string stack.

Here's an example session for CE 2.x:

root@146.252.161.13:~/r/ce-h18x# CE_DEBUG=2 ./app.out  | grep stack
[DSP] @0,051,117tk: [+4 T:0x8fa463cc] OT - Thread_create > name: "videnc_copy#0", pri:  -1, stack size:  1024, stack seg: 0
[DSP] @0,065,397tk: [+4 T:0x8fa463cc] OT - Thread_create > name: "viddec_copy#1", pri:  -1, stack size:  1024, stack seg: 0
@0,641,660us: [+5 T:0x4003a6d8] CE - Engine_deleteNode(0x3f998): algName = videnc_copy, algHandle = 0x8fa47398, stack size = 1024, stack used = 731(72%)
@0,644,959us: [+5 T:0x4003a6d8] CE - Engine_deleteNode(0x3fae0): algName = viddec_copy, algHandle = 0x8fa47ce0, stack size = 1024, stack used = 739(73%)

And here's similar output for CE 3.x (note the trace formatting is slightly different, but the information is the same):

root@arago:/tmp# CE_DEBUG=2 ./app_remote.xv5T -m omap3530_memmap.txt  | grep stack
[t=0x000be6fd] [tid=0x4001e9a0] ti.sdo.ce.Engine: Engine_deleteNode(0xd5d58): algName = universal_copy, algHandle = 0x8788f598, stack size = 4096, stack used = 2028(50%)

Changing dynamic stack allocation

Set the .stackSize field to the desired value of each algorithm thread listed, as well as for the system thread, that calls, in its own context, each algorithm's create() function. Example:

Server.algs = [
    {
        name: "h264enc",
        mod: H264ENC,
        groupId: 0,
        threadAttrs: {
            stackSize:  65536,  // stack size for each instance of this codec
            stackMemId: 0,
            priority: Server.MINPRI + 1
        }
    },
    {
        name: "h264dec",
        mod: H264DEC,
        groupId: 1,
        threadAttrs: {
            stackSize:  65536,  // stack size for each instance of this codec
            stackMemId: 0,
            priority: Server.MINPRI + 1
        }
    },
];
 
// stack size for the system thread, used to create/delete codecs, as well as
// handle general Server API requests like 'getCpuLoad()'
Server.threadAttrs.stackSize = 65536;

This setting will override the default that the algorithm suggests (via the getStackSize() fxn in its codec package), so make sure you're allocating more than the algorithm thinks it needs, not less.

One way to find out what the default allocations are is to look at the generated Server Data Sheet HTML file located in the Server's package/info directory.

Finding out static stack usage

What's the problem with just using dynamic stack usage? It comes down to the call graph. A static call graph is constructed via some static analysis of the code. In particular, the code is not executed. A static call graph shows all the possible calls that could occur. A dynamic call graph is constructed via information gathered when the code executes. A dynamic call graph shows which calls actually do occur.

Dynamic call graphs are very accurate, but only for the code which actually executes. There is no guarantee that, even with worst case input, the functions that use the most stack get called.

Generally speaking, a static call graph shows all the possible calls, and thus is a better choice for exploring system limits than a dynamic call graph. Always understand the accuracy constraints of the static call graph tool you are using, and the impact those constraints have on measurement of system limits. This topic demonstrates also outlines such limitations.

The TI supplied cg_xml utility package performs processing of XML files that can be generated by TI Codegen tools. Several tools are in the package. The executable call_graph builds a static call graph.

C:\dir>ofd6x -x -g ex.out | call_graph
Reading from stdin ...
Call Graph for ex.out
****************************************************
_c_int00 : wcs = 1240
|  __args_main : wcs = 1240
|  |  _main : wcs = 1232
|  |  |  _cond_is_false : wcs = 1216
|  |  |  |  _output : wcs = 1208
|  |  |  |  |  _printf : wcs = 1176
|  |  |  |  |  |  __printfi : wcs = 1160
|  |  |  |  |  |  |  __pproc_fflags : wcs = 0
|  |  |  |  |  |  |  __pproc_fwp : wcs = 40
|  |  |  |  |  |  |  |  _atoi : wcs = 0
. . .

Note wcs stands for worst case stack, the total amount of stack needed at that point in the call graph.

The cg_xml package is available for Windows and Linux x86 systems. It comes wrapped in a GUI installer, and requires clicking a license agreement.

The executable call_graph is implemented in Perl, and full source code is located in the directory install_root/ofd, where install_root is the location where cg_xml is installed. The binary executables are located in install_root/bin. Executing call_graph does not require the system to have Perl installed. Note that running call_graph does add some files to the system temporary directory.

The input for call_graph is an XML file representation of the system executable. Create that XML file with the Object File Display (OFD) utility. OFD is a component of the TI Code Generation Tools. Examples include ofd6x from the C6000 toolset, and ofd470 from the ARM toolset.

In some cases the XML file size can be very large, 200 MB or more, which causes call_graph to run very slowly. The solution is to use OFD options which reduce the XML size. The documentation for call_graph has all the details. For the example above, the best options to use are:

-xg --xml_indent=0 --obj_display=none,header,optheader,symbols --dwarf_display=none,dinfo

How does call_graph work? It processes the debug information. The same debug information created by the compiler, and used by the debugger when you set breakpoints, look at variables, etc. The debug information is in Dwarf format.

call_graph can analyze a library and show the maximum amount of stack it may use.

C:\dir>call_graph --stack_max rts6400.xml
_strftime : wcs = 1248

Hence codec producers can use the --stack_max option to determine what to plug into the CE-required getStackSize() method (set in the codec package's <MODULE>.xs file). Note that this the what the RTSC Codec Packaging Wizard does behind the scenes.

Note that the QualiTI XDAIS Compliance Tool runs this exact sequence to report the stack size.

Caveats of static stack usage

Indirect calls

Most functions calls are direct calls to a function with a specific name. Sometimes, the address of the function being called is passed in as a parameter, or looked up in a table. Such calls are termed indirect calls.

Indirect calls are difficult to handle when generating a static call graph. In practice, any given indirect call actually calls only a small number of functions. However, static analysis generally cannot find those functions.

When call_graph finds a function that makes an indirect call, it notes that fact in the output. You can supply configuration files which list out, for that parent function which makes the indirect call, all the child functions it possibly calls.

For example, the compiler runtime support library code makes indirect function calls. The configuration files ti_rts_indirect.txt, c60_rts_indirect.txt, arm_rts_indirect.txt, and c55_rts_indirect.txt supply information for these indirect function calls. The file ti_rts_indirect.txt supplies information that is common across those different RTS libs. For full correctness, it must be used in combination with a target specific file. For example ...

   call_graph -i=ti_rts_indirect.txt -i=c60_rts_indirect.txt c60_app.xml

These files also serve as examples of how you can write your own such configuration files.

Note dynamic call graphs do not have this problem. Because dynamic call graphs get their information based on actual execution, it is trivial to follow an indirect call to the function actually called.

Far calls

Unfortunately code built with CGT 6.0.x does not distinguish between near .v. far .v. indirect calls in the generated Dwarf information. Hence if you build your library with -ml3 then call_graph will treat is as an indirect call. The end stack size may then be too small because we dont see the true call tree. call_graph attempts to detect if the stack size is too small and emits a warning: -

The function which uses the most stack also makes no direct function calls.
  The max stack size is almost certainly too low.  Please view the online
  documentation with the command 'perldoc call_graph.pl' for possible
  explanations.
  _DMJPGE_dspRun : wcs = 272

The best solution to this problem is to not use -ml3 or far calls in your c6x library. -ml3 has been deprecated since 2005. Instead rely on linker trampolines to patch up the reach between functions that are further away than the 21 bit near addressing mode. 99 times out of 100 this should yield better performance than making everything far because there are typically relatively few trampolines in comparison to total function calls in large algorithms.

FYI this issue is fixed in CGT 6.1.x. The script will automatically do the right thing - you wont need an updated script.

Libraries built with -g

The script determines the amount of stack used by one function from the Dwarf attribute DW_AT_frame_base. This attribute is output by the compiler in default builds, but not when building for debug with -g. The call graph will be accurate but not the stack usage.

Most codec vendors release production libraries without -g since it drastically inhibits optimization - so typically this limitation is not a problem.

Stack analysis conclusions

  • As indicated initially be generous with stack size. We've never seen an algorithm use > 32K so this may be a reasonable cap.
  • At the same time, one downside of too large a stack maybe inflexibility in DSP-only systems e.g. the same codec package is often reused for DM6446, DM648, DM6437 - in the latter 2 cases you may want a smaller stack size so it can be placed in internal memory.
  • Employ both dynamic & static stack analysis. Be aware of the caveats in each method.

See Also

  • The BIOS FAQ has additional stack debugging details
  • The BIOS Debugging Tips article describes techniques for identifying stack overflows