Changing the DVEVM memory map

From Texas Instruments Wiki
Jump to: navigation, search

Mastering the Art of Memory Map Configuration for DaVinci-based Systems

Contents

Introduction

This document describes how to configure Codec Engine based audio/video applications on DM6446 (DaVinci) for use in a system that has less than 256 MB of DDR2 memory that the evaluation board provides. Specifically, we present the steps for shrinking memory requirements down to 64MB, but the principles apply to any amount of DDR2.

Developers who build audio/video applications on TI's DaVinci platform have a rich software stack available for use with the DVEVM evaluation board. At the core of the DVSDK (Digital Video Software Development Kit), as this stack is called, is Codec Engine (CE), an application programming layer that allows Arm-side applications to execute video and other algorithms on the DSP for faster processing. The DVSDK stack also includes the Montavista Linux OS, Arm-side Codec-Engine-using demo applications that encode and decode video and sound, and DSP-side executables running the actual video encoding and decoding algorithms. While the supplied software gives a strong starting point for creating custom applications, one of the challenges DaVinci developers face is how to adapt all these components to so they fit in a production system.

The DVEVM board is equipped with 256 megabytes of DDR2 memory. The DVSDK software stack is by default configured to use all of that memory: since video algorithms are memory-hungry, this configuration supports most complex video processing scenarios, and does not require the user to deal with the nontrivial issue of changing the memory map. The total DVEVM memory is partitioned by default at 120MB for Linux, 8MB for video and other input/output buffers to be exchanged between the ARM and the DSP, and 128MB for DSP algorithms. Of the latter amount, 6MB is set aside for code and data, and full 122MB for video algorithms intermediate processing buffers organized in a heap. This setting allows for several instances of video, image, and other encoders and decoders for different video formats to be running at the same time.

In the first part of the document, after giving some background on various elements we are working with, we discuss how to determine the minimum memory required and how to partition that memory. The rest consists of hands-on instructions on how to configure and rebuild the various components to fit the chosen memory map. We demonstrate the principles and execute actual steps on the Codec Engine video_copy example, whose sources every DVSDK user can access; at the end we showcase the techniques on the example of a real-world, production system application.

Why do Codec Engine applications out of the box consume so much memory?

The DaVinci EVM board comes with 256MB of external memory installed (the maximum amount currently addressable by the part). All the out-of-the-box software (DSP codecs and ARM-side apps) is spread out over all of that space for developer's comfort. This way you don't have to worry about running out of space when allocating buffers or creating memory-hungry instances of video-processing algorithms.

However, since production platforms based on the DM6446 processor will likely be made with less than 256MB of external memory available, the developer must be able to shrink the memory used by his applications to whatever his target platform provides.

There is a separate article with more details on the memory map employed by the Codec Engine Examples.

Physical sharing of DDR2 memory between the ARM and the DSP

This 256MB of physical DDR2 memory is shared between the ARM and the DSP, i.e. both processors can access all of the DDR2. The ARM however views this memory as virtual addresses through an MMU (Memory Management Unit) while the DSP uses the physical addresses directly. The virtual addresses is used by Linux to provide memory protection between processes, making sure a process only accesses memory which it has access to. If a Linux user process accesses an address it does not have access to, a segmentation fault (segfault) will occur and the process is killed by the OS.

Since the DSP has no MMU it can not be restricted to certain memory addresses, which means that a 'rogue pointer' in DSP side code can write not just all over the DSP DDR memory but also over the ARM (Linux) side code and data. These issues can be very difficult to find.

The physical memory addresses are the same for the ARM and the DSP on DM6446 and range from 0x80000000 to 0x90000000.

Linux partition

Linux is different from an RTOS like DSP/BIOS in that it manages all resources in the system for the application. The application requests access to a resource and Linux grants it depending on UNIX permissions and availability. This means that all the memory you give to Linux will be "owned" by Linux and is out of your direct control. The DDR Linux memory partition is segmented in to pages (4KB in size on ARM Linux), and this is the minimal unit of memory you can allocate. What this means is that when you call malloc() to reserve some memory for your application, Linux will give this memory to you as a sequence of 4KB pages. Not only do you not have any control over from where in physical memory this memory is allocated, you don't even know if they are physically contiguous (the MMU makes the memory look virtually contiguous to the process).

This is normally a great feature, but it becomes a problem when you want to share a buffer between the ARM and the DSP. This because the DSP needs physically contiguous memory to work with. This is the reason why the CMEM kernel module was created, i.e. to provide physically contiguous buffers to be shared between the ARM and the DSP. This is also useful for buffers which are to be accessed using the DMA or the DM6446 H/W resizer.

The Linux partition is also used for various internal I/O buffers and application caching features, so the bigger this partition is, the better.

CMEM: Contiguous Memory Allocator

To be able to share buffers between ARM processes (application control) and the DSP (algorithm acceleration), CMEM was created. It takes a physical memory region you specify at CMEM driver load time and carves it up in to pools of contiguous buffers according to your specifications. The buffers are typically not cached on the ARM side (but the Codec Engine handles the DSP side caching of these buffers).

There is an article with a CMEM Overview, and further details below. Note that on Linux, once you have inserted the cmemk.ko kernel module you can always execute "cat /proc/cmem" to get status on the buffers and pools managed by CMEM.

The DDRALGHEAP and DDR sections

The DDRALGHEAP section contains the heap from which the active codecs allocate all their dynamic memory. This section can be quite large, especially if video codecs are used.

Users of Codec Engine 1.20 and newer can pass in a physically contiguous (e.g. CMEM allocated) memory block to Codec Engine to be used as "DDRALGHEAP". See the Server_redefineHeap() API call. There are also API calls for querying heap usage during run time, see Server_getNumMemSegs() and Server_getMemStat().

The DDR section contains DSP side code and static data for all the codecs plus the system (i.e. DSP/BIOS and Codec Engine). This section is called "DDR2" for CE 1.20 and later.

The DSPLINKMEM section

This section is used by the DSP Link IPC (Inter Processor Communication) software from TI. Codec Engine uses this software module for communicating between the ARM and the DSP as well as loading the DSP with code and controlling it.

The RESET_VECTOR segment

This section contains the DSP reset vector, i.e. the vector table which the DSP side ISTP register is pointing to when the DSP is pulled out of reset by DSP Link. The reset vector code moves the vector table else where by changing the ISTP, but this is where it is located at boot. This section needs to start at an even 1MB and needs to be 128 bytes in size.

Designing the Memory Map

This section describes how to design your memory map to optimize the memory usage for your system. When this phase is completed we will have a piece of paper that lays out the most compact memory map we can have, with names, origins, and sizes for each segment of the map needed by all the players in the system. That piece of paper we then use as the input for the next phase (described in the next section), where we edit various text files to apply that information into our system's build.

Our motivation here is to make the memory needed by the Codec Engine, and its Arm-side support software like CMEM and DSP Link, be as small as possible. This we want because any portion of the memory set aside for Codec Engine that is not used remains unused forever; whereas any amount of memory given to Linux will always be put to good use. Linux uses all the memory it can get for caching its disk and network data, so increasing the memory for Linux improves the overall performance of the system.

To put it in an equation,

total memory = DSP Server memory + CMEM memory + Linux Memory

from which follows that

Linux Memory = total memory (e.g. 64MB) - DSP memory - CMEM memory

Since total memory is fixed and is determined by the hardware design of our board, we strive to give to DSP and CMEM only as much as necessary so that Linux gets as much as possible.

The procedure

In essence, the procedure for determining the memory map is this: we make our system work with a luxurious memory map (e.g. the original 256MB), then for each segment we calculate or measure its actual requirements, reduce the segment size, and rerun the application. Once we have minimized the size of each segment, we compact the map and fit everything into the block of memory we want our final system to have (e.g. 128MB, or 64MB, or 32MB, etc.)

  1. Start with the original 256MB memory map
  2. Include all the algorithms you need in your DSP server (via the .cfg file), and rebuild the Server
  3. Determine the size of DDRALGHEAP
    1. calculate, or
    2. run our ARM-side CE application and measure worst case
  4. Determine the size of the DDR segment
  5. Determine the size of CMEM
    1. calculate, or
    2. run our ARM-side CE application and measure worst case
  6. Move RESET_VECTOR in the same MB as DSPLINKMEM or DDR
  7. Order the segments correctly and place them at proper start addresses
  8. Compute Linux memory size = our device total memory – DDRDDRALGHEAPRESET_VECTOR - DSPLINKMEM - CMEM

Below is an example of a before and after scenario when this procedure is applied:

Figure1.png

In the "after" picture we have the actual minimum sizes for each segment; knowing that, we can fit everything in a device with less system memory, 128MB in the above example.

We now look at how to calculate or measure these sizes, but first we set some expectations.

It is (mostly) all about video

Recall that the DSP server has four segments:

  • Two smaller system segments called DSPLINKMEM and RESET_VECTOR, with total size of about 1MB (that can be shrunk to 512KB if really necessary, as we mention in a section below)
  • One medium-size segment, DDR (or sometimes called DDR2), sized at typically 1-3MB, that contains the code and static data for all the codecs plus the system
  • One large segment called DDRALGHEAP, sized at anywhere from 2MB to 200MB, that holds all the dynamic memory allocated by each active codec instance running on the DSP.

The sizes of the first two segments are independent of the run-time characteristics of the system. The size of the DDR segment depends on what codecs we include in the system, and this is fixed given the functionality we wish to support -- we will include only the codecs we need and none of the codecs we don't, which requires a fixed amount of code and static data space. But the size of the DDRALGHEAP segment depends heavily on which instances of those codecs the ARM side creates and when.

When the system is first started, there are no codec instances and the total heap allocated in DDRALGHEAP is 0. When the ARM side creates a DSP codec, for example via VIDENC_create(), this instance allocates dynamic memory according to its spec sheets and the amount of memory usually depends on the codec creation parameters. For video processing in full resolution this may require several MBytes of DDRALGHEAP for a single instance.

When an instance is deleted (e.g. VIDENC_delete()), all of its dynamic memory is reclaimed.

Video codecs (encoders and decoders) by far need the most dynamic memory, often several MBs, followed by imaging codecs, then audio codecs, followed by speech codecs, which typically need very little. Therefore, the way the video codecs are used by the ARM side determines how big the DDRALGHEAP segment must be.

The CMEM segment's size is also very dependent on which codecs run in the system and when. The purpose of the CMEM segment is to exchange input and output codec data between the ARM and the DSP codec instances. Video buffers allocated via CMEM are much larger than the speech buffers.

The bottom line is that our total required memory depends on how we use the codecs, and if we don't use all available codecs at the same time -- which some classes of applications do and some don't -- the total memory required will be less than the simple sum of the parts.

Here is an example to illustrate this: assume our system is a digital video camera which has a "video record" and a "video play" button. We could design our system to create, from the ARM side, a video encoder instance and video decoder instance on the DSP, at boot time, and have them run side by side. Both codec instances hold their dynamic data in DDRALGHEAP. When the user presses the "record" button the ARM side passes the raw images data via CMEM to the video encoder, when the user presses the "playback" button, the ARM side passes the compressed frames and receives uncompressed images. Our total of combined DDRALGHEAP and CMEM memory required for these two may be, say, 3MB for decoder + 2MB for encoder = 5MB.

But alternatively we can design our system to wait until the user presses the "record" button and then create a video encoder instance on the DSP, and delete the instance when the user presses the stop button. Similarly we create an encoder instance when the user presses the "playback" button, and delete it when he exits the playback mode.

Because we know that the user cannot record and playback at the same time, we know that the encoder instance and the decoder instance cannot exist at the same time. Therefore our total memory needs become MAX(3MB, 2MB) = 3MB instead of 5MB. And since creating a codec instance is a very fast operation, it does not affect our system in terms of speed and power consumption.

Determining the size of "DDRALGHEAP"

Total "DDRALGHEAP" size depends on which codecs (of which type, from which vendor), how many instances of those codecs exist at the same time, and possibly with which parameters the codecs were created, e.g. D1 vs. CIF max video resolution changes memory requirements.

In theory we can calculate the amount of memory needed by looking at the codec data sheets; those should list how much memory a codec instance requires based on the mode of operation.

In practice, it is better to actually measure the usage at peak time and only look at the specs to confirm that the expected numbers roughly match the measured results. This because the data sheets show the worst case size requirements, and depending on your codec configuration your requirements may or may not be less.

Measuring DDRALGHEAP size via Engine_getUsedMem() API [All CE versions]

The simplest way to measure memory usage is this:

  1. In the Arm app, call Engine_getUsedMem() immediately after the first call to Engine_open().
  2. Call Engine_getUsedMem(engineHandle) again after creating the codecs with the heaviest memory requirements.
  3. The delta between the two numbers is roughly the required size of "DDRALGHEAP" (it is in fact slightly larger as it includes the growth of "DDR", but the latter grows only by a few KB per instance; e.g. if the delta is 6.4 MB, the real "DDRALGHEAP" may be 6.395 MB)

Measuring DDRALGHEAP size via Server_getMemStat() API [CE 1.20 and later]

The Server_getMemStat() API in CE 1.20 and later allows us to query each segment specifically, so we can do that for "DDRALGHEAP".

Assuming a CE Engine handle is in hEngine, make a call below at peak load time (worst case codecs created and active) to find out how big "DDRALGHEAP" need be:

Server_Handle  hServer;
Int            i;
Int            numSegs;
Server_MemStat memStat;
 
/* hEngine was previously acquired via Engine_open() */
hServer = Engine_getServer(hEngine);
 
Server_getNumMemSegs(hServer, &numSegs);
 
for (i = 0; i < numSegs; i++) {
    Server_getMemStat(hServer, i, &memStat);
 
    if (strcmp(memStat.name, "DDRALGHEAP") == 0) {
        printf("DDRALGHEAP usage is %ld out of %ld available\n",
                memStat.used, memStat.size);
    }
}

Measuring DDRALGHEAP size

Using the ALGUTIL module [All CE versions]

A collection of tools, called 'servertools', at https://www-a.ti.com/downloads/sds_support/applications_packages/servertools/index.htm contains, among others, a utility to instrument the DSP server and determine the exact breakdown of memory needs by each algorithm for specific types of memory it needs.

  • Download the tools and locate algUtil, which is a DSP side utility module which prints out the memory allocated for algorithms on the heap as it is instantiated.
  • Insert the module into the codec server using xdc.useModule('ti.sdo.apps.algutil.ALGUTIL') in myServer.cfg.
  • Call ALGUTIL_init() in the DSP codec server main() function.
  • Enable tracing in your CE application and enable algUtil tracing when invoking the application:

TRACEUTIL_DSP0TRACEMASK="ti.sdo.apps.algutil.ALGUTIL=4"; ./myapp

  • For more details on CE trace, see the CE user documentation.
  • Invoke the app with one algorithm instance created at a time until data has been collected for all algorithms.
  • The output has two lines of special interest, as in the following example:
@0x000898fd:[T:0x8fc45144] ti.sdo.apps.algutil.ALGUTIL - EXTERNAL scratch total: best case:(0x0), worst case:(0x0)
@0x000899a6:[T:0x8fc45144] ti.sdo.apps.algutil.ALGUTIL - External persist total: best case:(0x29d5c), worst case:(0x29d94)
  • Both External Scratch and External persistent memory are typically allocated from DDRALGHEAP.
  • Add up the heap usage of your "worst case codec combination" to determine your total heap requirement.

Using the DSKT2 trace [CE 2.0 or above]

The new versions of CE and Framework Components has trace built into the DSKT2 module that will help you in measuring heap space allocation when you run the application. DSKT2 is the module that is responsible for querying the algorithm/codec for its memory requirements and allocating that memory from DDRALGHEAP.

Simply set DSKT2.trace to true in your server's .cfg file and rebuilding the server. Then set the environment variables TRACEUTIL_DSP0TRACEMASK="ti.sdo.fc.dskt2=01234567", TRACEUTIL_DSP0TRACEFILE="" and TRACEUTIL_REFRESHPERIOD=200 prior to running your application and the DSKT2 trace will appear on your stdout (you may also route the output to a file for closer inspection by providing a filename in TRACEUTIL_DSP0TRACEFILE). The resulting trace output contains the similar information as you'd get out of ALGUTIL.

Alternatively, setting CE_DEBUG=2 when running your application would be another convenient way to get the trace on stdout. However, this approach results in a more verbose trace, and you'll need to scan the output to find the DSKT2-related details.

The trace will contain the following information when an algorithm/codec instance is being created:

@0x000fd446:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Num memory recs requested 9
@0x000fd483:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[0]: size=0x380, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd4df:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[1]: size=0xfd00, align=0x80, space=IALG_DARAM0, attrs=IALG_SCRATCH
@0x000fd539:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[2]: size=0x138600, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd595:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[3]: size=0x50000, align=0x80, space=IALG_EXTERNAL, attrs=IALG_SCRATCH
@0x000fd5f1:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[4]: size=0x7f00, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd64d:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[5]: size=0x1000, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd6a8:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[6]: size=0x5e00, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd703:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[7]: size=0x1580, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST
@0x000fd75e:[T:0x88b665f4] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[8]: size=0x580, align=0x80, space=IALG_EXTERNAL, attrs=IALG_PERSIST

Add up the sizes and alignments of all requests with space=IALG_EXTERNAL (and any space that is mapped to be allocated from DDRALGHEAP. Check the server's .cfg file to see how DSKT2 has been configured.) The total will give you the worst case size requirement on DDRALGHEAP for this algorithm/codec instance.

Invoke your application by creating one codec instance at a time until data has been collected for all algorithms.

Add up the heap usage of your "worst case codec combination" to determine your total heap requirement.

Determining the size of the "DDR" section

The "DDR" segment holds codec and system code as well as static data. (It is called "DDR2" in CE 1.20 and later but it is the same segment.) We find out its required size simply by looking at the linker map.

The procedure is as follows:

  • Add all the codecs you intend to use to your DSP server's .cfg file, and no others.
  • Build your DSP server.
  • Look at the generated .map file for the codec server.
  • The .map file is under directory package/cfg.
  • See how much is used for DDR; in this example, it is 0x90168 bytes:
         name            origin    length      used     unused   attr
----------------------  --------  ---------  --------  --------  ----
  ARM_RAM               10008000   00004000  00000000  00004000  RWIX
  CACHE_L2              11800000   00010000  00000000  00010000  RWIX
  CACHE_L1P             11e08000   00008000  00000000  00008000  RWIX
  L1DSRAM               11f04000   00010000  00010000  00000000  RWIX
  CACHE_L1D             11f14000   00004000  00000000  00004000  RWIX
  DDRALGHEAP            88000000   07a00000  07a00000  00000000  RWIX
  DDR                   8fa00000   00400000  00090168  0036fe98  RWIX
  DSPLINKMEM            8fe00000   00100000  00000000  00100000  RWIX
  • Add a little to the "used" value to allow your code and data to grow some during development, and that is the minimal DDR size.

Sizing and partitioning CMEM memory

The module we call CMEM, as we have seen, enables us to allocate large chunks of physically contiguous memory from Linux-Arm and place data buffers in them for DSP to process. There are two aspects to configuring CMEM:

  1. Knowing the total amount of memory we need for buffer exchange between Arm and DSP: this is CMEM size
  2. Knowing the exact size and count of each type of exchange buffers our application needs: this is CMEM partitioning into pools of buffers

Again, we can calculate or measure required CMEM sizes, and it is always best to do both -- using the calculations to verify the measurements.

Calculating CMEM size and partitions

As an example, assume we run one video encoder at D1 resolution and two audio encoders. To exchange buffers with these codecs, we need for the video encoder one input D1 sized buffer for the raw image and one output buffer for the encoded frame; we also need one input buffer for raw audio data and one output buffer for compressed audio for each audio codec.

The size of raw D1 image buffer we calculate knowing the format of the image; let us assume it's 812K. The size of the output video encoder buffer depends on the compression format, but typically it is recommended to be the same as the input buffer, i.e. 812K in our case. As for the audio, assume we similarly use 4K input and 4K output buffer for each codec.

Our total CMEM need is then: 812K x 2 + 4K x 2 + 4K x 2 = 828K.

Our CMEM pool needs are: one pool with 2 buffers of 812K, and one pool with 4 buffers of 4K:

[.....812K.....][.....812K.....][4K][4K][4K][4K]

CMEM pool partitioning is important: if done improperly, it prevents us from getting the buffers we need, even if there is enough total space. (This is the disadvantage of pools; the advantage is that it prevents fragmentation where it could happen, i.e. if the app were allocating and releasing many buffers of different sizes constantly.)

In our example, we have one video encoder and two audio encoder instances, all running at the same time. Let us take a look at how the exchange occurs:

  • Before application starts, the system integrator loads the CMEM module. Assuming our CMEM area starts at 0x88000000, and is sized as above, the command to load the module is:

insmod cmemk.ko phys_start=0x88000000 phys_end=0x88200000 pools=2x831488,4x4096

(we have set aside full 2MB for CMEM, and split it into 2 x 812K and 4 x 4K buffer pools)

  • Our Arm application allocates its two 812K buffers and four 4K CMEM buffers via Memory_contigAlloc();
  • The Arm application stores raw image block in one 812K buffer and raw audio blocks in two 4K buffers, passes the 812K video buffers in a call to VIDENC_process() and the 4K audio buffers in calls to AUDENC_process(), and reads the compressed video and audio frames from their buffers.
  • When the application closes, it frees up all of its CMEM buffers via Memory_contigFree().

Imagine now that in addition to all of the above, our application also uses a video decoder, but never at the same time as the video encoder (recall our video camera example that supports record and playback modes but only one at a time). Assuming our video decoder also processes D1 sized images -- getting compressed frames and producing raw images -- we'd need two 812K buffers for the video decoder as well. But since we never have a situation that both the encoder and the decoder process their input data at the same time, we can use the same 812K input and output CMEM buffers. Therefore our total CMEM needs remain the same, and even the partitioning looks the same.

Note that it is not even necessary that the Arm app destroys one video codec on the DSP before it switches to another: the two instances can be active on the DSP at the same time, but if we never call VIDENC_process( input 812K buf, output 812K buf ) while we are waiting for VIDDEC_process(input 812K buf, output 812K buf), we are safe. We create and destroy codec instances on the DSP as needed only in order to save on memory needs on the DSP, for DDRALGHEAP.

Now for a counter example, our application may need both record and playback running at the same time. In that case, our total CMEM needs are 2 x 812K (video encode) + 2 x 812 K (video decode) + 2 x 4K (audio encode) + 2 x 4K (audio decode) which is 3264K. Our command line to load the CMEM module would then be:

insmod cmemk.ko phys_start=0x88000000 phys_end=0x88340000 pools=4x831488,4x4096

(we set aside 3.25MB, slightly larger than the 3.2MB we need).

Measuring CMEM size and partitions

In rare cases your application is so complex with its input and output buffers it is easier to measure what your application needs in terms of CMEM memory by running the application itself.

  • Start with CMEM module loaded and partitioned to allow for plenty of memory and plenty of buffers: one or two really big ones, a few large ones, a number of medium ones, a many small ones. For example:

insmod cmemk.ko phys_start=0x88000000 phys_end=0x8A000000 pools=2x4100000,10x1100000,50x130000,100x17000

i.e. two 4MB+ buffers, ten 1MB+ buffers, fifty 128K+ buffers, a hundred 16K+ buffers, for a total of 27MB+, rounded to 32MB.

  • At your application peak memory usage time, or several times through its life cycle, put the following in your C code:

system( "/bin/cat /proc/cmem" );

and record the standard output

Accessing /proc/cmem like above will cause CMEM to produce detailed info regarding exactly how many buffers it uses and of what sizes. This will give you the precise statistics for accurate sizing and partitioning.

Of course, it is always good to relax the accurate numbers, whether they are measured or calculated: application code may change and a new or larger buffer may be needed at certain moments, but the "insmod cmemk.ko" line may be neglected to be updated. We use our engineering intuition to decide how much headroom we want to leave in terms of numbers and sizes of the CMEM buffers our application needs.

(It is also worth noting that because alignment and Linux page boundary requirement may require the total area to be larger than the sum of the parts.)

Optional: Using CMEM to dynamically size DDRALGHEAP

In CE 1.20 and above, there are new APIs Server_redefineHeap() and Server_restoreHeap() that let you change the DSP-side heaps at run time:

  • The memory passed to Server_redefineHeap() needs to be contiguous (CMEM allocated).
  • The heap needs to be created in the DSP server’s BIOS configuration file, but can be 0 bytes initially.

This feature lets the user ‘reuse’ the memory used for "DDRALGHEAP" when the system is doing less stressful DSP tasks, e.g. ARM-side PDF file reading.

To use this feature, you will need to allocate an extra buffer of the size of "DDRALGHEAP" in the CMEM segment, on top of the requirements you have determined from sections 2.4.1 and 2.4.2.

Optional: reducing the DSPLINKMEM segment size from 1MB to 512KB

The "DSPLINKMEM" segment on the DSP is the system segment needed by DSPLink. By default, approximately 512 KB of the 1 MB segment is used for shared buffers and control structures. The defaults in the CE examples are larger to anticipate potential extra memory required by CE in future releases. For simplicity, it is ok to avoid modifying the defaults for DSPLINK system components, but if you need to save an extra 512K you can do so by reducing the size of this segment, without worrying about details of DSPLINK itself.

Arranging the segments in correct order and alignment

If we have followed the outlined procedure, we now have, on paper, the minimal measured and/or calculated size for each segment: CMEM, "DDRALGHEAP", "DDR", and possibly "DSPLINKMEM" if we decided to cut it in half. Segment "RESETCTRL" has a fixed size of 128 bytes.

At this point we now need to decide what the start address of each segment will be.

  1. Know (or decide) how much total system memory we have: 64MB or 128MB etc.
  2. Place "RESETCTRL" segment at the highest-addressed 1MB in the map – i.e. this segment must be 1 MB aligned and we choose it to be the very last MB in the memory map.
  3. Place "DSPLINKMEM" segment immediately after "RESETCTRL", so it gets 1MB - 128B (which is still fine). Now the last MB is occupied by "RESETCTRL" + "DSPLINKMEM".
  4. Place "DDR" before "RESETCTRL". Try to make DDR size at least a multiple of 4KB if you can't make it more even. (I.e. don't use size 2,432,131B for "DDR" size, use 2.5MB, e.g. 0x280000 bytes size, not 0x251C83 bytes size). If possible, leave a larger amount of memory for DDR than required in case you modify the code in the future and result in a code size increase. This will give you the convenience of not having to shift everything in your memory map just to accommodate for small code size changes.
  5. Place "DDRALGHEAP" before "DDR". Again, use round hex numbers and leave a safety margin if you can.
  6. Place "CMEM" before "DDRALGHEAP". Same recommendation -- use round hex numbers for origin and size if you can, both to avoid alignment surprises and to make the map easier for humans to understand and maintain.
  7. Linux gets the rest.

It is advisable to make "RESETCTRL" + "DSPLINKMEM" occupy the last 1MB of memory, especially if you use DSPLink 1.30. Then you will only have to rebuild DSP Link once (which is a tedious procedure and you don't want to repeat it more than absolutely necessary).

Also, it is convenient to have CMEM and "DDRALGHEAP" adjacent to each other, so you can resize one at the expense of the other without touching other segments. Both CMEM and DDRALGHEAP are normally unused when there are no active codecs, even though the DSP may be up and running and ready to create a codec when instructed.

If you are using the new feature in CE 1.20 and above to dynamically pass a CMEM buffer to the DSP to serve as its "DDRALGHEAP", simply combine "DDRALGHEAP" into the CMEM segment.

As the final result of this phase, your drawing on a piece of paper may look something like this for example:

Figure2.png

Memory-map Adaptation Instructions

This section provides the procedure to follow in order to set up the memory map you have designed. As an example, we have modified the video_copy example in Codec Engine to match the following memory map:

Figure3.png

For those of you who are familiar with the video_copy example, there is obviously a lot more memory here allocated to each segment than necessary for the application to run. However, the goal of this example is to simply show the files where the changes have been made, so that you can better follow the steps we outline below.

The modified video_copy code example comes in two flavors:

  • DSPLINK 1.30 and CE 1.02 based
  • DSPLINK 1.40 and CE 1.20 based

See the readme.txt file that accompanies the code for details on the contents.

To unzip/expand the example into a directory of your choice, either use the 'unzip' command in Linux or use WinZip in Windows.

When following this procedure for your own application, simply replace with your own memory map and use the corresponding sizes and base addresses for it.

Determining the version of Codec Engine

Knowing the version of CE (and DSP Link) is important because it determines the steps to follow. CE 1.20 (or above) uses DSP Link 1.40+, while versions of CE prior to that uses DSPLINK 1.30+. If you do not know which version of CE you are using, you can easily find it out by looking at the name of the CE installation directory. For example, if the directory name is codec_engine_1_20, it means you are using CE 1.20. At the time when this document is written, DVEVM and DVSDK software version 1.20 and prior bundles with versions of CE older than CE 1.20.

Rebuilding DSP Link 1.30

DSP Link is a S/W component that enables the ARM and the DSP to communicate. Version 1.30 of DSP Link requires rebuilding of the entire DSP Link when the DSP memory map is changed. DSP Link version 1.40, used by Codec Engine 1.20 and above, is more dynamic and requires no rebuilding.

If your DSP Link version is 1.40 or higher, you can skip to the next section. There is one replacement step you need to do, though much simpler than this one; it will be mentioned later in this procedure. Rebuilding DSP Link is the most involved step in the sequence. Its sub-steps are listed here:

  • cd to <DVEVM>/dsplink_1_30_*/packages/dsplink directory. All the paths in the remainder of this section will be given relative to this directory.
  • Open the DSP Link configuration file in a text editor: config/all/CFG_Davinci.TXT.
  • Search for a "RESUMEADDR" text entry. You will see, by default, the value of 0x8FF00020. Change that number to the beginning of our "RESET_VECTOR" segment + 0x20. In our case, it should be 0x83F00020.
  • Search for RESETVECTOR entry. Change its value to the beginning our "RESET_VECTOR" segment: 0x83F00000.

Word of caution! commonly large hex numbers with lots of zeroes are mistyped to omit one zero! Make sure the hex number is exactly eight characters wide.

  • Search for "MEMTABLE0" set of entries. There you will find some entries that resemble our memory map, and some that don't. The ones that you need to look for are "DSPLINKMEM", "RESETCTRL" (same as "RESET_VECTOR") , and DDR. Change their addresses (ADDRDSPVIRTUAL and ADDRPHYSICAL, which are the same) and sizes to match our new memory map; do not worry that "DDRALGHEAP" isn't there -- that's because DSP Link doesn't need to know about it since its content only exists while the DSP runs and is never accessed by the ARM. You will get:
[MEMTABLE0]

 [0]
 ENTRY           | N |   0                  # Entry number
 ABBR            | S |   DSPLINKMEM         # Abbreviation of the table name
 ADDRDSPVIRTUAL  | H |   0x83F00080         # DSP virtual address
 ADDRPHYSICAL    | H |   0x83F00080         # Physical address
 SIZE            | H |   0xFFF80            # Size of the memory region
 MAPINGPP        | B |   TRUE               # Map in GPP address space?
 [/0]

 [1]
 ENTRY           | N |   1                  # Entry number
 ABBR            | S |   RESETCTRL          # Abbreviation of the table name
 ADDRDSPVIRTUAL  | H |   0x83F00000         # DSP virtual address
 ADDRPHYSICAL    | H |   0x83F00000         # Physical address
 SIZE            | H |   0x00000080         # Size of the memory region
 MAPINGPP        | B |   TRUE               # Map in GPP address space?
 [/1]

 [2]
 ENTRY           | N |   2                  # Entry number
 ABBR            | S |   DDR                # Abbreviation of the table name
 ADDRDSPVIRTUAL  | H |   0x83C00000         # DSP virtual address
 ADDRPHYSICAL    | H |   0x83C00000         # Physical address
 SIZE            | H |   0x00300000         # Size of the memory region
 MAPINGPP        | B |   TRUE               # Map in GPP address space?
 [/2]

Do not worry about other segments listed in the file.

  • Edit file make/Linux/davinci_mvlpro4.0.mk that contains DSPLINK build instruction for its ARM binaries, on a Linux host. Edit the following fields to match your DVEVM installation, noting the location of the Linux kernel and the Arm compiler tools:

BASE_BUILDOS: location of the Linux kernel; directory usually ends with "/lsp/ti-davinci";

BASE_CGTOOLS: location of the Arm tools, directory usually ends with " arm/v5t_le/bin"

  • Edit file make/DspBios/c64xxp_5.xx_linux.mk that contains DSPLINK build instructions for its DSP binaries, on a Linux host. Edit the following fields to match your DVEVM and DSP/BIOS installation:

BASE_SABIOS: location of your DSPBIOS installation; directory usually ends with "/bios_5_21_01" or some such number

BASE_CGTOOLS: location of your C64P compiler tools that run on Linux; directory can end in different ways, but it invariably contains subdirectories "bin", "include", and "lib".

  • Set the environment variable DSPLINK to directory <DVEVM>/dsplink_1_30_*/packages/dsplink.
  • From the current ($DSPLINK) directory, type:

gmake -C gpp/src

gmake -C dsp/src

  • Find the newly built DSPLINK kernel module in gpp/export/BIN/Linux/Davinci/RELEASE/dsplinkk.ko and copy it to your DVEVM file system.

This should build a link server configured specifically for the memory layout we need. Keep in mind that if you ever build multiple servers, this build of DSPLINK won't work for them anymore!

If you have more than one server and they have different memory configurations, one approach you may use is to clone the entire top-level DSPLINK directory under a different name, then apply all the steps above in that directory, and you will have a DSPLINK build dedicated entirely to one specific memory map.

If you chose to do so, remember that you must specify which DSPLINK build you are using in the "XDCPATH", that would be the xdcpaths.mak file in Codec Engine examples if you build just Codec Engine examples, and Rules.make file in DVEVM installation directory if you build real DSP servers. The kernel module (dsplinkk.ko) also applies to just one specific memory layout.

It is because of this complexity that DSPLINK 1.40 eliminates all these steps and only uses one dsplinkk.ko kernel driver and one build for any DSP memory layout.

Rebuilding the DSP server

Every DSP server has a BIOS configuration file (.tcf file) that defines the memory layout on the DSP, among other things. It also has a Codec Engine configuration file (.cfg file) which lists which codecs to include in the image.

Our DSP server is found in the Codec Engine examples/servers/video_copy (or examples/ti/sdo/ce/examples/servers/video_copy in more recent versions of CE. Please adjust the path appropriately for the remainder of this procedure).

The server configuration file (video_copy.cfg) lists what codecs to include. There are only two in the list, and we need both, so we don't change anything in this file.

But if the codecs were real, our first step would be to edit this file and remove all the codecs we don't need. That would reduce the size of the DDR segment and allow us to make it shorter than the default of 4MB. The only file we need to edit right now is the video_copy.tcf file. If you open that file in a text viewer, you will see that it imports the contents of another DSP server's .tcf file (all_codecs.tcf) because the contents is the same for both servers. Since we want to modify the video_copy example only, do the following:

  • cd to the Codec Engine examples/servers/video_copy directory.
  • From inside the video_copy/ directory, copy ../all_codecs/all.tcf to video_copy.tcf.
  • Edit video_copy.tcf and edit the "mem_ext" array for our newly chosen memory map. That code should look like this:
var mem_ext = [
{
    comment:    "DDRALGHEAP: off-chip memory for dynamic algmem allocation",
    name:       "DDRALGHEAP",
    base:       0x83800000,   // 56 MB
    len:        0x00400000,   //  4 MB
    space:      "code/data"
},
{
    comment:    "DDR: off-chip memory for application code and data",
    name:       "DDR",
    base:       0x83C00000,   // 60 MB
    len:        0x00300000,   //  3 MB
    space:      "code/data"
},
{
    comment:    "RESET_VECTOR: off-chip memory for the reset vector table",
    name:       "RESET_VECTOR",
    base:       0x83F00000,   //  63 MB
    len:        0x00000080,   // 128 B
    space:      "code/data"
},
{
    comment:    "DSPLINK: off-chip memory reserved for DSPLINK code and data",
    name:       "DSPLINKMEM",
    base:       0x83F00080,   // 63 MB + 128 B
    len:        0x000FFF80,   //  1 MB - 128 B
    space:      "code/data"
},
];

Note! CE 1.20 uses "DDR2" instead of "DDR" in its examples.

Due to the fact we have defined and sized DDRALGHEAP to be a space that is solely used to store algorithm memory requests, you should change all sections, including "BIOSOBJSEG", "MALLOCSEG" and "STACKSEG" to use DDR (if not already done):

/* ==========================================================
 *  Set all data sections to use DDR
 *  ==========================================================*/
bios.setMemDataNoHeapSections (prog, bios.DDR);
bios.setMemDataHeapSections (prog, bios.DDR);
 
/*  ==========================================================
 *  MEM : Global
 *  ==========================================================*/
//prog.module("MEM").BIOSOBJSEG = bios.DDR;    //comment line out if present
//prog.module("MEM").MALLOCSEG  = bios.DDR;  //comment line out if present
 
/*  ==========================================================
 *  TSK : Global
 *  ==========================================================*/
//prog.module("TSK").STACKSEG = bios.DDR;        //comment line out if present


Optional: Splitting the DDR section to reduce trampoline occurrences by isolating code from data

Trampolines are linker-generated function calls that allow code to make jumps to points further than 4MB apart (i.e. far calls). They could occur on DSP servers that are large in terms of code and data size, when points of code that call each other are separated by large chunks of static data and other code. Trampolines may cause some performance loss or other problems; consequently, it is better to tell the linker to place all code apart from all data, to minimize the number of trampolines necessary, given the code would then be more compact. Separating code from data may be beneficial for other reasons as well (e.g. better cache utilization).

To find out how much code vs. data you have, you can use a tool call 'cg_xml' (available at https://www-a.ti.com/downloads/sds_support/applications_packages/cg_xml/index.htm). It is a collection of Perl scripts, which complements the codegen tools, providing more information about compiled binaries.

One of the scripts provided as part of 'cg_xml' is sectti.pl, which lists all output sections in a given compiled binary. If you run this script on your DSP image:

ofd6x –x myImage.x64P | perl sectti.pl

the script summarizes your code and data sizes. An example:

------------------------------------------------------------
Totals by section type
------------------------------------------------------------
  Uninitialized Data :     961084  0x000EAA3C
    Initialized Data :     462114  0x00070D22
                Code :    1169504  0x0011D860

The totals may need to be adjusted to discount sections not placed in DDR. For instance, DSP/BIOS may introduce sections marked as type “N/A” that are neither counted as data or code. If they are placed in the DDR section, they need to be added to the total data memory size.

Coming back to the .tcf file, to separate code from data, split “DDR” segment into two segments: “DDR” that contains data only, and “DDRCODE” that contains code only.

To carve off a portion of DDR for strictly for code placement, change the above DDR declaration in the “mem_ext” array to the following instead:

{
    comment:    "DDR: off-chip memory for data",
    name:       "DDR",
    base:       0x83C00000,
    len:        0x001B0000, // 1.7MB
    space:      "code/data"
}, {
    comment:    "DDRCODE: off-chip mem. for code",
    name:       "DDRCODE",
    base:       0x83DB0000,
    len:        0x00150000, // 1.3MB
    space:      "code/data"
}

Our effective memory map would become:

Figure4.png

Note that it is not necessary to inform DSP Link of this change, as DDR and DDRCODE are contiguous and can be treated as a monolithic segment of writeable external memory for the DSP Link loader.

After splitting the DDR section, you need to change the following line, which should place most code sections into DDRCODE:

bios.setMemCodeSections (prog, bios.DDRCODE);

Running sectti.pl from the Code Generation Tools XML Output Utility Scripts on the resulting executable can show all code sections that are not yet placed in DDRCODE. All sections in the video_copy example are already placed using the line above. However, in case your application has some extra non-placed sections (due to the section name missing the prefix ‘.text:’), here’s an example output from sectti.pl for your reference:

Name        :  Size (dec)    Size (hex)   Type   Load Addr   Run Addr
----        :  ----------    ----------   ----   ----------  ---------
.randomCode :  54816         0x0000d620   CODE   0x8fd99740  0x8fd99740

Assuming you have a section similar to .randomCode which lies outside of DDRCODE, you would need to manually place it in DDRCODE in the server's link.cmd file.

SECTIONS {
    .randomCode > DDRCODE
}

After you are done with all necessary modifications, save and close the file. Rebuild the server by typing this from the current directory:

    make clean
    make

Then copy the rebuilt server image (video_copy.x64P) to your target file system.

Rebuilding the Arm-side application - if you use DSP Link 1.40 (or above)

Users of CE 2.X can use a new feature called Engine.createFromServer to communicate the memory map used by DSP/BIOS to DSP LINK. Please go here for more details. Using Engine.createFromServer ensures the memory map used by DSP LINK is in sync with DSP/BIOS. The rest of this section does not apply to you.

For users of CE 1.20 and DSP Link 1.40, they do not have to rebuild link, but they have to rebuild their Arm-side application.

Specifically, the change to be made is in examples/apps/video_copy/dualcpu/ceapp.cfg, the application configuration file. It has to have a configuration file setting that specifies what the memory map is.

  • Open the ceapp.cfg file and add or otherwise make sure the following code exists in the file:
osalGlobal.armDspLinkConfig = {
    memTable: [
        ["DDRALGHEAP",   {addr: 0x83800000, size: 0x00400000, type: "other"}],
        ["RESET_VECTOR", {addr: 0x83F00000, size: 0x00000080, type: "reset"}],
        ["DDR2",         {addr: 0x83C00000, size: 0x00300000, type: "main" }],
        ["DSPLINKMEM",   {addr: 0x83F00080, size: 0x000FFF80, type: "link" }],
    ],
};

Then save and close the file.

  • Rebuild the application by executing make.

Copying other necessary files to target file system

In the final steps, we copy the remaining bits and pieces of the video_copy application to the target file system:

  1. cd to the Codec Engine examples/apps/video_copy/dualcpu/ directory. This is where the Arm application is.
  2. Copy the app.out executable to the target file system. Note that you do not have to rebuild it if you have not changed the Linux kernel supplied with the DVEVM/DVSDK software (unless you use DSP Link 1.40).
  3. Copy in.dat file, a sample input file for the application, from the current directory to the target filesystem.
  4. Have your cmemk.ko CMEM kernel module available on your target file system. You must have rebuilt it for your Linux kernel in order to run any other Codec Engine application. If you haven't changed your Linux kernel, you can use a copy of cmemk.ko in the Codec Engine examples/apps/system_files/davinci directory.
  5. If you are using DSPLINK 1.40, copy your dsplinkk.ko kernel module to your target file system. You might have rebuilt it for your Linux kernel in order to run any other Codec Engine application. If you haven't changed your Linux kernel, you can use a copy of dsplinkk.ko in the Codec Engine examples/apps/system_files/davinci directory.
  6. Have your kernel modules loading script (loadmodules.sh) available on your target file system. You can also find a copy of the script in the Codec Engine examples/apps/system_files/davinci directory.

Modifying the loadmodules.sh script

loadmodules.sh loads the kernel module dsplinkk.ko and tells it where to put the DDR segment. That is the only flexibility DSPLINK 1.30 allows. The "DDR" segment can be anywhere and of any length, and can be announced to DSPLINK at the time the kernel module is loaded. Another is that "DDRALGHEAP" can be anywhere and of any length. It is the "DSPLINKMEM" and "RESET_VECTOR" segments that cannot be moved or resized without rebuilding DSPLINK.

Edit the loadmodules.sh script and remove the arguments following insmod dsplinkk.ko text, so the command says:

insmod dsplinkk.ko

A note about loadmodules.sh and dsplinkk.ko arguments: DSPLINK 1.30 supports an optional argument pair ddr_start and ddr_size that allows you to load and run DSP images that have different DDR segment than the default; but it still expects "DSPLINKMEM" and "RESET_VECTOR" segments to match what DSP Link 1.30's configuration file says. This is useful when you want to experiment with increasing your DSP image's size of the DDR segment at the expense of other segments’ sizes (excluding DSPLINKMEM and RESET_VECTOR), or vice versa, without having to go through the process of rebuilding DSP Link 1.30 every time; but it doesn't help if you need to change the limits of the entire memory map. In our case, having changed that text configuration file and rebuilt DSPLINK, the ddr_start and ddr_size arguments are no longer necessary since the DSPLINK 1.30 memory map configuration matches the memory map of the DSP image -- though you can still use them if you subsequently want to experiment with changing the position and size of DDR.

With DSP Link 1.40, which supports dynamic memory map configuration, changing the start address and size of the DDR section is only a matter of modifying the ARM-side application’s .cfg file and rebuilding the application itself. Hence there is no need to specify these parameters in loadmodules.sh.

Next, you have to change the CMEM memory description that follows as the arguments to the insmod cmemk.ko command. Specify phys_start and phys_end to match your new CMEM address and size, then specify pools to match the buffer requirement of your application. The pools are configured using an NxSize syntax where N is the number of buffers in the pool, and Size is the size of these buffers. For the video_copy example, the following configuration would be more than sufficient for CMEM:

insmod cmemk.ko phys_start=0x83400000 phys_end=0x83800000 pools=20x4096,10x131072

Changing the boot argument in your Linux bootloader

When the Linux kernel is booted, we limit what the physical memory available to the kernel will be by means of the mem= boot argument. If you use U-Boot, change that portion of the bootargs variable to read mem=52M using the setenv command:

> setenv bootargs ‘console=ttyS0,115200n8 root=/dev/nfs mem=52M nfsroot=192.168.1.101:/opt/montavista/pro/devkit/arm/v5t_le/target,nolock’

Note! This step is critical -- if Linux tries to use memory above 52MB, it will corrupt the CMEM data and the data will corrupt the kernel. That would fortunately likely result in a quick crash.

Rebooting and running the application

After the system boots, type:

sh loadmodules.sh

./app.out

Look for this line of application output to confirm the procedure worked:

Application finished successfully.

Troubleshooting

At the end of the procedure, if you did everything correctly, the application should run and you can skip this section. Otherwise, this section provides a few troubleshooting tips to find out more about your system with the new memory map.

Checking how much cmem memory is available or used

After running loadmodules.sh, directly entering the command

/bin/cat /proc/cmem

at the command prompt in Linux can show you whether you have set up CMEM with the correct buffer pools.

If you are interested in verifying the amount of memory allocated from the CMEM pools at any point in time in your application (e.g. when a Memory_contigAlloc call failed), you can add this line:

system( "/bin/cat /proc/cmem" );

to your ARM-side application’s source code to obtain this information.

Memory map mismatch

If there is a mismatch between the memory map used by the DSP/BIOS .tcf file and the DSP Link configuration, often this would result in a failure in the first Engine_open() call, which loads the DSP with the DSP server executable. If this occurs, compare the settings in the mem_ext array in the .tcf file with the ones in the DSP Link configuration (which resides in the CFG_Davinci.txt file in DSP Link 1.30 or in your application’s .cfg file in DSP Link 1.40). It is very possible there is a mismatch between the two.

In fact, it is a good practice to double-check the two configurations after you have gone through all the steps in the procedure of memory map configuration. It can save valuable debugging time down the line.

Heap sizes too small

One common problem is that estimated heap sizes might be too small. This could happen if the sizes were underestimated or miscalculated, resulting in memory allocation failures on the DSP. Looking at the CE trace files and locating the point of failure should give you some indication of the heap that ran out of space. For example, if the trace shows a failure while creating an algorithm using DSKT2 (part of the Framework Components), this points to a potential lack of space in the "DDRALGHEAP". If the failure occurs while allocating/creating some other object, then it is likely that the DDR heap is too small, etc. You can turn on the highest verbosity level in CE trace by specifying the following command line when running your executable:

CE_TRACE="*=01234567" TRACEUTIL_DSP0TRACEMASK="*=01234567" TRACEUTIL_DSP0TRACEFILE="cedsp0log.txt" CE_TRACEFILEFLAGS="w" CE_TRACEFILE="cearmlog.txt" TRACEUTIL_REFRESHPERIOD=200 ./app.out

Replace app.out with the name of your application executable. This should produce two log files corresponding to the ARM and the DSP which you can inspect post execution of the application. More details on how to use the CE trace can be found in the Codec Engine Developer’s User Guide.

A Real World Example

This section describes an actual case in which a particular customer tried to resize their memory map.

The application is a four-channel CIF MPEG4 Simple Profile (or H.264 Baseline Profile) Digital Video Recorder (DVR) based on DM6446 with 64MB DDR2. 4 channels of CIF video are encoded and 1 channel of CIF video is decoded. The MPEG4 (or H.264) & Audio encoder & decoder conform to xDM. In this discussion, we will focus on the video codecs. So, 4 CIF encoding instances and 1 CIF decoding instance will be created by calling the VISA API:s. This example is based on Codec Engine 1.02 & DSPLINK 1.30.08.02. Below is the system block diagram:

Figure5.png

The final 64MB memory map looks like:

0x80000000 .. 0x83200000-1 (0-50MB; size 50MB): Linux: booted with MEM = 50M
0x83200000 .. 0x83A00000-1 (50-58MB; size 8MB): CMEM: shared ARM/DSP I/O buffers
0x83A00000 .. 0x83C00000-1 (58-60MB; size 2MB): DDRALGHEAP: codec dynamic memory
0x83C00000 .. 0x83E00000-1 (60-62MB; size 2MB): DDR: code, stack, system data
0x83E00000 .. 0x83F00000-1 (62-63MB; size 1MB): DSPLINKMEM: memory for DSPLINK
0x83F00000 .. 0x83F00080-1 (63-63MB; size 128B): RESET_VECTOR: reset vectors
0x83F00080 .. 0x84000000-1 (63-64MB; size 1MB): Unused memory

How did we arrive at this memory map? First of all, 1MB DSPLINKMEM is the default size of DSPLINK 1.30.08.02. It is important to correctly allocate the right memory size for CMEM, DDRALGHEAP & DDR. Then we will have enough space for DSP S/W & Linux OS.

Below diagram illustrates the system application data flow. VPFE puts the video input data (to be encoded data) in CMEM. The encoder running on DSP outputs the encoded data to CMEM. The encoded data is stored on hard disk finally. As for the decoding data flow, to-be-decoded data is copied from hard disk to CMEM first. Then, the decoder decompresses the data and outputs the result into decoded data buffers. The VPBE output resolution is CIF. Sometimes, we can use the resizer of the VPFE peripheral on DM6446 to get D1 resolution. So we need to allocate a buffer for resizer results in CMEM too.

Figure6.png

Allocating CMEM Memory Space

As for to-be-encoded data buffers, the size is ((352 * 288) * 4 * 2B) * 3 = 2433024B. 352*288 is CIF resolution, 4 means 4 channels, one pixel in YUV4:2:2 needs 2 bytes and three (352 * 288) * 4 * 2B buffers are allocated for encoder algorithm. The size of decoded data buffers is same: 2433024B. Because we encode 4 channels CIF, you can calculate the size of encoded data buffers by 50% D1 ((720 * 576 * 3 / 2 ) / 2 = 303.75KB, YUV4:2:0) or standard MPEG4 compression ratio. Here we allocate 256KB (262144B) for encoded data buffers less than 303.75KB. This is chosen based on experience. So, we configure three 256KB buffers (786432B) for to-be-decoded data buffer accordingly. As for the buffer of resizer result, we need 720 * 576 * 2B = 829440B in YUV4:2:2. So, the insmod cmemk command looks like:

insmod cmemk.ko phys_start=0x83200000 phys_end=0x83A00000 pools=1x262144, 2x2433024, 1x829440,1x786432

Allocating DDRALGHEAP Memory Space

DDRALGHEAP is the memory allocated for codec dynamic memory requests. Both encoder and decoder will process and accept data with YUV4:2:0. One channel CIF data in YUV4:2:0 is 352 * 288 * 3 / 2 B (one pixel with YUV4:2:0 format needs 3/2 byte). Encoder and decoder algorithms need the current frame and previous frame data. To compress or decompress one channel CIF, we need to allocate 352 * 288 * 3 / 2 * 2 B memory for encoder and decoder respectively. Because 4 channels CIF will be encoded and 1 channel CIF will be decoded. So, the encoder needs 352 * 288 * 3 / 2 * 2 * 4B (about 1.16MB) of dynamic memory and the decoder needs 352 * 288 * 3 / 2 * 2 * 1B (about 297KB) of dynamic memory. The total of them is about 1.45MB. 2MB DDRALGHEAP is allocated in this example.

Allocating DDR Memory Space

DDR is the DSP-side segment including all the system code, data, stack, heaps and code and static data for the codecs. The code size for the most complex video codecs is less than several hundred KBs. We can use the script sectti.pl to determine DDR section size:

ofd6x -x codec_server.x64P | perl c:\temp\cg_xml\ofd\sectti.pl > codec_server.x64P.sectti.csv

The script generated a report file, we can get about 416 KB of the totals of data and code. So 2MB DDR of this application is enough.

REPORT FOR FILE: codec_server.x64P
          Name :         Size (dec) Size (hex)  Type    Load Addr   Run Addr
          MPEG4ENC :     23840	    0x00005d20  CODE   	0x83c71000  0x83c71000
          MPEG4DEC :     10784      0x00002a20  CODE   	0x83c82000  0x83c82000
          .bss :         910        0x0000038e  UDATA   0x83c88000  0x83c88000
          .hwi_vec :     512        0x00000200  CODE    0x83c70c00  0x83c70c00
          .far :         204920     0x00032078  UDATA   0x83c00000  0x83c00000
          .bios :        22912 	    0x00005980  CODE    0x83c76d20  0x83c76d20
          .text :        123136     0x0001e100  CODE    0x83c52080  0x83c52080
          .cinit :       8196  	    0x00002004  DATA    0x83c84a20  0x83c84a20
          .sysinit :     1792  	    0x00000700  CODE    0x83c70180  0x83c70180
          .const :       21288 	    0x00005328  DATA    0x83c7c6a0  0x83c7c6a0
          .stack :       4096  	    0x00001000  UDATA   0x83c86a28  0x83c86a28

Totals by section type (about 416KB)
Uninitialized Data: 212958  0x00033fde
 Initialized Data : 30080  0x00007580
             Code : 182976  0x0002cac0

Allocating Linux OS Memory space

We computed it by calculating the DSP needs first and subtracting that from the total amount of memory available. We know our production system has only 64MB of memory. Given we need 1MB for "DSPLINKMEM", 2MB for "DDR", 2MB for "DDRALGHEAP", 1MB for "RESET_VECTOR" & unused memory and 8MB for CMEM, that gives a total of 14MB for DSP and shared buffers, leaving 50MB for Linux.

Conclusion

Memory map configuration for Davinci-based system can be systematically performed after the user has designed the memory map to suit the amount of memory available. In order for the procedure to go smoothly, a reminder is to:

  1. Know your system. Plan the memory map based on how many and which codec instances will need to be available at the same time in different modes of execution in the application. Calculate or measure the size for each segment and write down the desired memory map.
  2. Be thorough. Apply the mechanical steps to adapt the DSP server, ARM application, DSP Link, CMEM and boot loader to match the desired memory map. Always double-check the changes to ensure all numbers agree with each other.