NOTICE: The Processors Wiki will End-of-Life in December of 2020. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.

System Analyzer Tutorial 4A

From Texas Instruments Wiki
Jump to: navigation, search

System Analyzer Tutorial 4A:

How to Build Embedded Applications that Support Multicore Event Correlation

Up until this point in the tutorials, we've been working with programs that run on a single CPU, with the UIA events being timestamped using the local CPU's timestamp. The CPU timestamp counters on a multicore device are typically not synchronized with each other, which means that at any given instant of time, each CPU will report a different value for its timestamp. Factors that cause the CPU timestamps to have different values include:

  • the type of CPU (e.g. heterogeneous multicore devices can have a mix of different types of CPUs, each running at a different clock frequency)
  • the clock that is driving the CPU (the clocks for the different CPUs may not be phase-aligned, may run at different frequencies, or may be turned off to save power).
  • when the CPU comes out of reset
  • when the software on the CPU first started the local timestamp running
  • whether or not the CPU has halted (e.g. due to a breakpoint)
  • whether or not the CPU has changed its clock frequency or entered into an idle mode that affects the timestamp

One way to provide a uniform timestamp for events logged from the various cores on the device is to timestamp all events with a timestamp that is provided by a shared resource - e.g. a shared timer that all CPUs can access. The problem with this approach is that accessing such a shared resource usually adds overhead to each and every event that is logged, because instead of accessing a dedicated register provided by the CPU, we now have to go off-chip and access a peripheral via a common bus, which can cause delays due to bus contention.

The approach taken in UIA is to continue to timestamp the events using the (fast) local CPU timestamp, but to additionally log "sync point events" that contain, as event parameters, the value of the local CPU timestamp and the value of a global timestamp provided by a common shared timer resource. The values of these two timestamp values are captured as close together in time possible, so that they can act as 'synchronization points'. The frequency of the CPU and global timestamp timer are also logged by the sync point event, so that a "Global Timebase Server" component in System Analyzer has enough information to translate CPU timestamp values into the equivalent Global timestamp values. This allows events from all cpus on the device to be correlated with each other and positioned on a common global timeline, so you can see the timing relationship between events logged on different cpus.

Global Timestamp Modules

The UIA package comes with a number of Global Timestamp modules that make it easier to use the timer peripherals provided by the various types of multicore devices:

  • ti\uia\family\c66
  • TimestampC66XXGlobal : supports Keystone Architecture 1 devices (e.g. 6670, 6678)
  • You should not have to configure any parameters for this module.
  • A dedicated 64b global timestamp counter is provided as part of the C66XX device's PLL Controller to ensure that there are no conflicts with any shared timers.
  • The module's timer frequency is derived from the Platform.xdc file that the project is built with.
  • ti\uia\family\dm
  • TimestampDM816XTimer : supports DM816X devices
  • Please see the module's cdoc documentation for information on how to configure this module
  • ti\uia\family\c64p
  • TimestampC6472Timer : supports the TMS320C6472 and TCI6486 devices.
  • Uses Timer 11 by default.
  • Example: How to configure the C6472 Global Timestamp module to use timer 4 (base address 0x2620000) instead of timer 11:
  • var TimestampC6472Timer = xdc.useModule('ti.uia.family.c64p.TimestampC6472Timer');
  • TimestampC6472Timer.timerBaseAdrs = TimestampC6472Timer.TimerInstance_Timer4;
  • TimestampC6474Timer : supports the TMS320C6474 and TCI6488 devices.
  • Uses Timer 2 by default. To use a different timer, you will need to configure the timerBaseAdrs config option.
  • For UIA releases uia_1_00_03_25 and earlier, please:
  • download c6474timer.zip and unzip the contents into the <uia_install_dir>\packages\ti\uia\family\c64p folder
  • download LogSyncXs.zip and unzip the contents into the <uia_install_dir>\packages\ti\uia\runtime folder

Note: The above timestamp modules are designed to work independently of SysBios. As such, care must be taken to ensure that SysBios does not try to use the same timer, since the timer will still appear as an unused resource to SysBios.

The LogSync Module

The ti.uia.runtime.LogSync module provides the following APIs:

  • LogSync_writeSyncPoint: API to log a sync point event
  • The sync point events are defined in the ti.uia.events.UIASync module.
  • LogSync_isSyncEventRequired: API to determine if a sync point event needs to be logged because the target has been halted and resumed execution since the last sync point was logged


In order to use these APIs, you will need to add the LogSync module to your application by inserting the following line into the project's XDC configuration file:

  • var LogSync = xdc.useModule('ti.uia.runtime.LogSync');

If the target is a C6472, C6474 or C66X device, no further configuration statements are needed - the LogSync module's .xs script will automatically take care of configuring the global timestamp module.

By default, the LogSync module will create a 256 byte logger instance that is dedicated to logging sync point events in order to ensure that these important events are not overwritten by other less important events. The following configuration options are provided:

  • LogSync.defaultSyncLoggerSize: configures the default sync logger's buffer size
  • LogSync.syncLogger: configures the LogSync module to use a specific logger module when logging sync point events. If left null (the default setting), a 256 byte logger instance will be created for dedicated use by the LogSync module in order to log sync point events in order to ensure that these important events are not overwritten by less important events.


Creating Multicore Programs: Single Image Programs and Custom Platform Files

One of the challenges of multicore devices is finding a way to fit the various programs and data for the different CPUs into memory without having them interfere with each other. If you are familiar with how to do this, you may want to skip ahead to Tutorial 4B.

Single Image Programs

One way to do this is to create a 'single image' program that is shared by all CPUs on the device. In this type of program, the software conditionally executes operations based on which CPU core it is running on. An example of this is the MessageQ project that can be created from the New CCS Project wizard:

Tutorial4C MessageQProject.gif

(Note: To convert this project to run on the TMS320C6474, please follow the steps described in TMS320C6474L Build Configuration)

You'll also see an example of a simple single image multicore program in Tutorial 4B.

Memory Partitioning and Custom Platform Files

Another approach is to create separate programs for the various CPUs and to partition the memory so that the programs can all reside in memory side-by-side at the same time. This can be accomplished by creating a custom 'platform' file that defines separate memory sections that can be used by e.g. a "master" CPU and the other "slave" CPUs.

Let's take a look at the platform file for the TMS320C6472 EVM. If you use a text editor to open <xdctools install dir>/packages/ti/platforms/evm6472/Platform.xdc, you can see that it configures the CPU with 0K L2 cache and includes the following lines in the "instance:" section:

   override readonly config xdc.platform.IPlatform.Memory
       externalMemoryMap[string] = [
           ["DDR2", {name: "DDR2", base: 0xe0000000, len: 0x10000000}],
       ];

This defines an external memory section named DDR2 that has a length of 0x10000000 and starts at 0xe0000000. We can partition this into two or more regions, dedicated to specific CPUs, so that the programs for the different CPUs do not 'step on' each other's memory.

Don't edit the files that ship with XDCTOOLS, however - it's better to use the CCS RTSC Platform Wizard to create your own custom platform files (File / New / Other... / RTSC / New RTSC Platform). This will generate a custom 'platform' similar to the ones that are provided with the XDCTOOLS but is customized for your particular needs.

Lets create a custom platform file for the C6472 that uses the first half of DDR2 for the master CPU and the second half for the slave CPU. We'll also enable L2 cache so that the code that is in DDR2 will run faster.

  • After entering the 'basic information' on the first page of the wizard and clicking 'Next', you will get to the 'Device' page where you can configure the various memory ranges.
  • Enter in the clock speed for your device (e.g. 700MHz for the evm6472)
  • Check the L2 cache box and select the desired amount of memory to use as cache.
  • this amount of memory will be removed from the LL2RAM (Local L2 RAM) section, reducing the amount of memory available to run code directly from this memory region
  • Right click on the 'external memory' table and select 'insert row'.
  • Add a memory section named MasterDDR2 with base = 0xe0000000 and len = 0x08000000.
  • Set the default memory section for code, data and stack to LL2RAM for best performance
  • we will configure some of the memory sections for the master and slave sections to use DDR2 in the .cfg file
  • Tutorial4C CustomPlatform.gif

You'll then need to add <Your Documents and Settings folder>\myRepository\packages to the RTSC Repositories used by any projects that will need it (Build options / General / RTSC / click the add button). The platform you created will then be listed in the list of platforms you can configure your project to use in the RTSC Repositories wizard.

You'll see an example of a multicore program that has been split up into separate 'master' and 'slave' CPU programs in Tutorial 4C.

Creating Multicore Programs: Resource Management, Cache Considerations, Bus Throughput...

When developing multicore programs, there are a number of other important considerations to keep in mind as well:

Resource management

  • Device-level resources such as Interrupts, Hardware semaphores, etc. need to be managed in a way to avoid conflicts between programs running on different CPUs.
  • For more information, please see the appropriate MCSDK User Guide:

Cache coherency issues and False Sharing

  • e.g. you may need to explicitly program in cache coherency operations when using DMA or shared data structures to ensure that the CPU sees what is actually in memory instead of what is in cache. Alternatively, you can locate shared data structures or memory regions that are being used by DMA in non-cached memory.
  • Some GEL scripts that run on target startup turn off L2 cache to prevent unexpected cache coherency problems. To turn cache back on, you may need to explicitly configure the MAR registers in your code.
  • False sharing occurs when two programs have data structures that partially reside in the same cache line in shared memory. Updates to these data structures from one CPU will cause the cache line to be flushed, adding hidden execution overhead to the operation of the other CPU.
  • For more information on the use of cache in multicore devices, please see the appropriate Cache User Guide:
  • The CCS Cache View provides visibility into the use of cache by your program, and can be configured to show differences between what is in cache vs. what is in memory
  • To open the view, from the CCS main menu select Windows / Show Views / Other... / Debug / Cache.
  • For more information, search for Cache View in the CCS online help
  • The CCS Memory Browser provides a number of cache visibility features as well:
  • Cache line Boundary Markers: Right click on the memory view, select "Configure..." and select the type of cache you wish to see the line boundaries for.
  • Memory Analyzers: Click on the drop-down arrow to the right of the MemoryAnalysisButton.gif button and enable "Memory Analysis". Then right-click on the memory view and select which type of analysis you wish to have performed.
  • Analyzers highlight memory addresses where the cache is different than the underlying memory or the memory belongs to a dirty cache line or Least Recently Used (LRU) cache line.
  • Memory values are shown in bold if they meet one of the selected criteria.
  • Tooltips provide information on why an address was shown in bold.
  • MemoryAnalysis.gif

CP_Tracer: the "Common Platform Tracer"

  • The C66X devices support hardware bus monitoring through "CP_Tracer", which provides visibility into bus-related bottlenecks and provides throughput analysis to identify bus hogs, etc.

In the next part of this tutorial, we'll get back to focusing on System Analyzer and how it can provide visibility into what is going on in your multicore applications.

Next: Tutorial 4B: How to enable multicore event correlation when using JTAG transports

Links