MIDAS Ultrasound v3.0 Demo

From Texas Instruments Wiki
Jump to: navigation, search

MIDAS Versions

Please note that the newer MIDAS Ultrasound version 4.0 is available now.

Please see the main MIDAS wiki page to compare different versions of MIDAS and see what suits your use-case the best.


The MIDAS Ultrasound v3.0 demo showcases a system level implementation of midend and backend ultrasound signal processing on Texas Instruments (TI) Multicore devices, including the homogenous C6472 six-core multicore DSP, and the heterogenous OMAP3530 System-on-Chip that consists of one ARM and one DSP. We use the TI C6472 Low-cost EVM interfaced via an Ethernet-Ethernet connection to the TI OMAP3530 Mistral EVM. The C6472 consists of six high-performance C64x+ DSP cores, each of which can run up to a speed of 700 MHz, held together by a high speed switch fabric. The C6472 provides 768 KB of shared L2 memory and provides an internal mechanism for sharing data amongst cores. The on-chip DMA Engine allows automated movement of data between peripherals and memory. The device also contains high speed I/O ports such as SRIO and Gigabit Ethernet. The OMAP3530 consists of a 720MHz ARM Cortex A-8 general processor and a 520MHz C64x+ DSP core, which together provide a power-efficient solution to handle both system controller and back-end processing functions in diagnostic ultrasound imaging systems.

The figure below showcases the function of the cores on the C6472 and the OMAP3530. As shown in the diagram, the B-mode Processing Unit (BPU) which includes envelope detection and log compression, and the Doppler Processing Unit (DPU) which includes ensemble aggregation, wall filter and flow estimation, are implemented on separate cores on the C6472 device. The Scan Conversion Unit (SCU) is implemented on the OMAP’s C64x+ DSP core. All the algorithm kernels implemented in this system are from TI's Embedded Processor Software Toolkit for Medical Imaging Applications. In addition, the demo showcases the use of TI's DMAI APIs on ARM as well as the use of Qt and tslib to create a touchscreen-based Graphical User Interface (GUI) on the SoC that allows users to interact with the ARM and DSP in real-time.

Figure 1: Functional Overview of MIDAS Ultrasound Demo v3.0

The sample raw data used in the demo is from a scan of a Carotid Artery, consisting of 69 frames of post-RF-demodulated data. Each frame worth of input data includes 256 scanlines, 512 samples/scanline of B-mode, and 48 scanlines, 256 samples/scanline, 10 ensembles of doppler (color flow) data.  This input data is initially stored on the NFS/SDcard on OMAP3530. During initialization, all 69 frames worth of input data is sent to the C6472, which stores this in DDR. When the user presses the 'Start' button the user interface, the input data is fed in at a set acquisition interval rate (set at 20 fps in this demo but can be customized), which is then processed through the BPU, DPU (on C6472) and SCU (on OMAP3530's DSP) modules as shown above, and the final scan-converted image is displayed on the OMAP3530's LCD touchscreen.

This document covers various aspects of the demo, including a discussion on the software design as well as step-by-step instructions to obtain the source code, and setup your development environment to build and run the demo.

GForge Project Page

MIDAS source code is hosted on TI's GForge project portal. The MIDAS project page is located at https://gforge.ti.com/gf/project/med_ultrasound/

Software Implementation

This section discusses the software implementation for Ultrasound v3.0 and showcases how TI's software components including the Multicore Software Development Kit (MCSDK), Software Development Kit for OMAP3530 (DVSDK), Codec Engine (CE), Digital Media Application Interface (DMAI) and iUniversal APIs can be leveraged by developers to create applications for homogeneous and heterogeneous multicore systems like the C6472 and OMAP3530 respectively.

Both the SoC and Multicore DSP software implementations leverage TI's Codec Engine (CE), a standard software architecture and interface for algorithm execution. CE eases multicore programming and more specifically in the case of an SoC like the OMAP3530, it abstracts the DSP for GPP (ARM) programmers. CE is based on a client-server architecture: for the SoC implementation, the ARM acts as the client, and the DSP as the server (consisting of single/multiple algorithms), while for the multicore DSP, one DSP acts as the client (master core). CE allows easy algorithm plug-in, where DSP developers supply XDAIS-compliant algorithms, and GPP (ARM) developers integrate these DSP algorithms and make remote procedure calls from their ARM applications. Interprocessor communication and DSP resources (like DMA, memory) are managed by CE under the hood, and no manual coding of IPC is required.

The block diagrams below showcases the TI production software components that the Multicore DSP application relies on.

US3 Demo MCDSP Components.PNG

TI's SYS/BIOS 6.x is a highly configurable, real-time operating system that caters to a variety of embedded processors and is included as part of TI’s Code Composer Studio integrated development environment. SYS/BIOS provides some key features that enable easy memory management, preemptive multitasking and real-time analysis. Based on the application's requirements, developers can optimize their final runtime image by including/excluding specific SYS/BIOS modules.

The Multicore Software Development Kit (MCSDK) includes key components that ease multicore development including the chip support library, low level drivers, platform software (PDK), Network Developer's Kit (NDK), etc. The Codec Engine, which we describe in more detail later provides a framework and APIs to easily plug-and-play algorithms, and handle Inter-Processor Communication (IPC) under the hood.

The figure below showcases the software components available for development on the System-on-Chip, that our application on OMAP3530 relies on.

US3 Demo SoC Components.PNG

Multicore DSP (Mid End)

The software application that runs on C6472 is based on a Master/Slave model, where Core 0 acts as the centralized controlling core aka the Master core, and Core 1 and Core 2 act as the Slave cores. Note that though in this demo we utilize only three cores since they are enough to demonstrate our use-case, it is ofcourse feasible to use the same programming model to extend the application to utilize all six cores. The processing modules are statically assigned to Core 1 (BPU) and Core 2 (DPU) and Core 0, which serves as the master, takes care of synchronization and sets up the buffer pointers. Note however that even though distribution is done statically, the assignment of algorithms to cores is done outside of the main application. This allows easy reconfiguration and at the application sotware level, the developer can be agnostic to which core is running which algorithm.

Functional Overview

The software application on C6472's Core 0 is designed to integrate the following functional blocks: Front End Interface, Mid End Controller, Mid End Processing and Back End Interface.

US3 Demo Application Design.PNG 

Mid End Controller

As the name suggests, the Mid End Controller is responsible for initializing and initiating the other blocks. Commands received from the OMAP3530 are also interpreted here.

Front End Interface

The Front End Interface serves two primary functions: it provides periodic events that mark the availability of incoming input data and it provides functions to access this data that has arrived. Since this version of the demo, Ultrasound v3.0 showcases processing blocks post IQ demodulation, and has no front-end implementation, it becomes necessary to mimic the function of the ultrasound front-end as in a real system, where input frames would be continuously received at a set acquisition frame rate. The Front End Interface serves this role, where in it fires an INPUT_RDY event every (1/acquisition rate) seconds. The clock ticks are derived from the SYS/BIOS Timer module. Note here that though in this design we use a frame-based processing model, where the frame boundary defines the input block size, it is also possible to have partial frames as input boundaries.

Mid End Processing

The Mid End Processing function block is pending on the INPUT_RDY event from the Front End Interface and as soon as a new frame is “received,” it initiates processing on that input block. Mid End Processing acts as the Client in the Codec Engine (CE) framework and uses the iUniversal interface to call upon Algorithm Servers that correspond to various functions within the ultrasound midend processing signal chain. In this implementation, there are two algorithm servers implemented, one for each core (Core 1 and Core 2).

Let us now look at the primary execution threads that define the data flow through the Mid End Processing block. The figure below shows four primary tasks that utilize the MessageQ IPC module for message passing. The messages provide pointers to the data and trigger the execution of tasks in the receiving functions. The actual message buffer is setup in shared memory that both the message sender and the message receiver can access. In this case, the MidEnd_scatterTask() pends on a new input frame. When data becomes available, the MidEnd_scatterTask() allocates memory for the message from heap, and assigns the message pointer to the B-mode input data. It similarly allocates memory for a message that points to the Color flow input data. Using the MessageQ_put blocking call, the MidEnd_scatterTask() passes the input data pointers for B-mode and Color to the BPUcluster_task() and DPUcluster_task(). The BPUcluster_task() and DPUcluster_Task use the corresponding MessageQ_get calls to pend on the incoming B-mode and Color data frames. Once a new frame is received, these tasks call the UNIVERSAL_process API provided by the iUniversal interface, to invoke the BPU and DPU processing algorithms on Core 1 and Core 2 respectively. Once the B-mode and Color data are processed, the BPUcluster_task() and DPUcluster_task() use the MessageQ_put API to pass the output data pointers to the MidEnd_gatherTask(). The MidEnd_gatherTask() ensures that data atomicity is maintained, so that the B-mode and Color data that corresponds to a particular frame always stays together.

Note that the control tasks that involve data scattering and gathering, viz. MidEnd_scatterTask() and MidEnd_gatherTask() are running on Core 0, the Master core, and the processing tasks, viz. BPUcluster_task() and DPUcluster_task() are running on Core 1 and Core 2, respectively. It is important to note here that the IPC between cores is handled under the hood via CE and by using MessageQ, the developer is agnostic to this fact and can simply call the MessageQ APIs for passing data pointers between cores.

US3 Demo Data Flow.PNG

Back End Interface

Once the output B-mode and Color data is ready for scan conversion, it is time to pass the data to the OMAP3530 that handles the backend processing and display. The Mid End application's Back End Interface block provides functions to communicate with the OMAP3530 back end. For this example implementation, we use the ethernet ports on both device EVMs to interface the two together. To implement the communication protocol in software, we use RDSP, an application written on top of the MCSDK's Network Development Kit (NDK), that allows easy passing of data and parameters between the C6472 and the OMAP3530.

CE Implementation Details

To better understand this, we delve into a brief discussion on CE. The CE framework is essentially a set of APIs used to instantiate and run XDAIS-compliant algorithms. XDAIS is an algorithm standard that DSP programmers should follow to ensure that their algorithms easily plug-and-play with other algorithms and can be called using CE APIs. CE requires two essential components to operate in tandem: a CE Client and a CE Algorithm Server. In this demo, the master core, Core 0, serves as the CE Client and uses CE APIs to make “remote procedure calls” to CE Algorithm Server executables that reside on DSP Core 1 and Core 2. Essentially, the CE Algorithm Server combines the core codec (BPU in the case of Core 1 and DPU for Core 2) along with the other infrastructure pieces (SYS/BIOS, IPC, etc) to produce an executable (.x64P) that is callable by Core 0, the CE Client. The application on Core 0 invokes the BPU and DPU remote algorithms on Core 1 and Core 2 using the iUniversal interface, which is a set of APIs used to provide an easy way for XDAIS-compliant, non-VISA (Video, Image, Speech, Audio) algorithms to run using CE

CE provides some unique features that significantly eases the multicore DSP software development process. One of the primary advantages is that CE eliminates the need for the developer to manually code any Inter-Processor Communication (IPC) details. Once the developer configures memory for IPC, CE takes care of the rest of IPC under the hood. This is illustrated later in this section with code snippets. CE also captures some key TI hardware features, where resource management for memory and EDMA are done via CE. CE also enables code reuse and faster time to market since applications can be easily ported from TI's current generation C64x+ based multicore DSPs like the C6472 to next-generation devices based on the C66x architecture like the C6678. For more information on CE and C66x devices please click on the relevant links in the References section.

Performance Details

It is of interest to find the speed-up, or the ratio of cycles of single-core to that of multi-core to find the parallelism achieved by the multi-core implementation and get an understanding of the overheads associated in making the application multi-core. Currently the real-time requirements of the carotid-artery example case can be met even by running all the processing on a single-core so the speed-up measurement isn’t really necessary for meeting real-time needs but it is more to get an understanding of what is the actual performance compared to what is expected and where the overheads of this multi-core implementations lie.

The multi-core implementation parallelizes the execution of bpu (b-mode processing) and dpu (color processing) components. Each component is executed monolithically on a single but separate core of the multiple available cores. So the expected parallel cycles are the worst of those of the two components and some overhead due to multi-core.

Overheads and Related Optimizations in Multi-core Implementation

The overheads can be categorized into two components:

1. Communication and synchronization overhead between master and each of the slave cores running bpu/dpu components, including any cache operations for coherency. Caching overheads were minimized by configuring the infrastructure (Codec Engine) to disable any caching operations at the boundaries of master and slave cores because both bpu and dpu algorithms only use DMA for I/O.

2. DDR contention overhead. Each of the components parallelizes DDR memory I/O with CPU kernel execution, there is no cache usage for I/O. When each of the bpu and dpu are executed by themselves without the other, both are CPU bound but bpu has a higher margin (difference between CPU cycles and CPU cycles equivalent of DMA I/O) compared to dpu. The dpu is almost matched in I/O and CPU, mainly because of the RF input transpose operation in the DMA, which takes more cycles than a linear transfer. These higher cycles are because of the scatter operation in internal L2 RAM during the DMA transfer, which results in inefficient utilization of wide on-chip buses to internal memories. When bpu and dpu are run simultaneously on different cores, the two cores simultaneously hit DDR memory, and because DDR is shared, the accesses are serialized. So each execution suffers a greater penalty of I/O transfer than when only one is running. The measurements under the circumstances of same DMA priority for bpu and dpu show that dpu becomes significantly I/O limited, as much as 400 K cycles increase than when running standalone, which results in high overall multi-core cycles. As a result of this observation, the software was modified to prioritize dpu server core over bpu server core for EDMA access. This was accomplished with suitable programming of the channel to TC mapping, TC to queue mapping and system queue priorities. This resulted in a drop of dpu DDR degradation cycles from 400,000 per frame to 27,000 per frame, which resulted in a similar improvement in overall multi-core performance.

Performance Measurements

The full application consists of NDK software for communicating processed frames from 6472 to OMAP. Because in a real product an Ethernet interface is unlikely to be used for such communication (here we are constrained by the EVM hardware), it is not realistic to use a network stack. So for the purposes of measurement, the application was made to be build and run stand-alone on the 6472 without any Ethernet communication with OMAP. This can be done by defining MIDEND_STANDALONE, MIDEND_AUTOINIT_PARAMS in the project properties and setting variable midend_standalone to 1 in the midendapp.cfg file. Furthermore, single-core operation can be simulated by uncommenting the following in midend.c file


The above simply selects local CE algorithms of bpu and dpu i.e running on master core. Everything else (tasking) is same as multi-core operation.

BPU or DPU only execution can be configured by commenting appropriate lines below in the same file:


The single algorithm execution using above defines enables estimation of degradation due to DDR in simultaneous operation.

The DSP cycle measurements are shown in the table below, all performed on the master core which runs different tasks. All numbers represent one frame of processing.

Max latency
BPU Cycles
DPU Cycles
Multi-Core (MC)
Single-Core (SC)
MC BPU only

SC BPU only

MC DPU only

SC DPU only


The max latency column above is calculated from looking at the global variable gPerformanceStats (in watch window) as

max(gPerformanceStats.bmodeLatencyCycles, gPerformanceStats.colorLatencyCycles).

These latencies are measured in the code between entry of each type of frame into the system and its exit. So the maximum is the processing time of the stand-alone application.

The individual component cycles (bpu or dpu columns) in above table are obtained by differencing the cycles in the trace log (gLogBuf) looked backwards from gLogBufIndx between the points of BPU(DPU)_PRE_UNIV_PROCESS and BPU(DPU)_PRE_NOTIFY_PROCESS_DONE. The trace log consists of a time-stamp (it is 0 at the FRONT_END_PRE_EVENT_POST) logged for different events so it gives a good understanding of what events are happening at what times and therefore find what is getting parallelized.

The single-core application cycles cannot directly be used to find the speed-up because there are some overheads of tasking in the multi-core implementation that wouldn’t be in the single-core were single-core case be done naturally. Some overheads are getting hidden in multi-core application because of parallel execution of the algorithms. For example, the bpu’s overhead related to preparation and issuing of processing call as well as freeing of I/O buffer memory is parallelized with the dpu’s execution (we have set task priority of dpu to be higher than bpu knowing dpu takes longer processing cycles than bpu). While this is a good optimization on the multi-core, it gets serialized on the single-core equivalent execution. On the other hand, naturally implemented single-core will not have some of the overhead associated with having multiple tasks. A fair approximation is to use the multi-core overhead to the single-core case. The multi-core overhead is the difference between max latency and max of bpu, dpu cycles OR 1903650 – max{1789126,1880914) = 22736 cycles. So single-core max latency is 22736 + bpu_cycles of single-core + dpu_cycles of single-core = 22736 + 1769786 + 1840944 = 3633466. Thus,

Speed-up = 3633466/1903650 = 1.91.

The maximum possible theoretical speed-up is the one where there is no overhead in the single-core or multi-core case, which is simply the ratio below

Max theoretical speed-up

= (SC bpu cycles + SC dpu cycles)/max{SC bpu cycles, SC dpu cycles}
= (1769786 + 1840944)/max{1769786,1840944}
= 1.96

We can derive overhead numbers related to communication and DDR loading described earlier as follows:

1. Communication overhead (Codec Engine’s IPC or Inter Processor Communication) for dpu = dpu cycles of MC dpu only – dpu cycles of SC dpu only = 1853723 – 1839771 = 13952
2. Communication overhead (Codec Engine’s IPC or Inter Processor Communication) for bpu = bpu cycles of MC bpu only – bpu cycles of SC bpu only = 1785276 – 1771337 = 13939
3. Degradation in dpu due to DDR contention with bpu = dpu cycles of MC case – dpu cycles of MC dpu only case = 1880914 – 1853723 = 27191
4. Degradation in bpu due to DDR contention with dpu = bpu cycles of MC case – bpu cycles of MC bpu only case = 1789126 – 1785276 = 3850


Following table summarizes the results (DSP speed is 700 MHz):

Actual Speed-up
1.91 (max possible 1.96)
Communication Overhead
13900 cycles (0.02 ms)
DDR Contention Overhead
27191 cycles (0.03 ms)

The use of Codec Engine infrastructure, appropriate optimized implementations of the algorithms and system level tuning (DMA priorities, cache placements) allows a reasonable realization of multi-core performance. The overheads are insignificant at 20 fps (50 ms per frame) processing rate of the application.

System-on-Chip (Back End)

This section discusses the software design for the System-on-Chip application and showcases how TI's SDK for OMAP3530 including the Codec Engine, DMAI and iUniversal APIs can be used to ease SoC software development.

As previously mentioned, the SoC's role is to do backend processing and display, and provide mechanisms for data input/output and a graphical user interface.

The figure below summarizes the implementation.

US3 Demo SoC Application Design.PNG

As shown, OMAP3530's ARM receives data from the C6472 that has undergone B-mode estimation and Color Flow processing. OMAP3530’s C64x+ DSP core runs the DSP/BIOS real-time operating system and runs the Scan Converter Unit (SCU) algorithm on this received data. The SCUconverts both B-mode and Color Flow data from the acquired polar/cartesian co-ordinates to the display cartesian co-ordinates. Just like in the case of the Multicore DSP implementation, the algorithm module (SCU in this case) is packaged into a 'DSP Algorithm Server' that is managed by the Codec Engine and executed on the DSP core. On the ARM, the demo application uses Codec Engine APIs to make a remote procedure call to this 'Algorithm Server'.

The ARM core runs the Linux operating system, and all peripherals are controlled through Linux device drivers, which are part of the PSP. The ARM application resides on this Linux filesystem and uses a multithreaded framework to achieve data FIFO management, and to manage the Qt GUI on the On Screen Display (OSD) and service user interrupts via touchscreen events.

Functional Overview

The ARM application consists of a main function, and four execution threads running in parallel viz. acquisition, process, display and control threads. All four threads are configured as preemptive/ priority-based scheduled.
• The main function draws the Qt GUI and then becomes the event handling loop
• The acquire thread (initiated as part of the RDSP application) reads the raw input ultrasound data (from C6472) into the SoC
• The process thread engages the C64x+ DSP for scan conversion
• The display thread transfers ultrasound image frames to the frame buffer of the FBDev display driver
• The control thread computes and displays the ARM and DSP loading
Fixed-size buffers are exchanged between acquisition-process, and process-display threads for data movement. The thread where the data originates creates and maintains the set of data buffers, and FIFOs are used to put and get buffer pointers. In the following subsections we will delve into each of the threads discussed above. It might be useful to follow along with the source code as we discuss these elements.


The main function in main.cpp performs necessary initialization tasks, which include initiating a connection between 6472 and OMAP3530, sending the input files via TFTP from the SD card on OMAP3530 to 6472's DDR, sending configuration parameters for the algorithm modules (BPU, DPU) on C6472, initiating the acqusition, process, display and control threads and setting up the user interface. Once the main window is setup, this main thread becomes the event handling loop that makes function callbacks in response to touchscreen events. All trigger events (slots) and their associated callback functions (signals) are defined in mainwindow.cpp. The image() function is the most important signal in mainwindow.cpp and is triggered when the user touches the 'Start/Stop' button to begin/end image processing.


The acquisition thread is responsible for receiving raw, pre-scan-converted ultrasound data from the C6472. The Mid End Interface consists of the MidEndIf_getBuffer and MidEndIf_putBuffer functions. The RDSP client application layer calls these functions in the acqusition thread it spawns. The MidEndIf_getBuffer allocates buffer space for the incoming data. All data buffers that are shared between the OMAP3530's ARM and DSP need to be allocated in physically contiguous blocks of memory, which is what the BufTab_create() API call helps achieve for these input buffers. The MidEndIf_putBuffer call pushes the buffer pointers into a FIFO Acq2ProcFifo. Once the Process thread finishes processing the input frame, it releases the used buffer, which the Acqusition thread can then reuse for the next input frame. Note that the Acqusition thread continues to receive data at the acquisition frame interval defined in the C6472 application's Front End Interface. This is to mimic how an actual ultrasound system works where the acquisition rate and the display rate are independent of each other.


The process thread function defined in UsProcess.c interacts with the C64x+ and engages the DSP to run the SCU processing algorithm module. Just like we saw with the acquire thread, since the output buffers will be shared between ARM and DSP, the process thread also allocates contiguous memory for these. When imaging starts, the UsProcess_thrFxn begins and starts receiving input buffers from the acquisition thread on the Acq2ProcFifo/2 FIFOs. It passes the input and output buffer pointers to the UsProcess_ScanConvertB and UsProcess_scanConvertColor calls for the B-mode and Color data respectively. Once scan conversion is complete, the process thread places the pointer to the final output buffer on the Proc2DispFifo. The UsProcess_scanConvertB and UsProcess_scanConvertColor functions invoke the DSP-side scan conversion method using the IUNIVERSAL UNIVERSAL_process() API which in turn invokes the corresponding functions on the DSP-side. When the process thread is fully primed it releases the display thread for it to start accepting buffers on the Proc2DispFifo. Once free running though, the thread continues to process input buffers as fast as it can and sends them to the Display thread.


The Display thread's function defined in UsDisplay.c transfers the final scan-converted ultrasound image frames that it receives on the Proc2DispFifo to the frame buffer of the FBDev display device driver. The Display_create() DMAI method opens the display device driver, and the Display thread uses a handle to this to copy output buffers to the driver. The Display thread uses the H/W resizer to copy the output buffer to the display device driver, instead of using a memcpy() to ensure execution efficiency. DMAI’s Framecopy API is used to perform this buffer copy. The display refresh rate is set at 60 fps but incorporates logic that skips a frame if the frames are received from the process thread too fast, or repeats a frame if there is no new frame available from the process thread. Once the Display thread is triggered, the acquire, process and display threads continue to run till the user touches the 'Stop' button. Since this is a demo with only a limited number of frames, the C6472 loops through the same input dataset which the C6472 and OMAP3530 DSP cores continue to process and output in real time.

Control Thread

The control thread calculates the ARM CPU load and DSP CPU load for the OMAP3530 using the Codec Engine calls to the getArmCpuLoad() and Engine_getCpuLoad() API functions, respectively. The CPU loading is updated every few seconds on the GUI's 'Home' tab and is represented as a percentage. The GUI also displays the DSP loading for the six cores on the C6472, which is sent from the C6472 as part of the B-mode frame's header.

DSP Module Integration

In this section we outline some of the steps we took to integrate the BPU, DPU and SCU algorithm modules to plug-and-play with the Codec Engine (CE) using IUNIVERSAL. As CE Application Developers it is possible to use simple API calls to pass data and configuration parameters between the CE Client and CE Server cores.

The BPU, DPU and SCU modules, as provided in the Medical Imaging Software Toolkit, are all XDAIS compliant. You can find many articles on this TI EP wiki that guide you through the process of making your codecs XDAIS compliant.

Since the steps for BPU, DPU and SCU modules are similar, we will focus on the SCU module to illustrate how IUNIVERSAL is setup in this scenario where the CE Client is the ARM core on OMAP3530 and the CE Server is the DSP core on the OMAP3530.

The application capitalizes on the IUNIVERSAL API's capability to make a remote procedure call from the ARM to the DSP algorithm, without the need to write any system software for the C64x+. To initiate the ARM-DSP communication, the ARM creates a Codec Engine instance with an Engine_open() call that resets, loads and starts the DSP Engine and returns a handle to the same. Using this handle hEng, the UNIVERSAL_create() API creates an SCU algorithm instance using parameters from theISCU_Params structure that specify the size of memory to allocate on the DSP. It is important to note that the SCU header file at “midas/ultrasound/algos/scu/src/scu.h” is shared between the DSP and the ARM, which makes it possible for the two cores to share the same interpretation of SCU-specific structures and datatypes, even when both are compiled using different compilers. The UNIVERSAL_create() call returns a handle to the IUNIVERSAL algorithm instance hAlg as:

hAlg = UNIVERSAL_create(hEng, algName, (IUNIVERSAL_Params*)&ISCU_ALLOC_PARAMS);

Next, the ARM populates the SCU configuration structure, scuConfig_t, with parameter values from the user-provided configuration file/s, and with the tissue and flow color mapping lookup tables specified in "midas/ultrasound/userdata/LUTs". The SCU configuration structure is also defined in the SCU header file that ARM and DSP share. To send this configuration information to the DSP, the ARM passes a pointer to the scuConfig_t structure and the handle to the SCU algorithm instance hAlg, using the UNIVERSAL_control() API as shown in the code section below. UNIVERSAL_control() calls the corresponding SCU configuration function SCU_TI_control() on the DSP.

universalStatus.data.numBufs = 1;
universalStatus.data.descs[0].bufSize = sizeof(scuConfig_t);
universalStatus.data.descs[0].buf =(XDAS_Int8 *)(hUsProcess->pScuConfigB);
status = UNIVERSAL_control(hUsProcess->hAlg, XDM_SETPARAMS, &universalDynParams, &universalStatus);

It is important to note here that any buffers that the ARM shares with the DSP, it allocates in contiguous memory. This is necessary because unlike the ARM, the DSP does not have a virtual memory manager and therefore assumes that the buffer is aligned to a 64-bit boundary and is contiguous. In this demo implementation, the allocated buffers reside in CMEM, a contiguous memory manager by TI. When the ARM allocates contiguous memory using the Memory_contigAlloc() or BufTab_create() DMAI API, a CMEM pool that fits the buffer size requested is reserved for this buffer. The number and total size of CMEM pools is defined in the 'loadmodules.sh' script (from /opt/midas/. on the target), which the ARM application runs during initialization. Since both ARM and DSP have access to this CMEM memory space, they only need to exchange buffer pointers.

Once the SCU algorithm instance on the DSP is configured with the parameters it requires, it is ready to begin scan conversion processing. The ARM uses the UNIVERSAL_process() API to call the DSP-side SCU_TI_process() function; based on the scan conversion mode in the configuration, the DSP runs the corresponding processing function. Again, all input and output buffer pointers that the ARM and DSP exchange point to buffers allocated in CMEM.

status = UNIVERSAL_process(hUsProcess->hAlg, &inpBufDesc, &outBufDesc, NULL, &inArgs, &outArgs);

Finally, when the application exits, the UNIVERSAL_delete() API deletes the SCU algorithm instance hAlg. This deallocates all the dynamic memory that was associated with the hAlg instance. The algorithm instance deletion is accompanied with an Engine_close() call which deletes the Codec Engine instance created for ARM-DSP interaction as:


To summarize, the CE and IUNIVERSAL APIs allow application developers to seamlessly plug in XDAIS-compliant algorithm modules into their ARM application. As illustrated in the previous section, the use of IUNIVERSAL and CE is similarly used on the C6472 to ease multicore programming.

Get Ultrasound v3.0

Demo Application Source Code

Hardware Setup


  1. Mistral OMAP3530 EVM Rev G
  2. C6472 EVM
  3. Linux Development PC with Ubuntu 10.04 LTS
  4. Windows PC
  5. 3 Regular Ethernet Cables
  6. Gigabit Ethernet Switch
  7. Serial RS232 Cable
  8. USB cable for on-board emulator OR XDS510/XDS560 or similar


  1. Connect one ethernet cable from OMAP's ethernet port to the gigabit ethernet switch
  2. Connect another ethernet cable from C6472's lower ethernet port to the gigabit ethernet switch.
  3. Connect on-board emulator (or XDS510 or similar) to C6472 and Windows PC
  4. Connect one end of Serial Cable to OMAP's UART1/2 port and other end to Linux PC (if using minicom)
  5. If using NFS, Linux PC and office/home network should also be connected to the same switch via ethernet
  6. Ensure that EVM SW4 Settings are setup as appropriate for NFS boot

MIDAS Ultrasound3 Setup.jpg

Software Setup

A. Multicore DSP (Mid End)

All development for the multicore DSP (C6472) is done on the Windows PC.

  1. Environment Variables
    a. Windows XP: Right-click on My Computer --> Properties --> Advanced tab --> Environment Variables --> New
    b. Define new user variable, where variable name is TI_INSTALL_DIR and variable value is C:\ti
    c. Define another new user variable, where variable name is IQMATH_LIB_DIR and variable value is C:\ti\c64xplus-iqmath_2_01_04_00

  2. TI Code Composer Studio IDE (CCS)
    Download CCS v 4.2.1 from http://processors.wiki.ti.com/index.php/Download_CCS#Code_Composer_Studio_Version_4_Downloads and install as C:\ti\ccsv4

  3. PERL and 7-Zip
    Perl and 7-Zip are required to run the automated build script that will be discussed later.
    a. Install Active Perl from http://www.perl.org/get.html. Installs as C:\Perl by default.
    b. Once installation is complete, please ensure that C:\Perl\site\bin and C:\Perl\bin are in your Path. To check this, right click on My Computer --> Properties --> Advanced tab --> Environment Variables. The system variable Path should include these paths.
    c. Install 7-Zip from http://www.7-zip.org/. Installs as C:\Program Files\7-Zip by default. Please ensure that

  4. TI Software Components Setup
    Note that it is assumed in the following steps that C:\ti is TI_INSTALL_DIR

    a. Code Generation Tools 7.03
    Download from https://www-a.ti.com/downloads/sds_support/TICodegenerationTools/download.htm
    Double-click the installer and let this install in the default location i.e. C:\Program Files\Texas Instruments\C6000 Code Generation Tools 7.0.3

    b. XDC Tools
    Install as C:\ti\ xdctools_3_20_06_81

    c. Codec Engine
    Extract as C:\ti\ codec_engine_3_20_01_18

    d. Sys-BIOS
    Install as C:\ti\bios_6_30_03_46

    e. Framework Components
    Extract as C:\ti\framework_components_3_20_01_26

    f. IPC
    Install as C:\ti\ipc_1_22_04_25

    g. MCSDK
    Install as C:\ti\mcsdk_1_00_00_08
    (When asked to choose components, leave default setting i.e. all components checked)

    h. IQ Math Library 2.14
    Install as C:\ti\c64xplus-iqmath_2_01_04_00. Note that this corresponds to IQMATH_LIB_DIR defined above.

    At the end of these steps your TI_INSTALL_DIR folder should look as follows:
    Software Dependencies

  5. If you have not already done so, download the source code for MIDAS Ultrasound v3.0 from the GForge page. Unzip midas_usound_demo3_rel.zip as C:\ti\midas_usound_demo3_rel.
  6. Start CCS. When CCS is starting, it will ask you to choose a workspace. Select "C:\ti\midas_usound_demo3_rel\release\usound_demo3\ccsv4_workspace." If CCS is starting for the first time, it will autodiscover the software components installed. Now to make sure that all software components have been found, go to Window->Preferences->CCS->RTSC->Products. The snapshot should look similar to below with TI_INSTALL_DIR (C:/ti) listed under "Tool Discovery Path" and all components listed under "Discovered Tools."

    RTSC Components

  7. Build, Load and Run
    a. Open midas_usound_demo3_rel\common_xdcpaths.mak to ensure that paths defined are as expected. If you followed the recommended paths in the instructions above this should not have to make any changes to this file.
    b. Use Wordpad (or similar text editor) to open the file midas_usound_demo3\build6472.pl. Please ensure that paths defined are as expected. If you followed the recommended paths in the instructions above you should not have to make any changes to this file.
    c. Open a command window and 'cd' to the midas_usound_demo3_rel directory.
    d. Type 'perl build6472.pl' to execute the perl autobuild script. This will automatically build the .out files for all size cores of the C6472, viz. midendapp.out, server1.out, server2.out, server3.out, server4.out and server5.out. Note that midendapp.out corresponds to the main midend application which will run on core 0, server 1 corresponds to the BPU block, server 2 corresponds to DPU block, while server 3/4/5 don't have any significance in the current demo.
    e. To load the cores with the relevant .out files, Open CCS --> Load your target configuration that corresponds to your emulator --> 'Connect' all cores of the C6472 EVM target --> Load cores with corresponding .out files.
    f. Once all cores are loaded, hit 'Run.' This will execute the program on C6472 and you should see the following message on the CCS console:
HPMP EVM On-board Networking Demo
EVM in StaticIP mode at
Set IP address of PC to
5:48 ( 7%) 4:96 ( 12%) 2:128 ( 8%) 0:256

1:512 ( 16%) 0:1536 0:3072

(12288/49152 mmAlloc: 12/0/0, mmBulk: 0/0/0)

1 blocks alloced in 512 byte page
4 blocks alloced in 96 byte page
2 blocks alloced in 128 byte page
5 blocks alloced in 48 byte page

emac_init: core 0, port 0, total number of channels/MAC addresses: 1/1
MAC addresses configured for channel 0:
emac_open core 0 port 0 successfully
Registration of the EMAC Successful, waiting for link up ..
Network Added: If-1:
Service Status: HTTP : Enabled : : 000
Mid End process started
MidEnd_gatherTask started
Successfully created BPU instance
Successfully created DPU instance
Port 0 Link Status: 1000Mb/s Full Duplex on PHY 24

B. System-on-Chip (Back End)

All development for the System-on-Chip (OMAP3530) is done on the Linux PC (Ubuntu 10.04).

  1. DVSDK Setup
    A. -- Download --
    File is 'dvsdk_omap3530-evm_4_01_00_09_setuplinux' from http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/dvsdk/DVSDK_4_00/4_01_00_09/index_FDS.html
    B. -- Install --
    'linuxHost$ chmod +x dvsdk_omap3530-evm_4_01_00_09_setuplinux'
    'linuxHost$ ./dvsdk_omap3530-evm_4_01_00_09_setuplinux'
    Follow instructions as directed to setup DVSDK with Network Filesystem. If using defaults, this should install as ~/ti-dvsdk_omap3530-evm_4_01_00_09.
    C. -- Rebuild DVSDK and Setup Environment --
    'linuxHost$ cd ~/ti-dvsdk_omap3530-evm_4_01_00_09'
    'linuxHost$ make clean && make'
    'linuxHost$ setup.sh'
    After this step, among other development environment necessities, you should have NFS setup at ~/workdir/filesys, and minicom setup on your linux host with access to the OMAP terminal.
  2. IQMath Setup
    Download IQMath from http://focus.ti.com/docs/toolsw/folders/print/sprc542.html and install
  3. Download MIDAS package and untar
    Go to https://gforge.ti.com/gf/project/med_ultrasound/frs/, and download MIDASUltrasound3.0 source code package (midas_usound_demo3_rel.zip)
    'linuxHost$ unzip midas_usound_demo3_rel.zip'
  4. Setup Environment
    Edit Paths as per your environment in file ~/ti_midas/miDAS/ultrasound/demo3/backend/Paths.mak
  5. Build MIDAS
    'linuxHost$ cd ~/ti_midas/miDAS/ultrasound/demo3/backend'
    'linuxHost$ make setup'
    'linuxHost$ make backend'
    This will build both the 'server.x64P' and the OMAP application 'ultrasound' and will copy them to the ${EXEC_INSTALL_DIR} specified in Paths.mak.
  6. Calibrate Touchscreen
    'target$' ts_calibrate
    Touch the crosshairs that appear on the screen to help record touchscreen calibration information
  7. Run MIDAS
    Run the C6472 application from CCS as described in the earlier section. Ensure that you see the last message "Port 0 Link Status: 1000Mb/s Full Duplex on PHY 24" on the CCS console. Once you see this message on the OMAP console (on minicom), perform the following:
    target$ cd /opt/midas'
    'target$ ./ultrasound -qws -display transformed:Rot270 &'
    You will see the MIDAS welcome screen with a message "Initializing" on the OMAP's LCD screen. After some time you should see a user interface on the OMAP screen. Press the "Start" button on the OMAP to begin the demo. When running, the B-mode/Color button allows you to switch between B-mode and Color Flow modes. The Contrast drop down menu allows you to choose between three contrast settings (which are internally linked to different parameters for the BPU module). The Hide/Show button allows you to hide the user interface. The UI transparency changes the transparency of the user interface between transparent, translucent and opaque.

Algorithm Source Code

The algorithm source code for BPU, DPU and SCU and other ultrasound processing modules is available at http://www.ti.com/tool/s2meddus


Post questions to TI's Support Portal at e2e.ti.com. Please use the keyword 'MIDAS' as a keyword in your title.


System and equipment manufacturers and designers are responsible to ensure that their systems (and any TI devices incorporated in their systems) meet all applicable safety, regulatory and system-level performance requirements. All application-related information on this website (including application descriptions, suggested TI devices and other materials) is provided for reference only. This information is subject to customer confirmation, and TI disclaims all liability for system designs and for any applications assistance provided by TI. Use of TI devices in life support and/or safety applications is entirely at the buyer's risk, and the buyer agrees to defend, indemnify and hold harmless TI from any and all damages, claims, suits or expense resulting from such use.

All software is licensed under BSD with the following terms and conditions:

Copyright (C) 2011 Texas Instruments Incorporated - http://www.ti.com/
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Texas Instruments Incorporated nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.