Beagle Board Challenge: SuperBeagle

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

The Beagleboard contains an ARM processor, DSP, and graphics card for processing compute-intensive multimedia applications. However, there is currently very little support for using the DSP on the Beagleboard. Our project, DSP for Dummies, is to provide the support, documentation, and code infrastructure necessary for a beginner to use DSP/Bridge to interface with the DSP on a Beagleboard.

Why is this work important?

I don't think anyone understands DSP/Bridge well enough
-A TI Beagleboard engineer


Our goal was to change that perception. For the most part, we succeeded.

DSP for Dummies Overview


DSP for Dummies (DFD) is a framework designed to allow programmers to easily interface their programs to the DSP.

DSP coding comes in two parts:

The binary that runs on the DSP

DFD takes care of cache coherency, message parsing, etc. and presents a simple interface to the writer. The writer simply edits dummy_dsp.h and adds the function(s) that he/she would like to run on the DSP. Each function is associated with a message opcode, which you also define in the file. The function takes in the DMA input buffer, DMA output buffer, a global variable structure that is shared among all functions, and two 32-bit arguments that are passed in from the software.

DFD also provides an optional "idle" function that is run when the DSP is waiting for messages to arrive.

The binary that runs on the ARM processor

On the ARM side of things, you obviously need to include some files and call some DSP initialization code. For everything else, just write your code as usual. The main() function is located in dummy_arm.c.

When you wish to perform some work on the DSP, just write the data into the DMA input buffer and compose a message. Messages consist
of a command (which DSP function to run, specified in dummy_dsp.h) and two arguments (arg1, arg2). An example use of an argument is to tell the DSP how much data is in the DMA input buffer, since the size of the buffer is constant. Then, call dsp_send() to send the data and message to the DSP.

DFD implements two kinds of receive functions: a blocking receive and non-blocking receive. A blocking receive waits until the DSP completes the operation and returns before continuing executing. A non-blocking receive returns instantaneously along with a boolean value. The bool tells you if the message receive has arrived yet. If not, you can continue doing some work on the ARM processor while waiting for the DSP to complete its task. This enables you to easily run the DSP and ARM processor in parallel, maximizing performance.


DSP for Dummies Video

DSP Programming Model

This section will discuss various methods to effectively utilize the DSP within our infrastructure.

The figure below shows the original code that does not use the DSP. Everything is executed on the ARM core:

Drawing1.png



To improve performance, we would like to run tasks both on the ARM core and DSP. In the simplest programming model, the ARM processor is the master and the DSP is the slave. Then, whenever the ARM core has work for the DSP to do, the ARM core sends a command to the DSP with the data to be processed. The DSP then performs the operation and then sends back the result to the ARM core. The easiest thing to do is to send the work to the DSP and wait for the DSP to finish before continuing:

Drawing2.png


However, this assumes that the task that you sent to the DSP runs much faster on the DSP than on the ARM core. Frequently, this is not a good assumption. DSP code can be tricky to write. Depending on the algorithm and implementation, your final code might run the same speed or even run slower. Keep in mind that there is some overhead for the data to be sent to the DSP and for the DSP reply data to return.

Therefore, what we would really like to do is to send a task to the DSP, and then continue performing work until the DSP finishes its task. Then, we can send more tasks to the DSP. This requires the use of a non-blocking receive. What this means is that when the ARM core calls the dsp_receive() function, the function cannot return only when the DSP is finished. This is called a blocking receive. Instead, we need a non-blocking receive function. The function needs to immediately return, telling you whether or not the DSP is complete. If the DSP is complete, then it should behave the same way as the blocking receive. Both blocking and non-blocking receives have been implemented in DSP for Dummies. An example is shown below:

Drawing3.png


This method is good, but what if you want to send a large number of tasks to the DSP at once, with each task taking a very different amount of time? The previous schemes are only effective when sending a large task to the DSP. What if you have multiple small tasks? Communication latency between the DSP and ARM core can be performance bottleneck, what if we want to hide it to improve performance? In this case, instead of the DSP only doing work when the ARM core asks it to do so, the DSP has a task queue. The task queue contains all sorts of tasks for the DSP to complete. The DSP's main function is the loop, in which the following operations are performed:

  1. Message check: have any new tasks from the ARM core arrived? If so, place these tasks at the end of the task queue and acknowledge the ARM core.
  2. Get task: dequeue the first task from the task queue.
  3. Complete task: do the work required by the task.
  4. Put results in result queue: Since the jobs are potentially small, we don't want to constantly send results back as well (this can hurt performance). Instead, we can aggregate the results and only send them out once the queue contains over a certain number of results.
  5. Result queue threshold check: if the queue contains more results than the threshold number, we send the results in bulk back to the ARM core.


DSP for Dummies supports this method as well: simply use the idle fuction to run the loop algorithm described above. Although this is the most complicated of the three schemes presented today, it also offers the highest performance.

Drawing4.png

Performance Analysis

To analyze the performance of the DSP compared to the ARM core, we implemented a simple matrix multiplication algorithm on both the DSP and ARM core. The matrices contained 32-bit integers. The performance results were very surprising:

Mmul perf.png

Indeed, without any optimizations, DSP performance is sorely lacking compared to just running on the ARM core. However, after adding in a few optimizations, performance improved such that for large matrix sizes, the DSP performed better than the ARM core. Therefore, we can conclude the following: first, DSP coding is tricky and requires special considerations to achieve acceptable performance. Second, using the "non-blocking receives" or "task queue" programming models described above is the only way to achieve great performance: even if the DSP performs at the same speed as the ARM core, we can run tasks on both in parallel, thus potentially cutting overall runtime in half.

Note: the strange results could also be due to a defective board, which we have reason to believe is true

Usage Instructions

First of all, you need to install a Linux operating system that supports dsp-bridge. This almost certainly means that you will need to recompile the linux kernel. Almost all of the help available online on this subject (how to compile Linux for DSP) is woefully out of date. Be sure to use the right one!

First, you'll need to download the proper compilers. You'll need a total of 3 compilers: a GNU/Linux ARM compiler, EABI ARM compiler, and DSP compiler. Follow the directions here for where to download them and how to install them. Also, follow the directions to reformat your SD card to contain two partitions: a boot partition and a rootfs partition. However, the other directions on the site are woefully out of date! Don't follow them because it won't work! Believe me, I tried!

Next, follow the directions here to compile Linux and copy the DSP/Bridge drivers to your Beagleboard. If you have any problems running the commands, be sure to first try them with 'sudo'. The person in charge of this buildflow is Robert Nelson. You can find him almost any day on the Beagleboard IRC channel (his handle is rcn-ee). He is extremely helpful and knowledgeable, so be sure to go to him for help (I recommend going on the IRC channel first so that you won't bother him too much).

Afterwards, you should be able to see dspbridge when you type 'cat /proc/interrupts' and 'lsmod' on the Beagleboard. Now, you're ready to use DSP for Dummies! Simply download the source code below and follow the directions. Be sure to download doffbuild and edit the Makefile to point to the C6X compiler, which you downloaded earlier. The URL to download this is contained in the README. Type 'make' to build the example. Then, copy 'dummy' and 'dummy.dll64P' to your Beagleboard's rootfs as directed by the README. To run the example, type 'sudo ./dummy <matrix size> <0 for running on the DSP> <algorithm>'. To run with a 128x128 matrix on the DSP using the optimized algorithm, run 'sudo ./dummy 128 0 3'. The unoptimized algorithm is 0 and a simple vector add algorithm is 1.

Happy code writing!

Source Code

The source code is an extensively modified version of Felipe Contreras' dsp-dummy code.

File:Dsp-for-dummies.tar.gz

The Overall Experience

The problem with embedded processors seems to be the overall lack of cohesive documentation on the web. When we first started the project, people still have not gotten the DSP running on Ubuntu. We struggled compiling the Linux kernel for the first time ever, and then wrestled with applying kernel patches. There were numerous errors in the old wiki page (Beagleboard_Ubuntu), and it was frustrating when nothing worked. However, the IRC channel came in handy. People like Robert Nelson were very helpful at helping us figure out what was going wrong. In many cases, we were able to help them debug some problems that they didn't know about.

However, the whole experience was very rewarding, especially since not many people have experience in this subject. Robert was able to point us to seemingly the only person in TI who has extensive experience with DSP/Bridge, Omar Ramirez.


Acknowledgements

We'd like to thank Cathy Wicks and Tenequa from TI, as well as the other people behind the scenes, for hosting the contest. Also, we'd like to thank Robert Nelson and Omar Ramirez for their help debugging the numerous problems that we have encountered. We'd also like to thank Felipe Contreras for writing dsp-dummy and providing some assistance.

Contact Us

Dan Zhang - dan.zhang@utexas.edu
Jenny Huynh - jennyxhuynh@gmail.com