Tuning Audio Latency on C6747

From Texas Instruments Wiki
Jump to: navigation, search

Introduction

Latency is an important consideration in any DSP audio system. This article describes techniques for tuning your system to balance the competing requirements of responsiveness and general processing efficiency. We'll approach the problem from three starting points:

  1. An example application that comes with the C6747 PSP driver package
  2. An audio application generated by the C6Flo graphical development tool
  3. A low-level audio application that controls the audio and EDMA devices without any intermediate driver

PSP Audio Example Application

For more information on how to download and install the PSP drivers, please refer to the C6747 getting started guide. You should be able to find the PSP audio example application in a path similar to this:

C:\OMAPL137_dsp_1_XX_XX_XX\pspdrivers_01_XX_XX_XX\packages\ti\pspiom\examples\evm6747\audio\build\audioSample.pjt

All of the changes we'll make are in a single source file, audioSample_io.c. Take a moment to familiarize yourself with the contents of this file. The only processing done by this application is a simple memcpy call, which could easily be replaced by "real" processing between the input and output buffers.

Cutting Out Extra Audio Buffers

Before we do anything else, make sure that the macro NUM_BUFS is set to 2.

#define NUM_BUFS    2   /* Num Bufs to be issued and reclaimed */

The default value, 4, adds a lot of extra latency without any real benefit. Using 2 buffers is the classic "ping pong" arrangement, where one buffer is filled/consumed by the audio peripheral while the second is being processed. In this application, the audio driver and EDMA handle the first buffer without any CPU loading.

Changing the Audio Buffer Size

The most basic tradeoff between latency and performance is the selection of an audio buffer size. In this application, the buffer length is controlled by another macro, BUFLEN.

#define BUFLEN      2560 /* number of samples in the frame   */
#define BUFALIGN    128  /* alignment of buffer for L2 cache */
Note: Do not change BUFALIGN!  128-byte alignment is required by the audio driver for any buffer size.

Since we're operating at a sampling frequency of 44.1 kHz, the default buffer length works out to over 58 milliseconds. This is a pretty long time in audio terms. Our overall latency can be described with a simple formula:

L = 2 * T + d

Where L is our audio latency, T<tt> is our buffer period, and <tt>d is some additional delay introduced by the audio hardware itself (codec chip, etc.). Assuming zero hardware delay (unlikely), our large audio buffers mean we'll see lag of at least 116 ms. This is very noticeable, even for basic applications. The following table shows the actual audio latency you can achieve with this application simply by reducing the buffer length to smaller values.

Table 1: PSP Example Latency and CPU Load
Buffer Length Latency (ms) CPU Load
512 25.8 1.86%
256 14.2 2.46%
128 8.24 4.03%
64 5.4 8.66%
32 3.88 13.3%
16 3.2 25.3%

Note that the smallest buffer size listed is 16 samples. This is the smallest buffer size allowed by our audio driver. Also, note that the table also includes the CPU load required just to maintain the audio stream for these buffer sizes. The smaller your buffers become, the more often the DSP needs to call the SIO APIs that keep the buffers loaded.

Disabling the McASP Hardware FIFO

There's one additional change we can make to reduce our latency. The audio data enters our application via the McASP peripheral. It collects in a FIFO before being copied to normal memory via EDMA. If we disable the FIFO, audio samples will be copied directly to memory instead. This can slightly improve our latency as the cost of smaller, more frequent EDMA transactions. You can disable the FIFO usage very easily by changing one element in each of the McASP input/output channel parameter structs.

Mcasp_ChanParams  mcasp_chanparam[Audio_NUM_CHANS]=
{
    {
        0x0001,
        {Mcasp_SerializerNum_0, },
        (Mcasp_HwSetupData *)&mcaspRcvSetup,
        TRUE,
        Mcasp_OpMode_TDM,
        Mcasp_WordLength_32,
        NULL,
        0,
        NULL,
        NULL,
        1,
        Mcasp_BufferFormat_INTERLEAVED,
        FALSE,  /* McASP FIFO enable */
        TRUE
    },
    {
        0x0001,
        {Mcasp_SerializerNum_5,},
        (Mcasp_HwSetupData *)&mcaspXmtSetup,
        TRUE,
        Mcasp_OpMode_TDM,
        Mcasp_WordLength_32,
        NULL,
        0,
        NULL,
        NULL,
        1,
        Mcasp_BufferFormat_INTERLEAVED,
        FALSE,  /* McASP FIFO enable */
        TRUE
    }
};

With this change, we can achieve the latency listed in the following table. (Note that CPU load is not affected by this change; only EDMA load.)

Table 2: PSP Example Latency and CPU Load (McASP FIFO Disabled)
Buffer Length Latency (ms) CPU Load
512 24.2 1.86%
256 12.6 2.46%
128 6.78 4.03%
64 3.88 8.66%
32 2.40 13.3%
16 1.72 25.3%

C6Flo Audio Application

For more information on installing and using the C6Flo tool, please refer to the C6Flo main page. For this article, we'll start with the C6747 audio filter example application and cut out the processing blocks between the audio input and audio output blocks. The diagram should look like this before you generate your application code:

C6flo c6747 audio app ss.png

Note: You may want to "Save As..." before generating code to avoid overwriting the original diagram

We'll be making changes to two source files: c6747_audio_app_blocks.c and c6747_audio_app_threads.c.

Cutting Out Extra Audio Buffers

Similar to the PSP example application, our C6Flo application starts out using 4 buffers per audio driver handle (input and output). We'll need to change that to 2 buffers to get good audio latency in our system. Look for the following 6 lines in c6747_audio_app_blocks.c and change the number 4 to 2 in one:

int ti_c6flo_evmc6747_audioin_v1_create(ti_c6flo_evmc6747_audioin_v1_hdl blockp)
{
    // ...
    sio_attrs.nbufs = 2; // was 4
    // ...
    for (i = 0; i < 2; i++) // was 4
    // ...
    for (i = 0; i < 2; i++) // was 4
    // ...
}
 
int ti_c6flo_evmc6747_audioout_v1_create(ti_c6flo_evmc6747_audioout_v1_hdl blockp)
{
    // ...
    sio_attrs.nbufs = 2; // was 4
    // ...
    for (i = 0; i < 2; i++) // was 4
    // ...
    for (i = 0; i < 2; i++) // was 4
    // ...
}

Changing the Audio Buffer Size

Changing buffer sizes in a C6Flo application can be done in the GUI by adjusting the buffer_size and buffer_length parameters. However, if we change the parameters and re-generate our application code, we'll overwrite the changes we just made to cut out our extra audio buffers. Fortunately, it's also easy to change these parameters in our C application code. Look for the following lines in c6747_audio_app_threads.c:

// Thread parameter structs
C6Flo_std_thread_obj thread0_obj = {
    /* buffer size (bytes)      = */ 1024,
    /* buffer length (elements) = */ 256,
    /* buffer alignment (bytes) = */ 128,
    /* thread index             = */ 0
};

Note that buffer_size is equal to four times buffer_length because we're representing our data as single-precision floating point (i.e. 4 bytes per sample). When changing one value, be sure to change the other so that they maintain this ratio.

This application uses a slightly different McASP configuration and does a little more work between audio input and audio output, so our latency and CPU loading looks a little bit different than it did for the PSP example application. The following table summarizes the performance of our C6Flo application for different buffer sizes.

Table 3: C6Flo App Latency and CPU Load
Buffer Length Latency (ms) CPU Load
512 28.2 2.82%
256 16.6 4.24%
128 10.8 6.56%
64 7.16 11.2%
32 4.34 19.4%
16 2.98 36.9%

Priming the Audio Driver

Due to our McASP configuration, we can't just turn off the FIFO for this application. However, there is one more trick we can use to lower our latency. The C6Flo application code as generated follows a somewhat convoluted process to "prime" the audio input and output buffers at the start of the application:

  1. Allocate audio input buffers
  2. Create audio input driver handle
  3. Prime audio input driver handle
  4. Allocate audio output buffers
  5. Create audio output driver handle
  6. Prime audio output driver handle

The separation between steps 3 and 6 introduces unnecessary latency to our application, so it's in our best interest to move them closer together. Fortunately, there's a pretty easy way to do this thanks to the structure of our application. All C6Flo-generated applications begin by calling "create" functions for each block, followed by "init" functions for each block. Currently, steps 1-6 take place in the create functions. We can move steps 3 and 6 to the init functions with a little work (and without breaking anything). Here's what you should end up with when you're done:

int ti_c6flo_evmc6747_audioin_v1_create(ti_c6flo_evmc6747_audioin_v1_hdl blockp)
{
    SIO_Attrs sio_attrs;
    int size, count, align, status, i;
 
    // get max buffer length, alignment
    count = blockp->std.thread->buffer_length;
    align = blockp->std.thread->buffer_align;
 
    // internal buffers must be big enough to hold pairs of 16-bit value (i.e. count * 2 * 2 bytes)
    size = count << 2;
 
    // initialize audio driver
    audio_evm_init();
 
    // create driver handle
    sio_attrs       = SIO_ATTRS;
    sio_attrs.nbufs = 2;
    sio_attrs.align = align;
    sio_attrs.model = SIO_ISSUERECLAIM;
 
    blockp->stream_hdl = SIO_create("/dioAudioIN", SIO_INPUT, size, &sio_attrs);
    if (blockp->stream_hdl == NULL)
    {
        LOG_printf(&trace,"[audioin]: could not create driver handle");
        return C6Flo_EGENERIC;
    }
 
    // allocate internal buffers
    for (i = 0; i < 2; i++)
    {
        C6Flo_MEM_alloc(&(blockp->buffers[i]), C6Flo_MEM_NORMAL, C6Flo_MEM_PERSIST, size, align, &blockp->std);
        if (blockp->buffers[i] == NULL)
        {
            LOG_printf(&trace,"[audioin]: buffer allocation error");
            return C6Flo_EALLOC;
        }
    }
 
    return C6Flo_EOK;
}
 
int ti_c6flo_evmc6747_audioout_v1_create(ti_c6flo_evmc6747_audioout_v1_hdl blockp)
{
    SIO_Attrs sio_attrs;
    int size, count, align, status, i;
 
    // get max buffer length, alignment
    count = blockp->std.thread->buffer_length;
    align = blockp->std.thread->buffer_align;
 
    // internal buffers must be big enough to hold pairs of 16-bit value (i.e. count * 2 * 2 bytes)
    size = count << 2;
 
    // initialize audio driver
    audio_evm_init();
 
    // create driver handle
    sio_attrs       = SIO_ATTRS;
    sio_attrs.nbufs = 2;
    sio_attrs.align = align;
    sio_attrs.model = SIO_ISSUERECLAIM;
 
    blockp->stream_hdl = SIO_create("/dioAudioOUT", SIO_OUTPUT, size, &sio_attrs);
    if (blockp->stream_hdl == NULL)
    {
        LOG_printf(&trace,"[audioout]: could not create driver handle");
        return C6Flo_EGENERIC;
    }
 
    // allocate (and clear) internal buffers
    for (i = 0; i < 2; i++)
    {
        C6Flo_MEM_alloc(&(blockp->buffers[i]), C6Flo_MEM_NORMAL, C6Flo_MEM_PERSIST, size, align, &blockp->std);
        memset(blockp->buffers[i], 0, size);
        if (blockp->buffers[i] == NULL)
        {
            LOG_printf(&trace,"[audioout]: buffer allocation error");
            return C6Flo_EALLOC;
        }
    }
 
    return C6Flo_EOK;
}
 
int ti_c6flo_evmc6747_audioin_v1_init(ti_c6flo_evmc6747_audioin_v1_hdl blockp)
{
    int size, count, status, i;
 
    count = blockp->std.thread->buffer_length;
    size = count << 2;
 
    // prime driver (issue internal buffers)
    for (i = 0; i < 2; i++)
    {
        status = SIO_issue(blockp->stream_hdl, blockp->buffers[i], size, NULL);
        if (status != SYS_OK)
        {
            LOG_printf(&trace,"[audioin]: buffer issue error (prime)");
            return C6Flo_EALLOC;
        }
    }
 
    return C6Flo_EOK;
}
 
int ti_c6flo_evmc6747_audioout_v1_init(ti_c6flo_evmc6747_audioout_v1_hdl blockp)
{
    int size, count, status, i;
 
    count = blockp->std.thread->buffer_length;
    size = count << 2;
 
    // prime driver (issue internal buffers)
    for (i = 0; i < 2; i++)
    {
        status = SIO_issue(blockp->stream_hdl, blockp->buffers[i], size, NULL);
        if (status != SYS_OK)
        {
            LOG_printf(&trace,"[audioout]: buffer issue error (prime)");
            return C6Flo_EALLOC;
        }
    }
 
    return C6Flo_EOK;
}

This change will improve your latency to the values listed in the following table without affecting your CPU load at all.

Table 4: C6Flo App Latency and CPU Load (Driver priming moved to init functions)
Buffer Length Latency (ms) CPU Load
512 24.9 2.82%
256 13.3 4.24%
128 7.52 6.56%
64 4.56 11.2%
32 3.12 19.4%
16 2.40 36.9%

Low-Level Audio Application

Unlike the previous two examples, this application is not part of (or generated by) any standard software release from Texas Instruments. To get started, download the application source code from the following URL:

This package contains a CCS3.3 project file (*.pjt), a DSP/BIOS 5 configuration file (*.tcf), and several C source and header files. The application is self-contained; you don't need any other libraries or software packages to build and run.

Note: The following data reflects version 3 of the application.

Changing the Audio Buffer Size

This application is much closer to achieving optimal performance as-is, so there's little we can do to improve it. However, like all audio applications, our overall latency and CPU load will depend on our audio buffer size. The buffer size can be adjusted in this application using a macro near the top of the audio.c source file:

#define SAMPLES_PER_BUF     128

The following table lists latency and CPU load for several possible buffer lengths. Note that this application allows even smaller buffer sizes than the previous examples. The minimum sample count in those applications was set by the PSP driver. Since this application does not use the PSP driver, it does not share that limitation.

Table 5: Low-Level Application Latency and CPU Load
Buffer Length Latency (ms) CPU Load
512 22.2 0.82%
256 11.5 0.91%
128 6.20 1.07%
64 3.52 1.40%
32 2.20 2.05%
16 1.52 3.36%
8 1.20 5.88%