From Texas Instruments Wiki
(Redirected from Cortex A8)
Jump to: navigation, search

ARM Cortex-A8 Overview & Introduction

Architecture Overview

Available Tools

There are a variety of tools:

  • Microsoft Visual Studio 2005 + Platform Builder plugin for the IDE debugger ( WinCE application Debug)
  • Lauterbach (good for low-level ARM and DSP debug, experienced with previous OMAP/AM products)
  • GreenHills's MULTI (good for low-level ARM and DSP debug)
  • MontaVista's Devrocket (good for Linux application debug)

If you need a tool that understands Linux, CodeSourcery and MontaVista are the way to go; at present CodeSourcery tool-chain has better support for Cortex-A8 found in OMAP 35x / AM 35x devices.

What is Neon?

According to ARM, the Neon block of the Cortex-A8 core includes both the Neon and VFP accelerators. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor integrated in as part of the ARM Cortex-A8. What does SIMD mean? It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. It is also synonymous with the term vector processor. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate. Many Neon benchmarks are shown as ARM takes N instructions while Neon takes less than N instructions. This shows how much parallelism can be achieved for that benchmark. Reducing instruction count will reduce the number of clocks used to perform the same task. A simple rule of thumb for how fast Neon will speed up a specific loop is to look at the data size of the operation. Since the largest Neon register is 128 bits, if you are performing an operation on 8-bit values you can perform 16 operations simultaneously. On the other end of the spectrum, if you are using 32 bit data, then you can perform 4 operation simultaneously. However remember that there are always other considerations that affect execution speed such as memory throughput and loop overhead. Neon instructions are mainly for numerical, load/store, and some logical operations. Neon operations will be executing in the NEON pipline while other instruction such as branching will occur in the main ARM core pipeline. (See reference to Cortex-A8 Architecture above for a description of the ARM Cortex-A8 and NEON pipelines)

What are the advantages of Neon

  • Aligned and unaligned data access allows for efficient vectorization of SIMD operations.
  • Support for both integer and floating point operations ensures adaptability to a broad range of applications, from compression decoding to 3D graphics.
  • Tight coupling to the ARM core provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow.
  • The large Neon register file with its multiple views enables efficient handling of data and minimizes access to memory, enhancing data throughput performance.

Neon Advantages details

How to develop code for Neon

Unfortunately, in most cases you can not simply compile general C code and get a huge speed up using Neon. But if you truly want to utilize the power of Neon there are some basic steps you can follow. You need some basic understanding of what it means to vectorize the code. You need to know how to enable Neon in the Cortex-A8. Also, you need to have L2 cache enabled to get appreciable speed increases.

  1. Compiler Options - you can direct the compiler to auto-vectorize: The compiler generates Neon code. See the Neon auto vectorization example.
  2. Neon intrinsics - Compileable macros that give low level access to Neon operations
  3. Assembly Code - Write your own assembly or link highly optimized libraries.

For Intrinsics, ARM has created a NEON support library called NE10 to make your jump into NEON easier:

What does a Neon assembly instruction look like

A Neon instruction would look like one of the following:

VMUL.I16 q0,q0,q1

  • VMUL - multiply assembly instruction
  • .I16 - Indicates this instruction operates on 16 bit integers. This would be a "short int" in C code.
  • q0,q1 - Neon registers. A 'q' register is 128 bits wide and will hold 8 short ints.

This Neon instruction would simultaneously multiply the 8 operands in q0 with the 8 operands in q1 and store the 8 results in q0.

How to enable NEON

The NEON/VFP unit comes up disabled on power on reset. To enable Neon requires some co-processor commands. Below is the assembly code required to enable NEON/VFP. It is in the gcc type syntax. ARM code tools use a slightly different syntax.

     MRC p15, #0, r1, c1, c0, #2 ; r1 = Access Control Register
     ORR r1, r1, #(0xf << 20) ; enable full access for p10,11
     MCR p15, #0, r1, c1, c0, #2 ; Access Control Register = r1
     MOV r1, #0
     MCR p15, #0, r1, c7, c5, #4 ; flush prefetch buffer because of FMXR below
     ; and CP 10 & 11 were only just enabled
     ; Enable VFP itself
     MOV r0,#0x40000000
     FMXR FPEXC, r0 ; FPEXC = r0

Neon Auto Vectorization Compiler directives and Example

Compiler tools Autovectorization compiler directives
Code Composer Studio "-o3 -mv7a8 --neon -mf "
CodeSourcery (gcc) "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp"
Realview "--cpu=Cortex-A8 -O3 -Otime --vectorize"

Here is a simple example of autovectorizing a small C function using the Realview compiler

File: itest.c
void NeonTest(short int * __restrict a, short int * __restrict b, short int * __restrict z)
 int i;
 for (i = 0; i < 200; i++) {
 z[i] = a[i] * b[i];

ARM only code

generated by ARM/Thumb C/C++ Compiler, RVCT3.1 [Build 616]

commandline armcc [-c --asm --interleave --cpu=Cortex-A8 itest.c]

loop iterations: 200

000000  e92d4030          PUSH     {r4,r5,lr}
000004  e3a03000          MOV      r3,#0
000008  e0804083          ADD      r4,r0,r3,LSL #1
00000c  e0815083          ADD      r5,r1,r3,LSL #1
000010  e1d440b0          LDRH     r4,[r4,#0]
000014  e1d550b0          LDRH     r5,[r5,#0]
000018  e1640584          SMULBB   r4,r4,r5
00001c  e0825083          ADD      r5,r2,r3,LSL #1
000020  e2833001          ADD      r3,r3,#1
000024  e35300c8          CMP      r3,#0xc8
000028  e1c540b0          STRH     r4,[r5,#0]
00002c  bafffff5          BLT      |L1.8|
000030  e8bd8030          POP      {r4,r5,pc}

ARM + Neon code

generated by ARM/Thumb NEON C/C++ Compiler with Crescent Bay VAST 10.7z8 ARM NEON, RVCT3.1 [Build 616]

commandline armcc [-c --asm --interleave --cpu=Cortex-A8 -O3 -Otime --vectorize itest.c]

loop iterations: 25

Neon instructions

000000  e3a03019          MOV      r3,#0x19
000004  f4200a4d          VLD1.16  {d0,d1},[r0]!
000008  e2533001          SUBS     r3,r3,#1
00000c  f4212a4d          VLD1.16  {d2,d3},[r1]!
000010  f2100952          VMUL.I16 q0,q0,q1
000014  f4020a4d          VST1.16  {d0,d1},[r2]!
000018  1afffff9          BNE      |L1.4|
00001c  e12fff1e          BX       lr

Neon Intrinsics

You may find that using the autovectorizing compiler does not always work well for more complex functions. Intrinsics are a combination of assembly code and C code. They give you direct control over the Neon SIMD functionality similar to coding in assembly. They also give you C level compiler errors to warn you if you are not matching type inputs and output consistently.

Here is a small function in C which adds together the 200 corresponding elements from arrays x and y and stores each result in array z:
void NeonTest(int * x, int * y, int * z)
 int i;
 for(i=0;i<200;i++) {
 z[i] = x[i] + y[i];

Here is the equivalent function using intrinsics:

#include "arm_neon.h"
void intrinsics(uint32_t *x, uint32_t *y, uint32_t *z)
  int i;
  uint32x4_t x4,y4; // These 128 bit registers will contain 4 values from the x array and 4 values from the y array
  uint32x4_t z4;    // This 128 bit register will contain the 4 results from the add intrinsic
  uint32_t *ptra = x; // pointer to the x array data
  uint32_t *ptrb = y; // pointer to the y array data
  uint32_t *ptrz = z; // pointer to the z array data

  for(i=0; i < 200/4; i++)
    x4 = vld1q_u32(ptra);  // intrinsic to load x4 with 4 values from x
    y4 = vld1q_u32(ptrb);  // intrinsic to load y4
    z4=vaddq_u32(x4,y4);   // intrinsic to add z4=x4+y4
    vst1q_u32(ptrz, z4);   // store the 4 results to z
    ptra+=4; // increment pointers

Here is the output of the intrinsic function compiled with: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1

22 0000 323E81E2              add      r3, r1, #800
23                    .L2:
24 0004 8F4A21F4              vld1.32  {d4-d5}, [r1]
25 0008 101081E2              add      r1, r1, #16
26 000c 8F6A20F4              vld1.32  {d6-d7}, [r0]
27 0010 030051E1              cmp      r1, r3
28 0014 446826F2              vadd.i32 q3, q3, q2
29 0018 8F6A02F4              vst1.32  {d6-d7}, [r2]
30 001c 100080E2              add      r0, r0, #16
31 0020 102082E2              add      r2, r2, #16
32 0024 F6FFFF1A              bne      .L2
33 0028 1EFF2FE1              bx       lr


Coding in Assembly is a last resort. If autovectorization and intrinsics are not getting the desired results, then hand coding in assembly can be the way to maximize a functions performance. Coding in assembly that will improve on compiled code is a skill that may require a considerable learning curve.

Compiler Comparison

Compiler capabilities Autovectorization Intrinsics Assembly
Code Composer Studio Yes No Yes
CodeSourcery (gcc) Yes Yes Yes
Realview Yes Yes Yes

Note: For Code Composer you need version 4.6.x or greater of the TMS470 Compiler Tools

What is VFP?

VFP is a floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. If a processor like ARM does not have floating hardware, then it relies on software math libraries which can prohibitively slow down floating point calculations. The VFP supports both single and double precision floating point calculations compliant with IEEE754. Further, the VFP is not fully pipelined like Neon, so it will not have equivalent performance to Neon.

How to compile and run VFP code

What is the relationship between Neon and VFP?

Neon and VFP share the same large register file inside of the Cortex-A8. These registers are separate from the ARM core registers. The Neon/VFP register file is 256 bytes as shown in the diagram.

Neon/VFP register File

The Neon Register file has a dual view:

  • 32 - 64 bit registers (The Dx registers)
  • 16 - 128 bit registers (The Qx registers)

The VFP Register file also has a dual view:

  • 32 - 64 bit registers (The Dx registers)
  • 32 - 32 bit registers (The Sx registers - Only 1/2 of the registers may be viewed as 32 bit)

From the Neon point of view: register Q0 may be accessed as either Q0 or D0:D1

From the VFP point of view: register D0 may be accessed as either D0 or S0:S1

There are 2 paths or pipelines through Neon:

  • Integer and fixed point (supports 8 bit, 16 bit, 32 bit integers)
  • Single precision floating point (supports 32 bit floating point)

VFP has a single path:

  • Single or double precision floating point (supports 32 bit and 64 bit floating point)

(Note that Neon does not support double precision floating point operations)

(Note that Neon and VFP both support single precision floating point operations)

Neon and VFP both support floating point, which should I use?

  • The VFPv3 is fully compliant with IEEE 754
  • Neon is not fully compliant with IEEE 754, so it is mainly targeted for multimedia applications

Here is an example of showing how Neon pipelining will outperform VFP:

Taking the same C function from earlier, but using floating point types instead:

void NeonTest(float * __restrict a, float * __restrict b, float * __restrict z)
    int i;
    for(i=0;i<200;i++) {
       z[i] = a[i] * b[i];

Compile the above code using CodeSourcery: GCC: (CodeSourcery Sourcery G++ Lite 2007q3-51) 4.2.1

Compile the above function for both Neon and VFP and compare results:

  • arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
  • arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -ftree-vectorize -mfloat-abi=softfp

Running on OMAP3EVM under Linux with a Cortex-A8 clock speed of 600MHz

VFP/NEON Time to execute this function 500,000 times
VFP 7.36 seconds
Neon 0.94 seconds

Useful documentation

Useful links

TI Open Source Projects - Cutting edge information

ARM Cortex-A8 Terminology

  • SIMD - A processor capable of Single Instruction Multiple Data. For example during one single operation such as an "add", up to 16 sets of data will be added in parallel.
  • Superscalar - An architecture which employs instruction level parallelism. The Cortex-A8 has dual in-order instruction issue.
  • Vector Processor - Synonymous with SIMD processor