Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

ARM compiler optimizations

From Texas Instruments Wiki
Jump to: navigation, search

Overview

The TI ARM compiler has been optimized for use with our Cortex microcontrollers. Optimizations have been developed and tuned using a wide variety of customer benchmarks and code. Key characteristics of these benchmarks include:

  • Dominated by control code (i.e. code consisting of mostly function calls and conditional branches with few loops)
  • Auto-generated code
  • Bitfield manipulations
  • 16-bit arithmetic
  • Single precision floating point operations

Key optimizations

  • High level optimizations critical for auto generated code including common subexpression elimination, value propagation, and copy propagation.
  • Removal of unneeded sign extension instructions when performing 16-bit arithmetic
  • Using 16x16 multiplication instructions
  • Using MOVW and MOVT instructions for loading literals to avoid flash memory latencies (enabled at -mf3 and higher)
  • Utilizing the bitfield manipulation instructions on Cortex devices like UBFX, SBFX, and BFI
  • Utilizing predicatable instructions to avoid branch latencies.
  • Improve floating point performance by utilizing the VFP instruction set and providing a relaxed floating point mode (--fp_mode=relaxed) to improve performance at the expense of accuracy.
  • Link time optimization (-o4), which allows the compiler to optimize across file boundaries. This allows many optimizations to increase their effectiveness, but some key opportunities for ARM are:
    • Increased opportunity of inlining functions.
    • Global variable grouping, which allows the compiler to reduce the number of variable address loads which can dramatically improve both code size and performance.
    • Function specialization

Optimizing code with the TI ARM compiler

The TI ARM compiler has several options that can be used to control the amount and types of optimizations that are performed. The options can be roughly broken into two categories. Below is a list of the most important options you will need to select to get the best generated code.

Processor Options

The processor options are used to control which ARM variant the compiler should generate code for. It is critical that these options are chosen correctly. If you are compiling in CCS these options should be set correctly based on the device you selected at project creation time.

  • --silicon_version
    • Select the ARM variant the compiler should target, such as Cortex-R4, Cortex-A8, etc.
  • --code_state=16|32
    • Select the ARM or Thumb instruction set. For Cortex devices --code_state=16 selects the Thumb-2 instruction set
  • --float_support
    • Select whether VFP (Hardware floating point) is supported. If your code relies heavily on floating point it is critical that this option is properly selected to get the best performance

Optimization Options

The optimization options are used to control the scope and types of optimizations that are performed.

  • --opt_level=0-4
    • Selects the scope used to perform optimizations. They higher the number the greater the scope.
  • --opt_for_speed=0-5
    • Select whether the optimizations should focus on code size or performance. The higher the number the greater performance at the expense of code size.
  • --optimize_with_debug
    • This option instructs the compiler to perform agressive optimizations even if full debugging (--symdebug:dwarf) is on. Starting with the 5.0 release this option is automatically selected if --opt_level=2 or higher is used. In older releases this option must be selected to obtain the best performance with debug turned on.

Recommended performance options for Cortex-RF devices

Non-cache devices

--silicon_version=7R4 --float_support=VFPv3D16 --code_state=16 --opt_level=[3|4] --opt_for_speed=5

Cache devices

--silicon_version=7R4 --float_support=VFPv3D16 --code_state=16 --opt_level=[3|4] --opt_for_speed=1 --opt_for_cache

Tips for getting the best performance

The optimization options that will typically provide the best performance are: --opt_level=3 --opt_for_speed=5. Using --opt_level=4, which is also referred to as link-time optimization, will usually improve performance and code size at the expense of link time. If the performance of --opt_level=3 is not enough you should try --opt_level=4. It is also important to remember that the --opt_for_speed option is a heuristic and sometimes the compiler will make an incorrect choice. It might be useful to experiment with a few different settings to see if the performance improves.

The other important consideration is whether to use the ARM or Thumb-2 option set. Typically the best performance will be achieved using the Thumb-2 option set. This is due to several factors, but some key points are:

  • Thumb-2 supports integer divide instructions
  • Thumb-2 results in smaller code size which can allow more aggressive speed optimizations
  • Thumb-2 allows the processor to prefetch more instructions due to better code density.

Optimizing for devices with instruction cache

For devices which have an instruction cache, optimizing for speed can degrade performance because of the increase in code size. Our findings show that using --opt_for_speed=3 or higher will usually result in lower performance. For this reason it is usually best to use --opt_for_speed=1|2. The 5.0 compiler supports a new --opt_for_cache option that is intended to be used with --opt_for_speed=1|2. The option enables some important speed optimizations that provide important benefits even though code size may increase slightly.

Tips for floating point performance

See more tips at Floating Point Optimization

Many customers struggle with the achieving acceptable performance when using floating point code. The most common problem is using double precision operations instead of single precision. There is a significant performance penalty when using double precision. For instance, on a Cortex-R4, the result latency of a single precision multiply is 2 whereas a double precision multiply is 9.

In the TI ARM compiler (and all other EABI ARM compilers) the C/C++ type double is used for double precision (64-bit) data and float is used for single precision (32-bit) data. Some other hardware vendors specify the double type as being 32 bits which can lead to performance degradations when porting code from a different platform. You must ensure that all data types you are using are of type float in order to generate single precision floating point instructions.

Once all of your data is defined as float, there are still cases where you may unknowingly cause the compiler to generate double precision operations. The most common issue is when floating point literals such as 3.14159 are used. In C these literals are of type double, and if they are used in an expression consisting of single precision operands, the operations will be promoted to double precision. The proper way to specify a single precision literal is to use an 'f' suffix, 3.14159f.

The functions defined in math.h such as sin(), cos(), sqrt(), etc. are defined as double precision routines. This means that calling these functions will result in significant overhead. The C99 standard specifies single precision versions of these routines, which are implemented in the TI ARM compiler. These routines are defined as the double precision version with an 'f' suffix. So the routines are of the form sinf(), cosf(), sqrtf(), etc. It is important to note that these routines are not TI specific and are part of the C99 standard.

The ARM VFP hardware supports both a single and double precision square root instruction. Calls to sqrt() and sqrtf() will result in function calls to the RTS routines. The reason for this is because the C standard requires these functions to set errno if the input is negative. If your program makes several calls to sqrt(), you can avoid the overhead of the function call by using the __sqrt() and __sqrtf() intrinsics. These will result in the instruction being inserted directly in the code.

--float_operations_allowed

The option, --float_operations_allowed, is available in the 5.0 version of the compiler. The option can be used to control the precision of floating point operations allowed in the compilation. The arguments are: none, 32, 64, all. If --float_operations_allowed=32 is specified on the command line, the compiler will issue an error if a double precision operation will be generated. This can be used to ensure that double precision operations are not accidentally introduced into an application.