NOTICE: The Processors Wiki will End-of-Life in December of 2020. It is recommended to download any files or other content you may need that are hosted on The site is now set to read only.

Floating Point Optimization

From Texas Instruments Wiki
Jump to: navigation, search

The programmer's responsibility

Floating-point arithmetic is inherently trickier than integer or fixed-point arithmetic. There are a lot more performance and precision gotchas, so the compiler is not as free to optimize the code automatically. For this reason, the user must be far more aware of the properties of floating-point arithmetic to get good performance out of the compiler.

Beware: this topic is much deeper than it would seem; this page barely scratches the surface of things that the floating-point programmer needs to know.

Issues to consider are:

  • rounding modes
  • floating-point exceptions
  • do not expect absolute precision; not possible in a finite format
  • negation is an operation; negation is not part of float (or integer) constants
  • printf rounds

The compiler's responsibility

The compiler must faithfully translate floating-point arithmetic so that the computed value remains the same. By default, the compiler is not allowed to perform any optimization which might affect the result. This behavior can sometimes severely limit optimization potential. See below for options that give more aggressive optimization.

Quality of implementation

The TI compiler strives to provide IEEE-754 support, but there are some limitations. See a general statement about the compiler's adherence to IEEE-754. This is a quality-of-implementation (QoI) issue. In particular, the compiler's run-time support (RTS) library for some ISAs doesn't always handle special values, rounding, or accuracy correctly. For devices which have IEEE-754 floating-point arithmetic support, the compiler can take advantage of it and it will be as accurate as possible. However, some functions must still be handled in the library. TI is striving to improve the QoI of the floating-point handling. At this time, there is no other specific statement of the correctness or speed of the TI compiler's floating-point capabilities. A proper characterization of correctness and speed is one of the things to be done.

Devices without floating-point arithmetic hardware

Some devices do not have floating-point arithmetic hardware, so C floating-point types must be emulated in software. These functions are provided in the compiler's run-time support (RTS) library. These functions are much, much slower than floating-point hardware, so you will see poor performance if you try to use floating-point arithmetic on a device which does not support it in hardware.

Some ISAs have a variety of devices which may or may not support floating-point arithmetic. If you are using a device which does support it, you must inform the compiler or it will not be able to take advantage of it. Consult the C/C++ Compiler User Guide for your ISA for a complete list of options.

Sample FP-enabling options
ISA options
ARM --float_support=VFPv3
C2800 --float_support=fpu32
C6000 -mv6740

float vs. double vs. long double

Many customers struggle with the achieving acceptable performance when using floating point code. The most common problem is using double precision operations instead of single precision. There is a significant performance penalty when using double precision. For instance, on a Cortex-R4, the result latency of a single precision multiply is 2 whereas a double precision multiply is 9.

In the TI ARM compiler (and all other EABI ARM compilers) the C/C++ type double is used for double precision (64-bit) data and float is used for single precision (32-bit) data. Some other hardware vendors specify the double type as being 32 bits which can lead to performance degradations when porting code from a different platform. You must ensure that all data types you are using are of type float in order to generate single precision floating point instructions.

Once all of your data is defined as float, there are still cases where you may unknowingly cause the compiler to generate double precision operations. The most common issue is when floating point literals such as 3.14159 are used. In C these literals are of type double, and if they are used in an expression consisting of single precision operands, the operations will be promoted to double precision. The proper way to specify a single precision literal is to use an 'f' suffix, 3.14159f.

The functions defined in math.h such as sin(), cos(), sqrt(), etc. are defined as double precision routines. This means that calling these functions will result in significant overhead. The C99 standard specifies single precision versions of these routines, which are implemented in the TI ARM compiler. These routines are defined as the double precision version with an 'f' suffix. So the routines are of the form sinf(), cosf(), sqrtf(), etc. It is important to note that these routines are not TI specific and are part of the C99 standard.

Standard C has three real, floating-point types:

  • float
  • double
  • long double

If floating-point precision or speed are important to your application, you need to be aware of the properties of each type for the ISA you are using, and you also need to be aware of the type of each expression.

TI ISAs use either IEEE-32 or IEEE-64 to represent these types. On a given device, IEEE-32 is faster but less precise than IEEE-64.

IEEE-32 or IEEE-64
ISA float double long double
ARM 32 64 64
C2800 32 32 64
C6000 32 64 64
MSP (COFF) 32 32 32
MSP (EABI) 32 64 64

Note that some ISAs have 32-bit double or long double for legacy compatibility reasons. The C standard disallows IEEE-32 for these types, so these targets do not conform to the C standard with respect to these types.

Making sure your program doesn't use double precision

The option --float_operations_allowed controls the precision of floating point operations allowed in the compilation. The arguments are: none, 32, 64, all. If --float_operations_allowed=32 is specified on the command line, the compiler will issue an error if a double precision operation will be generated. This can be used to ensure that double precision operations are not accidentally introduced into an application.

Special values

IEEE floating-point representation has some special values. For a complete description, see IEEE-754 (or ISO/IEC/IEEE-60559) and C99 (ISO/IEC 9899:1999).

  • NaN (not a number)
  • Inf (positive infinity)
  • -Inf (negative infinity)
  • -0.0 (negative zero)
  • denormal (aka subnormal) numbers

These values may behave strangely in an arithmetic expression, so it may be desirable to avoid an expression which will create one.

In particular, avoid generating a NaN value.

NaN (not a number) represents the fact that no information is known about the value. It is the result of an expression that has no reasonable interpretation, such as 0/0 or Inf/Inf. When a NaN is involved in an arithmetic expression, the result is always NaN. When a NaN is compared to another value, NaN is not equal to anything, including itself. Thus NaN==X is false, and NaN!=X is true, even if X is NaN.

Not all algebraic idioms are valid

Floating-point arithmetic is full of cases where simple algebraic rules like X==X or (X*Y)*Z==X*(Y*Z) do not hold. Typically, this is because one of the inputs is a special value, but sometimes even normal numbers can cause this. For this reason, the compiler is not allowed to perform every algebraic simplification that might seem obvious. Here is a partial list of algebraic rules that are true for integer arithmetic, but may not be true for floating-point arithmetic.

  • X==X is not equivalent to true if X could be NaN
  • X!=X is not equivalent to false if X could be NaN
  • (X*Y)*Z is not equivalent to X*(Y*Z) for some values of X,Y,Z (see below)
  • -(X-Y) is not equivalent to Y-X if X and Y could be both 0.0.
  • X-X is not equivalent to 0.0 if X could be +Inf, -Inf, NaN, or -0.0
  • X/X is not equivalent to 1.0 if X could be +Inf, -Inf, NaN, or -0.0
  • X*0 is not equivalent to 0.0 if X could be NaN or -0.0
  • X<Y is not equivalent to !(X>=Y) if either X or Y could be NaN
  • ((X<0)?-X:X) is not equivalent to fabs(X) if X could be -0.0

Some idioms do hold, but only under restricted circumstances.

  • X/Y is equivalent to X*(1/Y) where Y is a floating-point constant and 1/Y is exactly representable. 1/2 is exactly representable, but 1/3 is not; thus, the optimizer will convert X/2 to X*0.5, but it will not convert X/3 to X*0.333333 (but see below)

In summary, don't expect the compiler to perform "obvious" algebraic transformations on your floating-point expressions. In many cases, you will need to write the expression exactly as you expect it to be executed.


Algebraic re-association is changing (X*Y)*Z to X*(Y*Z). Unfortunately, in floating-point arithmetic, this could change the result.

For example,

  • (10000001.0f * 10000001.0f) / 10000001.0f == 10000000.0f
  • 10000001.0f * (10000001.0f / 10000001.0f) == 10000001.0f

In the first expression, the actual value of (10000001.0f * 10000001.0f) exceeds the precision of IEEE-32, so it gets rounded off.

By default, the compiler is not allowed to make this transformation because it could change the result. Unfortunately this means that the compiler is not generally free to make profitable loop transformations that would effectively re-associate a floating-point expression. See below for options that give more aggressive optimization.

Better performance vs. strict IEEE correctness

By default, the compiler is severely limited with respect to floating-point optimizations, such as re-associating floating-point expressions, because such optimizations could slightly change the result. For programs which can tolerate this small change, you can use compiler options to instruct the compiler to more aggressively optimize your code. You must take care to make sure that your program will tolerate the loss of precision. Be aware that small errors can accumulate into larger errors as the result is fed into further expressions, especially in a loop.

Consult C/C++ Compiler User Guide for your ISA for the default settings of these modes.


The main compiler option is the --fp_mode option. This option controls the compiler's overall floating-point optimization strategy.


Relaxed mode prioritizes speed over strict correctness. In relaxed mode, the compiler may perform speed optimizations at the expense of reducing the precision of some calculations, typically a tiny amount. For instance, (X/3) is not precisely equivalent to (X*(1.0/3)), but in relaxed mode, the compiler is allowed to make this transformation anyway, as multiplication is much faster than division.


Strict mode enforces strict IEEE-754 semantics, disabling all unsafe optimizations. The compiler will still perform optimizations that are provably safe, such as (X/2) -> (X*0.5). Using --fp_mode=strict sets --fp_reassoc=off by default.


The --fp_reassoc option controls whether the compiler is allowed to re-associate floating-point expressions. This is an important optimization for ISAs which can perform more than one floating-point operation per cycle, such as those with vector hardware.


Re-association mode allows the compiler to freely re-associate floating-point expressions. However, this can slightly change the precision.


Using --fp_reassoc=off prevents the compiler from doing this specific optimization.

Further reading