Cortex-A8 Features

From Texas Instruments Wiki
Jump to: navigation, search

Content is no longer maintained and is being kept for reference only!

Cortex-A8 Highlights

  • Dual-issue, in-order, superscalar architecture delivering high performance
    • First implementation of the ARMv7 instruction-set architecture, including the advanced SIMD media Instructions (NEON™)
    • Advanced dynamic Branch prediction
  • Integrated, 256 KB unified L2 cache
    • Dedicated, low-latency, high-BW interface to L1 cache
  • NEON™ : 64/128-bit Hybrid SIMD Engine for Multimedia
    • Supports both Integer and Floating Point SIMD
  • Enhanced VFPv3 – doubles number of double-precision registers and new instructions to convert between fixed and floating point
  • Efficient Run Time Compilation Target
    • Jazelle-RCT: Target for Java. Memory footprint reduced up to 3x
    • Can also target languages such as Microsoft .NET MSIL, Perl, Python

Superscalar Cortex-A8 Core

  • In-order dual instruction issue
    • less complex than out-of-order
    • fewer structures means lower power
    • less need for custom design
  • can maintain high IPC with
    • fully symmetric ALU pipelines
    • all critical forwarding paths supported
    • dual-issue of dependent instruction pairs
  • Static scheduling with instruction replay on memory stall
    • low-power consumption due to early availability of gate enables
    • fire-and-forget instruction issue removes critical paths from the design
  • Net result
    • high-frequency design with out-of-order performance, but in-order clock frequency and power consumption
    • Average CPI of 0.9 across 150+ ARM and industry benchmarks

Cortex-A8 Technologies

Cortex-A8 Technologies Description
TrustZone Security Device Integrity / Secure Transactions
Jazelle RCT Acceleration / Thumb 2EE Instruction Set Fast & Responsive Java Applications
Thumb-2 Instruction Set Greater Performance With Less Code Size
NEON™ Advanced SIMD(+VFPv3) Enhanced Multimedia Experience
Superscalar ARMv7 Core Highest-performance mobile processor

NEON: Advanced SIMD

  • 64/128-bit Hybrid SIMD architecture
    • A single instruction performs the same operation on multiple elements that are packed within registers
  • Independent Register file with 2 aliased views:
    • 32 x 64-bit registers (D0-D31)
    • 16 x 128-bit registers (Q0-Q15)

Neon regs.png

  • Integer and SP Floating-point processing
    • 8, 16, 32, 64-bit Integers
    • Single-precision Floating-point

Neon reg width.png

  • Encoded in ARM and Thumb-2
  • Accelerates audio, video, and 3D-graphics

NEON: SIMD Instructions

  • NEON™ Instructions are based on “Packed SIMD” processing
    • Registers are considered as vectors of elements of the same data type
    • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
    • Instructions perform the same operation in all lanes

SIMD.PNG

SIMD Load/Store Structure

  • Native support for structures
    • e.g. complex numbers, pixels, coordinates
    • Memory treated as an array of structures (AoS)

Load-store.PNG

  • Eliminates ‘shuffling’ overhead
    • Optimised memory access as single transfer
    • Data arranged for efficient SIMD processing

Key NEON Capabilites

  • Two Integer 64-bit ALUs operating in parallel
    • Can perform 128-bit length equivalent ALU operation in 1 cycle
  • 64-bit datapath with data types up to 128 bits
  • Supports 128-bit data streaming from both L1D$ and L2$
    • Byte permute function allows for on-the-fly data shuffling
  • Two Integer Multipliers of 32x16
    • Each can perform one 32x16, two 16x16 or four 8x8 operations in a single pass
    • Support 32x32 operation in two passes

Thumb-2 Instruction Set

  • Combined 32 and 16 bit instruction set:
  • 16 bit instructions include the original Thumb instruction set
  • Some new 16 bit instructions for key code size wins
  • Virtually all instructions available in ARM ISA available in Thumb-2
  • In principle can stand-alone as a complete ISA
  • Unified assembly language for ARM and Thumb-2 targeted to either ISA
  • Conditional execution made available via IT instruction

Jazelle RCT Acceleration

  • Beneficial to Java and a wide range of emerging languages
    • Microsoft .NET MSIL, Perl, Python etc
  • Enables high performance in smallest memory footprint
    • Optimal balance between speed and code density with run-time compilers
  • Low cost and low power
    • Less than 8K gates and small memory footprint result in lower power

Thumb-2EE: Basis of Jazelle RCT

  • Thumb-2EE(Thumb2 Execution Environment) is a variant of Thumb-2 with instructions to support JIT and AOT runtime compilers
  • Targets any OO bytecode language such as Java and MS .NET IL
  • 16-bit instructions for common AOT/JIT compilation routines
    • Smaller code
  • Smaller code size means recompiled methods can be kept in memory
  • Less recompilation means faster performance and no start up delays

Thumb-2-perf.PNG

Memory System on Cortex-A8

  • Harvard Level 1 Caches – both 16KByte, 4 way set associative
    • single-cycle load-use penalty
    • Virtual index Physically tagged(VIPT)
    • Level 1 Data cache is blocking
      • Non-Neon read misses cache cause replay of subsequent instructions
      • Reduces complexity in later pipeline stages
        • Good for power and clock frequency
      • Neon data not allocated to L1 (but will read/update in L1 if necessary)
  • Integrated 256 KB unified Level 2 Cache, 8-way set associative
    • Dedicated low latency, high bandwidth interface to the Level-l cache
    • Line length of 64 bytes
    • Physically index Physically tagged(PIPT)
    • Minimum latency of 8 cycles
    • Streams to the Neon processing unit; up to 16GByte/s bandwidth
  • 128-bit data streaming from both L1D$ and L2$
  • 64 bit AMBA AXI interconnect to external memory
    • Supports multiple outstanding memory transactions to minimize memory latencies

TrustZone Security

Trustzone.png
  • TrustZone adds a parallel world to run secure OS and applications�
  • Normal and Secure worlds have different memory views, enforced by hardware�
  • Memory tagged as secure and non-secure by the system
    • Only the secure CPU can access the secure memory & peripherals
  • Secure Monitor is a software “gatekeeper” between the two worlds
  • Device integrity, Digital Rights Management, Electronic payment, etc

OMAP ARM Cores Performance Dhrystome V2.1

Perf.PNG