Cortex-A8 Features
From Texas Instruments Embedded Processors Wiki
Translate this page to
Cortex-A8 Highlights
- Dual-issue, in-order, superscalar architecture delivering high performance
- First implementation of the ARMv7 instruction-set architecture, including the advanced SIMD media Instructions (NEON™)
- Advanced dynamic Branch prediction
- Integrated, 256 KB unified L2 cache
- Dedicated, low-latency, high-BW interface to L1 cache
- NEON™ : 64/128-bit Hybrid SIMD Engine for Multimedia
- Supports both Integer and Floating Point SIMD
- Enhanced VFPv3 – doubles number of double-precision registers and new instructions to convert between fixed and floating point
- Efficient Run Time Compilation Target
- Jazelle-RCT: Target for Java. Memory footprint reduced up to 3x
- Can also target languages such as Microsoft .NET MSIL, Perl, Python
Superscalar Cortex-A8 Core
- In-order dual instruction issue
- less complex than out-of-order
- fewer structures means lower power
- less need for custom design
- can maintain high IPC with
- fully symmetric ALU pipelines
- all critical forwarding paths supported
- dual-issue of dependent instruction pairs
- Static scheduling with instruction replay on memory stall
- low-power consumption due to early availability of gate enables
- fire-and-forget instruction issue removes critical paths from the design
- Net result
- high-frequency design with out-of-order performance, but in-order clock frequency and power consumption
- Average CPI of 0.9 across 150+ ARM and industry benchmarks
Cortex-A8 Technologies
| Cortex-A8 Technologies | Description |
| TrustZone Security | Device Integrity / Secure Transactions |
| Jazelle RCT Acceleration / Thumb 2EE Instruction Set | Fast & Responsive Java Applications |
| Thumb-2 Instruction Set | Greater Performance With Less Code Size |
| NEON™ Advanced SIMD(+VFPv3) | Enhanced Multimedia Experience |
| Superscalar ARMv7 Core | Highest-performance mobile processor |
NEON: Advanced SIMD
- 64/128-bit Hybrid SIMD architecture
- A single instruction performs the same operation on multiple elements that are packed within registers
- Independent Register file with 2 aliased views:
- 32 x 64-bit registers (D0-D31)
- 16 x 128-bit registers (Q0-Q15)
- Integer and SP Floating-point processing
- 8, 16, 32, 64-bit Integers
- Single-precision Floating-point
- Encoded in ARM and Thumb-2
- Accelerates audio, video, and 3D-graphics
NEON: SIMD Instructions
- NEON™ Instructions are based on “Packed SIMD” processing
- Registers are considered as vectors of elements of the same data type
- Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
- Instructions perform the same operation in all lanes
SIMD Load/Store Structure
- Native support for structures
- e.g. complex numbers, pixels, coordinates
- Memory treated as an array of structures (AoS)
- Eliminates ‘shuffling’ overhead
- Optimised memory access as single transfer
- Data arranged for efficient SIMD processing
Key NEON Capabilites
- Two Integer 64-bit ALUs operating in parallel
- Can perform 128-bit length equivalent ALU operation in 1 cycle
- 64-bit datapath with data types up to 128 bits
- Supports 128-bit data streaming from both L1D$ and L2$
- Byte permute function allows for on-the-fly data shuffling
- Two Integer Multipliers of 32x16
- Each can perform one 32x16, two 16x16 or four 8x8 operations in a single pass
- Support 32x32 operation in two passes
Thumb-2 Instruction Set
- Combined 32 and 16 bit instruction set:
- 16 bit instructions include the original Thumb instruction set
- Some new 16 bit instructions for key code size wins
- Virtually all instructions available in ARM ISA available in Thumb-2
- In principle can stand-alone as a complete ISA
- Unified assembly language for ARM and Thumb-2 targeted to either ISA
- Conditional execution made available via IT instruction
Jazelle RCT Acceleration
- Beneficial to Java and a wide range of emerging languages
- Microsoft .NET MSIL, Perl, Python etc
- Enables high performance in smallest memory footprint
- Optimal balance between speed and code density with run-time compilers
- Low cost and low power
- Less than 8K gates and small memory footprint result in lower power
Thumb-2EE: Basis of Jazelle RCT
- Thumb-2EE(Thumb2 Execution Environment) is a variant of Thumb-2 with instructions to support JIT and AOT runtime compilers
- Targets any OO bytecode language such as Java and MS .NET IL
- 16-bit instructions for common AOT/JIT compilation routines
- Smaller code
- Smaller code size means recompiled methods can be kept in memory
- Less recompilation means faster performance and no start up delays
Memory System on Cortex-A8
- Harvard Level 1 Caches – both 16KByte, 4 way set associative
- single-cycle load-use penalty
- Virtual index Physically tagged(VIPT)
- Level 1 Data cache is blocking
- Non-Neon read misses cache cause replay of subsequent instructions
- Reduces complexity in later pipeline stages
- Good for power and clock frequency
- Neon data not allocated to L1 (but will read/update in L1 if necessary)
- Integrated 256 KB unified Level 2 Cache, 8-way set associative
- Dedicated low latency, high bandwidth interface to the Level-l cache
- Line length of 64 bytes
- Physically index Physically tagged(PIPT)
- Minimum latency of 8 cycles
- Streams to the Neon processing unit; up to 16GByte/s bandwidth
- 128-bit data streaming from both L1D$ and L2$
- 64 bit AMBA AXI interconnect to external memory
- Supports multiple outstanding memory transactions to minimize memory latencies
TrustZone Security
- TrustZone adds a parallel world to run secure OS and applications�
- Normal and Secure worlds have different memory views, enforced by hardware�
- Memory tagged as secure and non-secure by the system
- Only the secure CPU can access the secure memory & peripherals
- Secure Monitor is a software “gatekeeper” between the two worlds
- Device integrity, Digital Rights Management, Electronic payment, etc
OMAP ARM Cores Performance Dhrystome V2.1
Leave a Comment
Comments
Comments on Cortex-A8 Features
Terryanderson said ...
Jefflance01 said ...
Terry,
Please see this post: http://e2e.ti.com/support/dsp/omap_applications_processors/f/447/p/93345/324880.aspx#324880
--Jefflance01 11:05, 11 February 2011 (CST)



We obtained the "official" Dhrystone V2.1 benchmark source code from http://www.netlib.org/benchmark/dhry-c and compiled and ran it. We converted the Dhrystone figure to DMIPS by dividing by 1757 which we understand to be the standard conversion.
On our OMAP 3525 running LynxOS SE at 500 MHz and compiling with LynxWorks cross compiler we get 237.1 DMIPS To compare we used an older board using the OMAP 3430 (that we used prior to 3525 availability) under Monta Vista mobile Linux and using their cross-compiler and got 258.7 DMIPS.
We realize that Dhrystone measurements will vary some due to differences in compilers and OSs but would not expect our getting around 25% of TI's figure.
Can you suggest why we might be getting such drastically lower figures?
--Terryanderson 08:10, 11 February 2011 (CST)