Cortex-A8 Architecture

From Texas Instruments Wiki
Jump to: navigation, search


Cortex-A8 Pipeline Diagram

Cortex-A8Pipeline.png



Cortex-A8 Control Diagram

  • High quality branch prediction results in fewer replays and lower power
  • Branch prediction maintains 95% accuracy over a wide codebase

CortexA8-ControlFlow.png

  • Dynamic branch predictor components
    • 512-entry 2-way BTB
    • 4K-entry GHB indexed by branch history and PC
    • 8-entry return stack
  • Branch resolution
    • all branches are resolved in single stage
    • Maintains speculative and non-speculative versions of branch history and return stack

Cortex-A8 Instruction Decode

CortexA8-instDecode.png

  • Instruction decode highlights
    • 4 entry pending queue reduces Fetch stalls and increases pairing opportunities
    • replay queue keeps instructions for reissue on memory system stall
    • scoreboard predicts register availability using static scheduling techniques
    • cross-checks in D3 allow issue of dependent instruction pairs

Cortex-A8 Instruction Issuing

  • Two ALU instructions
  • One ALU instruction and one load/store instruction
  • One multiply/MAC instruction with one
    • ALU instruction
    • load/store instruction
    • NEON data processing instruction
  • Two NEON data processing instructions
  • One NEON data processing instruction with one
    • Load/store instruction
    • ALU instruction

Some instructions will only issue to pipeline 0

  • Multiply/MAC instructions
  • Load/store multiple and other multi-cycle instructions

Cortex-A8 Instruction Execution

CortexA8-InstExec.png

  • Execution pipeline highlights
    • 2 symmetric ALU pipelines: Shift/ALU/SAT
    • Load/store pipe used by instructions in either pipeline
    • Multiply instructions are tied to pipe 0
    • All key forwarding paths supported
    • Static scheduling allows for extensive clock gating

Cortex-A8 Memory System

  • Harvard Level 1 Caches – both 16KByte, 4 way set associative
    • single-cycle load-use penalty
    • Virtual index Physically tagged(VIPT)
    • Level 1 Data cache is blocking
      • Non-Neon read misses cache cause replay of subsequent instructions
      • Reduces complexity in later pipeline stages
      • Good for power and clock frequency
      • Neon data not allocated to L1 (but will read/update in L1 if necessary)
  • Integrated 256 KB unified Level 2 Cache, 8-way set associative
    • Dedicated low latency, high bandwidth interface to the Level-l cache
    • Line length of 64 bytes
    • Physically index Physically tagged(PIPT)
    • Minimum latency of 8 cycles
    • Streams to the Neon processing unit; up to 16GByte/s bandwidth
  • 128-bit data streaming from both L1D$ and L2$
  • 64 bit AMBA AXI interconnect to external memory
    • Supports multiple outstanding memory transactions to minimize memory latencies

Cortex-A8 Control Coprocessor

  • The processor does not have an external coprocessor interface but it does implement two internal coprocessors, CP14 and CP15
  • The CP14 coprocessor: also known as the debug coprocessor
    • used for various debug functions.
  • CP15 coprocessor: also known as the system control coprocessor
    • used to control and provide status information for the functions implemented in the processor.

CortexA8-coProc.png

Cortex-A8 CP15 Register Groups

Function CP15 Registers
System Configuration c0
System Control c1
Translation Base Control c2
Domain Access Control c3
Faults c5/c6
Cache Operations c7
TLB Operations c8/c10
Performance Monitor c9
L2 Control c9
Pre-load Engine c11
Interrupts c12
Process ID c13
Memory Arrays c15

Cortex-A8 Performance Monitor Unit

  • Controlled through CP15 registers (c9)
  • Four Counters
  • Count events
    • Cache misses
    • TLB missed
    • Branch ms-predictions
    • Exceptions
    • External events
    • Others
  • Interrupt output on overflow
Register Description
Performance monitor control Controls the operation of the count registers
Count Enable Set Enables PMU count registers
Count Enable Clear Disables PMU count registers
Overflow Flag Status Enables/Disables PMU count overflow flags
Software Increment Increments the count of PMU count register
Performance counter selection Selects a PMU counter
Cycle Count Reads/writes the PMU cycle count register
Event selection Selects the event for the PMU to count
Performance Monitor Count Reads/wites the 4 PMU event count registers
User Enabled Allows user mode to access the PMU
Interrupt Enable Set Enables overflow Interrupts
Interrupt Enable Clear Disables overflow interrupts

Cortex-A8 L2 Preload engine

  • PLE is not the same Dynamic Memory Allocation (DMA) engine used in previous ARM family of processors but has a similar programming interface.
  • Moves cache lines to/from L2
  • Two channels
  • Maximum number of cache lines is limited by cache way size(16K in OMAP3)
  • Set of registers to control the PLE( through CP15) in secure privileged mode
  • Transfer only dirty data from L2
  • Supports the ability to lock data to a specific L2 cache way.
  • Generates output signals
    • nDMAIRQ
    • nDMASIRQ
    • nDMAEXTERERRIRQ
  • Different from PLD, which is an actual single cycle instruction which preloads L2 with line of data. PLD instruction does not preload data into L1 cache as in V6 architecture.