Cortex-A8 Architecture
From Texas Instruments Embedded Processors Wiki
Translate this page to
Cortex-A8 Pipeline Diagram
Cortex-A8 Control Diagram
- High quality branch prediction results in fewer replays and lower power
- Branch prediction maintains 95% accuracy over a wide codebase
- Dynamic branch predictor components
- 512-entry 2-way BTB
- 4K-entry GHB indexed by branch history and PC
- 8-entry return stack
- Branch resolution
- all branches are resolved in single stage
- Maintains speculative and non-speculative versions of branch history and return stack
Cortex-A8 Instruction Decode
- Instruction decode highlights
- 4 entry pending queue reduces Fetch stalls and increases pairing opportunities
- replay queue keeps instructions for reissue on memory system stall
- scoreboard predicts register availability using static scheduling techniques
- cross-checks in D3 allow issue of dependent instruction pairs
Cortex-A8 Instruction Issuing
- Two ALU instructions
- One ALU instruction and one load-store instruction
- One Mac instruction with
- ALU instruction
- Load-store instruction
- NEON data processing instruction
- Two neon Data processing Instruction
- One Neon data processing instruction with
- Load store instruction
- ALU instruction
- Multi-cycle instructions will only issue to pipleline-0
- Complex operations(cp15 operations SVC) single issue
Cortex-A8 Instruction Execution
- Execution pipeline highlights
- 2 symmetric ALU pipelines: Shift/ALU/SAT
- Load/store pipe used by instructions in either pipeline
- Multiply instructions are tied to pipe 0
- All key forwarding paths supported
- Static scheduling allows for extensive clock gating
Cortex-A8 Memory System
- Harvard Level 1 Caches – both 16KByte, 4 way set associative
- single-cycle load-use penalty
- Virtual index Physically tagged(VIPT)
- Level 1 Data cache is blocking
- Non-Neon read misses cache cause replay of subsequent instructions
- Reduces complexity in later pipeline stages
- Good for power and clock frequency
- Neon data not allocated to L1 (but will read/update in L1 if necessary)
- Integrated 256 KB unified Level 2 Cache, 8-way set associative
- Dedicated low latency, high bandwidth interface to the Level-l cache
- Line length of 64 bytes
- Physically index Physically tagged(PIPT)
- Minimum latency of 8 cycles
- Streams to the Neon processing unit; up to 16GByte/s bandwidth
- 128-bit data streaming from both L1D$ and L2$
- 64 bit AMBA AXI interconnect to external memory
- Supports multiple outstanding memory transactions to minimize memory latencies
Cortex-A8 Control Coprocessor
- The processor does not have an external coprocessor interface but it does implement two internal coprocessors, CP14 and CP15
- The CP14 coprocessor: also known as the debug coprocessor
- used for various debug functions.
- CP15 coprocessor: also known as the system control coprocessor
- used to control and provide status information for the functions implemented in the processor.
Cortex-A8 CP15 Register Groups
| Function | CP15 Registers |
| System Configuration | c0 |
| System Control | c1 |
| Translation Base Control | c2 |
| Domain Access Control | c3 |
| Faults | c5/c6 |
| Cache Operations | c7 |
| TLB Operations | c8/c10 |
| Performance Monitor | c9 |
| L2 Control | c9 |
| Pre-load Engine | c11 |
| Interrupts | c12 |
| Process ID | c13 |
| Memory Arrays | c15 |
Cortex-A8 Performance Monitor Unit
- Controlled through CP15 registers (c9)
- Four Counters
- Count events
- Cache misses
- TLB missed
- Branch ms-predictions
- Exceptions
- External events
- Others
- Interrupt output on overflow
| Register | Description |
| Performance monitor control | Controls the operation of the count registers |
| Count Enable Set | Enables PMU count registers |
| Count Enable Clear | Disables PMU count registers |
| Overflow Flag Status | Enables/Disables PMU count overflow flags |
| Software Increment | Increments the count of PMU count register |
| Performance counter selection | Selects a PMU counter |
| Cycle Count | Reads/writes the PMU cycle count register |
| Event selection | Selects the event for the PMU to count |
| Performance Monitor Count | Reads/wites the 4 PMU event count registers |
| User Enabled | Allows user mode to access the PMU |
| Interrupt Enable Set | Enables overflow Interrupts |
| Interrupt Enable Clear | Disables overflow interrupts |
Cortex-A8 L2 Preload engine
- PLE is not the same Dynamic Memory Allocation (DMA) engine used in previous ARM family of processors but has a similar programming interface.
- Moves cache lines to/from L2
- Two channels
- Maximum number of cache lines is limited by cache way size(16K in OMAP3)
- Set of registers to control the PLE( through CP15) in secure privileged mode
- Transfer only dirty data from L2
- Supports the ability to lock data to a specific L2 cache way.
- Generates output signals
- nDMAIRQ
- nDMASIRQ
- nDMAEXTERERRIRQ
- Different from PLD, which is an actual single cycle instruction which preloads L2 with line of data. PLD instruction does not preload data into L1 cache as in V6 architecture.





