OMAP-L1x/C674x/AM1x SoC Constraints

From Texas Instruments Wiki
Jump to: navigation, search

^ Up to main OMAP-L1x/C674x/AM1x SOC Architecture and Throughput Overview Table of Contents

Hardware Latency

Each master to slave transaction has to go through elements (SCRs and bridges) in the system interconnect. Each element, apart from the master and slave, can contribute to overall latency of the transaction. For example, consider an access to Timer64P0 and Timer64P2 from the C674x DSP in the case of the OMAPL138/C6748 topology shown below (black dotted lines).

Hardwarelatency.png


A read transaction issued by the DSP to Timer64P0 will have some cycles of latency contributed by the DSP issuing a read request to Timer MMR (via SCR2 and BR6), and then the time for the return read data to travel from the Timer back to the DSP (also via BR6 and SCR2).

Similarly, for the DSP to access Timer64P2, a read/write request would traverse through SCR2, BR5, and SCRF6.

It should be noted that having to traverse a larger number of bridges/SCRs does not necessarily imply a higher cycle count because the latency/cycle-count depends on bridge characteristics like command/data FIFO depths (especially for multiple/burst transactions) and the clock domains in which a particular endpoint resides. In the illustrated examples where the DSP writes to the two Timers, a write transaction to Timer64P0 is in the order of 35-40 DSP cycles, and a write transaction to Timer64P2 is in the order of 10-15 DSP cycles. The primary reason for the higher write latency to Timer64P0 is that it is in the slowest clock domain (AUXCLK or PLL Bypass clock), whereas Timer64P2 is in a faster clock domain (DSP/2 clock)

In general, the topology is optimized to minimize latency between critical masters and slaves. For example, notice that the LCDC is located after SCR1 (main SCR), and it is closer to the DDR2/mDDR memory controller in order to minimize the latency of accesses made by the LCDC to memory.

It is important to note that system peripherals typically have multi-cycle latency because the requests and data have to pass through interconnect components. The latency of transactions are further impacted by "head of line blocking" and whether a transaction is a read or a write transaction. These concepts are briefly explained in subsequent subsections.


Head of Line Blocking

Bridges implement a command first-in-first-out (FIFO) scheme to queue read/write commands from masters/initiators. All requests are queued on a first-in-first-out basis -- bridges do not reorder the commands. It is possible that a high priority request at the tail of a queue can be blocked by lower priority commands that could be at the head of the queue. This scenario is called bridge head of line blocking. In the figure below, the command FIFO size is 4. The FIFO is completely filled with low priority (7) requests before a higher priority request (0) comes in.


Bridge Head of Line Blocking


In this case, the high priority request is blocked until all four lower priority (7) requests are serviced. When there are multiple masters vying for the same end point (or end points shared by the same bridge), the bridge head of line blocking is one factor that can affect system throughput and a master's ability to service read/write requests targeting a slave peripheral/memory.


Reads vs Writes

Read transactions are usually more costly (in clock cycles) than writes. For writes, the command and data flow together and can be thought of as "fire-and-forget" in nature. Once a write transaction leaves the master/initiator boundary (ex: sitting in a bridge or an end point's buffer or FIFO), the initiator can proceed to the next write (even before the previous write reaches its final destination). For reads, a read command pends until a read response/data returns. So in general, the initiator cannot issue a new read/write command until the previous read command's response reaches the master/initiator. Therefore, polling on registers can prove to be very expensive.

NOTE: The above is more prominent for the 32 bit buses. The 64 bit buses are capable, to some extent, of issuing multiple outstanding read and write commands. For details on the buses, please refer to the section on Interconnect Buses in the OMAPL1x/c674x SoC Architecture Overview article.

On-Chip vs Off-Chip Memory

On-chip memory accesses experience less latency when compared to off-chip memory accesses. Off-chip memory is susceptible to extra latency contributed by refresh cycles, CAS latency, etc. If possible, frequently used code should be kept in on-chip memory.


Maximum Bandwidths

Memory bandwidth affects the overall system throughput. This section illustrates how to calculate the maximum theoretical bandwidth for some critical on-/off-chip memories.


Maximum Theoritical Memory Bandwidths (device operating @ 300 MHz)

 Memory

Theoritical Max

Bandwith

(MBytes/Sec)

Calculations
Notes

c674x L2/L1D

1200

  150 MHz x 64 bit

[c674x SDMA port frequency * c674x SDMA port bus width]

  • SDMA port frequency is always device frequency divided by 2 
  • Bandwidth for accesses from outside the megamodule (e.g EDMA or UHPI etc)
Shared RAM 
 1200

  150 MHz x 64 bit

[Shared RAM Frequency  * Shared RAM port bus width ]

  • Shared RAM frequency is always device frequency divided by 2
EMIFA (SDRAM)


  200

  100 MHz x 16 bit

[EMA_CLK (max) frequency x SDRAM memory bus width]


EMIFA Asynchornous Memories  66.67

  33.33 MHz x 16 bit   [ (EMA_CLK)/(Setup+Strobe+Hold) *16 bit]

  • Illustrated calculations assume a Setup/Strobe/Hold value of 1 cycle
  • Assumes a 16 bit async interface
EMIFB (SDRAM)  532

  133 MHz x 32 bit

  [ EMB_CLK (max frequency) x SDRAM memory bus width]

  • Applicable to OMAPL137/c6747/45/43 and AM17xx
mDDR/DDR2   600

  150 MHz x 2 x 16 bit

  [ DDR_CLK x double data rate x mDDR/DDR2 memory bus width]

  • Applicable to OMAPL138/c6748/46 and AM18xx
  • Max mDDR/DDR2 frequency might depend on device operating frequency


NOTE: Theoretical maximum bandwidth is the maximum possible bandwith, which is calculated purely on the basis of memory clock and bandwith. This calculation will not take into consideration any additional system level ineffeciencies such as additional latency in the interconnect, additional cycles for off-chip access due to memory configuration/characteristics, additional latency incurred by competing traffic, prioritization, and use of shared resources (for example, a single bridge buffering read/write commands in a path to multiple slave end points).