PCI Throughput on C64x+ DSPs
From Texas Instruments Embedded Processors Wiki
Contents |
Overview
This Wiki article provides PCI throughput information under ideal conditions. The information in this Wiki article is not indicative of the performance one can expect from PCI drivers or specific PCI systems. In particular, performance achievable in typical desktop PCs will be significantly less than what is possible in ideal conditions. This article applies to the following DSPs: C6455, DM648/7, DM643x, C6424/21, and DM6467.
Many factors affect the throughput performance of a PCI device. These factors include, but are not limited to, PCI bus speed, DSP speed, transfer source/destination, EMIF/DDR speed, and PCI burst length. Other factors such as on-chip activity and traffic in the PCI bus also factor into PCI performance.
The intent of this wiki article is not to describe in detail the effect of all these variables on the performance of the C64x+ PCI. Instead, this wiki article presents theoretical performance numbers under ideal conditions. With this information the reader can get an idea of the maximum theoretical performance to expect from the C64x+ PCI.
Master Performance
The PCI included on C64x+ devices can act as a PCI bus master. The PCI supports all memory read/write commands. To initiate a PCI bus master memory read/write, an on-chip master such as the EDMA or the DSP must initiate a transfer to PCI memory space (see the memory map information in your device data sheet). The PCI will in turn initiate an access on the PCI bus.
PCI devices access each other, system memory, and external memory through burst transfers. A burst transfer is characterized by having an initialization or address phase followed by two or more data phases. The number of data phases is determined by the initiator of the burst transfer. However, the target of the transfer and the system arbiter are allowed to terminate the transfer.
As a master, number of data phase in a burst transfer initiated by the C64x+ PCI is limited by the size of the command issued by the system master. When the EDMA is used to move data to/from the PCI (as is the case in most systems), the size of this command is up to 64 bytes (32 bytes for DM6467) or the equivalent of 16 PCI data phases (8 for DM6467).
For example, a 2Kbyte transfer configured in the EDMA will be broken down into 32 64-byte commands. The C64x+ PCI will issue a burst transfer for each of these commands. The overhead associated with starting and stopping multiple burst transfers reduces the performance of the PCI.
The table below shows the maximum throughput that can be achieved using the C64x+ PCI in an ideal system. Note that an ideal system assumes no additional delays are introduced. In a real system, the performance of the PCI will be lower than this.
Maximum C64x+ PCI Master Throughput in Ideal System
| PCI Clock Speed (MHz) | Write Performance (Mbytes/sec) | Read Performance (Mbytes/sec) |
| 33 | 98 | 77* |
| 66** | 195 | 153 |
* Exception is DM6467 which has a best case performance of 57 Mbytes/sec (~43% bus utilization). The DM6467 has lower read performance because of an incorrect bridge setting in its architecture.
** 66MHz is only supported on C6455 and DM648.
EDMA & PCI Configuration for Maximum Performance
It is important to note that the EDMA command size is equivalent to the maximum EDMA default burst size (DBS). To ensure the EDMA uses the maximum DBS, follow these rules:
- EDMA optimization rules must be taken into account when defining a PCI master transfer, otherwise the EDMA will not use the maximum DBS and performance will suffer. EDMA optimization rules are defined in the EDMA User’s Guide for the respective device. Recommended setup: A-sync transfer, ACNT = 1 to 4096, BCNT = 1, SRCIDX = DSTIDX = ACNT.
- Most devices have a configurable DBS for each transfer controller (TC). The DBS size of the TC being used for PCI data movement should be maximized to 64 bytes whenever possible. For devices with a fixed DBS, use the TC with the largest DBS setting for PCI data movement. Note that due to incorrect bridge setting on DM6467, PCI will burst a maximum of 32 bytes for a DBS even if the setting is for 64 bytes.
- Expect similar bus utilization for accesses to L2, L1D, and DDR (with no other traffic).
- Most devices have configurable system priorities. Tune these priorities carefully to ensure the critical transfers in the system always have priority.
Slave Performance
The PCI can also act as a PCI bus slave. PCI bus masters can access DSP L2, L1, and external memory using memory read/write commands. DSP memory mapped registers can also be accessed. The PCI support six slave windows which can be configured to point to different ranges within the DSP memory map. To initiate a memory read/write, an off-chip PCI master must initiate a transfer to PCI. The PCI will in turn initiate an access within the DSP.
As a PCI slave, external PCI masters can burst an infinite amount of data to/from the C64x+ PCI. However, note that the C64x+ PCI takes an initial 12 PCI cycles to start driving data during reads during a burst transfer. Using large burst transfers will minimize the impact of this initial read latency.
The initial 12-cycle read latency impacts a situation in which two C64x+ DSPs are connected together via PCI. Recall that as a master the PCI will issue multiple burst transfers, each of these transfers will incur the 12-cycle read latency. In this case, the PCI reads will yield a maximum best case performance of 53 Mbytes/sec for a 33MHz bus and 106 Mbytes/sec for a 66MHz bus (assuming master device follows optimization rules described above). On a DM6467-to-DM6467 system a read performance of 36 and 71Mbytes, respectively, can be expected. The factors mentioned at the beginning of this article will further reduce performance.
PCI Configuration for Maximum Performance
To get the best performance out of the PCI slave follow these points:
- Avoid transferring small blocks of data. Bus utilization is low for small transfers due to transfer overhead; it is better for large transfers (up to 93% bus utilization for 4096 bytes).
- For reads, use memory read multiple PCI command (MRMCMD) whenever possible to do bulk reads from DSP, this command has the best throughput. The PCI command used is selected by the master of the transaction, not the TI DSP; consult the data sheet of the master device for information on command selection.
- Cache line size has a big impact when using memory read line command PCI (MRLCMD). Use a cache line setting of 128 bytes whenever possible (see cache line size register in PCI user guide). The PCI command used is selected by the master of the transaction, not the TI DSP; consult the data sheet of the master device for information on command selection.
Impact of Read Latency on PCI Performance
The performance of the PCI in actual system depends on different variables, but initiator and target latency have important significance to the C64x+ DSP PCI.
Initiator and target latency is the amount of time from when the master starts the transaction to when the target is ready to transfer the first data item. According to specification, the target is limited to 16 PCI clock cycles to complete the first data transfer. Similarly, the master is limited to a maximum of 8 PCI clock cycles. If for any reason, the device cannot meet these requirements, the target must issue a retry. The retry terminates the transaction prematurely, thereby freeing the PCI bus for use by other devices. After a minimum of two clock cycles, the initiator may reattempt to transfer data.
For write transfers, data is ready immediately; therefore, the latency is one PCI cycle (assuming the target is ready to receive the data). For read transfers, the latency can be several cycles long, especially if the transfer must propagate through a series of bridges before reaching the target.
It has been observed/reported that the read performance of the PCI operated in master mode in a PC-based system is very low. This is due to a combination of high read latency and the fact that the PCI master can burst only up to 64 bytes. Note that each 64 byte burst will incur the read latency penalty.
Whenever possible it is recommended that the PCI on C64x+ DSPs is used as a slave. This assumes of course that the other master has better performance than the DSP PCI.
