NOTICE: The Processors Wiki will End-of-Life in December of 2020. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.
Demystifying Ethernet Performance
*Demystifying the Ethernet performance*
Ethernet is the most popular high speed networking technology and the Ethernet interface on TI devices is expected to deliver the line rate at all the times. Most often, the Ethernet throughput numbers on fully loaded application devices do not scale to meet the desired level. This write-up attempts to put in the perspective the throughtput numbers of Ethernet interface by demystifying the interworking of Ethernet hardware (controller) and software driver. This article further enumerates and emphasizes the need of up-front planning in terms of selection of CPU horse power and operating systems to support targeted applications, functionalities and network traffic on a device.
This write-up kick starts the discussion by presenting the "behind the scenes" theoretical numbers for a 1 Gbps pipe. These numbers are then used up to analyze and demystify the realistic Ethernet performance numbers that are seen on the TI devices.
The number game
The table below indicates the theoretical number of Ethernet frames that would be transacted on a 1000 Mbps or 1 Gbps Ethernet pipe for various frame sizes.
|Frame Size (Bytes)||One way / HD (frames per second)||Two way / FD (frames per second)|
The numbers in table 1 have been arrived after including the 7 bytes of pre-amble, 1 byte of start frame de-limiter (SFD) and 12 bytes of inter-frame gap (see figure 1) to the respective sizes of the frames. An attentive reader would be quick to notice that even though, the bandwidth of the data transaction has not changed across the rows, the number of frames that a hardware controller would be required to handle has fallen drastically from a high of ~1488K frames per second (pps) to mere ~81Kpps. So, it is very much intuitive that a system, while handling the constant bandwidth of 1000 Mbps, is required to do much lesser amount of iterations for large frame sizes as against the smaller frame sizes. Well, lesser number of iterations definitely implies fewer interrupts and context switches. And fewer interrupts and context switches would in turn mean that CPU has more time to pursue meaningful functionlity (refer to section interrupt thrashing for details). As a result, a system will show up higher performance numbers for large sized frames. Thus, it is very important to ascertain the frame or packet size of the traffic while considering any network through put numbers.
The Operating System factor
It can be noted from the Table 1, that for an Ethernet frame of size 1518 bytes, there can be a maximum of 81275 frames per second on 1000 Mbps (or 1Gbps), half-duplex (one way) Ethernet pipe. In other words, this would mean, that the Ethernet network interface would be subjected to a frame transaction in every 12.3 us. So, in order to sustain the line rate of 1 Gbps, the software on the host CPU must finish its work of handling & processing of a frame with 12.3 us and then be available to handle the next arriving frame. Such a time bound execution will ensure that software is working in tandem with the Ethernet hardware. For any reason, if the software is not able to handle and process a frame or packet within the stipulated time (of 12.3 us), then, it will not be able to strike the much needed equilibrium with the Ethernet hardware to sustain the line rate. The lack of tandem between software and hardware will create either a backlog (if doing RX) or a starvation (if doing TX) on the hardware. A sustained backlog will eventually force the Ethernet hardware to overrun and tail-drop the frames from the Ethernet wire, whereas starvation will instantly cause the hardware to under-run and the wire would be under utilized.
Quantitatively, a CPU running at a frequency of 297 MHz and assuming to execute (effectively) one instruction per cycle, will be able to perform 3617 instructions within the time period of 12.3 us. The adequacy and applicability of 3617 instructions to handle & process a network packet is purely a subject of qualitative assessment and depends on the type & function of the software that is run on the CPU.
If a generic operating system like Linux is used, then, the series of 3617 instructions or in other words the time period of 12.3us can be easily exhausted not only as part the packet handling & processing but also as part of context switching, user space / kernel space transactions, data copy (load store) and interrupt thrashing. And there would be definitely still more work that CPU would be required to do before it can come back to Ethernet hardware to pick up or provide the next frame. On the other hand, if tailored, customized and relatively lightweight operating system like TI BIOS is used then, the sequence of 3617 instructions may showcase improvised numbers for handling & processing of the network frames or packets. Better still, a microkernel or similar tight loop software, with no Internet Protocol stack & having the sole functionality of handling the network packets and incorporating a very minimal custom logic for packet processing might be able to sustain the line rate with mere 3617 instructions. In all probability, such a software or a part of it, which is associated with the handling and processing of the network packets would be eventually locked up in the sufficiently sized instruction cache of the CPU. So, in the absence of any cache thrashing, any interrupt thrashing, any context switching and any data copy, the CPU will work optimally, perhaps to hold onto the line rate of 1000 Mbps.
The qualitative enumeration in the previous paragraph establishes that the nature, the architecture and the overall functionality of the software on the host CPU directly impacts the feasibility of the needed equilibrium between software and hardware. So, to expect a throughput number met out by a dedicated microkernel, from a general purpose operating system is definitely not a practical proposition. These two are different software – so they will behave differently and importantly will crunch network data at a different speed.
The discussion in this section clearly puts forth that it is very much imperative to understand the flavor, type and the functionality of the software that is being executed, before the throughput numbers displayed by these software can be set up for a comparison.
The Headroom factor
The software on the host CPU is responsible for retrieving the Ethernet frame from the network interface, moving the embedded network packet (in the Ethernet frame) into the operating system (kernel and network stack) for processing and eventually copying the data to the user land (assuming the packet is for local consumption and not for packet forwarding through routing / bridging). For a packet going out on to the network, the CPU would copy the data from the user space, process the network packet and move the Ethernet frame encapsulating the network packet to the network interface for transmission. A demand for a higher network performance would mean that the CPU is required to work more in the same time. It may be possible for CPU to handle additional traffic provided it has spare cycles. Generally, the CPU begins to choke even before it can scale the throughput number close to the desired value and invariably, the CPU utilization or CPU load hovers dangerously close to 100%. The CPU load presents a good measure of the time that a CPU is required to spend to support a given value of throughput (for a given packet size) and comes very handy for budgeting in the system design.
For a short quantitative analysis in this section, the CPU is set to operate at a very higher frequency - 1200 MHz or 1.2 GHz, which is more than the four times the frequency of the CPU considered in the previous section. Setting up a CPU to operate at 1.2 GHz dramatically changes the equations at run time. Now, there are 14760 (= 1200 x 12.3) instructions that are available for a CPU before it has to return to handle next Ethernet frame on a 1 Gbps pipe. Empirically, the throughput supported by 1.2 GHz processor will be four times the one that was provided by the CPU operating at 297 MHz.
So, the discussion in the previous paragraph seems to indicate that if a higher network throughput is to be supported then, it is good have a CPU with higher speed. Quite intuitive and reasonable! But wait a minute; "Is the network throughput the only feature and / or functionality or of a consumer electronics device in a digital home"? The answer is mostly "no". There are certain devices inside the digital home which are solely characterized by the network throughput numbers. Examples of such devices are routers and bridges, which are required to forward packets from one network interface to another and fall in a different class (often called network elements). Network elements invariably have two or more network interfaces whereas, consumer electronics (CE) devices typically, have one network interface and packet forwarding is mostly ruled out. Importantly, the functionalities of a consumer electronics device are very well cut-out and support of a network interface happens to just one item in the extended list of defined feature set of that device.
The CE devices are characterized by the distinct applications. Such applications are critical to the functioning of the box (CE product) and can not be easily dispensed with. In fact, applications are the real "differentiators" and many vendors would want to pack a plethora of applications to stand out from the crowd. While, this is all understood but how are applications on CE devices related to this discussion? Well, they are closely related; applications need CPU for execution and so does the network traffic on the device.
If the CPU cycles on a CE device are just good enough to sustain the desired network throughput then, it essentially means that there are no spare CPU cycles also called head room to run mission critical applications. If there is no headroom on the device, it is in no way going to be accepted in the market. So, the system designer not only has to plan for CPU frequency to support the network throughput but also needs to create ample headroom for the "kick-ass" applications. Similarly, for a given CPU frequency, the desired network throughput has to weigh-in the CPU load that will be created to sustain the traffic against the needed headroom required to run the "differentiating"; applications.
In a traditional or conventional software Ethernet driver implementation or setup, the Ethernet HW would cause interrupt to the CPU for every single frame transacted on the wire. To represent this aspect quantitatively, a read of Table 1 indicates that to sustain a throughput of 1 Gbps for large frame size of 1518 bytes, the CPU will receive approximated but a whopping ~81K interrupts per second. It also implies that the system would be required to invoke interrupt service routines for a whopping ~81K times in just one second. Intuitively, if the CPU is going to execute ~81K iterations of ISR in a second to handle the Ethernet frames perhaps then, CPU is not going to spend good enough time to cater to other required processing of the network packets and the critical applications. This condition of preventing a CPU from making any meaningful progress because of sustained interrupts is called interrupt thrashing; CPU is required to relinquish the current execution path more often than desired to handle ISR(s) and in effect it is unable to make progress to meet the desired functionalities.
Yet another aspect that needs to considered while working with an ISR is the associated overhead. A jump to ISR is coupled with the context switches and cache (re-) fills. Context switches in an operating system like Linux can be costly and detrimental. Furthermore, cache misses or re-fills can very well delay the execution path. Overall, the system can become pretty slow if the CPU has to experience the said overhead for each of the ~81K interrupts in a second.
So, it looks like that for a system to be responsive and performing it needs to experience a slower rate of interrupts; not only to avoid interrupt thrashing but also to gain from the reduced overhead associated with context switching and cache re-fills. In short, the interrupts from sources like Ethernet hardware must be "paced". Interrupt "pacing" works on the principle of accumulating interrupts prior to delivering it to the CPU. The accumulation of interrupts can be done on the basis of individual or combined factors of interrupt rates, time elapsed and number of packets transacted. Once, the "paced" interrupt is delivered to CPU, it is highly recommended that a single iteration of ISR handles a predefined number of frames from the HW. This mechanism, while reduces the overall number of context switches, it also benefits the system through the cache localization of the ISR code across multiple Ethernet frames. TI HW implements the interrupt "pacing" mechanism. Linux also supports a SW interrupt pacing mechanism called NAPI which is loosely based on ISR scheduling, the assigned quota of packets and the number of packets handled from hardware. Both HW interrupt pacing and NAPI yield considerable performance improvisations. Any modern network driver should support NAPI.
Performance improvisation options
Discussions in the previous sections have clearly established that network throughput of terminal CE device is not only dependent on the Ethernet HW but also on other factors like host CPU frequency, flavor of operating system and interrupt thrashing. This section attempts to look at the possible options to improvise the throughput numbers. It is strongly recommended that a network driver should seek advantage of the framework provided by the OS to delay and “soft” handle the interrupts. Furthermore, the choice of the operating system should commensurate with desired through numbers and the functionality that the overall software needs to support. Beyond these fundamentals, following is the list of the possible options for performance improvisations.
- Tweak and modify the Linux drivers / kernel making them sway away from the established guidelines of the Linux community. Well, this is definitely not a fool proof or one time solution. Importantly, in these times & ages of open source SW, this approach is not going to be appreciated by the customers and the Linux community alike. Further more, this approach would result in maintenance nightmares for the engineering teams. With this approach tweaking would become the norm for every release to offset any in-balance that would have crept in because of changes else where in the system. But, this should not stop the engineers from experimenting and pursuing a generic and an efficient solution which can benefit the open source community at large
- HW enhancements: there are enhancements that are possible in the DMA part of the Ethernet hardware; DMA engines copy data to / from the SDRAM into the Ethernet MAC. There are still further scope and possibilities to off-load from software, some of the packet-ization aspects into the Ethernet hardware. These enhancements would definitely free up some of the CPU cycles.
- Increase the speed of (host) CPU: well, even though the benefits of this approach are far reaching & definitive, this method is always going to be debatable one. If there happens to be a CPU with higher speed, OEM(s) are invariably going to attempt to pack new applications and steal the CPU cycles that was planned and added to handle higher network throughput. As a discipline, the system architects must work with customers / vendors to budget the CPU load for network performance and applications.
The analysis in this write-up shows that the Ethernet throughput numbers on a system is not only dependent on the Ethernet hardware/ software but also on the overall composition and the functionality of the system. As each system happens to be unique and different so, there is no "one size fits" all solution. Further, the challenges of sustaining network throughput have been acknowledged across the industry. A widely industry accepted rule of thumb states that a 1 Hz of CPU power is required to handle & process 1 bit of network data in 1 second i.e. 1 bit per 1 sec per 1 HZ for large frame sizes. And that pretty much sets up the tone for working out CPU budget to support a desired network throughput.
So, the key take away from this analysis is to look and identify the data flows for the use cases in the system which demands high and guaranteed network bandwidth. Once the data path within the system is established, there is a need to do estimation and budgeting of the CPU utilization for supporting network throughput, core & other functionalities of the CE device and planned headroom for future expansions. These budgets would enable and help the choice of one or more of the aforesaid options and perhaps something altogether new.
-- Suraj Iyer, 03 Apr 2009