NOTICE: The Processors Wiki will End-of-Life in December of 2020. It is recommended to download any files or other content you may need that are hosted on processors.wiki.ti.com. The site is now set to read only.
- 1 Overview
- 2 NFV Voice and Video Transcoding
- 3 Underlying Technology
- 4 Capacity Figures
- 5 TI CPUs vs. GPUs and x86 "Software Only"
- 6 Examples / Demos
- 7 Software Model
- 8 Virtualization
- 9 Embedded System Compatibility
This wiki is part of a "cloud HPC" series showing how to use c66x coCPU™ cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.
The focus of this wiki is virtualized high capacity voice transcoding and video transcoding for telecom applications. Other wiki's in the cloud HPC series include:
- Cloud HPC Overview
- Transcoding Test/Demo and Performance Measurement
- High Performance VMs
- Computer Vision (OpenCV)
- Heterogeneous Programming
- DirectCore® coCPU Interface
The Cloud HPC Overview wiki has information about specific tested servers.
NFV Voice and Video Transcoding
NFV (Network Functions Virtualization) definitions and use cases now include high capacity media transcoding with coCPU cards. For example on page 22 of this NFV Acceleration Technologies report (etsi.org), "Accelerator Hardware" is shown as a "Hardware Resource", at the same level as Computing Hardware, Storage Hardware, and Network Hardware. An excerpt from the report's NFV Reference Architecture diagram looks like this:
While server-based NFV standards continue to be refined, TI-based transcoding technology has also advanced. Using software stacks built up by the third-party ecosystem around c66x coCPUs, it's now possible to combine TI and Intel cores in a server, allowing each to do what it does best. The end result is an elegant heterogeneous core transcoding solution, combining 10s of x86 cores and 100s of c66x cores together within an off-the-shelf server and Linux + KVM framework, providing real-time, low-latency media transcoding on multiple concurrent sessions -- and at the same time, making it a mainstream, easy to use solution.
Following is a list of TI and third-party items required:
- c66x coCPUs and build tools, TI
- 32-core or 64-core c66x coCPU cards, Advantech
- DirectCore host and guest drivers and libraries, Signalogic
- mediaTest demo program, Signalogic
c66x coCPUs and Build Tools
Yes you read that right -- CPU, not DSP. TI marketing continues to label c66x devices as "DSPs", and that term continues to be widely used in the telecom community, where TI devices have a long history, including TI's acquisition of Telogy Networks in 1999. But after some 30 years of advanced chip development by TI, the term DSP is no longer an accurate label. The c66x architecture is in fact a CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication. In addition, from its DSP heritage, the c66x architecture retains compute-oriented advantages, including VLIW, software pipelined loops, multiple SIMD operations per clock cycle, specialized signal processing intrinsics, and extensive DMA capabilities.
Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. Application demo software described below uses TI command line tools and standard makefiles. TI build tools are available online.
The Advantech coCPU cards supply the server horsepower. Each card has 64 cores, takes up a single PCIe slot (unlike GPU boards that take 2 slots), has two 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in DPDK and virtualization, and multicore programming models such as OpenMP and OpenACC.
Below are images showing c66x coCPU cards installed in Dell and HP servers. Unlike GPU boards, the cards are single-slot thickness, allowing full riser utilization.
Below are images showing
- Dell R720 server with 16 x86 cores and two (2) c66x coCPU cards installed, or a total of 128 c66x cores (two (2) Xeon E5-2670 CPUs rated at 2.6 GHz, eight (8) C6778 CPUs rated at 1.25 GHz)
- HP DL380 G9 server with 16 x86 cores and two (2) c66x coCPU cards installed, or a total of 128 c66x cores (two (2) Xeon E5-2680v3 CPUs rated at 2.5 GHz, eight (8) C6778 CPUs rated at 1.25 GHz)
The above images show full-length PCIe cards; half-length cards are also available.
Host and Guest Drivers
DirectCore drivers interact with coCPU cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers. In the case of VMs, the KVM Hypervisor is supported. Below is an excerpt from the NFV Acceleration Technologies report cited above, showing virtIO drivers as the preferred method of interfacing to coCPU hardware:
Note: the DirectCore virtIO software stack is the first Hypervisor ever to support TI CPUs in standard x86 Linux servers (and the subject of another wiki).
The Virtualization section below describes how to configure VMs, including coCPU core and NIC allocation using the standard VMM (Virtual Machine Manager) user interface.
DirectCore libraries provide a high level API and session management mailbox interface for media transcoding applications. DirectCore libraries abstract all coCPU cores as a unified "pool" of cores, allowing multiple users / VM instances to share coCPU resources, including NICs on the coCPU cards. This applies regardless of the number of coCPU cards installed.
Application test and demo programs include command-line options and interactive keyboard commands for:
- transcoding session setup and tear-down
- codec unit test
- stats readout, including packet statistics, session statistics, and core usage
c66x codecs are optimized implementations of a wide range of telecom media codecs, as listed in the table in Capacity Figures, below.
A capacity table is given below for a partial list of voice and video codecs. Some notes about the table:
- Results are measured with Dell, HP, and Supermicro 1U and 2U servers, using 128 c66x cores (two 64-core cards with C6678 CPUs clocked at 1.4 GHz). For detailed information on server type and configuration, see the HPC Overview wiki
- All figures incorporate media session framework processing, including jitter buffer mangement (JBM), ptime handling, sampling rate conversion, DTMF inband and out-of-band, tone detection and generation, RTP packet and payload format parsing, RTCP, logging, and advanced memory management. Echo cancellation is an option (up to 128 msec)
- Voice codec figures assume a transcoding session with G711u or G711 wideband (8 or 16 kHz sampling rate), whichever will minimize processing due to sampling rate conversion, unless noted otherwise in the "Comments" field. Each session is bi-directional (i.e. "Capacity" field figures are bi-directional sessions). The following formula can be used to estimate capacity when transcoding between codec types:
max sessions = A/(A/B+1)
where A and B are capacity figures from the table (see notes below to derive this).
- Not all bitrates and options (DTX, echo cancellation, etc) are shown
- Video codec figures are measured at 720p 30 fps, 1.5 Mbps, unless noted otherwise in the "Comments" field. For H.264, high profile is used. Not all profiles, bitrates, and resolutions are shown
- All voice codecs are XDAIS compatible
|TI||Transcoding with AMR-NB|
|Opus||7168||85||TI||16 kHz sampling rate, 16 kbps|
|EVS||3584||95||Signalogic||8 kHz, 13.2 kbps, DTX disabled. Currently being optimized (EVS codec product page)|
|G711a||43520||61||Signalogic||Capacity limited by 4 GbE total NIC|
|H.264||32||90||TI, modifications Signalogic||HP, 720p, 30 fps, 1.5 Mbps|
|H.265||2-4||90||TI||Integration in work (Signalogic)|
The media session framework has been written and optimized for c66x by Signalogic, and contains no legacy Telogy code or other third-party code. In addition to the functions mentioned above, the media session framework also supports:
- Echo cancellation, VAD, DTMF, CNG, G711.1 (wideband), G711 Appendix I and II, and other voice algorithms (provided by TI)
- Conferencing (provided by Signalogic)
- Analytics functions, including speech recognition and image analytics (for example, OpenCV, as described here)
- OpenMP (when used with a subset of cores not running codecs or other media algorithms)
Deriving the transcoding equation:
1) Set max amount of higher capacity codec (A) equal to reduction in lower capacity codec (B):
n(A/B) = B-n
2) Solve for n:
n = B/(A/B +1)
3) Express max sessions as B-n.
TI CPUs vs. GPUs and x86 "Software Only"
x86 "software only" and GPU vendors persist in one-sided comparisons with "DSP" solutions, for example as shown here. These comparisons tend to be based on one or more of the following assumptions (i) TI devices are "hardware" solutions found only in embedded systems, not servers and clouds, (ii) the latest TI multicore devices are fundamentally not CPUs and can't run open source C/C++ code, and (iii) TI solutions are "hard to use", requiring special expertise. These assumptions, if valid in years past, do not apply when a software stack fully compatible with Linux, KVM, and DPDK is applied.
With the software stack in place, comparisons with x86 and/or GPU solutions can be made on a per-server basis. As noted, the table above gives figures for 128 cores in a single server; however, the solution is scalable beyond that. For example, in an average 1U server (e.g. 12 x86 cores, 800W, approx $2500 cost) up to 256 TI CPU cores can be installed, providing a combined performance, power efficiency, and per-session cost benefit of more than 10-to-1 vs. x86 and GPU solutions.
The technical reality is that a fully developed software stack allows the remarkable SWaP advantages of TI CPUs to be applied in commodity servers. There is no point to wait years for servers with 100+ CPU cores when it can be done already using off-the-shelf, mainstream software and hardware components.
Examples / Demos
The mediaTest program handles multiple concurrent streams, applying per stream transcoding and RTP streaming (packet I/O using either motherboard or coCPU card NICs). Here is an example mediaTest command line:
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -imedia_files/stv_8c.INP -Cconfig.txt -omedia_files/stv_8c.wav
Given the above command, mediaTest sets up sessions as specified by the -C command line option (session configuration file), performs network I/O using Linux, DPDK, or coCPU cores, and performs transcoding using Linux, DPDK, or coCPU cores. While mediaTest is running, interactive keyboard commands can be used for stats readout. Here is another example command line, showing destination IP addr info being specified for coCPU cores:
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -i/home/Signalogic/media_files/stv_8c.INP -D192.168.1.61:45056:60-af-6d-75-75-f1 -Cconfig.txt -o/home/Signalogic/media_files/stv_8c.wav
When coCPU card NICs are being used for network I/O instead of server NICs, then the following notes apply:
- coCPUs can run UDP/TCP, RTP, RTCP, ARP, and other protocol stacks
- For compressed video, RTP streaming is used
- Received packets are distributed to coCPU cores via UDP port filtering, in the case of c66x handled by the PA (packet accelerator) at wire-speed
Also mediaTest can run codecs in unit test mode, and save either intermediate compressed bitstream data, or decoded data as a .wav file, as shown in the following example:
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -i/home/Signalogic/media_files/stv_8c.INP -o/home/Signalogic/media_files/stv_8c.wav
A mediaTest demo is included in the Streaming Resource Functions limited SDK. The demo can be used with or without a coCPU card. Several more mediaTest example command lines are shown on the mediaTest Getting Started page.
Below is a diagram showing the software model for the cloud HPC solution. Notes about this diagram:
- Application complexity increases from left to right (command line, open source library APIs, user code APIs, heterogeneous programming). mediaTest is at the third level
- All application types can run concurrently in host or VM instances (see below for VM configuration)
- coCPUs can make direct DMA access to host memory, facilitating use of DPDK
- coCPUs are connected directly to the network. Received packets are filtered by UDP port and distributed to coCPU cores at wire speed
The host memory DMA capability is also used to share data between coCPUs, for example in TI c66x application such as H.265 (HEVC) encoding, where 10s of cores must work concurrently on the same data set.
The DirectCore drivers and libs are fully virtualized, supporting the KVM Hypervisor and QEMU system emulator (tested on CentOS, Ubuntu, and Red Hat). In the above command lines, the "-8" suffix to the card designator requests 8 cores. Another, simultaneous host or VM instance can give a similar command line specifying "-N" cores, and that user would be allocated an additional N cores.
Below is a screen capture showing VM configuration for coCPU cards:
In addition, this works across card boundaries, making transparent the number of coCPU cards installed.
It's worth noting that as a general rule, concurrent multi-user HPC VMs are difficult to implement in commodity boxes. For example, GPU technology requires time-slicing within GPU devices, and to do it with x86 cores alone there is simply not enough horsepower and VMs contest for network I/O resources. Combining TI and Intel CPU technology together in a complementary manner makes HPC VMs straightforward and extremely effective.
Embedded System Compatibility
For TI embedded systems customers, it's important to note the coCPU OpenCV compute model described here is scalable down as well as up. For example we have configured a dual-core Atom x86 motherboard (in mini-ITX form-factor) with Ubuntu and a half-length coCPU card (32 coCPU cores) and verified the test programs work as-is. Including the 32-core card, the overall enclosure is about 8" x 8" x 3" (pictures here).
It's also conceivable to port DirectCore drivers and libs to ARM run all software on one TI SoC.