- 1 Overview
- 2 NFV Voice and Video Transcoding
- 3 Underlying Technology
- 4 Capacity Figures
- 5 TI CPUs vs. GPUs and x86 "Software Only"
- 6 Examples / Demos
- 7 Software Model
- 8 Virtualization
- 9 Embedded System Compatibility
This wiki is part of a "cloud HPC" series showing how to use c66x co-CPU™ cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.
The focus of this wiki is virtualized high capacity voice transcoding and video transcoding for telecom applications. Other wiki's in the cloud HPC series include:
- Cloud HPC Overview
- Transcoding Test/Demo and Performance Measurement
- High Performance VMs
- Computer Vision (OpenCV)
- Heterogeneous Programming
- DirectCore® c66x Interface
The Cloud HPC Overview wiki has information about specific tested servers.
NFV Voice and Video Transcoding
NFV (Network Functions Virtualization) definitions and use cases now include high capacity media transcoding with co-CPU cards. For example on page 22 of this NFV Acceleration Technologies report (etsi.org), "Accelerator Hardware" is shown as a "Hardware Resource", at the same level as Computing Hardware, Storage Hardware, and Network Hardware. An excerpt from the report's NFV Reference Architecture diagram looks like this:
While server-based NFV standards continue to be refined, TI-based transcoding technology has also advanced. Using software stacks built up by the third-party ecosystem around c66x CPUs, it's now possible to combine TI and Intel cores in a server, allowing each to do what it does best. The end result is an elegant heterogeneous core transcoding solution, combining 10s of x86 cores and 100s of c66x cores together within an off-the-shelf server and Linux + KVM framework, providing real-time, low-latency media transcoding on multiple concurrent sessions -- and at the same time, making it a mainstream, easy to use solution.
Following is a list of TI and third-party items required:
- c66x CPUs and build tools, TI
- 32-core or 64-core c66x co-CPU cards, Advantech
- DirectCore host and guest drivers and libraries, Signalogic
- mediaTest demo program, Signalogic
c66x CPUs and Build Tools
Yes you read that right -- CPU, not DSP. TI marketing continues to label c66x devices as "DSPs", and that term continues to be widely used in the telecom community, where TI devices have a long history, including TI's acquisition of Telogy Networks in 1999. But after some 30 years of advanced chip development by TI, the term DSP is no longer an accurate label. The c66x architecture is in fact a CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication. In addition, from its DSP heritage, the c66x architecture retains compute-oriented advantages, including VLIW, software pipelined loops, multiple SIMD operations per clock cycle, specialized signal processing intrinsics, and extensive DMA capabilities.
Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. Application demo software described below uses TI command line tools and standard makefiles. TI build tools are available online.
The Advantech co-CPU cards supply the server horsepower. Each card has 64 cores, takes up a single PCIe slot (unlike GPU boards that take 2 slots), has two 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in DPDK and virtualization, and multicore programming models such as OpenMP and OpenACC.
Host and Guest Drivers
DirectCore drivers interact with c66x co-CPU cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers. In the case of VMs, the KVM Hypervisor is supported. Below is an excerpt from the NFV Acceleration Technologies report cited above, showing virtIO drivers as the preferred method of interfacing to co-CPU hardware:
Note: the DirectCore virtIO software stack is the first Hypervisor ever to support TI CPUs in standard x86 Linux servers (and the subject of another wiki).
The Virtualization section below describes how to configure VMs for TI CPUs, including c66x core and NIC allocation using the standard VMM (Virtual Machine Manager) user interface.
DirectCore libraries provide a high level API and session management mailbox interface for media transcoding applications. DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs on the co-CPU cards. This applies regardless of the number of co-CPU cards installed.
Application test and demo programs include command-line options and interactive keyboard commands for:
- transcoding session setup and tear-down
- codec unit test
- stats readout, including packet statistics, session statistics, and core usage
c66x codecs are optimized implementations of a wide range of telecom media codecs, as listed in the table in Capacity Figures, below.
A capacity table is given below for a partial list of voice and video codecs. Some notes about the table:
- Results are measured with Dell, HP, and Supermicro 1U and 2U servers, using 128 c66x cores (two 64-core cards with C6678 CPUs clocked at 1.4 GHz). For detailed information on server type and configuration, see the HPC Overview wiki
- All figures incorporate media session framework processing, including jitter buffer mangement (JBM), ptime handling, sampling rate conversion, DTMF inband and out-of-band, tone detection and generation, RTP packet and payload format parsing, RTCP, logging, and advanced memory management. Echo cancellation is an option (up to 128 msec)
- Voice codec figures assume a transcoding session with G711u or G711 wideband (8 or 16 kHz sampling rate), whichever will minimize processing due to sampling rate conversion, unless noted otherwise in the "Comments" field. Each session is bi-directional (i.e. "Capacity" field figures are bi-directional sessions). The following formula can be used to estimate capacity when transcoding between codec types:
max sessions = A/(A/B+1)
where A and B are capacity figures from the table (see notes below to derive this).
- Not all bitrates and options (DTX, echo cancellation, etc) are shown
- Video codec figures are measured at 720p 30 fps, 1.5 Mbps, unless noted otherwise in the "Comments" field. For H.264, high profile is used. Not all profiles, bitrates, and resolutions are shown
- All voice codecs are XDAIS compatible
|TI||Transcoding with AMR-NB|
|Opus||7168||85||TI||16 kHz sampling rate, 16 kbps|
|EVS||3584||95||Signalogic||8 kHz, 13.2 kbps, DTX disabled. Currently being optimized (EVS codec product page)|
|G711a||43520||61||Signalogic||Capacity limited by 4 GbE total NIC|
|H.264||32||90||TI, modifications Signalogic||HP, 720p, 30 fps, 1.5 Mbps|
|H.265||2-4||90||TI||Integration in work (Signalogic)|
The media session framework has been written and optimized for c66x by Signalogic, and contains no legacy Telogy code or other third-party code. In addition to the functions mentioned above, the media session framework also supports:
- Echo cancellation, VAD, DTMF, CNG, G711.1 (wideband), G711 Appendix I and II, and other voice algorithms (provided by TI)
- Conferencing (provided by Signalogic)
- Analytics functions, including speech recognition and image analytics (for example, OpenCV, as described here)
- OpenMP (when used with a subset of cores not running codecs or other media algorithms)
Deriving the transcoding equation:
1) Set max amount of higher capacity codec (A) equal to reduction in lower capacity codec (B):
n(A/B) = B-n
2) Solve for n:
n = B/(A/B +1)
3) Express max sessions as B-n.
TI CPUs vs. GPUs and x86 "Software Only"
x86 "software only" and GPU vendors persist in one-sided comparisons with "DSP" solutions, for example as shown here. These comparisons tend to be based on one or more of the following assumptions (i) TI devices are "hardware" solutions found only in embedded systems, not servers and clouds, (ii) the latest TI multicore devices are fundamentally not CPUs and can't run open source C/C++ code, and (iii) TI solutions are "hard to use", requiring special expertise. These assumptions, if valid in years past, do not apply when a software stack fully compatible with Linux, KVM, and DPDK is applied.
With the software stack in place, comparisons with x86 and/or GPU solutions can be made on a per-server basis. As noted, the table above gives figures for 128 cores in a single server; however, the solution is scalable beyond that. For example, in an average 1U server (e.g. 12 x86 cores, 800W, approx $2500 cost) up to 256 TI CPU cores can be installed, providing a combined performance, power efficiency, and per-session cost benefit of more than 10-to-1 vs. x86 and GPU solutions.
The technical reality is that a fully developed software stack allows the remarkable SWaP advantages of TI CPUs to be applied in commodity servers. There is no point to wait years for servers with 100+ CPU cores when it can be done already using off-the-shelf, mainstream software and hardware components.
Examples / Demos
The mediaTest program handles multiple concurrent streams, applying per stream image analytics, H.264 compression and RTP streaming (packet egress on the card NIC). Here is an example mediaTest command line with comments:
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -i/home/Signalogic/media_files/stv_8c.INP -Sconfig.txt -o/home/Signalogic/media_files/stv_8c.wav
Given the above command, mediaTest sets up sessions as specified by the -S command line option (session configuration file), processes network I/O packets using Linux or DPDK cores, and transfers IP/UDP/RTP packets to/from c66x cores (via PCIe) for transcoding. While mediaTest is running, interactive keyboard commands can be used for stats readout.
Optionally, the c66x card can process IP/UDP/RTP packets using its onboard NICs (via the -D, or destination, command line option).
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -i/home/Signalogic/media_files/stv_8c.INP -D192.168.1.61:45056:60-af-6d-75-75-f1 -Sconfig.txt -o/home/Signalogic/media_files/stv_8c.wav
If card NICs are being used for network I/O instead of server NICs, then the following notes apply:
- UDP/TCP, RTP, RTCP, ARP, and other protocol stacks run on c66x CPUs
- For compressed video, RTP streaming is used
- Received packets are distributed to c66x cores via UDP port filtering, handled by the c66x PA (packet accelerator) at wire-speed
Also mediaTest can run codecs in unit test mode, and save either intermediate compressed bitstream data, or decoded data as a .wav file, as shown in the following example:
./mediaTest -m0xff -f1400 -emedia_transcoding.out -cSIGC66XX-8 -s2 -i/home/Signalogic/media_files/stv_8c.INP -o/home/Signalogic/media_files/stv_8c.wav
To obtain demo software please send e-mail to info [at] signalogic [dot] com. After verifying that you have a supported c66x co-CPU card we'll send a link to a secure page with demo binaries and an automated install script.
Below is a diagram showing the software model for the cloud HPC solution. Notes about this diagram:
- Application complexity increases from left to right (command line, open source library APIs, user code APIs, heterogeneous programming). mediaTest is at the third level
- All application types can run concurrently in host or VM instances (see below for VM configuration)
- c66x CPUs can make direct DMA access to host memory, facilitating use of DPDK
- c66x CPUs are connected directly to the network. Received packets are filtered by UDP port and distributed to c66x cores at wire speed
The host memory DMA capability is also used to share data between c66x CPUs, for example in an application such as H.265 (HEVC) encoding, where 10s of cores must work concurrently on the same data set.
The DirectCore drivers and libs are fully virtualized, supporting the KVM Hypervisor and QEMU system emulator (tested on CentOS, Ubuntu, and Red Hat). In the above command lines, the "-8" suffix to the card designator requests 8 cores. Another, simultaneous host or VM instance can give a similar command line specifying "-N" cores, and that user would be allocated an additional N cores.
Below is a screen capture showing VM configuration for c66x co-CPU cards:
In addition, this works across card boundaries, making transparent the number of co-CPU cards installed.
It's worth noting that as a general rule, concurrent multi-user HPC VMs are difficult to implement in commodity boxes. For example, GPU technology requires time-slicing within GPU devices, and to do it with x86 cores alone there is simply not enough horsepower and VMs contest for network I/O resources. Combining TI and Intel CPU technology together in a complementary manner makes HPC VMs straightforward and extremely effective.
Embedded System Compatibility
For TI embedded systems customers, it's important to note the c66x OpenCV compute model described here is scalable down as well as up. For example we have configured a dual-core Atom x86 motherboard (in mini-ITX form-factor) with Ubuntu and a half-length co-CPU card (32 c66x cores) and verified the test programs work as-is. Including the 32-core card, the overall enclosure is about 8" x 8" x 3" (pictures here).
It's also conceivable to port DirectCore drivers and libs to ARM run all software on one TI SoC.