High Performance VMs

From Texas Instruments Wiki
Jump to: navigation, search

High Performance Virtual Machines (HPVMs)

This wiki is one in a series showing how to use c66x accelerator cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.

Overview

Using established technology and software stacks built by TI's third-party ecosystem, it's now possible to combine TI and Intel cores to create heterogeneous HPC server solutions. Using off-the-shelf servers running Linux + KVM, up to 10s of x86 cores and 100s of c66x cores can work together to solve applications including image analytics, video content delivery, and media transcoding.

This solution meets key objectives required for HPVMs (i) each VM must process multiple streams or data sets concurrently and in real-time, (ii) multiple VMs must run concurrently with high performance, and (iii) configuration of c66x VMs must be integrated with VMM (Virtual Machine Manager), like any other machine resource.

This wiki focuses on HPVMs, while others in the series show how to set up off-the-shelf HPC servers in order to run computer vision (OpenCV) and perform c66x Heterogeneous Programming. The Server HPC overview wiki has detailed information about tested servers.

Underlying Technology

Following is a list of TI and third-party items required:

  1. c66x CPUs and build tools, TI
  2. 32-core or 64-core c66x PCIe accelerator cards, Advantech
  3. Standard off-the-shelf server running Ubuntu, CentOS, or Red Hat Linux (tested examples given below)
  4. DirectCore host drivers and libraries, Signalogic
  5. DirectCore guest drivers and patches for QEMU, libvirt, and virt-manager, Signalogic
  6. Application Demo Programs, Signalogic

c66x CPUs and Build Tools

Yes you read that right -- CPU, not DSP. Although TI marketing continues to label c66x devices as "DSPs", after some 30 years of advanced chip development by TI, this is no longer a precise label. The c66x architecture is in fact a CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication. In addition, from its DSP heritage, the c66x architecture retains compute-oriented advantages, including VLIW, software pipelined loops, and multiple operations per clock cycle.

TI build tools are available online.

Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. The TI build tools are command line tools, and the demo software described below includes standard makefiles.

PCIe Cards

The Advantech PCIe cards supply the server horsepower. Each card has 64 cores, takes up a single slot (unlike GPU boards that take 2 slots), has two (2) 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in virtualization, DPDK, and high bandwidth network I/O, as well as multicore programming models such as OpenMP and OpenACC.

Off-the-Shelf Linux Servers

Servers and OS tested with HPVMs include:

  • Servers: HP DL380 G8 and G9, Dell R720 and R730, Supermicro 6016GT or 1028Gx series, others
  • Linux OS: Ubuntu 12.0, 14.04, CentOS 6.2, 7, 7.1, or Red Hat 7
  • KVM Hypervisor and QEMU system emulator

Detailed information about tested servers, including pictures of c66x card installation, power consumption stats, and temperature stats, are located on the Server HPC Overview wiki.

Host and Guest Drivers

DirectCore drivers interact with c66x PCIe cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers.

Host and Guest Libs

DirectCore libraries provide a high level API for applications. DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs on the PCIe cards. This applies regardless of the number of PCIe cards installed in the server.

Installing / Configuring VMs

Below is a screen capture showing VM configuration for c66x accelerator cards, using the Ubuntu Virtual Machine Manager (VMM) user interface:

VMM dialog showing VM configuration for c66x accelerator cards

c66x core allocation is transparent to the number of PCIe cards installed in the system; just like installing memory DIMMs of different sizes, c66x cards can be mixed and matched. Unless there is an application-specific reason for Affinity, physical cores and physical NICs are not assigned.

Application Demo Programs

Application test and demo programs include command-line options for:

  • image analytics processing
  • continuous streaming (H.264 compression using TI multimedia codecs and streaming over IP/UDP/RTP; multiple streams
  • ffmpeg command line emulation

Command Line Demo Programs

Following are command line examples for application demos that use pre-built c66x executable files.

1) streamTest. The streamTest program handles multiple concurrent streams, applying per stream image analytics, H.264 compression and RTP streaming (packet egress on the card NIC). Here is an example streamTest command line with comments:

 ./streamTest -m0xff -f1400 -estream.out -cSIGC66XX-8 -s2 -i/home/Signalogic/video_files/parkrun_720p_50fps_420fmt.yuv -x1280 -y720 -r30 -D192.168.1.61:45056:60-af-6d-75-75-f1 -B1500000 -oparkrun_test.h264

The above command line runs in "continuous mode" and outputs both an RTP stream over the card NIC (the -D, or destination, command line option) and a compressed file (the -o command line option). Multiple streams can be specified by adding more instances to the command line.

2) iaTest. The iaTest program handles multiple concurrent YUV video files, applying OpenCV image analytics and storing output YUV data continuously to HDD file.

Here is an example example iaTest command line with comments:

./iaTest -m1 -f1250 -eia.out -cSIGC66XX-8 -s0 -i/home/Signalogic/video_files/CCTV_640x360p_30fps_420fmt.yuv -x640 -y360 -r30 -l0x1100003 -occtv_test5.yuv

In the above command lines, -x and -y give the resolution, -r the frame rate, and -B the bitrate. Also, not shown are command line options to specify the codec profile, CBR vs. VBR, qp values, and more.

In the above command lines, the "-8" suffix to the card designator requests 8 cores. Another, simultaneous host or VM instance can give a similar command line specifying "-N" cores, and that user would be allocated an additional N cores.

To obtain demo programs please send e-mail to info [at] signalogic [dot] com. After verifying that you have a supported c66x PCIe card we'll send a link to a secure page with demo binaries and an automated install script.

CIM Hyperpiler Demo

Also available is a CIM® Hyperpiler™ demo, which uses OpenMP syntax in C/C++ source code to:

  • generate separate source code streams for x86 and c66x
  • augment generated host source code streams and c66x target source code streams with APIs required for run-time synchronization and data transfer
  • build and run x86 and c66x executables

Below is a C source with a simple convolution example; the area highlighted in red shows the use of #ifdef's to control whether the source should be compiled for the following platforms:

  • c66x only
  • x86 only
  • both c66x and x86 (using the CIM® Hyperpiler™)

Heterogeneous c66x and x86 programming example using convolution C source code

More information is available on the c66x heterogeneous programming wiki.

Virtualized c66x -- How it Works

Modifications to QEMU, libvirt, and virt-manager allow DirectCore drivers and libraries to be fully virtualized. Here are some notes about the HPVM solution:

  • Supports the KVM Hypervisor and QEMU system emulator, and has been tested on CentOS, Ubuntu, and Red Hat
  • VM instances use virtIO front-end drivers; host instances use a physical driver
  • Any combination of host and VM instances can run concurrently; both allocate cores from the total system pool of c66x cores

It's worth noting that as a general rule, concurrent multi-user HPC VMs are difficult to implement in commodity boxes. For example, GPU technology requires time-slicing within GPU devices, and to do it with x86 cores alone there is simply not enough horsepower and uncontested NIC ports. Combining TI and Intel CPU technology together in a complementary manner boosts the number of cores and NIC ports, making it straightforward to create HPC VMs.