- 1 Overview
- 2 HPC with c66x CPUs
- 3 Underlying Technology
- 4 Software Model
- 5 Installing / Configuring VMs
- 6 Application Demo Programs
This wiki is one in a series showing how to use c66x co-CPU cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data.
This wiki is the cloud HPC and server HPC overview, and contains information about server type and configuration, overall software architecture, and virtualization support. The following wikis go into application specific detail:
- High Performance VMs
- Computer Vision (OpenCV)
- NFV Media Transcoding
- Heterogeneous Programming
- DirectCore API and Source Code Examples
HPC with c66x CPUs
Using established technology and software stacks built by TI's third-party ecosystem, it's now possible to combine TI and Intel cores to create heterogeneous HPC server solutions. Using off-the-shelf servers running Linux + KVM, up to 10s of x86 cores and 100s of c66x cores can work together to solve applications including high performance VMs (HPVMs), image analytics, video content delivery, and media transcoding.
Following is a list of TI and third-party items required:
- c66x CPUs and build tools, TI
- 32-core or 64-core c66x co-CPU cards, Advantech
- Standard off-the-shelf server running Ubuntu, CentOS, or Red Hat Linux (tested configurations given below)
- DirectCore host drivers and libraries, Signalogic
- DirectCore guest drivers and patches for QEMU, libvirt, and virt-manager, Signalogic
- Application Demo Programs, Signalogic
c66x CPUs and Build Tools
The c66x architecture is an advanced CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication. In addition, from its DSP heritage, the c66x architecture retains compute-oriented advantages, including VLIW, software pipelined loops, advanced DMA functionality, and multiple operations per clock cycle. Because it's a CPU it works well in combination with x86 CPUs inside servers.
TI build tools generate optimized code from C/C++ source. Porting open source C/C++ to c66x is a straightforward process, one documented example being OpenCV (open computer vision). The command line version of build tools are available online.
Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. The command line tools and standard Makefiles are used in all demo software described in the HPC series of wiki's.
The Advantech co-CPU cards supply the server horsepower. Each card has 64 cores, takes up a single slot (unlike GPU boards that take 2 slots), has two (2) 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in virtualization, DPDK, and high bandwidth network I/O, as well as multicore programming models such as OpenMP and OpenACC.
Off-the-Shelf Linux Servers
Servers and OS tested with c66x HPC solutions include:
- Servers: HP DL380 G8 and G9, Dell R720 and R730, Supermicro 6016GT or 1028Gx series, others
- Linux OS: Ubuntu 12.0, 14.04, CentOS 6.2, 7, 7.1, or Red Hat 7
- KVM Hypervisor and QEMU system emulator (VMware support is planned)
Below are images showing c66x co-CPU cards installed in Dell and HP servers. Unlike GPU boards, the cards are single-slot thickness, allowing full riser utilization.
Below is an image showing a Dell R720 server with 16 x86 cores and two (2) c66x co-CPU cards installed, or a total of 128 c66x cores (the x86 cores are supplied by two (2) Xeon E5-2670 CPUs rated at 2.6 GHz, and the c66x cores by eight (8) C6778 CPUs rated at 1.25 GHz):
Below is an image showing an HP DL380 G9 server with 16 x86 cores and two (2) c66x co-CPU cards installed, or a total of 128 c66x cores (the x86 cores are supplied by two (2) Xeon E5-2680v3 CPUs rated at 2.5 GHz, and the c66x cores by eight (8) C6778 CPUs rated at 1.25 GHz):
Server Power Consumption and Temperature
For a Dell R720 with 16 Intel / 128 TI cores, the average power draw is around 700 W. The images below show R720 instantaneous temperature and power stats for 128 cores:
Note that power variation depends mostly on what the x86 cores are doing.
For an HP DL380 G9 with 16 Intel / 128 TI cores, the average power draw is around 650 W. The image below shows DL380 G9 temperature stats for 128 cores:
A "Power and Thermal Evaluation" app note is available explaining the test methodology used for precise TI chip level measurements.
Host and Guest Drivers
DirectCore drivers interact with c66x cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers.
Host and Guest Libs
DirectCore libraries provide a high level API for applications. DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs. This applies regardless of the number of cards installed in the server. This page has DirectCore API and source code examples.
Below is a diagram showing the software model for the cloud and server HPC solution. Notes about this diagram:
- Application complexity increases from left to right (command line, open source library APIs, user code APIs, heterogeneous programming)
- All application types can run concurrently in host or VM instances (see below for VM configuration)
- c66x CPUs can make direct DMA access to host memory, facilitating use of DPDK
- c66x CPUs are connected directly to the network. Received packets are filtered by UDP port and distributed to c66x cores at wire speed
The host memory DMA capability is also used to share data between c66x CPUs, for example in an application such as H.265 (HEVC) encoding, where 10s of cores must work concurrently on the same data set.
Installing / Configuring VMs
Below is a screen capture showing VM configuration for c66x co-CPU cards, using the Ubuntu Virtual Machine Manager (VMM) user interface:
c66x core allocation is transparent to the number of cards installed in the system; just like installing memory DIMMs of different sizes, c66x cards can be mixed and matched.
Application Demo Programs
Application test and demo programs are available and described in detail on other wiki's in the cloud and server HPC series, including:
- Image analytics processing examples, located on the c66x computer vision (OpenCV) wiki
- Continuous streaming examples (H.264 compression using TI multimedia codecs and streaming over IP/UDP/RTP), located on the High Performance VMs wiki
- Media transcoding examples, located on the NFV Transcoding wiki
- Hyperpiler demo (c66x heterogeneous programming) examples, located on the c66x Heterogeneous Programming wiki