This wiki is one in a series showing how to use c66x accelerator cards in commodity servers to achieve real-time, high capacity processing and analytics of multiple concurrent streams of media, signals and other data. Topics covered include porting OpenCV to TI c66x CPUs, required software components, and VM / Hypervisor support.
Other wiki's in the server HPC series show how to set up off-the-shelf HPC servers that combine x86 and c66x cores in order to run High Performance Virtual Machines (HPVMs) and NFV Transcoding, and perform c66x Heterogeneous Programming. The Server HPC overview wiki has detailed information about tested servers.
OpenCV on TI c66x
OpenCV has emerged as a widely accepted, open source image analytics and image processing platform. It has a C/C++ codebase and does not depend on extensive hand-optimized x86 asm language for high performance, like some older projects such as ffmpeg and fftw. Instead, it relies on OpenMP, CUDA, OpenCL, or other multicore compute models, and the OpenCV authors appear to be committed to standards-based heterogeneous core compute models, running on commodity servers.
Using established technology and software stacks built up by the third-party ecosystem around c66x, it's now possible to combine TI and Intel cores in a server, allowing each to do what it does best. The end result is an elegant heterogeneous core HPC solution, combining 10s of x86 cores and 100s of c66x cores together within an off-the-shelf server and Linux + KVM framework, providing high performance real-time image analytics on multiple concurrent streams -- and at the same time, making it a mainstream, simple to use solution.
Following is a list of TI and third-party items required:
- c66x CPUs and build tools, TI
- 32-core or 64-core c66x PCIe accelerator cards, Advantech
- DirectCore host and guest drivers and libraries, Signalogic
- c66x OpenCV port and host OpenCV interface layer, Signalogic
c66x CPUs and Build Tools
Yes you read that right -- CPU, not DSP. Although TI marketing continues to label c66x devices as "DSPs", after some 30 years of advanced chip development by TI, this is no longer a precise label. The c66x architecture is in fact a CPU architecture, similar in many ways to Intel x86, including external memory, internal memory subsystem (L1P, L1D, L2 cache, multicore shared memory), embedded PCIe and high-speed NIC peripherals, and inter-CPU communication. In addition, from its DSP heritage, the c66x architecture retains compute-oriented advantages, including VLIW, software pipelined loops, multiple operations per clock cycle, and extensive DMA capabilities.
TI build tools are available online.
Note that Code Composer Studio software and detailed knowledge of low-level TI chip details are not required. The demo software described below uses TI command line build tools and standard Makefiles.
The Advantech PCIe cards supply the server horsepower. Each card has 64 cores (eight C6678 CPUs), takes up a single slot (unlike GPU boards that take 2 slots), has two 1 GbE NICs, and draws about 120W. Up to 256 cores can be installed in a standard 1U server, and twice that many in suitable 1U or 2U servers. This is a lot of CPU cores, and aligns perfectly with emerging server architecture trends in DPDK and virtualization, and multicore programming models such as OpenMP and OpenACC.
DirectCore drivers make it possible to control and feed image analytics streams from either host instances or VMs. In the case of VMs, the KVM Hypervisor is supported. Note: this is the first Hypervisor ever to support TI CPUs in standard x86 Linux servers (and the subject of another wiki).
DirectCore libraries provide both an OpenCV compatible high level interface API, and a hardware level API interface. DirectCore drivers and libraries view all c66x cores in the server as a "unified pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs on the PCIe cards. This applies regardless of the number of PCIe cards installed in the server (i.e. in terms of card resources, boundaries between cards are transparent).
Host test and demo programs include command-line options for:
- YUV video download to c66x cores
- image analytics processing
- continuous streaming (H.264 compression with RTP streaming)
- multiple streams
The c66x OpenCV port provides API functionality to (i) user-defined C/C++ programs running on c66x, and (ii) host programs that use the DirectCore interface layer mentioned above.
c66x example C/C++ programs are provided that demonstrate motion detection, morphological functions, edge detection, finding contours, filtering, contour manipulation, YUV conversion, region of interest, and more. Here is a sample c66x C code with OpenCV API calls:
Porting OpenCV to c66x
Porting OpenCV to TI c66x proved to be a straightforward process (we ported OpenCV version 2.4.10). The TI build tools handle complex C++ source with no problems and they generate highly optimized code. The major steps were:
- memory allocation had to reworked and adapted. OpenCV has its own internal memory management
- in order to take advantage of c66x performance, we modified key internal C/C++ matrix/image loops with TI pragmas and/or intrinsics to ensure these loops were software pipelined
- in some cases we obtained higher performance by replacing OpenCV functions internally with TI VLIB functions, for example morphology functions erode and dilate (note - there are likely additional VLIB replacements that can be applied. But also in some cases the compiled OpenCV functions are already fast using only loop optimization, with no replacement required)
- we added optimized YUV-RGB and YUV-YCrCb conversion functions, which OpenCV lacks for some reason
- we added H.264 compression and RTP streaming options (note - using TI H.264 codec. We may consider adding H.265 as an update)
Currently we are enabling c66x OpenMP to allow multiple core performance to be applied to individual OpenCV APIs. This does not work yet, as it requires utilizing a subset of c66x CPU cores (e.g. 6 cores, starting with OpenMP master core index = 2), but we expect to get it working with TI's help.
Examples / Demos
1) streamTest. The streamTest program handles multiple concurrent streams, applying per stream image analytics, H.264 compression and RTP streaming (packet egress on the card NIC). Here is an example streamTest command line with comments:
./streamTest -m0xff -f1400 -estream.out -cSIGC66XX-8 -s2 -i/home/Signalogic/video_files/parkrun_720p_50fps_420fmt.yuv -x1280 -y720 -r30 -D192.168.1.61:45056:60-af-6d-75-75-f1 -B1500000 -oparkrun_test.h264
The above command line runs in "continuous mode" and outputs both an RTP stream over the card NIC (the -D, or destination, command line option) and a compressed file (the -o command line option). Multiple streams can be specified by adding more instances to the command line.
2) iaTest. The iaTest program handles multiple concurrent YUV video files, applying image analytics and storing output YUV data continuously to HDD file.
Here is an example example iaTest command line with comments:
./iaTest -m1 -f1250 -eia.out -cSIGC66XX-8 -s0 -i/home/Signalogic/video_files/CCTV_640x360p_30fps_420fmt.yuv -x640 -y360 -r30 -l0x1100003 -occtv_test5.yuv
In the above command lines, -x and -y give the resolution, -r the frame rate, and -B the bitrate. Also, not shown are command line options to specify the codec profile, CBR vs. VBR, qp values, scan type (progressive or interlaced), and more.
To obtain demo programs please send e-mail to info [at] signalogic [dot] com. After verifying that you have a supported c66x PCIe card we'll send a link to a secure page with demo binaries and an automated install script.
Here is a link to an example iaTest output .yuv file. In this example, one (1) c66x core is doing a Gaussian filter, detecting motion, finding contours, and annotating image frames with analysis stats in about 3x real-time.
A screen grab from a surveillance video tracking algorithm, implemented with c66x OpenCV, is shown below:
In this case, the algorithm is concurrently tracking all people in the video who have a backpack or are carrying something (e.g. a bag). Regions marked in green are "candidates", regions marked in other colors are rejected. Statistics are printed in each frame.
In an HPC server application, the objective is to run 100+ simultaneous video live feeds -- in real-time -- using a c66x accelerated server. This level of image analytics throughput is simply not possible without having 128 CPU cores or more available. Being able to connect the live feeds directly to the accelerator NICs is also an advantage, more so if the server is virtualized.
The DirectCore drivers and libs are fully virtualized, supporting the KVM Hypervisor and QEMU system emulator (tested on CentOS, Ubuntu, and Red Hat). In the above command lines, the "-8" suffix to the card designator requests 8 cores. Another, simultaneous host or VM instance can give a similar command line specifying "-N" cores, and that user would be allocated an additional N cores.
Below is a screen capture showing VM configuration for c66x accelerator cards:
In addition, this works across card boundaries, making transparent the number of cards installed in the server.
It's worth noting that as a general rule, concurrent multi-user HPC VMs are difficult to implement in commodity boxes. For example, GPU technology requires time-slicing within GPU devices, and to do it with x86 cores alone there is simply not enough horsepower and VMs contest for network I/O resources. Combining TI and Intel CPU technology together in a complementary manner makes HPC VMs straightforward and extremely effective.
Embedded System Compatibility
For TI embedded systems customers, it's important to note the c66x OpenCV compute model described here is scalable down as well as up. For example we have configured a dual-core Atom x86 motherboard (in mini-ITX form-factor) with Ubuntu and a half-length PCIe card (32 c66x cores) and verified the test programs work as-is. Including the 32-core card, the overall enclosure is about 8" x 8" x 3" (pictures here).
It's also conceivable to port DirectCore drivers and libs to ARM run all software on one TI SoC.