Please note as of Wednesday, August 15th, 2018 this wiki has been set to read only. If you are a TI Employee and require Edit ability please contact x0211426 from the company directory.

Small AI server

From Texas Instruments Wiki
Jump to: navigation, search

This wiki introduces small AI servers for automation, robotics, security, and IoT applications. Small AI servers are scaled down versions of cloud AI servers, able to perform "Local AI" processing at the edge of the network, independent of the cloud, in very small form-factors, and with very low power consumption. Included below is a description of a demo that provides local AI speech recognition to control privacy of home and business based products such as the Echo Dot.

The Need for Local AI

Edge and IoT AI applications face key problems:

  • Training and compression -- deep learning models must be trained and compressed, and inference fully tested, before running on the embedded system
  • Bandwidth -- multiple cameras, even with compressed video data, can exceed the modest upload speeds of many home and business Internet connections
  • Reliability -- Internet connection to the cloud may be unreliable or lost, with backup operation required
  • Privacy -- video and voice data should be streamed continuously to the cloud only as needed, avoiding "Always On" operation that may compromise privacy and security

Local AI can help solve these problems, but requires small, quiet, low power AI servers that are functionally similar to their cloud AI counterparts. A typical local AI server has the following characteristics:

  • Small size -- approximate dimensions of a large book (e.g. a dictionary)
  • Cloud server functional compatibility, including x86_64 Linux, x86 open source software, and high performance technologies such as DPDK and PCIe
  • High performance -- around 1 TFLOP
  • Low power consumption -- under 100 W, with a minimal operating mode of 25 to 35 W

Cloud Compatibility -- Same, But Smaller

Basic problems with an "Always On" Internet connection are well recognized -- with privacy, bandwidth issues, and unpredictable WiFi reliability at the top of the list. But there is more to it than that. What many people don't realize is the ARM chips inside mobile and consumer devices cannot actually perform -- or at best perform a limited amount of -- deep learning algorithms such as speech recognition, image analytics, and neural network classification. These devices function by streaming raw data to the cloud, where inference occurs and results are sent back. To get some idea of where ARM devices are on the AI processing scale, in some cases dozens or even 100s of servers in the cloud, using 1000 or more Xeon x86 cores and GPU cores, might be involved in speech recognition for just one conversation. Consequently, deep learning models are first designed, trained, and tested / debugged on cloud servers.

This results in a need for small, unobtrusive local AI servers that can operate independently of the cloud, yet are functionally compatible with cloud servers. The process of porting AI software from cloud servers to local AI servers follows this sequence:

  • Deep learning models are trained and tested on cloud servers by matching the number of Atom x86 cores and coCPU cores available in the local AI server. As noted below, coCPU cores (on PCIe cards) can be used in cloud servers also, so training and testing can be precise and expected to match local AI server results
  • Deep learning models are compressed, using regularization techniques, and inference performance is fully characterized
  • After inference testing in the cloud, combined software is moved to the local AI server, without changes
  • Local AI server software may be augmented with WiFi, audio, video, and other modules required for local operation

As this process must be fast and efficient, with possibly numerous iterations, adding intermediate steps to port to ARM or any other fundamentally different CPU and/or platform architecture adds costly and time-consuming effort.

High Performance Using coCPUs™

Unlike GPUs, DSPs, FPGAs, ASICs, and other specialized chips, coCPUs™ can be added to small form-factor servers such as mini-ITX to create cloud-independent, small, local AI servers with 30 to 40 total CPU cores, and still maintain cloud server compatibility and low power consumption. The local AI server described below has 32 c66x coCPU™ cores, and two (2) x86_64 cores.

coCPUs can also be used in the cloud to scale up Xeon x86 based commodity servers to have 100s of cores. The HPC wiki discusses this and shows examples of HP and Dell servers with 16 x86 cores and 128 c66x cores, providing 3.2 TFLOPS with 250 W additional power consumption.

From a concept point of view, combining x86 and c66x CPUs and running software components necessary for AI applications such as H.264 decode, OpenCV and TensorFlow, is another form of an "AI Accelerator". The architecture described here favors fast, reliable development: mature semiconductors, tools and support (TI and Intel), open source software, standard server format, and a wide range of easy-to-use peripherals and storage.

Power Consumption

A small local AI server should be capable of both a minimal mode, around 25 to 35 W, and a high performance mode, up to 100 W. The mini-ITX example described below meets this objective.

Small Form Factor Example

The pictures below show a local AI server with:

  • Mini-ITX motherboard and case
  • Dual core Atom (C2358, 1.74 GHz), 4x GbE interfaces, 8 GB DDR3 mem, 1333 MHz
  • 32 coCPU cores (C6678, 1.6 GHz), GbE interface, 8 GB DDR3 mem, 1333 MHz, x8 PCIe
  • 4x USB interfaces
  • IPMI (dedicated GbE)
  • Audio I/O interface (via USB)
  • VGA optional display

 

Local AI server, mini-ITX enclosure, top view

Local AI server in mini-ITX enclosure, with 32 coCPU cores and dual-core Atom CPU. coCPU cores
and Atom CPU cores have independent GbE interfaces. Dimensions are approximately 8" x 9 " x 3".
Power consumption from 35 W to 100 W depending on mode of operation. The case can sit flat or
stand vertically.

 

Small AI server, iso view with top of case removed

Local AI server in mini-ITX enclosure with case lid removed. PCIe card with 32 c66x coCPU cores
is installed with a standard PCIe riser. Dual-core Atom CPU and memory are just to the right of
the card. Note 100 W power supply at left side of the enclosure.

 

Small AI server, top view

Local AI server in mini-ITX enclosure with older D510 Atom CPU. Note the airflow situation, with
the coCPU card at the top of the enclosure and a path for Atom CPU airflow to the right of the
card.

 

Notes on Atom PCIe Performance

In addition to the C2358 CPU based system specified above, an Atom D510 based system has also been tested:

  • Dual core Atom (D510, 1.66 GHz), dual GbE interfaces, 2 GB DDR2 mem, 667 MHz
  • 32 coCPU cores (C6678, 1.6 GHz), GbE interface, 8 GB DDR3 mem, 1333 MHz, x8 PCIe
  • 6x USB interfaces
  • Audio I/O interface
  • VGA optional display

With Ubuntu 16.04 installed, the D510 proved to have insufficient PCIe performance, caused by the ICH8M bridge chip required to provide PCIe connectivity. Due to this result, only Atom CPUs with onchip PCIe are recommended. This forum thread has more information.

Ultra Small Form-Factors

Combining a nano-ITX Atom motherboard with c66x coCPUs on a mini-PCIe card would produce a box with approximate dimensions 6" x 4" x 2". Such a mini-PCIe card is not currently available, but potentially could be manufactured by companies such as Advantech and Z3 Technologies, who already have c66x hardware products.

Embedded AI Comparison

The advantages of a small Atom-based server with 32 to 64 coCPU cores are compelling: higher performance, immediately run cloud software with no code rewrites, and a wide range of easy-to-use, low-cost peripherals and storage options (i.e. like any other server). The disadvantage is power consumption. Even though this approach consumes relatively low power, there are emerging, dedicated "embedded AI" boxes with limited performance, typically containing numerous ARM cores, that can reduce power consumption to the 20 to 30 W range and potentially to under 10W.

To achieve portability in the 30 to 75 W range requires a sizable Lithium battery, but this increases overall solution weight. A typical "generator style" battery available online provides 100W over 41 amp-hour, with dimensions 6" x 3" x 7" and weight 3.3 lb. As always, such SWaP tradeoffs depend on application requirements.

Demo: Privacy Control for the Echo Dot Using Two-Level Wakeword

To maintain home privacy, and avoid "always on" streaming of all conversation to the cloud, the local AI server performs speech recognition to turn on/off AC power to the Echo Dot. When the Dot is powered on, it uses cloud AI for normal operation. The first level wakeword should be a somewhat longer phrase not likely to occur during ordinary conversation nor use of the Dot, for example "Alexa wake up". Here is a diagram showing the setup:

 

Setup diagram for demo showing two-level wakeword for Echo Dot

 

Notes about the demo:

1. Controlling either the microphone or power to the device-under-control (DUC) guarantees privacy. Other methods such as manually controlling the mute button are prone to human error. A patent pending method controls the DUC microphone using masking with white noise generated by the local AI server. The power-control based approach has an intrinsic problem: the boot-time of the DUC. After saying the first level wakeword, the user has to wait for the DUC to boot before giving the second level wakeword (for example, "Alexa"). With the Amazon Echo Dot, boot time can take upwards of 1 minute. The microphone control method allows DUC power to remain "always on" and is considered more user friendly.

2. The range and accuracy of speech recognition in the local AI server is far less than what the cloud can do, but is sufficient for the purpose. For this demo the local AI server only needs to recognize the first level wakeword, although additional speech recognition could be added.

Software

As with a cloud AI server, local AI servers can run image analytics software such as OpenCV and deep learning / neural network classification software such as TensorFlow, AlexNet, and GoogLeNet. Currently OpenCV v2.4 has been ported to c66x coCPU cores.

Architecture Diagram

Below is an architecture diagram showing image analytics and AI processing performed by x86 CPUs and c66x coCPUs.

Image analytics software architecture diagram

Software Model

Below is a diagram showing the general coCPU software model for AI and other applications. Notes about this diagram:

  • Application complexity increases from left to right (command line, open source library APIs, user code APIs, heterogeneous programming)
  • All application types can run concurrently in host or VM instances (see below for VM configuration)
  • c66x CPUs can make direct DMA access to host memory, facilitating use of DPDK
  • c66x CPUs are connected directly to the network. Received packets are filtered by UDP port and distributed to c66x cores at wire speed

 

HPC software model diagram

 

The host memory DMA capability is also used to share data between c66x CPUs, for example in an application such as H.265 (HEVC) encoding, where 10s of cores must work concurrently on the same data set.

Additional Demo Programs

Application test and demo programs are available and described in detail on other wiki's in the cloud and server HPC series, including:

Concurrent Applications and Virtualized Environment

A software stack including drivers, libraries, and virtIO components integrates coCPUs under Linux, whether in local or cloud AI servers. In a bare-metal environment, concurrent applications are supported. In a KVM + QEMU virtualized environment, cores and network I/O interfaces appear as resources that can be allocated between VMs. VM and host users can share also, as the available pool of cores is handled by a physical layer back-end driver. This flexibility allows AI and HPC applications to scale between cloud, enterprise, and remote vehicle/location servers.

Host and Guest Drivers

DirectCore drivers interact with c66x cards from either host instances or VMs. Host instances use a "physical" driver and VM instances use virtIO "front end" drivers.

Host and Guest Libs

DirectCore libraries provide a high level API for applications. DirectCore libraries abstract all c66x cores as a unified "pool" of cores, allowing multiple users / VM instances to share c66x resources, including NICs. This applies regardless of the number of cards installed in the server. This page has DirectCore API and source code examples.

Installing / Configuring VMs

Below is a screen capture showing VM configuration for c66x coCPU cards, using the Ubuntu Virtual Machine Manager (VMM) user interface:

VMM dialog showing VM configuration for c66x coCPU cards

c66x core allocation is transparent to the number of cards installed in the system; just like installing memory DIMMs of different sizes, c66x cards can be mixed and matched.