AM335x Crypto Performance

From Texas Instruments Wiki
Jump to: navigation, search


Return to the Sitara Linux Software Developer's Guide

TIBanner.png


This page is under temporary revision

Introduction

Starting with SDK 5.05.00.00, Cryptographic Acceleration is available for AM335x.  The Linux drivers in this SDK are in the pre-built kernel and ready to go.  The demos using OpenSSL under Matrix will be automatically accelerated with the available crypto hardware module. Benchmarks will be performed on the 800MHz OPP frequecy and the 1GHz OPP frequency.


Crypto Performance

This page is intended to provide a summary of cryptographic performance data with AM335x.  It is not a comprehensive list of cryptographic functions but rather a comparison of the functions that are hardware accelerated vs software only cryptographic performance.

There are two aspects of performance related to hardware accelerated cryptography. 

  • Throughput - Overall speed which the calculation is made measured in kB/sec.
  • CPU bandwidth - Percentage of Cortex A8 CPU usage needed to perform the function 


DMCrypt Installation

To perform several of these benchmarks, DMCrypt functionality is required through the use of the userspace tool cryptsetup. DMCrypt is a disk encryption subsystem in the linux kernel that is used to test the kernel's use of the CryptoAPI.

A PDF describing step by step cross compilation and installation instructions for DMCrypt is shown below:

[Installing and Enabling DMCrypt]

AM335x 800MHz OPP Frequency

OpenSSL Performance

Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


time -v openssl speed -elapsed -evp aes-128-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-128-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 503.88 879.08

64 1997.67 2793.28

256 6234.37 6341.89

1024 13145.77 9264.47

8192 24223.74 10742.44


CPU Usage: 41%
CPU Usage: 68%


 time -v openssl speed -elapsed -evp aes-192-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-192-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 503.41 841.27

64 1987.71
2642.50

256 6275.67
5709.82

1024 13272.06
8118.95

8192 24455.85 9268.74


CPU Usage: 41%

CPU Usage: 68%


time -v openssl speed -elapsed -evp aes-256-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-256-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 498.57 830.55

64 1987.99 2523.03

256 6311.94 5177.17

1024 13282.65 7189.85

8192 24376.66 8129.19


CPU Usage: 41%
CPU Usage: 69%


OpenSSL Performance VS CPU Usage

The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. Note, this cross-compilation assumes one has the sufficient dependencies to compile coreutils. See the coreutils documentation for more information on the dependencies required for compilation.

export PATH="<SDK INSTALL DIR>/linux-devkit/bin:$PATH"
source <SDK INSTALL DIR>/linux-devkit/environment-setup
cd <SDK install path>/linux-devkit/arm-arago-linux-gnueabi/
wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz
tar -xJf coreutils-8.21.tar.xz
cd coreutils-8.21
./configure --host=arm-arago-linux-gnueabi --prefix=<mount-point of sd-card root> 
cp -v Makefile{,.orig}
sed -e 's/^#run_help2man\|^run_help2man/#&/' \
  -e 's/^\##run_help2man/run_help2man/' Makefile.orig > Makefile
make
cp ./src/nice <mount-point of sd-card root>/usr/bin

Once this is performed, save the following code as the file "cpu_nice_benchmark" and chmod it as executable (chmod +x cpu_nice_benchmark)

#!/bin/bash
rm -rf benchmark_data
echo "Enter the cipher you wish to test and press enter (ex: aes-128-cbc)"
read cipher
: ${cipher:="aes-128-cbc"}
for ((i=-20;i<21;i++));
do
          echo "OpenSSL Nice: $i"
          nice -$i time -v openssl speed -elapsed -evp $cipher  2>&1 | grep "^$cipher\|%$"  >> benchmark_data
done

This file tests OpenSSL crypto performance at the entire range of Nice priority levels (-20 to 20) and outputs the information in a raw, unparsed format to the file benchmark_data in the same folder.

Afterward, if one desires to parse this data into a format easily enterable into a spreadsheet program, simply save the following code as the file "parse_cpu_benchmarks" and chmod it executable (chmod +x parse_cpu_benchmarks). After running this program, the new parsed data will be outputted to the file "parsed_data" in the same directory.

#!/bin/bash
linenumber=`wc -l benchmark_data | awk '{print $1}'`
tabs 11
rm -f parsed_data
for ((i=1;i<linenumber;i=i+2));
do
          cpu=`head -n $((i+1)) benchmark_data | tail -1 | awk '{print $7}' |  sed 's/[^0-9.]//g'`
          echo -e -n $cpu "\t" >> parsed_data
          for ((j=2;j<7;j++));
          do
                throughput=`head -n $i benchmark_data | tail -1 | awk -v k=$j '{print $k}' |  sed 's/[^0-9.]//g'`
				echo -e -n $throughput "\t" >> parsed_data
          done
          echo -e -n '\n' >> parsed_data
done

The format of the parsed benchmark file will be as shown:

OpenSSL CPU Usage (%)
16 Byte Block Size
64 Byte Block Size
256 Byte Block Size
1024 Byte Block Size

8192 Byte Block Size

Finally, run the following command in the background to make the nice values have a significant effect:

while true; do true; done&


The following are charts produced from the previous benchmarking utilities:

OPENSSL NO HWA.png


OPENSSL DMCRYPT NO HWA2.png

Thus, these graphs can be used to compare HWA performance vs software crypto performance at various percent CPU Usages. An example comparison is shown below:

HWA VS NOHWA.png

DM_Crypt Performance

Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. Hdparm was used to verify the cached read rate and buffered disk read rate for the encrypted device. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0
cryptsetup luksOpen /dev/ram0 enc-pv
mke2fs -T ext2 /dev/mapper/enc-pv
hdparm -tT /dev/mapper/enc-pv
No Crypto Hardware Acceleration Drivers

Trial One

Trial Two
Trial Three
Average
Timing Cached Reads (MB/sec)
128.96 134.61 134.32 132.63
Timing buffered disk reads (MB/sec)
8.37 8.38 8.35
8.37
CPU Usage (%)
11 11 11
11



modprobe omap4-sham
modprobe omap4-aes
cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0
cryptsetup luksOpen /dev/ram0 enc-pv
mke2fs -T ext2 /dev/mapper/enc-pv
hdparm -tT /dev/mapper/enc-pv
With Crypto Hardware Acceleration Drivers

Trial One

Trial Two
Trial Three
Average
Timing Cached Reads (MB/sec)
142.2 138.94 131.65 137.60
Timing buffered disk reads (MB/sec)
11.85 11.92 11.77
11.85
CPU Usage (%)
11 11 11
11


OpenSSL and DM_Crypt Concurrent Performance

Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. These tests demonstrate openSSL's benchmarks while a file is being concurrently written to an encypted DM_Crypt partition. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0							
cryptsetup luksOpen /dev/ram0 enc-pv							
mke2fs -T ext2 /dev/mapper/enc-pv							
mount /dev/mapper/enc-pv /mnt							
cd ~							
dd if=/dev/zero of=file.txt bs=1048576 count=14							
./infinite_loop&							
openssl *command*

Where infinite_loop is a shell script containing the following:

#!/bin/bash      
while :      
do      
        cp -f ~/file.txt /mnt      
done 

Note that file.txt is a fourteen megabyte file so it will fit snuggly into the sixteen megabyte ram partition


time -v openssl speed -elapsed -evp aes-128-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-128-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 270.19 449.59

64 1171.88
1416.23

256 2969.34 3122.77

1024 9216.34 4378.62

8192 21703.34 6148.08


CPU Usage: 30%
CPU Usage: 37%


 time -v openssl speed -elapsed -evp aes-192-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-192-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 291.44 444.93

64 1274.99
1351.79

256 3308.89
2804.91

1024 8910.17
3763.88

8192 20698.45 4317.18


CPU Usage: 32%

CPU Usage:36%


time -v openssl speed -elapsed -evp aes-256-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-256-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 316.41 440.62

64 1137.58 1296.26

256 3275.78 2518.67

1024 8870.57 3352.66

8192 20821.33 3929.43


CPU Usage: 32%
CPU Usage: 36%


OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage

Using the procedures documented in the section "OpenSSL Performance VS CPU Usage" and the very active DM_Crypt partition described in the section "OpenSSL and DM_Crypt Concurrent Performance," one can produce the following charts describing the crypto perfomance of OpenSSL at various CPU usages while other programs are attempting to use the crypto facilities.


OPENSSL DMCRYPT NO HWA3.png


OPENSSL DMCRYPT HWA2.png

Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage

The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. See the section "OpenSSL Performance VS CPU Usage" above for information related to this task. In addition, a more complete version of "ps" is required to obtain cpu usage since the program "time" is incompatible with the child processes made when openssl tests multiple threads. The cross compilation instructions for cross compiling "procps" will be included below (this contains ps).

The overall testcase discussed below involves an eight-threaded OpenSSL benchmark occuring in sync with an eight-threaded copy command to a DM_Crypt encrypted RAM module.

First, download the beta build of procps from http://procps.cvs.sourceforge.net/procps/procps/

Untar it unto the folder <SDK install path>/linux-devkit/arm-arago-linux-gnueabi/


export PATH="<SDK INSTALL DIR>/linux-devkit/bin:$PATH"
source <SDK INSTALL DIR>/linux-devkit/environment-setup
cd <SDK install path>/linux-devkit/arm-arago-linux-gnueabi/<newly extracted procps folder>
make
cp -f ./ps/ps <mount-point of sd-card root>/usr/bin
cp -f ./proc/libproc-3.2.8.so <mount-point of sd-card root>/usr/lib

Boot the arm device and cd into ~
Once this is performed, save the following code as the file "cpu_multithread" and chmod it as executable (chmod +x cpu_nice_benchmark)

#!/bin/bash
rm -rf benchmark_data

echo "Enter the cipher you wish to test and press enter (ex: aes-128-cbc)"

read cipher
: ${cipher:="aes-128-cbc"}
for ((i=-20;i<21;i++));

do
        echo "OpenSSL Nice: $i"
        nice -$i openssl speed -multi 8 -elapsed -evp $cipher  2>&1 | grep "^evp" >> benchmark_data&
        APID=$!
        string=`nice -$i ps -p ${APID}`

        if [ "$string" ]; then
                cpu1=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'`

        fi
        string=`nice -$i ps -p ${APID}`
        if [ "$string" ]; then
                cpu2=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'`
        fi
        string=`nice -$i ps -p ${APID}`
        if [ "$string" ]; then
                cpu3=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'`
        fi
        string=`nice -$i ps -p ${APID}`

        if [ "$string" ]; then
                cpu4=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'`
        fi
        wait $APID
        { echo scale=3; echo "($cpu3 + $cpu4) / 2"; } | bc >> benchmark_data
done

This file tests OpenSSL crypto performance at the entire range of Nice priority levels (-20 to 20) and outputs the information in a raw, unparsed format to the file benchmark_data in the same folder.

Afterward, if one desires to parse this data into a format easily enterable into a spreadsheet program, simply save the following code as the file "parse_benchmark_data_multi" and chmod it executable (chmod +x parse_cpu_benchmarks). After running this program, the new parsed data will be outputted to the file "parsed_data" in the same directory.

#!/bin/bash
linenumber=`wc -l benchmark_data | awk '{print $1}'`
tabs 11
rm -f parsed_data
for ((i=1;i<linenumber;i=i+2));
do

        cpu=`head -n $((i+1)) benchmark_data | tail -1 | awk '{print $1}' |  sed 's/[^0-9.]//g'`
        echo -e -n $cpu "\t" >> parsed_data
        for ((j=2;j<7;j++));
        do
                throughput=`head -n $i benchmark_data | tail -1 | awk -v k=$j '{print $k}' |  sed 's/[^0-9.]//g'`
                echo -e -n $throughput "\t" >> parsed_data

        done
        echo -e -n '\n' >> parsed_data
done

The format of the parsed benchmark file will be as shown:

OpenSSL CPU Usage (%)
16 Byte Block Size
64 Byte Block Size
256 Byte Block Size

1024 Byte Block Size

8192 Byte Block Size

OpenSSL CPU Usage (%) 16 Byte Block Size 64 Byte Block Size 256 Byte Block Size 1024 Byte Block Size 8192 Byte Block Size

To create eight DM_Crypt copy threads in the backgrounds, a shell script called "infinite_loop_8" is created containing the following:

#!/bin/bash
while :
do
        cp -f ~/1mb /mnt/1mb1&
        cp -f ~/1mb /mnt/1mb2&
        cp -f ~/1mb /mnt/1mb3&
        cp -f ~/1mb /mnt/1mb4&
        cp -f ~/1mb /mnt/1mb5&
        cp -f ~/1mb /mnt/1mb6&
        cp -f ~/1mb /mnt/1mb7&
        cp -f ~/1mb /mnt/1mb8&

        wait
done

Now to start the overall benchmarks, create a DM_Crypt encrypted ram partition and run the previous scripts as shown in the following:

cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0
cryptsetup luksOpen /dev/ram0 enc-pv
mke2fs -T ext2 /dev/mapper/enc-pv
mount /dev/mapper/enc-pv /mnt
cd ~
rm -rf /mnt/*
dd if=/dev/zero of=1mb bs=1048576 count=1
./infinite_loop_8&
./cpu_multithread
./parse_benchmark_data_multi
cat parsed_data

The following graphs were produced from implementing the above procedure:

OPENSSL DMCRYPT NOHWA 800mhz3.png

OPENSSL DMCRYPT WHWA 800mhz.png



AM335x 1GHz OPP Frequency

OpenSSL Performance

Each of these benchmarks was performed using SDK 6.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


time -v openssl speed -elapsed -evp aes-128-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-128-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 811.31 1536.68

64 3261.95 4623.62

256 8151.38 9577.64

1024 19412.65 13022.21

8192 32702.46 14688.26


CPU Usage: 46%
CPU Usage: 83%


 time -v openssl speed -elapsed -evp aes-192-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-192-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 827.07 1405.03

64 3264.09
4197.78

256 8185.26
8387.84

1024 20320.94
11181.40

8192 33153.02 12389.03


CPU Usage: 46%

CPU Usage: 86%


time -v openssl speed -elapsed -evp aes-256-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-256-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 823.89 1362.68

64 3263.25 3978.35

256 8047.10 7599.02

1024 20371.46 9909.59

8192 33120.26 10823.78


CPU Usage: 44%
CPU Usage: 86%


OpenSSL Performance VS CPU Usage

The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. Note, this cross-compilation assumes one has the sufficient dependencies to compile coreutils.


To prevent redundancy, further information relating to running this benchmark is included in the section "OpenSSL Performance VS CPU Usage" for the AM335x 800MHz OPP.


The following are charts produced from the previous benchmarking utilities at the 1GHz OPP:

OPENSSL NOHWA 1GHz.png


OPENSSL HWA 1GHz.png

DM_Crypt Performance

Each of these benchmarks was performed using SDK 6.00.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. Hdparm was used to verify the cached read rate and buffered disk read rate for the encrypted device. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0
cryptsetup luksOpen /dev/ram0 enc-pv
mke2fs -T ext2 /dev/mapper/enc-pv
hdparm -tT /dev/mapper/enc-pv
No Crypto Hardware Acceleration Drivers

Trial One

Trial Two
Trial Three
Average
Timing Cached Reads (MB/sec)
274.03 283.81 274.77 277.54
Timing buffered disk reads (MB/sec)
13.52 13.52 13.51
14.52
CPU Usage (%)
15 15 15
15



modprobe omap4-sham
modprobe omap4-aes
cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0
cryptsetup luksOpen /dev/ram0 enc-pv
mke2fs -T ext2 /dev/mapper/enc-pv
hdparm -tT /dev/mapper/enc-pv
With Crypto Hardware Acceleration Drivers

Trial One

Trial Two
Trial Three
Average
Timing Cached Reads (MB/sec)
268.68 285.56 281.74 278.66
Timing buffered disk reads (MB/sec)
17.6 17.38 17.29
17.42
CPU Usage (%)
15 15 15
15


OpenSSL and DM_Crypt Concurrent Performance

Each of these benchmarks was performed using SDK 6.00.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. These tests demonstrate openSSL's benchmarks while a file is being concurrently written to an encypted DM_Crypt partition. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.


cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0							
cryptsetup luksOpen /dev/ram0 enc-pv							
mke2fs -T ext2 /dev/mapper/enc-pv							
mount /dev/mapper/enc-pv /mnt							
cd ~							
dd if=/dev/zero of=file.txt bs=1048576 count=14							
./infinite_loop&							
openssl *command*

Where infinite_loop is a shell script containing the following:

#!/bin/bash      
while :      
do      
        cp -f ~/file.txt /mnt      
done 

Note that file.txt is a fourteen megabyte file so it will fit snuggly into the sixteen megabyte ram partition


time -v openssl speed -elapsed -evp aes-128-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-128-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 513.02 767.98

64 2032.76
2426.41

256 4163.70
5249.96

1024 9231.02
7467.01

8192 27615.23
8708.10


CPU Usage: 32%
CPU Usage: 43%


 time -v openssl speed -elapsed -evp aes-192-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-192-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 507.25 790.54

64 1920.66
2439.59

256 4359.94
4957.53

1024 11735.38
6619.14

8192 29018.79 7457.45


CPU Usage: 33%

CPU Usage:44%


time -v openssl speed -elapsed -evp aes-256-cbc
Algorithm    Data Block Size      Throughput (kB/sec)
aes-256-cbc

(bytes)

With HW Acceleration
Without HW Acceleration

16 507.75 760.34

64 2032.50 2237.40

256 4418.56 4441.39

1024 11909.8 5842.58

8192 29155.33 6498.99


CPU Usage: 33%
CPU Usage: 45%


OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage

Using the procedures documented in the section "OpenSSL Performance VS CPU Usage" and the very active DM_Crypt partition described in the section "OpenSSL and DM_Crypt Concurrent Performance," one can produce the following charts describing the crypto perfomance of OpenSSL at various CPU usages while other programs are attempting to use the crypto facilities.


OPENSSL 1thr DMCRYPT NOHWA 1Ghz.png


OPENSSL 1thr DMCRYPT HWA 1Ghz.png

Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage

Using the instructions found in the section "Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage" for the 800MHz OPP, the following measurements are produced for the 1GHz OPP.

OPENSSL DMCRYPT NOHWA 1Ghz.png

OPENSSL DMCRYPT HWA 1Ghz.png