AM335x Crypto Performance

Return to the Sitara Linux Software Developer's Guide



= This page is under temporary revision =

Introduction
Starting with SDK 5.05.00.00, Cryptographic Acceleration is available for AM335x. The Linux drivers in this SDK are in the pre-built kernel and ready to go. The demos using OpenSSL under Matrix will be automatically accelerated with the available crypto hardware module. Benchmarks will be performed on the 800MHz OPP frequecy and the 1GHz OPP frequency.

Crypto Performance
This page is intended to provide a summary of cryptographic performance data with AM335x. It is not a comprehensive list of cryptographic functions but rather a comparison of the functions that are hardware accelerated vs software only cryptographic performance.

There are two aspects of performance related to hardware accelerated cryptography.


 * Throughput - Overall speed which the calculation is made measured in kB/sec.
 * CPU bandwidth - Percentage of Cortex A8 CPU usage needed to perform the function

= DMCrypt Installation =

To perform several of these benchmarks, DMCrypt functionality is required through the use of the userspace tool cryptsetup. DMCrypt is a disk encryption subsystem in the linux kernel that is used to test the kernel's use of the CryptoAPI. A PDF describing step by step cross compilation and installation instructions for DMCrypt is shown below:

[Installing and Enabling DMCrypt]

= AM335x 800MHz OPP Frequency =

OpenSSL Performance
Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

time -v openssl speed -elapsed -evp aes-128-cbc

time -v openssl speed -elapsed -evp aes-192-cbc

time -v openssl speed -elapsed -evp aes-256-cbc

OpenSSL Performance VS CPU Usage
The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. Note, this cross-compilation assumes one has the sufficient dependencies to compile coreutils. See the coreutils documentation for more information on the dependencies required for compilation. export PATH="&lt;SDK INSTALL DIR&gt;/linux-devkit/bin:$PATH" source &lt;SDK INSTALL DIR&gt;/linux-devkit/environment-setup cd &lt;SDK install path&gt;/linux-devkit/arm-arago-linux-gnueabi/ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz tar -xJf coreutils-8.21.tar.xz cd coreutils-8.21 ./configure --host=arm-arago-linux-gnueabi --prefix=&lt;mount-point of sd-card root&gt; cp -v Makefile{,.orig} sed -e 's/^#run_help2man\|^run_help2man/#&amp;/' \ -e 's/^\##run_help2man/run_help2man/' Makefile.orig &gt; Makefile make cp ./src/nice &lt;mount-point of sd-card root&gt;/usr/bin

Once this is performed, save the following code as the file "cpu_nice_benchmark" and chmod it as executable (chmod +x cpu_nice_benchmark) rm -rf benchmark_data echo "Enter the cipher you wish to test and press enter (ex: aes-128-cbc)" read cipher
 * 1) !/bin/bash
 * ${cipher:="aes-128-cbc"}

for ((i=-20;i&lt;21;i++)); do echo "OpenSSL Nice: $i" nice -$i time -v openssl speed -elapsed -evp $cipher 2&gt;&amp;1 | grep "^$cipher\|%$"  &gt;&gt; benchmark_data done This file tests OpenSSL crypto performance at the entire range of Nice priority levels (-20 to 20) and outputs the information in a raw, unparsed format to the file benchmark_data in the same folder.

Afterward, if one desires to parse this data into a format easily enterable into a spreadsheet program, simply save the following code as the file "parse_cpu_benchmarks" and chmod it executable (chmod +x parse_cpu_benchmarks). After running this program, the new parsed data will be outputted to the file "parsed_data" in the same directory. linenumber=`wc -l benchmark_data | awk '{print $1}'` tabs 11 rm -f parsed_data for ((i=1;i&lt;linenumber;i=i+2)); do cpu=`head -n $((i+1)) benchmark_data | tail -1 | awk '{print $7}' | sed 's/[^0-9.]//g'` echo -e -n $cpu "\t" &gt;&gt; parsed_data for ((j=2;j&lt;7;j++)); do               throughput=`head -n $i benchmark_data | tail -1 | awk -v k=$j '{print $k}' |  sed 's/[^0-9.]//g'` echo -e -n $throughput "\t" &gt;&gt; parsed_data done echo -e -n '\n' &gt;&gt; parsed_data done The format of the parsed benchmark file will be as shown:
 * 1) !/bin/bash

Finally, run the following command in the background to make the nice values have a significant effect: while true; do true; done&amp;

The following are charts produced from the previous benchmarking utilities:





Thus, these graphs can be used to compare HWA performance vs software crypto performance at various percent CPU Usages. An example comparison is shown below:

DM_Crypt Performance
Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. Hdparm was used to verify the cached read rate and buffered disk read rate for the encrypted device. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv hdparm -tT /dev/mapper/enc-pv

modprobe omap4-sham modprobe omap4-aes cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv hdparm -tT /dev/mapper/enc-pv

OpenSSL and DM_Crypt Concurrent Performance
Each of these benchmarks was performed using SDK 5.07.00.00 at a CPU clock speed of 800MHz and a DDR3 Clock Speed of 303 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. These tests demonstrate openSSL's benchmarks while a file is being concurrently written to an encypted DM_Crypt partition. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv mount /dev/mapper/enc-pv /mnt cd ~ dd if=/dev/zero of=file.txt bs=1048576 count=14 ./infinite_loop&amp; openssl *command* Where infinite_loop is a shell script containing the following: while : do cp -f ~/file.txt /mnt done Note that file.txt is a fourteen megabyte file so it will fit snuggly into the sixteen megabyte ram partition
 * 1) !/bin/bash

time -v openssl speed -elapsed -evp aes-128-cbc

time -v openssl speed -elapsed -evp aes-192-cbc

time -v openssl speed -elapsed -evp aes-256-cbc

OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage
Using the procedures documented in the section "OpenSSL Performance VS CPU Usage" and the very active DM_Crypt partition described in the section "OpenSSL and DM_Crypt Concurrent Performance," one can produce the following charts describing the crypto perfomance of OpenSSL at various CPU usages while other programs are attempting to use the crypto facilities.





Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage
The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. See the section "OpenSSL Performance VS CPU Usage" above for information related to this task. In addition, a more complete version of "ps" is required to obtain cpu usage since the program "time" is incompatible with the child processes made when openssl tests multiple threads. The cross compilation instructions for cross compiling "procps" will be included below (this contains ps). The overall testcase discussed below involves an eight-threaded OpenSSL benchmark occuring in sync with an eight-threaded copy command to a DM_Crypt encrypted RAM module.

First, download the beta build of procps from http://procps.cvs.sourceforge.net/procps/procps/

Untar it unto the folder &lt;SDK install path&gt;/linux-devkit/arm-arago-linux-gnueabi/

export PATH="&lt;SDK INSTALL DIR&gt;/linux-devkit/bin:$PATH" source &lt;SDK INSTALL DIR&gt;/linux-devkit/environment-setup cd &lt;SDK install path&gt;/linux-devkit/arm-arago-linux-gnueabi/&lt;newly extracted procps folder&gt; make cp -f ./ps/ps &lt;mount-point of sd-card root&gt;/usr/bin cp -f ./proc/libproc-3.2.8.so &lt;mount-point of sd-card root&gt;/usr/lib

Boot the arm device and cd into ~ Once this is performed, save the following code as the file "cpu_multithread" and chmod it as executable (chmod +x cpu_nice_benchmark) rm -rf benchmark_data
 * 1) !/bin/bash

echo "Enter the cipher you wish to test and press enter (ex: aes-128-cbc)"

read cipher
 * ${cipher:="aes-128-cbc"}

for ((i=-20;i&lt;21;i++));

do echo "OpenSSL Nice: $i" nice -$i openssl speed -multi 8 -elapsed -evp $cipher 2&gt;&amp;1 | grep "^evp" &gt;&gt; benchmark_data&amp; APID=$! string=`nice -$i ps -p ${APID}`

if [ "$string" ]; then cpu1=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'`

fi       string=`nice -$i ps -p ${APID}` if [ "$string" ]; then cpu2=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'` fi       string=`nice -$i ps -p ${APID}` if [ "$string" ]; then cpu3=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'` fi       string=`nice -$i ps -p ${APID}`

if [ "$string" ]; then cpu4=`nice -$i ps -eo pcpu,pid,user,args | sort -k1 -r | head -3 | tail -1 | awk '{print $1 }'` fi       wait $APID { echo scale=3; echo "($cpu3 + $cpu4) / 2"; } | bc &gt;&gt; benchmark_data done

This file tests OpenSSL crypto performance at the entire range of Nice priority levels (-20 to 20) and outputs the information in a raw, unparsed format to the file benchmark_data in the same folder.

Afterward, if one desires to parse this data into a format easily enterable into a spreadsheet program, simply save the following code as the file "parse_benchmark_data_multi" and chmod it executable (chmod +x parse_cpu_benchmarks). After running this program, the new parsed data will be outputted to the file "parsed_data" in the same directory. linenumber=`wc -l benchmark_data | awk '{print $1}'` tabs 11 rm -f parsed_data for ((i=1;i&lt;linenumber;i=i+2)); do
 * 1) !/bin/bash

cpu=`head -n $((i+1)) benchmark_data | tail -1 | awk '{print $1}' | sed 's/[^0-9.]//g'` echo -e -n $cpu "\t" &gt;&gt; parsed_data for ((j=2;j&lt;7;j++)); do               throughput=`head -n $i benchmark_data | tail -1 | awk -v k=$j '{print $k}' |  sed 's/[^0-9.]//g'` echo -e -n $throughput "\t" &gt;&gt; parsed_data

done echo -e -n '\n' &gt;&gt; parsed_data done

The format of the parsed benchmark file will be as shown:

OpenSSL CPU Usage (%) 16 Byte Block Size 64 Byte Block Size 256 Byte Block Size 1024 Byte Block Size 8192 Byte Block Size

To create eight DM_Crypt copy threads in the backgrounds, a shell script called "infinite_loop_8" is created containing the following: while : do cp -f ~/1mb /mnt/1mb1&amp; cp -f ~/1mb /mnt/1mb2&amp; cp -f ~/1mb /mnt/1mb3&amp; cp -f ~/1mb /mnt/1mb4&amp; cp -f ~/1mb /mnt/1mb5&amp; cp -f ~/1mb /mnt/1mb6&amp; cp -f ~/1mb /mnt/1mb7&amp; cp -f ~/1mb /mnt/1mb8&amp;
 * 1) !/bin/bash

wait done

Now to start the overall benchmarks, create a DM_Crypt encrypted ram partition and run the previous scripts as shown in the following: cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv mount /dev/mapper/enc-pv /mnt cd ~ rm -rf /mnt/* dd if=/dev/zero of=1mb bs=1048576 count=1 ./infinite_loop_8&amp; ./cpu_multithread ./parse_benchmark_data_multi cat parsed_data The following graphs were produced from implementing the above procedure:





= AM335x 1GHz OPP Frequency =

OpenSSL Performance
Each of these benchmarks was performed using SDK 6.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

time -v openssl speed -elapsed -evp aes-128-cbc

time -v openssl speed -elapsed -evp aes-192-cbc

time -v openssl speed -elapsed -evp aes-256-cbc

OpenSSL Performance VS CPU Usage
The performance of the Software and Hardware Crypto Engines are partially related to the CPU usage alloted to the crypto process. Thus, in order to demonstrate a wide range of OpenSSL performances under various CPU percent usages, the linux scheduling program "nice" is used. Nice allows for the user to specify the relative priority of a process compared to others. Since nice is at this time not included in SDK 5.07.00.00, it will be necessary to cross compile coreutils for one to perform this particular benchmark. Note, this cross-compilation assumes one has the sufficient dependencies to compile coreutils.

To prevent redundancy, further information relating to running this benchmark is included in the section "OpenSSL Performance VS CPU Usage" for the AM335x 800MHz OPP.

The following are charts produced from the previous benchmarking utilities at the 1GHz OPP:





DM_Crypt Performance
Each of these benchmarks was performed using SDK 6.00.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. Hdparm was used to verify the cached read rate and buffered disk read rate for the encrypted device. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv hdparm -tT /dev/mapper/enc-pv

modprobe omap4-sham modprobe omap4-aes cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv hdparm -tT /dev/mapper/enc-pv

OpenSSL and DM_Crypt Concurrent Performance
Each of these benchmarks was performed using SDK 6.00.00.00 at a CPU clock speed of 1GHz and a DDR3 Clock Speed of 400 MHz. To prevent peripheral bus speed bottlenecks from affecting the benchmarks, the 16MB /dev/ram0 device was used as a encrypted partition. These tests demonstrate openSSL's benchmarks while a file is being concurrently written to an encypted DM_Crypt partition. Listed above the chart for each algorithm are the code snippets used to run each benchmark test.

cryptsetup --cipher aes-cbc-null --key-size 128 luksFormat /dev/ram0 cryptsetup luksOpen /dev/ram0 enc-pv mke2fs -T ext2 /dev/mapper/enc-pv mount /dev/mapper/enc-pv /mnt cd ~ dd if=/dev/zero of=file.txt bs=1048576 count=14 ./infinite_loop&amp; openssl *command* Where infinite_loop is a shell script containing the following: while : do cp -f ~/file.txt /mnt done Note that file.txt is a fourteen megabyte file so it will fit snuggly into the sixteen megabyte ram partition
 * 1) !/bin/bash

time -v openssl speed -elapsed -evp aes-128-cbc

time -v openssl speed -elapsed -evp aes-192-cbc

time -v openssl speed -elapsed -evp aes-256-cbc

OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage
Using the procedures documented in the section "OpenSSL Performance VS CPU Usage" and the very active DM_Crypt partition described in the section "OpenSSL and DM_Crypt Concurrent Performance," one can produce the following charts describing the crypto perfomance of OpenSSL at various CPU usages while other programs are attempting to use the crypto facilities.





Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage
Using the instructions found in the section "Multithreaded OpenSSL and DM_Crypt Concurrent Performance VS CPU Usage" for the 800MHz OPP, the following measurements are produced for the 1GHz OPP.