As was introduced in another document, a Renegade 4GB SBC was obtained, setup, and made operational for the purposes of performing benchmarks (specifically in comparison to the Raspberry Pi Model 3B).
This document will cover the benchmarking phase of the endeavor.
Benchmarking is a process of measuring things for comparison, often for consideration of performance, efficiency, power utilization- whatever the desired metric may be.
At the same time, benchmarking is often flawed: it is a careful simulation of comparative activities. Often, the tests performed aren't necessarily indicative of real world use.
It is important to keep that in mind: just because something has shown certain benchmark values does not mean it will always conform to those values. Much like estimated gas mileage on cars- those too are benchmarks (and often done on the benchmarks of “city driving” (flat terrain with occasional stops) and “highway” (flat terrain with no stops). Take that same car on extensive hills (perhaps your real world use), and you will find actual performance differs greatly from advertised benchmarks.
Still, benchmarking is an excellent practice to sharpen your observation, experimentation, consideration, analytical, and even visualization skills, endeavoring to isolate enough aspects of a thing for the sake of a comparison.
And in this case, likely beneficial to give us an impression of the difference in hardware.
Also, when analyzing results, perspective is important. There are often two common results we encounter with many technology benchmarks:
But they DEPEND on what is being measured. Clearly, if you're measuring how long it takes to DO a task, that'll be in seconds, and better performance is gained by being able to accomplish the task in less time (lower is better). On the other hand, that task would be able to be accomplished faster if more data is able to be processed at a time (that would be measured not in seconds, but a storage unit, like MegaBytes), and therefore, a higher measured value would be better.
It is important to understand the importance of the measurement, and the units being measured, to properly gauge the impact of the endeavor (otherwise you're just babbling memorized numbers and are clueless about their meaning).
It is my belief, through reading the technical documentation on the renegade sbc and what I know from the raspberry pi, that:
Some things we know:
Some things I discovered:
Some variables I intend to test:
Those are the biggies for now. Other opportunities will likely crop up as we go along.
The renegade has hardware extensions to handle some cryptographic operations. I am not sure if they've yet been implemented, but: I found a cryptographic tool that has benchmarking capability, so a good place to start.
The cryptsetup tool:
Description: disk encryption support - startup scripts Cryptsetup provides an interface for configuring encryption on block devices (such as /home or swap partitions), using the Linux kernel device mapper target dm-crypt.
While we're not looking to do actual disk encryption operations (although, truth be told, that would be another good angle to benchmark), the tool itself has a benchmark option, which is what we'll be using.
Running cryptsetup benchmark on the renegade board yields the following results:
# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 186446 iterations per second for 256-bit key PBKDF2-sha256 361577 iterations per second for 256-bit key PBKDF2-sha512 137680 iterations per second for 256-bit key PBKDF2-ripemd160 128754 iterations per second for 256-bit key PBKDF2-whirlpool 18886 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 285.3 MiB/s 342.0 MiB/s serpent-cbc 128b N/A N/A twofish-cbc 128b N/A N/A aes-cbc 256b 248.6 MiB/s 314.3 MiB/s serpent-cbc 256b N/A N/A twofish-cbc 256b N/A N/A aes-xts 256b 304.1 MiB/s 310.7 MiB/s serpent-xts 256b N/A N/A twofish-xts 256b N/A N/A aes-xts 512b 288.2 MiB/s 292.4 MiB/s serpent-xts 512b N/A N/A twofish-xts 512b N/A N/A
Notice some of the common units of measurement:
along with different encryption algorithms, and using those algorithms for encryption vs. decryption.
Let us see how the raspberry pi 3 did:
Running cryptsetup benchmark on the pi3b board yields the following results:
# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 114975 iterations per second for 256-bit key PBKDF2-sha256 159843 iterations per second for 256-bit key PBKDF2-sha512 114975 iterations per second for 256-bit key PBKDF2-ripemd160 104025 iterations per second for 256-bit key PBKDF2-whirlpool 23239 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 26.2 MiB/s 29.4 MiB/s serpent-cbc 128b N/A N/A twofish-cbc 128b N/A N/A aes-cbc 256b 21.5 MiB/s 22.7 MiB/s serpent-cbc 256b N/A N/A twofish-cbc 256b N/A N/A aes-xts 256b 28.1 MiB/s 28.2 MiB/s serpent-xts 256b N/A N/A twofish-xts 256b N/A N/A aes-xts 512b 22.8 MiB/s 22.2 MiB/s serpent-xts 512b N/A N/A twofish-xts 512b N/A N/A
More for fun, I tried to also run a number of these benchmarks on lab46, if only for subjective comparison (most of us are used to the performance/feel of an Intel CPU… ARM is still very much catching up (as we will likely see)).
Here are the numbers for lab46:
# cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 914987 iterations per second for 256-bit key PBKDF2-sha256 1054905 iterations per second for 256-bit key PBKDF2-sha512 910222 iterations per second for 256-bit key PBKDF2-ripemd160 679129 iterations per second for 256-bit key PBKDF2-whirlpool 522199 iterations per second for 256-bit key Required kernel crypto interface not available.
Interestingly, lab46 does not have a crypto kernel module loaded, so we won't get the disk I/O numbers. Still, the numbers we did get are telling.
Let's place some of these values side-by-side:
As largely expected, the renegade board is pronouncedly more powerful when it comes to a number of these algorithms… on PBKDF2-sha1 it enjoys a hefty improvement over the pi, and with PBKDF2-sha256 it also has an edge. We see that as the algorithms get more complex (I assume that is how they are ranked), we see the lead shrink between the two boards.
Then curiously, the “whirlpool” test sees the pi3b with a slight edge. Looking up the whirlpool algorithm, it is of a potentially different class of algorithm, which makes sense considering how much more rigorous it is on both machines compared to the rest.
When considering encryption, there's also the factor of usability (you don't want to make it TOO EASY for attackers to attack, yet you also don't want to hugely inconvenience the user with overbearing processing requirements). Something like the PBKDF2-sha512 or PBKDF2-ripemd160 would be prime considerations for production on these systems (as they are neither the worst nor best performing).
And, we can see that lab46 trounces the two ARM boards in performance. Even the more intensive whirlpool algorithm (the 'worst' performing on all), is like a factor of 20 better than the others.
The other results are showing data throughput:
|machine||aes-cbc128 E,D (MiB/s)||aes-cbc256 E,D (MiB/s)||aes-xts256 E,D (MiB/s)||aes-xts512 E,D (MiB/s)|
And here there is no contest: the renegade soundly bests the pi3b by at least a factor of 10. For doing encrypted disk operations, the more data it is able to process per second, the better the performance (ie interactivity wouldn't appear to suffer, as much). And, we would definitely feel a difference between the sluggish pi3b if we tried to do encrypted disk operations.
Given that the renegade's kernel may not be fully optimized for its hardware cryptographic functions, there are clear performance advantages almost across the board in favor of the renegade board. And then in the lone area where the pi3b currently exceeds, that value is far less practical than the other algorithms (and even then, we're only looking at a 19% improvement in whirlpool by the pi3b).
I will be interesting to see how the renegade will perform once proper support is implemented for its hardware resources.
Less CPU-specific, measuring network performance speaks to underlying I/O (or, as we will also see, how important it is to have proper support for the hardware so we can adequately measure it).
As stated, in the renegade's 4.4 kernel, the network hardware is NOT fully supported; I actually had to force the speed down to 100Mb so that I wouldn't encounter instabilities. I've also read that improved driver support has been merged as of the 4.14/4.15 kernels, and I've got a 4.14 kernel to potentially test things with (once I successfully boot it, I will be going back and adding additional entries for it).
iperf is a tool to perform network throughput tests, an ideal way to test how well things may be working (and what they are capable of). While not indicative of everyday use and performance, it gives us an appreciation of its performance range.
To use this, we need to set up a client and a server.
On the tested machine, I started the server as follows:
# iperf -s -i 1
On the tested machine, I started the client as follows:
# iperf -c IP.AD.RE.SS -i 1
First up, the renegade board:
------------------------------------------------------------ Client connecting to IP.AD.RE.SS, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local IP.AD.RE.SS port 37268 connected with IP.AD.RE.SS port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 290 MBytes 2.43 Gbits/sec [ 3] 1.0- 2.0 sec 294 MBytes 2.46 Gbits/sec [ 3] 2.0- 3.0 sec 293 MBytes 2.46 Gbits/sec [ 3] 3.0- 4.0 sec 293 MBytes 2.46 Gbits/sec [ 3] 4.0- 5.0 sec 294 MBytes 2.46 Gbits/sec [ 3] 5.0- 6.0 sec 294 MBytes 2.46 Gbits/sec [ 3] 6.0- 7.0 sec 294 MBytes 2.47 Gbits/sec [ 3] 7.0- 8.0 sec 294 MBytes 2.47 Gbits/sec [ 3] 8.0- 9.0 sec 294 MBytes 2.47 Gbits/sec [ 3] 9.0-10.0 sec 295 MBytes 2.47 Gbits/sec [ 3] 0.0-10.0 sec 2.87 GBytes 2.46 Gbits/sec
Next up, the raspberry pi:
------------------------------------------------------------ Client connecting to IP.AD.RE.SS, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local IP.AD.RE.SS port 37268 connected with IP.AD.RE.SS port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 700 MBytes 5.87 Gbits/sec [ 3] 1.0- 2.0 sec 720 MBytes 6.04 Gbits/sec [ 3] 2.0- 3.0 sec 508 MBytes 4.26 Gbits/sec [ 3] 3.0- 4.0 sec 504 MBytes 4.22 Gbits/sec [ 3] 4.0- 5.0 sec 517 MBytes 4.34 Gbits/sec [ 3] 5.0- 6.0 sec 511 MBytes 4.29 Gbits/sec [ 3] 6.0- 7.0 sec 516 MBytes 4.33 Gbits/sec [ 3] 7.0- 8.0 sec 516 MBytes 4.33 Gbits/sec [ 3] 8.0- 9.0 sec 522 MBytes 4.38 Gbits/sec [ 3] 9.0-10.0 sec 523 MBytes 4.39 Gbits/sec [ 3] 0.0-10.0 sec 5.41 GBytes 4.64 Gbits/sec
And, to show the scale of different hardware categories:
# iperf -c 10.80.2.46 -i 1 ------------------------------------------------------------ Client connecting to 10.80.2.46, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 10.80.2.46 port 59176 connected with 10.80.2.46 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 6.88 GBytes 59.1 Gbits/sec [ 3] 1.0- 2.0 sec 6.77 GBytes 58.2 Gbits/sec [ 3] 2.0- 3.0 sec 6.64 GBytes 57.0 Gbits/sec [ 3] 3.0- 4.0 sec 6.78 GBytes 58.2 Gbits/sec [ 3] 4.0- 5.0 sec 6.71 GBytes 57.6 Gbits/sec [ 3] 5.0- 6.0 sec 6.79 GBytes 58.3 Gbits/sec [ 3] 6.0- 7.0 sec 6.63 GBytes 56.9 Gbits/sec [ 3] 7.0- 8.0 sec 6.72 GBytes 57.7 Gbits/sec [ 3] 8.0- 9.0 sec 6.52 GBytes 56.0 Gbits/sec [ 3] 9.0-10.0 sec 6.78 GBytes 58.2 Gbits/sec [ 3] 0.0-10.0 sec 67.2 GBytes 57.7 Gbits/sec
This also begs the question: how much is minimally needed for a useful computing experience? Lab46 represents a likely obscene excess. Especially considering our outbound internet connections. Who here has a 55Gb internet connection? I barely have a 15Mb connection at home, and our campus connection to the LAIR is only 20Mb.
Still, recognizing the capacity for bandwidth also describes the general state of the machine. The more data it can haul, the more it takes to saturate it, so if general usage is consistently above the bandwidth usage, our experience is a pleasant one (which would make even the raspberry pi an ideal platform for our networking endeavors).
I discovered a sysbench benchmarking tool, which has a suite of different tests to perform.
We will start with the cpu tests. which involve a prime number calculation (of all things! Many of us have considerable experience and familiarity with such things).
Another factor that comes into play with CPUs (especially modern CPUs) are cores/execution units/threads.
This tool lets me specify the number of threads, so I have done so for the values of 1, 2, 4, 8, 16. I expect we will see a “sweet spot” of performance, where adding on more threads will not improve performance, and at some point will start to diminish performance.
The specific test is invoked by:
# sysbench --test=cpu --num-threads=# run
And produces output of the form:
sysbench 0.4.12: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: # Doing CPU performance benchmark Threads started! Done. Maximum prime number checked in CPU test: 10000 Test execution summary: total time: X.YYYYs total number of events: 10000 total time taken by event execution: X.YYYY per-request statistics: min: X.YYms avg: X.YYms max: X.YYms approx. 95 percentile: X.YYms Threads fairness: events (avg/stddev): XXXX.0000/0.00 execution time (avg/stddev): X.YYYY/0.00
I am specifically going to be comparing the “total time:” values for the different thread values, across the different test environments.
I simply redirected all output for each run into a text file, by the name of sysbench.cpu.out.# (where # are the number of threads).
# grep -H 'total time:' sysbench.cpu.out.* sysbench.cpu.out.01: total time: 147.0570s sysbench.cpu.out.02: total time: 73.8700s sysbench.cpu.out.04: total time: 36.9689s sysbench.cpu.out.08: total time: 37.0156s sysbench.cpu.out.16: total time: 36.8539s
# grep -H 'total time:' sysbench.cpu.out.* sysbench.cpu.out.01: total time: 139.0519s sysbench.cpu.out.02: total time: 69.8045s sysbench.cpu.out.04: total time: 34.8948s sysbench.cpu.out.08: total time: 34.8903s sysbench.cpu.out.16: total time: 34.8745s
# grep -H 'total time:' sysbench.cpu.out.* sysbench.cpu.out.01: total time: 12.0414s sysbench.cpu.out.02: total time: 6.0233s sysbench.cpu.out.04: total time: 3.3988s sysbench.cpu.out.08: total time: 3.3668s sysbench.cpu.out.16: total time: 3.3630s
Placing values together in a table (in seconds, lower is better):
It should once again be quite obvious how much of a novelty these ARM boards are when it comes to general computing. Lab46 tends to be a factor of 12 better, across the board.
And comparing the two ARM boards, they are generally on par, and consistently. I was surprised to see the pi edging out the renegade by a few seconds each iteration. Considering the renegade is clocked faster, this is curious. I'll feel more satisfied when I see results with an optimized kernel. Note that I'm not expecting the renegade to significantly outperform the pi, but you would expect a slight improvement (perhaps ahead by as much the non-optimized one is currently behind). Again, this operation is largely CPU-bound, and we're looking at very similar CPUs.
Also, in a more proper setting, we would have run these tests a number of times, and taken an average of the times. This is because other things could have been happening on the system at any given time, and an average would help factor out some of those rough spots (like the renegade board on 8 threads; running that test a second time yielded a value of 36.8052s, which puts it somewhat more in-line with the subtle trend we are seeing in the values).
Next up is a more I/O specific benchmark: file I/O.
sysbench provides six different top-level fileio tests:
This touches on two important aspects of file access:
As well as the two common means of file access:
We see read/write specs plastered all over drives as marketing sell-points, but again, that's just a benchmark in and of itself… we're pulling back the covers a little bit and adding a bit more detail.
sysbench also lets us specify the number of threads, which can also play a role in fileio. We will be collecting results with thread counts of 1, 2, 4, 8, 16, 32, and 64 (powers of 2 are common examples, as a lot of processing resources come in powers of 2).
Also, there are some additional angles we will be focusing on:
Then, with systems like the renegade, not only do we have the factor of non-optimized vs. optimized kernel, potentially offering up significant differences in performance, but also the presence of the eMMC, in addition to the SD card. So: a lot of different environments to grab these metrics in.
Again, for now we're generally doing this just to get a sampling of what the renegade is capable of, and pulling in other systems as a means of comparison (the pi3b to show improvements over, and a system like lab46 to show that no matter how much improved me are over the pi3b, we're still in a rather niche computing envelope- but at the same time, where do personal computing needs fall between the renegade and a system like lab46?).
I was expecting a performance hit compared to the RAMdisk, but was surprised at just how much I had to scale back. For sanity and care of the machine, I opted not to go beyond thread counts of 8… loads were otherwise easily into the double digits.
One thought that came to mind is that the quota system may have been introducing some performance overhead. So as another variable to test, I did the same thing with quota disabled (actually on the lab46 backup system).
In the end we see that quota's presence has virtually no impact on performance of this benchmark. Good to know.
To set up a RAMdisk, I did the following:
# mkdir /mnt/ramdisk # mount -t tmpfs -o size=1024M tmpfs /mnt/ramdisk
I'm only making use of 512MiB of test data for sysbench, so this could have been a lot closer to 512MiB than 1024MiB… I'll certainly be doing that on the pi3b, since it has far less RAM available.
What'll be interesting to see is how well the renegade performs here. It uses DDR4, so does lab46, so we may see a surprising surge in performance in this category.
So, here are the values when the manipulated files are on a RAMdisk:
I suspect here is where we will see the pi start to fall flat… I/O is definitely NOT its strong suit. The benchmarks ran for an eternity on the pi, compared to even the renegade:
Created with the same recipe I used on the renegade board.
Curiously, the pi3b held up, giving the renegade an on-par performance. This is surprising at face value, considering the pi uses DDR3. Maybe there's some level of support that still needs to be implemented to unlock the technology benefits (or, the task at hand doesn't heavily rely on technology uniqueness, and the pi3's memory may not be as encumbered as its other storage I/O).
Again, since these are times (in seconds), lower is better.
The following sites were explored while performing this endeavor: