I'm hardly the first one to notice this, but I was having a discussion in ##crypto on Freenode about the Linux kernel CSPRNG performance. It was mentioned that the kernelspace CSPRNG was "horrendously slow". Personally, I found the performance sufficient for me needs, but I decided to entertain his definition. I'm glad I did; I wasn't disappointed.
Pull up a terminal, and run the following command, passing 10GB of data from /dev/urandom to /dev/null:
$ dd if=/dev/urandom of=/dev/null bs=1M count=1024 iflag=fullblock 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 80.1537 s, 13.4 MB/s $ pv < /dev/urandom > /dev/null # cancel in a different terminal, unless you have "-S" 1.02GB 0:01:20 [13.3MB/s] [ < => ]
13.4 MBps of throughput for reading data directly out of the kernelspace CSPRNG. But, can we do better?
In the ##crypto channel, and as should be across development mailing lists, forums, groups, and discussion channels, I recommend that developers should not generally develop their own userspace CSPRNG. There are all sorts of pitfalls and traps waiting for you when you attempt it. Unless you know what you're doing, you could end up with a CSPRNG that isn't actually cryptographically secure (the "CS" in "CSPRNG").
However, what happens when I do actually run a userspace CSPRNG on the same machine? What can I expect out of performance? For example, I could implement AES-128 in CTR mode as a CSPRNG. In fact, we can do this with OpenSSL:
$ dd if=/dev/zero bs=10M count=1024 iflag=fullblock 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB) copied, 15.3137 s, 701 MB/s $ openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt < /dev/zero | pv > /dev/null 31.9GB 0:00:34 [ 953MB/s] [ < => ]
700-950 MBps (notice that dd(1) incurs a performance penalty). That's 52-70x the speed of reading the kernelspace CSPRNG directly. That's more than a full order of magnitude faster. However, this is on a box with AES-NI. What about disabling AES-NI on the same box? How badly does it damage performance, and how does it compare to reading the kernelspace CSPRNG? We can use OpenSSL speed(1SSL) to benchmark algorithms.
First, with AES-NI enabled:
$ openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 468590.43k 1174849.02k 1873606.83k 2178642.60k 2244471.47k
And with AES-NI disabled:
$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 74272.21k 83315.43k 340393.30k 390135.47k 391279.96k
In this case, we see about a 5x performance improvement when using the AES-NI instruction set as compared to when not using it. That's significant. And even with AES-NI disabled in userspace, we're still outperforming /dev/urandom by almost 30x.
Interestingly enough, even the OpenBSD CSPRNG (different hardware than previously tested), which uses ChaCha20, outperforms the Linux CSPRNG (although its userspace CSPRNG with openssl(1) doesn't outperform kernelspace):
% dd if=/dev/urandom of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 13.630 secs (78775541 bytes/sec) % dd if=/dev/zero bs=1M count=1024 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes transferred in 33.498 secs (32052998 bytes/sec) % openssl speed -elapsed -evp aes-128-ctr 2> /dev/null (...snip...) The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-ctr 41766.37k 46930.74k 49593.54k 50669.32k 50678.33k
Roughly 78 MBps for OpenBSD on an Intel Xeon CPU running at 2.80GHz. Basically, six times the speed of the Linux kernel CSPRNG on an Intel Xeon CPU running at 2.67GHz.
So why is the Linux CSPRNG so slow? And, what can we do about it? Well, first, the kernel is using SHA-1 for its cryptographic primitive. In very loose terms, the CSPRNG hashes the input pool with SHA-1, and spits out the output to /dev/urandom. It's output is also its input, so its digesting its own output.
But, that's not all it's doing actually. The first function actually adds data into the input pool without increasing the entropy estimate. Then, after adding those bytes, the input pool is mixed with a Skein-like mixing function. Then some math is done to credit the entropy estimator, and the system is polled for data to add to the input entropy pool. Things like disk IO, CPU timings, interrupts, and user activity. Finally, we're ready to hash the data. This is done by extracting the data out of the input pool, and hashing it with SHA-1. But, we don't want any recognizable output, so the output is left-rotated and folded in half. Then, and only then, is the data ready for consumption.
Unfortunately, the Linux kernel CSPRNG is not based on any sound theoretical security design. It's very much a hodge-podge home-brew design by developers who think they know what they're doing, when in reality, they don't. In 2013, a security audit and analysis was performed on the Linux kernel CSPRNG (PDF), and concluded that not only is it not robust, but it has some weaknesses:
In the literature, four security notions for a PRNG with input have been proposed: resilience (RES), forward security (FWD), backward security (BWD) and robustness (ROB), with the latter being the strongest notion among them.
Distributions Used in Attacks based on the Entropy Estimator As shown in Section 5.4, LINUX uses an internal Entropy Estimator on each input that continuously refreshes the internal state of the PRNG. We show that this estimator can be fooled in two ways. First, it is possible to define a distribution of zero entropy that the estimator will estimate of high entropy, secondly, it is possible to define a distribution of arbitrary high entropy that the estimator will estimate of zero entropy. This is due to the estimator conception: as it considers the timings of the events to estimate their entropy, regular events (but with unpredictable data) will be estimated with zero entropy, whereas irregular events (but with predictable data) will be estimated with high entropy.
As shown in Section 5.7, it is possible to build a distribution D0 of null entropy for which the estimated entropy is high (cf. Lemma 3) and a distribution D1 of high entropy for which the estimated entropy is null (cf. Lemma 4). It is then possible to mount attacks on both /dev/random and /dev/urandom, which show that these two generators are not robust.
We have proposed a new property for PRNG with input, that captures how it should accumulate the entropy of the input data into the internal state. This property actually expresses the real expected behavior of a PRNG after a state compromise, where it is expected that the PRNG quickly recovers enough entropy. We gave a precise assessment of Linux PRNG /dev/random and /dev/urandom security. In particular, we prove that these PRNGs are not robust. These properties are due to the behavior of the entropy estimator and the mixing function used to refresh its internal state. As pointed by Barak and Halevi [BH05], who advise against using run-time entropy estimation, we have shown vulnerabilities on the entropy estimator due to its use when data is transferred between pools in Linux PRNG. We therefore recommend that the functions of a PRNG do not rely on such an estimator.
Finally, we proposed a construction that meets our new property in the standard model and we showed that it is noticeably more efficient than the Linux PRNGs. We therefore recommend to use this construction whenever a PRNG with input is used for cryptography.
TL;DR? The Linux CSPRNG does not meet the definitions of a secure CSPRNG per the PDF. It's not that it's theoretically broken, it's just not theoretically secure either. It's really nothing theoretically at all. This isn't great.
A replacement for random.c in the kernel would be to ditch the homebrew entropy collection, mixing, and output mangling, and instead, stick with AES-128 in CTR mode. Of course, as per the PDF, the entropy collectors need serious work, but if AES-128-CTR was deployed as the CSPRNG instead of SHA-1, then the generator could take advantage of hardware AES performance, which as I've shown, is exceptionally superior. It's frustrating, because the kernel already ships AES, so the code is already there. It's just not being utilized.
The Linux kernel could have 1 GBps in CSPRNG output, but is deliberately choosing not to. That's like having a V12 turbo-charged sleeper, without the turbo, and only firing on 3 of the 12 cylinders, with a duct taped muffler on the back.
Why does 1 GBps of performance matter? How about wiping hard drives or secure data removal in general? With 20 MBps, we can't even saturate a single drive in IOPS. With 1 GBps, we could saturate many simultaneously. As someone who wipes old employee workstations when they leave the company, backup servers with dozens of drives, or old decommissioned hardware, I see great benefit here.
Or, how about HTTPS web sites for a shared web hosting provider? I have seen countless times HTTPS and SSH connections lag due to waiting on the CSPRNG. Not that it's being intentionally blocked, but because the load is so intense on the server, it just can't generate enough cryptographic randomness to keep up with requests.
I'm sure there are plenty of other examples where end userspace applications could benefit with improved performance of the CSPRNG. And, as shown, it can't be that difficult to implement correctly. The real question is, of course, who will do the work and submit the patch?