Image of the glider from the Game of Life by John Conway
Skip to content

CPU Jitter Entropy for the Linux Kernel

Normally, I keep a sharp eye on all things cryptographic-related with the Linux kernel. However, in 4.2, I missed something fantastic: jitterentropy_rng.ko. This is a Linux kernel module that measures the jitter of the high resolution timing available in modern CPUs, and uses this jitter as a source of true randomness. In fact, using the CPU timer as a source of true randomness isn't anything new. If you're read my blog for some time, you're already familiar with haveged(8). This daemon also collects CPU jitter and feeds the collected data into the kernel's random number generator.

The main site is at http://www.chronox.de/jent.html and the PDF paper describing the CPU jitter entropy can be found on the site.

So why the blog post about jitterentropy_rng.ko? Because now that we have something in the mainline kernel, we get a few benefits:

  1. More eyes are looking at the code, and can adjust, analize, and refine the entropy gathering process, making sure it's not to aggressive nor conservative in its approach.
  2. We now have something that can collect entropy much earlier in the boot sequence, even before the random number generator has been initialized. This means we can have a properly seeded CSPRNG when the CSPRNG is initialized.
  3. While not available now, we could have a kernelspace daemon collecting entropy and feeding it to the CSPRNG without the need for extra software.
  4. This isn't just for servers, desktops, and VMs, but anything that runs the Linux kernel on a modern CPU, including Android phones, embedded devices, and SoC.
  5. While haveged(8) has been a good solution for a long time, it has been heavily criticized, and it seems development on it has stalled. Here is another software solution for true randomness without the need of potentially dangerous 3rd party USB random number generators.
  6. You don't need Intel's RDRAND. Any modern CPU with a high resolution timer will work. AMD, SPARC, ARM, MIPS, PA-RISC, Power, etc.

As mentioned in the list, unfortunately, loading the kernel doesn't automatically top-off the entropy estimate of the internal state of the CSPRNG (/proc/sys/kernel/random/entropy_avail). As such, /dev/random will still block when the estimate is low or exhausted. So you'll still need to run a userspace daemon to prevent this behavior. The author has also shipped a clean, light userspace daemon that just reads the data provided by the jitterentropy_rng.ko kernel module, and uses ioctl(2) to increase the estimate. The jitterentropy_rng.ko module provides about 10 KBps of random data.

Again, this isn't anything that something like haveged(8) doesn't already have access to. However, by taking advantage of a loaded kernel module, we can ensure that randomness is being collected before the CSPRNG is initialized. So, when CSPRNG initialization happens, we can ensure that it is properly seeded on first boot, minimizing the likelihood that exact keys will be created on distinct systems. This is something haveged(8) can't provide, as it runs entirely in userspace.

Unfortunately, jitterentropy-rngd(8) isn't available in the Debian repositories yet, so you'll need to download the compressed tarball from the author's website, manually compile and install yourself. However, he does ship a systemd(8) service file, which makes it easy to get the daemon up and running on boot with minimal effort.

I've had the jitterentropy_rng.ko module installed with the jitterentropy-rngd(8) userspace daemon running all day today, without haveged(8), and needless to say, I'm pleased. It keeps the CSPRNG entropy estimate sufficiently topped off for software that still relies on /dev/random (please stop doing this developers- start using /dev/urandom please) and provides adequate performance. Near as I can tell, there is not a character device created when loading the kernel module, so you can't access the unbiased data before feeding it into the CSPRNG. As such, I don't have a way to test its randomness quality. Supposedly, there is a way to access this via debugfs, but I haven't found it.

Anyway, I would recommend using jitterentropy_rng.ko and jitterentropy-rngd(8) over haveged(8) as the source for your randomness.

Weechat Relay With Let's Encrypt Certificates

I've been on IRC for a long time. Not as long as some, granted, but likely longer than most. I've had my hand in a number of IRC clients, mostly terminal-based. Yup, I was (shortly) using the ircII client, then (also shortly) BitchX. Then I found irssi, and stuck with that for a long time. Search irssi help topics on this blog, and you'll see just how long. Then, after getting hired at XMission in January 2012, I switched full-time to WeeChat. I haven't looked back. This IRC client is amazing.

One of the outstanding features of WeeChat is the relay, effectively turning your IRC client into a bouncer. This feature isn't unique- it's in irssi also. However, the irssi proxy does not support SSL (2009). The WeeChat relay does. And with Let's Encrypt certificates freely available, this is the perfect opportunity to use TLS with a trusted certificate.

This post assumes that you are running WeeChat on a box that you can control the firewall to. In my case, I run WeeChat on an externally available SSH server behind tmux. With Let's Encrypt certificates, you will need to provide a FQDN for your Common Name (CN). This is all part of the standard certificate verification procedure. I purchased a domain that points to the IP of that server, and you will need to do the same.

The official Let's Encrypt "certbot" package used for creating Let's Encrypt certificates is already available in Debian unstable. A simple "apt install certbot" will get that up and running for you. Once installed, you will need to create your certificate.

$ certbot certonly --standalone -d weechat.example.com -m aaron.toponce@gmail.com

Per Let's Encrypt documentation, you needs ports 80 and 443 open to the world when creating and renewing your certificate. The execution will create four files:

# ls -l /etc/letsencrypt/
total 24
drwx------ 3 root root 4096 May 19 12:36 accounts/
drwx------ 3 root root 4096 May 19 12:39 archive/
drwxr-xr-x 2 root root 4096 May 19 12:39 csr/
drwx------ 2 root root 4096 May 19 12:39 keys/
drwx------ 3 root root 4096 May 19 12:39 live/
drwxr-xr-x 2 root root 4096 May 19 12:39 renewal/
# ls -l /etc/letsencrypt/live/weechat.example.com/
total 0
lrwxrwxrwx 1 root root 43 May 19 12:39 cert.pem -> ../../archive/weechat.example.com/cert1.pem
lrwxrwxrwx 1 root root 44 May 19 12:39 chain.pem -> ../../archive/weechat.example.com/chain1.pem
lrwxrwxrwx 1 root root 48 May 19 12:39 fullchain.pem -> ../../archive/weechat.example.com/fullchain1.pem
lrwxrwxrwx 1 root root 46 May 19 12:39 privkey.pem -> ../../archive/weechat.example.com/privkey1.pem

The "cert.pem" file is your public certificate for your CN. The "chain.pem" file in the Let's Encrypt intermediate certificate. The "fullchain.pem" file is the "cert.pem" and "chain.pem" files combined. Of course, the "privkey.pem" file is your private key. For the WeeChat relay, it needs the "privkey.pem" and "fullchain.pem" files combined into a single file.

Because the necessary directories under "/etc/letsencrypt/" are accessible only by the root user, you will need root access to copy the certificates out and make them available to WeeChat, which hopefully isn't running as root. Also, Let's Encrypt certificates need to be renewed no sooner than every 60 days and no later than every 90 days. So, not only will you want to automate renewing the certificate, but you'll probably want to automate moving it into the right directory when the renewal is complete.

As you can see from above, I setup my certificate on a Thursday at 12:39. So weekly, on Thursday, at 12:39, I'll check to see if the certificate needs to be nenewed. Because it won't renew any more frequently than every 60 days, but I have to have it renewed every 90 days, this gives be a 30-day window in which to get the certificate updated. So, I'll keep checking weekly. If a renewal isn't needed, the certbot(1) tool will gracefully exit. If a renewal is needed, the tool will update the certificate. Unfortunately, certbot(1) does not provide a useful exit code when renewals aren't needed, so rather than parsing text, I'll just copy the new certs into my WeeChat directory, regardless if they get updated or not.

So, in my root's crontab, I have the following:

39 12 * * 4 /usr/local/sbin/renew.sh

Where the contents of "/usr/local/sbin/renew.sh" are:

#!/bin/bash

certbot renew -q
cat /etc/letsencrypt/live/weechat.example.com/privkey.pem \
    /etc/letsencrypt/live/weechat.example.com/fullchain.pem > \
    ~aaron/.weechat/ssl/relay.pem
chown aaron.aaron ~aaron/.weechat/ssl/relay.pem

Now the only thing left to do is setup the relay itself in WeeChat. So, from within the client:

/relay sslcertkey
/relay add ssl.weechat 8443

You will need port 8443 open in your firewall, of course.

That's it. I have had some problems with certificate caching in WeechatAndroid it seems. So far, I have had to manually restart the relay in WeeChat, and flush the cache in WeechatAndroid and restart it to get the new certificate (I was previously using a self-signed certificate). Hopefully, this can also be automated, so I don't have to manually keep restarting the relay in WeeChat and flushing the cache in WeechatAndroid.

Regardless, this is how you use Let's Encrypt certificates with WeeChat SSL relay. Hopefully this is beneficial to someone.

Say Allo To Insecurity

Yesterday, Google announced two new encrypted messaging apps called "Allo" and "Duo". There has been some talk about the security of Allo's end-to-end encryption and incognito mode. Most of it was speculation, until Thai Duong blogged about it. Well, it's time to see what he said, and see if Allo stands up to scrutiny.

"Allo offers two chat modes: normal and incognito. Normal is the default, but incognito can be activated with one touch. I want to stress that both modes encrypt chat messages when they are in transit or at rest. The Allo clients talk to Google servers using QUIC or TLS 1.2. When messages are temporarily stored on our servers waiting for delivery they are also encrypted, and will be deleted as soon as they're delivered."

There are a few things in this paragraph that need some explanation. First, "both modes encrypt chat messages when they are in transit or at rest". This is good, but the devil is in the details. In transit, Thai explains how they're encrypted: "Allo clients talk to Google servers using QUIC or TLS 1.2". This has a couple of ramifications. First, this isn't end-to-end encryption (E2E). This is client-server encryption, which means both the server and the client are encrypting and decrypting the data. As a result, any Google employee with the appropriate privileges can read the messages delivered to Google servers. That's sort of the point of why E2E encryption exists- to prevent this from happening.

Second, kudos for storing the messages encrypted on disk. But, realize that Google has the master key to decrypt these messages if needed. Also, kudos for deleting them off of Google's servers as soon as they're delivered. However, just like VPN service providers promising they don't log your connections, Google promising not to log your message sort of falls into this category. That is, although they might not be storing the messages right now, they may store them later, especially if presented with a warrant from law enforcement. So, Google promising to not store your messages really doesn't amount to much other than maybe they don't want to unnecessarily chew through disk unless forced. Just remember, Google isn't going to go to jail for you, so they will comply with law enforcement.

"In normal mode, an artificial intelligence run by Google (but no humans including the Allo team or anyone at Google) can read your messages. This AI will use machine learning to analyze your messages, understand what you want to do, and give you timely and useful suggestions. For example, if you want to have dinner, it'll recommend restaurants or book tables. If you want to watch movies, it can buy you tickets.

Like it or not, this AI will be super useful. It's like having a personal assistant that can run a lot of errands for you right in your pocket. Of course, to help it help you you'll have to entrust it with your chat messages. I really think that this is fine, because your chat messages are used to help you and you only, and contrary to popular beliefs Google never sells your personal information to anyone."

Herein lies the real reason why E2E is not enabled by default- Google would like to mine your messages, on your phone, and present you with real-time ads. Ads not just on your phone, but likely when you're logged into your Google account on your desktop or laptop as well. If the data is E2E encrypted, this poses a problem for a company that has made Big Bucks off of advertising. With incognito mode, you are enabling E2E encryption, and the AI no longer has access to this data. Application and browser ads become generic, or must get their mining elsewhere. Because Google is allowing an AI to mine Allo messages for targeted ads, could it be possible that this same AI could be mining other data on your phone for the same goal? Could this AI be mining your email, Twitter, Facebook, photos, and other data? Will this AI be shipping solely with Allo, or will it be a separate service in Android N?

While Google might not be selling your data, they are making a percentage of sales that come from ads. The more targeted the ads become, the more likely you are to make a purchase, and the more likely Google will be to get a percentage of that sale. Google isn't selling your data, but they are making money off of it.

"But what if I want to stay off the grid? What if I don't want even the AI or whatever to see my messages?

"That's fine. We understand your concerns. Everybody including me has something to hide. This is why we develop the incognito mode. In this mode, all messages are further encrypted using the Signal protocol, a state of the art end-to-end chat encryption protocol which ensures that only you and your recipients can read your messages."

WhatsApp, acquired by Facebook, and pushing nearly one billion active messaging accounts, recently enabled E2E encryption also with the Signal Protocol. The difference being, with WhatsApp, E2E is default for every account when they update their app. E2E is not default for Allo, and only enabled for incognito mode. So, if "everybody including me has something to hide", then why isn't E2E default with Allo?

Thai then quotes a survey explaining that users want self-destructing messages more than E2E. He explains that survey with (emphasis mine):

"So to most users what matters the most is not whether the NSA can read their messages, but the physical security of their devices, blocking unwanted people, and being able to delete messages already sent to other people. In other words, their threat model doesn't include the NSA, but their spouses, their kids, their friends, i.e., people around and near them. Of course it's very likely that users don't care because they don't know what the NSA has been up to. If people know that the NSA is collecting their dick pics, they probably want to block them too. At any rate, NSA is just one of the threat sources that can harm normal users."

Sure, my threat model is also losing my phone. I find that much more likely than either the NSA confiscating my phone, issuing a warrant to collect my data, or decrypting my traffic in real-time (which isn't practical anyway). However, while the NSA isn't in my threat model, the NSA should be in Google's threat model. In other words, Google should be worrying about the NSA for me.

This is why I created d-note is is running at https://secrets.xmission.com. As a system administrator, I don't want to turn over logs to the NSA or any other organization. As such, the messages are encrypted server-side before stored to disk, and destroyed immediately upon viewing. The goal isn't necessarily to protect the end user, but to protect the server administrator. By legitimately not being able to provide logs or data when a warrant is issued is extremely valuable.

Google should be protecting the "dick pics" of users from getting into the NSA hands. Apple recently made a strong stand here against the FBI regarding Syed Farook's iPhone. Apple technically could not help the FBI, because of the protections that Apple baked into their product. Apple's hands were tied. As such, the FBI wanted to set a precedent about enabling government backdoors into the OS for future releases, so they would no longer be blocked from access. Apple is protecting the "dick pics" of its users from the NSA, FBI, and everyone else. Why isn't Google? As we mentioned earlier, the answer to that question is data mining and advertising revenue.

"This is why I think end-to-end encryption is not an end in itself, but rather a means to a real end which is disappearing messaging. End-to-end encryption without disappearing messaging doesn't cover all the risks a normal user could face, but disappearing messaging without end-to-end encryption is an illusion. Users need both to have privacy in a way that matters to them."

Emphases mine. So, Thai recognizes that disappearing messaging without E2E encryption is an illusion. So, why isn't it default? The higher powers that be, likely. He mentions in his conclusion that he would like E2E to be default, with a single tap. Something of an option with "Always start in incognito", thus always starting with E2E and always having self-destructing messages. However, rather than opt-in, it should be opt-out. If the prior message history is more important to you than the security of E2E encryption and self-destructing messages, then it should be something that you switch. If SnapChat is so popular because of self-destructing massages, and WhatsApp has one billion users with E2E encryption be default, Google, a company larger than both combined, should be able to do the same.

Finally, one point that Thai does not mention in his post. Allo is proprietary closed-source software. From a security perspective, this is problematic. First, because you don't have access to the source, you cannot audit it to make sure it holds up to the security claims that it has. As security and software engineers, not having access to the source code should be a major block when considering the use of non-free software.

Second, without access to the source code, you cannot create reproducible builds. Even if you did have access to the source code, are you sure the binary you have installed matches the binary you can build? If not, how do you know the binary isn't spying on you? Or compromised? Or just compiled incorrectly, causing undesired behavior? Not being able to create reproducible builds of software means not being able to verify the integrity of the shipped binary. Debian is making it a high priority to ship packages with reproducible builds. It's important to Debian, because they want to be transparent with their userbase. If you don't trust Debian is doing what they claim, you can rebuild the binaries and packages yourself, and they should match what Debian has shipped.

I know this sounds very Richard Stallman and GNU, but proprietary closed-source software is scary when it comes to security. While your immediate threat model might just be those you interact with on a daily basis, the immediate threat model to Google, Apple, SnapChat, and others, are well-funded organizations that have legal weight. Ultimately, they're after your data, which in the end, puts them in your threat model. There are no safety or security guarantees with proprietary closed-source software. You are at the mercy of the software vendor to Do The Right Thing, and many companies just don't.

So, while Allo might be the new kid on the block with E2E encrypted and self-destructing messages, as I've shown, it can't be trusted for your security and privacy. You're best off ignoring it, not recommending it to family and friends, and sticking with Free Software alternatives where E2E messages are default.

How To Always Encrypt Chromium Saved Passwords On GNU/Linux - No Matter What

One of the things that has always bothered me about the Chromium project (the project the Google Chrome browser is based on) is that passwords are encrypted, if and only if your operating system provides an authentication API through your account login. For example, on Windows, is is accomplished through the "CryptProtectData" function. This function uses your existing account credentials when logging into your computer, as a "master key" to encrypt the passwords on your hard drive. For Mac OS X, this is accomplished with Keychain, and with GNU/Linux users, KWallet if you're running KDE or GNOME Keyring if you're running GNOME.

In all those cases, your saved passwords will be encrypted before getting saved to disk. But, what if you're like me, and do not fall into any of those situations? Now, granted, GNU/Linux and BSD users (you're welcome) make up about 3% of the desktop installs.

Graph showing operating system market share.

Of that 3%, although I don't have any numbers, maybe 2/3 run GNOME or KDE. That leaves 1 out of every 100 users where Chromium is not encrypting passwords on disk by default. For me, who lands in that 1%, this is unacceptable. So, I wanted a solution.

Before I go any further, let me identify the threat and adversary. The threat is offline disk analysis. I'm going to assume that you're keeping your operating system up-to-date with the latest security patches, and that your machine is not infected with malware. Instead, I'm going to assume that after you are finished using your machine, upgrading the hardware, or a hard drive fails, that the disk is discarded. I'm further going to assume that you either can't or didn't digitally wipe or physically destroy the drive once decommissioned. So, the threat is someone getting a hold of that drive, or laptop, or computer, and imaging the drive for analysis. This means that our adversary is a global adversary- it could be anyone.

Now, the obvious solution would be to run an encrypted filesystem on that drive. dm-crypt with or without LUKS makes this possible. But, let's assume you're not running FDE. Any options? In my case, I run eCryptfs, and store the Chromium data there, symbolically linking to it from the default location.

By default, Chromium stores its passwords in ~/.config/chromium/Default/Login\ Data. This is an SQLite 3.x database, and as mentioned, the passwords are stored in plaintext. A simple solution is to create an eCryptfs private directory, and symlink the database to that location. However, Chromium also stores cookies, caches, and other data in ~/.config/chromium/ that might be worth encrypting as well. So, you can just symlink the entire ~/.config/chromium/ directory to the eCryptfs mount.

I'll assume you've already setup eCryptfs and have it mounted to ~/Private/. If not, run the "ecryptfs-setup-private" command, and follow the prompts, then run "ecryptfs-mount-private" to get it mounted to ~/Private/.

Make sure Chromium is not running and move the ~/.config/chromium/ directory to ~/Private/. Then create the necessary symlink, so Chromium does not create a new profile:

$ mv ~/.config/chromium/ ~/Private/
$ ln -s ~/Private/chromium/ ~/.config/

At this point, all your Chromium data is now stored in your eCryptfs encrypted filesystem, and Chromium will follow the symlink, reading and writing passwords in the encrypted mount. This means, no matter if using KWallet or GNOME Keyring, or nothing at all, your passwords will be always be encrypted on disk. Of course, in the SQLite 3.x database, the passwords are still in plaintext, but the database file is encrypted in eCryptfs, thus giving us our security that we're looking for.

However, there is a caveat which needs to be mentioned. The entire security of the encryption rests solely on the entropy of your eCryptfs passphrase. If that passphrase does not have sufficient entropy to withstand a sophisticated attack from a well-funded organization (our global adversary), then all bets are off. Essentially, this eCryptfs solution is acting like a "master password", and all encryption strengths rests on your ability to use a strong password defined by Shannon entropy. Current best-practice to guard against an offline password cracking attack, is to pick a password with at least 128-bits of entropy. You can use zxcvbn.js from Dropbox to estimate your passphrase entropy, which I have installed at http://ae7.st/ent/ (no, I'm not logging passphrases- save the page offline, pull your network cable and run it locally if you don't believe me).

Opera, VPNs, and Security

Yesterday, Opera announced that they are bundling a VPN with the latest release of their browser. This is what the release says:

Why we are adding free VPN in Opera

Bringing this important privacy improvement marks another step in building a browser that matches up to people’s expectations in 2016. When you think about it, many popular options offered by desktop browsers today were invented (quite frequently by Opera) many years ago. The innovation energy in the industry has been recently so focused on mobile, even if the desktop is still thriving.

In January, we were reviewing our product plans, and we realized that people need new features in order to browse the web efficiently in 2016. It also became apparent to us that what people need are not the same features that were relevant for their browsers ten years ago. This is why we today have more engineers than ever before working on new features for our desktop browser.

So far we have the native ad blocker. And, we’re introducing another major feature in just a matter of a few weeks; a native, unlimited and free VPN client, right inside your browser!

Enhanced privacy online with Opera’s free VPN

According to Global Web Index*, more than half a billion people (24% of the world’s internet population) have tried or are currently using VPN services. According to the research, the primary reasons for people to use a VPN are:

– To access better entertainment content (38%)
– To keep anonymity while browsing (30%)
– To access restricted networks and sites in my country (28%)
– To access restricted sites at work (27%)
– To communicate with friends/family abroad (24%)
– To access restricted news websites in my country (22%)

According to the research, young people are leading the way when it comes to VPN usage, with almost one third of people between 16-34 having used a VPN.

Better than traditional VPNs

Until now, most VPN services and proxy servers have been limited and based on a paid subscription. With a free, unlimited, native VPN that just works out-of-the-box and doesn’t require any subscription, Opera wants to make VPNs available to everyone.

That’s why Opera’s built-in free VPN feature is easy to use. To activate it, Mac users just need to click the Opera menu, select “Preferences” and toggle the feature VPN on, while Windows and Linux users need to go to the “Privacy and Security” section in “Settings” and enable VPN there. A button will appear in the browser address field, from which the user can see and change location (more locations will appear later), check whether their IP is exposed and review statistics for their data used. It’s free and unlimited to use, yet it offers several must-have options available in paid VPNs, such as:

  • Hide your IP address – Opera will replace your IP address with a virtual IP address, so it’s harder for sites to track your location and identify your computer. This means you can browse the web more privately.
  • Unblocking of firewalls and websites – Many countries, schools and workplaces block video-streaming sites, social networks and other services. By using a VPN you can access your favorite content, no matter where you are.
  • Public Wi-Fi security – When you’re surfing the web on public Wi-Fi, intruders can easily sniff data. By using a VPN, you can improve the security of your personal information.

There were a couple things that stuck out to me rather quickly when reading this press release:

  1. Is it a true VPN, or just an HTTP proxy?
  2. If either a VPN or an HTTP proxy, how is it handling DNS requests?
  3. If an HTTP proxy, is the request through a transparent TLS connection to Opera?
  4. Why is the press release specifically absent about logs and tracking?

Well, some of these questions have been answered. First, it's not a true VPN. Instead, it's just an HTTP/HTTPS proxy. Here's the details:

How the “VPN” works

Once the user enables the feature in settings, Opera VPN sends API requests to https://api.surfeasy.com to obtain credentials and proxy IPs. The browser then talks to a proxy like de0.opera-proxy.net, and its IP address can only be resolved from within Opera when the VPN feature is turned on. It’s an HTTP/S proxy that requires authentication.

When the Opera browser with enabled VPN loads a page, it sends many requests to de0.opera-proxy.net with a Proxy-Authorization request header.

The Proxy-Authorization header decoded:
CC68FE24C34B5B2414FB1DC116342EADA7D5C46B:9B9BE3FAE67
4A33D1820315F4CC94372926C8210B6AEC0B662EC7CAD611D86A3

Since we’re talking about a proxy, these credentials can be used with de0.opera-proxy.net even when connecting from a different machine. This means that if you use the proxy on a computer with no Opera installed, you’ll get the same IP as when using Opera’s VPN.

From this, we can learn that it's not a VPN at all. In fact, it's not even deploying a TLS tunnel for the HTTP/S proxy. So, traditional HTTP requests will still be in the clear, just with a different target. So while a school or library might be filtetring requests based on DNS, this HTTP/S proxy in Opera doesn't address more active smart filtering based on content.

Unfortunately, Help Net Security also suggests the use of a general VPN service provider (emphasis mine):

"What Opera offers is not a VPN as such. It's just a proxy for the browser. You still need a full VPN if privacy is what you care about (and you should care about your privacy). Other tools you use, including for example email clients like Outlook, won’t use this 'VPN'," Špaček told Help Net Security.

VPN service providers are scary. Sven Slootweg posted a "Don't use VPN services" Gist where he addresses some real concerns with using VPN service providers (I don't agree with a couple points):

  • VPN service providers log connections and other metadata.
  • VPN service providers have full accounting and payment information of their customers.
  • VPNs really are just glorified proxies, and don't provide any meaningful security or privacy.
  • VPNs don't obfuscate your IP address like Tor, and your IP address is meaningless to trackers anyway.
  • VPN service providers exist, because it's easy money.

I don't fully agree with a couple points (IP addresses are extremely valuable to trackers), but I think the overall topics Sven is trying to drive home, are the following: know how VPNs work, who has access to data at the VPN endpoint(s), and your security and privacy risks when using a VPN. There are valid times when using a VPN. Data is encrypted between your VPN client and the provider, so it is an easy way to get around restrictive firewalls, which you would think Opera would be trying to address with their HTTP/S proxy. You may also need to access your corporate internal network when "on the road", in which case using your corporate VPN server is needed.

But in both cases, understand the security and privacy concerns when using the VPN. Your VPN provider isn't going to go to jail for you. If the FBI catches unsavory traffic coming out of the VPN provider, you can rest assured they'll give the authorities all the logging, account, and payment information to comply with the request. You can rest assured that if your employer catches you breaking policy with the VPN, you will lose your VPN access, and possibly your job.

So, what to do? Well, realistically, if you want to obfuscate your traffic dynamically, security, and pseudoanonymously, then use Tor. Install a Tor client on your machine, install a Tor proxy extension in your browser, and when you want to get around restrictive firewalls, flip the proxy switch, and get on Tor.

Of course, Tor isn't a security and privacy panacea. You still need to understand the risks associated with using Tor. For example, the extension you installed may not tunnel DNS requests through Tor. Of course, HTTP traffic is still in the clear when it leaves the Tor exit relay. Tor clients and extensions may contain vulnerabilities that reveal metadata about you. Basically, don't be ignorant or stupid with your Tor connection.

Regardless, I think we can take a few things away from this post:

  1. The Opera VPN is just an HTTP/S proxy.
  2. Opera is very likely logging all your traffic.
  3. Your Opera VPN browsing habits are likely unique enough for Opera to identify you.
  4. VPN service providers should be avoided.
  5. VPN service providers also are likely logging all your traffic.
  6. Your VPN service provider won't go to jail for you.
  7. When in doubt, use Tor, just understand the risks.

Tor and the CloudFlare Problem

Before I go anywhere with this post, let me make three things very clear:

  1. I do not work for CloudFlare.
  2. I work for a small local ISP in Utah.
  3. I have been using Tor probably almost as long as many of you have been alive.

I first blogged about Tor in 2006. I had discovered it around 2004, only a couple of years after it's first release. I had used it as a way to prevent my ISP and my employer from tracking what I'm doing with my Internet connection. I would setup a simple SOCKS proxy in Firefox, then switch to it when I wanted to get on the Tor network, and switch away from it when I didn't. Oh, and you think latencies are bad on Tor now? You should have been on it back then.

Here is a metrics graph showing the time it took to download a 50 KiB file over the Tor network. Unfortunately, they don't have the data back when I started using the network, but you get a rough idea of what it was like:

Graph from metrics.torproject.org showing the latencies of downloading a 50 KiB file.

This makes a good deal of sense, because back then, ISPs didn't provide a lot of bandwidth to customers (it can be argued they still don't), and there wasn't a lot of exit nodes in the Tor network to handle the bandwidth (again, it can be argued there still isn't enough):

Graph from metrics.torproject.org showing the bandwidth of of relays in the network.

Graph from metrics.torproject.org showing the size of the Tor network

Spend some time on metrics.torproject.org looking over the historical data, and you'll get a good sense that using Tor in 2004 was a lot like getting data over dial-up. It was anything but pleasant.

What's the point? The point is, that while things can still be improved (we need more exit nodes, and we need more bandwidth on each exit node), the Tor network latencies, bandwidth, and relays is in a good position compared to 12 years ago when I started using it. So running large-scale attacks through the network is now practical.

So, where does CloudFlare fit into this? CloudFlare deploys solving captchas when you wish to consume a service behind the CloudFlare CDN. For example, while connected to Tor, visit medium.com, and you will be presented with a captcha, similar to something like this:

Screenshot of Chromium showing the need to solve a visual coptcha while trying to browse medium.com while connected to Tor.

This has gotten a lot of criticism from the "cypherpunk" millennials who feel that Tor access should be unrestricted. If you follow the "#dontblocktor" hashtag on Twitter, you will see the continued repeated criticism of CloudFlare deploying these captchas to Tor users on their CDN. Some of the arguments include:

  • Solving the captcha may only bring up another, repeatedly, never being able to consume the website in question.
  • Visually impaired users cannot solve the visual captchas.
  • Non-native English speakers will not be able to solve the audio version of the cpatcha.
  • People using browsers that disable JavaScript will not be able to reach the page.
  • There may be other security concerns where the choice of Tor is preferred over not using Tor.

No doubt, all captchas on the Web should be reconsidered. Personally, for JavaScript enabled browsers, I think forcing a proof-of-work puzzle onto the browser is transparent, and provides exactly the sort of rate-limiting needed for mitigating large-scale malicious attacks. For non-javascript puzzles, captchas seem to be the best alternative. But, I'm sure as a society, can can find alternatives to non-javascript browsers (such as network-based proof-of-work puzzles).

No doubt physical limitations, such as visual or audible impairments, can make solving a visual and audible captcha challenging, if not impossible. I don't have good solutions here except for JavaScript-based proof-of-work puzzles. But the real question that need to be addressed, is why is CloudFlare deploying captchas for Tor users?

CloudFlare addressed this due to the on-going criticism a select few on Twitter have giving the company. The blog post "The Trouble with Tor" basically comes down to the following:

  1. You must pick two between: security, anonymity, and convenience.
  2. CloudFlare is a large CDN that deals regularly with malicious traffic sourced from Tor exit relays.
  3. Captchas are a compromise, allowing Tor users to remain anonymous, while also getting access to the website.
  4. A CloudFlare CDN customer has an option in their control panel to whitelist Tor or captcha Tor.
  5. CloudFlare is investigating "blind token" proof-of-work client puzzles for something long-term.

I don't see anything unreasonable here. As a system administrator and security engineer for XMission, I understand and sympathize with CloudFlare's stance toward captachas, even if I don't agree with the implementation of the captcha itself. I have had to fight off malicious Tor traffic from our network many times during my employment, such as DNS and NTP amplification attacks, HTTP POST DDoS attacks, SQL injection and XSS attacks, and many others.

So, even as CloudFlare put it in their reasonable post, how do you allow honest Tor users with high degrees of convenience to consume the website while also minimizing and proactively mitigating malicious Tor traffic?

Again, I don't care for captchas, and wish they would die in a fire. But, what should CloudFlare do? Should they abandon the captcha altogether? If so, how should they proactively prevent malicious Tor traffic from negatively impacting their customer base? It's easy and knee-jerky to post screenshots to Twitter with the "#dontblocktor" hashtag, and shame CloudFlare and the customer using the CDN. I don't think that's the right approach, personally (nevermind that a captcha isn't a block (yes, semantics are important)). I'm curious how many of those who are reacting to CloudFlare captchas are actual system or network administrators that have to deal with these attacks. Instead, I would try to architect solutions to the problem.

Personally, I see the following:

  • Consume CloudFlare without Tor. There are no captchas, but you sacrifice a level of anonymity.
  • Consume CloudFlare behind Tor, but understand the compromise you are making to solve captchas sacrificing convenience.
  • Consume CLoudFlare beind a VPN, thus providing both anonymity and convenience.

If it really bothers you that you have to solve a captcha to reach a CloudFlare website, then rather than shaming CloudFlare, it might be worth your time to reach out to the site operator, and let them know about whitelisting Tor. If they engage in conversation, they may not have been aware of the configuration option, or they may have reasons why they want you to solve the captcha. Either way, you've come out ahead without the knee-jerking of #dontblocktor.

I guess in conclusion, while I hate captchas as much as the next guy, what would you do if you were employed by CloudFlare and in charge of this problem? What is a reasonable solution to keeping customers happy by mitigating malicious Tor traffic while also allowing honest Tor users to consume the website with high levels of convenience? Let's engage in discussion about how to create and architect these solutions, so we get as many people happy as possible- CloudFlare network admins, customers, and clients.

A final note about the term "block". The CloudFlare captcha is not blocking you from the reading the website. Instead, it's rate-limiting you. Some will argue that you get caught in endless captcha loops, consistently solving them over and over, never to actually reach the service. Personally, I have never encountered this, but others swear it exists. At most, I've had to solve 3 captchas in a row, usually because I did not solve them quick enough. I guess the effect is the same, but as already mentioned, the "#dontblocktor" hash tag is a knee-jerk, and incorrectly placed. Semantics are important, because CloudFlare is not actually blocking Tor, like Akamai does with "Access denied". It's one thing to provide a 502 HTTP error, it's quite another to rate limit requests.

Two OCB Block Cipher Mode Patents Expired Due To Nonpayment

Peter Gutmann on the "[Cryptography]" mailing list wrote some thoughts about the impending crypto monoculture of all-things-Bernstein that seems to be currently sweeping the crypto world. In his post, he mentions the following (emphasis mine):

The remaining mode is OCB, which I'd consider the best AEAD mode out there (it shares CBC's graceful-degradation property in which reuse or misuse of the IV doesn't lead to a total loss of security, only the authentication property breaks but not the confidentiality). Unfortunately it's patented, and even though there are fairly broad exceptions allowing it to be used in many situations, the legal minefield that ensues makes it untouchable for most potential users. For example does the prohibition on military use cover the situation where an open-source crypto package is used in a vendor library that's used in a medical insurance app that's used by the US Navy, or where banking transactions protected by TLS may include ones of a military nature (both of these are actual examples that affected decisions not to use OCB). Since no-one wants to call in lawyers every time a situation like this comes up, and indeed can't call in lawyers when the crypto is several levels away in the service stack, OCB won't be used even though it may be the best AEAD mode out there.

Dr. Matthew Green also wrote about authenticated encryption and block cipher modes. He had this to say about OCB mode (emphasis mine):

In performance terms Offset Codebook Mode blows the pants off of all the other modes I mention in this post. It's 'on-line' and doesn't require any real understanding of Galois fields to implement** -- you can implement the whole thing with a block cipher, some bit manipulation and XOR. If OCB was your kid, he'd play three sports and be on his way to Harvard. You'd brag about him to all your friends.

I've known that OCB mode was patented, and as a result, why it has not been included in OpenSSL and other cryptographic protocol implementations. Peter said it correctly, it is a legal minefield. However, I wanted to read up on the patents, their design, operation, etc., mostly because I wanted to get out of doing the dishes. Discover my shock when I stumbled upon the following:

  • Patent 7,046,802 - Method and apparatus for facilitating efficient authenticated encryption
    • Status: Lapsed
  • Patent 7,200,227 - Method and apparatus for facilitating efficient authenticated encryption
    • Status: Lapsed

Not fully understanding what "Lapsed" means, I went to the official source: The United States Patent and Trademark Office website. I searched for those two patent numbers, and got the following:

  • Patent 7,046,802 - Method and apparatus for facilitating efficient authenticated encryption
    • Status: Patent Expired Due to NonPayment of Maintenance Fees Under 37 CFR 1.362
    • Status Date: 06-06-2014
  • Patent 7,200,227 - Method and apparatus for facilitating efficient authenticated encryption
    • Status: Patent Expired Due to NonPayment of Maintenance Fees Under 37 CFR 1.362
    • Status Date: 05-04-2015

Sure enough, Phillip Rogaway's first two patents regarding the OCB block cipher mode of encryption are expired due to nonpayment. I had to tweet this:

Patents 7949129 (Method and apparatus for facilitating efficient authenticated encryption) and 8321675 (Method and apparatus for facilitating efficient authenticated encryption) are still valid however. I'm not sure how this applies to the Charanjit Jutla's IAPM mode patents now owned by IBM. Also, I don't know exactly what OCB modes patents 7,046,802 and 7,200,227 cover. OCB1 and OCB2? if someone can comment here, that would be great.

So, what does this mean for the cryptography world? It means that OCB covered by those two patents can now be implemented royalty-free, without fear of legal entanglements, in Free Software as well as proprietary and commercial software. OpenSSL, LibreSSL, BoringSSL, OpenPGP, Open Whisper Systems Signal, and so many other protocols, projects, and software should be able to implement OCB now.

All because Phillip Rogaway did not make the payments necessary to keep the patent valid. Two more software patents bite the dust.

Linux Kernel CSPRNG Performance

I'm hardly the first one to notice this, but I was having a discussion in ##crypto on Freenode about the Linux kernel CSPRNG performance. It was mentioned that the kernelspace CSPRNG was "horrendously slow". Personally, I found the performance sufficient for me needs, but I decided to entertain his definition. I'm glad I did; I wasn't disappointed.

Pull up a terminal, and run the following command, passing 10GB of data from /dev/urandom to /dev/null:

$ dd if=/dev/urandom of=/dev/null bs=1M count=1024 iflag=fullblock  
1024+0 records in  
1024+0 records out  
1073741824 bytes (1.1 GB) copied, 80.1537 s, 13.4 MB/s
$ pv < /dev/urandom > /dev/null # cancel in a different terminal, unless you have "-S"
1.02GB 0:01:20 [13.3MB/s] [                   < =>                              ]

13.4 MBps of throughput for reading data directly out of the kernelspace CSPRNG. But, can we do better?

In the ##crypto channel, and as should be across development mailing lists, forums, groups, and discussion channels, I recommend that developers should not generally develop their own userspace CSPRNG. There are all sorts of pitfalls and traps waiting for you when you attempt it. Unless you know what you're doing, you could end up with a CSPRNG that isn't actually cryptographically secure (the "CS" in "CSPRNG").

However, what happens when I do actually run a userspace CSPRNG on the same machine? What can I expect out of performance? For example, I could implement AES-128 in CTR mode as a CSPRNG. In fact, we can do this with OpenSSL:

$ dd if=/dev/zero bs=10M count=1024 iflag=fullblock 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB) copied, 15.3137 s, 701 MB/s
$ openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt < /dev/zero | pv > /dev/null
31.9GB 0:00:34 [ 953MB/s] [                                  < =>               ]

700-950 MBps (notice that dd(1) incurs a performance penalty). That's 52-70x the speed of reading the kernelspace CSPRNG directly. That's more than a full order of magnitude faster. However, this is on a box with AES-NI. What about disabling AES-NI on the same box? How badly does it damage performance, and how does it compare to reading the kernelspace CSPRNG? We can use OpenSSL speed(1SSL) to benchmark algorithms.

First, with AES-NI enabled:

$ openssl speed -elapsed -evp aes-128-ctr 2> /dev/null  
(...snip...)
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-ctr     468590.43k  1174849.02k  1873606.83k  2178642.60k  2244471.47k

And with AES-NI disabled:

$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-ctr 2> /dev/null  
(...snip...)  
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-ctr      74272.21k    83315.43k   340393.30k   390135.47k   391279.96k

In this case, we see about a 5x performance improvement when using the AES-NI instruction set as compared to when not using it. That's significant. And even with AES-NI disabled in userspace, we're still outperforming /dev/urandom by almost 30x.

Interestingly enough, even the OpenBSD CSPRNG (different hardware than previously tested), which uses ChaCha20, outperforms the Linux CSPRNG (although its userspace CSPRNG with openssl(1) doesn't outperform kernelspace):

% dd if=/dev/urandom of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 13.630 secs (78775541 bytes/sec)
% dd if=/dev/zero bs=1M count=1024 2> /dev/null | openssl enc -aes-128-ctr -pass pass:"sHgEOKTB8bo/52eDszkHow==" -nosalt | dd of=/dev/null 
2097152+0 records in
2097152+0 records out
1073741824 bytes transferred in 33.498 secs (32052998 bytes/sec)
% openssl speed -elapsed -evp aes-128-ctr 2> /dev/null
(...snip...)
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-ctr      41766.37k    46930.74k    49593.54k    50669.32k    50678.33k

Roughly 78 MBps for OpenBSD on an Intel Xeon CPU running at 2.80GHz. Basically, six times the speed of the Linux kernel CSPRNG on an Intel Xeon CPU running at 2.67GHz.

So why is the Linux CSPRNG so slow? And, what can we do about it? Well, first, the kernel is using SHA-1 for its cryptographic primitive. In very loose terms, the CSPRNG hashes the input pool with SHA-1, and spits out the output to /dev/urandom. It's output is also its input, so its digesting its own output.

But, that's not all it's doing actually. The first function actually adds data into the input pool without increasing the entropy estimate. Then, after adding those bytes, the input pool is mixed with a Skein-like mixing function. Then some math is done to credit the entropy estimator, and the system is polled for data to add to the input entropy pool. Things like disk IO, CPU timings, interrupts, and user activity. Finally, we're ready to hash the data. This is done by extracting the data out of the input pool, and hashing it with SHA-1. But, we don't want any recognizable output, so the output is left-rotated and folded in half. Then, and only then, is the data ready for consumption.

W.T.F.

Unfortunately, the Linux kernel CSPRNG is not based on any sound theoretical security design. It's very much a hodge-podge home-brew design by developers who think they know what they're doing, when in reality, they don't. In 2013, a security audit and analysis was performed on the Linux kernel CSPRNG (PDF), and concluded that not only is it not robust, but it has some weaknesses:

In the literature, four security notions for a PRNG with input have been proposed: resilience (RES), forward security (FWD), backward security (BWD) and robustness (ROB), with the latter being the strongest notion among them.

(...snip...)

Distributions Used in Attacks based on the Entropy Estimator As shown in Section 5.4, LINUX uses an internal Entropy Estimator on each input that continuously refreshes the internal state of the PRNG. We show that this estimator can be fooled in two ways. First, it is possible to define a distribution of zero entropy that the estimator will estimate of high entropy, secondly, it is possible to define a distribution of arbitrary high entropy that the estimator will estimate of zero entropy. This is due to the estimator conception: as it considers the timings of the events to estimate their entropy, regular events (but with unpredictable data) will be estimated with zero entropy, whereas irregular events (but with predictable data) will be estimated with high entropy.

(...snip...)

As shown in Section 5.7, it is possible to build a distribution D0 of null entropy for which the estimated entropy is high (cf. Lemma 3) and a distribution D1 of high entropy for which the estimated entropy is null (cf. Lemma 4). It is then possible to mount attacks on both /dev/random and /dev/urandom, which show that these two generators are not robust.

(...snip...)

We have proposed a new property for PRNG with input, that captures how it should accumulate the entropy of the input data into the internal state. This property actually expresses the real expected behavior of a PRNG after a state compromise, where it is expected that the PRNG quickly recovers enough entropy. We gave a precise assessment of Linux PRNG /dev/random and /dev/urandom security. In particular, we prove that these PRNGs are not robust. These properties are due to the behavior of the entropy estimator and the mixing function used to refresh its internal state. As pointed by Barak and Halevi [BH05], who advise against using run-time entropy estimation, we have shown vulnerabilities on the entropy estimator due to its use when data is transferred between pools in Linux PRNG. We therefore recommend that the functions of a PRNG do not rely on such an estimator.

Finally, we proposed a construction that meets our new property in the standard model and we showed that it is noticeably more efficient than the Linux PRNGs. We therefore recommend to use this construction whenever a PRNG with input is used for cryptography.

TL;DR? The Linux CSPRNG does not meet the definitions of a secure CSPRNG per the PDF. It's not that it's theoretically broken, it's just not theoretically secure either. It's really nothing theoretically at all. This isn't great.

A replacement for random.c in the kernel would be to ditch the homebrew entropy collection, mixing, and output mangling, and instead, stick with AES-128 in CTR mode. Of course, as per the PDF, the entropy collectors need serious work, but if AES-128-CTR was deployed as the CSPRNG instead of SHA-1, then the generator could take advantage of hardware AES performance, which as I've shown, is exceptionally superior. It's frustrating, because the kernel already ships AES, so the code is already there. It's just not being utilized.

The Linux kernel could have 1 GBps in CSPRNG output, but is deliberately choosing not to. That's like having a V12 turbo-charged sleeper, without the turbo, and only firing on 3 of the 12 cylinders, with a duct taped muffler on the back.

Why does 1 GBps of performance matter? How about wiping hard drives or secure data removal in general? With 20 MBps, we can't even saturate a single drive in IOPS. With 1 GBps, we could saturate many simultaneously. As someone who wipes old employee workstations when they leave the company, backup servers with dozens of drives, or old decommissioned hardware, I see great benefit here.

Or, how about HTTPS web sites for a shared web hosting provider? I have seen countless times HTTPS and SSH connections lag due to waiting on the CSPRNG. Not that it's being intentionally blocked, but because the load is so intense on the server, it just can't generate enough cryptographic randomness to keep up with requests.

I'm sure there are plenty of other examples where end userspace applications could benefit with improved performance of the CSPRNG. And, as shown, it can't be that difficult to implement correctly. The real question is, of course, who will do the work and submit the patch?

Cryptographic Hashing, Part I- Introduction

Introduction

Lately, I've been seeing some discussion online about cryptographic hashing functions, along with some confusion between a cryptographic digest, a cryptographic signature, and a message authentication codes. At least in that last post, I think I did well defining and clarifying the differences between those terms, but I also feel like I could take this discussion a lot further. So, I decided to dedicate a series to generic cryptographic hashing functions, which will include building compression frameworks with security proofs, specific implementations of cryptographic hashing functions, and some implementations of these functions. So, without further ado, let's get started.

Collisions

When we talk about a hashing function (cryptographic or otherwise), we are referring to any function that can take an arbitrary length of data, and compress it into a fixed-length digest. Typically, this digest is called a "fingerprint", a "checksum", or a "hash". The goal, is that any time we input the same data, our function outputs the same digest. Further, it's important that not only can I produce that digest, but anyone can produce the same digest. This gets us prepared for the Random Oracle, but we still have some ground to cover first.

Because our hashing function has a fixed length output, say 128-bits, then an ideal function would map every input to one of those outputs. In other words, our function maps an element in the domain (our data to be hashed) to exactly one element in the range (our actual hash). So, if our function produces 128-bit digests, then there are a total of 2^128 digests in the range. This means, that we have at least a one-to-one mapping of elements in the domain to elements in the range. Again, speaking about an ideal hashing function.

However, we know that there are many more inputs than just 2^128; there are infinitely many, actually. But think about it for a second. Take the number zero, and send it through our hashing function. Increment that number by 1, then hash that number. Continue in this manner, assuming infinite computing resources and infinite time, until you've hashed every number between 0 and 2^128. Ideally, you've produced exactly 2^128 unique digests. But, what happens when you now want to hash 2^128+1? Now we have what is called a collision. In other words, two distinct inputs was hashed to the same output. To put it formally:

Definition: A collision is when two distinct pieces of data hash to the same digest, checksum, or fingerprint.
Theorem: For any fixed-length hashing function, there are infinitely many collisions.
Proof: This can be proven using the pigeon-hole principle. Given a fixed-length hashing function of n-bits of output, hashing n+1 inputs from the domain will produce a collision in the range. As n tends to infinity, the collisions tend to infinity. Q.E.D.

I don't think I need to tell you how much larger infinity is to 128-bits. As a result, collisions are overwhelming. In fact, would you like to see a collision in practice? Below are 2 different hexadecimal strings. The differences are very subtle, but they indeed distinct (emphasized in bold red). Here, we'll take the two strings, and hash them with the known MD5 algorithm. Then, just to show I'm not cheating, we'll hash the same strings with SHA-1. While we produce a collision in MD5, we have distinct digests with SHA-1. Go ahead, and verify that you get the same results.

$ INPUT1=d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89\
55ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b\
d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0\
e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70
$ INPUT2=d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89\
55ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b\
d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0\
e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70
$ printf "$INPUT1" | xxd -r -p | md5sum
79054025255fb1a26e4bc422aef54eb4  -
$ printf "$INPUT2" | xxd -r -p | md5sum
79054025255fb1a26e4bc422aef54eb4  -
$ printf "$INPUT1" | xxd -r -p | sha1sum
a34473cf767c6108a5751a20971f1fdfba97690a  -
$ printf "$INPUT2" | xxd -r -p | sha1sum 
4283dd2d70af1ad3c2d5fdc917330bf502035658  -

Crazy, right? With an ideal hashing function, it should be at least as difficult as a brute force search to find these collisions, and it should take searching an entire 128-bit domain to find a collision. Unfortunately, however, finding blind collisions with a brute force search turns out to be much faster, thanks to the Birthday Paradox. The Birthday Paradox says the following:

In a room of just 23 people, there is a 50% probability that at least two of them share the same birthday. In a room of just 75 people, there is a 99.9% probability that at least two of them share the same birthday.

Wait, what? Uhm, last I checked, there are 366 days days in a year, assuming leap year. Soooo, if there are 23 people in a room, then there should be a 23/366, or about a 6% probability that two people share the same birthday. Unfortunately, this isn't how it works. There may be a 6% chance someone shares your birthday, but there is a 50% chance two arbitrary people share the same birthday. Now do you see the problem? Not only must you compare your birthday to everyone, but so must everyone else. This is a case of permutations. So, with 23 people in the room, there are actually 253 possible comparisons that must be made (23*22/2). The math gets a little hairy, and to be honest, it's a bit outside the scope of this post, and this series (it's going to be long enough as it is). Refer to the Wikipedia article if you want to work through the theory and the proof.

We can use this Birthday Paradox to work out an attack on finding two distinct inputs that produce an identical digest. This is called the Birthday Attack, and it's the primary driver in finding collisions. The attack basically says something like this:

To find a collision in a n-bit range with approximately 50% probability, you need to only search the the square root of 'n' of elements in the domain.

So, for a 128-bit digest (2^128 possible distinct outputs), using the Birthday Attack, I only need to search 2^64 possible inputs to have approximately a 50% probability that I have found a collision. If you don't think 2^64 is very small, the bitcoin network is currently mining 2^64 SHA-256 digests about every 20 seconds.

Blind, preimage, and second preimage collisions

Armed with this knowledge, we can now formalize some definitions of collision attacks. This might be confusing, so I'll define it first, then give some examples.

Collision attack:
A blind search, where two distinct inputs produce the same digest.
Preimage attack:
A search to find an input that matches a defined digest.
Second preimage attack:
A search to find a second file that matches the digest of a defined file.

Let's break these down individually. A collision search is literally a blind search, without any respect to inputs or outputs. You don't know what the inputs will be nor do you know what their outputs will be. You only know that you have found two distinct inputs that collide to the same output, all of which is entirely arbitrary.

A preimage attack is where you have a digest in your possession, but you would like to find an input that matches it. In this case, while the input is completely arbitrary, the output is static. For example, suppose you have the 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9". A preimage attack would be successfully finding the input that produced it. In this case, it was "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz".

A second preimage attack means having both the input "Cz3eJlm4I2I2rHt8hioZ7evonLyukwlz" and its 256-bit hexadecimal digest "ec58d903a9f9dcc9d783da72401b1c94fc8fb9d9623d7141b8b90997382088f9", and finding a second input that produces that same digest.

Usually, when breaking cryptographic hash functions, the first thing to break is the compression function, which I'll cover in later posts. Once the compression function is broken, the next step is to break searching for blind collisions. This is generally done by analyzing the weaknesses in mathematics, find bias in the output, observe the quality of the avalanche effect, and so forth. You eventually learn where the hashing function is weak, and where you can take "shortcuts" to get to your goal. Eventually, the algorithm is broken to the point that finding blind collisions is practical. MD5 is broken in this regard.

After breaking the compressing function, and weakening the algorithm to the point of practical collision attacks, preimage attacks become the next focus of analysis. However, when the compression function is broken, such as in the case of SHA-1, it's a strong sign to start moving away from the algorithm, long before you find collisions. So, analysis tends to slow down after collisions have been found, because no one should be using the function anymore. This also means continuing to find second preimage collisions gets even less attention.

Avalanche Effect

The final property of cryptographic hashing functions that needs to be addressed is the "avalanche effect". It is absolutely critical in cryptographic hashing functions that even though inputs may be sequential, their outputs do not show that to be the case. For example, consider the SHA-256 of the first 10 digits:

$ for I in {1..10}; do printf "$I: "; printf "$I" | sha256sum -; done
1: 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b  -
2: d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35  -
3: 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce  -
4: 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a  -
5: ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d  -
6: e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683  -
7: 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451  -
8: 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3  -
9: 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7  -
10: 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5  -

Notice that there is no clear indication on sequential digests. For all practical purposes, they are truly randomized output, despite the sequential input (merely flipping a single bit on each input from the previous). However, can we formally define the avalanche effect? What would be ideal is that with each bit change on the input, every bit in the digest output has as close to a 50% chance of being flipped as theoretically possible.

I'll talk more about "rounds" in future posts when I talk about specific implementations and designs. Suffice it to say that a cryptographic hashing function will iterate through the compression functions a certain number of times, before outputing the state. On each round, the bits in the output should each have a 50% chance of being flipped. So, on each output of each round iteration, close to half of the bits have been flipped in some pseudorandom manner. After a certain number of rounds, the final output should be indistinguishable to true random noise.

So, how about this as a formal definition:

When a single input bit is flipped, each output bit should change with a 50% probability.

Of course, the cryptographic strength doesn't rest solely on the avalanche effect. There are mathematical properties that determine that. But, the output should be completely unpredictable. You could apply the "next bit test", in that there is no algorithm you could produce that would determine the next state of the next bit, without actually compromising the state of the machine (this is a test held to cryptographically secure pseudorandom number generators).

Unfortunately, all we have to test the avalanche effect is standard randomness tests, such as the chi-square distribution, Monte Carlo for Pi, and the Birthday Paradox, among others. This doesn't say anything about the cryptographic strength of the hashing function, but says a lot about randomness properties (non-cryptographic hashing functions can also exhibit strong randomness qualities).

There are a couple software utilities we can use to test and analyze cryptographic hashing functions. First, we have standard randomness tests, such as Dieharder and the FIPS 140-2 suite. But, for something more specific on analyzing cryptographic primitives, I would recommand Cryptol. On the one side, this isn't an out-of-the-box software solution for just running a battery of tests and analysis. It is actually a domain-specific language that will require a bit of a learning curve. On the other hand, it's Free Software, and you'll probably learn more about cryptanalysis with this tool, than just playing with randomness tests.

Conclusion

This was just a primer post to get you thinking about cryptographic hashes, specifically thinking about their output, and the task of finding collisions. The rest of the posts in the series will cover specific functions such as MD5, SHA-1, -2, and -3, as well as some others. We'll talk about hashing constructions, and where you'll find cryptographic functions in practice (I think you'll be surprised). I may even throw in a post or two about random oracles, and how we want cryptographic hashing functions to not only imitate them, but be proven secure under the "Random Oracle Model".

Regardless, this post will get you started, and hopefully excited for what is to come.

Manual Authenticated File Encryption With OpenSSL

One thing that bothers me about OpenSSL is the lack of commandline support for AEAD ciphers, specifically AES in CCM and GCM block modes. Why does this matter? Suppose you want to save an encrypted file to disk, without GnuPG, because you don't want to get into key management. Further, suppose you want to send this data to a recipient or store it on a server outside of your full control. The authenticated encryption is important, otherwise the ciphertext is malleable and vulnerable to bit flipping.

So, when you get to the shell, you may try using AES in GCM mode with OpenSSL's "enc(1)" command, only to be left wanting. Here, we generate a key from /dev/urandom, convert it to hexadecimal, and provide the key as an argument on the command line.

$ LC_CTYPE=C tr -cd 'abcdefghjkmnpqrstuvwxyz23456789-' < /dev/urandom | head -c 20; echo
sec2tk24ppprcze33ucs
$ echo sec2tk24ppprcze33ucs | xxd -p
73656332746b323470707072637a6533337563730a
$ openssl enc -aes-256-gcm -k 73656332746b323470707072637a6533337563730a -out file.txt.aes -in file.txt
AEAD ciphers not supported by the enc utility
$ echo $?
1

So, rather than using GCM, however, we can build the authentication tag manually with HMAC-SHA-512, which OpenSSL does support. This means using a non-authenticated block cipher mode, such as CTR, as a first step, then authenticating the ciphertext manually as a second step.

Using our same password from the previous example, we'll do this in two steps now:

$ openssl enc -aes-256-ctr -k 73656332746b323470707072637a6533337563730a -out file.txt.aes -in file.txt
$ openssl dgst -sha512 -binary -mac HMAC -macopt hexkey:73656332746b323470707072637a6533337563730a -out file.txt.aes.mac file.txt.aes

Now you have three files- your plaintext file, your AES encrypted ciphertext file, and your HMAC-SHA-512 authentication file:

$ ls -l file.txt*
-rw-rw-r--. 1 aaron aaron 1050 Feb 27 10:26 file.txt
-rw-rw-r--. 1 aaron aaron 1066 Feb 27 10:27 file.txt.aes
-rw-rw-r--. 1 aaron aaron   64 Feb 27 10:28 file.txt.aes.mac

When sending or remotely storing the "file.txt.aes" file, you'll want to also make sure the "file.txt.aes.mac" authentication file is accompanied with it. Unfortunately, the OpenSSL dgst(1) command does not support verifying message authentication codes, so you'll have to script this manually. So, you'll need to generate a second file, maybe "file.txt.tmp.mac", then compare the two. If they match, you can decrypt the "file.txt.aes" ciphertext file. If not, discard the data.

This isn't elegant, and I wish enc(1) supported AEAD, but as it stands, it doesn't. So, you'll have to stick with doing things manually. However, this is something simple enough to script, and provides both data confidentiality and authenticity, which should be the goal of every ciphertext.

Digest Algorithms in Google Spreadsheets

I can't imagine there are a lot of uses for using digest algorithms in spreadsheets, but I came up with one, and I really wished I had access to them. Seeing as though most spreadsheet applications don't ship one, I figured I would create my own.

Mostly, I use Google for my document processing and spreadsheet use, and I had a spreadsheet of Louis L'Amour books. My grandfather gave me his entire selection of Louis L'Amour books last year, and I made a goal to read them all during 2016. I grew up listening to them on audiotape in the truck when I was on the road with him laying carpet, heading up to the family cabin in Idaho, and other things, so, I have memories of many of the stories. It will be fun to read them.

So, what do digest algorithms have to do with Louis L'Amour novels? Well, after the spreadsheet was created to track what I've read and what I have left, as well as a pace (I have to be reading at least 60 pages every day), I wanted to start reading the books in a random order. Sure, I'll read the Sackett, Hopalong Cassidy, Talon & Chantry, and Kilkenny series first, but when I'm finished with the series, I want to read the novels in random order. Why? Because I don't want to get caught up in published year watching him change as a writer, or go in alphabetical order, because that's boring. Randomness is exciting!

Now I could have used the =RAND() function in the spreadsheet, but when I sort the columns, the numbers change. So, I need to copy and paste their values, then sort the columns. Besides, is RAND() even cryptographically secure (indistinguishable from true random noise)? Even better, I could just get ASCII data off of /dev/urandom and paste those results into the column, then sort off of that. But that requires using an external tool. However, I could also use a digest algorithm to calculate the digest of the book title, then sort by the digest. Because digest algorithms aren't part of the Google Spreadsheet default functions, my OCD kicked in, and I had to create one.

Here is what I came up with. You'll notice that MD2(), MD5(), and SHA1() are created, even if they're not cryptographically secure for today's modern cryptographic applications. However, in this specific use case, such as sorting columns, they are fine. Also, notice that SHA256(), SHA384(), and SHA512() exist, but not SHA224(). This is because "Utilities.DigestAlgorithm" does not export a "SHA_224" algorithm, which in my opinion, is just odd. MD4 is also not available, nor any of the SHA-3 functions. Regardless, all the digest algorithms supported by the API are available.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Cryptographic hash functions for use in Google Spreadsheets
// Use with =MD5(string)
// A string or cell (or concatenation of cells) can be provided
// Output in hex
// Released to the public domain

function MD2(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.MD2, s);
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}
function MD5(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, s);
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}
function SHA1(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_1, s)
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}
function SHA256(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_256, s);
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}
function SHA384(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_384, s);
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}
function SHA512(s) {
  var hexstr = '';
  var digest = Utilities.computeDigest(Utilities.DigestAlgorithm.SHA_512, s);
  for (i = 0; i < digest.length; i++) {
    var val = (digest[i]+256) % 256;
    hexstr += ('0'+val.toString(16)).slice(-2);
  }
  return hexstr;
}

So, how do we apply it? Because this is a script which is applied to your spreadsheet, you need to add it manually every time you create a new spreadsheet document. Supposedly, you can permanently add it through the Chrome Web Store (I created a project that is currently pending review), but for the time being, copying/pasting works.

Navigate to "Tools > Script Editor", remove the default function Google provides, and add the code above. Save it as a project, then it will be available to your spreadsheet. Now you can use it.

For example,

To calculate the MD5 of cell "A1":
    =MD5(A1)

To calculate the SHA1 of cells "A1" through "D1":
    =SHA1(CONCATENATE(A1:D1))

Retrieve the 12 left-most characters from a SHA512 digest of cells "A1" through "D1":
    =LEFT(SHA512(CONCATENATE(A1:D1)), 12)

I'm sure there are a few uses for digest algorithms in spreadsheets, they're just not very common; at least web searching for their use gives me very scant results. If you find these helpful, even if there are other solutions, I would be interested in how you used them in the comments below.

Oh, and here's my Louis L'Amour spreadsheet tracking my reading progress. 🙂

My Strange Tweets

You may have noticed some tweets from me that look.... strange. Probably something like these:


First, let me provide some background. When Twitter was announced, a couple Free Software developers got together to create a self-hosted Free Software alternative. They called that alternative "Identica", because it was hosted in Canada, and a way to establish your social identity. It made sense, and the Free Software and Open Source ecosystem ate it up. Within no time, it was a thriving online social network, involving mostly those from the Free Software and Open Source world, with all sorts of very influential developers and people creating accounts.

One account that seemed to catch the eye of many was @key. It posted what appeared to be MD5 checksums every 2 hours, regularly and consistently. Plenty of people were following the account, yet it wasn't following anyone. People replied to the tweets, asking what it was posting, who it was, why it was doing what it was doing, if it was a government account, etc. No one could figure it out, and if there were MD5 checksums, no one could reproduce them. It was a social enigma, and it kept people enthralled and engaged.

I thought this was exceptionally creative, and I was quite jealous that I didn't think of it first. The best I figured was that it was posting the timestamp of the tweet with a custom salt. At least, that is what I would have done. It couldn't be an MD5 of random data, otherwise, why not just post the random data? Or is that exactly what it is? So, instead, I decided to play with the Identica API and roll my own, using my own account. I had already setup the "Identica-Twitter bridge", so anything I posted to my Identica account would get posted to Twitter automatically.

But, I have to be different. So rather than a random digest that no one could figure out (I'm sure it's a timestamp), I wanted something a little more transparent. I started with taking the SHA-1 of the Unix epoch (the number of seconds since Jan 1, 1970 00:00.00) at 13:37 local time, because it's leet. This was easily accomplished with a bit of shell code:

$ EPOCH=$(date --date="today 13:37" +%s); printf "$EPOCH: "; printf "$EPOCH" | sha1sum - | cut -d ' ' -f 1

This was the first tweet:


Later however, I wanted something even more creative. I go by the online nick "eightyeight" on IRC, because I play the piano. However, some Asian cultures see the number "8" as lucky. With "Chinese" fortune cookies, I figured I would "encrypt" a fortune at 08:08 local time. Again, I decided to do this with a bit of shell code:

$ fortune -s -n 70 | gzip -c | base64 | rot13 | paste -sd ''

The first tweet to hit that was (testing the API, so this one actually wasn't on 08:08):


However, Identica started going downhill. First, we had big challenges fighting bot spam. Despite repeated bug reports and discussion on the network, very little change was happening in the code to combat the spam (for future reference, just use Hashcash tokens as a proof-of-work for form submissions). Then getting venture capital, and attempting to appeal to the mass market, things started changing. First it rebranded itself as "Status.Net", then we lost threaded replies. The API was no longer Twitter compatible (at least some things were different), and branding got real weird. Then it rebranded itself again under a completely new code rewrite as "pump.io", and that is the status today. At this last rebranding, the API was no longer functional, and my scripts stopped. I didn't want to work with the Twitter API, so I didn't bother setting it up again.

It wasn't until some time ago I decided to resurrect my cryptic tweets. However, I made some changes. Instead of using SHA-1, I decided to use RIPEMD-160. Although it hasn't had the mountains of analysis SHA-1 has had, RIPEMD-160 is still considered secure, although with its 160-bit digest size, the security margin might be a bit too slim for some. However, I stuck with the same Unix epoch timestamp automated at 13:37 local time.

Then, after developing my own playing card cipher, and refining it with the help of @timshadel, I decided to actually attempt a legitimate (if still insecure) cipher with Talon. It's still a fortune (BOFH style) and it's still published at 08:08 local time for the same reasons. If you want a crack at decrypting it, check out my playing card cipher repository at https://github.com/atoponce/cardciphers. There should be a new one every day, but it may be possible that the fortune is 1 character too long, and as a result, it doesn't get posted (I've accounted for this, but I'm sure I've missed something).

What's the point? Nothing more than just a bit of fun. It's probably not something you're interested in seeing on your timeline, and I don't blame you. Granted, there will be one of each every day. If you don't have a busy timeline, I guess it could get a bit old. But, I don't plan on stopping, nor using a separate account.

Checksums, Digital Signatures, and Message Authentication Codes, OH MY!

I recently submitted a bug to the Vim project about its Blowfish encryption not using authentication. Bram Moolenaar, the lead developer of Vim, responded about using checksums and digital signatures. I hope he doesn't mind me using him as an example here, but I want to quote the relevant bits (emphasis mine):

The encryption is meant to avoid other people, who don't have the key, from reading the text. It does not have the goal of protecting manipulation of the text, that is something else. You could add a checksum even when not using encryption. I believe it's called signing.

Unfortunately, Bram is confusing checksums, digital signatures, and message authentication codes, all rolled up into one. I don't blame him. This is a topic that is not well understood by those not intimately familiar with cryptography. In a nutshell, each provide data integrity at the core. Where they differ is whether or not you're using encryption keys, and whether or not those encryption keys are symmetric or asymmetric. So, in this post, I would like to break it down.

Checksums

Checksums do not require any sort of encryption key. They are simply digests, or "fingerprints" that represent some data. When you download a piece of software from the Internet, there may be a file with an MD5, SHA-1, or SHA-256 hash of the file. This is the software vendor providing a way for you to verify that you got all the correct bits when the download completes.

For example, suppose you wish to download the latest Debian 8.3.0 amd64 ISO from https://mirrors.xmission.com/debian-cd/8.3.0/amd64/iso-cd/. Notice that there are the following files: MD5SUMS, SHA1SUMS, SHA256SUMS, & SHA512SUMS. Part of the SHA256SUMS file looks like this:

$ head SHA256SUMS
1dae8556e57bb04bf380b2dbf64f3e6c61f9c28cbb6518aabae95a003c89739a  debian-8.3.0-amd64-CD-1.iso
89facfbb5039e49d4e3eeff1cca6ab55e9121ff46affeb46ed510c11731acf41  debian-8.3.0-amd64-CD-10.iso
7f6bc807d3636975374b937c2724353f7468ecd7a61e60f2a8b71f92eeefe629  debian-8.3.0-amd64-CD-11.iso
bd99b7c274ea400b50960ab9e46dd23bad76f87574d2ceee1e8e43859fbd045b  debian-8.3.0-amd64-CD-12.iso
e85679304a509593526cffa77ff0d675329565eb4430444ee2c0d2cdd87842a8  debian-8.3.0-amd64-CD-13.iso
69f727bceb0460957bbd5023fe79749c6bf9f0e3a1b89945e6c63c6b3f04f509  debian-8.3.0-amd64-CD-14.iso
d1dab389f8cb794013986d2da8a6dc72c0be8bc932fcc6d7291cb09b418724d5  debian-8.3.0-amd64-CD-15.iso
913b5d89322b500a02f699d44778901cb59aae909f09bff64963115143c2a6ca  debian-8.3.0-amd64-CD-16.iso
0638aca6f59a8f5bec6d1cd4d272cea01758c2b2d6ec1412048ecb78ef684a77  debian-8.3.0-amd64-CD-17.iso
6f17742fbc82828f04da39f66647e958b0ac667cb4d2a40c9888c749680f1eb8  debian-8.3.0-amd64-CD-18.iso

So, when downloading "debian-8.3.0-amd64-CD-1.iso", I can use the sha256sum(1) command to verify the file:

$ sha256sum debian-8.3.0-amd64-CD-1.iso
1dae8556e57bb04bf380b2dbf64f3e6c61f9c28cbb6518aabae95a003c89739a  debian-8.3.0-amd64-CD-1.iso

The digest matches, so the download was successful and all the correct bits exist. Another way would be to download the SHA256SUMS file, and use the "-c" switch for the utility to verify the checksum automatically, rather than you eyeballing it:

$ sha256sum -c SHA256SUMS 
debian-8.3.0-amd64-CD-1.iso: OK

The important thing to understand about checksums, is they are completely and totally anonymous. There is no secret shared between the server where I downloaded the software and myself, and there is no identity attached to the checksum. This means that anyone can change the original file and recalculate the checksum. So, if transferring data over the Internet, there is nothing preventing a man-in-the-middle attack from replacing the bits you're downloading with something else, while also replacing the checksum.

In other words, checksums provide data integrity, but they do not offer any sort of authentication. However, there are a number of checksum hashing functions, both cryptographically secure and not, such as CRC, MurmurHash, MD5, SHA-1, SHA-2, SHA-3, and so forth. For non-authenticated data integrity, a cryptographically secure hash function isn't always desirable, which is why non-cryptographic hash functions exist.

Digital Signatures

Digital signatures are a form of checksum, in that they provide data integrity, but they require asymmetric encryption to also provide authenticity. Digital signatures are away to attach an identity to the checksum. This implies a level of trust between you and the 3rd party, such as Debian with our example above. If you have met with the 3rd party, or dealt with them enough to establish some level of trust, then you can install the 3rd party's public key into your system. Then, when they provide data attached with a digital signature, you can verify that the data did in fact come from the 3rd party, and no other source.

Notice that a man-in-the-middle attack is no longer valid here, if I already have the 3rd party's public key installed on my system. Going back to our example with Debian, I have the Debian signing public key already installed. So, I can now download the MD5SUMS.sign, SHA1SUMS.sign, SHA256SUMS.sign, or SHA512SUMS.sign file, along with the checksums file I already downloaded, and verify that the checksums are those intended by Debian:

$ gpg --verify SHA256SUMS.sign                   
gpg: assuming signed data in `SHA256SUMS'
gpg: Signature made Sun 24 Jan 2016 11:08:33 AM MST using RSA key ID 6294BE9B
gpg: Good signature from "Debian CD signing key <debian-cd@lists.debian.org>"
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: DF9B 9C49 EAA9 2984 3258  9D76 DA87 E80D 6294 BE9B

If we look at the contents of the SHA256SUMS.sign file, we get the following:

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAABCAAGBQJWpRMhAAoJENqH6A1ilL6bYz4P/3ZNCR8N+rrlSSgTN/AkpSVt
WXWg2BTflY3cPYmKK/osJUvLT7HTPDhabPiuQY2jJxrHYJhq5sCOrhbgc4eSRmIf
IsSm7OxQ9TXqde4mg9DVsxmIRui/rVbhEjkAVu47A0eGDUrRxczgJUo14En3jO0Z
qhXypCIN90y8HWaqy6OMe+eCsPyGxmXpWRT1XEH9tOX21wCAaxUl6ZHkiqNqdt8u
Erojls77nlaBR/tvB9CHXTkUmqsocdYD+n5UsvtmLlYN0nz85b7NhrLEW2QtugLd
MJngeI5eJvI4Hjyas0HfSlsdoBAvF+Uw3Dn9aHiTIeWVIeCYUKhdXmLww0dL0n95
jVuBSuMavQwOKKRGTbvG++RET9s/2U/G95wK0Vfx5fsf1neKVJgYf9q9iyObgcH8
dRLAqkgWJBNkvm9oXmpcy7jAq8jlXzDfaPz8plAyqDuIXoOSCHpJ5KAbAS1cYLIT
9U2cQLKTbCPrWJT5xZzOMuCPWu1CzfluDEafFsNzurWG5vCmFEJ+vV9strkeEIuX
tFeKVDkkhVEZYQKSbIlidXBa/WP2Q0g1KvlKXb+nsnWDtWAjLUPD621F3ZjUcjlX
aDPv3J+7kqfryA/7qYMVTH67KY3DwKIDKt6XtquxSf7HuYqEwXKIXp2De7zCCEqH
csWVPFNUQyOdetIC/l/w
=TjXr
-----END PGP SIGNATURE-----

The details of a PGP signature aren't important for this blog post. Suffice it to say that it requires the signer's private key to produce the signature, and the signer's public key to verify it. The sender will hash the message with their private key, and append the signature to the message. When the recipient receives the message, they will separate the signature from the message, hash the message, and verify that the two signatures match using the sender's public key. If they do match, the recipient knows that only the sender signed the data, and no one else.

Because the private key is required to create the signature, and because only the 3rd party should have access to the private key, this means that a man-in-the-middle attack is no longer effective. A miscreant should not be able remove the signature, apply a new signature, and have the recipient still verify the signature as "good" from Debian, unless that miscreant also had access to Debian's private signing key.

To make an example of an existing software vendor, Arch Linux was under heat about this. A core developer strongly disagreed about digitally signed packages. They would provide their software from their repositories with MD5 checksums only. The packages were not digitally signed.

So, when your local Arch Linux installation would request packages from the Arch Linux software repository, unless served over HTTPS, a man-in-the-middle could interject their own bits with their own MD5 checksum. Your pacman(8) package manager would verify that the MD5 is valid, and proceed to install the software with root privileges, because that is what you told it to do. By also digitally signing the package with an Arch Linux signing key, this attack is no longer possible.

Eventually, Arch Linux fixed the vulnerability, and closed a very large security hole, by digitally signing their packages.

As with checksums, digital signatures really should be using a cryptographically secure hashing function as part of the protocol. This can include RIPEMD160, SHA-2, SHA-3, BLAKE2, Skein, and others. MD5 and SHA-1 are no longer considered cryptographically secure, and should not be used with digital signatures (thus why SHA-256 SSL certificates instead of SHA-1).

Message Authentication Codes

Finally, message authentication codes (also called "MAC tags") are another way to provide data integrity with authentication, but this time using symmetric encryption. Where digital signatures imply a physical identity behind the authentication, MAC tags provide anonymous authentication. Generally, symmetric keys don't have identities associated with them. They're usually short-lived and shared via complex key exchanges, such as the Diffie-Hellman key exchange.

A MAC tag is keyed, meaning a shared secret is used when calculating the digest. There are a number of different implementations of MACs, such as CBC-MAC, HMAC, UMAC, & Poly1305, among others. The differences between each of those isn't important for this post. What is important, is how they are calculated, and how they are used with encryption.

The sender of some message will apply a cryptographic hashing function to the message, and append (or prepend) the resulting digest to the message, and send the full payload off. Because both the ciphertext and the MAC were calculated with a shared secret key, a man-in-the-middle cannot strip the MAC tag and apply their own without knowing the shared secret. Because, when the recipient receives the payload, they will strip off the MAC tag, rehash the message with the same keyed hashing function, and see if the two MAC tags match. If they do match, the message can be acted upon. If they do not match, something happened to the data in transit, and the payload can be safely ignored.

There are three main ways to apply MAC tags to messages: encrypt-then-MAC, MAC-then-encrypt, and encrypt-and-MAC. The first, encrypt-then-MAC, is considered "best practice" for message authentication. First the message is encrypted, then the ciphertext is authenticated, and the resulting MAC is appended to the ciphertext. This provides both ciphertext and plaintext integrity. The big advantage to this approach, is that if the MAC tag does not match the newly calculated MAC tag during verification, the ciphertext does not need to be decrypted. This is the default approach with IPsec and modern versions of OpenSSH. RFC 7366 standardizes this for TLS (yet to be implemented by OpenSSL last I checked). Also an ISO/IEC 19772:2009 standard.

Encrypt-then-MAC flowchart.

The next approach, MAC-then-encrypt, means authenticating the plaintext, appending the resulting MAC tag to the plaintext, and then encrypting the full plaintext and MAC tag payload. While this approach offers plaintext data integrity, it does not offer ciphertext integrity. As such, the ciphertext must be decrypted before the MAC tag can be verified. This is the default behavior in older versions of OpenSSH.

MAC-the-encrypt flowchart

Finally, encrypt-and-MAC, means authenticating the plaintext first, then encrypting the plaintext. The resulting MAC tag is appended to the ciphertext. Again, like MAC-then-encrypt, this approach offers plaintext data integrity, but it does not offer ciphertext integrity. So, you must detach the MAC tag first, then decrypt the ciphertext, then verify if the MAC tag is valid. This is the default behavior with OpenSSL.

Encrypt-and-MAC

As I understand it, there are no known vulnerabilities with MAC-then-encrypt and encrypt-and-MAC MACs. However, by having both ciphertext and plaintext integrity with encrypt-then-MAC, as well as not needed to decrypt the ciphertext on failure, is why encrypt-then-MAC is the preferred way to handle message authentication.

As with digital signatures, MACs should be calculated with cryptographically secure hashing functions, such as RIPEMD160, SHA-2, SHA-3, BLAKE2, Skein, etc. MD5 and SHA-1 would not qualify (although we could get into a discussion about HMAC-MD5 and HMAC-SHA1, but we won't).

Conclusion

No doubt, it's confusing to separate checksums from digital signatures from message authentication codes. Things get even a bit more hairy with blind signatures (used primarily in digital currencies) and Merkle trees (used primarily in peer-to-peer networks and copy-on-write filesystems), but they're special cases of the primary three functions discussed above. However, if you can get checksums, digital signatures, and message authentication codes cleared up, then you're that much closer to implementing cryptographic protocols correctly.

Bitcoin Mining Rate and Waste

Recently, the Bitcoin mining rate surpassed 1 exahash per second, or 1 quintillion SHA-256 hashes per second.

Bitcoin mining graph showing gigahashes per second over time.

If we do some quick math, we can determine the following:

  • If SHA-1 collisions can be found in 2^65.3 hashes, that's one SHA-1 collision found every 45 seconds.
  • Every combination of bits can be flipped in an 84-bit keyspace every year.
  • If mining is done strictly with ASICs and each ASIC can produce 1 trillion hashes per second, that's 1,000,000 ASICs.
  • If each ASIC above consumes 650 Watts of power, that's 650 Megawatts of power consumed.
  • At 650,000 kWh per ASIC, that's 1.3 million pounds of CO2 released into the atmosphere every hour if using fossil fuels.
  • Current global rate is about 160 Bitcoins mined per hour.
  • At $0.15 USD per kWh, that's $609 spent on electricity per Bitcoin mined. Bitcoin is currently trading at $376/BTC.

That's "back of the envelope" calculations, with some big assumptions made about the mining operation (how it's powered, who is powering it, etc.).

Of course, not all mining is using fossil fuels, not all miners are using ASICs, not all ASICs can do 1 trillion hashes per second (some more, some less), not all ASICs are consuming that wattage per rate, and the cost of electricity was strictly a U.S. figure. Of course, if you're using a GPU, or worse, a CPU for your mining, then you expending more electricity per the rate than ASICs are. That may help balance out some of the miners who are using renewable energy, such as solar power for their mining. Many Chinese and Russian mining data centers certainly have less overhead costs on electricity. You get the point- we've made some big assumptions, to come to some very rough "ballpark" figures. I don't think we're too far off.

So, according to those numbers, unless using renewable energy, cheaper electricity, or Bitcoin trading goes north of $610 USB/BTC mining for Bitcoin is a net loss. This comes at the expense of 1.3 million pounds of CO2 released into the atmosphere every hour. I would argue that Bitcoin is the worst idea to come out of Computer Science in the history of mankind.

Using Your Monitors As A Cryptographically Secure Pseudorandom Number Generator

File this under the "I'm bored and have nothing better to do" category. While coming into work this morning, I was curious if I could use my monitors as a cryptographically secure pseudorandom number generator (CSPRNG). I don't know what use this would have, if any, as your GNU/Linux operating system already ships a CSPRNG with /dev/urandom. So, in reality, there is really no need to write a userspace CSPRNG. But what the hell, let's give it a try anywho.

The "cryptographically secure" piece of this will come from the SHA-512 function. Basically, the idea is this:

  • Take a screenshot of your monitors.
  • Take the SHA-512 of that screenshot.
  • Resize the screenshot to 10% it's original size.
  • Take the SHA-512 of that resized file.
  • Take the SHA-512 of your previous two SHA-512 digests.
  • Take the last n-bits of that final digest as your random number.

Most GNU/Linux systems come with ImageMagick pre-installed, as well as the "sha512sum(1)" function. So thankfully, we won't need to install any software. So, here's a simple shell script that can achieve our goals:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/sh
# Produces random numbers in the range of [0, 65535].
# Not licensed. Released to the public domain.

cd ~/Private # assuming you have an encrypted filesystem mounted here

TS1=$(date +%Y%m%d%H%M%S%N)
import -window root ${TS1}.png
sha512sum ${TS1}.png > /tmp/SHA512SUMS

TS2=$(date +%Y%m%d%H%M%S%N)
convert -scale 10% ${TS1}.png ${TS2}.png
sha512sum ${TS2}.png >> /tmp/SHA512SUMS

DIGEST=$(sha512sum /tmp/SHA512SUMS)
printf "%d\n" 0x$(printf "$DIGEST" | awk '{print substr($1, 125, 128)}')

shred ${TS1}.png ${TS2}.png
rm ${TS1}.png ${TS2}.png
shred /tmp/SHA512SUMS
rm /tmp/SHA512SUMS

Running it for 10 random numbers:

$ for i in {1..10}; do sh monitor-csprng.sh; done
15750
36480
64651
7942
2367
10905
53889
9346
52726
63570

A couple things to note:

  • This is slow, due to taking the screenshot, and resizing it.
  • The data on your monitors should be sufficiently random. Chats, social updates, etc. The security of this will depend entirely on the entropy of the initial screenshot.
  • You really should be saving your screenshots to an encrypted filesystem, such as eCryptfs.
  • We're using timestamps with nanosecond accuracy to provide some additional entropy for the final SHA-512 digest.
  • This is using the last 4 hexadecimal characters to be converted to decimal. In reality, it could be anything, including some convoluted dynamic search algorithm in the string.

It's worth noting that the entropy of the initial screenshot is critical, which is actually difficult to accurately measure. So, it may help to have a chat window or more open, with recent chat logs. Same could be said for social update "walls", with the most recent updates (Twitter, Facebook, Goodreads, etc.). Having a clock with seconds ticking in a status bar can also help (although not unpredictable, at least semi-unique). Tabs in browsers, running applications, etc. The more unpredictable your workspace in the screenshot, the better off you'll be. But, people in general suck at randomness, so I'm not advocating this as something you should rely on for a cryptographically secure random number generator.

If you wanted, you could add this to a terminal, giving you a sort of "disco rave" before taking the screenshot:

1
2
3
4
5
6
7
#!/bin/sh
# Disco lights in your terminal
# No license. Released to the public domain

while true; do
    printf "\e[38;5;$(($(od -d -N 2 -A n /dev/urandom)%$(tput colors)))m•\e[0m"
done

Get that running first, then take your screenshot. But then, if you're reading data off of /dev/urandom, you might as well do that for your random numbers anyway...