Image of the glider from the Game of Life by John Conway
Skip to content

1,000 Books Read In One Year? No, Not By A Long Shot

Recently, Goodreads sent out a tweet about how to remove social media and the Internet from your life, so you can focus on reading 1,000 books in one year. The post follows this sort of math:

  1. The average person reads 400 words per minute.
  2. The typical non-fiction books have around 50,000 words.
  3. Reading 200 books will take you 417 hours.
  4. The average person spends 608 hours on social media annually.
  5. The average person spends 1,642 hours watching TV annually.
  6. Giving up 2,250 hours annually will allow you to read 1,000 books in one year.

This blew my mind. I'm a very avid reader. Since signing up for Goodreads in 2013, I've been hitting at least 20,000 pages read every year, and I'm on track to read 25,000 pages this year. But, I'm only putting down 75 books each year. Now granted, a spare 2,250 hours per year ÷ 365 days per year is just over 6 hours per day of reading. I'm not reading 6 hours per day. I don't watch TV, I have a job, kids and a wife to take care of, and other things that keep me off the computer most of my time at home (I'm writing this blog post after midnight).

No doubt, 6 hours per day is a lot of reading. But I average 2 hours per day, and I'm only putting down 75 books annually. 6 hours of reading per day would only put me around 225 books each year, a far cry from the 1,000 I should be hitting. What gives?

Well, it turns out, Charles Chu is being a bit ... liberal with his figures. First off, the average person does not read 400 words per minute. Try about half, at only 200 words per minute, according to Iris Reading, a company that sells a product on improving your reading speed and memory comprehension. This cuts our max books from 1,000 in a year to 500.

Second, Chu claims the average non-fiction book is 50,000 words in length. I can tell you that 50,000 words is a very slim novel. This feels like a Louis L'Amour western length to me. Most books that I have read are probably closer to twice that length. However, according to HuffPost which quotes Amazon Text Stats, the average book is 64,000 words in length. But, according to this blog post by Writers Workshop, the average "other fiction" novel length is 70,000 to 120,000 words. This feels much more in line with what I've read personally, so I'll go with about 100,000 words in a typical non-fiction novel.

So now that brings our annual total down from 500 to 250 books. That's reading 200 words per minute, for 6 hours every day, with non-fiction books that average 100,000 words in length. I claimed that I would probably come in around 225 books, so this seems to be a much closer ballpark figure.

But, does it line up? Let's look at it the another way, and see if we can agree that 200-250 books annually, reading 6 hours per day, is more realistic.

I claimed I'm reading about 2 hours per day. I read about 3 hours for 4 of the 7 days in a week while commuting to work. For the other three days in the week, I can read anywhere from 1 hour to 3 hours, depending on circumstances. So my week can see anywhere from 13 hours to 15 hours on average. That's about 2 hours per day.

During 2016, I read 24,048 pages. That's about 65 pages per day, which feels right on target. But, how many words are there per page? According to this Google Answers answer, which offers a couple citations, a novel averages about 250 words per page.

But, readinglength.com shows that many books I've read are over 300 words per page, and some denser at 350 words per page, with the average sitting around 310. So 250 words per page at 65 pages per day is 16,250 words per day, and 310 words per page at 65 pages per day 20,150 pages that I'm reading.

Because I'm only reading about 2 hours per day, that means I'm reading at a meager 135 to 168 words per minute, based on the above stats. I guess I'm a slow reader.

If I highball it at 168 words per minute, then in 6 hours, I will have read 60,480 words. After a year of reading, that's 22,075,200 words. An independent blog post confirms this finding of 250-300 words per page, but also uses that to say that most adult books are 90,000 - 100,000 words in length (additional confirmation from earlier), and young adult novels target the 55,000 word length that Chu cited (maybe Chu likes reading young adult non-fiction?). As such, I can expect to read 22,075,200 words per year ÷ 100,000 words per book, or about 220 books in a year of reading 6 hours every day.

Bingo.

So, what can we realistically expect from reading?

  1. Readers average 200 words per minute.
  2. A page averages 250 words.
  3. A novel averages 100,000 words.
  4. One hour of reading per day can hit 30-40 books per year.
  5. Six hours of reading per day can hit 200-250 books per year.
  6. To read 1,000 books in a year, you need to read 22 hours per day.

This is reading average length adult non-fiction books at an average speed of 200 words per minute. The calculus completely changes if your average reading speed is faster than 200 wpm, you read primarily graphic novels with little text, or read shorter non-fiction novels. Fourth-grade chapter books? Yeah, I could read 1,000 of those in a year. 🙂

Password Best Practices I - The Generator

This is the first in a series of posts about password best practices. The series will cover best practices from a few different angles- the generator targeted at developers creating those generators, the end user (you, mom, dad, etc.) as you select passwords for accounts from those generators, and the service provider storing passwords in the database for accounts that your users are signing up for.

Motivation

When end users are looking for passwords, they may turn to password generators, whether they be browser extensions, websites, or offline installable executables. Regardless, as a developer, you will need to ensure that the passwords you your providing for your users are secure. Unfortunately, that's a bit of a buzzword, and can be highly subjective. So, we'll motivate what it means to be "secure" here:

  • The generator is downloaded via HTTPS, whether it's a website, executable ZIP, or browser extension.
  • The generator uses a cryptographically secure random number generator.
  • The generator provides at least 70-bits of entropy behind the password.
  • The generator is open source.
  • The generator generates passwords client-side, not server-side.
  • The generator does not serve any ads or client-side tracking software.

I think most of us can agree on these points- the software should be downloaded over HTTPS to mitigate man-in-the-middle attacks. A cryptographically secure RNG should be used to ensure unpredictability in the generated password. In addition to that, the CRNG should also be uniformly distributed across the set, so no elements of the password are more likely to appear than any other. Creating an open source password generator ensures that the software can be audited for correctness and instills trust in the application. Generating passwords client-side, means the server hos now possible way of knowing what passwords were generated, unless the client is also calling home (the code should be inspected). And of course, we don't want any adware or malware installed in the password generating application to further compromise the security of the generator.

Okay. That's all well and good, but what about this claim to generate passwords from at least 70-bits in entropy? Let's dig into that.

Brute force password cracking

Password cracking is all about reducing possibilities. Professional password crackers will have access to extensive word lists of previously compromised password databases, they'll have access to a great amount of hardware to rip through the password space, and they'll employ clever tricks in the password cracking software, such as Hashcat or MDXFind, to further reduce the search space, to make finding the passwords more likely. In practice, 90% of leaked hashed password databases are reversed trivially. With the remaining 10%, half of that space takes some time to find, but those passwords are usually recovered. The remaining few, maybe 3%-5%, contain enough entropy that the password cracking team likely won't recover those passwords in a week, or a month, or even a year.

So the question is this- what is that minimum entropy value that thwarts password crackers? To answer this question, let's look at some real-life brute force searching to see if we can get a good handle on the absolute minimum security margin necessary to keep your client's leaked password hash out of reach.

Bitcoin mining

Bitcoin mining is the modern-day version of the 1849 California Gold Rush. As of right now, Bitcoin is trading at $3,665.17 per BTC. As such, people are fighting over each other to get in on the action, purchasing specialized mining hardware, called "Bitcoin ASICs", to find those Bitcoins as quickly as possible. These ASICs are hashing blocks of data with SHA-256, and checking a specific difficulty criteria to see if it meets the requirements as a valid Bitcoin block. If so, the miner that found that block is rewarded that Bitcoin and it's recorded in the never-ending, ever-expanding, non-scalable blockchain.

How many SHA-256 hashes is the word at large calculating? As of this writing, the current rate is 7,751,843.02 TH/s, which is 7,751,843,020,000,000,000 SHA-256 hashes per second. At one point, it peaked at 8,715,000 THps, and there is no doubt in my mind that it will pass 10,000,000 THps before the end of the year. So let's run with that value, of 10,000,000,000,000,000,000 SHA-256 hashes per second, or 1019 SHA-256 hashes per second.

If we're going to talk about that in terms of bits, we need to convert it to a base-2 number, rather than base-10. Thankfully, this is easy enough. All we need to calculate is the log2(X) = log(X)/log(2). Doing some math, we see that Bitcoin mining is roughly flipping every combination of bits in a:

  • 63-bit number every second.
  • 69-bit number every minute.
  • 74-bit number every hour.
  • 79-bit number every day.
  • 84-bit number every month.
  • 88-bit number every year.

What does this look like? Well, the line is nearly flat. Here in this image, the x-axis is the number of days spent mining for Bitcoin, starting from 0 through a full year of 365 days. The y-axis is the search space exhaustion in bits. So, you can see that in roughly 45 days, Bitcoin mining have calculated enough SHA-256 hashes to completely exhaust an 85-bit search space (click to enlarge):

Plot showing log(x*10^19*86400)/log(2) for Bitcoin mining.

Real-world password cracking

That's all fine and dandy, but I doubt professional password crackers have access to that sort of hardware. Instead, let's look at a more realistic example.

Recently, Australian security researcher Troy Hunt, the guy that runs https://haveibeenpwned.com/, released a ZIP of 320 million SHA-1 hashed passwords that he's collected over the years. Because the passwords were hashed with SHA-1, recovering them should be like shooting fish in a barrel. Sure enough, a team of password crackers got together, and made mincemeat of the dataset.

In the article, it is mentioned that they had a peak password cracking speed of 180 GHps, or 180,000,000,000 SHA-1 hashes per second, or 18*1010 SHA-1 hashes per second. The article mentions that's the equivalent of 25 NVidia GTX1080 GPUs working in concert. To compare this to Bitcoin mining, the team was flipping every combination of bits in a:

  • 41-bit number every second.
  • 47-bit number every minute.
  • 53-bit number every hour.
  • 58-bit number every day.
  • 63-bit number every month.
  • 66-bit number every year.

As we can see, this is a far cry from the strength of Bitcoin mining. But, are those numbers larger than you expected? Let's see how it looks on the graph, compared to Bitcoin (click to enlarge):

Plot showing log(x*18*10^10*86400)/log(2) for this cluster of password cracking hobbyists.

So, it seems clear that our security margin is somewhere above that line. Let's look at one more example, a theoretical one.

Theoretical password cracking by Edward Snowden

Before Edward Snowden became known to the world as Edward Snowden, he was known to Laura Poitras as "Citizenfour". In emails back-and-forth between Laura and himself, he told her (emphasis mine):

"Please confirm that no one has ever had a copy of your private key and that it uses a strong passphrase. Assume your adversary is capable of one trillion guesses per second. If the device you store the private key and enter your passphrase on has been hacked, it is trivial to decrypt our communications."

But one trillion guesses per second is only about 5x the collective power of our previous example of a small team of password cracking hobbyists. That's only about 125 NVidia GTX1080 GPUs. Certainly interested adversaries would have more money on hand to invest in more computing power than that. So, let's increase the rate to 10 trillion guesses per second. 1,250 NVidia GTX1080 GPUs would cost our adversary maybe $500,000. A serious investment, but possibly justifiable, and certainly not outside the $10 billion annual budget of the NSA. So let's roll with it.

At 1013 password hashes per second, we are flipping every combination of bits in a:

  • 43-bits every second.
  • 49-bits every minute.
  • 54-bits every hour.
  • 59-bits every day.
  • 64-bits every month.
  • 68-bits every year.

Plotting this on our chart with both Bitcoin mining and clustered hobbyist password cracking, we see (click to enlarge):

Plot of log(x*86400*10^13)/log(2)

The takeaway

What does all this math imply? That as a developer of password generator software, you should be targeting a minimum of 70-bits of entropy with your password generator. This will give your users the necessary security margins to steer clear of well-funded adversaries, should some service provider's password database get leaked to the Internet, and they find themselves as a target.

As a general rule of thumb, for password generator developers, these are the sort of security margins your can expect with entropy:

  • 70-bits or more: Very secure.
  • 65-69 bits: Moderately secure.
  • 60-64 bits: Weakly secure.
  • 59 bits or less: Not secure.

Colored recommendation of the previous plot showing all brute force attempts.

What does this mean for your generator then? This means that the number of size of the password or passphrase that you are giving users should be at least:

  • Base-94: 70/log2(94)=11 characters
  • Base-64: 70/log2(64)=12 characters
  • Base-32: 70/log2(32)=14 characters
  • Base-16: 70/log2(16)=18 characters
  • Base-10: 70/log2(10)=22 characters
  • Diceware: 70/log2(7776)=6 words

Now, there is certainly nothing wrong with generating 80-bit, 90-bit, or even 128-bit entropy. The only thing you should consider with this, is the size of the resulting password and passphrases. For example, if you were providing a minimum of 128-bit security for your users with the password generator, then things would look like:

  • Base-94: 128/log2(94)=20 characters
  • Base-64: 128/log2(64)=22 characters
  • Base-32: 128/log2(32)=26 characters
  • Base-16: 128/log2(16)=32 characters
  • Base-10: 128/log2(10)=39 characters
  • Diceware: 128/log2(7776)=10 words

As you can see, as you increase the security for your users, the size of the generated passwords and passphrases will also increase.

Conclusion

It's critical that we are doing right by our users when it comes to security. I know Randall Munroe of XKCD fame created the "correct horse battery staple" comic, advising everyone to create 4-word passphrases. This is fine, provided that those 4 words meets that minimum 70-bits of entropy. In order for that to happen though, the word list needs to be:

   4 = 70/log2(x)
=> 4 = 70/log(x)/log(2)
=> 4 = 70*log(2)/log(x)
=> 4*log(x) = 70*log(2)
=> log(x) = 70/4*log(2)
=> x = 1070/4*log(2)
=> x ~= 185,364

You would need a word list of at least 185,364 words to provide at least 17.5-bits of entropy per word, which brings us to required 70-bits of total entropy for 4 words. All too often, I see generators providing four words, but the word list is far too small, like around Diceware size, which is only around 51-bits of entropy. As we just concluded, that's not providing the necessary security for our users.

So, developers, when creating password and passphrase generators, make sure they are at least targeting the necessary 70-bits of entropy, in addition to the other qualifications that we outlined at the beginning of this post.

Colorful Passphrases

Since the development of my passphrase and password generator, I started working toward improving the other online generators out there on the web. I created a Google Spreadsheet to work toward that goal, by doing reasonable audits to "rank" each generator, and see how they stacked up against the rest. Then, I started submitting patches in hopes of making things better.

One passphrase generator that was brought to my attention was Pass Plum. Pass Plum supplies an example word list to use for generating your passphrases, if you choose to install the software on your own server. Unfortunately, the list is only 140 words in size, so if you choose to use that for your word list, then you only get about 7.13-bits of entropy per word. Sticking to the default configuration 4 words given to the user, that's a scant 28-bits of security on your passphrase, which is trivially reversed. I submitted a pull request to extend it to 4,096 words, providing exactly 13-bits of entropy per word, or about 52-bits of entropy for a 4-word passphrase- a significant improvement.

I noticed, however, that the default list was nothing but color names, and that got me thinking- what if not only the generator provided color names for passphrases, but also colored the word that color name? Basically, a sort of false visual synesthesia. What I want to know is this, is it easier to remember passphrases when you can associate each word with a visual color?

So, over the past several nights, and during weekends, I've been putting this together. So, here is is- colorful passphrases.

Collage of 4 separate screenshots of what the color passphrase generator looks like.

Head over to my site to check it out. If a color is too light (its luma value is very high), then the word is outlined with CSS. Every word is bold, to make the word even more visible on the default white background.

As I mentioned, the idea is simple: people struggle remembering random meaningless strings of characters for passwords, so passphrases are a way to make a random series of words easier to recall. After all, it should be easier to remember "gnu hush gut modem scamp giddy" than it is to remember "$5hKXuE[\NK". It's certainly easier to type on mobile devices, and embedded devices without keyboards, like smart TVs and video game consoles.

But, even then, there is nothing that is really tying "gnu hush gut modem scamp giddy" together, so you force yourself in some sort of mnemonic to recall it. Visually stimulated color passphrases have the benefit of not only using a mnemonic to recall the phrase, but an order of colors as well. For example, you might not recall "RedRobin Pumpkin Revolver DeepPuce Lucky Crail TealDeer", but you may remember its color order of roughly "red orange black purple gold brown teal". "A RedRobin is red. A pumpkin is orange. A revolver (gun) is black. DeepPuce is a purple. Lucky coins are gold. Crail, Soctand has brown dirt. TealDeer are teal."

However, it also comes with a set of problems. First, what happens if you actually have visual synesthesia? Will seeing these colors conflict with your mental image of what the color should be for that word? Second, many of the words are very obscure, such as "Crail" or "Tussock" or "Tuatara" (as all seen in the previous screenshot collage). Finally, what happens when you have a color passphrase where two similar colors are adjacent to each other? Something like "Veronica Affair Pipi DeepOak Atoll BarnRed RedOxide"? Both "BarnRed" and "RedOxide" are a deep reddish color. Will it be more difficult to recall which comes first?

A screenshot showing a color collision between "BarnRed" and "RexOxide"

As someone who is interested in password research, I wanted to see what sort of memory potential visually colorful passphrases could have. As far as I know, this has never been investigated before (at least I could find any research done in this area, and I can't find any passphrase generators doing it). This post from Wired investigates alternatives to text entry for password support, such as using color wheels, but doesn't say anything about visual text. Here is a browser extension that colors password form fields on websites, with the SHA-1 hash of your password as you type it. You know if it's correct, by recognizing if the pattern is the same it always is when logging in.

Long story short, I think I'm wading into unknown territory here. If you find this useful, or even if you don't, I would be very interested in your feedback.

A Practical and Secure Password and Passphrase Generator

The TL;DR

Go to https://ae7.st/g/ and check out my new comprehensive password and passphrase generator. Screenshots and longer explanation below.

Introduction

Sometime during the middle of last summer, I started thinking about password generators. The reason for this, was that I noticed a few things when I used different password generators, online or offline:

  1. The generator created random meaningless strings.
  2. The generator created XKCD-style passphrases.
  3. The generator gave the user knobs and buttons galore to control
    • Uppercase characters
    • Lowercase characters
    • Digits
    • Nonalphanumeric characters
    • Pronounceable passwords
    • Removing ambiguous characters
    • Password Length

The Problem

Here is just one example of what I'm talking about:

Screenshot showing a "secure" password generator from a website.

This password generator has a lot of options for tweaking your final password.

Ever since Randal Munroe published https://xkcd.com/936/, people started creating "XKCD-style" passphrase generators. Here's a very simple one that creates a four-word passphrase. No knobs, bells, or whistles. Just a button to generate a new XKCD passphrase. Ironically, the author provides an XKCD passphrase generator for you to use, then tells you not to use it. 🙂

On the other hand, why not make the XKCD password generation as complex as possible? Here at https://xkpasswd.net/s/, not only do you have an XKCD password generator, but you have all the bells, whistles, knobs, buttons, and control to make it as ultimately complex as possible. Kudos to the generator even make entropy estimates about the generated passwords!

Screenshot showing a very complex control board of an XKCD style password generator.

Why not add all the complexity of password generation to XKCD passwords?

What bothers me about the "XKCD password" crowd, however, is that no one knows that Diceware was realized back in 1995, making passphrases commonplace. Arnold Reinhold created a list of 7,776 words, enough for every combination of a 6-sided die rolled 5 times. Arnold explains that the passphrase needs to be chosen from a true random number generator (thus the dice) and as a result each word in the list will have approximately 12.9-bits of entropy. Arnold recommends throwing the dice enough times to create a five-word Diceware passphrase. That would provide about 64-bits of entropy, a modestly secure result.

A five-word Diceware passphrase could be:

  • soot laid tiger rilly feud pd
  • 31 al alibi chick retch bella
  • woven error rove pliny dewey quo

My Solution

While these password generators are all unique, and interesting, and maybe even secure, it boils down to the fact that my wife, never mind my mom or grandma, isn't going to use them. They're just too complex. But worse, they give the person using them a false sense of security, and in most cases, they're not secure at all. I've talked with my wife, family, and friends about what it requires to have a strong password, and I've asked them to give me examples. You can probably guess what I got.

  • Spouse's first name with number followed by special character. EG: "Alan3!"
  • Favorite sports team in CamelCase. EG: "UtahUtes"
  • Keyboard patterns. EG: "qwertyasdf"

The pain goes on and on. Usually, the lengths of each password is somewhere around 6-7 characters. However, when you start talking about some of these generators, and they see passwords like "(5C10#+b" or "V#4I5'4c", their response is usually "I'm never going to remember that!". Of course, this is a point of discussion about password managers, but I'll save that for another post.

So I wanted to create a password and passphrase generator that met everyone's needs:

  • Simplicity of use
  • Length and complexity
  • Provably secure
  • Desktop and mobile friendly

If you've been a subscriber to my blog, you'll know that I post a lot about Shannon entropy. Entropy is maximized when a uniform unbiased random function controls the output. Shannon entropy is just a fancy way for estimating the total number of possibilities something could be, and it's measured in bits. So, when I say a Diceware passphrase as approximately 64-bits of entropy, I'm saying that the passphrase that was generated is 1 in 2^64 or 18,446,744,073,709,551,616 possibilities. Again, this is only true if the random function is uniform and unbiased.

So, I built a password generator around entropy, and entropy only. The question became, what should the range be, and what's my threat model? I decided to build my threat model after offline brute force password cracking. A single computer with a few modest GPUs can work through every 8-character password built from all 94 graphical characters on the ASCII keyboard hashed with SHA-1 in about a week. That's 94^8 or 6,095,689,385,410,816 total possibilities. If chosen randomly, Shannon entropy places any password built from that set at about 52-bits. If the password chosen randomly from the same set of 94 graphical characters was 9 characters long, then the password would have about 59-bits of Shannon entropy. This would also take that same GPU password cracking machine 94 weeks to fully exhaust every possibility.

This seemed like a good place to start the range. So, for simplicity sake, I started the entropy range at 55-bits, then incremented by 5 bits until the maximum of 80-bits. As you can see from the screenshot of the entropy toolbar, 55-bits is red as we are in dangerous territory of an offline password cracker with maybe a cluster of GPUs finding the password. But things get exponentially expensive very quickly. Thus, 60-bits is orange, 65-bits is yellow, and 70-bits and above are green. Notice that the default selection is 70-bits.

Screenshot showing the entropy toolbar of my password generator.

The entropy toolbar of my password generator, with 70-bits as the default.

When creating the generator, I realized that some sites will have length restrictions on your password, such as not allowing more than 12 characters, or not allowing certain special characters, or forcing at least one uppercase character and one digit, and so forth. Some service providers, like Google, will allow you any length with any complexity. But further, people remember things differently. Some people don't need to recall the passwords, as they are using password managers on all their devices, with a synced database, and can just copy/paste. Others want to remember the password, and others yet want it easy to type.

So, it seemed to me that not only could I build a password generator, but also a passphrase generator. However, I wanted this to be portable, so rather than create a server-side application, I made a client-side one. This does mean that you download the wordlists as you need them to generate the passphrases, and the wordlists are anything but light. However, you only download them as you need them, rather than downloading all on page load.

To maximize Shannon entropy, I am using the cryptographically secure pseudorandom number generator from the Stanford Javascript Crypto Library. I'm using this, rather than the web crypto API, because I use some fairly obscure browsers, that don't support it. It's only another 11KB download, which I think is acceptable. SJCL does use the web crypto API to seed its generator, if the browser supports it. If not, a entropy collector listener event is launched, gathering entropy from mouse movements. The end result, is that Shannon entropy is maximized.

Passphrases

There are 5-types of passphrases in my generator:

  • Alternate
  • Bitcoin
  • Diceware
  • EFF
  • Pseudowords

Diceware

For the Diceware generator, I support all the languages that you'll find on the main Diceware page, in addition to the Beale word list. As of this writing, that's Basque, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Esperanto, Finnish, French, German, Italian, Japanese (Romaji), Maori, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, and Turkish. There are 7,776 words in each word list, providing about 12.9248-bits of entropy per word.

EFF

For the EFF generator, I support the three word lists that the EFF has created- the short word list, the long word list, and the "distant" word list, where every work has an edit distance of at least three from the others in the list. The long list is similar to the Diceware list, in that it is 7,776 words providing about 12.9248-bits of entropy per word. However, the number of characters in each word in the word list are longer on average, at around 7 characters per word than the Diceware word list, at around 4.3 characters per word. So, for the same entropy estimate, you'll have a longer EFF passphrase than a Diceware passphrase. The short word list contains only 1,296 words, to be used with 4 dice, instead of 5, and the maximum character length of any word is 5 characters. The short word list provides about 10.3399-bits of entropy per word. Finally, the "distant" word list is short in number of words also at 1,296 words, but longer in character count, averaging 7 characters per word.

Bitcoin

For the Bitcoin generator, I am using the BIP-0039 word lists to create the passphrase. These lists are designed to be a mnemonic code or sentence for generating deterministic Bitcoin wallets. However, because they are a list of words, they can be used for building passphrases too. Each list is 2,048 words, providing exactly 11-bits of entropy per word. Like Diceware, I support all the languages of the BIP-0039 proposal, which as of this writing includes Simplified Chinese, Traditional Chinese, English, French, Italian, Japanese (Hiragana), Korean (Hangul), and Spanish.

Alternate

Elvish

In the Alternate generator, I have a few options that provide various strengths and weaknesses. The Elvish word list is for entertainment value only. The word list consists of 7,776 words, making it suitable for Diceware, and provides about 12.9248-bits of entropy per word. However, because the generator is strictly electronic, and I haven't assigned dice roll values to each word, I may bump this up to 8,192 words providing exactly 13-bits of entropy per word. The word list was built from the Eldamo lexicon.

Klingon

Another passphrase generator for entertainment value is the Klingon generator. This word list comes from the Klingon Pocket Dictionary, and my word list provides exactly 2,604 unique words from the 3,028 words in the Klingon language. Thus, each word provides about 11.3465-bits of entropy.

PGP

The PGP word list was created to make reading hexadecimal strings easier to speak and phonetically unambiguous. It comprises of exactly 256 words providing exactly 8-bits of entropy per word. This generator works well in noisy environments, such as server rooms, where passwords need to be spoken from one person to another to enter into a physical terminal.

Simpsons

The Simpson's passphrase generator consists of 5,000 words, providing about 12.2877-bits of entropy per word. The goal of this generator is not only educational to show that any source of words can be used for a password generator, such as a television series of episodes, but also more memorable. Because this list contains the most commonly spoken 5,000 words from the Simpson's episodes, a good balance of verbs, nouns, adjectives, etc. are supplied. As such, the generated passphrases seem to be easier to read, and less noun-heavy than the Diceware or EFF word lists. These passphrases may just be the easiest to recall, aside from the Trump word list.

Trump

And now my personal favorite. The Trump generator was initially built for entertainment purposes, but ended up having the advantage of providing a good balanced passphrase of nouns, verbs, adjectives, etc. much like the Simpson's generator. As such, these passphrases may be easier to recall, because they are more likely to read as valid sentences than the Diceware or EFF generators. This list is pulled from Donald J. Trump's Twitter account. The list is always growing, currently at 5,343 words providing about 12.3404-bits of entropy per word.

Pseudowords

The pseudowords generator is a cross between unreadable/unpronounceable random strings and memorable passphrases. They are pronounceable, even if the words themselves are gibberish. They are generally shorter in practice than passphrases, and longer than pure random strings. The generators are here to show what you can do with random pronounceable strings.

Bubble Babble

Bubble Babble is a hexadecimal encoder, with builtin checksumming, initially created Antti Huima, and implemented in the original proprietary SSH tool (not the one by the OpenSSH developers). Part of the specification is that every encoded string begins and ends with "x". However, rather than encode data from the RNG, it is randomly generating 5-characters words in the syntax of "". As such, each 5-character word, except for the end points, provides 21521521=231,525 unique combinations, or about 17.8208-bits of entropy. The end points are in the syntax of "x" or "x, which is about 21521*5=11,025 unique combinations, or about 13.4285-bits of entropy.

Secret Ninja

This generator comes from a static character-to-string assignment that produces pronounceable Asian-styled words. As such, there are only 26 assignments, providing about 4.7004-bits of entropy per string. There are three strings concatenated together per hyphenated word.

Cosby Bebop

I was watching this YouTube video with Bill Cosby and Stewie from Family Guy, and about half-way through the skit, Bill Cosby starts using made-up words as part of his routine. I've seen other skits by comedians where they use made-up words to characterize Bill Cosby, so I figured I would create a list of these words, and see how they fell out. There are 32 unique words, providing exactly 5-bits of entropy per word. Unlike the Bubble Babble and Secret Ninja generators, this generator uses both uppercase and lowercase Latin characters.

Korean K-pop

In following with the Bill Cosby Bebop generator, I created a Korean "K-pop" generator that used the 64-most common male and female Korean names, providing exactly 6-bits of entropy per name. I got the list of names from various sites listing common male and female Korean names.

Random

These are random strings provided as a last resort for sites or accounting software that have very restrictive password requirements. These passwords will be some of the shortest generated while meeting the same minimum entropy requirement. Because these passwords are not memorable, they should be absolutely stored in a password manager (you should be using one anyway).

  • Base-94: Uses all graphical U.S. ASCII characters (does not include horizontal space). Each character provides about 6.5546-bits of entropy. This password will contain ambiguous characters.
  • Base-64- Uses all digits, lowercase and uppercase Latin characters, and the "+" and "/". Each character provides exactly 6-bits of entropy. This password will contain ambiguous characters.
  • Base-32: Uses the characters defined in RFC 4648, which strives to use an unambiguous character set. Each character provides exactly 5-bits of entropy.
  • Base-16: Uses all digits and lowercase characters "a" through "f". Each character provides exactly 4-bits of entropy. This password will contain fully unambiguous characters.
  • Base-10: Uses strictly the digits "0" through "9". This is mostly useful for PINs or other applications where only digits are required. Each digits provides about 3.3219-bits of entropy. This password will contain fully unambiguous characters.
  • Emoji: There are 881 emoji glyphs provided by that font, yielding about 9.7830-bits per glyph. One side-effect, is that even though there is a character count in the generator box, each glyph may be more than 1 byte, so some input forms may count that glyph as more than 1 character. Regardless, the minimum entropy is met, so the emoji password is still secure.

I want to say something a bit extra about the Emoji generator. With the rise of Unicode and the UTF-8 standard, and the near ubiquitous popularity of smartphones and mobile devices, having access to non-Latin character sets is becoming easier and easier. As such, password forms are more likely supporting UTF-8 on input to allow Cyrillic, Coptic, Arabic, and East Asian ideographs. So, if Unicode is vastly becoming the norm, why not take advantage of it while having a little fun?

I opted for the black-and-white font, as opposed to the color font, to stay consistent with the look and feel of the other generators. This generator uses the emoji character sets provided by Google's Noto Emoji fonts, as that makes it easy for me to support the font in CSS 3, allowing every browser that supports CSS 3 to take advantage of the font and render the glyphs in a standard fashion. The license is also open so that I can redistribute the font without paying royalties, and others can do the same.

Screenshots

The post wouldn't be complete without some screenshots. The generator is both desktop friendly, fitting comfortably in a 1280x800 screen resolution, as well a mobile friendly, working well on even some of the oldest mobile devices.

Desktop screenshot of my password generator.

Desktop screenshot.

First mobile screenshot of my password generator.

First mobile screenshot.

Second mobile screenshot of my password generator.

Second mobile screenshot.

Random Passphrases Work, Even If They're Built From Known Passwords

Just this morning, security researcher Troy Hunt released a ZIP containing 306 million passwords that he's collected over the years from his ';--have i been pwned? service. As an extension, he created a service to provide either a password or a SHA-1 hash to see if your password has been pwnd.

Screenshot showing the entry text field to test your password from the new password checking service Troy Hunt has provided.

In 2009, the social network RockYou was breached, and 32 million accounts of usernames and passwords was released to the public Internet. No doubt those 32 million passwords are included in Troy Hunt's password dump. What's interesting, and the point of this post, is individually, each password from the RockYou breach will fail.

Collage of six screenshots of individual RockYou passwords failing Troy Hunt's password check. They were: caitlin, peachy, keeley, doreen, ursulet, & juggalo.

However, what would happen if you took 6 random RockYou passwords, and created a passphrase? Below is screenshot demonstrating just that using the above 6 randomly chosen RockYou passwords. Individually, they all fail. Combined, they pass.

Screenshot showing how creating the passphrase 'caitlinpeachykeeleydoreenursuletjuggalo' passes Troy Hunt's check.

Now, to be fair, I'm choosing these passwords from my personalized password generator. The list is the top 7,776 passwords from the 32 million RockYou dump. As such, you could use this list as a Diceware replacement with 5 dice. Regardless, each password is chosen at random from that list, and enough passwords are chosen to reach a 70-bits of entropy target, which happen to be 6 passwords. Mandatory screenshot below:

Screenshot showing my password generator using the RockYou passwords as a source for a strong passphrase. One screenshot is hyphenating the passwords, the other is not.

The point of this post, is that you can actually build a secure password for sites and services using previously breached passwords for your word list source, in this case, RockYou. The only conditions is that you have a word list large enough create a reasonable passphrase with few selections, and that the process picking the passwords for you is cryptographically random.

Electronic Slot Machines and Pseudorandom Number Generators

TL;DR

An Austrian casino company used a predictable pseudorandom number generator, rather than a cryptographically secure one, and people are taking advantage of it, and cashing out big.

The Story

Wired reported on an article about an amazing operation at beating electronic slot machines, by holding your phone to the slot machine screen for a time while playing, leaving the slot machine, then coming back an additional time, and cashing in big.

Unlike most slots cheats, he didn’t appear to tinker with any of the machines he targeted, all of which were older models manufactured by Aristocrat Leisure of Australia. Instead he’d simply play, pushing the buttons on a game like Star Drifter or Pelican Pete while furtively holding his iPhone close to the screen.

He’d walk away after a few minutes, then return a bit later to give the game a second chance. That’s when he’d get lucky. The man would parlay a $20 to $60 investment into as much as $1,300 before cashing out and moving on to another machine, where he’d start the cycle anew.

These machines were made by Austrian company Novomatic, and when Novomatic engineers learned of the problem, after a deep investigation, the best thing they could come up with, was that the random number generator in the machine was predictable:

Novomatic’s engineers could find no evidence that the machines in question had been tampered with, leading them to theorize that the cheaters had figured out how to predict the slots’ behavior. “Through targeted and prolonged observation of the individual game sequences as well as possibly recording individual games, it might be possible to allegedly identify a kind of ‘pattern’ in the game results,” the company admitted in a February 2011 notice to its customers.

The article, focused on a single incident in Missouri, mentions that the state vets the machines before they go into production:

Recognizing those patterns would require remarkable effort. Slot machine outcomes are controlled by programs called pseudorandom number generators that produce baffling results by design. Government regulators, such as the Missouri Gaming Commission, vet the integrity of each algorithm before casinos can deploy it.

On random number generators

I'll leave you to read the rest of the article. Suffice it to say, the Novomatic machines were using a predictable pseudorandom number generator after observing its output for a period of time. This poses some questions that should immediately start popping up in your head:

  1. What is the vetting process by states to verify the quality of the pseudorandom number generators in solt machines?
  2. Who is on that vetting commission? Is it made up of mathematicians and cryptographers? Or just a board of executives and politicians?
  3. Why aren't casino manufacturers using cryptographically secure pseudorandom number generators?

For me, that third item is the most important. No doubt, as the Wired article states, older machines just cannot be fixed. They need to be taken out of production. So long as they occupy casinos, convenience stores, and gas stations, they'll be attacked, and the owner will lose money. So let's talk about random number generators for a second, and see what the gambling industry can do to address this problem.

You can categorize random number generators into four categories:

  1. Nonsecure pseudorandom
  2. Cryptographically secure pseudorandom
  3. Chaotic true random
  4. Quantum true random

What I would be willing to bet, is that most electronic machines out there are of the "nonsecure pseudorandom" type of random number generator, and Novomatic just happened to pick a very poor one. Again, there likely isn't anything they can do about existing machines in production now, but what can they do moving forward? They should start using cryptographically secure pseudorandom number generators (CSPRNGs).

In reality, this is trivial. There are plenty of CSPRNGs to choose from. CSPRNGs can be broken down further into three subcategories:

  1. Designs based on cryptographic primitives.
  2. Number theoretic designs.
  3. Special-purpose designs.

Let's look at each of these in turn.

Designs based on cryptographic primitives.

These are generators that use things like block ciphers, stream ciphers, or hashing functions for the generator. There are some NIST and FIPS standardized designs:

  • NIST SP 800-90A rev. 1 (PDF): CTR_DRBG (a block cipher, such as AES in CTR mode), HMAC_DRBG (hash-based message authentication code), and Hash_DRBG (based on cryptographically secure hashing functions such as SHA-256).
  • ANSI X9.31 Appendix A.2.4: This is based on AES, and obsoletes ANSI X9.17 Appendix C, which is based on 3DES. It requires a high-precision clock to initially seed the generator. It was eventually obsoleted by ANSI X9.62-1998 Annex A.4.
  • ANSI X9.62-2005 Annex D: This standard is defines an HMAC_DRBG, similar to NIST SP 800-90A, using an HMAC as the cryptographic primitive. It obsoletes ANSI X9.62-1998 Annex A.4, and also requires a high-precision clock to initially seed the generator.

It's important that these designs are backtracking resistant, meaning that if you know the current state of the RNG, you cannot construct all previous states of the generator. The above standards are backtracking resistant.

Number theoretic designs

There are really only two current designs, that are based on either the factoring problem or the discrete logarithm problem:

  • Blum-Blum-Shub: This is generator based on the fact that it is difficult to compute the prime factors of very large composites (on the order of 200 or more digits in length). Due to the size of the prime factors, this is a very slow algorithm, and not practical generally.
  • Blum-Micali: This is a generator based on the discrete logarithm problem, when given two known integers "b" and "g", it is difficult to find "k" where "b^k = g". Like Blum-Blum-Shub, this generator is also very slow, and not practical generally.

Special-purpose designs

Thankfully, there are a lot of special purpose designs designed by cryptographers that are either stream ciphers that can be trivially ported to a CSPRNG, or deliberately designed CSPRNGs:

  • Yarrow: Created by cryptographer Bruce Schneier (deprecated by Fortuna)
  • Fortuna: Also created by Bruce Schneier, and obsoletes Yarrow.
  • ISAAC: Designed to address the problems in RC4.
  • ChaCha20: Designed by cryptographer Daniel Bernstein, our crypto Lord and Savior.
  • HC-256: The 256-bit alternative to HC-128, which is part of the eSTREAM portfolio.
  • eSTREAM portfolio: (7 algorithms- 3 hardware, 4 software)
  • Random123 suite: Contains four highly parallelizable counter-based algorithms, only two of which are cryptographically secure.

The solution for slot machines

So now what? Slot machine manufacturers should be using cryptographically secure algorithms in their machines, full stop. To be cryptographically secure, the generator:

  • Must past the next-bit test (you cannot predict the next bit any better than 50% probability).
  • Must withstand a state compromise (you cannot reconstruct past states of the generator based on the current state).

If those two properties are met in the generator, then the output will be indistinguishable from true random noise, and the generator will be unbiased, not allowing an adversary, such as someone with a cellphone monitoring the slot machine, to get the upperhand on the slot machine, and prematurely cash out.

However, the question should then be raised- "How do you properly seed the CSPRNG, so it starts in an unpredictable state, before release?" Easy, you have two options here:

  • Seed the CSPRNG with a hardware true RNG (HWRNG), such as a USB HWRNG, or....
  • Build the machine such that it collects environmental noise as entropy

The first point is much easier to achieve than the second. Slot machines likely don't have a lot of interrupts built into the system-on-a-chip (SoC). So aside from a microphone, video camera, or antenna recording external events, you're going to be hard-pressed to get any sort of high-quality entropy into the generator. USB TRNGs are available all over the web, and cheap. When the firmware is ready to be deployed, read 512-bits out of the USB generator, hash it with SHA-256, and save the resulting hash on disk as an "entropy file".

Then all that is left is when the slot machine boots up and shuts down:

  • On startup, read the "entropy file" saved from the previous shutdown, to seed the CSPRNG.
  • On shutdown, save 256-bits of data out of the generator to disk as an "entropy file".

This is how most operating systems have solved the problem with their built-in CSPRNGs. Provided that the very first "entropy file" was initially seeded with a USB true HWRNG, the state of every slot machine will be always be different, and will always be unpredictable. Also, 256-bits is more than sufficient to make sure the initial state of the generator is unpredictable; physics proves it.

Of course, the SoC could have a HWRNG onboard, but then you run the risk of hardware failure, and the generator becoming predictable. This risk doesn't exist with software-based CSPRNGs, so provided you can always save the state of the generator on disk at shutdown, and read it on startup, you'll always have an unpredictable slot machine.

Adblockers Aren't Part Of The Problem- People Are

Troy Hunt, a well-respected security researcher, and public speaker, wrote a blog post recently about how adblockers are part of the bad experience of the web. His article is about a sponsorship banner he posts at the top of his site, just below the header. It's not flashy, intrusive, loud, obnoxious, or a security or privacy concern. He gets paid better for the sponsorship strip than he does for ads, and the strip is themed with the rest of his site. It's out of the way of the site content, and scrolls with the page. In my opinion, it's in perfectly good taste. See for yourself:

Screenshot of Troy Hunt's homepage, showing the sponsorship strip just below the header.

Troy was surprised to find out, however, that his sponsorship strip is not showing when AdBlock Plus or UBlock Origin ad blockers are installed and enabled in the browser. He is understandably upset, as he is avoiding everything that piss off the standard web user when it comes to ads. He reached out to ABP about whitelisting his strip, and they've agreed it's hardly violating web user experience. However, someone added it to the EasyList filters, which means any ad blocker outside of ABP, will filter the sponsorship strip.

So, here's my question- are users wrong in filtering it?

Let's look at the state of web ads over the past couple decades. First, there was the ad popup, where the web page you were visiting would popup an ad right in front of the page. Sometimes they were difficult to close, and sometimes closing one would open a different one. Some pages would open dozens of popups, some fullscreen. It wasn't long before browsers across the board blocked popups by default, baked right into the browser.

Screenshot showing a Windows XP desktop littered with ad popups.

After popups were unanimously blocked across every browser, advertisers turned to ad banners. These were just as obnoxious as the popups, even if you didn't have to close a window. The flashed, blinked, falsely promised free trips and gadgets, and even sometimes auto-played videos. They were rarely relevant to the site content, but web page owners were promised a revenue per click, regardless. So, the more you could fit on the page, the more likely someone would click on an ad, and you would get paid. Web page owners placed these obnoxious ads above the header, below the header, in the sidebars, in the middle of the pages breaking up paragraphs in posts, in the footers. In some cases, the screen real estate dedicated to ads was more than the actual content on the site.

An image showing a collection of annoying banner ads.

Some HTML5 and CSS3 solutions now include overlays, that have to be manually closed or escaped, in order to continue parsing the site content. Unfortunately, ad blockers don't do a great job at blocking these. While they're great at finding and filtering out elements, blocking CSS overlay popups seems to be too difficult, as they are prevalent on the web, much to the chagrin of many ad block users.

Screenshot showing a CSS overlay on a web page showing an ad.

Ad blockers then became a mainstay. Web users were pissed off due to Flash crashing the browser (most ads were Flash-based), slowing down their connection to download additional content (at the time, most were on dial-up on slow DSL), and in general just getting in the way. It got so bad, that DoubleClick's "privacy chief" wrote a rant about ad blockers, and how they were unethical, and ad blocker users were stealing revenue.

As web page analytics started becoming a thing, more and more website owners wanted to know how traffic was arriving at their site, so they could further increase that traffic, and in addition, increase ad revenue. Already, page counters like StatCounter existed, to help site owners understand partially how traffic was hitting them, where they came from, what time, what search engine they used, how long they stayed, etc. Well, advertisers started putting these analytics in their ads. So, not only did the website owner know who you were, the advertising company did too. And worse, while the website owner might not be selling that tracking data, the advertiser very likely is.

The advertiser also became a data broker.

But here's the tricky part- ad blocking was no longer enough. Now website owners were adding JavaScript trackers to their HTML. They're not visible on the page, so the ad blocker isn't hiding an element. It's not enough to block ads any longer. Privacy advocates begin warning about "browser fingerprinting" based on the specific details in your browser that can uniquely identify you. Those unique bits are then tracked with these tracking scripts, and set to advertisers and data brokers, which change many hands along the way. The EFF created a project to help users understand how unique they appeared on the web through the Panopticlick Project.

Screenshot of my browser results after testing at https://panopticlick.eff.org.

As a result, other browser extensions dedicated to blocking trackers started showing up. Things like Ghostery, Disconnect, Privacy Badger, and more. Even extensions that completely disable JavaScript and Flash became popular. Popular enough, that browsers implemented a "click-to-play" setting, where flash and other plugin content was blocked by default, and you would need to click the element to display it. It's not uncommon now to visit a web page where you tracking blocker will block a dozen or more trackers.

Screenshot of Ghostery blocking 20 trackers on a web page.

I wish I could stop here, but alas, many advertisers have made a turn for the ugly. Now, web ads are the most common way to get malware installed on your computer. Known as "malvertising", it is more common at infecting your computer than shady porn sites. Even more worrisome, is that this trend is shifting away from standard desktops to mobile. Your phone is now more valuable than your desktop, and advertisers know it. Never mind shady apps that compromise your device, ads in legitimate "safe" apps are compromising devices as well.

Infographic showing the threat of malvertising on mobile in 2014.

So, to summarize, the history of ads has been:

  1. Annoying popups.
  2. Annoying banners.
  3. Annoying CSS overlays.
  4. Transparent trackers.
  5. Malvertising.

So, to Troy Hunt, here's my question: Given the awful history of advertisements on the web, are you honestly surprised that users don't trust a sponsorship strip?

Consider the following analogy: Suppose I brought a bunch of monkeys to your home, and they trashed the place. Smashed dishes, tore up furniture, destroyed computers and televisions, ruined floors, broke windows, and generally just destroyed anything and everything in sight. Then, after cleaning the place up, not only do I bring the monkeys back, but this time, they have digital devices (cameras, microphones, etc.) that report back to me about what your house looks like, where you live, what you're doing in response to the destruction. Again, you kick them out, clean up the place, and I return with everything as before, with some of them carrying a contagious disease that can get you and your family sick. I mean, honestly, one visit of these monkeys is enough, but they've made three visits, each worse than before.

Now, you show up at my doorstep, with a well-trained, leashed, groomed, clean, tame monkey, and I'm supposed to trust that it isn't anything like the past monkeys I've experienced before? As tame as it may be, call me rude, but I'm not trusting of monkeys right now. I've installed all sorts of alarm and monitoring systems, to warn me when monkeys are nearby, and nuke them with lasers. I've had too many bad experiences with monkeys in the past, to trust anyone bringing a new monkey to the premises.

So, you can see, it's not ad blockers that are the problem. It's the people behind the advertising firms and it's the people not trusting the Internet. The advertising c-level executives are trying to find ways to get their ad in front of your eyes, and are using any sort of shady means necessary to do it. The average web user is trying to find ways to have a pleasant experience on the web, without getting tracked, infected with malware, shouted at by a video, while still being able to consume the content.

People arguably don't trust ads. The people in the advertising firms have ruined that trust. You may have a clean privacy-aware non-intrusive sponsorship strip, but you can't blame people for not trusting it. We've just had too long of a history of bad ad experiences. So, while reaching out to the ad blocker developers to whitelist the sponsorship strip is a good first step, ultimately, if people don't trust it, and want to block, you can't blame them. Instead, continue focusing on what makes you successful, for your revenue from the ad blockers- blogging, speaking, developing, engaging. Your content, who you are, how you handle yourself is your most valuable ad.

Breaking HMAC

Okay. The title might be click bait, just a little, but after you finish reading this post, I think you'll be a bit more careful picking your HMAC keys. After learning this, I know I will be. However, HMAC is not broken. It just has an interesting ... property that's worth knowing about.

First off, let's remind ourselves what HMAC is. HMAC, or Hashed Message Authentication Codes, are the ability to authenticate a cryptographic message. This is done through an asymmetric key agreement protocol, such as Diffie-Hellman, where two parties securely share symmetric keys. These keys are used to encrypt messages as well as authenticate data. HMAC tags prevent chosen plaintext attacks, where the attacker can insert malicious data into the payload (send $1,000,000 to my account), and HMAC tags prevent adaptive chosen ciphertext attacks where the attacker can send encrypted data to the server, and learn what is being protected (is "password" in the payload?).

Authenticated messages are absolutely essential to modern cryptographic software. If you're writing cryptographic software, and you're not authenticating your ciphertext, you're doing it wrong. It doesn't matter if the data is at rest or in motion. Authenticate your ciphertext. This is where HMAC fits in.

So, best practice, when not using native AEAD ciphers, such as AES-GCM, or ChaCha20-Poly1305, is to encrypt the plaintext then authenticate it with HMAC, and prepend or append the digest (called a "MAC tag" or just "tag") to the ciphertext.

Something like this pseudocode:

ciphertext = AES-256-CTR(nonce, plaintext, key1)
tag = HMAC-SHA-512(ciphertext, key2)
ciphertext = tag || ciphertext

Then we ship off the newly bundled ciphertext and MAC tag. When it arrives at our destination, the recipient can verify if the ciphertext is what it should be, before decrypting it, by verifying the MAC tag, which is the whole point:

tag1 = ciphertext[:64]
data = ciphertext[64:]
tag2 = HMAC-SHA-512(ciphertext, key2)
hmac_check = 0

for char1, char2 in zip(tag1, tag2):
    hmac_check |= ord(char1) ^ ord(char2)

if hmac_check == 0:
    plaintext = AES-256-CTR(nonce, data, key1)
else:
    return False
return True

Notice that we're doing constant time comparison of the shipped tag and the calculated tag. Last thing we want is to introduce a timing attack by doing "if tag1 != tag2". However, after the constant time comparison, if tag1 and tag2 match, we can decrypt the data, and retrieve the plaintext.

So, now you have the background on HMAC, let's look at some examples with Python, and see where HMAC breaks. After all, that's the click bait title, no?

1
2
3
4
5
6
7
>>> import os
>>> import hmac
>>> import hashlib
>>> key = os.urandom(16)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg).hexdigest()
'd5a94f051b1e6ff67065b6f4c3a60130'

In this example, the default HMAC is HMAC-MD5. HMAC-MD5 is still considered cryptographically secure, even though vanilla MD5 is broken. Regardless, it'll suffice for this example, and we'll look at SHA-1 and SHA-2 also.

In RFC 2104, where HMAC is standardized, section 3 has this odd little tidbit (emphasis mine):

The key for HMAC can be of any length (keys longer than B bytes are
first hashed using H
). However, less than L bytes is strongly
discouraged as it would decrease the security strength of the
function. Keys longer than L bytes are acceptable but the extra
length would not significantly increase the function strength. (A
longer key may be advisable if the randomness of the key is
considered weak.)

The "B bytes" length is the block size of the underlying HMAC operations. If the key is shorter than this block size, zeroes need to be appended to the key. If the key is longer than this block size, then it is hashed with the HMAC cryptographic hash.

The block size is 64 bytes for the following HMACs:

  • HMAC-MD5
  • HMAC-RIPEMD128
  • HMAC-RIPEMD160
  • HMAC-SHA1

In other words, HMAC wants to key to be exactly one block in length. If it's longer, it's hashed with zeros appended to fit exactly into one block.

Can we test this?

1
2
3
4
5
6
>>> key = os.urandom(65) # longer than one block
>>> msg = os.urandom(256)
>>> hmac.new(key, msg).hexdigest()
'f887a4146e94ed47405c97931798885d'
>>> hmac.new(hashlib.md5(key).digest(),msg).hexdigest()
'f887a4146e94ed47405c97931798885d'

We have a collision. In other words:

For:
* 'H' a cryptographic hash
* 'k' a private key
* 'm' a message
* 'B' an HMAC block size

    HMAC(k, m) == HMAC(H(k), m)

for all 'k', where len(k) > B

Does this work with HMAC-SHA1?

1
2
3
4
5
6
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha1).hexdigest()
'1070312944223b36928382d7a53ca54f7204ad4a'
>>> hmac.new(hashlib.sha1(key).digest(), msg, hashlib.sha1).hexdigest()
'1070312944223b36928382d7a53ca54f7204ad4a'

How about all the SHA-2 functions? (SHA-224, SHA-256, SHA-384, & SHA-512)?

SHA-224:

1
2
3
4
5
6
7
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha224).hexdigest()
'9ea8c3f667e55e6e9c5d63c5dd1b569ca69e2cc69f5e3fa3f87e94ba'
>>> hmac.new(hashlib.sha224(key).digest(), msg, hashlib.sha224).hexdigest
()
'9ea8c3f667e55e6e9c5d63c5dd1b569ca69e2cc69f5e3fa3f87e94ba'

SHA-256:

1
2
3
4
5
6
7
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha256).hexdigest()
'2aa02e678fcfe7ecaa1475efb70fe284fe91cc81e5a9c543433b70f5f5112c4b'
>>> hmac.new(hashlib.sha256(key).digest(), msg, hashlib.sha256).hexdigest
()
'2aa02e678fcfe7ecaa1475efb70fe284fe91cc81e5a9c543433b70f5f5112c4b'

SHA-384:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha384).hexdigest()
'0941e6502233a72d01beeec729eaa7db2469f8ce96339cd5b3b2c9a4684501e6a7025fac
6c9c20a511c48df76b453ec3'

>>> hmac.new(hashlib.sha384(key).digest(), msg, hashlib.sha384).hexdigest
()
'5380905fb89fee68836be076ebfccff600e3b89c6554840fe61fed01b049d6a6a77423d2
f5f4be1afb9d1c6f63b8b7fc'

SHA-512:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(65)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha512).hexdigest()
'36063bdc2d02ce8ea4b01b40ba040094c640959e0cc5716f7a75f119cbc348aa93d555f8
6bfcdaee5dad4ec2e5d53ed4362f9df0720ec0e1272288d49a912f7e'

>>> hmac.new(hashlib.sha512(key).digest(), msg, hashlib.sha512).hexdigest
()
'8bfe675e3ca35a8680243d3747b5d3ce7ded1731e1a307cf5d1b00ae9243395ab94039f2
2585b417d7cbdf09f3d8dcf39c85ce147ff77c901c1a21f8de981b6a'

So it appears that the block size is indeed 64 bytes for MD5, SHA-1, SHA-224, and SHA-256, but for SHA-384 and SHA-512, it doesn't appear to be working. That is because the block size has changed to 128 bytes for these two functions. So, if our key is 129 bytes, we should be able to replicate collisions:

SHA-384 with a 129-byte key:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(129)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha384).hexdigest()
'bda2b586637e3bd73a27919601d7d5a1c1743f1f9f5cb72a0aa874f832046f4bc396ff8e
307f9318dc404c4b432ca491'

>>> hmac.new(hashlib.sha384(key).digest(), msg, hashlib.sha384).hexdigest
()
'bda2b586637e3bd73a27919601d7d5a1c1743f1f9f5cb72a0aa874f832046f4bc396ff8e
307f9318dc404c4b432ca491'

SHA-512 with a 129-byte key:

1
2
3
4
5
6
7
8
9
>>> key = os.urandom(129)
>>> msg = os.urandom(256)
>>> hmac.new(key, msg, hashlib.sha512).hexdigest()
'd0153f8bb6a549539abbcff8ee5ac7592c48c082bbbb7b3cc95dcb2166f162e5c59bb7bb
3316e65d1481bd8697e8d3bc91deb46ad44845b972c57766f45c54bd'

>>> hmac.new(hashlib.sha512(key).digest(), msg, hashlib.sha512).hexdigest
()
'd0153f8bb6a549539abbcff8ee5ac7592c48c082bbbb7b3cc95dcb2166f162e5c59bb7bb
3316e65d1481bd8697e8d3bc91deb46ad44845b972c57766f45c54bd'

This isn't a poor Python implementation of HMAC. Try it in your favorite language, and you should be able to replicate the collisions. This is a "bug" in HMAC. If the key is longer than the block size, it's hashed with the HMAC cryptographic hash, then appended with zeros to fit the single block.

So what does this mean? It means that when choosing your HMAC keys, you should stay within one block size of bytes- 64 bytes or less for MD5, RIPEMD-128/160, SHA-1, SHA-224, SHA-256, and 128-bytes or less for SHA-384 and SHA-512. If you do this, you'll be fine.

Then again, you should probably be using NaCl or libsodium rather than piecing these cryptographic primitives manually yourself. These sorts of pitfalls are already handled for you.

This post is the result of a Twitter discussion between Scott Arciszewski, myself, and some others.

UPDATE: I'm not aware of any actual security implications where this is a problem that two distinct inputs produce the same HMAC digest. Ultimately, what are you trying to get out of HMAC? It's used as an authentication and integrity mechanism. So what if a long key, and the hash of that key produce the same HMAC digest? At 64 bytes, or 512-bits, the amount of work required at guessing the right key is anything but practical. Still, this is interesting.

Further Investigation Into Scrypt and Argon2 Password Hashing

Introduction

In my previous post, I didn't pay close attention to the memory requirements of Argon2 when running my benchmarks. Instead, I just ran them until I got tired of waiting around. Further, I really didn't do justice to either scrypt nor Argon2 when showing the parallelization factor. So, as a result, I did a lot more benchmarks on both, so you could more clearly see how the cost affects the time calculating the password hash, and how parallelization can affect that time.

More scrypt Benchmarks and Clarification

So, let's run some more scrypt benchmarks, and take a closer look at what's going on. Recall that the cost parameters for scrypt are:

  • N: The CPU and RAM cost
  • r: The mixing loop memory block size
  • p: The product factor

The recommended cost factors from the Colin Percival are:

  • N: 16384 (214)
  • r: 8
  • p: 1

To calculate the amount of memory used, we use the following equation:

Memory in bytes = (N * r * 128) + (r * p * 128)

So, according to the recommendation:

(214 * 8 * 128) + (8 * 1 * 128) = 16,778,240 bytes

Or, about 16 megabytes. According to Anthony Ferrara, you should be using at least 16 MiB with scrypt. At 4 Mib or less, it is demonstrably weaker than bcrypt. So, when you're looking at the images below, you'll notice that the first memory result is in red text, to show that 8 MiB is the weak point with those cost factors, and the 16 MiB is green, showing cost factors from there up are preferred. As a result, anything between 8 MiB and 16 MiB is default black, in that you should probably be cautious using these cost factors. While you might be demonstrably stronger than bcrypt with these work loads, you're not at the developer's recommendation of 16 MiB.

So, knowing that, let's look at the results. Notice that when we increase our product factor, our execution time increases by that factor. Despite urban legend, this isn't a parallelization constant (well, it is, but it's not breaking up the problem into smaller ones- it's multiplying it). The idea is that once you've reached a reasonable memory cost, you can increase the execution time by creating more mixing loops with the same cost on each loop. So, instead of one mixing loop costing you 16 MiB, you can have two mixing loops costing you 16 MiB. We didn't divide the problem, we multiplied it. As such, our execution time will double from one mixing loop to two mixing loops.

This seems strange, and indeed it is, but you should start with "p=1" at the memory cost you can afford, then increase the proudct factor if you can donate more time to the execution. In other words, the product factor is designed for hardware limited scenarios. In general, you'll want to look at your execution time, and let the memory come in as an after thought (provided it's at or more than 16 MiB).

As in the last post, I have highlighted the green cells with interactive password sessions targeting half-of-a-second and red cells with symmetric key derivation targeting a full five seconds.


Scrypt table showing memory block sizes of 1 and 2 with product multipliers of 1, 2, and 4.

Scrypt table showing memory block sizes of 4 and 6 with product multipliers of 1, 2, and 4.

Scrypt table showing memory block sizes of 8 and 10 with product multipliers of 1, 2, and 4.

Scrypt table showing memory block sizes of 12 and 14 with product multipliers of 1, 2, and 4.

Scrypt table showing memory block sizes of 16 and 18 with product multipliers of 1, 2, and 4.

More Argon2 Benchmarks

When showing Argon2 in my last post, I did a poor job demonstrating several additional cost factors, and I didn't make it clear how the parallelization played a roll when keeping the same cost factor. As a result, I ran additional benchmarks to make it more clear exactly how CPU and RAM play with each other in your work load.

As a reminder, the cost parameters for Argon2 are as follows:

  • n: The number of iterations on the CPU.
  • m: The memory work load.
  • p: The parallelization factor.

Unlike scrypt, where a single factor manipulates both the CPU and RAM cost, Argon2 separates them out. You deliberately have two knobs to play with- "n" for the CPU and "m" for the RAM. But, one affects the other. If you are targeting a specific execution time, and you increase your memory factor by two, then your CPU work factor will be decreased by half. On the reverse, if you increase your CPU work factor by two, then your memory work factor will be decreased by half. So, affecting one work factor affects the other.

Why is this important? Well, let's consider setting your memory requirement into the gigabyte range. At 1 GiB, for an interactive login session of .5 seconds, you would need at least both cores working on the hash, and you would only get a single iteration. In other words, your work is entirely memory dependent without any significant CPU cost. Maybe you're trying to thwart FPGAs or ASICs with the large memory requirement. However, is it possible that an adversary has 1 GiB of on-die cache? If so, because you're entirely memory-dependent, and no CPU work load, you've been able to cater to the adversary, without significant hardware cost.

On the reverse, you could get CPU heavy with 2048 iterations to hit your .5 seconds execution time, but then you would only be using 256 KiB of memory. You're likely not defeating the FPGAs and ASICs that Argon2 is designed for, as you're almost entirely processor-driven.

So, what to do? It would probably be a good idea to target a balance- require a significant amount of memory, even if it doesn't break the on-die cache barrier, while also requiring a significant amount of processor work. Sticking with Colin's recommendation of 16 MiB (214) of memory and 32 iterations on 4 cores for interactive logins is probably a good balance. Then again, it will all depend on your hardware, what you can expect in customer execution time, load, and other variables.

However, here are additional timings of Argon2, just like with scrypt, so you can see how parallelization affects identical costs. Again, green cells are targeting .5 seconds for interactive logins, and red cells are targeting 5 seconds for symmetric key derivation.


Argon2 table showing memory requirements of 256 KiB and 512 KiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 1 MiB and 2 MiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 4 MiB and 8 MiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 16 MiB and 32 MiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 64 MiB and 128 MiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 256 MiB and 512 MiB with parallel factors of 1, 2, and 4.

Argon2 table showing memory requirements of 1 GiB and 2 GiB with parallel factors of 1, 2, and 4.

Conclusion

Hopefully, this will help you make a more educated decision about your cost factors when deploying either scrypt or Argon2 as your password hash or symmetric key derivation function. Remember, that you have a few things to consider when picking your costs:

  • Make it expensive for both CPU and memory.
  • Target a realistic execution time for the situation.
  • Guarantee that you can always meet these goals after deployment.

Also, don't deploy Argon2 into production quite yet. Let is bake for a while. If it still stands secure in 2020, then you're probably good to go. Otherwise, deploy scrypt, or the other functions mentioned in the prior post.

Let's Talk Password Hashing

TL;DR

In order of preference, hash passwords with:

  1. scrypt
  2. bcrypt
  3. Argon2
  4. sha512crypt
  5. sha256crypt
  6. PBKDF2

Do not hash passwords with:

  1. MD5
  2. md5crypt
  3. UNIX crypt(3)
  4. SHA-1/2/3
  5. Skein
  6. BLAKE2
  7. Any general purpose hashing function.
  8. Any encryption algorithm.
  9. Your own design.
  10. Plaintext

Introduction

Something that comes up frequently in crypto circles, aside from the constant database leaks of accounts and passwords, are hashing passwords. Because of the phrase "hashing passwords", developers who may not know better will think of using generic one-way fixed-length collision-resistant cryptographic hashing functions, such as MD5, SHA-1, SHA-256, or SHA-512, without giving a second thought to the problem. Of course, using these functions is problematic, because they are fast. Turns out, we don't like fast hashing functions, because password crackers do like fast hashing functions. The faster they can do, the sooner they can recover the password.

The Problem

So, instead of using MD5, SHA-1, SHA-256, SHA-512, etc., the cryptographic community got together, and introduced specifically designed password hashing functions, where a custom work factor is included as a cost. Separately, key derivation functions were also designed for creating cryptographic keys, where a custom work factor was also included as a cost here. So, with password-based key derivation functions and specifically designed password hashing functions, we came up with some algorithms that you should be using instead.

The Solution

The most popular algorithms of this type would include, in my personal order of preference from most preferred to least preferred:

  1. scrypt (KDF)
  2. bcrypt
  3. Argon2 (KDF)
  4. sha512crypt
  5. sha256crypt
  6. PBKDF2 (KDF)

The only difference between a KDF and a password hashing function, is that the digest length can be arbitrary with KDFs, whereas password hashing functions will have a fixed length output.

For the longest time, I was not a fan of scrypt as a password hashing function. I think I've changed my mind. Even though scrypt is sensitive to the parameters picked, and it suffers from a time-memory trade-off (TMTO), it's still considered secure, provided you pick sane defaults. I also place bcrypt over Argon2, because Argon2 was just recently announced as the Password Hashing Contest winner. As with all cryptographic primitives, we need to time to analyze, attack, and pick apart the design. If after about 5 years, it still stands strong and secure, then it can be recommended as a solution for production. In the meantime, it's something certainly worth testing, but maybe not for production code. Finally, I prefer sha512crypt and sha256crypt over PBKDF2, mostly because they are included with every GNU/Linux distribution by default, they are based on the strong SHA-2 hashing function, which has had years and mountains of analysis, and unlike PBKDF2, you know exactly which hashing function is used. PBKDF2 could be using SHA-2 functions by default, or it could be using SHA-1. You'll need to check your library to be sure.

Different Strokes for Different Folks

Regardless, all of the above functions include cost parameters for manipulating how long it takes to calculate the hash from a password. It's less important exactly what the cost parameters are, and more important that you are targeting an appropriate time to work through the cost, and create the hash. This means you need to identify your threat model and your adversary.

The two common scenarios you'll find yourself in, are:

  1. Password storage
  2. Encryption keys

For password storage, your threat model is likely the password database getting leaked to the Internet, and password crackers around the world working on the hashes in the database to recover passwords. Thus, your adversary is malware, Anonymous, and password crackers. For encryption keys, your threat model is likely private encrypted keys getting compromised and side-channel attacks. Thus, your adversary is also malware, poor key exchanges, or untrusted networks. Knowing your threat model and your adversary changes how you approach the problem.

With password storage, you may be dealing with an interactive login, such as through a website. As such, you probably want the password hashing time to be quick, while still maintaining a work factor that would discourage large distributed attacks on your leaked database. Possibly, .5 seconds. This means if the database was leaked, the password cracker could do no more than 2 passwords per second. When you compare this to the millions of hashes per second a GPU could execute on Windows NTLM passwords, 2 passwords per second is extremely attractive. For encryption keys, you probably don't need to worry about interactive sessions, so taking 5 seconds to create the key from the password probably isn't a bad thing. So key crackers spending 5 seconds per guess trying to recover the password that created the encrypted private key is really nice.

bcrypt, sha256crypt, sha512crypt, & PBKDF2

So, knowing the work factors, what would it look like for the above algorithms? Below, I look at bcrypt, sha256crypt, sha512crypt, and PBKDF2 with their appropriate cost. I've highlighted the row green where a possible work factor could mean spending 0.5 seconds on hashing the password, and a red row where a possible work factor could mean spending 5 full seconds on creating a password-based encryption key.

Spreadsheet table showing various cost factors for bcrypt, sha256crypt, sha512crypt, and PBKDF2.

Notice that for bcrypt, this means for password hashing, a factor of 13 would provide a cost of about 0.5s to hash the password, where a factor of 16 would get me close to my cost of about 5 seconds for creating a password-based key. For sha256crypt, sha512crypt, and PBKDF2, that seems to be about 640,000 and 5,120,000 iterations respectively.

scrypt

When we move to scrypt, things get a touch more difficult. With bcrypt, sha256crypt, sha512crypt, and PBKDF2, our cost is entirely a CPU load factor. Unfortunately, while possibly problematic for fast GPU clusters, they still fall victim to algorithm-specific FPGAs and ASICs. In order to combat this, we need to also include a memory cost, seeing as though memory on these devices is expensive. However, having both a CPU and a RAM cost, means multiple knobs to tweak. So, Colin Percival, the designer of scrypt, decided to bundle both the CPU and the RAM cost three factors: "N", "r", and "p". The resulting memory usage is calculated as follows:

Memory in bytes = (N * r * 128) + (r * p * 128)

There are a lot of suggestions out there about what's "best practice". It seems that you should at least have the following cost factors with scrypt, which provides a 16 MiB memory load:

  • N: 16384 (214)
  • r: 8
  • p: 1

While you should be aware of the sensitivity of scrypt parameters, provided you are working with at least 16 MiB of RAM, you aren't any worse than other password hashing functions or KDFs. So, in the following tables, I increase the memory cost needed for the hash by tweaking the three parameters.

Update 2016-06-29: I've clarified these parameters in a follow-up post, which you should most definitely read at https://pthree.org/2016/06/29/further-investigation-into-scrypt-and-argon2-password-hashing/.

First table showing the cost factors of scrypt. Second table showing different cost factors of scrypt. Third table showing yet different cost factors of scrypt.

Because I only have access to a single-socket-quad core CPU in this testing machine, I wanted to limit my "p" cost to 1, 2, and 4, which is displayed in those tables. Further, I'm limited on RAM, and don't want to disrupt the rest of the applications and services running on the box, so I've limited my "r" cost to 4, 8, and 16 multiplied by 128 bytes (512 bytes, 1024 bytes, and 2048 bytes).

Interestingly enough, Colin Precival recommends 16 MiB (N=16384 (214), r=8, p=1) for interactive logins and 16 MiB (N=131072 (217), r=1, p=1) for symmetric key derivation. If I were targeting my 0.5s password hashing time, then I could improve that to 256 MiB (N=65536 (216), r=8, p=1), or 2 GiB (N=2097152 (221), r=8, p=1), if targeting just slightly more than 5 seconds for symmetric key derivation.

Argon2

Finally, we look at Argon2. Argon2 comes in two flavors- Argon2d and Argon2i; the first of which is data (d)ependent and the latter is data (i)independent. The former is supposed to be resistant against GPU cracking while the latter is supposed to be resistant against side-channel attacks. In other words, Argon2d would be suitable for password hashing, while Argon2i would be suitable for encryption key derivation. However, regardless of Argon2d or Argon2i, the cost parameters will perform the same, so we'll treat them as a single unit here.

Like scrypt, Argon2 has both a CPU and a RAM cost. However, both are handled separately. The CPU cost is handled through standard iterations, like with bcrypt or PBKDF2, and the RAM cost is handled through specifically ballooning the memory. When I started playing with it, I found that just manipulating the iterations felt very much like bcrypt, but I could affect the overall time it took to calculate the hash by just manipulating the memory also. When combining the two, I found that iterations affected the cost more than the RAM, but both had significant say in the calculation time, as you can see in the tables below. As with scrypt, it also has a parallelization cost, defining the number of threads you want working on the problem:

First table showing the cost factors of Argon2. Second table showing different cost factors of Argon2. Third table showing yet different cost factors of Argon2.

Note the RAM cost between 256 KiB and 16 MiB, in addition to the number of iterations and the processor count cost. As we balloon our RAM, we can bring our iteration cost down. As we require more threads to work on the hash, we can bring that iteration count down even further. Regardless, we are trying to target 0.5s for an interactive password login, and a full 5 seconds for password-based encryption key derivation.

Conclusion

So, what's the point? When hashing passwords, whether to store them on disk, or to create encryption keys, you should be using password-based cryptographic primitives that were specifically designed for this problem. You should not be using general purpose hashing functions of any type, because of their speed. Further, you should not be rolling out your own "key-stretching" algorithm, such as recursively hashing your password digest and additional output.

Just keep in mind- if the algorithm was specifically designed to handle passwords, and the cost is sufficient for your needs, threat model, and adversary, then you're doing just fine. Really, you can't go wrong with any of them. Just avoid any algorithm not specifically designed around passwords. The goal is security through obesity.

Best practice? In order of preference, use:

  1. scrypt
  2. bcrypt
  3. Argon2
  4. sha512crypt
  5. sha256crypt
  6. PBKDF2

Do not use:

  1. MD5
  2. md5crypt
  3. UNIX crypt(3)
  4. SHA-1/2/3
  5. Skein
  6. BLAKE2
  7. Any general purpose hashing function.
  8. Any encryption algorithm.
  9. Your own design.
  10. Plaintext

The Physics of Brute Force

Introduction

Recently, MyDataAngel launched a Kickstarter project to sell a proprietary encryption algorithm and software with 512-bit and 768-bit symmetric keys. The motivation was that 128-bit and 256-bit symmetric keys just isn't strong enough, especially when AES and OpenSSL are older than your car (a common criticism they would mention in their vlogs). Back in 2009, Bruce Schneier blogged about Crypteto having a 49,152-bit symmetric key. As such, their crypto is 100% stronger, because their key is 100% bigger (than 4096-bit keys?). Meganet, which apparently still exists, has a 1 million-bit symmetric key!

It's hard to take these encryption products seriously, when there are no published papers on existing primitives, no security or cryptography experts on your team, and you're selling products with ridiculous key lengths (to be fair, 512-bit and 768-bit symmetric keys aren't really that ridiculous). Nevermind that your proprietary encryption algorithm is not peer-reviewed nor freely available to the public. Anyone can create a symmetric encryption algorithm that they themselves cannot break. The trick is releasing your algorithm for peer review, letting existing cryptography experts analyze the design, and still coming out on top with a strong algorithm (it wouldn't hurt if you analyzed existing algorithms and published papers yourself).

So with that, I want to talk a bit about the length of symmetric keys, and what it takes to brute force them. Bruce Schneier addressed this in his "Applied Cryptography" book through the laws of thermodynamics. Unfortunately, he got some of the constants wrong. Although the conclusion is basically the same, I'm going to give you the same argument, with updated constants, and we'll see if we come to the same conclusion.

Counting Bits

Suppose you want to see how many bits you can flip in one day by counting in binary every second. Of course, when you start counting, you would start with "0", and your first second would flip your first bit to "1". Your second second would flip your second bit to "1" while also flipping your first bit back to "0". Your third second would flip the first bit back to "1", and so forth. Here is a simple GIF (pronounced with a hard "G") counting from 0 to 127, flipping bits each second.

Bit counter animation counting from 0 to 127.

By the end of a 24-hour period, I would have hit 86,400 seconds, which is represented as a 17-bit number. In other words, every 24 hours, flipping 1 bit per second, I can flip every combination of bits in a 16-bit number.

Binary representation of 86400

By the end of a single year, we end up with a 25-bit number, which means flipping a single bit every second can flip every combination of 24-bits every year.

Binary representation of 31536000

So, the obvious question is then this- what is the largest combination of bits that I can flip through to exhaustion? More importantly, how many computers would I need to do this work (what is this going to cost)?

Some Basic Physics

One of the consequences of the second law of thermodynamics, is that it requires energy to do a certain amount of work. This could be anything from lifting a box over your head, to walking, to even getting out of bed in the morning. This also includes computers and hard drives. When the computer wishes to store data on disk, energy is needed to do that work. This is expressed with the equation:

Energy = kT

Where "k" is Boltzmann's constant of 1.38064852×10−16 ergs per Kelvin, and "T" is the temperature of the system. I'm going to use ergs as our unit, as we are speaking about work, and an "erg" is a unit of energy. Of course, a "Kelvin" is a unit of temperature, where 0 Kelvin is defined as a system devoid of energy; also known as "absolute zero".

It would make the most sense to get our computer as absolutely cool as possible to maximize our output while also minimizing our energy requirements. Current background radiation in outer space is about 2.72548 Kelvin. To run a computer cooler than that would require a heat pump, which means adding additional energy to the system than what is needed for our computation. So, we'll run this ideal computer at 2.72548 Kelvin.

As a result, this means that to flip a single bit with our ideal computer, it requires:

Energy = (1.38064852×10−16 ergs per Kelvin) * (2.72548 Kelvin) = 3.762929928*10-16 ergs

Some Energy Sources

The Sun

Now that we know our energy requirement, let's start looking at some energy sources. The total energy output from our star is about 1.2*1034 Joules per year. Because one Joule is the same as 1*107 ergs, then the total annual energy output of the Sun is about 1.2*1041 ergs. So, doing some basic math:

Bits flipped = (1.2*1041 ergs) / (3.762929928*10-16 ergs per bit) = 3.189004374*1056 bits

3.189004374*1056 bits means I can flip every combination of bits in a 2187-bit number, if I could harness 100% of the solar energy output from the sun each year. Unfortunately, our Sun is a weak star.

A Supernova

A supernova is calculated to release something around 1044 Joules or 1051 ergs of energy. Doing that math:

Bits flipped = (1051 ergs) / (3.762929928*10-16 ergs per bit) = 2.657503608*1066 bits

2.657503608*1066 bits is approximately 2220-bits. Imagine flipping every bit in a 220-bit number in an orgy of computation.

A Hypernova

A hypernova is calculated to release something around 1046 Joules or 1053 ergs of energy. Doing that math:

Bits flipped = (1053 ergs) / (3.762929928*10-16 ergs per bit) = 2.657503608*1068 bits

2.657503608*1068 bits is approximately 2227-bits. This is a computation orgy turned up to 11.

Of course, in all 3 cases, I would have to harness 100% of that energy into my ideal computer, to flip every combination of these bits. Never mind finding transportation to get me to that hypernova, the time taken in that travel (how many millions of light years away is it?), and the cost of the equipment to harness the released energy.

Bitcoin Mining

As a comparative study, Bitcoin mining has almost surpassed 2 quintillion SHA-256 hashes per second. If you don't think this is significant, it is. That's processing all of a 60-bit number (all 260 bits) every second, or an 85-bit number (all 285 bits) every year. This is hard evidence, right now, of a large scale 256-bit brute force computing project, and it's barely flipping all the bits in an 85-bit number every year. The hash rate would have to double (4 quintillion SHA-256 hashes every second) to surpass flipping all the bits in an 86-bit number every year.

Further, we do not have any evidence of any clustered supercomputing project that comes close to that processing rate. It can be argued that the rate of Bitcoin mining is the upper limits of what any group of well-funded organizations could afford (I think it's fair to argue several well-funded organizations are likely Bitcoin mining). To produce a valid conspiracy theory to counteract that claim, you would need to show evidence of organizations that have their own semiconductor chip manufacturing, that has outpaced ARM, AMD, Intel and every other chip maker on the market, by several orders of magnitude.

Regardless, we showed the amount of energy needed anyway to flip every bit in a 256-bit number, and the laws of thermodynamics strongly imply that it's just not physically possible.

Asymmetric Cryptography

Things change when dealing with asymmetric cryptography. Now, instead of creating a secret 256-bit number, you're using mathematics, such as prime number factorization or elliptic curve equations. This changes things drammatically when dealing with key lengths, because even though we assume some mathematical problems are easy to calculate, but hard to reverse, we need to deal with exceptionally large numbers to give us the security margins necessary to prove that hardness.

As such, it because less of a concern about energy, and more a concern about time. Of course, key length is important up to a point. We just showed with the second law of thermodynamics, that brute forcing your way from 0 to 2256 is just physically impossible. However, finding the prime factors of that 256-bit number is a much easier task, does not require as much energy, and can be done by only calculating no more than half of the square root amount of numbers (in this case, 2127, assuming we're only testing prime numbers).

As such, we need to deal with prime factors that are difficult to find. It turns out that it's not enough to just have a 512-bit private key to prevent the Bad Guys from finding your prime factors. This is largely because there are efficient algorithms for calculating and testing prime numbers. So, it must also be expensive to calculate and find those primes. Currently, best practice seems to be generating 2 1024-bit prime factors to produce a 2048-bit private RSA key.

Table showing recommendations of key length from various authorities

Fixed-length Collision-resistant Hashing

Fixed-length collision-resistant hashing puts a different twist on brute force searching. The largest problem comes from the Birthday Attack. This states that if you have approximately the square root of 2 times 365 people in the room (about 23 people), the chances that any two people share the same birthday is 50%. Notice that this comes from any two people in the room. This means that you haven't singled out 1 person, and the odds that the other 22 people in the room have that same birthday is 50%. This isn't a pre-collision search. This is a blind search. You ask the first person what their birthday is, and compare it with the other 22 people in the room. Then you ask the second person what their birthday is, and compare it with the remaining 21 people in the room. And so on and so forth. After working through all 23 people comparing everyone's birthday to everyone else's birthday, the odds you found a match between two random people is 50%.

Why is this important? Suppose you are processing data with SHA-1 (160-bit output). You only need to calculate 280 SHA-1 hashes before your odds of finding a duplicate hash out of the currently calculated hashes reaches 50%. As we just learned with Bitcoin, this is practical within one year with a large orchestrated effort. Turns out, SHA-1 is weaker that that (we only need to calculate 264 hashes for a 50% probability), which is why the cryptographic community has been pushing so hard to get everyone and everything away from SHA-1.

Now you may understand why 384-bit and 512-bit (and more up to 1024-bit) cryptographically secure fixed-length collision-resistant hashing functions exist. Due to the Birthday Attack, we can make mince meat of our work.

Conclusion

As clearly demonstrated, the second law of thermodynamics provides a clear upper bound on what can be found with brute force searches. Of course, brute force searches are the least effective way to find the private keys you're looking for, and indeed, there are more efficient ways to get to the data. However, if you provide a proprietary encryption algorithm with a closed-source implementation, that uses ridiculously long private keys, then it seems clear that you don't understand the physics behind brute force. If you can't grasp the simple concept of these upper bounds, why would I want to trust you and your product in other areas of security and data confidentiality?

Quantum computing does give us some far more efficient algorithms that classical computing cannot achieve, but even then, 256-bits still remains outside of the practical realm of mythical quantum computing when brute force searching.

As I've stated many times before- trust the math.

Webcam Random Number Generation

A couple weeks ago, I purchased a lava lamp for $5 at a thrift store. It was in brand spanking new condition, and worked like a charm. The only thing going through my head at the time? I can't wait to point my webcam at it, and start generating some random numbers! Okay, well that, and mood lighting for the wife.

Anyway, I wrote a quickie Python script which will capture a frame from the webcam, hash it with a keyed BLAKE2, and output the result to a FIFO file to be processed. The BLAKE2 digest of the frame also becomes the key for the next BLAKE2 instance, making this script very CBC-like in execution (the first function is keyed from /dev/urandom, and each digest keys the next iteration).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/usr/bin/python

# Create true random seeds (near as we can tell) with your webcam.
#
# This script will use your webcam pointed at a source of entropy, keyed with
# random data from the OS CSPRNG. You could point the camera at:
#
#   * Lava lamps
#   * Plasma globes
#   * Double pendulums
#   * Rayleigh-Benard convection
#   * Brownian motion
#
# Performance is ~ 2 KiB/s.
# Requires pyblake2: https://pypi.python.org/pypi/pyblake2
#
# Released to the public domain.

import os
import cv2
import pyblake2

cap = cv2.VideoCapture(0)
webcamfile = '/tmp/webcamfile.fifo'
key = os.urandom(64)

try:
    os.mkfifo(webcamfile)
except OSError, e:
    print "Cannot create FIFO: {0}".format(e)
else:
    fifo = open(webcamfile, 'w+')

while True:
    ret, frame = cap.read()
    if not ret:
        break

    b2sum = pyblake2.blake2b(key)
    b2sum.update(frame)
    digest = b2sum.digest()
    key = digest

    fifo.write(digest)
    fifo.flush()

    cv2.imshow('webcamlamp', frame)
    k = cv2.waitKey(1) & 0xFF
    if k == 27:
        break

fifo.close()
os.remove(webcamfile)
cap.release()
cv2.destroyAllWindows()

As you'll notice in the code, you should point your webcam at a source of either chaotic randomness, like a lava lamp, or quantum randomness, like a plasma globe. Because the frame is whitened with a keyed BLAKE2, it could be considered as a true random number generator, or you could use it as a seed for a cryptographically secure pseudorandom number generator, such as those shipped with modern operating systems. If you do use this as a TRNG, realize that it's slow- it only operates at about 2 KiBps.

Here is a screenshot of the webcam itself looking at a USB desk plasma globe, that you can purchase of ThinkGeek for $10.

Webcam view of a plasma globe in operation.

The data is sent to a FIFO in /tmp/. If you don't do anything with the data, and let the buffer fill, the script will hang, until you read data out of the FIFO. As such, you could do something like this to reseed your CSPRNG (of course, it's not increasing the entropy estimate, just reseeding the generator):

$ < /tmp/webcamrng.fifo > /dev/random

Lava lamps and plasma globes are only the beginning. Anything quantum or chaotic that can be visually observed also works. Things like:

  • Double pendulums
  • Brownian motion
  • Rayleigh-Benard convection
  • CCD noise from the webcam itself
  • A bouncing ball on a sinusoidal vibrating table

So, there you have it. Plasma globes and lava lamps providing sufficiently random data via a webcam, either to be used as a secret seed, or as a TRNG itself. Any other systems that could be used to point a webcam at, or suggestions for improvement in the Python code, let me know in the comments.

CPU Jitter Entropy for the Linux Kernel

Normally, I keep a sharp eye on all things cryptographic-related with the Linux kernel. However, in 4.2, I missed something fantastic: jitterentropy_rng.ko. This is a Linux kernel module that measures the jitter of the high resolution timing available in modern CPUs, and uses this jitter as a source of true randomness. In fact, using the CPU timer as a source of true randomness isn't anything new. If you're read my blog for some time, you're already familiar with haveged(8). This daemon also collects CPU jitter and feeds the collected data into the kernel's random number generator.

The main site is at http://www.chronox.de/jent.html and the PDF paper describing the CPU jitter entropy can be found on the site.

So why the blog post about jitterentropy_rng.ko? Because now that we have something in the mainline kernel, we get a few benefits:

  1. More eyes are looking at the code, and can adjust, analize, and refine the entropy gathering process, making sure it's not to aggressive nor conservative in its approach.
  2. We now have something that can collect entropy much earlier in the boot sequence, even before the random number generator has been initialized. This means we can have a properly seeded CSPRNG when the CSPRNG is initialized.
  3. While not available now, we could have a kernelspace daemon collecting entropy and feeding it to the CSPRNG without the need for extra software.
  4. This isn't just for servers, desktops, and VMs, but anything that runs the Linux kernel on a modern CPU, including Android phones, embedded devices, and SoC.
  5. While haveged(8) has been a good solution for a long time, it has been heavily criticized, and it seems development on it has stalled. Here is another software solution for true randomness without the need of potentially dangerous 3rd party USB random number generators.
  6. You don't need Intel's RDRAND. Any modern CPU with a high resolution timer will work. AMD, SPARC, ARM, MIPS, PA-RISC, Power, etc.

As mentioned in the list, unfortunately, loading the kernel doesn't automatically top-off the entropy estimate of the internal state of the CSPRNG (/proc/sys/kernel/random/entropy_avail). As such, /dev/random will still block when the estimate is low or exhausted. So you'll still need to run a userspace daemon to prevent this behavior. The author has also shipped a clean, light userspace daemon that just reads the data provided by the jitterentropy_rng.ko kernel module, and uses ioctl(2) to increase the estimate. The jitterentropy_rng.ko module provides about 10 KBps of random data.

Again, this isn't anything that something like haveged(8) doesn't already have access to. However, by taking advantage of a loaded kernel module, we can ensure that randomness is being collected before the CSPRNG is initialized. So, when CSPRNG initialization happens, we can ensure that it is properly seeded on first boot, minimizing the likelihood that exact keys will be created on distinct systems. This is something haveged(8) can't provide, as it runs entirely in userspace.

Unfortunately, jitterentropy-rngd(8) isn't available in the Debian repositories yet, so you'll need to download the compressed tarball from the author's website, manually compile and install yourself. However, he does ship a systemd(8) service file, which makes it easy to get the daemon up and running on boot with minimal effort.

I've had the jitterentropy_rng.ko module installed with the jitterentropy-rngd(8) userspace daemon running all day today, without haveged(8), and needless to say, I'm pleased. It keeps the CSPRNG entropy estimate sufficiently topped off for software that still relies on /dev/random (please stop doing this developers- start using /dev/urandom please) and provides adequate performance. Near as I can tell, there is not a character device created when loading the kernel module, so you can't access the unbiased data before feeding it into the CSPRNG. As such, I don't have a way to test its randomness quality. Supposedly, there is a way to access this via debugfs, but I haven't found it.

Anyway, I would recommend using jitterentropy_rng.ko and jitterentropy-rngd(8) over haveged(8) as the source for your randomness.

Weechat Relay With Let's Encrypt Certificates

I've been on IRC for a long time. Not as long as some, granted, but likely longer than most. I've had my hand in a number of IRC clients, mostly terminal-based. Yup, I was (shortly) using the ircII client, then (also shortly) BitchX. Then I found irssi, and stuck with that for a long time. Search irssi help topics on this blog, and you'll see just how long. Then, after getting hired at XMission in January 2012, I switched full-time to WeeChat. I haven't looked back. This IRC client is amazing.

One of the outstanding features of WeeChat is the relay, effectively turning your IRC client into a bouncer. This feature isn't unique- it's in irssi also. However, the irssi proxy does not support SSL (2009). The WeeChat relay does. And with Let's Encrypt certificates freely available, this is the perfect opportunity to use TLS with a trusted certificate.

This post assumes that you are running WeeChat on a box that you can control the firewall to. In my case, I run WeeChat on an externally available SSH server behind tmux. With Let's Encrypt certificates, you will need to provide a FQDN for your Common Name (CN). This is all part of the standard certificate verification procedure. I purchased a domain that points to the IP of that server, and you will need to do the same.

The official Let's Encrypt "certbot" package used for creating Let's Encrypt certificates is already available in Debian unstable. A simple "apt install certbot" will get that up and running for you. Once installed, you will need to create your certificate.

$ certbot certonly --standalone -d weechat.example.com -m aaron.toponce@gmail.com

Per Let's Encrypt documentation, you needs ports 80 and 443 open to the world when creating and renewing your certificate. The execution will create four files:

# ls -l /etc/letsencrypt/
total 24
drwx------ 3 root root 4096 May 19 12:36 accounts/
drwx------ 3 root root 4096 May 19 12:39 archive/
drwxr-xr-x 2 root root 4096 May 19 12:39 csr/
drwx------ 2 root root 4096 May 19 12:39 keys/
drwx------ 3 root root 4096 May 19 12:39 live/
drwxr-xr-x 2 root root 4096 May 19 12:39 renewal/
# ls -l /etc/letsencrypt/live/weechat.example.com/
total 0
lrwxrwxrwx 1 root root 43 May 19 12:39 cert.pem -> ../../archive/weechat.example.com/cert1.pem
lrwxrwxrwx 1 root root 44 May 19 12:39 chain.pem -> ../../archive/weechat.example.com/chain1.pem
lrwxrwxrwx 1 root root 48 May 19 12:39 fullchain.pem -> ../../archive/weechat.example.com/fullchain1.pem
lrwxrwxrwx 1 root root 46 May 19 12:39 privkey.pem -> ../../archive/weechat.example.com/privkey1.pem

The "cert.pem" file is your public certificate for your CN. The "chain.pem" file in the Let's Encrypt intermediate certificate. The "fullchain.pem" file is the "cert.pem" and "chain.pem" files combined. Of course, the "privkey.pem" file is your private key. For the WeeChat relay, it needs the "privkey.pem" and "fullchain.pem" files combined into a single file.

Because the necessary directories under "/etc/letsencrypt/" are accessible only by the root user, you will need root access to copy the certificates out and make them available to WeeChat, which hopefully isn't running as root. Also, Let's Encrypt certificates need to be renewed no sooner than every 60 days and no later than every 90 days. So, not only will you want to automate renewing the certificate, but you'll probably want to automate moving it into the right directory when the renewal is complete.

As you can see from above, I setup my certificate on a Thursday at 12:39. So weekly, on Thursday, at 12:39, I'll check to see if the certificate needs to be nenewed. Because it won't renew any more frequently than every 60 days, but I have to have it renewed every 90 days, this gives be a 30-day window in which to get the certificate updated. So, I'll keep checking weekly. If a renewal isn't needed, the certbot(1) tool will gracefully exit. If a renewal is needed, the tool will update the certificate. Unfortunately, certbot(1) does not provide a useful exit code when renewals aren't needed, so rather than parsing text, I'll just copy the new certs into my WeeChat directory, regardless if they get updated or not.

So, in my root's crontab, I have the following:

39 12 * * 4 /usr/local/sbin/renew.sh

Where the contents of "/usr/local/sbin/renew.sh" are:

#!/bin/bash

certbot renew -q
cat /etc/letsencrypt/live/weechat.example.com/privkey.pem \
    /etc/letsencrypt/live/weechat.example.com/fullchain.pem > \
    ~aaron/.weechat/ssl/relay.pem
chown aaron.aaron ~aaron/.weechat/ssl/relay.pem

Now the only thing left to do is setup the relay itself in WeeChat. So, from within the client:

/relay sslcertkey
/relay add ssl.weechat 8443

You will need port 8443 open in your firewall, of course.

That's it. I have had some problems with certificate caching in WeechatAndroid it seems. So far, I have had to manually restart the relay in WeeChat, and flush the cache in WeechatAndroid and restart it to get the new certificate (I was previously using a self-signed certificate). Hopefully, this can also be automated, so I don't have to manually keep restarting the relay in WeeChat and flushing the cache in WeechatAndroid.

Regardless, this is how you use Let's Encrypt certificates with WeeChat SSL relay. Hopefully this is beneficial to someone.

Say Allo To Insecurity

Yesterday, Google announced two new encrypted messaging apps called "Allo" and "Duo". There has been some talk about the security of Allo's end-to-end encryption and incognito mode. Most of it was speculation, until Thai Duong blogged about it. Well, it's time to see what he said, and see if Allo stands up to scrutiny.

"Allo offers two chat modes: normal and incognito. Normal is the default, but incognito can be activated with one touch. I want to stress that both modes encrypt chat messages when they are in transit or at rest. The Allo clients talk to Google servers using QUIC or TLS 1.2. When messages are temporarily stored on our servers waiting for delivery they are also encrypted, and will be deleted as soon as they're delivered."

There are a few things in this paragraph that need some explanation. First, "both modes encrypt chat messages when they are in transit or at rest". This is good, but the devil is in the details. In transit, Thai explains how they're encrypted: "Allo clients talk to Google servers using QUIC or TLS 1.2". This has a couple of ramifications. First, this isn't end-to-end encryption (E2E). This is client-server encryption, which means both the server and the client are encrypting and decrypting the data. As a result, any Google employee with the appropriate privileges can read the messages delivered to Google servers. That's sort of the point of why E2E encryption exists- to prevent this from happening.

Second, kudos for storing the messages encrypted on disk. But, realize that Google has the master key to decrypt these messages if needed. Also, kudos for deleting them off of Google's servers as soon as they're delivered. However, just like VPN service providers promising they don't log your connections, Google promising not to log your message sort of falls into this category. That is, although they might not be storing the messages right now, they may store them later, especially if presented with a warrant from law enforcement. So, Google promising to not store your messages really doesn't amount to much other than maybe they don't want to unnecessarily chew through disk unless forced. Just remember, Google isn't going to go to jail for you, so they will comply with law enforcement.

"In normal mode, an artificial intelligence run by Google (but no humans including the Allo team or anyone at Google) can read your messages. This AI will use machine learning to analyze your messages, understand what you want to do, and give you timely and useful suggestions. For example, if you want to have dinner, it'll recommend restaurants or book tables. If you want to watch movies, it can buy you tickets.

Like it or not, this AI will be super useful. It's like having a personal assistant that can run a lot of errands for you right in your pocket. Of course, to help it help you you'll have to entrust it with your chat messages. I really think that this is fine, because your chat messages are used to help you and you only, and contrary to popular beliefs Google never sells your personal information to anyone."

Herein lies the real reason why E2E is not enabled by default- Google would like to mine your messages, on your phone, and present you with real-time ads. Ads not just on your phone, but likely when you're logged into your Google account on your desktop or laptop as well. If the data is E2E encrypted, this poses a problem for a company that has made Big Bucks off of advertising. With incognito mode, you are enabling E2E encryption, and the AI no longer has access to this data. Application and browser ads become generic, or must get their mining elsewhere. Because Google is allowing an AI to mine Allo messages for targeted ads, could it be possible that this same AI could be mining other data on your phone for the same goal? Could this AI be mining your email, Twitter, Facebook, photos, and other data? Will this AI be shipping solely with Allo, or will it be a separate service in Android N?

While Google might not be selling your data, they are making a percentage of sales that come from ads. The more targeted the ads become, the more likely you are to make a purchase, and the more likely Google will be to get a percentage of that sale. Google isn't selling your data, but they are making money off of it.

"But what if I want to stay off the grid? What if I don't want even the AI or whatever to see my messages?

"That's fine. We understand your concerns. Everybody including me has something to hide. This is why we develop the incognito mode. In this mode, all messages are further encrypted using the Signal protocol, a state of the art end-to-end chat encryption protocol which ensures that only you and your recipients can read your messages."

WhatsApp, acquired by Facebook, and pushing nearly one billion active messaging accounts, recently enabled E2E encryption also with the Signal Protocol. The difference being, with WhatsApp, E2E is default for every account when they update their app. E2E is not default for Allo, and only enabled for incognito mode. So, if "everybody including me has something to hide", then why isn't E2E default with Allo?

Thai then quotes a survey explaining that users want self-destructing messages more than E2E. He explains that survey with (emphasis mine):

"So to most users what matters the most is not whether the NSA can read their messages, but the physical security of their devices, blocking unwanted people, and being able to delete messages already sent to other people. In other words, their threat model doesn't include the NSA, but their spouses, their kids, their friends, i.e., people around and near them. Of course it's very likely that users don't care because they don't know what the NSA has been up to. If people know that the NSA is collecting their dick pics, they probably want to block them too. At any rate, NSA is just one of the threat sources that can harm normal users."

Sure, my threat model is also losing my phone. I find that much more likely than either the NSA confiscating my phone, issuing a warrant to collect my data, or decrypting my traffic in real-time (which isn't practical anyway). However, while the NSA isn't in my threat model, the NSA should be in Google's threat model. In other words, Google should be worrying about the NSA for me.

This is why I created d-note is is running at https://secrets.xmission.com. As a system administrator, I don't want to turn over logs to the NSA or any other organization. As such, the messages are encrypted server-side before stored to disk, and destroyed immediately upon viewing. The goal isn't necessarily to protect the end user, but to protect the server administrator. By legitimately not being able to provide logs or data when a warrant is issued is extremely valuable.

Google should be protecting the "dick pics" of users from getting into the NSA hands. Apple recently made a strong stand here against the FBI regarding Syed Farook's iPhone. Apple technically could not help the FBI, because of the protections that Apple baked into their product. Apple's hands were tied. As such, the FBI wanted to set a precedent about enabling government backdoors into the OS for future releases, so they would no longer be blocked from access. Apple is protecting the "dick pics" of its users from the NSA, FBI, and everyone else. Why isn't Google? As we mentioned earlier, the answer to that question is data mining and advertising revenue.

"This is why I think end-to-end encryption is not an end in itself, but rather a means to a real end which is disappearing messaging. End-to-end encryption without disappearing messaging doesn't cover all the risks a normal user could face, but disappearing messaging without end-to-end encryption is an illusion. Users need both to have privacy in a way that matters to them."

Emphases mine. So, Thai recognizes that disappearing messaging without E2E encryption is an illusion. So, why isn't it default? The higher powers that be, likely. He mentions in his conclusion that he would like E2E to be default, with a single tap. Something of an option with "Always start in incognito", thus always starting with E2E and always having self-destructing messages. However, rather than opt-in, it should be opt-out. If the prior message history is more important to you than the security of E2E encryption and self-destructing messages, then it should be something that you switch. If SnapChat is so popular because of self-destructing massages, and WhatsApp has one billion users with E2E encryption be default, Google, a company larger than both combined, should be able to do the same.

Finally, one point that Thai does not mention in his post. Allo is proprietary closed-source software. From a security perspective, this is problematic. First, because you don't have access to the source, you cannot audit it to make sure it holds up to the security claims that it has. As security and software engineers, not having access to the source code should be a major block when considering the use of non-free software.

Second, without access to the source code, you cannot create reproducible builds. Even if you did have access to the source code, are you sure the binary you have installed matches the binary you can build? If not, how do you know the binary isn't spying on you? Or compromised? Or just compiled incorrectly, causing undesired behavior? Not being able to create reproducible builds of software means not being able to verify the integrity of the shipped binary. Debian is making it a high priority to ship packages with reproducible builds. It's important to Debian, because they want to be transparent with their userbase. If you don't trust Debian is doing what they claim, you can rebuild the binaries and packages yourself, and they should match what Debian has shipped.

I know this sounds very Richard Stallman and GNU, but proprietary closed-source software is scary when it comes to security. While your immediate threat model might just be those you interact with on a daily basis, the immediate threat model to Google, Apple, SnapChat, and others, are well-funded organizations that have legal weight. Ultimately, they're after your data, which in the end, puts them in your threat model. There are no safety or security guarantees with proprietary closed-source software. You are at the mercy of the software vendor to Do The Right Thing, and many companies just don't.

So, while Allo might be the new kid on the block with E2E encrypted and self-destructing messages, as I've shown, it can't be trusted for your security and privacy. You're best off ignoring it, not recommending it to family and friends, and sticking with Free Software alternatives where E2E messages are default.