## Solving Wordle with Regular Expressions

It seems like everyone is solving the daily Wordle puzzle these days, myself included. If you're not familiar with Wordle, it's a simple, fun, stress-free game. Your goal is to guess the secret 5-letter word within six tries. The rules are really quite simple:

1. Gray letters are not in the solution.
2. Yellow letters are in the solution, but in a different position (or positions).
3. Green letters are in the solution and in the correct position.

That's it. Let's look at the puzzle for January 1, 2022 from first guess to last to see how it works:

Here we solved the puzzle in 4 guesses:

1. "STARE": "S", "R", and "E" are in the solution, but in different positions. "T" and "A" are not in the solution.
2. "EUROS": "E", "U", and "R" are in the solution, but in different positions. "O" is not in the solution. "S" is the last character in the solution.
3. "URGES": "U", "R", and "E" are in the solution, but in different positions. "G" is not in the solution. "S" is the last character in the solution.
4. "REBUS" is correctly guessed as the solution.

I'm always curious if I can stretch my skills as a system administrator and software developer, and one of those critical skills is regular expressions. While I'm good with the basics, such as line and word anchors, character sets, quantification, and simple back references, I wanted to see if I could dig deeper using the /usr/share/dict/words file on my Debian laptop.

First off, I already know I can get a list of all 5-character alphabetic words ignoring proper nouns with the following regex:

`\$ grep -P '^[a-z]{5}\$' /usr/share/dict/words`

This returns 4,681 unique words on my system. I just need to pick a good starting word that increases my chances of guessing the solution quickly. Knowing about character frequency, I can pick a word that maximizes frequently used characters in English. A solid pick would be something that uses "E", "S", "T", "A", "I", "R", etc. As such, I always start with "STARE".

The clues tell us immediately that "S", "R", and "E" are in the solution, just in different positions. We also learn that "T" and "A" are not in the solution. So my limited knowledge of regular expressions would have me doing something like this:

`\$ grep -P '^[a-z]{5}\$' /usr/share/dict/words | grep 's' | grep 'r' | grep 'e' | grep '[^ta]'`

Surely regular expressions can handle this logic without unnecessary pipes. The obvious question is whether or not PCRE supports logical AND. Basically, "match words that have 'S' and 'R' and 'E' and not 'T' and not 'A'". The answer is "yes" using lookarounds.

Regular expression lookarounds come in four flavors:

1. (?=foo): Look ahead. "foo" immediately follows the current position in the string.
2. (?<bar): Look behind. "bar" immediately precedes the current position in the string.
3. (?!baz): Negative look ahead. What immediately follows the current position in the string is not "baz".
4. (?<!qux): Negative look behind. What immediatley precedes the current position in the string is not "qux".

I want "S", and "R" and "E" to appear in the string. So I can use look aheads to make that possible. Because "foo" can be any regular expression pattern, then I can do:

`\$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)[a-z]{5}\$' /usr/share/dict/words | grep '[^ta]'`

Further, I know that "T" and "A" are not in the solution, and further I know that "S" will not be in the 1st position, "R" will not be in the 4th, and "E" will not be in the 5th. So, I can make five character classes with negative matches. Our final regular expression looks like:

`\$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)(?=[a-z]{5})[^sta][^ta][^ta][^tar][^tae]\$' /usr/share/dict/words`

This returns 80 words, but I can fiddle it down a touch further (for the moment) by removing words that have repeated characters. For example, "preps" is returned in that list, but repeats "p". In order to maximize my win ratio, I should probably be taking advantage of words with strictly unique characters. So I'll add the negative look ahead of (?!.*(.).*\1) to my list:

`\$ grep -P '^(?=.*s.*)(?=.*r.*)(?=.*e.*)(?=[a-z]{5})(?!.*(.).*\1)[^sta][^ta][^ta][^tar][^tae]\$' /usr/share/dict/words`

This returns 66 words which is easier to parse. I should probably pick something high on the character frequency list. "EUROS" seems like a good pick, so let's try that:

The clues this time tell us that "E", "U", and "R" are in the solution, but in different positions, "O" is not in the solution, and "S" was correctly picked for its position. We can modify our expression above with the new data. We know that for the:

1. First character: Cannot be "S", "T", "A", "E", or "O".
2. Second character: Cannot be "T", "A", "U", or "O".
3. Third character: Cannot be "T", "A", "R", or "O".
4. Fourth character: Cannot be "T", "A", "R", or "O".
5. Fifth character: "S".

So we modify our expression:

`\$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})(?!.*(.).*\1)[^staeo][^tauo][^taro][^taro]s\$' /usr/share/dict/words`

We're down to three words. Now we still have that filter that no word can have a repeating character. It might be worthwhile to remove it as our returned list is manageable. Interestingly enough, removing that filter returns the same three words.

`\$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})[^staeo][^tauo][^taro][^taro]s\$' /usr/share/dict/words`

We know the final word is either "GRUES" or "REBUS" at this point, and I have 3 guesses left to win. But let's see if updating the regular expression reveals the solution. Again, given the clues, we know "U", "R", and "E" are in the solution, but in different spots. "G" is not in the solution, and "S" is the final character. Knowing that "GRUES" has "G", then the solution should be "REBUS". Let's see if regex agrees:

`\$ grep -P '^(?=.*r.*)(?=.*e.*)(?=.*u.*)(?=[a-z]{5})[^staeoug][^tauorg][^tarog][^taroeg]s\$' /usr/share/dict/words`

Indeed it does. "REBUS" is the only valid word with this dictionary and Wordle confirms it's the correct solution.

Let's look at each piece in this regular expression, just to make sure we know what it's doing:

• grep -P: Use Perl Compatible Regular Expressions with GNU grep.
• '^: Anchor the match at the beginning of the line.
• (?=.*r.*): Match "r" anywhere in the word.
• (?=.*e.*): Match "e" anywhere in the word.
• (?=.*u.*): Match "u" anywhere in the word.
• (?=[a-z]{5}): Match five lowercase alphabetic characters only.
• (?!.*(.).*\1): Do not match words with repeated characters.
• [^staeoug]: The first character cannot contain "s", "t", "a", "e", "o", "u", or "g".
• [^tauorg]: The second character cannot contain "t", "a", "u", "o", "r", or "g".
• [^tarog]: The third character cannot contain "t", "a", "r", "o", or "g".
• [^taroeg]: The fourth character cannot contain "t", "a", "r", "o", "e", or "g".
• s: The fifth character must be "s".
• \$': Anchor the match at the end of the line.

Yes, it's ugly. Yes, it's probably more efficient to pipe several times. Yes, this regular expression is error-prone and could be difficult to debug. BUT, you learned lookarounds which may come handy for pattern matching in the future unrelated to Wordle.

Next step is discovering whether or not you could store the gray characters (not in the solution) in a capture group and refer back to them in the character classes. That would help to reduce typing and error. I'll leave that as an exercise for the reader (and author).

Your email is never published nor shared.