split-s: Breaking the code

As a follow-up to my previous post, I'd like to describe how the I broke the GCHQ code. All it took was a bit of luck, some detective work and some programming.

The first time I saw the challenge, I skimped over the names to get a feel for it, without any intention of actually trying to solve it. It was pretty obvious to me, though, that the names had been encrypted with some kind of substitution cypher.

And then I saw the tip-off... BZGZD A'GAANZ. That had to be an irish-like name, so the A would be and encrypted O. Consequently, the last name would have to have the form O'xOOxx. Peter O'Toole immediately came to mind, although I didn't initially remember who he was. But the name sounded so familiar. I googled for it and it turned out he was an actor. Then I remembered :), and I was hooked.

I looked at the other names carefully, trying to spot easy ones, like KHP FNWHQBEV. With three distinct letters in the first name, I couldn't think of anything else than "Kim". Kim Basinger came to mind, and it fit nicely.

It was then that I decided to write a program to automate the discovery. I needed the program to take a name as input and match it to each name in the list. The question was how to do this efficiently. I realized that when dealing with substitution cyphers, an unencrypted string has the same form as the encrypted version. For example, PETER could be described as a sequence of characters [a, b, c, d, e] where a = P, b = d = E, c = T, e = R. It can be rewritten as [a, b, c, b, e], or "abcbe". In a sense, this sequence is a canonical form for the string. This (Ruby) code snippet will do the transformation:

def canonicalForm(chars)
     map = {}
     currentIndex = 64
     result = ""
     chars.each_byte { |char|
         if not Regexp.new("[' ]") =~ char.chr
             index = map[char] ||= (currentIndex += 1)
         else
             index = char
         end
         result << index
     }
     return result
end

A string and its encrypted version will both have the same canonical form. For instance, for PETER O'TOOLE and BZGZD A'GAANZ, the canonical form is ABCBD E'CEEFB.

The original version was quite primitive... if you fed it a name it would tell you which, if any, encrypted names matched. I discovered a few names with this approach. At first I thought the actors and actresses were Oscar winners, but I quickly dropped that hypothesis when I encountered "Kirsten Dunst". Given that she performed in Spiderman, I decided to try "Tobey Maguire". Direct hit! So my new theory was that an actor and actress that had performed in the same movie would be paired together. But I had to test it.

That's when I came up with the second version of the program, one that would build the cypher given a string and its encrypted version. For example, if I provided it with TOBEY MAGUIRE and DYFJN WCLEQPJ, it would produce

CF..J.L.Q...W.Y..P.DE...N.
The dots represent unknown substitutions. If I did the same with Kirsten Dunst, I'd get

...HJ...Q.U..X...PIDE.....
If you take a close look at the cyphers, you'll notice that they look awfully similar (the J, Q, P, D, and E are in the same positions). If you merge them, you get

CF.HJ.L.Q.U.WXY..PIDE...N.
Also, you'll note that the merged cypher is very regular and that the letters seem to be, up to a point, in alphabetical order. We can confidently fill in some of the missing spaces, for only one letter seems to fit:

CFGHJKL.Q.UVWXY..PIDE...N.
The last portion is suspiciously close to Spiderman, the name of the movie. That was one of the tell-tale signs that confirmed that the cyphers were unique to each pair, and that they were built from the name of the movie. Completing the cypher with the missing letters from Spiderman we get

CFGHJKL.Q.UVWXY.SPIDERMAN.
That leaves only a few letters left, which, again, we can detemine unequivocally:

CFGHJKLOQTUVWXYZSPIDERMANB

Some names were still missing their pair. I was able to infer a few by applying the cyphers of the still-unpaired names to the remaining encrypted names. For example, the cypher for Sean Connery produced .ONOR...AC..AN when decrypting one of the names. .ONOR could be nothing else than "Honor", so I queried www.imdb.com for all the people named that way. The first entry in the list was "Honor Blackman". Another direct hit!

At this point I was out of luck. I had no idea how to decrypt the remaining names and I had tried with all the actors and actresses I could remember. That's when I decided to investigate if IMDb had a programmatic interface. It did not, but it was possible to download the list of all known actors and actresses in a text file. Oh yeah, baby!

The third incarnation of the program would parse the text files and attempt to decrypt each of the remaining names with each of the names in the file. Surprisingly, some names produced unambiguous translations, but others resulted in dozens of alternatives. And I was not going to try one by one.

During the few minutes I spent thinking about how to best tackle the problem ahead I contemplated a couple of alternatives and I made an interesting discovery: if I picked any name (with Ae and Ad being the encrypted and decrypted versions, respectively) and another name B (with Be and Bd following the same rule as A), such that A and B are encrypted with the same cypher, the concatenation of Ae with Be has the same canonical form as the concatenation of Ad with Bd. If you stop to think about it for a second it, it makes perfect sense. Just consider what happens within a name. The encrypted and decrypted versions of the first name have the same canonical form. If we add the last name, which is written in terms of the standard alphabet in the decrypted version, and in a non-standard alphabet resulting applying the cypher on the standard alphabet, the strings are still isomorphic (i.e., they have the same canonical form). Thus, it follows that if we add a full name to each of the strings, each in their respective alphabet, the result will still be two isomorphic strings.

In the fourth incarnation, the program would pair up all the alternatives for the names that I hadn't been able to decrypt and make sure that their concatenation was isomorphic with the contatenation of the encrypted versions. Amazingly, the program produce one and only one pairing for each name! I was done with the first step.

Next, I had to figure out what were the names of the movies, since I had a hunch that the hidden quotation would be buried somewhere within the cyphers. I could've written a program to extract this information from the IMDb text files, but they are not well structured and I considered that it would take me longer to figure out how to parse them than to query IMDd. It did take some going back and forth in IMDb, but it wasn't that bad. Plus, I had already figured out most of the names, anyway.

Finding the hidden quotation took me a while, though. I tried everything I could possibly think of, to no avail. Then, it simply clicked. I looked at the release dates of the movies and noticed they were all different, and ranged from the 40's up to the present day. That could imply only one thing: they were meant to be sorted! And so I did. I sorted the cyphers based on the release date, in ascending order. And it didn't take me long to realize that the quote was in the diagonal of the matrix formed by the cyphers.

split-s

January 29, 2005

Breaking the code

4 Comments:

Previous Posts