[an error occurred while processing this directive]
Suppose that you were presented with the following polyalphabetic cipher and told that you'd never be able to discover the keyword or plaintext.
debujh rxgzze buhlkg szzecr rsahso llworg prfeld ejjwdw cdwugv xerbru slenzz fchczj lifgpw iutsxj orgxit tsxdrf qiausm linlen rxfqgv jxbckh hbxuum cnhwli khjgbc ukcjgs weikit jazvzx cxznlr whqwzm kdmpgv xsewip ebxkol gxvjtg gyhrug xfydeb jrbwsh bhgjzr vxjhlq
For sake of time, let's assume that you perform the tests explained in the first polyalphabetic cipher lesson and determined that indeed it was a polyalphabetic cipher. And since the person told you that there was a keyword, you know that you'll have to find the period…which means that we can start using what we just learned.
The first step is to search the message and find any polygraph duplicates and the space in between each of them. This can be done with hand, or you can use the program that we made to speed the process. Whatever method you use, you'll find the following digraphs and trigraphs appear in the message with this amount of spacing:
If you typed in 222 and pressed enter, you will have found that about 23 digraphs should appear twice, 2 digraphs three times, and possibly one four times.
Assuming that is true, we have 30 polygraphs that were actually caused by the same plaintext being encoded by the same ciphertext alphabet and about 25 that were just random noise. How do we tell the difference?
The difference between a true duplicate and one caused by noise can be seen from the amount of spacing between the digraph pairs. If you remember, the spacing should be a multiple of the period since the keyword must have been repeated in full so that the same keyletters could be applied to the plaintext. So, to the majority of the digraphs should be in intervals which are multiples of the same number. The following table shows the multiples…
It's fairly obvious that the keyword could not be 89 digits long, so it's easy to eliminate that digraph. Similar intervals…such as large prime numbers can also be crossed out quickly. And, by looking at the remaining polygraphs, it's fairly clear that it is either a 5 or ten letter keyword (notice how many polygraphs have those numbers as factors).
If the keyword were 10 letters, that would mean that about ## of those polygraphs were cause by noise. That is too many, though, so it's much more likely that 5 is the answer. This would mean we could trash ## polygraphs, which is just the right number.
And now that we know 5 is the period, we change gears and go into monoalphabetic mode.
Assuming that the period is five, the 1st, 6th, 11th, 16th, etc… letter should all be from the same cipheralphabet (likewise with the 2nd, 7th, 12th, etc..etc…). If we move the ciphertext into a large 5 column wide table, each column will be composed of letters from one ciphertext alphabet.
Now, using frequency distributions, we can attack each column as an individual monoalphabetic substitution. We just have to keep in mind how our decisions from one alphabet will influence another, which is often very helpful.
Looking at the frequency distribution for each chart would yield the following:
Cipertext 'e' is not clear in all of them, although it is a safe bet to assume that it is probably l in the first alphabet and 'x' in the third. This doesn't leave us with too much, but it does give us enough.
The next attack to try comes form the two trigraphs found in the ciphertext, 'gvx' and 'tsx'. The x in the second one is almost certainly 'e', and the 's' and 't' occur with medium/high frequency in their respective alphabets. Although it might be a long shot, 'the' is the most common word so it wouldn't be surprising to find a trigraph with it. The 'x' and the 'e' match up perfectly, and the 'ts' and 'th' frequencies are reasonable. By substituting the rest of the messages 't's, 's's and 'x's with the appropriate letters, we move on. Although this could be wrong, we have to start somewhere.
The other trigraph, 'gvx' has no letters which we can immediately determine from frequency distributions. However, we can made educated guesses. Perhaps, like 'the', 'gvx' is another popular three letter word - such as 'and'. The 'a' makes perfect sense in place of 'g'. In the frequency distribution, 'g' is the second most popular letter, which isn't completely accurate with our high frequency order (etoanirsh), but it's close enough to warrant an attempt. Furthermore, the 'v'and 'x' for 'n' and 'd' also fit within the expected frequency distributions.
We could continue this lengthy explaination of how to solve simultaneous monoalpahbetic substituions - but most of the work is trial and error with the probable word method…all with constant checking back to the frequency table. It's a exponential process, and once several letters are known the words becomes easier and letter meanings come faster.
As it were, we ended up being right about the 'the' and the 'and'. And if we continued to test, we would have found the following text:
we crossed the creek at the head of the island by means of a skiff and ascending the high grounds on the shore of the main land proceeded in a north westernly direction throug a tract of country excessively wild and desolate where no trace of a human footstep was to be seen.
And in the process, we also would have yielded the 6 cipheralphabets used in the encoding. And when lined up in order, the keyword is fairly obvious, and fitting.
Cipheralphabets lined up with EDGAR appearing…