CHAPTER XXIII
Investigating the Unknown Cipher
When the type of encipherment is unknown, the decryptor’s first problem may concern the probable language used in the plaintext, and this he is usually able to determine from the source and history of the cryptogram.
His second problem is the major classification, and this, too, is usually simple, since transposition, as a rule, can be recognized by its appearance. It must, however, respond to a group-test, and for cases in which this is needed, the approximate percentages for English can be taken as follows:
| Vowels, with or without Y, | about 40% | (Variation limits: 35% to 45%) | ||
| Consonants L N R S T | about 30% | (Variation limits: 25% to 35%) | ||
| Consonants J K Q X Z | about 2% | (May be influenced by nulls). |
The 5% variation is suggested in the Parker Hitt Manual. In this connection, it should be pointed out that an apparent transposition with exactly 40% of vowels and 100% evenness in their distribution is suspicious. Many of the checkerboard systems result in this way, and also some of the codes based on pronounceable five-letter groups. Then, too, it is easily possible to construct a simple substitution cipher alphabet in such a way that the resulting cryptograms will resemble transposition, and even respond satisfactorily to a group-test. It should be carefully ascertained that a supposed transposition cryptogram does not contain the many repeated sequences which belong to simple substitution. As to those transpositions which do show an appreciable number of repeated digrams, they will probably have undergone one of the route transpositions, especially one in which columns were taken off in alternating directions.
Concerning the characteristics of simple substitution, these have been seen throughout the text; we have normal frequencies attached to the wrong letters, and we have those numerous repetitions of various lengths, occurring at all kinds of intervals, which are never found in a transposition. Here, too, we may apply a group-test, based only on the relative frequencies of letters. The five most frequent are supposed to represent the letters E T A O N or their equivalents, and should total about 45% of the text. The nine most frequent should total about 70%; the eleven most frequent well over 75%; the five of lowest frequency (which would include all of those totally absent) should correspond to the normal behavior of the group J K Q X Z.
If the simple substitution frequency count is present without the repeated sequences, then we probably have a combination of simple substitution with transposition. It becomes necessary to rewrite the cryptogram into various new arrangements until one is found which will bring back the repeated sequences. Ordinarily, the simplest kinds of transposition will have been used; sometimes the transposition will have taken place in a complete-unit block, and there will be a clue in the total number of letters present in the cryptogram.
When all letters are present in the frequency count (or all but one or two in the possible cases of 25-letter and 24-letter alphabets), a period-investigation is usually indicated. The case of periodics has been seen at considerable length, though a final hint might be added for the detection of a possible Porta encipherment. One of our many collaborators, F. R. Carter, suggests that any Porta cryptogram, periodic or otherwise, ought to show from 52% to 53% of letters N to Z — the opposite of normal.
The characteristics of digram-encipherment have been mentioned. Other polygram ciphers show corresponding characteristics, according to the polygram length, though the trail grows fainter as polygrams grow longer. A trigram-system, for instance, might be present when the cryptogram is evenly divisible into three-letter groups; it might suggest period 3, and might even show repeated sequences whose length is a multiple of 3 and which begin at serial positions such as 1, 4, 7, 10, which are the beginnings of trigrams. A great many of the trigram systems will show only repeated digrams beginning at these serial positions, or separated by intervals which are divisible by 3.