The Fantastical Devices of Pete the Mad Scientist: May 2012

Wednesday, 9 May 2012

Parts of Speech in the Voynich Manuscript

As previously mentioned, I've been trying to do some analysis on The Voynich Manuscript. There are three basic schools of thought about what the language in the manuscript might be-

A cipher for some natural language
A conlang
Gibberish

As a conlanger myself, I tend towards the second hypothesis. If that's correct, the key to understanding it will lie in working out what the grammar of that language is. As a first stage, I've been trying to cluster words according to the environments in which they occur. The idea is that words that occur in similar environments are likely to have similar grammatical roles. Suppose we have a sequence of words A X B, then the tuple (A,B) can be considered the environment of X. However, since there are a lot of different possible environments, that would produce a very low signal to noise ratio. Therefore, for each environment, I found the set of words that could occur in that environment, and compared those sets between pairs of environments using the Tanimoto metric. I clustered the environments using a nearest neighbour approach (each environment belonged to the same cluster as its nearest neighbour). For each distinct word in the text, I then calculated the probability of it occurring in any environment found in each of the clusters described above. Comparing vectors of these probabilities with a Pearson metric, I performed nearest neighbour clustering. This initially produced 92 clusters, so I created a sequence of cluster memberships for each word in the text, performed clustering on that, and then merged clusters accordingly. I repeated this until no further reduction in the number of clusters was possible. I ended up with five clusters, about the right number to represent parts of speech. My results can be seen here (I do need to fix the formatting, though). A quick statistical summary is

Cluster	Number of words	Number of instances
0	4291	15790
1	1470	7040
2	747	3460
3	379	2270
4	4420	12142

But what to these categories mean? How do they relate to each other? That will be the subject of my next experiment.