The Mysterious Voynich Manuscript


Ian James
© February 2009

|| Quick Intro | Content & Reason for Being | Word Structure | Word Frequency | Notes on Morphology ||

Quick Intro

For background on history, layout & basic insights into the document, and further links, see Other info on Wikipedia etc.

In summary, the Voynich MS is a document of unknown origin, written in an unknown script and unknown language, by almost all accounts still completely undeciphered. Dated to roughly the Renaissance period, its earliest known mention is in records from the Bohemian court in Prague of the 16th century. There are 1,900 paragraphs and 37,000 words across 228 pages, with illustrations of plants, stars, and quasi-biological figures.

A paragraph from folio 67:

f67r1 extract

A transliteration of same using the EVA system of chars:


The EVA system

In this system, each char (symbol) has been mapped to a Latin char, based on similarity of shape and with a satisfactory mix of consonants and vowels, to enable the original chars to be easily recognized and the words to appear vaguely readable. The ones appearing 100 times or more (out of the total 180,000 so 99.94% appearance) are:
vms chars a-l vms chars m-y
special combos
vms combo chars
Capital letters are sometimes used in EVA to signify variations involving ligatures. Another dozen chars appear rarely, some of them only once. Some of those appear solo, and may be abbreviations or decorative symbols. The others, given below, may be variations of more common chars, or simply stand for rare phonemes or morphemes:

vms rare chars

In addition, EVA has editorial symbols including:

* unreadable or uncertain char
. word break
, possible word break
= end of paragraph

There are many difficult-to-read chars, and there may actually be more or less foundational chars in the set, and thus a lingering confusion over how best to identify and define them. For example, in the small text sample above (originals at Yale University) it can be seen that

q can look like y
d seems to have 2 distinct forms
s by itself and within sh seem quite unrelated
y can look like e with a long separate tail, etc.

Content and Reason for Being

What we know

It is important to note that until even just a single word has been translated, all theories, arguments, and expert contribution remain empty conjecture & noise. The common complaints over lack of information and imperfections in the original are also somewhat distracting. The manuscript is the work of a human communicator working within a mental framework all humans share. Even if only 50% of the text was available, or 5%, or one page, the amount of indisputable orthographic evidence is still enough that we should find some meaning. Otherwise it shows up as a weakness in our understanding of human communication. Back to work, everyone.

Some observers have suggested the document contains:

  • gibberish
  • an academic challenge or hoax
  • encrypted words
  • a record of an exotic language
  • a synthetic language
  • alchemical or magical secrets

Unlikely gibberish nor a hoax

The organization of the chars is elaborate and consistent, resembling natural language. It passes various statistical linguistic tests, and is non-random, curiously varied, intentionally structured. It runs smoothly off the pen. As it stands, however, a straight reading of the chars as an alphabetic system is unusual and baffling.

It has proved to be an enormous challenge over the centuries. But this does not imply the author designed it as such, or as a joke, or to feign knowledge of some strange magical tongue. It certainly required a great deal of effort and time to produce, and for a document with only very dull or very bizarre illustrations and no clear show of typically “magical” content, casual readers would hardly be bothered with it. Except for the writing system itself.

Code-breakers disappointed

Many have attempted decipherment of the text by applying techniques of cryptography and mathematical analysis. The lack of success may be due not to the script system being intentionally obfuscated, but conversely, to it being meaningful of itself. In the same way natural language cannot be coldly decoded, the text may be expressing phonemes or words or ideas directly, somehow. There is yet no known system of generic translation from one language-set (or idea-set) to another.

There is also some suggestion the text includes contributions from more than one hand. This would imply a mutual understanding and agreement on how to use the language, and a straightforward means of teaching or transferring it. Even if the author is one person, the variations in handwriting would suggest different periods, and a long span of work, and a straightforward personal method which defied change, evolution, and tedium. The slight differences found by some statistical analyses may come either from different authors preferring slightly different grammatical paradigms (eg. passive rather than active verb-tense), or one author developing new preferences—perhaps a swifter, terser expression—over time, or simply from wording dependent on different topics. The final bundling of folios does not reflect the original timeline, for either subject exploration or style or additions.

Many patterns which emerge in the search for phonemes or syllables suggest either too great a set, or results which do resemble gibberish. It seems a deeper structure is involved, or the units refer to something other than spoken sounds, as in an artificial language. Or the language is natural, but particularly exotic.

Why record a foreign language this way?

From our present knowledge of world languages and script systems, the language of the (single-instance) Voynich document is unique. If someone had presented it as an example from a strange country of their writing, we would certainly know of other instances by now.

Or if a monk or scholar was visiting a strange country and wanted to study and record local knowledge (as has happened), such a form might be invented for the purpose, especially if the foreign language had no script of its own. Modern examples of this include Cree and Hmong. But the writer would surely seek to make it more accessible to the readers back home, not less so. Perhaps the intended key to the system became lost, and the work sadly (for the author) unreadable to others. But even then, surely something of the local culture would have appeared in the illustrations, perhaps even in the letter-forms, some token of his journey & new environment. Indeed it seems to resist local input, from the writing itself down to the depiction of non-mundane, even unreal activities.

Another option is that the author himself1 is foreign, and is recording his own (very different) language using an invented script based loosely in the Roman/classical tradition of his new home. Script invention is not uncommon in private communications2.

An invented language

The script is reminiscent of those used in manuscripts written in Europe, but without any correspondences in the detail. One important observation is that while the author did not produce finely drafted illustrations, and appears from those to lack sophisticated (scientific) knowledge of botany, biology and astronomy, he does seem to display a profound interest in language and script design. To create a sophisticated encoding system and be able to express himself in elegant and consistent fashion over the length (and timespan) of the document holds the greatest fascination. And thus also relevance to the Language for the World project.

But not only does the invention of a new script warrant admiration, it seems by virtue of the difficulties in decipherment that the language it expresses may not be a natural language of phonemes or syllables, but a synthetic language. There have been several significant attempts at creating an artificial or universal language expressing ideas rather than sounds3 and this might rank as an early example. Typically, “words” are used as descriptions or pointers of location within a comprehensive taxonomy of concepts. Such a system, however, would be difficult to use in everyday fashion, and would certainly revert to shortcuts or arbitrary assignments. And it would make translation (or review at a later date) near impossible without a key.

Another option is that the synthesis does not involve taxonomic lookup, but grammatical or morphological construction. One might, for example, look at Latin, Greek or Arabic (with historically well-analyzed grammars) and invent a simplified version of such grammars. Or perhaps apply their paradigms to one’s own (very different) native language, to create a personal hybrid. This would be much easier to maintain and use.

A further option is a combination of these approaches, in the manner of Chinese or Sumerian: semantic units with grammatical modifiers. An important possible clue to this option is the way many words appear together which are spelt the same but for one or two chars. This will be seen whenever the (written) vocabulary is based on the same semantic root. For example, a sentence in Chinese about metalwork will have an inordinate number of characters containing the radical (sub-character) metal.

In any case, it would have had to make sense to the author; and perhaps as a welcome side effect—enhancing privacy—it would make little sense to casual observers. And all these things highlight the author’s interest in language/linguistics and the expression of uncommon knowledge.

Secrets worth unlocking?

Even though the text is certainly not a jumble of chars with slim connection to human language, it remains to be seen whether the content is something we might actually understand, or want to read. It may be the ravings of an alchemist high on quicksilver fumes, or a maverick thinker locked in a world of his own.

One possible clue to its having some value to seekers of esoteric knowledge is the fact that some pages are missing. Some have been cut out, some may have fallen out. If these contained keys to the translation and/or epitomes of the content, perhaps a “secret society” is keeping the useful knowledge to themselves. And certainly, alchemists/magicians/thinkers of the calibre of Athanius Kircher and John Dee may have looked over the manuscript; they and/or their many alchemical peers in Renaissance Europe—who were closer to the source—may have had a very good idea about the value of the document, and recognized not only the gems it contained, but also various shared methods for decoding it.

Mystic philosophy

If the contents are rather the personal musings of a radical thinker, or a collection of exotic ideas gathered from foreign sources, there is still the possibility it holds great interest philosophically. For example, and judging merely by the naive and abstract illustrations, the author is likely not expressing details of natural phenomena, but concepts based on them.

At casual first glance, and assuming deep philosophical interest either intrinsic to the material itself or held by an ever-hopeful modern observer, the following sets of illustrations may be supplementing certain esoteric topics:

  • plants:
    • vegetational aspects of human consciousness
    • plant spirits, fairies, devas
    • entheogenic qualities of plants
    • deep understanding of the plant realm (eg. Amazonian)
  • internal organs & female figures in containers:
    • Daoist internal alchemy, from China
    • “female” energies or points of pleasure or desire within the body (eg. Indian or Tibetan yoga/tantra)
    • Sufic meditational states, from various Islamic locations (naked females representing heavenly houris)
    • shamanic landscapes & journeys, from traditions of North Asia or the Americas
  • stars & astronomical or astrological arrangements:
    • “heavenly” meditational states
    • various esoteric aspects of astrology
    • a new understanding of the solar system and/or astronomical cycles
    • a new understanding of general cosmology (cf. Giordano Bruno, 16th century scientist/mystic)4
Assuming the author was sober and intelligent, and the ideas he came across (in foreign lands or his own mind) were this radical, and thus potentially dangerous in the intellectual environment of say Christian or Muslim Eurasia, having to encode them in the meantime makes sense.

But even if the content is interesting & relevant to us today, it may still remain cryptic for us by virtue of the use of abbreviations and jargon and archaic references, even after the text has been translated into a linguistically feasible form – just as alchemical texts remain for non-alchemists. In any case, it remains a fascinating case of linguistic invention, with enough curiosity value to invite at least some interpretation.

Word Structure

Frequencies of appearance

In a simple preliminary breakdown of char frequency, getting ready for “code-breaking” efforts, we find (ignoring occurrences of less than 100):
All chars frequency%
Initial char frequency
Final chars
Not initial
e i n
m h

Not final
q e i a
k t p f
c h
One way to categorize the chars is by tendency of position within a word:
Initial and not final
k t p f c a
Initial and final
o d y
s l r
Neither initial nor final
e i h
Strictly initial
Strictly final
n m
Only some chars appear in multiples, namely ii iii ee eee and rarely iiii eeee oo ss ll dd.

Orthographic families

It is interesting to look at the chars themselves, and how they may have come to be chosen and used. The simplest shapes are i e o, and several chars can be seen as relations of them. These can also be arranged by direction of initial pen-stroke and addition of tails etc:

plain tail up tail looped tail down with crossbar
 I  downward i n j

upward r m l I
 E  downward e
d o y a h
upward c s d q g c S
 O o
d y a qo O
verticals q
p f t k q P F T K
other v

Some chars can fit more than one spot in the table, and those options are dependent on movement of the pen. It is difficult to know whether a char is conceived the same way it is constructed.

Some key points:

  1. a looks like ei ; this might increase the multiples count of both e and i
  2. final chars are always tailed (except for o if that is not a looped e)
  3. the crossbar exemplified by ch may be a place of insertion for <nothing> or p f t k or the tail which makes sh (sometimes transcribed as ')
  4. the crossbar may be separated out, which transforms ch into e--e ; this would make e as much or more prone to multiples as i ; indeed sometimes c is drawn with a downward stroke and the crossbar added after
  5. y often looks like e with a long tail which might be another variant of downward i, and so may be a variant of a
  6. qo may be considered a single char, since q rarely appears with other chars or solo

Calligraphic additions

Some observers have suggested that some of the base (18 common) chars are no more than calligraphic variations of each other, and hold no separate value in the system. It is certainly possible that some of the other, rarer symbols have additions with only decorative value. But it is unlikely that the person who has carefully constructed an entire writing system—and probably its language as well—would then wantonly abandon the complex & orderly invention, without adding meaning.

Orthographic references

The similarity of many chars to Latin chars (and hence the choices that came to be made for the EVA system) would have been apparent also to the author, if he was based in the European Renaissance or Medieval writing tradition. This may have aided his own translation into the system; and we may have some chance of translating back if we can imagine his thought process in reverse.

Probably the most inviting reference is seen in the multiple use of i which reminds us of the Roman numeral system. Indeed the existence of multiples of i and e and sometimes o strongly suggests counting or labeling with indices. Counts or index-numbers may have been included:

  • to number a syllabic or phonemic instance (which would hide any simple alphabetic structure to the casual observer)
  • to number a part of speech, as with conjugations or declensions in Latin
  • to refer to a table/database of words or word-roots

Another small clue may lie in the use of the o and e shapes as found in musical notation of the Medieval period:

Rhythmic mode
o circle perfect, triplet, 3s
e semicircle imperfect, duplet, 2s

This might then leave i to represent (quite naturally) a level of singletons. It is important to note that i and e are the only chars to appear (overwhelmingly) in neither initial nor final position.

Another point is that the set of chars k t p f might play a role semantically flagging, raising or accenting, by virtue of their height above the median channel. In a similar way, final-position chars have tails: a common feature of writing where a flourish is the physical rounding-off of a thought.

It has been suggested the gaps reminiscent of word breaks actually do not delimit words, but are arbitrary (to intentionally confuse), or based on space belonging to what looks like a final char (in the manner of Arabic unlinked letters /d/, /r/, /w/). Almost universally, gaps delimit either words or phrases, but could feasibly delimit syllables as well. The VMS may either be very efficient and abstracted, forming compact phrasal units, or full of redundancy in simpler units like phonemes and syllables. If the gaps were arbitrary or atypical, it would imply the author had a clear method of discerning the proper caesurae for himself, likely quite obvious to him from the series of chars and its overriding structure.

Word Frequency

It can appear that the Voynich MS has too much variety and too little repetition, to be a viable text. But look at the top 7 most frequent words, and compare with English (Brown Corpus) the percent frequency:
Those words of English often become affixes in other languages, so their frequency peaks do not make Voynichese seem too random at all. Consider also the vocabulary density of the MS; dividing total words by number of unique words = 36300 / 6992 = 5.2, which would be regarded as “above average” to “dense” in terms of richness, and fairly typical of a philosophical treatise, for example. This brings us to what the word units might represent: phonemic, syllabic, root+affix, or semantic indices.

Notes on Morphology

Generic words

Each word of the Voynich MS had to be written down following some system, ideally avoiding tedious lookup and encryption, and facilitating review by the author. It seems unlikely then that its words are units matching some idea-horde one-to-one. More useful (in terms of language use and re-use) is a system which incorporates a grammar and/or gives words additional markers to extend more general roots semantically.

The tendency-of-position table above gives some guidelines as to the generic word form, and perhaps a first approximation of:

q o d y s l r
k t p
f c a
But not surprisingly, there emerge many rules about which chars may precede or follow which other chars. See the components table for a full summary of commonest forms/pathways. Among the clearest forms are suffixes surrounding i and following a:
+ d k t
e r s
l o
 a  i
. r.
where . represents a word break and + possible further pathways. It can be seen the chars are all written with an I-family stroke. If words have more than one part, each part within the word would have a beginning and ending, or some divider or pivot between the parts. It seems for example that only I-chars can follow a which may in turn be a head for components like this.

Minimum words

The only 1 and 2 char words appearing more than 10 times (the most frequent word overall is daiin appearing 892 times, and 66% of all words appear only once) are:

s y r o l d m k
ol ar or dy al am sy
os qo ky dl om sh ty
lo do ry ly ot lr ls
These would typically be either grammatical particles which commonly resist semantic breakdown, such as and or may or even though etc, or the smallest possible semantic constructions, showing us for example a bare root or null affix or small root + small affix. See the shortest words table for a fairly complete list of 2 and 3 char words.

[--work in progress--]


  1. The author is likely male, since most scholars, travelers, monks, alchemists, mystics & writers were, at that time. Famous exceptions to this rule were Hildegard of Bingen (12th century Christian mystic) and Rabi‘a of Basra (8th century Islamic mystic).
  2. For modern examples see Omniglot’s collection of alternative and newly constructed scripts.
  3. A significant example is that of George Dalgarno (17th century). See also Umberto Eco, The Search for the Perfect Language, 1995.
  4. These last three points of course relevant to Sky Knowledge.


All material on this page © Ian James, unless otherwise stated.
Last modified Feb.22,2009