Understanding the Second-Order Entropies of Voynich Text

by Dennis J. Stallings

May 11, 1998


Abstract

The anomalous second-order entropies of Voynich text are among its most puzzling features. h1-h2, the difference between conditional first- and second order entropies, equals the difference H1-h2, the difference between the first-order absolute entropy and the second- order conditional entropy. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. Therefor it was chosen as a simple measure of what is being sought, although the whole entropy profile of text samples was considered.
Tests show that Voynich text does not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases. Tests also show that the low h2 measures are probably not due to an underlying low-entropy natural language. A verbose cipher, one which substitutes several ciphertext characters for one plaintext character, can produce the entropy profile of Voynich text.



Table of Contents


Introduction

William Ralph Bennett first applied the entropy concept to the study of the Voynich Manuscript in his Scientific and Engineering Problem Solving with the Computer (Englewood Cliffs: Prentice-Hall, 1976). His book has introduced many people to the VMs.
The repetitive nature of VMs text is obvious to casual examination. Entropy is one possible numerical measure of a text's repetitiousness. The higher the text's repetitiousness, the lower the second-order entropy (information carried in letter pairs). Bennett noted that only some Polynesian languages have second-order entropies as low as VMs text. Typical ciphers do not have a low second-order entropy either.
This paper examines other possible reasons for the low second- order entropy of Voynich texts: a verbose cipher or a repetitious underlying text. It also examines the low-entropy natural languages Hawaiian and Japanese for further insight into that hypothesis.

Measures of Relative Second-Order Entropy

Jacques Guy's MONKEY program was used to calculate second-order entropies. (Note: the bug-free, "sensible" MONKEY on the EVMT Project Home Page was used; the author believes that the version of MONKEY on Garbo as of this writing has bugs.) Note that MONKEY in its present form only takes the first 32,000 characters in a file. Some long texts were divided up into portions so that MONKEY could analyze them separately.

 

 

The conditional entropies were used, as is customary on the Voynich E-mail list. Say that H1 is the absolute first-order entropy and H2 is the absolute second-order entropy. Then h1 and h2 are the first- and second-order conditional entropies. h2 = H2-H1, since it is conditional on more than one character. h1 = H1, since it depends on only single characters; thus h1 is really not conditional.
The following measures were considered:

 
h0: zero-order entropy (log2 of the number of different characters)
h1: first-order conditional or absolute entropy
h2: second-order conditional entropy
h1-h2: difference between conditional first- and second order entropies, which equals the difference -
H1-h2: the difference between the first-order absolute entropy and the second-order conditional entropy.

 

 

As will be seen, there is a need here to compare systems with very different numbers of characters, to scale the statistics somehow to the size of the character set. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. It is perhaps the best single, simple measure of what is being sought.
The % of the second-order maximum absolute entropy might have been used. One could calculate the % of H2 from the total H2 that could be delivered by each alphabet. Using digraphs with an alphabet of m characters, H2(max) is:

 

 

log2(m^2)


and the %H2(max) is:
 

(H2/log2(m^2))/100

 
However, the H2(max) depends tremendously on m, the size of the character set chosen. For Voynich text, Currier has 36 characters and Basic Frogguy has 23 characters. Characters that are hardly ever used have little effect on h1 and h2, but could make a tremendous difference in H2(max). Therefore, this measure was not used.
To start the discussion, here are some data from the English King James Bible:

 

 
 
 
 
 
 
 
 

Table 1:

English King James Bible - 1 Kings
 
 
 
Passage Beginning at
# ch.
File Size
h0
h1
h2
h1-h2
1:1
27
32000
4.755 
4.022 
3.068 
0.953 
8:19
27
32000
4.755 
4.028 
3.090 
0.939 
15:27
27
32000
4.755 
3.998 
3.092 
0.906 
Average of three
27
96000
4.755 
4.016 
3.083 
0.933 

The h1-h2 range for different portions of the same text is 0.906-0.953.
And here are data on the corresponding portions of the Latin Vulgate Bible:

 

 
 
 
 
 
 

Table 2:

Latin Vulgate Bible - 1 Kings
 
 
 
Passage Beginning at
# ch.
File Size
h0
h1
h2
h1-h2
1:1
24
32000
4.585 
4.002 
3.309 
0.692 
8:19
24
32000
4.585 
3.994 
3.287 
0.707 
15:27
24
32000
4.585 
4.005 
3.304 
0.700 
Average of three
24
96000
4.585 
4.000 
3.300 
0.700 

The average h1-h2 is 0.700, compared to 0.933 for the English text. This is undoubtedly due to the fact that English uses more combinations of two or more letters to represent single phonemes than Latin does. The range of h1-h2 for the Latin text is 0.692-0.707, narrower than for the English text.

 

 

The next table shows the h1-h2 statistic for assorted files in various languages and notations. This shows how the h1-h2 statistic sometimes shows unexpected information. For instance, Hawaiian and Japanese have low h2 values, approaching Voynich text, in phonemic notation. However, the h1-h2 values for Hawaiian and Japanese are far less than Voynich text.

 

 
 
 
 

Table 3:

h1-h2 Statistics for Selected Texts
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Latin - Vulgate Bible, 1 Kings, first 32K
24
32000
4.585 
4.002 
3.309 
0.692 
Hawaiian (Bennett, limited phonemic)
13
15000
3.700 
3.200 
2.454 
0.746 
Hawaiian newspaper (full phonemic)
19
13473
4.248 
3.575 
2.650 
0.925 
English - King James Bible - Genesis, first 32K
27
32000
4.755 
3.969 
3.020 
0.949 
Japanese Tale of Genji - Section 1 (romaji)
22
32000
4.459 
3.763 
2.677 
1.086 
Japanese Tale of Genji - Section 1 (kana) 
71
20622
6.150 
4.764 
3.393 
1.370 
Voynich Herbal-B (Currier)
34
13858
5.087 
3.796 
2.267 
1.529 
Voynich Herbal-B (EVA)
21
16061
4.392 
3.859 
2.081 
1.778 

Entropies of Voynich Texts

Here are entropy results for Voynich texts, a sample of Herbal-A and Herbal-B. The Herbal-A sample's h1-h2 ranges 1.479-1.945, depending on which transcription alphabet is used. The Herbal-B sample's h1-h2 ranges 1.529-1.897. All these are far greater than the 0.93 for English and 0.70 for Latin.
The choice of transcription alphabet also makes an enormous difference. From Currier to Frogguy the range of h1-h2 is 1.5-1.9. The direction is what one would expect. Currier is the most synthetic, while Frogguy is the most analytical, decomposing single Currier characters into several Frogguy characters. Thus Currier Q = Frogguy cqpt.

 

 
 

Table 4:

Voynich Texts
 
 
 
Type of Voynich Text
Transcription Alphabet 
# ch.
File Size
h0
h1
h2
h1-h2
Herbal-A
Currier
33
9804
5.044 
3.792 
2.313 
1.479 
Herbal-A
FSG
24
10074
4.585 
3.801 
2.286 
1.515 
Herbal-A
EVA
21
12218
4.392 
3.802 
1.990 
1.812 
Herbal-A
Frogguy
21
13479
4.392 
3.826 
1.882 
1.945 
Herbal-B
Currier
34
13858
5.087 
3.796 
2.267 
1.529 
Herbal-B
FSG
24
14203
4.585 
3.804 
2.244 
1.560 
Herbal-B
EVA
21
16061
4.392 
3.859 
2.081 
1.778 
Herbal-B
Frogguy
21
17909
4.392 
3.846 
1.949 
1.897 

The samples of Voynich text are relatively small. The following statistics of samples of a single known Latin text gives some idea of how much difference this might make.

 

 
 
 
 
 
 
 
 

Table 5:

Texts from Latin Vulgate Bible, 1 Kings, For Study of Effect of Sample Size on Entropy Data. Passages All Begin at 1:1
 
 
 
Passage Ending at 
# ch.
File Size
h0
h1
h2
h1-h2
2:18
23
8929
4.524
3.994
3.263
0.731 
4:21
24
18623
4.585
3.995
3.298
0.697 
7:17
24
29647
4.585
4.003
3.309
0.694 

It is doubtful whether h1-h2 or any other single measure can tell us all we want. However, the representation system is probably the heart of the issue. The following discussion of verbose ciphers is a case in point.

Verbose Ciphers

A verbose cipher, one that substitutes several ciphertext characters for one plaintext character, can produce the entropy profile seen with Voynich text. Such a system is Cat Latin C, which is to be applied to Latin plaintext. Vowels and consonants were added roughly in proportion to their occurence in Latin. This keeps the h1 roughly the same as with Latin and Voynich FSG. The repeated digraphs are what reduce h2 to where it is desired. If q is followed by u, it is as with normal Latin; otherwise it fits one of the consonant patterns. So this scheme is unambiguous. This scheme does produce VMs-like entropies!

 

 

This table shows the Cat Latin verbose cipher:

 

 
 
 
 

Table 6:

Cat Latin C
 
 
 
Plaintext Ciphertext
a a
b bqbababa
c c
d dqdede
e e
f fqfififi
g gqgogogo
h h
i i
j jqjajaja
k k
m mqmememe
n nqninini
o o
p pqpopopo
qu qu
r rqrarara
s sqsesese
t tqtititi
u u
v v
w w
x xqxoxoxo
y y
z zqzazaza

For comparison here are VMs results in FSG, since the size of that character set is closest to Latin.

 

 
 

Table 7:

Verbose Cipher Compared to Voynich Text
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Voynich Herbal-A (FSG)
24
10074
4.585 
3.801 
2.286 
1.515 
Voynich Herbal-B (FSG)
24
14203
4.585 
3.804 
2.244 
1.560 
Latin Vulgate, 1 Kings, 1:1 - 2:11
23
8232
4.524 
3.996 
3.262 
0.734 
Above passage, Cat Latin C
23
28754
4.524 
3.873 
2.278 
1.595 

However, it's clear that this is not the same pattern as Voynich text. It might be best to look for patterns subjectively. Here are some text samples.
The start of the Voynich Herbal-A sample file (f29v, lines 1- 9), in EVA:


kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar
 

The beginning of a Hawaiian sample file, from a Hawaiian newspaper, to be discussed later:
kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka' na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia lA hoihoi 'o ka lA 'ohana
Finally, the beginning of the Latin Vulgate 1 Kings in Cat Latin C:
etqtititi rqrararaexqxoxoxo dqdedeavidqdede sqseseseenqnininiuerqrararaatqtititi habqbababaebqbababaatqtititique aetqtititiatqtititiisqsesese pqpopopolurqrararaimqmememeosqsesese dqdedeiesqsesese cumqmememeque opqpopopoerqrararairqrararaetqtititiurqrarara vesqsesesetqtititiibqbababausqsesese nqnininionqninini calefqfififiiebqbababaatqtititi dqdedeixqxoxoxoerqrararaunqnininitqtititi erqrararagqgogogoo ei sqseseseerqrararavi ...
Look at these samples and think about the kind of repetition involved in each case! The "Cat Latin C" verbose cipher is clearly not the same thing as Voynichese.
Here are the entropy values for these samples:

 

 
 
 
 
 
 

Table 8:

Statistics on Text Samples
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Voynich Herbal-A (EVA)
21
12218
4.392 
3.802 
1.990 
1.812 
Hawaiian newspaper (full phonemic)
19
13473
4.248 
3.575 
2.650 
0.925 
Latin Vulgate, 1 Kings, 1:1 - 2:11, Cat Latin C
23
28754
4.524 
3.873 
2.278 
1.595 

The author's personal opinion is that the rigid internal structure of Voynich text accounts for the low h2 measures. The majority of Voynich "words" follow a paradigm. Robert Firth (Work Note #24) and Jorge Stolfi (Voynich Page) both have identified paradigms. Captain Prescott Currier (Currier's Papers ) identified several other kinds of internal structure in Voynich text.

Repetitive Texts

From time to time, some have suggested that the Voynich Manuscript is simply a very repetitious text. Here is a magical spell in medieval High German that is repetitious:
 
         eiris sazun idisi             sazun her duoder
         suma hapt heptidun            suma heri lezidun
         suma clubodun                 umbi cuoniouuidi
         insprinc haptbandun           inuar uigandun

         phol ende uuodan              uuorun zi holza
         du uuart demo balderes uolon  sin uuoz birenkit
         thu biguol en sinthgunt       sunna era suister
         thu biguol en friia           uolla era suister
         thu biguol en uuodan          so he uuola conda
         sose benrenki                 sose bluotrenki
         sose lidirenki
         ben zi bena                   bluot zi bluoda
         lid zi geliden                sose gelimida sin

Merseburger Zaubersprüche (Magic Spells from Merseburg) in Old High German. Note: 'uu' = 'w'.
An experiment to test this idea is to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).
Note that some long texts were larger than MONKEY's 32,000 character limit; in those cases MONKEY just took the first 32,000 characters. Some long texts were divided up into separate portions that MONKEY could analyze.
Jacobean English. Ever since its publication, many commentators have noted how repetitious the Book of Mormon is. The Bible itself is, of course, somewhat repetitious. A (relatively) non-repetitious text in Jacobean English is the Essays of Sir Francis Bacon.
The Book of Mormon appears to be the most repetitious. h1- h2 for the Book of Mormon excerpts range 0.931-0.980. The King James Bible is next, 0.904-0.983. The non-repetitious Essays of Francis Bacon have 0.827-0.837. Taking averages, the difference for h1-h2 between the most repetitious text and the least is 0.951 versus 0.831, a difference of 0.120.

 

 
 
 
 

Table 9:

Elizabethan English Texts of Varying Repetition
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Book of Mormon - 1 Nephi
27
32000
4.755 
4.033 
3.090 
0.942 
Book of Mormon - Alma
27
32000
4.755 
4.041 
3.109 
0.931 
Book of Mormon - Ether
27
32000
4.755 
4.009 
3.029 
0.980 
King James Bible - Genesis
27
32000
4.755 
3.969 
3.020 
0.949 
King James Bible -Joshua
27
32000
4.755 
4.012 
3.029 
0.983 
King James Bible -Acts
27
32000
4.755 
4.041 
3.137 
0.904 
Francis Bacon's Essays, Part 1
27
32000
4.755 
4.048 
3.220 
0.827 
Francis Bacon's Essays, Part 2
27
32000
4.755 
4.042 
3.214 
0.828 
Francis Bacon's Essays, Part 3
27
32000
4.755 
4.066 
3.229 
0.837 

Latin (Late Classical). Samples of the Vulgate Bible and Boethius' Consolations of Philosophy were analyzed. There is little difference in the statistics between the Vulgate Bible and the presumably less repetitious Consolatio Philosophiae.

 

 
 
 
 
 
 
 
 

Table 10:

Latin Texts of Varying Repetition
 
 
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
1 Kings, Vulgate, 1:1
24
32000
4.585 
4.002 
3.309 
0.692 
1 Kings, Vulgate, 8:19
24
32000
4.585 
3.994 
3.287 
0.707 
1 Kings, Vulgate, 15:27
24
32000
4.585 
4.005 
3.304 
0.700 
Boethius - Consolatio Philosophiae - Books 3 & 4
25
32000
4.644 
3.971 
3.272 
0.699 

Modern English. Repetitive texts: food recipes (chicken and Cajun), a catalog of technical standards, and a Roman Catholic litany. For a non-repetitious text: a short story, "The Blue Hotel" by Stephen Crane.
The non-repetitious short story "The Blue Hotel" has an h1-h2 of 0.826, while the repetitious Roman Catholic Litany has an h1-h2 of 0.968. The difference is 0.968 - 0.826 = 0.142. The other texts mostly fall in between, although the presumably repetitious Cajun recipe has an h1-h2 of 0.827, almost identical to the short story.

 

 
 
 
 
 
 

Table 11:

Modern English Texts of Varying Repetition
 
 
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Modern English - Roman Catholic litany
26
9492
4.700 
4.071 
3.103 
0.968 
Modern English - ISO 14000 catalog
27
6696
4.755 
4.076 
3.137 
0.939 
Modern English - The Blue Hotel by Stephen Crane (short story)
27
32000
4.755 
4.073 
3.247 
0.826 
Modern English - Cajun recipe
27
27363
4.755 
4.124 
3.297 
0.827 
Modern English- Chicken recipe
27
18461
4.755 
4.131 
3.193 
0.938 

For comparison, here are data for Voynich texts in FSG, which has the character set closest in size to the ordinary Latin alphabet.

 

 
 
 
 

Table 12:

Voynich Texts in FSG
 
 
 
Type of Voynich Text
Transcription Alphabet 
# ch.
File Size
h0
h1
h2
h1-h2
Herbal-A
FSG
24
10074
4.585 
3.801 
2.286 
1.515 
Herbal-B
FSG
24
14203
4.585 
3.804 
2.244 
1.560 

When one compares the h1-h2 values of Voynich text with the differences due to repetition in English texts (0.968 - 0.826 = 0.142 for modern English and 0.951 - 0.831 = 0.120 for Jacobean English) with the h1- h2 values for Voynich text (1.515 or 1.560), it becomes clear that repetitious underlying format or subject matter could not change a text in a normal European language to a Voynich text! Thus, Voynich text does clearly not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases.

Schizophrenic Language

In an important paper that discusses the Voynich Manuscript, Professor Sergio Toresella says that the VMs author had a psychiatric disturbance. In one of the works cited by Toresella in this connection, Creativity by Silvano Arieti, Arieti talks about the distorted language of schizophrenics but not other language phenomena.
At the Kooks Museum, there is a sample of schizophrenic language. In the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two Rants:
Kooks Museum
Francis E. Dec, Esquire
Transcripts of flyers

 
Here is an excerpt from Rant #2:

 

 

"Computer God computerized brain thinking sealed robot operating arm surgery cabinet machine removal of most of the frontal command lobe of the brain, gradually, during lifetime and overnight in all insane asylums after Computer God kosher bosher one month probation period creating helpless, hopeless Computer God Frankenstein Earphone Radio parroting puppet brainless slaves, resulting in millions of hopeless helpless homeless derelicts in all Jerusalem, U.S.A. cities and Soviet slave work camps. Not only the hangman rope deadly gangster parroting puppet scum-on-top know this top medical secret, even worse, deadly gangster Jew disease from deaf Ronnie Reagan to U.S.S.R. Gorbachev know this oy vay Computer God Containment Policy top secret. Eventual brain lobotomization of the entire world population for the Worldwide Deadly Gangster Communist Computer God overall plan, an ideal worldwide population of light-skinned, low hopeless and helpless Jew-mulattos, the communist black wave of the future."
The samples and discussion of schizophrenic talk in Arieti resemble Francis Dec's, in repeated but disconnected ideas, alliteration, etc.
MONKEY was run on the two Rants and the results were compared with examples of normal English text:

 

 
 
 
 
 
 

Table 13:

Schizophrenic Rant Compared to Other English Texts
 
 
 
 
 
File
# ch.
File Size
h0
h1
h2
h1-h2
Schizophrenic rant
27
12967
4.755 
4.182 
3.428 
0.755 
King James Bible - Genesis
27
32000
4.755 
3.969 
3.020 
0.949 
Francis Bacon's Essays, Part 1
27
32000
4.755 
4.048 
3.220 
0.827 
Modern English - Roman Catholic litany
26
9492
4.700 
4.071 
3.103 
0.968 
Modern English - The Blue Hotel by Stephen Crane (short story)
27
32000
4.755 
4.073 
3.247 
0.826 

The second-order entropy of the schizophrenic rants is definitely higher, and h1-h2 lower, than any of the ordinary texts. As with the repetitive texts, the nature of the text itself would not by itself explain the puzzling nature of VMs text.

Low-Entropy Natural Languages

One may write Japanese in Latin characters (romaji) or in syllabic scripts (hiragana and katakana, the kana). In romaji Japanese is a low-entropy language because of a relatively low phonemic inventory and severe phonotactic constraints. A Japanese syllable may begin in zero or one consonant (counting ts, ry, and ky as one consonant), have one vowel, and end with nothing or -n (although the following syllable's consonant may be doubled). (There are at least some long and short vowels in Japanese, which complicates this a little.)
However, the very fact of these severe phonotactic constraints makes only a limited number of syllables possible in Japanese and therefore makes a syllabic script such as kana feasible. One would expect Japanese in kana to have a higher relative h2 (lower h1- h2) than Japanese in romaji.
Hawaiian has even more severe phonotactic constraints, and thus one ought to be able to write Hawaiian in a syllabic script. In Hawaiian a syllable may begin in zero or one consonant, have only one vowel, and may only end in nothing! Hawaiian has a much more limited phonemic inventory than Japanese. Hawaiian is especially significant because Bennett compared Voynichese to Hawaiian and noted that they had similar second-order entropies. Bennett said that some Polynesian languages are the only natural languages with second-order entropies as low as Voynichese.
Therefore, in order to gain insight on these issues, Hawaiian and Japanese are compared in syllabic as well as phonemic notation.

Japanese

The classic Japanese novel Tale of Genji is written almost entirely in kana. Gabriel Landini kindly adapted this both into romaji and into a kana notation that MONKEY could analyze.

 

 
 
 
 
 
 
 
 

Table 14:

Entropies of Japanese in Romaji and Kana
 
 
 
File
Orthography
# ch.
File Size
h0
h1
h2
h1-h2
Tale of Genji - Section 1
Romaji
22
32000
4.459 
3.763 
2.677 
1.086 
Tale of Genji - Section 2
Romaji
20
31505
4.322 
3.751 
2.627 
1.124 
Tale of Genji - Section 3
Romaji
20
29474
4.322 
3.749 
2.639 
1.110 
Tale of Genji - Section 4
Romaji
20
32000
4.322 
3.750 
2.641 
1.109 
Tale of Genji - Section 5
Romaji
20
27064
4.322 
3.744 
2.630 
1.114 
Tale of Genji - Overall
Romaji
22
152043
4.459 
3.751 
2.643 
1.108 
Tale of Genji - Section 1
Kana
71
20622
6.150 
4.764 
3.393 
1.370 
Tale of Genji - Section 2
Kana
71
20622
6.150 
4.764 
3.393 
1.370 
Tale of Genji - Section 3
Kana
70
18574
6.129 
4.709 
3.410 
1.298 
Tale of Genji - Section 4
Kana
70
20386
6.129 
4.716 
3.464 
1.252 
Tale of Genji - Section 5
Kana
70
17096
6.129 
4.698 
3.362 
1.337 
Tale of Genji - Overall
Kana
71
97300
6.150 
4.730 
3.404 
1.326 

As one would expect, the absolute h0, h1, and h2 numbers for kana are much higher than those for romaji. However, the differences for h1-h2 are consistently higher for kana, which one would not expect.

Hawaiian

Bennett did his Hawaiian study with a limited Hawaiian orthography that did not recognize vowel length or the glottal stop. Therefore, statistics were run both on Hawaiian in limited phonemic and syllabic spellings, with long/short vowels not separated and glottal stop not indicated, and in full phonemic and syllabic notation.

 

 

Hawaiian has the following phonemes:

 

 

Consonants: h k l m n p w '(glottal stop)
Vowels: a e i o u A E I O U (cap's means long)

 
Bennett used a "lossy" Hawaiian orthography that did not distinguish the long vowels and did not write the glottal stop (call this Hawaiian limited phonemic). He also had his own Voynich transcription alphabet. Finally, he only compared the absolute h2 values and not relative measures such as h1-h2. It's as good as any an illustration of the problems here.
Here is a sample of the Hawaiian newspaper text used in this paper for statistics in Bennett's notation:
ma ka la o malaki ua noa ka paka o kapiolani no ke anaina na lakou ke kuleana o ka malama ana ma ka olelo ana aku i ka olelo hawaii ma laila no i Akoakoa ai ka poe haumana ka

 

 

And here is the same text in full phonemic notation:
ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka

 

 

Here are the entropy values.

 

 
 
 
 

Table 15:

Entropies of Hawaiian Texts in Different Orthographies
 
 
 
 
 
File
Orthography
# ch.
File Size
h0
h1
h2
h1-h2
Hawaiian (Bennett)
limited phonemic
13
15000
3.700 
3.200 
2.454 
0.746 
Hawaiian newspaper
limited phonemic
13
13097
3.700 
3.224 
2.437 
0.787 
Hawaiian newspaper
limited syllabic
39
9533
5.285 
3.816 
2.929 
0.887 
Hawaiian newspaper
full phonemic
19
13473
4.248 
3.575 
2.650 
0.925 
Hawaiian newspaper
full syllabic
77
9160
6.267 
4.361 
3.162 
1.200 

And here are data for Bennett's and this paper's Voynich texts for comparison:

 

 
 
 
 
 
 

Table 16:

Voynich Texts for Comparison with Hawaiian
 
 
 
Type of Voynich Text
Transcription Alphabet 
# ch.
File Size
h0
h1
h2
h1-h2
Voynich (Bennett)
Bennett
21
10000
4.392 
3.660 
2.220 
1.440 
Herbal-A
Currier
33
9804
5.044 
3.792 
2.313 
1.479 
Herbal-A
FSG
24
10074
4.585 
3.801 
2.286 
1.515 
Herbal-A
EVA
21
12218
4.392 
3.802 
1.990 
1.812 
Herbal-A
Frogguy
21
13479
4.392 
3.826 
1.882 
1.945 
Herbal-B
Currier
34
13858
5.087 
3.796 
2.267 
1.529 
Herbal-B
FSG
24
14203
4.585 
3.804 
2.244 
1.560 
Herbal-B
EVA
21
16061
4.392 
3.859 
2.081 
1.778 
Herbal-B
Frogguy
21
17909
4.392 
3.846 
1.949 
1.897 

Bennett compared his Voynich text in a 21-character transcription to Hawaiian in a 13-character orthography (including the space character). He got h2 values of 2.220 for Voynich text and 2.454 for his Hawaiian text. However, a sample of Hawaiian text in a full phonemic orthography, with 19 characters including spaces, has h2 of 2.650, even higher. A comparison of h1-h2 values shows a dramatic difference between Hawaiian and Japanese on one hand and Voynichese on the other. h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.

Discussion of Phonemic versus Syllabic Notation

While perhaps not germane to the Voynich Manuscript problem, it is odd that h1-h2 increases from phonemic to syllabic notation, both for Japanese and Hawaiian. In syllabic notation, given the first character, the second character is more predictable than it is in phonemic notation. This is quite puzzling. How can we explain these results for Hawaiian and Japanese?

The Size of the Character Set

In going from phonemic to syllabic, the text becomes shorter, more information is packed into fewer characters --but that is accomplished by using a larger character set. The numbers of characters for the syllabic notations are more than three times those for the phonemic notations. The measure h1-h2 was chosen to minimize the effect of the size of the character set, but surely is not entirely successful in doing that.

 

 

The Effect of Word Divisions

Perhaps one loses predictability because the number of space characters in relation to the total is greater for syllabic notation than for phonemic. If that were the case, leaving out the spaces ought to decrease h1-h2 for syllabic notation more than for phonemic notation. MONKEY runs were made leaving out the spaces to test this. However, the h1-h2 results for syllabic notation decrease less than those for phonemic notation do.

 

 
 
 
 

Table 17:

The Effect of Word Divisions on Statistics for Japanese and Hawaiian
 
 
 
 
 
 
 
File
Orthography
Spaces Included
# ch.
File Size
h0
h1
h2
h1-h2
Japanese Tale of Genji - Section 1 
Romaji
Yes
22
32000
4.459 
3.763 
2.677 
1.086 
Japanese Tale of Genji - Section 1 
Romaji
No
21
26106
4.392 
3.803 
2.935 
0.868 
Japanese Tale of Genji - Section 1 
Kana
Yes
71
20622
6.150 
4.764 
3.393 
1.370 
Japanese Tale of Genji - Section 1 
Kana
No
70
14051
6.129 
5.666 
4.330 
1.337 
Hawaiian newspaper
Full Phonemic
Yes
19
13473
4.248 
3.575 
2.650 
0.925 
Hawaiian newspaper
Full Phonemic
No
18
10433
4.170 
3.622 
2.935 
0.687 
Hawaiian newspaper
Full Syllabic
Yes
77
9160
6.267 
4.361 
3.162 
1.200 
Hawaiian newspaper
Full Syllabic
No
76
6120
6.248 
5.156 
3.982 
1.174 

Redundancy

Gabriel Landini, who did graduate studies in Japan, noted that the redundancy of Japanese is only apparent, that it is actually rather ambiguous. In writing this is overcome with ideographs (kanji), while in speech it is overcome with the context of the speech and with rigid structures (phrases and expressions).
However, Jacques Guy (doctorate in Polynesian languages, was once fluent in Tahitian) notes that Tahitian (similar to Hawaiian) is no more ambiguous than English or French! So redundancy is not likely the explanation.

The Effect of Syllable Divisions

Could the (relatively) high h1-h2 values for syllabic Hawaiian and Japanese mean that combinations of two syllables (eg. yama in Japanese, wiki in Hawaiian) are as repetitious and fixed as combinations of phonemes within syllables?
The phonemic vs. syllabic problem here is more complex than this. Take "yamamoto" in romaji and in kana: (ya)(ma)(mo)(to). When we are analysing the second-order entropy in romaji, one is looking for the distributions of "ya" "am" "mo" "ot" "to", while for kana it is "(ya)(ma)" "(ma)(mo)" "(mo)(to)". For half (or so) of the romaji, one deals with combinations of letters ("am", "ot") that are never represented in kana. So the second-order entropy in one type of text is not strictly comparable with the second-order entropy in the other. The second-order entropy order of the romaji text is in principle "near" in meaning to the first-order entropy of the kana, but about only half of the digraphs correspond to kana.
While the differences in statistics between syllabic and phonemic notation are interesting, they are not necessarily relevant to the Voynich Manuscript. They are chiefly interesting in raising questions about the use of the entropy concept.

 

 

Final Thoughts on Low-Entropy Natural Languages

Consider again the start of the Herbal-A sample file (f29v, lines 1-9), in EVA:
kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar

 
And then the beginning of the Hawaiian newspaper sample file:

 

 

kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka' na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia lA hoihoi 'o ka lA 'ohana
One sees that the low h2's of Hawaiian and Japanese are due to their very strict consonant-vowel alternation. The EVA Voynich sample shows that the consonant-vowel alternation of Voynichese (as determined by the Sukhotin vowel-recognition algorithm) is not as strict.
Once again, h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.
For these reasons, it seems unlikely that an underlying low- entropy natural language explains the low h2 measures of Voynich text.

 

 

Suggestions for Further Work

The various h2 measures are only crude, partial measures of all the factors that interest us. However, the entropy measure will continue to be useful. It would be nice to have a program that would calculate the entropies of files larger than 32K and calculate higher- order entropies more accurately.

 

 

The author believes that the "paradigms" and other structural restrictions of Voynichese explain the low h2 measures. Further study of these structural constraints will be most useful.

Acknowledgments

Many of these ideas and data were previously discussed on the Voynich E-mail list. A special thanks to Gabriel Landini and Rene Zandbergen for their assistance.

 

 

References for Electronic Texts

  1. Voynich Text
  2. Rene Zandbergen kindly provided samples of Herbal-B and Herbal-A from voynich.now.
    Herbal-B: 26r, 26v, 31r, 31v, 33r, 33v, 34r, 34v, 39r, 39v, 40r, 40v, 41r, 41v, 43r, 43v, 46r, 46v, 48r, 48v, 50r, 50v, 55r, 55v, 57r

     

     

    Selected Herbal-A: 28v, 29r, 29v, 30r, 30v, 32r, 32v, 35r, 35v, 36r, 36v, 37r, 37v, 38r, 38v, 42r, 42v, 44r, 44v, 45r, 45v, 47r, 47v, 49r, 49v

     

     

  3. Jacobean English

  4. Book of Mormon
    Bible, KJV
    Sir Francis Bacon, Essays
  5. Late Classical Latin Vulgate Latin Bible

  6. Estragon
    or
    Gopher
    Boethius: Consolatio Philosophiae: Book 3 & Book 4
  7. Modern English

  8. Catholic Litany
    ISO Standard Catalog
    "The Blue Hotel", by Stephen Crane
    Chicken Recipe
    Cajun Recipes, Part 1 and Part 2
  9. Japanese Text
  10. Gabriel Landini kindly prepared this. The text is from the Genji monogatari's [Tale of Genji, a classic Japanese novel mostly written in hiragana] first 4 parts: 01 Kiritsubo 02 Hahakigi 03 Utsusemi 04 Yugao.
    The "kana" output is not kana, of course, but an arbitrary substitution for kana so that MONKEY could be applied.

     

     

  11. Hawaiian
  12. The author prepared the Hawaiian texts. Hawaiian has the following phonemes:

     

     

    Consonants: h k l m n p w '(glottal stop)
    Vowels: a e i o u A E I O U (cap's means long)

     
    However, the difference between long and short vowels is often not indicated. Also, the glottal stop is often not written. Obviously both of these things need to be written, since even with them Hawaiian has a rather limited phonemic inventory!

     

     

    The Hawaiian text came from all the articles in this issue of a Hawaiian newspaper:


     Na Maka o Kana
    Puke 5, Pepa 5
    15 Malaki, 1997
     

    The text was changed to the notation above. All numbers, English, Japanese, and other foreign words were removed until the character set (the number of characters MONKEY showed) matched the Hawaiian notation. A syllabic script for Hawaiian using characters that MONKEY recognizes was devised.
  13. Schizophrenic Language
  14. At the Kooks Museum, in the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two schizophrenic Rants:


     Francis E. Dec, Esquire
    Transcripts of flyers
     

Printed References

  1. Arieti, Silvano. Creativity : the magic synthesis. New York : Basic Books, c1976. Library of Congress call number: BF408.A64

  2.  

     

  3. Bennett, William Ralph. Scientific and Engineering Problem Solving with the Computer. Englewood Cliffs: Prentice-Hall, 1976. [Contains a chapter on VMS.]
  4. D'Imperio, M. E. The Voynich Manuscript--An Elegant Enigma. National Security Agency, 1978. Aegean Park Press, 1978?
  5. Toresella, Sergio. ``Gli erbari degli alchimisti.'' [Alchemical herbals.] In Arte farmaceutica e piante medicinali -- erbari, vasi, strumenti e testi dalle raccolte liguri, [Pharmaceutical art and medicinal plants -- herbals, jars, instruments and texts of the Ligurian collections.] Liana Saginati, ed. Pisa: Pacini Editore, 1996, pp.31-70. [Profusely illustrated. Fits the VMS into an ``alchemical herbal'' tradition.]

  6.  

     


Copyright © 1998 by Dennis J. Stallings, all rights reserved.