Understanding the Second-Order Entropies of Voynich Text
by Dennis J. Stallings
May 11, 1998
Abstract
The anomalous second-order entropies of Voynich text are among its most
puzzling features. h1-h2, the difference between conditional first- and
second order entropies, equals the difference H1-h2, the difference between
the first-order absolute entropy and the second- order conditional entropy.
h1-h2 or H1-h2 is a theoretically significant number; it denotes the average
information carried by the first character in a digraph about the second
one. Therefor it was chosen as a simple measure of what is being sought,
although the whole entropy profile of text samples was considered.
Tests show that Voynich text does not have its low h2 measures solely because
of a repetitious underlying text, that is, one that often repeats the same
words and phrases. Tests also show that the low h2 measures are probably
not due to an underlying low-entropy natural language. A verbose cipher,
one which substitutes several ciphertext characters for one plaintext character,
can produce the entropy profile of Voynich text.
Table of Contents
-
Introduction
-
Measures of Relative Second-Order Entropy
-
Entropies of Voynich Texts
-
Verbose Ciphers
-
Repetitive Texts
-
Schizophrenic Language
-
Low-Entropy Natural Languages
-
Japanese
-
Hawaiian
-
Discussion of Phonemic versus Syllabic Notation
-
The Size of the Character Set
-
The Effect of Word Divisions
-
Redundancy
-
The Effect of Syllable Divisions
-
Final Thoughts on Low-Entropy Natural Languages
-
Suggestions for Further Work
-
Acknowledgments
-
References for Electronic Texts
-
Printed References
Introduction
William Ralph Bennett first applied the entropy concept to the study of
the Voynich Manuscript in his Scientific and Engineering Problem Solving
with the Computer (Englewood Cliffs: Prentice-Hall, 1976). His book
has introduced many people to the VMs.
The repetitive nature of VMs text is obvious to casual examination. Entropy
is one possible numerical measure of a text's repetitiousness. The higher
the text's repetitiousness, the lower the second-order entropy (information
carried in letter pairs). Bennett noted that only some Polynesian languages
have second-order entropies as low as VMs text. Typical ciphers do not
have a low second-order entropy either.
This paper examines other possible reasons for the low second- order entropy
of Voynich texts: a verbose cipher or a repetitious underlying text. It
also examines the low-entropy natural languages Hawaiian and Japanese for
further insight into that hypothesis.
Measures of Relative Second-Order Entropy
Jacques Guy's MONKEY program was used to calculate second-order entropies.
(Note: the bug-free, "sensible" MONKEY
on the
EVMT Project
Home Page was used; the author believes that the version of MONKEY
on Garbo as of this writing has bugs.)
Note that MONKEY in its present form only takes the first 32,000 characters
in a file. Some long texts were divided up into portions so that MONKEY
could analyze them separately.
The conditional entropies were used, as is customary on the Voynich E-mail
list. Say that H1 is the absolute first-order entropy and H2 is the absolute
second-order entropy. Then h1 and h2 are the first- and second-order conditional
entropies. h2 = H2-H1, since it is conditional on more than one character.
h1 = H1, since it depends on only single characters; thus h1 is really
not conditional.
The following measures were considered:
h0: zero-order entropy (log2 of the number of different characters)
h1: first-order conditional or absolute entropy
h2: second-order conditional entropy
h1-h2: difference between conditional first- and second order entropies,
which equals the difference -
H1-h2: the difference between the first-order absolute entropy and the
second-order conditional entropy.
As will be seen, there is a need here to compare systems with very different
numbers of characters, to scale the statistics somehow to the size of the
character set. h1-h2 or H1-h2 is a theoretically significant number; it
denotes the average information carried by the first character in a digraph
about the second one. It is perhaps the best single, simple measure of
what is being sought.
The % of the second-order maximum absolute entropy might have been used.
One could calculate the % of H2 from the total H2 that could be delivered
by each alphabet. Using digraphs with an alphabet of m characters, H2(max)
is:
log2(m^2)
and the %H2(max) is:
(H2/log2(m^2))/100
However, the H2(max) depends tremendously on m, the size of the character
set chosen. For Voynich text, Currier has 36 characters and Basic Frogguy
has 23 characters. Characters that are hardly ever used have little effect
on h1 and h2, but could make a tremendous difference in H2(max). Therefore,
this measure was not used.
To start the discussion, here are some data from the English King James
Bible:
Table 1:
English King James Bible - 1 Kings
Passage
Beginning at |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
1:1 |
27
|
32000
|
4.755
|
4.022
|
3.068
|
0.953
|
8:19 |
27
|
32000
|
4.755
|
4.028
|
3.090
|
0.939
|
15:27 |
27
|
32000
|
4.755
|
3.998
|
3.092
|
0.906
|
Average
of three |
27
|
96000
|
4.755
|
4.016
|
3.083
|
0.933
|
The h1-h2 range for different portions of the same text is 0.906-0.953.
And here are data on the corresponding portions of the Latin Vulgate Bible:
Table 2:
Latin Vulgate Bible - 1 Kings
Passage
Beginning at |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
1:1 |
24
|
32000
|
4.585
|
4.002
|
3.309
|
0.692
|
8:19 |
24
|
32000
|
4.585
|
3.994
|
3.287
|
0.707
|
15:27 |
24
|
32000
|
4.585
|
4.005
|
3.304
|
0.700
|
Average
of three |
24
|
96000
|
4.585
|
4.000
|
3.300
|
0.700
|
The average h1-h2 is 0.700, compared to 0.933 for the English text. This
is undoubtedly due to the fact that English uses more combinations of two
or more letters to represent single phonemes than Latin does. The range
of h1-h2 for the Latin text is 0.692-0.707, narrower than for the English
text.
The next table shows the h1-h2 statistic for assorted files in various
languages and notations. This shows how the h1-h2 statistic sometimes shows
unexpected information. For instance, Hawaiian and Japanese have low h2
values, approaching Voynich text, in phonemic notation. However, the h1-h2
values for Hawaiian and Japanese are far less than Voynich text.
Table 3:
h1-h2 Statistics for Selected Texts
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Latin
- Vulgate Bible, 1 Kings, first 32K |
24
|
32000
|
4.585
|
4.002
|
3.309
|
0.692
|
Hawaiian
(Bennett, limited phonemic) |
13
|
15000
|
3.700
|
3.200
|
2.454
|
0.746
|
Hawaiian
newspaper (full phonemic) |
19
|
13473
|
4.248
|
3.575
|
2.650
|
0.925
|
English
- King James Bible - Genesis, first 32K |
27
|
32000
|
4.755
|
3.969
|
3.020
|
0.949
|
Japanese
Tale of Genji - Section 1 (romaji) |
22
|
32000
|
4.459
|
3.763
|
2.677
|
1.086
|
Japanese
Tale of Genji - Section 1 (kana) |
71
|
20622
|
6.150
|
4.764
|
3.393
|
1.370
|
Voynich
Herbal-B (Currier) |
34
|
13858
|
5.087
|
3.796
|
2.267
|
1.529
|
Voynich
Herbal-B (EVA) |
21
|
16061
|
4.392
|
3.859
|
2.081
|
1.778
|
Entropies of Voynich Texts
Here are entropy results for Voynich texts, a sample of Herbal-A and Herbal-B.
The Herbal-A sample's h1-h2 ranges 1.479-1.945, depending on which transcription
alphabet is used. The Herbal-B sample's h1-h2 ranges 1.529-1.897. All these
are far greater than the 0.93 for English and 0.70 for Latin.
The choice of transcription alphabet also makes an enormous difference.
From Currier to Frogguy the range of h1-h2 is 1.5-1.9. The direction is
what one would expect. Currier is the most synthetic, while Frogguy is
the most analytical, decomposing single Currier characters into several
Frogguy characters. Thus Currier Q = Frogguy cqpt.
Table 4:
Voynich Texts
Type
of Voynich Text |
Transcription Alphabet
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Herbal-A |
Currier
|
33
|
9804
|
5.044
|
3.792
|
2.313
|
1.479
|
Herbal-A |
FSG
|
24
|
10074
|
4.585
|
3.801
|
2.286
|
1.515
|
Herbal-A |
EVA
|
21
|
12218
|
4.392
|
3.802
|
1.990
|
1.812
|
Herbal-A |
Frogguy
|
21
|
13479
|
4.392
|
3.826
|
1.882
|
1.945
|
Herbal-B |
Currier
|
34
|
13858
|
5.087
|
3.796
|
2.267
|
1.529
|
Herbal-B |
FSG
|
24
|
14203
|
4.585
|
3.804
|
2.244
|
1.560
|
Herbal-B |
EVA
|
21
|
16061
|
4.392
|
3.859
|
2.081
|
1.778
|
Herbal-B |
Frogguy
|
21
|
17909
|
4.392
|
3.846
|
1.949
|
1.897
|
The samples of Voynich text are relatively small. The following statistics
of samples of a single known Latin text gives some idea of how much difference
this might make.
Table 5:
Texts from Latin Vulgate Bible, 1 Kings, For Study
of Effect of Sample Size on Entropy Data. Passages All Begin at 1:1
Passage
Ending at |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
2:18 |
23
|
8929
|
4.524
|
3.994
|
3.263
|
0.731
|
4:21 |
24
|
18623
|
4.585
|
3.995
|
3.298
|
0.697
|
7:17 |
24
|
29647
|
4.585
|
4.003
|
3.309
|
0.694
|
It is doubtful whether h1-h2 or any other single measure can tell us all
we want. However, the representation system is probably the heart of the
issue. The following discussion of verbose ciphers is a case in point.
Verbose Ciphers
A verbose cipher, one that substitutes several ciphertext characters for
one plaintext character, can produce the entropy profile seen with Voynich
text. Such a system is Cat Latin C, which is to be applied to Latin plaintext.
Vowels and consonants were added roughly in proportion to their occurence
in Latin. This keeps the h1 roughly the same as with Latin and Voynich
FSG. The repeated digraphs are what reduce h2 to where it is desired. If
q is followed by u, it is as with normal Latin; otherwise it fits one of
the consonant patterns. So this scheme is unambiguous. This scheme does
produce VMs-like entropies!
This table shows the Cat Latin verbose cipher:
Table 6:
Cat Latin C
Plaintext |
Ciphertext |
a |
a |
b |
bqbababa |
c |
c |
d |
dqdede |
e |
e |
f |
fqfififi |
g |
gqgogogo |
h |
h |
i |
i |
j |
jqjajaja |
k |
k |
m |
mqmememe |
n |
nqninini |
o |
o |
p |
pqpopopo |
qu |
qu |
r |
rqrarara |
s |
sqsesese |
t |
tqtititi |
u |
u |
v |
v |
w |
w |
x |
xqxoxoxo |
y |
y |
z |
zqzazaza |
For comparison here are VMs results in FSG, since the size of that character
set is closest to Latin.
Table 7:
Verbose Cipher Compared to Voynich Text
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Voynich
Herbal-A (FSG) |
24
|
10074
|
4.585
|
3.801
|
2.286
|
1.515
|
Voynich
Herbal-B (FSG) |
24
|
14203
|
4.585
|
3.804
|
2.244
|
1.560
|
Latin
Vulgate, 1 Kings, 1:1 - 2:11 |
23
|
8232
|
4.524
|
3.996
|
3.262
|
0.734
|
Above
passage, Cat Latin C |
23
|
28754
|
4.524
|
3.873
|
2.278
|
1.595
|
However, it's clear that this is not the same pattern as Voynich text.
It might be best to look for patterns subjectively. Here are some text
samples.
The start of the Voynich Herbal-A sample file (f29v, lines 1- 9), in EVA:
kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar
The beginning of a Hawaiian sample file, from a Hawaiian newspaper, to
be discussed later:
kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka'
na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke
anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka
'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka
po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a
ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia
lA hoihoi 'o ka lA 'ohana
Finally, the beginning of the Latin Vulgate 1 Kings in Cat Latin C:
etqtititi rqrararaexqxoxoxo dqdedeavidqdede sqseseseenqnininiuerqrararaatqtititi
habqbababaebqbababaatqtititique aetqtititiatqtititiisqsesese pqpopopolurqrararaimqmememeosqsesese
dqdedeiesqsesese cumqmememeque opqpopopoerqrararairqrararaetqtititiurqrarara
vesqsesesetqtititiibqbababausqsesese nqnininionqninini calefqfififiiebqbababaatqtititi
dqdedeixqxoxoxoerqrararaunqnininitqtititi erqrararagqgogogoo ei sqseseseerqrararavi
...
Look at these samples and think about the kind of repetition involved
in each case! The "Cat Latin C" verbose cipher is clearly not the same
thing as Voynichese.
Here are the entropy values for these samples:
Table 8:
Statistics on Text Samples
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Voynich
Herbal-A (EVA) |
21
|
12218
|
4.392
|
3.802
|
1.990
|
1.812
|
Hawaiian
newspaper (full phonemic) |
19
|
13473
|
4.248
|
3.575
|
2.650
|
0.925
|
Latin
Vulgate, 1 Kings, 1:1 - 2:11, Cat Latin C |
23
|
28754
|
4.524
|
3.873
|
2.278
|
1.595
|
The author's personal opinion is that the rigid internal structure of Voynich
text accounts for the low h2 measures. The majority of Voynich "words"
follow a paradigm. Robert Firth (Work
Note #24) and Jorge Stolfi (Voynich
Page) both have identified paradigms. Captain Prescott Currier (Currier's
Papers ) identified several other kinds of internal structure in Voynich
text.
Repetitive Texts
From time to time, some have suggested that the Voynich Manuscript is simply
a very repetitious text. Here is a magical spell in medieval High German
that is repetitious:
eiris sazun idisi sazun her duoder
suma hapt heptidun suma heri lezidun
suma clubodun umbi cuoniouuidi
insprinc haptbandun inuar uigandun
phol ende uuodan uuorun zi holza
du uuart demo balderes uolon sin uuoz birenkit
thu biguol en sinthgunt sunna era suister
thu biguol en friia uolla era suister
thu biguol en uuodan so he uuola conda
sose benrenki sose bluotrenki
sose lidirenki
ben zi bena bluot zi bluoda
lid zi geliden sose gelimida sin
Merseburger
Zaubersprüche (Magic Spells from Merseburg) in Old High German.
Note: 'uu' = 'w'.
An experiment to test this idea is to take samples of known repetitious
texts (food recipes, religious texts, catalogs) and compare their second-order
entropies with those of known texts that should be less repetitious (prose
fiction, essays).
Note that some long texts were larger than MONKEY's 32,000 character limit;
in those cases MONKEY just took the first 32,000 characters. Some long
texts were divided up into separate portions that MONKEY could analyze.
Jacobean English. Ever since its publication, many commentators
have noted how repetitious the Book of Mormon is. The Bible
itself is, of course, somewhat repetitious. A (relatively) non-repetitious
text in Jacobean English is the Essays of Sir Francis Bacon.
The Book of Mormon appears to be the most repetitious. h1- h2 for
the Book of Mormon excerpts range 0.931-0.980. The King James
Bible is next, 0.904-0.983. The non-repetitious Essays of Francis
Bacon have 0.827-0.837. Taking averages, the difference for h1-h2 between
the most repetitious text and the least is 0.951 versus 0.831, a difference
of 0.120.
Table 9:
Elizabethan English Texts of Varying Repetition
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Book
of Mormon - 1 Nephi |
27
|
32000
|
4.755
|
4.033
|
3.090
|
0.942
|
Book
of Mormon - Alma |
27
|
32000
|
4.755
|
4.041
|
3.109
|
0.931
|
Book
of Mormon - Ether |
27
|
32000
|
4.755
|
4.009
|
3.029
|
0.980
|
King
James Bible - Genesis |
27
|
32000
|
4.755
|
3.969
|
3.020
|
0.949
|
King
James Bible -Joshua |
27
|
32000
|
4.755
|
4.012
|
3.029
|
0.983
|
King
James Bible -Acts |
27
|
32000
|
4.755
|
4.041
|
3.137
|
0.904
|
Francis
Bacon's Essays, Part 1 |
27
|
32000
|
4.755
|
4.048
|
3.220
|
0.827
|
Francis
Bacon's Essays, Part 2 |
27
|
32000
|
4.755
|
4.042
|
3.214
|
0.828
|
Francis
Bacon's Essays, Part 3 |
27
|
32000
|
4.755
|
4.066
|
3.229
|
0.837
|
Latin (Late Classical). Samples of the Vulgate Bible and
Boethius' Consolations of Philosophy were analyzed. There is little
difference in the statistics between the Vulgate Bible and the presumably
less repetitious Consolatio Philosophiae.
Table 10:
Latin Texts of Varying Repetition
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
1
Kings, Vulgate, 1:1 |
24
|
32000
|
4.585
|
4.002
|
3.309
|
0.692
|
1
Kings, Vulgate, 8:19 |
24
|
32000
|
4.585
|
3.994
|
3.287
|
0.707
|
1
Kings, Vulgate, 15:27 |
24
|
32000
|
4.585
|
4.005
|
3.304
|
0.700
|
Boethius
- Consolatio Philosophiae - Books 3 & 4 |
25
|
32000
|
4.644
|
3.971
|
3.272
|
0.699
|
Modern English. Repetitive texts: food recipes (chicken and Cajun),
a catalog of technical standards, and a Roman Catholic litany. For a non-repetitious
text: a short story, "The Blue Hotel" by Stephen Crane.
The non-repetitious short story "The Blue Hotel" has an h1-h2 of 0.826,
while the repetitious Roman Catholic Litany has an h1-h2 of 0.968. The
difference is 0.968 - 0.826 = 0.142. The other texts mostly fall in between,
although the presumably repetitious Cajun recipe has an h1-h2 of 0.827,
almost identical to the short story.
Table 11:
Modern English Texts of Varying Repetition
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Modern
English - Roman Catholic litany |
26
|
9492
|
4.700
|
4.071
|
3.103
|
0.968
|
Modern
English - ISO 14000 catalog |
27
|
6696
|
4.755
|
4.076
|
3.137
|
0.939
|
Modern
English - The Blue Hotel by Stephen Crane (short story) |
27
|
32000
|
4.755
|
4.073
|
3.247
|
0.826
|
Modern
English - Cajun recipe |
27
|
27363
|
4.755
|
4.124
|
3.297
|
0.827
|
Modern
English- Chicken recipe |
27
|
18461
|
4.755
|
4.131
|
3.193
|
0.938
|
For comparison, here are data for Voynich texts in FSG, which has the character
set closest in size to the ordinary Latin alphabet.
Table 12:
Voynich Texts in FSG
Type
of Voynich Text |
Transcription Alphabet
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Herbal-A |
FSG
|
24
|
10074
|
4.585
|
3.801
|
2.286
|
1.515
|
Herbal-B |
FSG
|
24
|
14203
|
4.585
|
3.804
|
2.244
|
1.560
|
When one compares the h1-h2 values of Voynich text with the differences
due to repetition in English texts (0.968 - 0.826 = 0.142 for modern English
and 0.951 - 0.831 = 0.120 for Jacobean English) with the h1- h2 values
for Voynich text (1.515 or 1.560), it becomes clear that repetitious underlying
format or subject matter could not change a text in a normal European language
to a Voynich text! Thus, Voynich text does clearly not have its low h2
measures solely because of a repetitious underlying text, that is, one
that often repeats the same words and phrases.
Schizophrenic Language
In an important paper that discusses the Voynich Manuscript, Professor
Sergio Toresella says that the VMs author had a psychiatric disturbance.
In one of the works cited by Toresella in this connection, Creativity
by Silvano Arieti, Arieti talks about the distorted language of schizophrenics
but not other language phenomena.
At the Kooks Museum, there is a sample of schizophrenic language. In the
Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec,
containing two Rants:
Kooks Museum
Francis E. Dec, Esquire
Transcripts of flyers
Here is an excerpt from Rant #2:
"Computer God computerized brain thinking sealed robot operating arm surgery
cabinet machine removal of most of the frontal command lobe of the brain,
gradually, during lifetime and overnight in all insane asylums after Computer
God kosher bosher one month probation period creating helpless, hopeless
Computer God Frankenstein Earphone Radio parroting puppet brainless slaves,
resulting in millions of hopeless helpless homeless derelicts in all Jerusalem,
U.S.A. cities and Soviet slave work camps. Not only the hangman rope deadly
gangster parroting puppet scum-on-top know this top medical secret, even
worse, deadly gangster Jew disease from deaf Ronnie Reagan to U.S.S.R.
Gorbachev know this oy vay Computer God Containment Policy top secret.
Eventual brain lobotomization of the entire world population for the Worldwide
Deadly Gangster Communist Computer God overall plan, an ideal worldwide
population of light-skinned, low hopeless and helpless Jew-mulattos, the
communist black wave of the future."
The samples and discussion of schizophrenic talk in Arieti resemble Francis
Dec's, in repeated but disconnected ideas, alliteration, etc.
MONKEY was run on the two Rants and the results were compared with examples
of normal English text:
Table 13:
Schizophrenic Rant Compared to Other English Texts
File |
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Schizophrenic
rant |
27
|
12967
|
4.755
|
4.182
|
3.428
|
0.755
|
King
James Bible - Genesis |
27
|
32000
|
4.755
|
3.969
|
3.020
|
0.949
|
Francis
Bacon's Essays, Part 1 |
27
|
32000
|
4.755
|
4.048
|
3.220
|
0.827
|
Modern
English - Roman Catholic litany |
26
|
9492
|
4.700
|
4.071
|
3.103
|
0.968
|
Modern
English - The Blue Hotel by Stephen Crane (short story) |
27
|
32000
|
4.755
|
4.073
|
3.247
|
0.826
|
The second-order entropy of the schizophrenic rants is definitely higher,
and h1-h2 lower, than any of the ordinary texts. As with the repetitive
texts, the nature of the text itself would not by itself explain the puzzling
nature of VMs text.
Low-Entropy Natural Languages
One may write Japanese in Latin characters (romaji) or in syllabic scripts
(hiragana and katakana, the kana). In romaji Japanese is a low-entropy
language because of a relatively low phonemic inventory and severe phonotactic
constraints. A Japanese syllable may begin in zero or one consonant (counting
ts, ry, and ky as one consonant), have one vowel, and end with nothing
or -n (although the following syllable's consonant may be doubled). (There
are at least some long and short vowels in Japanese, which complicates
this a little.)
However, the very fact of these severe phonotactic constraints makes only
a limited number of syllables possible in Japanese and therefore makes
a syllabic script such as kana feasible. One would expect Japanese in kana
to have a higher relative h2 (lower h1- h2) than Japanese in romaji.
Hawaiian has even more severe phonotactic constraints, and thus one ought
to be able to write Hawaiian in a syllabic script. In Hawaiian a syllable
may begin in zero or one consonant, have only one vowel, and may only end
in nothing! Hawaiian has a much more limited phonemic inventory than Japanese.
Hawaiian is especially significant because Bennett compared Voynichese
to Hawaiian and noted that they had similar second-order entropies. Bennett
said that some Polynesian languages are the only natural languages with
second-order entropies as low as Voynichese.
Therefore, in order to gain insight on these issues, Hawaiian and Japanese
are compared in syllabic as well as phonemic notation.
Japanese
The classic Japanese novel Tale of Genji is written almost entirely
in kana. Gabriel Landini kindly adapted this both into romaji and into
a kana notation that MONKEY could analyze.
Table 14:
Entropies of Japanese in Romaji and Kana
File |
Orthography
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Tale
of Genji - Section 1 |
Romaji
|
22
|
32000
|
4.459
|
3.763
|
2.677
|
1.086
|
Tale
of Genji - Section 2 |
Romaji
|
20
|
31505
|
4.322
|
3.751
|
2.627
|
1.124
|
Tale
of Genji - Section 3 |
Romaji
|
20
|
29474
|
4.322
|
3.749
|
2.639
|
1.110
|
Tale
of Genji - Section 4 |
Romaji
|
20
|
32000
|
4.322
|
3.750
|
2.641
|
1.109
|
Tale
of Genji - Section 5 |
Romaji
|
20
|
27064
|
4.322
|
3.744
|
2.630
|
1.114
|
Tale
of Genji - Overall |
Romaji
|
22
|
152043
|
4.459
|
3.751
|
2.643
|
1.108
|
Tale
of Genji - Section 1 |
Kana
|
71
|
20622
|
6.150
|
4.764
|
3.393
|
1.370
|
Tale
of Genji - Section 2 |
Kana
|
71
|
20622
|
6.150
|
4.764
|
3.393
|
1.370
|
Tale
of Genji - Section 3 |
Kana
|
70
|
18574
|
6.129
|
4.709
|
3.410
|
1.298
|
Tale
of Genji - Section 4 |
Kana
|
70
|
20386
|
6.129
|
4.716
|
3.464
|
1.252
|
Tale
of Genji - Section 5 |
Kana
|
70
|
17096
|
6.129
|
4.698
|
3.362
|
1.337
|
Tale
of Genji - Overall |
Kana
|
71
|
97300
|
6.150
|
4.730
|
3.404
|
1.326
|
As one would expect, the absolute h0, h1, and h2 numbers for kana are much
higher than those for romaji. However, the differences for h1-h2 are consistently
higher for kana, which one would not expect.
Hawaiian
Bennett did his Hawaiian study with a limited Hawaiian orthography that
did not recognize vowel length or the glottal stop. Therefore, statistics
were run both on Hawaiian in limited phonemic and syllabic spellings, with
long/short vowels not separated and glottal stop not indicated, and in
full phonemic and syllabic notation.
Hawaiian has the following phonemes:
Consonants: h k l m n p w '(glottal stop)
Vowels: a e i o u A E I O U (cap's means long)
Bennett used a "lossy" Hawaiian orthography that did not distinguish the
long vowels and did not write the glottal stop (call this Hawaiian limited
phonemic). He also had his own Voynich transcription alphabet. Finally,
he only compared the absolute h2 values and not relative measures such
as h1-h2. It's as good as any an illustration of the problems here.
Here is a sample of the Hawaiian newspaper text used in this paper for
statistics in Bennett's notation:
ma ka la o malaki ua noa ka paka o kapiolani no ke anaina na lakou ke kuleana
o ka malama ana ma ka olelo ana aku i ka olelo hawaii ma laila no i Akoakoa
ai ka poe haumana ka
And here is the same text in full phonemic notation:
ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke
kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma
laila nO i 'Akoakoa ai ka po'e haumAna ka
Here are the entropy values.
Table 15:
Entropies of Hawaiian Texts in Different Orthographies
File |
Orthography
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Hawaiian
(Bennett) |
limited phonemic
|
13
|
15000
|
3.700
|
3.200
|
2.454
|
0.746
|
Hawaiian
newspaper |
limited phonemic
|
13
|
13097
|
3.700
|
3.224
|
2.437
|
0.787
|
Hawaiian
newspaper |
limited syllabic
|
39
|
9533
|
5.285
|
3.816
|
2.929
|
0.887
|
Hawaiian
newspaper |
full phonemic
|
19
|
13473
|
4.248
|
3.575
|
2.650
|
0.925
|
Hawaiian
newspaper |
full syllabic
|
77
|
9160
|
6.267
|
4.361
|
3.162
|
1.200
|
And here are data for Bennett's and this paper's Voynich texts for comparison:
Table 16:
Voynich Texts for Comparison with Hawaiian
Type
of Voynich Text |
Transcription Alphabet
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Voynich
(Bennett) |
Bennett
|
21
|
10000
|
4.392
|
3.660
|
2.220
|
1.440
|
Herbal-A |
Currier
|
33
|
9804
|
5.044
|
3.792
|
2.313
|
1.479
|
Herbal-A |
FSG
|
24
|
10074
|
4.585
|
3.801
|
2.286
|
1.515
|
Herbal-A |
EVA
|
21
|
12218
|
4.392
|
3.802
|
1.990
|
1.812
|
Herbal-A |
Frogguy
|
21
|
13479
|
4.392
|
3.826
|
1.882
|
1.945
|
Herbal-B |
Currier
|
34
|
13858
|
5.087
|
3.796
|
2.267
|
1.529
|
Herbal-B |
FSG
|
24
|
14203
|
4.585
|
3.804
|
2.244
|
1.560
|
Herbal-B |
EVA
|
21
|
16061
|
4.392
|
3.859
|
2.081
|
1.778
|
Herbal-B |
Frogguy
|
21
|
17909
|
4.392
|
3.846
|
1.949
|
1.897
|
Bennett compared his Voynich text in a 21-character transcription to Hawaiian
in a 13-character orthography (including the space character). He got h2
values of 2.220 for Voynich text and 2.454 for his Hawaiian text. However,
a sample of Hawaiian text in a full phonemic orthography, with 19 characters
including spaces, has h2 of 2.650, even higher. A comparison of h1-h2 values
shows a dramatic difference between Hawaiian and Japanese on one hand and
Voynichese on the other. h1-h2 equals 1.8 for Voynichese in EVA. h1-h2
is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic
notation, and 1.1 for Japanese romaji. These figures are all very different
from Voynichese.
Discussion of Phonemic versus Syllabic Notation
While perhaps not germane to the Voynich Manuscript problem, it is odd
that h1-h2 increases from phonemic to syllabic notation, both for
Japanese and Hawaiian. In syllabic notation, given the first character,
the second character is more predictable than it is in phonemic
notation. This is quite puzzling. How can we explain these results for
Hawaiian and Japanese?
The Size of the Character Set
In going from phonemic to syllabic, the text becomes shorter, more information
is packed into fewer characters --but that is accomplished by using a larger
character set. The numbers of characters for the syllabic notations are
more than three times those for the phonemic notations. The measure h1-h2
was chosen to minimize the effect of the size of the character set, but
surely is not entirely successful in doing that.
The Effect of Word Divisions
Perhaps one loses predictability because the number of space characters
in relation to the total is greater for syllabic notation than for phonemic.
If that were the case, leaving out the spaces ought to decrease h1-h2 for
syllabic notation more than for phonemic notation. MONKEY runs were made
leaving out the spaces to test this. However, the h1-h2 results for syllabic
notation decrease less than those for phonemic notation do.
Table 17:
The Effect of Word Divisions on Statistics for Japanese
and Hawaiian
File |
Orthography
|
Spaces Included
|
# ch.
|
File Size
|
h0
|
h1
|
h2
|
h1-h2
|
Japanese
Tale of Genji - Section 1 |
Romaji
|
Yes
|
22
|
32000
|
4.459
|
3.763
|
2.677
|
1.086
|
Japanese
Tale of Genji - Section 1 |
Romaji
|
No
|
21
|
26106
|
4.392
|
3.803
|
2.935
|
0.868
|
Japanese
Tale of Genji - Section 1 |
Kana
|
Yes
|
71
|
20622
|
6.150
|
4.764
|
3.393
|
1.370
|
Japanese
Tale of Genji - Section 1 |
Kana
|
No
|
70
|
14051
|
6.129
|
5.666
|
4.330
|
1.337
|
Hawaiian
newspaper |
Full Phonemic
|
Yes
|
19
|
13473
|
4.248
|
3.575
|
2.650
|
0.925
|
Hawaiian
newspaper |
Full Phonemic
|
No
|
18
|
10433
|
4.170
|
3.622
|
2.935
|
0.687
|
Hawaiian
newspaper |
Full Syllabic
|
Yes
|
77
|
9160
|
6.267
|
4.361
|
3.162
|
1.200
|
Hawaiian
newspaper |
Full Syllabic
|
No
|
76
|
6120
|
6.248
|
5.156
|
3.982
|
1.174
|
Redundancy
Gabriel Landini, who did graduate studies in Japan, noted that the redundancy
of Japanese is only apparent, that it is actually rather ambiguous. In
writing this is overcome with ideographs (kanji), while in speech it is
overcome with the context of the speech and with rigid structures (phrases
and expressions).
However, Jacques Guy (doctorate in Polynesian languages, was once fluent
in Tahitian) notes that Tahitian (similar to Hawaiian) is no more ambiguous
than English or French! So redundancy is not likely the explanation.
The Effect of Syllable Divisions
Could the (relatively) high h1-h2 values for syllabic Hawaiian and Japanese
mean that combinations of two syllables (eg. yama in Japanese, wiki in
Hawaiian) are as repetitious and fixed as combinations of phonemes within
syllables?
The phonemic vs. syllabic problem here is more complex than this. Take
"yamamoto" in romaji and in kana: (ya)(ma)(mo)(to). When we are analysing
the second-order entropy in romaji, one is looking for the distributions
of "ya" "am" "mo" "ot" "to", while for kana it is "(ya)(ma)" "(ma)(mo)"
"(mo)(to)". For half (or so) of the romaji, one deals with combinations
of letters ("am", "ot") that are never represented in kana. So the second-order
entropy in one type of text is not strictly comparable with the second-order
entropy in the other. The second-order entropy order of the romaji text
is in principle "near" in meaning to the first-order entropy of the kana,
but about only half of the digraphs correspond to kana.
While the differences in statistics between syllabic and phonemic notation
are interesting, they are not necessarily relevant to the Voynich Manuscript.
They are chiefly interesting in raising questions about the use of the
entropy concept.
Final Thoughts on Low-Entropy Natural Languages
Consider again the start of the Herbal-A sample file (f29v, lines 1-9),
in EVA:
kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar
And then the beginning of the Hawaiian newspaper sample file:
kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka'
na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke
anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka
'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka
po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a
ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia
lA hoihoi 'o ka lA 'ohana
One sees that the low h2's of Hawaiian and Japanese are due to their very
strict consonant-vowel alternation. The EVA Voynich sample shows that the
consonant-vowel alternation of Voynichese (as determined by the Sukhotin
vowel-recognition algorithm) is not as strict.
Once again, h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for
Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation,
and 1.1 for Japanese romaji. These figures are all very different from
Voynichese.
For these reasons, it seems unlikely that an underlying low- entropy natural
language explains the low h2 measures of Voynich text.
Suggestions for Further Work
The various h2 measures are only crude, partial measures of all the factors
that interest us. However, the entropy measure will continue to be useful.
It would be nice to have a program that would calculate the entropies of
files larger than 32K and calculate higher- order entropies more accurately.
The author believes that the "paradigms" and other structural restrictions
of Voynichese explain the low h2 measures. Further study of these structural
constraints will be most useful.
Acknowledgments
Many of these ideas and data were previously discussed on the Voynich E-mail
list. A special thanks to Gabriel Landini and Rene Zandbergen for their
assistance.
References for Electronic Texts
-
Voynich Text
-
Rene Zandbergen kindly provided samples of Herbal-B and Herbal-A from voynich.now.
-
Herbal-B: 26r, 26v, 31r, 31v, 33r, 33v, 34r, 34v, 39r, 39v, 40r, 40v, 41r,
41v, 43r, 43v, 46r, 46v, 48r, 48v, 50r, 50v, 55r, 55v, 57r
-
Selected Herbal-A: 28v, 29r, 29v, 30r, 30v, 32r, 32v, 35r, 35v, 36r, 36v,
37r, 37v, 38r, 38v, 42r, 42v, 44r, 44v, 45r, 45v, 47r, 47v, 49r, 49v
-
Jacobean English
Book
of Mormon
Bible,
KJV
Sir
Francis Bacon, Essays
-
Late Classical Latin Vulgate Latin Bible
Estragon
or
Gopher
Boethius: Consolatio Philosophiae: Book
3 & Book
4
-
Modern English
Catholic Litany
ISO Standard Catalog
"The
Blue Hotel", by Stephen Crane
Chicken Recipe
Cajun Recipes, Part
1 and Part
2
-
Japanese Text
-
Gabriel Landini kindly prepared this. The text is from the Genji
monogatari's [Tale of Genji, a classic Japanese novel mostly written
in hiragana] first 4 parts: 01 Kiritsubo 02 Hahakigi 03 Utsusemi 04 Yugao.
-
The "kana" output is not kana, of course, but an arbitrary substitution
for kana so that MONKEY could be applied.
-
Hawaiian
-
The author prepared the Hawaiian texts. Hawaiian has the following phonemes:
-
Consonants: h k l m n p w '(glottal stop)
-
Vowels: a e i o u A E I O U (cap's means long)
-
However, the difference between long and short vowels is often not indicated.
Also, the glottal stop is often not written. Obviously both of these things
need to be written, since even with them Hawaiian has a rather limited
phonemic inventory!
-
The Hawaiian text came from all the articles in this issue of a Hawaiian
newspaper:
Na
Maka o Kana
Puke 5, Pepa 5
15 Malaki, 1997
-
The text was changed to the notation above. All numbers, English, Japanese,
and other foreign words were removed until the character set (the number
of characters MONKEY showed) matched the Hawaiian notation. A syllabic
script for Hawaiian using characters that MONKEY recognizes was devised.
-
Schizophrenic Language
-
At the Kooks Museum, in the Schizophrenic Wing, there is a transcript of
flyers by Francis E. Dec, containing two schizophrenic Rants:
Francis E. Dec, Esquire
Transcripts of flyers
Printed References
-
Arieti, Silvano. Creativity : the magic synthesis. New York : Basic
Books, c1976. Library of Congress call number: BF408.A64
-
Bennett, William Ralph.
Scientific and Engineering Problem Solving with
the Computer. Englewood Cliffs: Prentice-Hall, 1976. [Contains a chapter
on VMS.]
-
D'Imperio, M. E.
The Voynich Manuscript--An Elegant Enigma. National
Security Agency, 1978. Aegean Park Press, 1978?
-
Toresella, Sergio. ``Gli erbari degli alchimisti.'' [Alchemical herbals.]
In
Arte farmaceutica e piante medicinali -- erbari, vasi, strumenti
e testi dalle raccolte liguri, [Pharmaceutical art and medicinal plants
-- herbals, jars, instruments and texts of the Ligurian collections.] Liana
Saginati, ed. Pisa: Pacini Editore, 1996, pp.31-70. [Profusely illustrated.
Fits the VMS into an ``alchemical herbal'' tradition.]
Copyright © 1998 by Dennis J. Stallings, all rights reserved.