* May 26, 2004.* Added more Case 2 statistics.

* May 25, 2004. *Added Case 2.

*May 14, 2004. *Added comments on base
case and added Case 1.

* May 12, 2004. *Opened statistics page.
Placed statistics for base case.

Here are statistics on the diary transcription.
although these do not consider the # mark for uncertain characters.

- List and counts of all graphemes in transcription.
- A Lotus .wk1 file of this information
- A comparison of discrepancies between these counts and those given in Stamp's paper.

- LTCT Output - single-grapheme distribution, digram distribution, doubled characters, and other statistics.
- VFQ Output - other single-grapheme statistics and vowel identifications by the Sukhotin algorithm. Characters with a number in the first column are vowels according to the Sukhotin algorithm.
- HMM/Sukhotin comparison - a comparison of vowel identifications by the Sukhotin algorithm and the Hidden Markov Model in Stamp's paper.
- The most frequent trigrams. (Thanks to Jeff Haley.)
- The most frequent 5-grapheme strings. (Thanks to Bruce Grant.)

The Sukhotin/HMM comparison table shows that in fact the Sukhotin vowel algorithm and the Hidden Markov Model used in Stamp's paper might be saying the same thing. For standard and phonemic English the HMM placed consonants in state 0 and vowels in state 1. If we assume however that the Sukhotin algorithm does just the opposite, they are in complete agreement, if one makes a further assumption.

The assumption is that a model value ratio of 1.6 or more is sufficient to definitely place a result in either of the two states. Stamp stated that a ratio of 10 would be necessary, but these results make one question that. The samples of English were very large (around 6 million characters or phonemes) compared to the Hamptonese sample here (about 29,000 graphemes). That may well make the results less definite. Stamp did state that a sample of 10,000 characters is sufficient to get valid results for English, but at what requirements?

The other thing is the degree to which the /Y3 vv/ digram dominates the distributions. Further investigation shows that even the /Y3 vv Y3 vv/ string is rather dominant. Also, 90% of the occurrences of /Ki/ are in the digram /Ki /Ki/. We shall therefore assume that we need to treat these groups as single graphemes, which leads to Case 1.

/Y3 vv/ --> /Y3v/ /Y3 vv Y3 vv/ --> /Y6/ /Ki Ki/ --> /Ki2/ {/KiKi/ is too ambiguous}However, this did not include the same graphemes as with the Base Case LTCT and VFQ results, as well as 13 , qL , and HH.

Here are the resulting statistics:

- LTCT Output.
- VFQ Output.
- 5-Grapheme Strings.

/qL3 vv/ --> /qLv/However, this excludes the same graphemes as in Case 1, and also J.

This is the result:

- 5-Grapheme Strings.
- 6-Grapheme Strings.
- 7-Grapheme Strings.

For later.

END