New Developments Home

Statistics on the Transcription of James Hampton's Diary

    June 3, 2004.  Removed Case 1 and 2 statistics, since there seem to have been some errors with them.  Still investigating.

    May 26, 2004.  Added more Case 2 statistics.

    May 25, 2004.  Added Case 2.

    May 14, 2004.  Added comments on base case and added Case 1.

    May 12, 2004.  Opened statistics page.  Placed statistics for base case.

Base Case

    Here are statistics on the diary transcription. although these do not consider the # mark for uncertain characters.

    Here are more statistics.  The single-character and digrams files do not include * (unreadable characters), the #  mark for uncertain characters,  and the uncommon characters J2 , 4 , O1 , 10 , L , 15 , Y1 , qL1 , n , qL0 , P3 , e , P4 , v , Y5 , K1 and Y0 .


    Study of the base case is still ongoing, but some things seem obvious.

    The Sukhotin/HMM comparison table shows that in fact the Sukhotin vowel algorithm and the Hidden Markov Model used in Stamp's paper might be saying the same thing.  For standard and phonemic English the HMM placed consonants in state 0 and vowels in state 1.  If we assume however that the Sukhotin algorithm does just the opposite, they are in complete agreement, if one makes a further assumption.

    The assumption is that a model value ratio of 1.6 or more is sufficient to definitely place a result in either of the two states.  Stamp stated that a ratio of 10 would be necessary, but these results make one question that.  The samples of English were very large (around 6 million characters or phonemes) compared to the Hamptonese sample here (about 29,000 graphemes).   That may well make the results less definite.  Stamp did state that a sample of 10,000 characters is sufficient to get valid results for English, but at what requirements?

    The other thing is the degree to which the /Y3 vv/ digram dominates the distributions.  Further investigation shows that even the /Y3 vv Y3 vv/ string is rather dominant.  Also, 90% of the occurrences of /Ki/ are in the digram /Ki /Ki/.  We shall therefore assume that we need to treat these groups as single graphemes, which leads to Case 1.

Case 1

    The following transformations were applied to the transcription:
   /Y3 vv/        -->  /Y3v/
   /Y3 vv Y3 vv/  -->  /Y6/
   /Ki Ki/        -->  /Ki2/   {/KiKi/ is too ambiguous}
    However, this did not include the same graphemes as with the Base Case LTCT and VFQ results, as well as 13 , qL , and HH.

    Here are the resulting statistics:


    It is obvious that /qL3 vv/ is also an important influence on the 5-gram statistics.  This leads immediately to Case 2.

Case 2

    This transformation is applied:
   /qL3 vv/  --> /qLv/
   However, this excludes the same graphemes as in Case 1, and also J.

    This is the result:


    Still considering these results.  Perhaps there is a calculation error.

    For later.