A Statistical Discussion


[in progress]

I am not a statistician, and much of what will appear on this page sits on the limits of my knowledge and understanding.  There will be dodgy data and egregious errors. On the other hand, why let a lack of expertise get in the way?  Citations refer to the Bibliography on the homepage.

My introductory notes on the homepage include a brief comparative statistical overview.  Here I have put that information into a table alongside some additional data.

No. of songs represents the number of songs included in the corpus, which may not be all the songs in the artists’ canon. In the case of Bob Dylan, Khalifa notes that “it is impossible even to approach exhaustiveness: there are 401 songs in total, which doesn’t quite cover the whole body of the officially recorded songs. Still, for my purposes in this study, this selection will be taken as representative enough, if not close enough to completeness.” (p.162).

Size of text means the number of words in total (in linguistics these are called “tokens”).

Vocabulary size means the number of unique words that are used in the text (called “types”).

The Type/token ratio is the product of dividing the number of different words in a text by the total number of words in a text.  I’ve rounded to four decimal points. A TTR of 1 would mean that no words were repeated in the text. Although this ratio is often found in linguistics texts, it’s well known that it’s not really that great as an indicator of anything, because the longer the text the lower the TTR is likely to be. There’s no harm in publishing the data so long as it’s understood to be largely useless. You have to start somewhere, don’t you?

Hapax. is an abbreviation of the phrase hapax legomena, a term in linguistics for words that appear only once in a particular context. I give the total number and percentage of the total text size.   As I stated on the homepage, “according to Leech, Rayson and Wilson’s Word Frequencies in Written and Spoken English (2001), based on the British National Corpus, 52.44% of word forms occur only once in the BNC.”

Guiraud’s Index  is a measure of “lexical richness”, calculated by dividing the vocabulary size by the square root of the total text size. It’s an attempt to adjust TTR for differences in text size, and is widely used, but it’s apparently not considered particularly successful either.




Source: The Flickering Lexicon


Source: Campbell & Murphy (1980)


Source: Khalifa (2007)

No. of songs




Size of text




Vocabulary size




Type/token ratio




Hapax. # (%)

4557 (47.4%)

935 (39.85%)


Guiraud’s Index





















According to Khalifa, Shakespeare’s vocabulary was 25000-30000 words; James Joyce’s Dubliners contains 67000 words and a vocabulary size of 7600 (type/token ratio = 8.8157).  (Note that Daniel Schmidtke’s figures for Bob Dylan are a bit different to Khalifa’s. However, I prefer Khalifa’s data over Schmidtke’s.).  



<to be continued>


Last update: 9 November 2014