It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.
Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*
In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.
As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.
Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.
By the time the words are ten letters long, true to form it’s Self who’s writing is the only one above 1%.