Introduction and Methodology
(Skip to results if you want to miss the boring parts, or look here for a more granular, in depth account, including the code itself. If you code, yeah, I’m so sorry, I’ll make it more elegant soon)
This post will document a statistical analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as ‘realist’ and will be used as a benchmark against which we might define modernist literary style, a mode of writing which arose in the early twentieth century, (though it should be noted that this chronology is increasingly subject to revision due to the work of new modernist scholars).
The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novel Lady Susan, and was written in the year 1794. The final one is Thomas Hardy’s novel Jude the Obscure, which was published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.
The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers such as Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.
This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum.
One other aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research. There has been some work published on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors.
Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, their being subject to change over time is in the nature of how text is received and re-constituted in different eras for different communities of readers (this discussion between Will Self and Kafka’s translators is particularly illuminating in this context, please do not be put off by Self, he gives the translators so much space to discuss the process, you really should watch it). The germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally.
So, to turn to the analysis. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps remains contiguous with, this previous generation of novel writing. This was carried out using a POS tagger from the Natural Language Toolkit in Python.
From realism to modernism:
- average sentence length decreases by 4 words, from an average 22 words to 18 words per sentence.
- Personal pronouns (I, you, he, she, it, we, they, me, him, her, us, and them) increase by 1% from 5% to 6%. Interrogative pronouns (who and where) also decrease by 0.01% from 0.03% to 0.02%
- Verbs in the past tense increase by 1% from 6% to 7%.
- Adverbs increase by 0.5% from 4.5% to 5%.
- Prepositions, (after, in, to, on, and with) decrease by 0.4% from 10.9% to 10.5%
- Wh Determiners (words beginning with wh, such as ‘where’ or ‘who’ acting to modify the noun phrase) decrease by 0.2% from 0.6% to 0.4%.
- Particles (parts of speech with grammatical function with no meaning such as ‘up’ in the phrase ‘I tidied up the room’) increase by 0.1% from 0.4% to 0.5%.
- Non third-person singular present verbs (verbs in first or second person) decrease by 0.1% from 1.6% to 1.5%.
- Existentials (words such as ‘there’ which indicates that something exists) increase by 0.04%, from 0.17% to 0.21%.
- Superlative adjectives (adjectives such as ‘best’, ‘biggest’, ‘worst’) decrease by 0.01% from 0.14% to 0.13%.
It will not have escaped your attention that a lot of these percentages are quite small. The extent to which any given text is made up of this hyper-specific categories is pretty minimal in the first place, so this is why many of these quantities seem so laughably tiny. Rest assured that they are statistically significant, this does not mean that they are important, this requires a greater burden of proof, more analyses, more exploration, but that they are noteworthy considering the quantities involved.
One boxplot which might be of interest, is the one below, which shows the ‘spread’ of the data for average sentence length between realism and modernism.
What we see on the left is the variation of the sentence length data (the term ‘variation’ here meaning the general ‘dispersedness’ of the data) for realism, which goes from 10 to roughly 35 words per sentence with an outlier or two on either end, whereas if we consider modernism, we have everything from zero (Samuel Beckett’ novel How It Is which has no full stops in it) up to forty-five, with far more outliers on the higher end. Higher outliers, are data points with values greater than 1.5 times the interquartile range above the third quartile, lower outliers, of which there are three, are more than 1.5 times below the first quartile. For one’s own general knowledge, the modernist outliers for sentence length are
- William Faulkner’s Absalom! Absalom! (46.4), and Intruer in the Dust (42.3)
- Marcel Proust’s Swann’s Way (42.9), In a Budding Grove (40.2) In a Budding Grove (40.2), Time Re-gained (38), The Prisoner (37.2) and The Captive (35.7) The Guermantes Way (34.1) and Sodom and Gomorrah (30.9).
- Samuel Beckett’s Texts for Nothing and The Unnamable have 40.5 and 32.9 words per sentence respectively
- Gertrude Stein’s novels The Making of Americans and Everybody’s Autobiography have 33.9 and 33.5 respectively.
- Henry James’ The Ivory Tower and The Young Lovell score 31.8 and 29 respectively.
- The three lower outlier values for sentence length are all written by Beckett, such as the aforementioned How It Is and also Worstward Ho (4.9) and Ill Seen Ill Said (7).
It can be tempting I think, when we see these sorts of names surface so prominently, in conjunction with a visual confirmation of the existence of an avant-garde to think that modernism in its most pure form was a kind of relentless maximalism, an uncompromising movement towards longer sentences, more pronouns, and that all other manifestations of it are inadequate or insufficient in some way. This is a kind of a boring and masculinist overview of the genre, which takes, I think, too many of the claims made by its most dogmatic adherents at face value, and it’s not a modernism I’m particularly interesting in defending or instantiating. There can also, of course, be a regressive or rearguard aspect to modernism, which is perceptible in the following boxplot, which displays the distribution of past tense verbs.
As was pointed out above, modernism displays an increase in past tense verbs overall, but here we see a large number of outlier values moving against the overall trend. These novels are:
- James Joyce’s Ulysses (4.3%) and Finnegans Wake (2.7%)
- William Faulkner’s As I Lay Dying (4.2%) and Requiem for a Nun (3.6%)
- Samuel Beckett’s Malone Dies (3.9%), Fizzles (2.5%), Company (2%), Texts for Nothing (1.8%), The Unnamable (1.7%), Worstward Ho (1.6%), Ill Seen Ill Said (1.4%) and a corpus of his miscellaneous and unpublished short fiction (2.2%).
- Joseph Conrad and Ford Madox Ford’s collaborative novel The Nature of a Crime (2.6%)
- Virginia Woolf’s The Waves (2.4%)
- Gertrude Stein’s Tender Buttons (1.7%)
The higher modernism outlier is Virginia Woolf’s 1937 novel The Years (10%) and the lower realism outlier is Balzac’s 1841 novel Letters of Two Brides (2.7%)
In this way we can see that modernism is not just a unidirectional commitment to a narrow sequence of stylistic changes. Instead, it’s a contradictory movement in which a number of different stylistic markers jostle against and subvert one another. In this particular instance, for example, we can perceive the authors most generally understood to be among the most uncompromising; Joyce, Beckett, Stein, Woolf and Faulkner, resisting the overall trend.
From the two boxplots I’ve generated so far, you might have noticed that in, modernism tends to generate a greater number of outliers, and I can confirm that this trend of a greater degree grammatical heterogeneity manifesting itself in modernist novel-writing than naturalistic novel-writing persists across the other categories of grammar, which you can validate by looking at the complete analysis here.
This struck me as important development, so I quantified the extent of each data point’s outlier-ness, and then grouped them according to author. These values were then divided by the number of outlier data points, because some of these novelists only have a small number of novels in the corpus versus others. Austen’s complete works would be totally outnumbered by Balzac’s for instance. The results appear below:
Please do note the values on the y-axis; Jane Austen is barely above zero because the only outlier text she wrote is Mansfield Park, which marks itself out for its disproportional use of adjectives. I thought it better to not exclude her from the plot though, because, I didn’t want it to turn into even more of a boy’s club than it might otherwise be. It would be useful, and exciting I think, to conceive of this plot as an indication of early breaches with conventional form, perhaps some nineteenth century anticipations of modernism. Reading Dostoevsky, Zola and Balzac in this manner would all be coterminous with changes taking place in the study of modernism now, but reading Thackeray and Eliot in these terms might be a more surprising development, and I’d be interested to read these texts in light of what we’re seeing here.
The modernism plot for deviation appears below:
From this plot we can see that the most avant-gardist prose writers, considered from the perspective of their grammar, appear to be Beckett, Stein, Woolf, Conrad and Joyce. Of course, this is nowhere near a definitive answer as to what modernist style is, or who its most innovative practitioners were; these measurements are atomistic and are quantifying individual words. But style is not just words in isolation, style is agglomerations of words, spaces between words, the clandestine networks and relations the phrases these words add up to compose in the mind of the reader, and, if these digital methodologies are to have any chance of illustrating this shift (an inadequate term in the first instance, since it is more an accumulation of changes distributed over a broad corpus than a sudden or transformational one that we are here concerned with) it is in these cumulative terms that style must be quantified, in order to avoid drifting into the reductive and schematic scientism that numerical analyses of this kind are frequently accused of perpetuating.