Ernest Hemingway’s novel Torrents of Springis not very good. An author’s note at its end informs us that it was written in ten days and if this account of its composition is true, it very much shows. The reason I am choosing to inflict this reading experience on myself is because of my PhD research; my stylometric analysis of nineteenth and twentieth century literature, which involves identifying words which are particular to each author, informs me that Torrents of Spring marks a departure from Hemingway’s usual range of expression, away from words like ‘hell’, ‘bottle’, ‘drink’, ‘hit’ and ‘you’re’, to words like ‘wife’, ‘woman’, ‘happiest’, ‘agreed’, and ‘herself’, words Hemingway does not use anywhere else in his oeuvre. It’s worth pointing out that Hemingway wrote Torrrents of Spring as a satire of Sherwood Anderson’s novel Dark Laughter. How successful it is in this regard I don’t know, what I am more interesting in tracing, is the opportunity Hemingway takes to launch a broadside against the modernist project at large.
There’s not a page that goes by that Hemingway does not satirise the prose styles of either Gertrude Stein (‘Yogi Johnson walking down the silent street with his arm around the little Indian’s shoulder. The big Indian walking along beside them. The cold night. The shuttered houses of the town. The little Indian, who has lost his artificial arm. The big Indian, who was also in the war. Yogi Johnson, who was in the war too’) the more folksy thoughts of Leopold Bloom (‘What is that old writing fellow Shakespear says?’) or the weightier thematising of D.H. Lawrence (‘In some ways the pump-factory had hardened him. His speech had become more clipped. More like these hardy Northern workers’). More than these individual examples however is the broader alienation or discomfort intergral to modern life in industrialised Anglosphere after the first world war, summed up in the persistent refrain: ‘What was it all about? Where was it taking her?’
This is a familiar story underpinning literary modernism’s emergence, and we know well the formal strategies which emerge as a means of containing the modern sensibility, be it fragmentation, referentiality, the drawing on literary antecedents as a guarantor of one’s own fundamental seriousness. And none of these emerge unscathed either. The waitress at the diner to whom one of the main characters becomes engaged is from the Lake District (‘Wordworth’s country’ comes the helpful gloss), for instance.
The culmination of all this comes in Hemingway’s solicitous addresses to the reader which are interleaved throughout the text, about his luncheons with John Dos Passos and how F. Scott FitzGerald’s just been by, and how difficult it was to research the history of Native American tribes in the last chapter and if the reader has a manuscript themselves to drop it by one of the cafés, etc. Subtlety is obviously not what Hemingway’s about here, but it’s an interesting observation on the tension between what the modernist project said about itself and how Hemingway regarded it in practice. Rather than being founded on autonomy and transcendence, it inculcates a cult of the author and whatever mastery they exhibit over their materials. The insular gossip culture of Paris travels as far as Petoskey, Michigan, with factory workers and waitresses trading anecdotes about Henry James’ last words and Ford Madox Ford’s encounters with high society. It is this sublimated, parasocial aspect to the modern that is most noteworthy in Torrents of Spring, and certainly appears to be the most enduring, based on how the vast majority of them are now marketed (‘just think of it, H.G. Wells talking about you right in our home. Anyway, H.G. Wells’). The final author’s note, injuncting the reader to tell their friends about the book if they enjoyed it because of how hard it is to shift units these days, make clear in precise terms what these poses, aesthetic though they may be, are really all about.
(Skip to results if you want to miss the boring parts, or look here for a more granular, in depth account, including the code itself. If you code, yeah, I’m so sorry, I’ll make it more elegant soon)
This post will document a statistical analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as ‘realist’ and will be used as a benchmark against which we might define modernist literary style, a mode of writing which arose in the early twentieth century, (though it should be noted that this chronology is increasingly subject to revision due to the work of new modernist scholars).
The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novel Lady Susan, and was written in the year 1794. The final one is Thomas Hardy’s novel Jude the Obscure, which was published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.
The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers such as Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.
This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum.
One other aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research. There has been some work published on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors.
Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, their being subject to change over time is in the nature of how text is received and re-constituted in different eras for different communities of readers (this discussion between Will Self and Kafka’s translators is particularly illuminating in this context, please do not be put off by Self, he gives the translators so much space to discuss the process, you really should watch it). The germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally.
So, to turn to the analysis. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps remains contiguous with, this previous generation of novel writing. This was carried out using a POS tagger from the Natural Language Toolkit in Python.
From realism to modernism:
average sentence length decreases by 4 words, from an average 22 words to 18 words per sentence.
Personal pronouns (I, you, he, she, it, we, they, me, him, her, us, and them) increase by 1% from 5% to 6%. Interrogative pronouns (who and where) also decrease by 0.01% from 0.03% to 0.02%
Verbs in the past tense increase by 1% from 6% to 7%.
Adverbs increase by 0.5% from 4.5% to 5%.
Prepositions, (after, in, to, on, and with) decrease by 0.4% from 10.9% to 10.5%
Wh Determiners (words beginning with wh, such as ‘where’ or ‘who’ acting to modify the noun phrase) decrease by 0.2% from 0.6% to 0.4%.
Particles (parts of speech with grammatical function with no meaning such as ‘up’ in the phrase ‘I tidied up the room’) increase by 0.1% from 0.4% to 0.5%.
Non third-person singular present verbs (verbs in first or second person) decrease by 0.1% from 1.6% to 1.5%.
Existentials (words such as ‘there’ which indicates that something exists) increase by 0.04%, from 0.17% to 0.21%.
Superlative adjectives (adjectives such as ‘best’, ‘biggest’, ‘worst’) decrease by 0.01% from 0.14% to 0.13%.
It will not have escaped your attention that a lot of these percentages are quite small. The extent to which any given text is made up of this hyper-specific categories is pretty minimal in the first place, so this is why many of these quantities seem so laughably tiny. Rest assured that they are statistically significant, this does not mean that they are important, this requires a greater burden of proof, more analyses, more exploration, but that they are noteworthy considering the quantities involved.
One boxplot which might be of interest, is the one below, which shows the ‘spread’ of the data for average sentence length between realism and modernism.
What we see on the left is the variation of the sentence length data (the term ‘variation’ here meaning the general ‘dispersedness’ of the data) for realism, which goes from 10 to roughly 35 words per sentence with an outlier or two on either end, whereas if we consider modernism, we have everything from zero (Samuel Beckett’ novel How It Is which has no full stops in it) up to forty-five, with far more outliers on the higher end. Higher outliers, are data points with values greater than 1.5 times the interquartile range above the third quartile, lower outliers, of which there are three, are more than 1.5 times below the first quartile. For one’s own general knowledge, the modernist outliers for sentence length are
William Faulkner’s Absalom! Absalom! (46.4), and Intruer in the Dust (42.3)
Marcel Proust’s Swann’s Way (42.9), In a Budding Grove (40.2) In a Budding Grove (40.2), Time Re-gained (38), The Prisoner (37.2) and The Captive (35.7) The Guermantes Way (34.1) and Sodom and Gomorrah (30.9).
Samuel Beckett’s Texts for Nothing and The Unnamable have 40.5 and 32.9 words per sentence respectively
Gertrude Stein’s novels The Making of Americans and Everybody’s Autobiography have 33.9 and 33.5 respectively.
Henry James’ The Ivory Tower and The Young Lovell score 31.8 and 29 respectively.
The three lower outlier values for sentence length are all written by Beckett, such as the aforementioned How It Is and also Worstward Ho (4.9) and Ill Seen Ill Said (7).
It can be tempting I think, when we see these sorts of names surface so prominently, in conjunction with a visual confirmation of the existence of an avant-garde to think that modernism in its most pure form was a kind of relentless maximalism, an uncompromising movement towards longer sentences, more pronouns, and that all other manifestations of it are inadequate or insufficient in some way. This is a kind of a boring and masculinist overview of the genre, which takes, I think, too many of the claims made by its most dogmatic adherents at face value, and it’s not a modernism I’m particularly interesting in defending or instantiating. There can also, of course, be a regressive or rearguard aspect to modernism, which is perceptible in the following boxplot, which displays the distribution of past tense verbs.
As was pointed out above, modernism displays an increase in past tense verbs overall, but here we see a large number of outlier values moving against the overall trend. These novels are:
James Joyce’s Ulysses (4.3%) and Finnegans Wake (2.7%)
William Faulkner’s As I Lay Dying (4.2%) and Requiem for a Nun (3.6%)
Samuel Beckett’s Malone Dies (3.9%), Fizzles (2.5%), Company (2%), Texts for Nothing (1.8%), The Unnamable (1.7%), Worstward Ho (1.6%), Ill Seen Ill Said (1.4%) and a corpus of his miscellaneous and unpublished short fiction (2.2%).
Joseph Conrad and Ford Madox Ford’s collaborative novel The Nature of a Crime (2.6%)
Virginia Woolf’s The Waves (2.4%)
Gertrude Stein’s Tender Buttons (1.7%)
The higher modernism outlier is Virginia Woolf’s 1937 novel The Years (10%) and the lower realism outlier is Balzac’s 1841 novel Letters of Two Brides (2.7%)
In this way we can see that modernism is not just a unidirectional commitment to a narrow sequence of stylistic changes. Instead, it’s a contradictory movement in which a number of different stylistic markers jostle against and subvert one another. In this particular instance, for example, we can perceive the authors most generally understood to be among the most uncompromising; Joyce, Beckett, Stein, Woolf and Faulkner, resisting the overall trend.
From the two boxplots I’ve generated so far, you might have noticed that in, modernism tends to generate a greater number of outliers, and I can confirm that this trend of a greater degree grammatical heterogeneity manifesting itself in modernist novel-writing than naturalistic novel-writing persists across the other categories of grammar, which you can validate by looking at the complete analysis here.
This struck me as important development, so I quantified the extent of each data point’s outlier-ness, and then grouped them according to author. These values were then divided by the number of outlier data points, because some of these novelists only have a small number of novels in the corpus versus others. Austen’s complete works would be totally outnumbered by Balzac’s for instance. The results appear below:
Please do note the values on the y-axis; Jane Austen is barely above zero because the only outlier text she wrote is Mansfield Park, which marks itself out for its disproportional use of adjectives. I thought it better to not exclude her from the plot though, because, I didn’t want it to turn into even more of a boy’s club than it might otherwise be. It would be useful, and exciting I think, to conceive of this plot as an indication of early breaches with conventional form, perhaps some nineteenth century anticipations of modernism. Reading Dostoevsky, Zola and Balzac in this manner would all be coterminous with changes taking place in the study of modernism now, but reading Thackeray and Eliot in these terms might be a more surprising development, and I’d be interested to read these texts in light of what we’re seeing here.
The modernism plot for deviation appears below:
From this plot we can see that the most avant-gardist prose writers, considered from the perspective of their grammar, appear to be Beckett, Stein, Woolf, Conrad and Joyce. Of course, this is nowhere near a definitive answer as to what modernist style is, or who its most innovative practitioners were; these measurements are atomistic and are quantifying individual words. But style is not just words in isolation, style is agglomerations of words, spaces between words, the clandestine networks and relations the phrases these words add up to compose in the mind of the reader, and, if these digital methodologies are to have any chance of illustrating this shift (an inadequate term in the first instance, since it is more an accumulation of changes distributed over a broad corpus than a sudden or transformational one that we are here concerned with) it is in these cumulative terms that style must be quantified, in order to avoid drifting into the reductive and schematic scientism that numerical analyses of this kind are frequently accused of perpetuating.
It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.
Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*
In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.
As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.
Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.
By the time the words are ten letters long, true to form it’s Self who’s writing is the only one above 1%.
I have recently begun to experiment with Natural Language Processing to determine how particular words in modernist texts are correlated. I’m still getting my head around Python and NLTK, but so far I’m finding it much more user-friendly than similar packages in R.
Long-term I hope to graph these collocations in high-vector space, so that I can graph them, but for the moment, I’m interested in noting the prevalence of the term ‘young man’, Self and Baume being the only authors that have female adjective-noun phrases, and the usage of titles which convey particular social hierarchies; Joyce, Woolf and Bowen’s collocations are almost exclusively composed of these, as is Stein’s, with the clarifier that Stein’s appear shorn of their ‘Mr.’, ‘Miss.’ or ‘Doctor’.
Here’s all the collocations in the modernist corpus:
young man; robert jordan; new york; gertrude stein; old man; could see; henry martin; every one; years ago; first time; long time; hugh monckton; great deal; come back; david hersland; good deal; every day; edward colman; came back; alfred hersland
Canonical modernist texts:
young man; robert jordan; gertrude stein; henry martin; new york; every one; old man; could see; years ago; long time; hugh monckton; first time; great deal; david hersland; come back; good deal; every day; edward colman; alfred hersland; mr. bettesworth
fat controller; phar lap; von sasser; first time; per cent; could see; old man; one another; even though; years ago; new york; front door; young man; either side; someone else; dave rudman; last night; living room; steering wheel; every time
frau mann; nora said; english girl; someone else; long ago; leaned forward; london bridge; come upon; could never; god knows; doctor said; sweet sake; first time; five francs; terrible thing; francis joseph; hôtel récamier; orange blossoms; bowed slightly; would say
kentish town; someone else; first time; last night; jesus christ; something else; years ago; five minutes; every day; hail mary; take care; next week; arms around; never mind; every single; little girl; little boy; two years; soon enough; come back
mrs kerr; lady waters; mrs heccomb; major brutt; mme fisher; lady naylor; miss fisher; good deal; said mrs; first time; lady elfrida; one another; young man; colonel duperrier; aunt violet; last night; ann lee; one thing; sir robert; sir richard
robert jordan; old man; could see; colonel said; gran maestro; catherine said; jordan said; richard gordon; long time; pilar said; thou art; pablo said; nick said; bill said; girl said; captain willie; young man; automatic rifle; mr. frazer; david said
F. Scott FitzGerald
new york; young man; years ago; first time; sally carrol; several times; fifth avenue; ten minutes; minutes later; richard caramel; thousand dollars; five minutes; young men; evening post; old man; next day; saturday evening; long time; last night; come back
gertrude stein; every one; david hersland; alfred hersland; angry feeling; family living; independent dependent; jeff campbell; julia dehning; mrs. hersland; daily living; whole one; bottom nature; madeleine wyman; good deal; mary maxworthing; middle living; miss mathilda; mabel linker; every day
buck mulligan; said mr.; martin cunningham; aunt kate; says joe; mary jane; corny kelleher; ned lambert; mrs. kearney; stephen said; mr. henchy; ignatius gallaher; father conmee; nosey flynn; mr. kernan; myles crawford; cissy caffrey; ben dollard; mr. cunningham; miss douce
young man; faubourg saint-germain; long ago; caught sight; first time; every day; one day; great deal; des laumes; young men; could see; quite well; next day; one another; would never; nissim bernard; victor hugo; would say; louis xiv; long time
said camier; said mercier; miss counihan; lord gall; miss carridge; mr. kelly; panting stops; said belacqua; mr. endon; said wylie; said neary; one day; otto olaf; dr. killiecrankie; come back; vast stretch; mrs gorman; push pull; something else; ground floor
even though; tawny bay; living room; old man; passenger seat; bird walk; maggot nose; shut-up-and-locked room; stone fence; food bowl; lonely peephole; low chair; old woman; kennel keeper; rearview mirror; shih tzu; shore wall; safe space; every day; oneeye oneeye
miss barrett; mrs. ramsay; mrs. hilbery; young man; st. john; could see; years ago; peter walsh; mrs. thornbury; miss allan; said mrs.; young men; mrs. swithin; human beings; wimpole street; mrs. flushing; mr. ramsay; mrs. manresa; sir william; door opened
new york; per cent; eliza lynch; dear friend; years old; even though; first time; came back; years ago; long time; michael weiss; señor lópez; living room; every time; looked like; could see; one day; said constance; pat madigan; mrs hanratty
fat controller; phar lap; von sasser; one another; old man; could see; first time; per cent; dave rudman; let alone; front door; young man; skip tracer; quantity theory; jane bowen; los angeles; young woman; either side; charing cross; long since
father fahrt; good fairy; father cobble; said shanahan; mrs crotty; said furriskey; said lamont; mrs laverty; one thing; sergeant fottrell; said slug; old mathers; public house; far away; cardinal baldini; monsignor cahill; mrs furriskey; red swan; black box; said shorty
Ford Madox Ford
henry martin; hugh monckton; edward colman; privy seal; mr. bettesworth; mr. fleight; young man; mr. sorrell; sergius mihailovitch; young lovell; new york; jeanne becquerel; lady aldington; kerr howe; anne jeal; miss peabody; mr. pett; great deal; marie elizabeth; robert grimshaw
Jorge Luis Borges
ts’ui pên; buenos aires; pierre menard; eleventh volume; richard madden; nils runeberg; yiddische zeitung; stephen albert; hundred years; erik lönnrot; firing squad; henri bachelier; madame henri; orbis tertius; vincent moon; paint shop; seventeenth century; anglo-american cyclopaedia; fergus kilpatrick; years ago
mrs. travers; mrs verloc; mrs. fyne; peter ivanovitch; doña rita; miss haldin; mrs. gould; assistant commissioner; charles gould; san tomé; chief inspector; years ago; captain whalley; could see; van wyk; old man; dr. monygham; gaspar ruiz; young man; mr. jones
young man; st. mawr; mr. may; mrs. witt; blue eyes; miss frost; could see; one another; mrs bolton; ‘all right; come back; said alvina; two men; of course; good deal; long time; mr. george; next day
uncle buck; aleck sander; miss reba; years ago; dewey dell; mrs powers; could see; white man; four years; old man; ned said; division commander; general compson; miss habersham; new orleans; uncle buddy; let alone; one another; united states; old general