I initially began my doctorate with an investigation into a literary trend which was at that stage was already beginning to wind down, in favour of the resurgence of a critical theory inflected magical realism, which I would probably argue has now achieved hegemonic status. In and around 2014, five or so Irish and British writers, as well as their critics, were using the word ‘modernism’ to talk about their more recent work, and I’m thinking here in particular of Will Self, Eimear McBride, Anne Enright and Sara Baume. I was interested in investigating whether or not these trends could be detectable on a quantitative level and what words were indicative of the more obvious points of comparison, twentieth century modernism as compared to twenty-first century modernism, as well as the more implicit co-ordinates, such as twentieth-century realism or twenty-first century realism. For various reasons, primarily institutional, my area of study has changed quite significantly, but I feel I would be remiss if I did not in some respect answer the question I began with, now that I am actually equipped to do so from a logistical point of view. The following few paragraphs talk about the adopted method, so if you’re a stranger to some of this stuff or, like me a few years ago, you’re broadly ignorant of statistical and regression methods, feel free to skip to the results section.
The first problem which confronts us in a study such as this is the definition of a baseline of modernist style, against which we can locate our contemporary modernists. Once we’ve done that, we can identify the degree to which any given text deviates from this ‘norm’. The most established means of quantifying the literary style of any given text, is to perform distance clustering on the normalised relative frequencies of a text, i.e., the percentage a particular word commands in the text’s overal length converted into z-scores. Transforming numbers into z-scores involves altering them such that their mean is 0, the standard deviation is 1, and each number basically indicates the number of standard deviations they reside from this mean. Peforming distance clustering on numerical vectors which represent novels is called the ‘Delta’ method and I talk a bit more about it and how well it works here. Below is an image of a frequency table which gives some indication of how these frequencies look.
On the far left we see author, title and date of publication and in each of the cells we see the relative frequency for seven of the most frequent words in our corpus. As we would expect, these are words like ‘the’, ‘and’, ‘to’, etc. If we look at the figure in the top left, we see that the word ‘the’ appears 3.65 times in Anne Bronte’s Agnes Grey, whereas it appears 4.7 times in Louisa May Alcott’s A Modern Cinderella. As far as the word ‘the’ goes, then, A Modern Cinderella exists at a distance of 1.05 from Agnes Grey (4.7–3.65 = 1.05). Now imagine that this process happens for every word (5000) between every novel in the corpus (1173), divided by the total number of words we extracted (again, 5000). This is what is at the basis of Delta distance.
We have a relatively even spread of nineteenth century fiction (568) versus twentieth century fiction (605). There’s also one eighteenth century text, written by Maria Edgeworth, which I labelled as nineteenth. At an early stage I anticipated trying to divide these two categories into modernist, anti-modernist and proto-modernist as opposed to classical realist versus continuity realism, but given the current state of the discourse, wherein what was revanchist victorianism is now modernism etc., I decided not to, and to adopt time as a less contentious variable instead. Effectively then we are tracing the stylistic change from the nineteenth to the twentieth century. This is a slight adjustment to the goal posts in terms of the aim of this study and reflects the assumption that what we trace when we analyse the change from the nineteenth to the twentieth century will organically correspond to a modernist signal. As far as the actual contents of the corpus goes, the contents in the image above are symptomatic, I’ve gone for standard bearers of nineteenth century and twentieth century literature, whoever you can name off the top of your head I probably have in there, Woolf, Dickens, Lewis, Barnes, Joyce, Conrad, Mansfield, Stein, Wells, Kipling etc. etc. It is quite skewed towards canonical texts, but in my defense, it’s almost impossible to find digital copies of texts by non-canonical authors.
Since we are interested in the words which come into prominence from one century to the text, one potential method which were considered are t-tests, which are used in order to assess whether or not the mean difference between two numerical vectors are significant. We could loop t-tests along our data, identifying whether from the twentieth century to the nineteenth century the words ‘the’, ‘we’, ‘of’, ‘days’ or ‘thought’ increase in their relative frequencies. We would then identify the words which do manifest a significant change, whether this is an increase or a decrease. However, there are complicating factors here, not least that we don’t have an equal number of samples from the nineteenth century, which is something that t-tests would require. If we are not interested in randomly sampling the twentieth century, we would have to omit them. Large numbers of t-tests also give us back large numbers of false positives, even with a false detection algorithm applied to our results after the fact.
Regression then, seemed to provide the best chance of a result, given that we are dealing with what is effectively an either/or problem; was this novel written in a style more indicative of century a or century b? Regression is a method for investigating the relationship which exists between one variable and another variable. We might, for example, wish to investigate the relationship which exists between the age and the height of fifty people. We plot the results of our data, then we place a regression line through the data. There is a very slight upward slope here, which would seem to indicate that there is a relationship between how old you are and how tall you are.
This is a stupid example of course, but it gives an indication of what regression is supposed to do, namely, investigate the relationship between two variables and fit a line or model which offers the most robust explanation. If you look at how the data points scatter along the regression line, we can see that it makes a decent stab at predicting how 30–45% of the data falls out. It’s really wide of the mark at predicting that there is a 35 year-old in our dataset who is 3 feet tall, there is quite a significant distance there between the predicted value (4′ 10) and the actual observed value. This is called a residual. When all the residuals are summed and squared, they are referred to as the sum of the squared residuals and it is the aim of regression of this type to minimise the value of this figure as much as possible by coming as close as we possibly can to hitting as many of the observed values.
However, before throwing our data into a linear regression, we need to ask ourselves if this really suits the problem. As we can see, age is any number between 18 and 60, making it continuous, whereas our dependent variable is categorical, i.e. it is either ‘nineteenth’ or ‘twentieth’ century. This is an either/or problem, the answer is a probability between zero and one. Logistic regression is therefore the best means of approaching this problem. However again, complications remain. We have a large number of variables here (relative frequencies of about 5000 words) and we don’t know which ones are important and which ones are not. It’s relatively straightforward to regress for a categorical outcome when you have a relatively small sample of variables, but here we have a lot, all of which might be potentially interesting. If we throw thousands and thousands of variables into our logistic regression though, we will get what is referred to as an overfit model. Rather than creating a model which can capture and identify borderline cases, the corpus will separate absolutely into nineteenth and twentieth century, which sounds like it would be a good thing, but would actually result in an overly rigid template unfit to make actual judgements. Therefore we attenuate the influence of particular variables, reducing their value across the board to the same extent; this is called regularisation and the amount by which we regularise each variable is arrived at, again, by minimising the sum of the squared residuals and is embodied in the value attached to our lambda value.
A lot of what I’ve been describing in the previous pararaph functions, for the most part, in the backend of R, most statistical libraries that carry out regularised regressions contain standard implementations. So, this is the type that we use, a cross-validated model obtained from glmnet, such that each variable is made regular according to what minimises our residuals. We can then extract the most significant predictors, the words which are best suited to identifying a text written in the nineteenth century as opposed to the twentieth. Initially we tried to use the predict() function, which would provide us with a figure between zero and one which would give us the certainty of a particular judgement. We would then correlate this vector of numbers with our word frequencies and identify which words are most closely correlated with relative certainty. Unfortunately in this instance there were no high effect sizes, so we looked at our co-efficients given optimal lambda; lambda which reduces the sum of the squared errors. Now, on some level we should be wary of these co-efficients, these are selected almost at random in order to explain the most data variation, but they’re better than nothing and furthermore interesting from the perpspective of content.
It is interesting to note first of all, just how parsimonious this model is; cv.glmnet() manages to reduce us down to just 138 words as opposed to the 5000 we present to the model. Secondly, it is interesting to note that there are far more predictors for the nineteenth century (82) as opposed to the twentieth (56). This suggests that the nineteenth century possesses a far more coherent style, whereas the twentieth century is obviously pulling in too many heterogenous directions to be summarised to the same extent. Before we talk about them in detail, in roughly descending order of importance, I’ll readily admit that yes, how we interpret these can vary, some nouns are verbs, some verbs are nouns, some are both and separating one for the other has everything to do with context, there are broad generalisations here on offer, but this seems to me to be both the fundamental hazard as well as the asset of CLS in general.
The nineteenth century vocabulary breaks down into a few different categories, the first are words to do with emotions, the overwhelming majority of which seem to be on the negative end, between ‘vexation’ , ‘reproach’, ‘despair’ ‘dismal’, ‘misfortune’, ‘sorrow’, ‘spite’ and ‘tears’, only ‘delight’ represents an exception to this rule.
Present-tense verbs, the sort of things most characters in these novels find themselves doing are difficult to synthesise but all seem within the realm of what people in novels spend most of their time doing: ‘entering’, ‘declaring’, ‘noticing’, ‘pointing’, ‘throwing’. We also have the infinitives of ‘resist’, ‘tread’, ‘wish’, ‘allow’, ‘deceive’, ‘fetch’, ‘comprehend’ , ‘give’, ‘take’, ‘lend’ and ‘induce’, all of which seem to suggest the general traffic of social interaction and interchange.
We have some past tense verbs including ‘proposed’, ‘treated’, ‘obtained’, ‘seated’, ‘ascended’, ‘fastened’, ‘obliged’, ‘expressed’, ‘consented’, ‘fancied’, ‘quitted’, ‘cried’, ‘accompanied’, ‘returned’, ‘took’, ‘darted’, ‘promised’ and ‘taken’. We also have ‘retired’, which I found very satisfying, being as it is within the realm of the sorts of verbs Joyce uses in his parodies of nineteenth century writing.
The nouns on offer in nineteenth century writing seem to vary slightly, breaking down into vague references to the immediate environment, with words such as ‘heap’, ‘circumstances’, ‘particulars’, ‘companion(s)’, as well as more clear references to social contracts and milieu ‘occupation’, ‘character’, ‘account’, ‘intellect’, ‘deal’, ‘manner’, ‘fortune’, ‘heir’, ‘prospects’ , ‘promises’ and ‘present.’ The adjectives break down into good: ‘earnest’, ‘good-natured’ and ‘respectable’ against bad: ‘low’. We also see a few more abstract or idealistic nouns associated with otherworldly values such as ‘temptation’.
Nouns to the fore in the twentieth century are far more concrete and seem to foreground a commodity economy, with the nouns less significant and opening up less to broader values with ‘moustache’, ‘electric’, ‘apple’, ‘hat’, ‘chimney’ and ‘wire’. More abstract tendencies are manifested in words like ‘jesus’, ‘adventure’, ‘problem’, ‘response’, ‘comment’, ‘personality’ and ‘humour’ and ‘vision’.
Present-tense verbs drop off quite significantly, and those that remain are far less active in any sense, we get far less moving around in an environment and much more in the way of ‘wearing’ and ‘slipping’. ‘Whistle’ also appears. Past tense verbs like ‘picked’, ‘faced’, ‘slipped’, ‘smiled’, ‘sighed’, ‘protested’, ‘knew’, ‘realised’, all emphasise social interchange, but also seem to point more towards a bit more of an inward focalisation.
Colloquial words like ‘anyhow’, ‘weren’t’ and ‘aren’t’ seem to be predictors here, as well as adjectives which are far more toned down aside from ‘amazing’, which is the exception, we have ‘normal’, ‘decent’, ‘grey’, ‘responsible’, ‘main’, ‘quality’ and ‘different’.
Finally, we have words which make overt references to the passing of time, such as ‘dusk’, ‘later’, ‘latest’ ‘afternoon’ and ‘spring’.
Grouping all these findings impressionistically, it would seem as though twentieth century literature can be defined i) by its attenuated affect, ii) more of an interior disposition iii) a movement away from physical action, iv) a concurrent movement away from the material facts of social relations in toto in favour of their symptoms in the form of a commodities, v) the introduction of colloquial language.
Some of these trends in macro detail on the barplot below:
We then used this model trained in order to prise nineteenth and twentieth century literature apart on the contemporary modernists, the complete works of Anne Enright, Eimear McBride, Will Self and Sara Baume (at time of writing) were presented to the model. Now, some of you may have noticed the problem with this approach. We have trained the model to differentiate nineteenth century fiction from twentieth, and therefore it’s hardly well set up to differentiate twenty-first century fiction influenced by modernism from twenty-first century fiction not influenced by modernism. It’s a fair point, and if I were writing my thesis on this subject, training a proper model would be what I was doing here. However, I’m not committing as much of a statistical no-no as might at first be thought. For instance, a key part of my analysis of this modernist resurgence has to do with its status as a revanchist, rather than a revolutionary, modernism. And I do mean this more particularly for Will Self and some of Eimear McBride’s most well-placed, and misinformed, critics, these are the only ones truly on record as saying ‘this is modernism’ ad nauseum. Take Self’s observation that post-modernism offers no classicism from which a truly novel aesthetic can be formulated. This is not an aesthetic which emerges concurrently with a period of social and political revolution which affords some degree of insight into the newly emergent bourgeois individual in the proletarianised urban environment, rather it attempts to scoop up the literary prestige associated with modernist literature, understood as Woolf, Joyce and one or two others, the hegemonic criterion by which literature departments, publishers and literary monthlies assess ‘worth’ and sell it back to you wholesale against YA, Netflix or whatever else it is you have to set yourself against in order to be a serious reader.
I expected that the passage of time would fill the gap and that all these novels would be judged as modernist, but in fact the opposite happened; only Anne Enright’s novel What Are You Like? came back as such. Trying to find out why this was the case, we used glmnet’s predict() function, which gives us a figure between 0 and 1 indicating the level of certainty one way or the other. We then correlated this figure with all the word frequencies we have, in order to identify where this certainty that all the contemporary modernists, are in fact quite traditional in their approach, originates.
Words which were decisive in identifying these texts as nineteenth century in the overwhelming majority of cases include their use of past tense verbs such as ‘walked’, ‘opened’, ‘married’, ‘tried’, ‘liked’, ‘talked’, ‘watched’, ‘decided’, ‘kissed’, ‘lifted’, ‘pushed’, ‘stayed’, ‘slept’, ‘slipped’, ‘ate’, ‘wiped’ and ‘spoiled’.
Adjectives like ‘easy’, ‘middle’, ‘clever’, ‘ordinary’, ‘foolish’, ‘fierce’, ‘sober’, and ‘irish’, pronouns such as ‘she’ and ‘herself’ and finally, nouns like ‘side’, ‘dress’, ‘floor’, ‘sorrow’, ‘blame’, ‘cloth’, ‘veil’, ‘rail’ and ‘treasure’.
In conclusion then, we might say that contemporary modernism in fact fails to embody modernism’s stylistic disposition in a key number of ways and in fact harkens back to a pre-modernist stylistic tendency in its investment in action verbs in the past tense. The relative abscence of modern also technology seems to be a feature here too and a more pronounced affective turn also seems to undermine these novels in their aspiration, real or formulated, towards a modernist aesthetic. It is finally interesting to reflect a bit on What Are You Like?, within Enright’s career it reflects a crux from the magical realism of her short stories and The Wig my Father Wore more towards quite an affectless reflection on identity and psychology. I’ll update this post with more examples once I have a copy of the book to hand, for the moment you’ll just have to trust me on that. Interesting to note as well, that towards the end of the novel the main characters’ mother delivers a soliloquoy from hell in a way quite reminiscent of Faulkner’s As I Lay Dying, an encouraging parallel within a study of this kind.