This blog post will provide some notes towards the methodology underpinning my doctoral research. In completing my research project I will model 640 novels and short story collections within a consensus network in order to project a potential definition of modernist literary style through both qualitative and quantitative means. In the fullness of time I will have a full and replicable account up on RPubs and Github, for the moment this general introduction will have to do.
The quantitative analysis of literature has had a fraught history. Since the cultural turn of the sixties and seventies, when the political revionisms of feminism, queer and critical race theory were gaining increasing currency, the concept of ‘style’, some quintessence of the work which could be instrumentally distilled from the text, became increasingly untenable. Context became the predominant means through which literature is understood in Anglo-American literature departments. Indeed the very idea would seem to recall the belles lettres approach of the nineteenth century.
Computational literary criticism, out of necessity, treats literary materials in more pragmatic terms. When filling a spreadsheet, things need to be inputted into cells and there’s no real conversation in quantitative terms that’s possible outside of these terms. This stands in contrast to contemporary literary studies, in which one can quite happily have a long and involved discussion on what the text is not saying. Since the more recent developments inculcated within new modernist studies and neo-victorianism, which have expanded the temporal and spatial limits of their respective objects of study, into the present day, far into the past and beyond the metropoles of London, New York and Paris, aiming to de-tether the implicit value judgements of their respective categorisations, from the more problematic aspects of modernity or colonialism, these two positions have only become more polarised.
This leaves quantitative literary critics in something of a quandary. Despite some of its more vociferous advocates claiming that the application of computational logic to literary materials represents a definitive paradigm shift which the discipline at large should take more account of, their epistemological conservatism is often reflected in their political conservatism. The notion of style as combinations of quantifiable features seem to underpin an uncritical celebrations of formal competence and has been intriguingly read as an example of ‘third way’ knowledge production, as well as a backlash against politically oriented cultural criticism.
I would argue that falling into retrograde modes of thought is certainly a risk of analyses of this kind, but it doesn’t have to be a necessity, and networks, with their capacity to regard texts as embedded within a broader ecosystem, offer the possibility of bringing the new modernist studies dispensation into dialogue with quantitative literary criticism.
The quantitative analysis of literature can be said to have been kicking around as far back as monks first devised manual concordances of the Bible. Every digital humanist will be familiar with the work of Roberto Busa, but the history of the statistical analysis of literature is a more decentralised phenomenon than the big tent digital humanities. The earliest example I can find is Louis Tonko Milic’s A Quantitative Approach to the Style of Jonathan Swift which was published in 1967. Milic, bless him, seems to be under the impression that he stands at the brink of a newly invigorated formalism which can mobilise computation to reveal literary works as they truly are, bypassing the impressionism which elsewhere characterises appraisals of style within the field. Unfortunately literary critics are not terribly well-known for their command of statistics and Milic’s tendency to reproduce pages and pages of tables without assessing their significance, with a student t-test for instance, is symptomatic. Many of the earliest digital humanities journals simply reproduce the raw data in binary form and advance interpretations based on their visual impressions, rather than mathematical findings.
The development of analyses based on the richness of a text’s vocabulary (number of unique words/total number words), hapax richness (number of words which appear once in the text/total number of words) or average sentence length, word length represent an improvement on this approach, but not by much. These may be understood as indexes of style, but as before they were placed in tables and often ‘read’ in the same way literary critics usually do. There were no systematic attempts to assess sentence length across a broader corpus, nor was there any benchmark established for the assessment of significant differences.
The first quantitative analysis of literature which yielded replicable results was developed by the Australian literary critic J.F. Burrows. His Delta method, rather than focusing on the more evocative or longer words that literary critics usually focus their attention on, aimed to uncover stylistic signal by quantifying the relative occurrences of high-frequency terms, such as ‘the’, ‘an’, ‘a’, ‘and’ or ‘said’. Burrows’ original method involved using just the first 150 most frequent words (MFWs) but subsequent analyses have demonstrated successful authorship attribution increases all the way up to 5000 MFWs. The more of these particle words which are analysed, in effect, the better.
This leaves us with a problem as to what scale we analyse texts at. Eder has noted that analysing words at different scales broadcast different stylistic signals, with discomfiting amounts of variation between them. I’ve noted this phenomenon myself when analysing individual words as opposed to combinations of words in twos (‘the man’, ‘she said’, ‘over there’), threes (‘she also said’, ‘over by the’) or even on the level of individual characters (‘th ‘, ‘a’, ‘n he’). Rybicki and Eder’s solution is to quantify all 5000 words six times, and culling them in increments of twenty; rather than finding a single ‘best’ fit, we just throw everything in and attain the average level of similarity existing between each text, subject to particular conditions. I propose a similar approach, by analysing single words, bigrams, trigrams, quadgrams and quingrams in both word and character form. This is all done through the ‘Stylo’ package, a custom-made library constructed within the R language.
Once all these analyses have been run, R outputs a list of edges into the working directory, which will form the basis of the network. It looks like this:
Each row here represents a relationship from one text, ‘Source’, ‘Target’. Each row is effectively a line drawn from column A to column B. The third column marked ‘Weight’ signifies the intensity of the relationship, the weakest being 1 and the strongest being ~1125. This seems to be the maximum value possible so I suspect the algorithm which creates this table cuts off the similarity calculation past a certain point. To return to the table, we can see that they run in descending order of intensity, and that Anne Bronte’s novel Agnes Grey is by far most like her other novel The Tenant of Wildfell Hall. From there there’s a pronounced drop-off from 902 to a weight of 226, the next most similar novel, James Joyce’s Finnegans Wake.
A list of this kind is effectively outputted for every single scale mentioned above. They are then combined into a single massive list of edges (about 14720 rows in all). Because there are about ten edges lists, there are ten different weights for each relationship. Each of these are average into a single ‘edge’, and this forms the basis of the network, which I’ll talk about in a subsequent post.