Monthly Archives: November 2018

Books I would like to read that don’t exist

On the ruinous tweeness of webcomics, popular existentialism, Kraftwerk’s role in ideologically laundering the EU project, Marxism and paranoia, a sociology of creative writing workshops, the rich, anti-communism in Irish literature, a literary history of Fine Gael and the avant-garde, on the hauntology of tacky dubstep, how the fash/Heideggerian ‘authentic relationship with nature’ was commodified, a novel that reads like an encyclopedia/history of an invented country, Alice Spawl on the Brontës, Aphex Twin’s cornishness,a collection of the best (justified) literary hatchet jobs, a non-anthropocentric treatment of experiments with animal communication on the consequences of cross-species communication, Enzo Traverso on Irish left melancholia and Judith Butler’s Lives of the Saints.

A collection of essays considering the crossover between illness and the avant-garde , a book that locates the blame for Brexit at the foot of Ian McEwan and Martin Amis’ ouevre, a book that blames Jonathan Franzen for Trump, a book on the Britpop psyop , a dystopian sci-fi novel where art is a means of oppression rather than the straightforward force of resistance it’s usually represented as.

The Gathering if Anne Enright had written it as the Faulkneresque three generations of the Free State it started out as, as well as the magical realism one she wrote in UEA about Colly Ciber adapting Shakespeare, a non-contrived historiographical metafiction, Emma Donoghue Hood sequel, Deleuze and Guattari critique cryptids, the novel Lucia Joyce wrote that her brother burned on her death, the original draft of Nightwood, Derrida’s response to Gadamer, the novel Joyce would’ve actually written had he lived to be a hundred and lived in Iowa (Don DeLillo said this of the warren report).

Maggie Nelson on purple.

Advertisements

The quantitative analysis of literature in theory

Screenshot 2018-11-06 at 11.34.35

This blog post will provide some notes towards the methodology underpinning my doctoral research. In completing my research project I will model 640 novels and short story collections within a consensus network in order to project a potential definition of modernist literary style through both qualitative and quantitative means. In the fullness of time I will have a full and replicable account up on RPubs and Github, for the moment this general introduction will have to do.

The quantitative analysis of literature has had a fraught history. Since the cultural turn of the sixties and seventies, when the political revionisms of feminism, queer and critical race theory were gaining increasing currency, the concept of ‘style’, some quintessence of the work which could be instrumentally distilled from the text, became increasingly untenable. Context became the predominant means through which literature is understood in Anglo-American literature departments. Indeed the very idea would seem to recall the belles lettres approach of the nineteenth century.

Computational literary criticism, out of necessity, treats literary materials in more pragmatic terms. When filling a spreadsheet, things need to be inputted into cells and there’s no real conversation in quantitative terms that’s possible outside of these terms. This stands in contrast to contemporary literary studies, in which one can quite happily have a long and involved discussion on what the text is not saying. Since the more recent developments inculcated within new modernist studies and neo-victorianism, which have expanded the temporal and spatial limits of their respective objects of study, into the present day, far into the past and beyond the metropoles of London, New York and Paris, aiming to de-tether the implicit value judgements of their respective categorisations, from the more problematic aspects of modernity or colonialism, these two positions have only become more polarised.

This leaves quantitative literary critics in something of a quandary. Despite some of its more vociferous advocates claiming that the application of computational logic to literary materials represents a definitive paradigm shift which the discipline at large should take more account of, their epistemological conservatism is often reflected in their political conservatism. The notion of style as combinations of quantifiable features seem to underpin an uncritical celebrations of formal competence and has been intriguingly read as an example of ‘third way’ knowledge production, as well as a backlash against politically oriented cultural criticism.

I would argue that falling into retrograde modes of thought is certainly a risk of analyses of this kind, but it doesn’t have to be a necessity, and networks, with their capacity to regard texts as embedded within a broader ecosystem, offer the possibility of bringing the new modernist studies dispensation into dialogue with quantitative literary criticism.

The quantitative analysis of literature can be said to have been kicking around as far back as monks first devised manual concordances of the Bible. Every digital humanist will be familiar with the work of Roberto Busa, but the history of the statistical analysis of literature is a more decentralised phenomenon than the big tent digital humanities. The earliest example I can find is Louis Tonko Milic’s A Quantitative Approach to the Style of Jonathan Swift which was published in 1967. Milic, bless him, seems to be under the impression that he stands at the brink of a newly invigorated formalism which can mobilise computation to reveal literary works as they truly are, bypassing the impressionism which elsewhere characterises appraisals of style within the field. Unfortunately literary critics are not terribly well-known for their command of statistics and Milic’s tendency to reproduce pages and pages of tables without assessing their significance, with a student t-test for instance, is symptomatic. Many of the earliest digital humanities journals simply reproduce the raw data in binary form and advance interpretations based on their visual impressions, rather than mathematical findings.

The development of analyses based on the richness of a text’s vocabulary (number of unique words/total number words), hapax richness (number of words which appear once in the text/total number of words) or average sentence length, word length represent an improvement on this approach, but not by much. These may be understood as indexes of style, but as before they were placed in tables and often ‘read’ in the same way literary critics usually do. There were no systematic attempts to assess sentence length across a broader corpus, nor was there any benchmark established for the assessment of significant differences.

The first quantitative analysis of literature which yielded replicable results was developed by the Australian literary critic J.F. Burrows. His Delta method, rather than focusing on the more evocative or longer words that literary critics usually focus their attention on, aimed to uncover stylistic signal by quantifying the relative occurrences of high-frequency terms, such as ‘the’, ‘an’, ‘a’, ‘and’ or ‘said’. Burrows’ original method involved using just the first 150 most frequent words (MFWs) but subsequent analyses have demonstrated successful authorship attribution increases all the way up to 5000 MFWs. The more of these particle words which are analysed, in effect, the better.

This leaves us with a problem as to what scale we analyse texts at. Eder has noted that analysing words at different scales broadcast different stylistic signals, with discomfiting amounts of variation between them. I’ve noted this phenomenon myself when analysing individual words as opposed to combinations of words in twos  (‘the man’, ‘she said’, ‘over there’), threes (‘she also said’, ‘over by the’) or even on the level of individual characters (‘th ‘, ‘a’, ‘n he’). Rybicki and Eder’s solution is to quantify all 5000 words six times, and culling them in increments of twenty; rather than finding a single ‘best’ fit, we just throw everything in and attain the average level of similarity existing between each text, subject to particular conditions. I propose a similar approach, by analysing single words, bigrams, trigrams, quadgrams and quingrams in both word and character form. This is all done through the ‘Stylo’ package, a custom-made library constructed within the R language.

Once all these analyses have been run, R outputs a list of edges into the working directory, which will form the basis of the network. It looks like this:

Screenshot 2018-11-06 at 11.09.06

Each row here represents a relationship from one text, ‘Source’, ‘Target’. Each row is effectively a line drawn from column A to column B. The third column marked ‘Weight’ signifies the intensity of the relationship, the weakest being 1 and the strongest being ~1125. This seems to be the maximum value possible so I suspect the algorithm which creates this table cuts off the similarity calculation past a certain point. To return to the table, we can see that they run in descending order of intensity, and that Anne Bronte’s novel Agnes Grey is by far most like her other novel The Tenant of Wildfell Hall. From there there’s a pronounced drop-off from 902 to a weight of 226, the next most similar novel, James Joyce’s Finnegans Wake.

A list of this kind is effectively outputted for every single scale mentioned above. They are then combined into a single massive list of edges (about 14720 rows in all). Because there are about ten edges lists, there are ten different weights for each relationship. Each of these are average into a single ‘edge’, and this forms the basis of the network, which I’ll talk about in a subsequent post.