A (Proper) Statistical analysis of the prose works of Samuel Beckett


Content warning: If you want to get to the fun parts, the results of an analysis of Beckett’s use of language, skip to sections VII and VIII. Everything before that is navel-gazing methodology stuff.

If you want to know how I carried out my analysis, and utilise my code for your own purposes, here’s a link to my R code on my blog, with step-by-step instructions, because not enough places on the internet include that.

I: Things Wrong with my Dissertation’s Methodology

For my masters, I wrote a 20000 word dissertation, which took as its subject, an empirical analysis of the works of Samuel Beckett. I had a corpus of his entire works with the exception of his first novel Dream of Fair to Middling Women, which is a forgivable lapse, because he ended up cannibalising it for his collection of short stories, More Pricks than Kicks.

Quantitative literary analysis is generally carried out in one of two ways, through either one of the open-source programming languages Python or R. The former you’ve more likely to have heard of, being one of the few languages designed with usability in mind. The latter, R, would be more familiar to specialists, or people who work in the social sciences, as it is more obtuse than Python, doesn’t have many language cousins and has a very unfriendly learning curve. But I am attracted to difficulty, so I am using it for my PhD analysis.

I had about four months to carry out my analysis, so the idea of taking on a programming language in a self-directed learning environment was not feasible, particularly since I wanted to make a good go at the extensive body of secondary literature written on Beckett. I therefore made use of a corpus analysis tool called Voyant. This was a couple of years ago, so this was before its beta release, when it got all tricked out with some qualitative tools and a shiny new interface, which would have been helpful. Ah well. It can be run out of any browser, if you feel like giving it a look.

My analysis was also chronological, in that it looked at changes in Beckett’s use of language over time, with a view to proving the hypothesis that he used a less wide vocabulary as his career continued, in pursuit of his famed aesthetic of nothingness or deprivation. As I wanted to chart developments in his prose over time, I dated the composition of each text, and built a corpus for each year, from 1930–1987, excluding of course, years in which he just wrote drama, poetry, which wouldn’t be helpful to quantify in conjunction with one another. Which didn’t stop me doing so for my masters analysis. It was a disaster.

II: Uniqueness

Uniqueness, the measurement used to quantify the general spread of Beckett’s vocabulary, was obtained by the generally accepted formula below:

unique word tokens / total words

There is a problem with this measurement, in that it takes no account of a text’s relative length. As a text gets longer, the likelihood of each word being used approaches 1. Therefore, a text gets less unique as it gets bigger. I have the correlations to prove it:

Screen Shot 2016-11-03 at 12.18.03.png

There have been various solutions proposed to this quandary, which stymies our comparative analyses, somewhat. One among them is the use of vectorised measurements, which plot the text’s declining uniqueness against its word count, so we see a more impressionistic graph, such as this one, which should allow us to compare the word counts for James Joyce’s novels, A Portrait of the Artist as a Young Man and his short story collection, Dubliners.

Screen Shot 2016-11-03 at 13.28.18.png

All well and good for two or maybe even five texts, but one can see how, with large scale corpora, this sort of thing can get very incoherent very quickly. Furthermore, if one was to examine the numbers on the y-axis, one can see that the differences here are tiny. This is another idiosyncrasy of stylostatistical methods; because of the way syntax works, the margins of difference wouldn’t be regarded as significant by most statisticians. These issues relating to the measurement are exacerbated by the fact that ‘particles,’ the atomic structures of literary speech, (it, is, the, a, an, and, said, etc.) make up most of a text. In pursuit of greater statistical significance for their papers, digital literary critics remove these particles from their texts, which is another unforgivable that we do anyway. I did not, because I was concerned that I was complicit in the neoliberalisation of higher education. I also wrote a 4000 word chapter that outlined why what I was doing was awful.

IV: Ambiguity

The formula for ambiguity was arrived at by the following formula:

number of indefinite pronouns/total word count

I derived this measurement from Dr. Ian Lancashire’s study of the works of Agatha Christie, and counted Beckett’s use of a set of indefinite pronouns, ‘everyone,’ ‘everybody,’ ‘everywhere,’ ‘everything,’ ‘someone,’ ‘somebody,’ ‘somewhere,’ ‘something,’ ‘anyone,’ ‘anybody,’ ‘anywhere,’ ‘anything,’ ‘no one,’ ‘nobody,’ ‘nowhere,’ and ‘nothing.’ Those of you who know that there are more indefinite pronouns than just these, you are correct, I had found an incomplete list of indefinite pronouns, and I assumed that that was all. This is just one of the many things wrong with my study. My theory was that there were to be correlations to be detected in Beckett’s decreasing vocabulary, and increasing deployment of indefinite pronouns, relative to the total word count. I called the vocabulary measure ‘uniqueness,’ and the indefinite pronouns measure I called ‘ambiguity.’ This in tenuous I know, indefinite pronouns advance information as they elide the provision of information. It is, like so much else in the quantitative analysis of literature, totally unforgivable, yet we do it anyway.

V: Hapax Richness

I initially wanted to take into account another phenomenon known as the hapax score, which charts occurrences of words that appear only once in a text or corpus. The formula to obtain it would be the following:

number of words that appear once/total word count

I believe that the hapax count would be of significance to a Beckett analysis because of the points at which his normally incompetent narrators have sudden bursts of loquaciousness, like when Molloy says something like ‘digital emunction and the peripatetic piss,’ before lapsing back into his ‘normal’ tone of voice. Once again, because I was often working with a pen and paper, this became impossible, but now that I know how to code, I plan to go over my masters analysis, and do it properly. The hapax score will form a part of this new analysis.

VI: Code & Software

A much more accurate way of analysing vocabulary, for the purposes of comparative analysis when your texts are of different lengths, therefore, would be to randomly sample it. Obviously not very easy when you’re working with a corpus analysis tool online, but far more straightforward when working through a programming language. A formula for representative sampling was found, and integrated into the code. My script is essentially a series of nested loops and if/else statements, that randomly and sequentially sample a text, calculate the uniqueness, indefiniteness and hapax density ten times, store the results in a variable, and then calculate the mean value for each by dividing the result by ten, the number of times that the first loop runs. I inputted each value into the statistical analysis program SPSS, because it makes pretty graphs with less effort than R requires.

VII: Results

I used SPSS’ box plot function first to identify any outliers for uniqueness, hapax density and ambiguity. 1981 was the only year which scored particularly high for relative usage of indefinite pronouns.


It should be said that this measure too, is correlated to the length of the text, which only stands to reason; as a text gets longer the relative incidence of a particular set of words will decrease. Therefore, as the only texts Beckett wrote this year, ‘The Way’ and ‘Ceiling,’ both add up to about 582 words (the fifth lowest year for prose output in his life), one would expect indefiniteness to be somewhat higher in comparison to other years. However, this doesn’t wholly account for its status as an outlier value. Towards the end of his life Beckett wrote increasingly short prose pieces. Comment C’est (How It Is) was his last novel, and was written almost thirty years before he died. This probably has a lot to do with his concentration on writing and directing his plays, but in his letters he attributed it to a failure to progress beyond the third novel in his so-called trilogy of Molloy, Malone meurt (Malone Dies) and L’innomable (The Unnamable). It is in the year 1950, the year in which L’inno was completed, that Beckett began writing the Textes pour rien (Texts for Nothing), scrappy, disjointed pieces, many of which seem to be taking up from where L’inno left off, similarly the Fizzlesand the Faux Départs. ‘The Way,’ I think, is an outgrowth of a later phase in Beckett’s prose writing, which dispenses the peripatetic loquaciousness and the understated lyricism of the trilogy and replaces it with a more brute and staccato syntax, one which is often dependent on the repetition of monosyllables:

No knowledge of where gone from. Nor of how. Nor of whom. None of whence come to. Partly to. Nor of how. Nor of whom. None of anything. Save dimly of having come to. Partly to. With dread of being again. Partly again. Somewhere again. Somehow again. Someone again.

Note also the prevalence of particle words, that will have been stripped out for the analysis, and the ways in which words with a ‘some’ prefix are repeated as a sort of refrain. This essential structure persists in the work, or at least the artefact of the work that the code produces, and hence of it, the outlier that it is.

Screen Shot 2016-11-03 at 12.55.13.png

From plotting all the values together at once, we can see that uniqueness is partially dependent on hapax density; the words that appear only once in a particular corpus would be important in driving up the score for uniqueness. While there could said to be a case for the hypothesis that Beckett’s texts get less unique, more ambiguous up until 1944, when he completed his novel Watt, and if we’re feeling particularly risky, up until 1960 when Comment C’est was completed, it would be wholly disingenuous to advance it beyond this point, when his style becomes far too erratic to categorise definitively. Comment C’est is Beckett’s most uncompromising prose work. It has no punctuation, no capitalisation, and narrates the story of two characters, in a kind of love, who communicate with one another by banging kitchen implements off another:

as it comes bits and scraps all sorts not so many and to conclude happy end cut thrust DO YOU LOVE ME no or nails armpit and little song to conclude happy end of part two leaving only part three and last the day comes I come to the day Bom comes YOU BOM me Bom ME BOM you Bom we Bom

VIII: Conclusion

I would love to say that the general tone is what my model is being attentive to, which is why it identified Watt and How It Is as nadirs in Beckett’s career but I think their presence on the chart is more a product of their relative length, as novels, versus the shorter pieces which he moved towards in his later career. Clearly, Beckett’s decision to write shorter texts, make this means of summing up his oeuvre in general, insufficient. Whatever changes Beckett made to his aesthetic over time, we might not need to have such complicated graphs to map, and I could have just used a word processor to find it — length. Bom and Pim aside, for whatever reason after having written L’inno none of Beckett’s creatures presented themselves to him in novelistic form again. The partiality of vision and modal tone which pervades the post-L’inno works demonstrates, I think far more effectively what is was that Beckett was ‘pitching’ for, a new conceptual aspect to his prose, which re-emphasised its bibliographic aspects, the most fundamental of which was their brevity, or the appearance of an incompleteness, by virtue of being honed to sometimes less than five hundred words.

The quantification of differing categories of words seems like a radical, and the most fun, thing to quantify in the analysis of literary texts, as the words are what we came for, but the problem is similar to one that overtakes one who attempts to read a literary text word by word by word, and unpack its significance as one goes: overdetermination. Words are kaleidoscopic, and the longer you look at them, the more threatening their darkbloom becomes, the more they swallow, excrete, the more alive they are, all round. Which is fine. Letting new things into your life is what it should be about, until their attendant drawbacks become clear, and you start to become ambivalent about all the fat and living things you have in your head. You start to wish you read poems instead, rather than novels, which make you go mad, and worse, start to write them. The point is words breed words, and their connections are too easily traced by computer. There’s something else about knowing that their exact correlations to a decimal point. They seem so obvious now.

Political Context to the Queen’s Theatre Visualisation Project

Here’s another blog post I did in which I try to sum up some one hundred years of Irish history in 500 words. I mostly fail, I think the most telling part is when I stop to admit what I’ve been saying has little pertinence to the overall project, which can be found here. I also have a few inaccuracies and incorrectly used words, but I do slam de Valera, which is fun.

This blog post provides a historical context for the Queen’s Theatre by outlining Ireland’s political and economic situation in the first half of the twentieth century.Events such as the 1916 Rising and the ensuing Civil War cast a long shadow over Irish political discourse even today, as can be seen by the ongoing controversy as to how best to celebrate the 1916 Rising, or whether such an event should even be celebrated.

In 1914, the failures of constitutional parliamentarians such as John Redmond to both secure a definite deal on Home Rule with the British government and assuage the anxieties of Unionists in the North of Ireland led to a situation that more fringe minorities could take advantage of, as is demonstrated by the formation of both the Ulster Volunteer Force and the Irish Volunteer Force. In this environment, the Irish Republican Brotherhood became increasingly radicalised, as exemplified by Patrick Pearse’s inflammatory rhetoric at Fenian leader Jeremiah O’Donovan Rossa’s funeral in 1915: “Life springs from death, and from the graves of patriotic men and women spring living nations.” A minority were determined to take advantage of the timing of the Great War. Others within the IRB, such as IRB’s chief-of-staff Eoin MacNeill, were reluctant to adopt violence as a means to independence : “To my mind, those who feel impelled towards military action on any of the grounds that I have stated are really impelled by a sense of feebleness or dependency or fatalism, or byan instinct of satisfying their own emotions or escaping from a difficult…situation.”

Reactions to the Rising were multiple and varied. Many urban dwellers seized the opportunity in the immediate aftermath to loot a number of shops in the surrounding area. For some members of a younger generation, such as then-medical student and later IRA officer Ernie O’Malley, the occasion was stirring and brought about an increase in Volunteers. It was not until subsequent events relating to the Rising that public opinion began to soften with regards to the actions of the Volunteers. Among these events were J.C. Power-Colthurst’s shooting of Francis Sheehy-Skeffington  during the events of the Rising, the excessive measures of the British government against those responsible (fifteen executions) and Dublin Castle’s attempts to pin responsibility for the outbreak of violence on moderate parliamentarians.

In the Irish Free State created in the aftermath of the civil war, the maintenance of income from agriculture was regarded as crucial to further prosperity. An economic policy of protectionism was adopted, albeit an incoherent one. Tariffs on imported goods were established but with no attempt made to create a domestic industry of production. This policy, combined with a lack of funding for the development of  employment schemes, led to widespread emigration. De Valera’s vision for rural Ireland as being made up of self-sustaining, frugal and anti-materialist family units ignored the metropolitan and anglicised lifestyle in urban centres such as Dublin, where 21.1% were employed in finance, 12% in administration, 13.7% personal services and 32.2% in agricultural production. Economic growth remained sluggish throughout ‘the Emergency,’ for the obvious reasons.

How the Queen’s theatre fits into a survey of Irish history of this kind can be difficult to quantify. Pearse’s uncompromising vision of an independent Ireland and ideologically driven economic mismanagement can seem to have little bearing on the function of the Queen’s Theatre as a venue for light entertainment. However, what is important to recall is that the Queen’s remained a site of cultural practice throughout many generations, and during one of the most tumultuous periods in Irish history until it closed in 1966. It furthermore remained a Dublin landmark until 1969. When The Plough and the Stars (1926)  was staged in the Queen’s, its political contentiousness perhaps did not match that of the earlier productions in the Abbey when widows of victims of the Rising, including Hannah Sheehy Skeffington,  disrupted the performance, but it was in a city that within living memory had been the site of a divisive conflict. When the Abbey Theatre Company took up residence in the Queen’s, the Irish Free State was only twenty-nine years old. For projects like this Queen’s Theatre Visualisation Project, it is important that the space inhabited by the theatre-whether that space is physical or social-be reconstructed also.

Information on the history of the Abbey Theatre Company at the Queen’s can be found here.

Further Reading

Literary Context to the Queen’s Theatre Visualisation Project

The following is a blog post intended to establish the literary context to the Queen’s Theatre visualisation project, which I undertook as part of my MPhil in Digital Humanities and Culture. The project itself can be found here. I make an argument about a strong literary tradition being in some way a bad thing. I’m not sure what I was thinking. Very little.

The intention of this blog post is to provide a literary context for the Queen’s Theatre Project. This post deals with the Irish literary and cultural scene in the early twentieth century which can seem to have a somewhat tangential relationship to the Queen’s Theatre itself. Nevertheless, it is hoped that this brief survey will prove illuminating to those who are unfamiliar with the development of Irish cultural nationalism. Furthermore, the range of this cultural watershed is not limited to the years in which they could be said to have taken place. Critics such as Anthony Cronin have argued that the movement set in motion by Lady Gregory, William Butler Yeats and others had a stultifying influence on the literary generations that followed. From the biographies and works produced by authors such as Flann O’Brien, Brendan Behan and Patrick Kavanagh, one can see the negative effects of a powerful literary tradition resonate into the 1950’s.

In September 1897, Yeats, folklorist Lady Gregory and writer Edward Martyn began to plan the creation of an Irish National Theatre.  It should be remembered that discussions of a cultural renaissance involving organisations such as this literary theatre or the Gaelic League, reflect a political agenda shaped by a minority grouping of urban intelligentsia, while, as R.F. Foster writes, “life went on in eighteenth-century tenements [in Dublin city] bereft of water or sanitation.” Furthermore, the activities of the Irish National Theatre Society similarly reflect the niche interests of a small segment of society. Yeats’ intended audience was “that limited public which gives understanding,” and he records that he would “not mind greatly if others are bored.” Attendance of productions such as The Playboy of the Western World (1907) and Cathleen Ní Houlihon (1902) was far outstripped by the public’s interest in light-opera and music hall performances. As Christopher Morash writes in his A History of Irish Theatre 1601-2000 (2002), “on that same December night, as Maire Ní Shiubhlaigh was playing Cathleen Ní Houlihan…across the Liffey almost two thousand people were howling for the informer’s blood in Whitbread’s Sarsfield at the Queen’s.” This is, at least partially, the rationale for projects of this kind. By drawing attention to the more popular forms of Irish cultural life, it is possible that the oversights of Irish historiography can be corrected and the milieu of mid-twentieth century Dublin life can be reflected more accurately.

The events surrounding the reception of J.M. Synge’s The Playboy of the Western World further points to Yeats’ talents as regards the art of self-promotion. Yeats took a dim view of those who disrupted the second performance of the play, dismissing them as “commonplace and ignorant people,” who “had no books in their houses.” He also brought a sectarian dimension to the affair, drawing a line between the behaviour of the owners of the Irish Literary Theatre, mostly Protestants, and those disruptive members of the audience – and the public in general – objecting to the content of the play. For Yeats, their behaviour was indicative of characteristics inherent to members of the Catholic religion: “We have not such pliant bones, and did not learn in the houses that bred us a so suppliant knee.”

Much of the information we have about the Dublin literary scene at the time of the Celtic Revival and beyond has been obtained from the unpublished manuscript written by the architect and theatre fanatic Joseph Holloway. Holloway’s Impressions of a Dublin Playgoer (1895-1944) is a massive and rich resource containing a number of manuscript volumes in which he wrote extensive reviews and information about various performances he attended in almost all of Dublin’s theatres, such as the Abbey, the Queen’s and the Antient Concert Rooms. Holloway also designed the Abbey for the purposes of the Irish Literary Theatre and was commissioned to do so by Annie Horniman, a theatre manager and patron. For further information on Irish theatre, it is recommended to consult Holloway’s diaries and the texts provided in the Further Reading section below.

A Defense of Pragmatic Approaches to TEI mark-up

Hypertext essay

The Use of The Irish Language in the 1641 depositions

My Dissertation

I finished my dissertation – a quantitative analysis of the works of Samuel Beckett. There’s a copy available in Hodges & Figgis because I left one there.

Alternatively, here is the PDF.

