Tag Archives: Digital Humanities

A Statistical Analysis of the narrators of ‘Ulysses’ or ‘why ‘Ulysses’ isn’t wisdom literature’

The second time I read Ulysses,in advance of an undergraduate seminar, it was around the ninetieth anniversary of the original text’s publication. The newspapers were printing archive material relating to the novel, extended supplements about its importance from the usual quarters, as well as reviews of recently published monographs from both young and established scholars. Unfortunately, the critical trend of the time was to read Ulysses as wisdom literature. Critics urged prospective readers of the novel to wrest Joyce from the scholars and bring him ‘back to the people’. This school of thought treated Leopold Bloom as a model of the way in which the contemporary urban subject should be living: aloof, polite, well-intentioned but not dogmatic on political issues. Moderately informed, but more often wrong, a reader, but not self-serious, an everyman. Ulysses’ structural indebtedness to cornerstones of The Canon such as William Shakespeare’s Hamlet and Homer’s The Odyssey frequently undergirds this line of argument, demonstrative in itself of how easily high literary art and everyday life may be set next to one another. This generally requires critics to treat the characters of Bloom and Stephen Dedalus as two opposites in need of the other. Each has a little to impart on life, love and literature, whether it be to reflect a little deeper on themselves or their marriage, move past their respective losses or to find in each other their lost son/father.

This interpretation of the novel reads it along a linear trajectory, as Stephen and Bloom come together to form Blephen and Stoom. Through computation it may be possible to examine the writing style of later chapters, and determine whether or not they bear formal witness to this change in character. We must first however, consider the difficulty of locating where Joyce’s narrators actually are. Part of what makes Joyce’s writing style so unique is his use of free indirect discourse, a mode of writing in which the reality of the text is inflected by the consciousness(es) of the beholder(s). As such, putting a category on each episode of Ulysses as though it were narrated by one person or a combination of persons might seem reductive; it very much is. But in fusing computation and literature, certain assumptions have to be made.

In carrying out this analysis, I made use of R’s ‘Stylo’ package, which contains tools for breaking a number of texts into equal sizes, removing words which are not common to most samples, calculating the relative frequencies of these words, transforming these observations into new combinations of variables called ‘components’ with greater explanatory potential, and clustering them together. These words appear below:

These might seem like boring terms, as literary critics we tend to look past them to more evocative ones like ‘serpentine’ or ‘columbanus’ but unfortunately, in computational terms it is the relative frequencies of these ‘particles’ or ‘function words’ that provide the most secure means of modelling a writer’s particular idiom. These samples were then plotted on a correlation matrix, which can be taken as an index of similarity, based on where they cluster:

The six different narrators of Ulysses appearing in the index above are:

‘Anon’, who narrates the episode ‘Cyclops’

‘Blephen’, a composite delineation for episodes in which both characters feature, such as ‘Circe’, ‘Eumaeus’, ‘Ithaca’ and ‘Oxen of the Sun’

Bloom, who narrates ‘Hades’, ‘Calypso’, ‘Lestrygonians’ and ‘The Lotus Eaters’, Gerty, who narrates at least half of ‘Nausicaa’ (this is a controversial point within the literature, it might by Bloom who is narrating for her)

Molly, who narrates the book’s final chapter ‘Penelope’,

and finally Stephen, who narrates the first three episodes ‘Telemachus’, ‘Nestor’ and ‘Proteus’, as well as the novel A Portrait of the Artist as a Young Man, which has been thrown in here for comparison.

Here’s the same plot as above with the labels more clearly indicated

The first thing we could note is the gender divide. Molly and Gerty both spread over to the right, with Molly as an outlier. Both are more proximate to the A Portrait samples than any other, which are all taken from the earlier parts of the novel, suggesting that Joyce writes women and young children using the same number of words at the same rate. As the Gerty samples move through the episode, they move closer and closer to the Bloom cluster, visually conforming that the episode starts in Gerty’s voice before he takes over, and that Bloom doesn’t think much of women’s intelligence in the main either.

Overall we can say that there doesn’t look to be a fusing of perspectives here as such. Rather than the Blephen episodes meeting halfway between the Stephen and Bloom, Stephen and Bloom already seem quite comfortably clustered at the novel’s outset. Based on the divide between Stephen’s episodes of Ulysses and A Portrait, we might say that the way in which Stephen narrates A Portrait is very different from the way in which he narrates Ulysses.This is justified I think by how sensitive the analysis is to changes in narrator, demonstrated by the Gerty/Bloom example already discussed, as well as the fact that the earlier part of Aeolous, in which Bloom is present, clusters with his samples, whereas the second part, after Stephen’s entered, clusters with the Stephen samples.

Below is the plot with the Portrait samples removed:

Words Stephen’s narration is most likely to use in comparison to Bloom
Words Bloom’s narration is more likely to use in comparison to Stephen

There are a number of ways one could use these results to interrogate the notion of Ulysses as wisdom literature. We could begin by asking after the gendered aspects of the adjective ‘wise’, and ask why so many of these books which teach us how one might best live are written by men (and how tone-deaf this argument can sound because to read Ulysses one might almost think married women weren’t let out of the house) or we could ask what interests an Irish model of bourgeois respectability might serve, along the lines of an Irish ‘keep calm and carry on’ poster.

Ulysses as a guide to life risks rendering it a novel of parts coming together, the middle-class intellectual and the middle-class working stiff holding hands across whatever barricade is supposed to be dividing them. Not that I would go to the other extreme and frame it as one of dissolution. Ulysses’ shape is one I would be loathe to put a vector to in fact; to say that Stephen and Bloom’s relationship moves from a) state to b) state would be too easy by half.

What makes Ulyssesan interesting novel to me is its self-referentiality, the dialogue it establishes between the novel and its supposed referent of ‘real Dublin’, which is made most clear in ‘Circe’, but also in the book’s other failed attempts to understand itself, as in the cases of the characters referenced as being in particular places at particular times who may or may not be Bloom, the McIntosh mystery or the puzzle of crossing Dublin without passing a pub. In this context, I think ‘Eumaeus’ appearing as a stylistic outlier is significant.

It is in this episode that we get information about a sequence of coincidences, and resonant differences between Bloom and Stephen’s lives. The depth of these coincidences (which I won’t provide a summary of here, because I think they’re among the most poignant parts of the novel) gesture towards something a bit more cosmically ordered than the rest of the novel even as they take place within the circumscribed rituals of Irish urban middle-class life in the early twentieth century. ‘Eumaeus’ is written in a chill tone which most closely resembles that of a scientific paper, eliding the indirect discourse which ostensibly defines the rest of the text, and it is the fact that these connections are raised here rather than anywhere else that the true interest in their relationship, such as it is, is to be found.

These connections which remain unrealised by the two, rather than bring us to some Forsterian notion of connection should raise instead questions of alienation and of their unity in separation. It presents problems both epistemological and political, about how our reality is structured, the means through which it is circumscribed and how it is more defined by how little of it we are aware of rather than how much. Rather than teaching us ‘how to live’ Ulysses shows us how we do not live, how we probably won’t live and how it could so easily have been otherwise. It is no more an explanation for life as it is an explanation of itself, or Homer, or Ireland.

Advertisements

Joanna Walsh’s ‘Seed’

The first thing one notices about Joanna Walsh’s online novella Seed is the quality of the design. Seed’s aesthetic is very consistent, and was obviously designed with an eye to the material at hand. For all this we have its illustrator Charlotte Hicks to thank, as well as the digital publishing company responsible for designing the platform on which the text is hosted. Seed is optimised for iOS, and, as the site tells us, is probably better viewed there, but it can also be read on a laptop or a PC.

The reader begins by being presented with seventeen different plants which open up onto different lexia, with suggestive and minimalistic titles such as ‘Baby’, ‘Touch’ or ‘Red’. Each one gives a brief insight into the life of an eighteen year old woman living in a middle-class housing estate in suburban England, coming to terms with herself, her environment, the people around her and the reality of her incipient young adulthood. By presenting the reader with seventeen different starting points (ignoring the opening explanatory remarks for a moment), and the means of proceeding in any way they might choose, the text emulates the same provisional and tentative steps that the narrator concurrently takes in the development of her own identity. In an interview, Walsh explains that the rhizoidal orientation of the text provided her with the opportunity to disorientate the reader, and perhaps engender in them the same uncertainty that the protagonist of the novella may be feeling at any given time, so that the reader has:

no sense of reading left to right, of the weight of the book, of how far they were through, or, sometimes, of the direction within the narrative.

Seed is therefore doing very deliberate and self-conscious things with the particularities of its format, typical of texts which, overtly or otherwise, draw attention to their digitality. Insofar as a firm distinction can be drawn between these two facets of the work, Seed therefore introduces a coherence/tension between its form and its content.

In a design quirk which enables this sense of openness that Seed conveys, the reader has the option of changing the text’s visual interface in order to display differently-coloured vines intertwined between each of the plants. The colours refer to each lexia’s subject matter, and inverts the standardised and industrial nature of colour-coding, a tendency, or obssession, that the narrator exhibits throughout the text:

Fruits in the supermarket. They’re a different species. Those strawberries all white in the middle all the year round, like crunchy peaches. Everything so shiny. Not a speck of earth anywhere. Why would there be? It goes straight from the formica shed to our formica kitchen. Once cut my mother wraps it in cling film and puts it in the fridge.

The narrator’s sustained attention to post-industrial artefacts, the symptoms of contemporary, or then-contemporary suburban living, is the strongest aspect of Seed. The narrator’s oscillation between a tone of matter-of-fact inventory and syntax-rupturing anxiety, enacts the very process of interpretation and the fact that so much narrative time is deployed in coming to terms with such quotidian objects, made to seem strange by their presence in a narrative medium known for attention to other, less strange things, intensifies the effect:

The doves in our garden say something else no they say somewhere else from their tall perspective looking down on lawns mowed with stripes, somewhere nature isn’t the same kind we have round here.

The site’s drawing together of Seed’s structure and content, finds a corollary in the text’s actual word usage. Walsh uses leitmotifs, particularly the names of plants or descriptions of colours in order to string each unit of text together with one another in more subtle ways, without making use of an overt visual interface.

It should be noted that the text is not as radically discontinuous as it might at first seem, or certainly was not regarded as such by Walsh, who said the following in an interview:

I’ve been thinking about the authority I’m still claiming as an ‘author’ in Seed; despite the degree of reader-control offered by the project, it’s still a fairly traditional ‘authorial’ work.

I had to write Seed as a linear text to ensure it will read ok for anyone who wants to follow the temporal narrative. That said, I never write in a ‘linear’ fashion, but in one that resembles the Seed reading experience: I write phrases, notes, paragraphs, then brings them together on shuffle, until they work.

Walsh’s comments may be surprising for those familiar with her writing methodology, which involves the use of cut-ups, or other aleatoric methods which introduce an element of chance into the composition process. It is surprising also, for those who are familiar with the somewhat niche history of digital or hypertextual literature. For many of hypertext’s trailblazing practitioners, such as Shelley Jackson or Michael Joyce, the crux of hypertextual literature was the game-playing that new digital formats allowed the author to engage in as an absent centre of meaning, which expedited the then-extremely trendy dalliances with post-structuralist philosophy and critical theory in a digital context. Within Seed’s units of text after all, there is no opportunity for interaction, except insofar as the text requires you to turn the page. In an interview with Review31, Walsh described how Seed barely resembles a hypertext in the original sense of the term at all, and that it is much better understood as a traditional work focalised around the author’s vision.

This is true, firstly for the structural reasons already outlined, but also because Seed’s formal architecture is best understood as functioning in the same way as literary works in print do, in that they imply, or gesture, far more readily than they state directly. This is axiomatic for all novels worthy of the name, but it presents an interesting means of thinking about how narrative works in the context of Seed in particular. While it might seem to present some amount of freedom or capacity for interaction, Seed is in fact circumscribing you even as it offers the chance of liberation. This has a nice visual metaphor in Seed’s visual interface which deliberately places a number of other flowers beyond the reader’s reach in darkness, suggesting both the thwarted ambition to move beyond the text that we’re presented with, and, as I’ve said already, the myopia of the narrator in her own environment:

it’s a fairly tight work, and I’ve said what I wanted to say in it. I love the idea of locked passages: part of my intent was to create a feeling of implied space beyond what is described (isn’t that the intent of most novels, to create, in however abstract a sense, a ‘world’, even if ‘world’ means a set of conceptual parameters?). I’d like to do a print edition to see if and how the circle of nonlinearity could be squared.

Though we have the ability to read Seed in any order we might like, each section is up to five pages long, and therefore requires us to read chronologically for a far greater length of time than hypertexts of the nineties do. Whether this can be attributed to the now mainstream nature of micro-textual formats, which requires literature to aspire to something else is probably a question for others to answer. Personally speaking, if writers working digitally can produce works as good as Seed, I won’t be unduly detained by the sociological reasonings why.

How big are the words modernists use?

It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.

Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*

*The previous graph had twice as many authors and was far too noisy, with not enough distinction between the colours to make it anything other than a headwreck to read.

In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.

As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.

Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.

By the time the words are ten letters long, true to form it’s Self who’s writing is the only one above 1%.

Literary Cluster Analysis

I: Introduction

My PhD research will involve arguing that there has been a resurgence of modernist aesthetics in the novels of a number of contemporary authors. These authors are Anne Enright, Will Self, Eimear McBride and Sara Baume. All these writers have at various public events and in the course of many interviews, given very different accounts of their specific relation to modernism, and even if the definition of modernism wasn’t totally overdetermined, we could spend the rest of our lives defining the ways in which their writing engages, or does not engage, with the modernist canon. Indeed, if I have my way, this is what I will spend a substantial portion of my life doing.

It is not in the spirit of reaching a methodology of greater objectivity that I propose we analyse these texts through digital methods; having begun my education in statistical and quantitative methodologies in September of last year, I can tell you that these really afford us no *better* a view of any text then just reading them would, but fortunately I intend to do that too.

This cluster dendrogram was generated in R, and owes its existence to Matthew Jockers’ book Text Analysis with R for Students of Literature, from which I developed a substantial portion of the code that creates the output above.

What the code is attentive to, is the words that these authors use the most. When analysing literature qualitatively, we tend to have a magpie sensibility, zoning in on words which produce more effects or stand out in contrast to the literary matter which surrounds it. As such, the ways in which a writer would use the words ‘the’, ‘an’, ‘a’, or ‘this’, tends to pass us by, but they may be far more indicative of a writer’s style, or at least in the way that a computer would be attentive to; sentences that are ‘pretty’ are generally statistically insignificant.

II: Methodology

Every corpus that you can see in the above image was scanned into R, and then run through a code which counted the number of times every word was used in the text. The resulting figure is called the word’s frequency, and was then reduced down to its relative frequency, by dividing the figure by total number of words, and multiplying the result by 100. Every word with a relative frequency above a certain threshold was put into a matrix, and a function was used to cluster each matrix together based on the similarity of the figures they contained, according to a Euclidean metric I don’t fully understand.

The final matrix was 21 X 57, and compared these 21 corpora on the basis of their relative usage of the words ‘a’, ‘all’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’, ‘for’, ‘from’, ‘had’, ‘have’, ‘he’, ‘her’, ‘him’, ‘his’, ‘I’, ‘if’, ‘in’, ‘is’, ‘it’, ‘like’, ‘me’, ‘my’, ‘no’, ‘not’, ‘now’, ‘of’, ‘on’, ‘one’, ‘or’, ‘out’, ‘said’, ‘she’, ‘so’, ‘that’, ‘the’, ‘them’, ‘then’, ‘there’, ‘they’, ‘this’, ‘to’, ‘up’, ‘was’, ‘we’, ‘were’, ‘what’, ‘when’, ‘which’, ‘with’, ‘would’, and ‘you’.

Anyway, now we can read the dendrogram.

III: Interpretation

Speaking about the dendrogram in broad terms can be difficult for precisely the reason that I indicative above; quantitative/qualitative methodologies for text analysis are totally opposed to one another, but what is obvious is that Eimear McBride and Gertrude Stein are extreme outliers, and comparable only to each other. This is one way unsurprising, because of the brutish, repetitive styles and is in other ways very surprising, because McBride is on record as dismissing her work, for being ‘too navel-gaze-y.’

Jorge Luis Borges and Marcel Proust have branched off in their own direction, as has Sara Baume, which I’m not quite sure what to make of. Franz Kafka, Ernest Hemingway and William Faulkner have formed their own nexus. More comprehensible is the Anne Enright, Katherine Mansfield, D.H. Lawrence, Elizabeth Bowen, F. Scott FitzGerald and Virginia Woolf cluster; one could make, admittedly sweeping judgements about how this could be said to be modernism’s extreme centre, in which the radical experimentalism of its more revanchiste wing was fused rather harmoniously with nineteenth-century social realism, which produced a kind of indirect discourse, at which I think each of these authors excel.

These revanchistes are well represented in the dendrogram’s right wing, with Flann O’Brien, James Joyce, Samuel Beckett and Djuna Barnes having clustered together, though I am not quite sure what to make of Ford Madox Ford/Joseph Conrad’s showing at all, being unfamiliar with the work.

IV: Conclusion

The basic rule in interpreting dendrograms is that the closer the ‘leaves’ reach the bottom, the more similar they can be said to be. Therefore, Anne Enright and Will Self are the contemporary modernists most closely aligned to the forebears, if indeed forebears they can be said to be. It would be harder, from a quantitative perspective, to align Sara Baume with this trend in a straightforward manner, and McBride only seems to correlate with Stein because of how inalienably strange their respective prose styles are.

The primary point to take away here, if there is one, is that more investigations are required. The analysis is hardly unproblematic. For one, the corpus sizes vary enormously. Borges’ corpus is around 46 thousand words, whereas Proust reaches somewhere around 1.2 million. In one way, the results are encouraging, Borges and Barnes, two authors with only one texts in their corpus, aren’t prevented from being compared to novelists with serious word counts, but in another way, it is pretty well impossible to derive literary measurements from texts without taking their length into account. The next stage of the analysis will probably involve breaking the corpora up into units of 50 thousand words, so that the results for individual novels can be compared.

Can a recurrent neural network write good prose?

At this stage in my PhD research into literary style I am looking to machine learning and neural networks, and moving away from stylostatistical methodologies, partially out of fatigue. Statistical analyses are intensely process-based and always open, it seems to me, to fairly egregious ‘nudging’ in the name of reaching favourable outcomes. This brings a kind of bathos to some statistical analyses, as they account, for a greater extent than I’d like, for methodology and process, with the result that the novelty these approaches might have brought us are neglected. I have nothing against this emphasis on process necessarily, but I do also have a thing for outcomes, as well as the mysticism and relativity machine learning can bring, alienating us as it does from the process of the script’s decision making.

I first heard of the sci-fi writer from a colleague of mine in my department. It’s Robin Sloan’s plug-in for the script-writing interface Atom which allows you to ‘autocomplete’ texts based on your input. After sixteen hours of installing, uninstalling, moving directories around and looking up stackoverflow, I got it to work.I typed in some Joyce and got stuff about Chinese spaceships as output, which was great, but science fiction isn’t exactly my area, and I wanted to train the network on a corpus of modernist fiction. Fortunately, I had the complete works of Joyce, Virginia Woolf, Gertrude Stein, Sara Baume, Anne Enright, Will Self, F. Scott FitzGerald, Eimear McBride, Ernest Hemingway, Jorge Luis Borges, Joseph Conrad, Ford Madox Ford, Franz Kafka, Katherine Mansfield, Marcel Proust, Elizabeth Bowen, Samuel Beckett, Flann O’Brien, Djuna Barnes, William Faulkner & D.H. Lawrence to hand.

My understanding of this recurrent neural network, such as it is, runs as follows. The script reads the entire corpus of over 100 novels, and calculates the distance that separates every word from every other word. The network then hazards a guess as to what word follows the word or words that you present it with, then validates this against what its actuality. It then does so over and over and over, getting ‘better’ at predicting each time. The size of the corpus is significant in determining the length of time this will take, and mine required something around twelve days. I had to cut it off after twenty four hours because I was afraid my laptop wouldn’t be able to handle it. At this point it had carried out the process 135000 times, just below 10% of the full process. Once I get access to a computer with better hardware I can look into getting better results.

How this will feed into my thesis remains nebulous, I might move in a sociological direction and take survey data on how close they reckon the final result approximates literary prose. But at this point I’m interested in what impact it might conceivably have on my own writing. I am currently trying to sustain progress on my first novel alongside my research, so, in a self-interested enough way, I pose the question, can neural networks be used in the creation of good prose?

There have been many books written on the place of cliometric methodologies in literary history. I’m thinking here of William S. Burroughs’ cut-ups, Mallarmé’s infinite book of sonnets, and the brief flirtation the literary world had with hypertext in the 90’s, but beyond of the avant-garde, I don’t think I could think of an example of an author who has foregrounded their use of numerical methods of composition. A poet friend of mine has dabbled in this sort of thing but finds it expedient to not emphasise the aleatory aspect of what she’s doing, as publishers tend to give a frosty reception when their writers suggest that their work is automated to some extent.

And I can see where they’re coming from. No matter how good they get at it, I’m unlikely to get to a point where I’ll read automatically generated literary art. Speaking for myself, when I’m reading, it is not just about the words. I’m reading Enright or Woolf or Pynchon because I’m as interested in them as I am in what they produce. How synthetic would it be to set Faulkner and McCarthy in conversation with one another if their congruencies were wholly manufactured by outside interpretation or an anonymous algorithmic process as opposed to the discursive tissue of literary sphere, if a work didn’t arise from material and actual conditions? I know I’m making a lot of value-based assessments here that wouldn’t have a place in academic discourse, and on that basis what I’m saying is indefensible, but the probabilistic infinitude of it bothers me too. When I think about all the novelists I have yet to read I immediately get panicky about my own death, and the limitless possibilities of neural networks to churn out tomes and tomes of literary data in seconds just seems to me to exacerbate the problem.

However, speaking outside of my reader-identity, as a writer, I find it invigorating. My biggest problem as a writer isn’t writing nice sentences, given enough time I’m more than capable of that, the difficulty is finding things to wrap them around. Mood, tone, image, aren’t daunting, but a text’s momentum, the plot, I suppose, eludes me completely. It’s not something that bothers me, I consider plot to be a necessary evil, and resent novels that suspend information in a deliberate, keep-you-on-the-hook sort of way, but the ‘what next’ of composition is still a knotty issue.

The generation of text could be a useful way of getting an intelligent prompt that stylistically ‘borrows’ from a broad base of literary data, smashing words and images together in a generative manner to get the associative faculties going. I’m not suggesting that these scripts would be successful were they autonomous, I think we’re a few years off one of these algorithms writing a good novel, but I hope to demonstrate that my circa 350 generated words would be successful in facilitating the process of composition:

be as the whoo, put out and going to Ingleway effect themselves old shadows as she was like a farmers of his lake, for all or grips — that else bigs they perfectly clothes and the table and chest and under her destynets called a fingers of hanged staircase and cropping in her hand from him, “never married them my said?” know’s prode another hold of the utals of the bright silence and now he was much renderuched, his eyes. It was her natural dependent clothes, cattle that they came in loads of the remarks he was there inside him. There were she was solid drugs.

“I’m sons to see, then?’ she have no such description. The legs that somewhere to chair followed, the year disappeared curl at an entire of him frwented her in courage had approached. It was a long rose of visit. The moment, the audience on the people still the gulsion rowed because it was a travalious. But nothing in the rash.

“No, Jane. What does then they all get out him, but? Or perfect?”

“The advices?”

Of came the great as prayer. He said the aspect who, she lay on the white big remarking through the father — of the grandfather did he had seen her engoors, came garden, the irony opposition on his colling of the roof. Next parapes he had coming broken as though they fould

has a sort. Quite angry to captraita in the fact terror, and a sound and then raised the powerful knocking door crawling for a greatly keep, and is so many adventored and men. He went on. He had been her she had happened his hands on a little hand of a letter and a road that he had possibly became childish limp, her keep mind over her face went in himself voice. He came to the table, to a rashes right repairing that he fulfe, but it was soldier, to different and stuff was. The knees as it was a reason and that prone, the soul? And with grikening game. In such an inquisilled-road and commanded for a magbecross that has been deskled, tight gratulations in front standing again, very unrediction and automatiled spench and six in command, a

I don’t think I’d be alone in thinking that there’s some merit in parts of this writing. I wonder if there’s an extent to which Finnegans Wake has ‘tainted’ the corpus somewhat, because stylistically, I think that’s the closest analogue to what could be said to be going on here. Interestingly, it seems to be formulating its own puns, words like ‘unrediction,’ ‘automatiled spench’ (a tantalising meta-textual reference I think) and ‘destynets’, I think, would all be reminiscent of what you could expect to find in any given section of the Wake, but they don’t turn up in the corpus proper, at least according to a ctrl + f search. What this suggests to me is that the algorithm is plotting relationships on the level of the character, as well as phrasal units. However, I don’t recall the sci-fi model turning up paragraphs that were quite so disjointed and surreal — they didn’t make loads of sense, but they were recognisable, as grammatically coherent chunks of text. Although this could be the result of working with a partially trained model.

So, how might they feed our creative process? Here’s my attempt at making nice sentences out of the above.

— I have never been married, she said. — There’s no good to be gotten out of that sort of thing at all.

He’d use his hands to do chin-ups, pull himself up over the second staircase that hung over the landing, and he’d hang then, wriggling across the awning it created over the first set of stairs, grunting out eight to ten numbers each time he passed, his feet just missing the carpeted surface of the real stairs, the proper stairs.

Every time she walked between them she would wonder which of the two that she preferred. Not the one that she preferred, but the one that were more her, which one of these two am I, which one of these two is actually me? It was the feeling of moving between the two that she could remember, not his hands. They were just an afterthought, something cropped in in retrospect.

She can’t remember her sons either.

Her life had been a slow rise, to come to what it was. A house full of men, chairs and staircases, and she wished for it now to coil into itself, like the corners of stale newspapers.

The first thing you’ll notice about this is that it is a lot shorter. I started off by traducing the above, in as much as possible, into ‘plain words’ while remaining faithful to the n-grams I liked, like ‘bright silence’ ‘old shadows’ and ‘great as prayer’. In order to create images that play off one another, and to account for the dialogue, sentences that seemed to be doing similar things began to cluster together, so paragraphs organically started to shrink. Ultimately, once the ‘purpose’ of what I was doing started to come out, a critique of bourgeois values, memory loss, the nice phrasal units started to become spurious, and the eight or so paragraphs collapsed into the three and a half above. This is also ones of my biggest writing issues, I’ll type three full pages and after the editing process they’ll come to no more than 1.5 paragraphs, maybe?

The thematic sense of dislocation and fragmentation could be a product of the source material, but most things I write are about substance-abusing depressives with broken brains cos I’m a twenty-five year old petit-bourgeois male. There’s also a fairly pallid Enright vibe to what I’ve done with the above, I think the staircases line could come straight out of The Portable Virgin.

Maybe a more well-trained corpus could provide better prompts, but overall, if you want better results out of this for any kind of creative praxis, it’s probably better to be a good writer.

Modelling Humanities Data Blog Post #1 Deleuze, Descartes and Data to Knowledge

While dealing with the distinctions between data, knowledge and information in class, a pyramidal hierarchy was proposed, which can be seen on the left. This diagram discloses the process of making data (which have been defined as ‘facts’ which exist in the world), into information, and thereafter knowledge. These shifts from one state to another are not as neat as the diagram might suggest; it is just one interpretation giving shape to a highly dynamic and unsettled process; any movement from one of these levels to another is fraught. It is ‘a bargaining system,’ as every dataset has its limitations and aporias, not to speak of the process of interpretation or subsequent dissemination. This temporal dimension to data, its translation from a brute state is too often neglected within certain fields of study, fields in which data is more often understood as unambiguous, naturally hierarchicalised, and not open to contextualisation or debate.

This blog post aims to consider these issues within the context of a dataset obtained from The Central Statistics Office. The dataset contains information relating to the relative risk of falling into poverty based on one’s level of education between the years 2004 and 2015 inclusive. The data was analysed through use of the statistical analysis interface SPSS.

The purpose of the CSO is to compile and disseminate information relating to economic and social conditions within the state in order to give direction to the government in the formulation of policy. Therefore it was decided that the most pertinent information to be derived from the dataset would be the correlations between level of education and the likelihood of falling into poverty. The results appear below.

Correlation Between Risk of Poverty and Level of Education Achieved

Correlation Between Consistent Poverty (%) and Level of Education Received

Correlation Between Deprivation Rate (%) and Level of Education Received

Poverty Risk Based on Education Level

Deprivation Rate Based on Education Level

Consistent Poverty Rate based on Education Level

It can be seen that there is a very strong negative correlation between one’s level of education and one’s risk of exposure to poverty; the higher one ascends through the education system, the less likely it is one will fall into economic liminality. This is borne out both in the bar charts and the correlation tables, the latter of which yield p-values of .000, underlining the certainty of the finding. It should be noted that both graphing the data, and detecting correlations through use of the Spearman’s rho are elementary statistical procedures, but as the trend revealed here is consistent with more elaborate modelling of the relationship,[1] the parsimonious analysis carried out here is all that is required.

It should not be assumed that just because these graphs are informative that it is impossible to garner information from data in any other way. Even in its primary state, as it appears on the website, one could obtain information from a dataset through qualitative means. It is unlikely that this information will be as coherent as that which that can be gleaned from even the most basic graph, but it is important to emphasise the fact that the border that separates data from information is fluid.

It is unlikely to be a novel finding that those who have a third level education have higher incomes than those who do not; there is a robust body of research detailing the many benefits of attending university. [2] Therefore, can it be said that the visualisation of the dataset above has contributed to knowledge? One would answer this question relative to one’s initial research question, and how the information complicates or advances it. If the causal relationship between exposure to poverty and level of education has been confirmed, and a government agency makes the recommendation that further investment in educational support programmes are necessary, it is somewhere in this process that the boundary separating information from knowledge has been crossed.

The above diagram actualises the temporal nature of data to a greater extent than the pyramid, but in doing so it perpetuates a linearisation of the process, a line along which René Descartes’ notion of thought could be said to align. Descartes understood thought as a positive function which tends towards the good and toward truth. This ‘good sense’, allows us to ‘judge correctly and to distinguish the true from the false’.[3] Gilles Deleuze believes Descartes instantiates a model of thought which is oppressive, and which perceives thinking relative to external needs and values rather than in its actuality: ‘It cannot be regarded as fact that thinking is the natural exercise of a faculty, and that this faculty is possessed of a good nature and a good will.’[4]

In Deleuze’s conception, thought takes on a sensual disposition, reversing the Cartesian notion of mental inquiry beginning from a state of disinterestedness in order to arrive at a moment at which one recognises ‘rightness’. Deleuze argues that there is no such breakthrough moment or established methodology to thought, and argues for regarding it as more invasive, or unwelcome, a point of encounter when ‘something in the world forces us to think.’[5]

Rather than taking the neat, schematic movement from capturing data to modelling to interpreting for granted, Deleuze is engaged by these moments of crisis, points just before or just after the field of our understanding is qualitatively transformed into something different:

How else can one write but of those things which one doesn’t know, or know badly?…We write only at the frontiers of our knowledge, at the border which separates our knowledge from our ignorance and transforms one into the other.[6]

Deleuze’s comments have direct bearing upon our understanding of data, and how they should be understood within the context of the wider questions we ask of them. Deleuze argues that, ‘problems must be considered not as ‘givens’ (data) but as ideal ‘objecticities’ possessing their own sufficiency and implying acts of constitution and investment in their respective symbolic fields.’[7] While it is possible that Deleuze would risk overstating the case, were we to apply his theories to this dataset, it is nonetheless crucial to recall that data, and the methodologies we use to unpack and present them participate in wider economies of significance, ones with indeterminate horizons.

Notes

[1] Department for Business, Education and Skills, ‘BIS Research Paper №146: The Benefits of Higher Education and Participation for Individuals and Society: Key Findings and Reports’, (Department for Business, Education and Skills: 2013) https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/254101/bis-13-1268-benefits-of-higher-education-participation-the-quadrants.pdf

[2] OECD, Education Indicators in Focus, (OECD: 2012) https://www.oecd.org/education/skills-beyond-school/Education%20Indicators%20in%20Focus%207.pdf

[3] Descartes, René, Discourse on the Method of Rightly Conducting the Reason, and Seeking Truth in the Sciences (Gutenberg: 2008), http://www.gutenberg.org/files/59/59-h/59-h.htm

[4] Deleuze, Gilles, Difference and Repetition (Bloomsbury Academic: 2016), p.175

[5] Ibid.

[6] Ibid, p. xviii

[7] Ibid, p.207

Bibliography

Deleuze, Gilles, Difference and Repetition (Bloomsbury Academic: 2016), p.175

Department for Business, Education and Skills, ‘BIS Research Paper №146: The Benefits of Higher Education and Participation for Individuals and Society: Key Findings and Reports’, (Department for Business, Education and Skills: 2013) https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/254101/bis-13-1268-benefits-of-higher-education-participation-the-quadrants.pdf

Descartes, René, Discourse on the Method of Rightly Conducting the Reason, and Seeking Truth in the Sciences (Gutenberg: 2008), http://www.gutenberg.org/files/59/59-h/59-h.htm

OECD, Education Indicators in Focus, (OECD: 2012) https://www.oecd.org/education/skills-beyond-school/Education%20Indicators%20in%20Focus%207.pdf

Statistical Correlations in avant-garde prose writing

The question that this blog post sets itself is: What differences and similarities can be detected in modernist and contemporary authors on the basis of three stylistic variables; hapax, unique and ambiguity, and how are these stylistic variables related to one another?

I: The Data

The data to be analysed in this project were derived from an analysis of twenty-one corpora of avant-garde literary prose through use of the open-source programming language R. The complete works of the authors James Joyce, Virginia Woolf, Gertrude Stein, Sara Baume, Anne Enright, Will Self, F. Scott FitzGerald, Eimear McBride, Ernest Hemingway, Jorge Luis Borges, Joseph Conrad, Ford Madox Ford, Franz Kafka, Katherine Mansfield, Marcel Proust, Elizabeth Bowen, Samuel Beckett, Flann O’Brien, Djuna Barnes, William Faulkner & D.H. Lawrence were used.

Seventeen of these writers were active between the years 1895 and 1968, a period of time associated with a genre of writing referred to as ‘modernist’ within the field of literary criticism. The remaining four remain alive, and have novels published as early as 1991, and as late as 2016. These novelists are known for their identification as latter-day modernists, and perceive their novels as re-engaging with the modernist aesthetic in a significant way.

I.II Uniqueness

The unique variable is a generally accepted measurement used within digital literary criticism to quantify the ‘richness’ of a particular text’s vocabulary. The formula for uniqueness is obtained by dividing the number of distinct word types in a text by the total number of words. For example, if a novel contained 20000 word types, but 100000 total words, the formula for obtaining this text’s uniqueness would be as follows:

20000/100000 = Uniqueness is equal to 0.2

I.III Ambiguity

Ambiguity is a measure used to calculate the approximate obscurity of a text, or the extent to which it is composed of indefinite pronouns. The indefinite pronouns quantified in this study are as follows, ‘another’, ‘anybody’, ‘anyone’, ‘anything’, ‘each’, ‘either’, ‘enough’, ‘everybody’, ‘everyone’, ‘everything’, ‘little’, ‘much’, ‘neither’, ‘nobody’, ‘no one’, ‘nothing’, ‘one’, ‘other’, ‘somebody’, ‘someone’, ‘something’, ‘both’, ‘few’, ‘everywhere’, ‘somewhere’, ‘nowhere’, ‘anywhere’, ‘many’, ‘others’, ‘all’, ‘any’, ‘more’, ‘most’, ‘none’, ‘some’, ‘such’. The formula for ambiguity is:

number of indefinite pronouns / number of total words

I.IV Hapax

Finally, the hapax variable calculates the density of hapax legomena, words which appear only once in a particular author’s oeuvre. The formula for this variable is:

number of hapax legomena / number of total words

a bar chart giving an overview of the data

II: Data Overview

Even before analysing the data in great depth, the fact that these variables are interrelated with one another stands to a logical analysis. Hapax and unique are best understood as an indication of a text’s heterogeneity, as if a text is hapax-rich, the score for uniqueness will be similarly elevated. Ambiguity, as it is a set of pre-defined words, can be considered a measure of a text’s homogeneity, and if the occurrences of these commonplace words are increasing, hapax and uniqueness will be negatively effected. The aim of this study will be to first determine how these measures vary according to the time frame in which the different texts were written, i.e. across modern and contemporary corpora, which correlations between stylistic variables exist, and which of the three is most subject to the fluctuations of another.

more overviews for each variable

IV.I: The Three Groups Hypothesis

A number of things are clear from these representations of the data. The first finding is that the authors fall into approximately three distinct groups. The first is the base- level of early twentieth-century modernist authors, who are all relatively undifferentiated. These are Ernest Hemingway, Virginia Woolf, William Faulkner, Elizabeth Bowen, Marcel Proust, F. Scott Fitzgerald, D.H. Lawrence, Joseph Conrad and Ford Madox Ford. They are all below the mean for the hapax and unique variables.

boxplot of outliers for the unique hapax variable

The second group reach into more extreme values for unique and hapax. These are Djuna Barnes, Jorge Luis Borges, Franz Kafka, Flann O’Brien, James Joyce, Eimear McBride and Sara Baume. Three of these authors are even outliers for the hapax variable, which can be seen in the box plot.

Joyce’s position as an extreme outlier in this context is probably due to his novel Finnegans Wake (1939), which was written in an amalgam of English, French, Irish, Italian and Norwegian. It’s no surprise then, that Joyce’s value for hapax is so high. The following quotation may be sufficient to give an indication of how eccentric the language of the novel is:

La la la lach! Hillary rillarry gibbous grist to our millery! A pushpull, qq: quiescence, pp: with extravent intervulve coupling. The savest lauf in the world. Paradoxmutose caring, but here in a present booth of Ballaclay, Barthalamou, where their dutchuncler mynhosts and serves them dram well right for a boors’ interior (homereek van hohmryk) that salve that selver is to screen its auntey and has ringround as worldwise eve her sins (pip, pip, pip)

Though Borges’ and Barnes’ prose may not be as far removed from modern English as Finnegans Wake, both of these authors are known for their highly idiosyncratic use of language; Borges for his use of obscure terms derived from archaic sources, and Barnes for reversing normative grammatical and syntactic structures in unique ways.

The third and final group may be thought of as an intermediary between these two extremes, and these are Katherine Mansfield, Samuel Beckett, Will Self and Anne Enright. These authors share characteristics of both groups, in that the values for ambiguity remain stable, but their uniqueness and hapax counts are far more pronounced than the first group, but not to the extent that they reach the values of the second group.

boxplot displaying stein as an extreme outlier for ambiguity

Gertrude Stein is the only author who’s stylistic profile doesn’t quite fit into any of the three groups. She is perhaps best thought of as most closely analogous to the first group of early twentieth century modernists, but her extreme value for ambiguity should be sufficient to distinguish her in this regard.

The value for ambiguity remains fairly stable throughout the dataset, the standard deviation is 0.03, but if Stein’s values are removed from the dataset, the standard deviation narrows from 0.03 to 0.01.

Two disclaimers need to be made about this general account from the descriptive statistics and graphs. The first is that there is a fundamental issue with making such a schematic account of these texts. The grouping approach that this project has taken thus far is insufficiently nuanced as it could probably be argued that McBride could just as easily fit into the third group as the second. Therefore, the stylistic variables do not adequately distinguish modern and contemporary corpora from one another.

IV.II Word Count

word count for the most prolific authors

It should not escape our attention that those authors who score lowest for each variable and that the first group of early twentieth-century author are the most prolific. The correlation between word count and the stylistic variables was therefore constructed.

Pearson correlation for word count and stylistic variables

Both the Pearson correlation and Spearman’s rho suggest that word count is highly negatively correlated with hapax and unique (as word count increases, hapax and unique decreases and vice versa), but not with ambiguity.

Spearman’s rho for word count and stylistic variables

The fact that the Spearman’s rho scores significantly higher than the Pearson suggests that the relationship between the two are non-linear. This can be seen in the scatter plot.

scatter plot showing the relationship between word count and uniqueness

In the case of both variables, the correlation is obviously negative, but the data points fall in a non-linear way, suggesting that the Spearman’s rho is the better measure for calculating the relationship. In both cases it would seem that Joyce is the outlier, and most likely to be the author responsible for distorting the correlation.

scatter plot displaying the relationship between word count and hapax density
Pearson correlations for word count and each stylistic variable

SPSS flags the correlation between hapax and unique as being significant, as this is clearly the most noteworthy relationship between the three stylistic variables. The Spearman’s rho exceeded the Spearman correlation by a marginal amount, and it was therefore decided that the relationship was non-linear, which is confirmed by the scatter plot below:

Spearman’s rho correlation for word count and stylistic variables

The stylistic variables of unique and hapax are therefore highlycorrelated.

VI: Conclusion

As was said already, the notion that stylistic variables are correlated stands to reason. However, it was not until the correlation tests were carried out that the extent to which uniqueness and hapax are determined by one another was made clear.

The biggest issue with this study is the issue that is still present within digital comparative analyses in literature generally; our apparent incapacity to compare texts of differing lengths. Attempts have been made elsewhere to account for the huge difference that a text’s length clearly makes to measures of its vocabulary, such as vectorised analyses that take measurements in 1000 word windows, but none have yet been wholly successful in accounting for this difference. This study is therefore one among many which presents its results with some clarifiers, considering how corpora of similar lengths clustered together with one another to the extent that they did. The only author that violated this trend was Joyce, who, despite a lengthy corpus of 265500 words, has the highest values for hapax and uniqueness, which marks his corpus out as idiosyncratic. Joyce’s style is therefore the only of the twenty-one authors that we can say has a writing style that can be meaningfully distinguished from the others on the basis of the stylistic variables, because he so egregiously reverses the trend.

But we hardly needed an analysis of this kind to say Joyce writes differently from most authors, did we.