Tag Archives: Digital Arts & Humanities

Network Analysis

Once the network has been imported into Gephi we can colourise it according to the century in which it was written, with twentieth century texts in pink, and nineteenth century novels in green

Screenshot 2018-11-06 at 11.45.26

No the resolution isn’t great here, WordPress has limits in terms of what it can accept, but you get the picture, there’s a clear separation here. There’s also some interesting intermingling of particular authors, in the upper part of the network we can see the novels of Stephen Crane, an American writing in the late nineteenth century, being drawn into a cluster of classical modernist works sucha s Woolf, Joyce and Ford, as below:

Screenshot 2018-11-06 at 11.48.37

What’s interesting about this, is that the kidn of fiction Crane is conventionally understood to have written, naturalism, is increasingly being discussed in the context of more recent literary criticism as a modernist, or a proto-modernist form, as opposed to the low, popular or proletarian traditions it was associated with in the past.

More importantly though, we can perform community detection algorithms on the network. Rather than using associated metadata to determine the nature of the network visualisation, we can use the weights between the novels to tell us how similar the writing styles of these authors are. The network appears below, the potentially more illustrative key follows.

Screenshot 2018-11-06 at 11.54.00.png

 

image

Screenshot 2018-11-06 at 11.55.27

Advertisements

The quantitative analysis of literature in theory

Screenshot 2018-11-06 at 11.34.35

This blog post will provide some notes towards the methodology underpinning my doctoral research. In completing my research project I will model 640 novels and short story collections within a consensus network in order to project a potential definition of modernist literary style through both qualitative and quantitative means. In the fullness of time I will have a full and replicable account up on RPubs and Github, for the moment this general introduction will have to do.

The quantitative analysis of literature has had a fraught history. Since the cultural turn of the sixties and seventies, when the political revionisms of feminism, queer and critical race theory were gaining increasing currency, the concept of ‘style’, some quintessence of the work which could be instrumentally distilled from the text, became increasingly untenable. Context became the predominant means through which literature is understood in Anglo-American literature departments. Indeed the very idea would seem to recall the belles lettres approach of the nineteenth century.

Computational literary criticism, out of necessity, treats literary materials in more pragmatic terms. When filling a spreadsheet, things need to be inputted into cells and there’s no real conversation in quantitative terms that’s possible outside of these terms. This stands in contrast to contemporary literary studies, in which one can quite happily have a long and involved discussion on what the text is not saying. Since the more recent developments inculcated within new modernist studies and neo-victorianism, which have expanded the temporal and spatial limits of their respective objects of study, into the present day, far into the past and beyond the metropoles of London, New York and Paris, aiming to de-tether the implicit value judgements of their respective categorisations, from the more problematic aspects of modernity or colonialism, these two positions have only become more polarised.

This leaves quantitative literary critics in something of a quandary. Despite some of its more vociferous advocates claiming that the application of computational logic to literary materials represents a definitive paradigm shift which the discipline at large should take more account of, their epistemological conservatism is often reflected in their political conservatism. The notion of style as combinations of quantifiable features seem to underpin an uncritical celebrations of formal competence and has been intriguingly read as an example of ‘third way’ knowledge production, as well as a backlash against politically oriented cultural criticism.

I would argue that falling into retrograde modes of thought is certainly a risk of analyses of this kind, but it doesn’t have to be a necessity, and networks, with their capacity to regard texts as embedded within a broader ecosystem, offer the possibility of bringing the new modernist studies dispensation into dialogue with quantitative literary criticism.

The quantitative analysis of literature can be said to have been kicking around as far back as monks first devised manual concordances of the Bible. Every digital humanist will be familiar with the work of Roberto Busa, but the history of the statistical analysis of literature is a more decentralised phenomenon than the big tent digital humanities. The earliest example I can find is Louis Tonko Milic’s A Quantitative Approach to the Style of Jonathan Swift which was published in 1967. Milic, bless him, seems to be under the impression that he stands at the brink of a newly invigorated formalism which can mobilise computation to reveal literary works as they truly are, bypassing the impressionism which elsewhere characterises appraisals of style within the field. Unfortunately literary critics are not terribly well-known for their command of statistics and Milic’s tendency to reproduce pages and pages of tables without assessing their significance, with a student t-test for instance, is symptomatic. Many of the earliest digital humanities journals simply reproduce the raw data in binary form and advance interpretations based on their visual impressions, rather than mathematical findings.

The development of analyses based on the richness of a text’s vocabulary (number of unique words/total number words), hapax richness (number of words which appear once in the text/total number of words) or average sentence length, word length represent an improvement on this approach, but not by much. These may be understood as indexes of style, but as before they were placed in tables and often ‘read’ in the same way literary critics usually do. There were no systematic attempts to assess sentence length across a broader corpus, nor was there any benchmark established for the assessment of significant differences.

The first quantitative analysis of literature which yielded replicable results was developed by the Australian literary critic J.F. Burrows. His Delta method, rather than focusing on the more evocative or longer words that literary critics usually focus their attention on, aimed to uncover stylistic signal by quantifying the relative occurrences of high-frequency terms, such as ‘the’, ‘an’, ‘a’, ‘and’ or ‘said’. Burrows’ original method involved using just the first 150 most frequent words (MFWs) but subsequent analyses have demonstrated successful authorship attribution increases all the way up to 5000 MFWs. The more of these particle words which are analysed, in effect, the better.

This leaves us with a problem as to what scale we analyse texts at. Eder has noted that analysing words at different scales broadcast different stylistic signals, with discomfiting amounts of variation between them. I’ve noted this phenomenon myself when analysing individual words as opposed to combinations of words in twos  (‘the man’, ‘she said’, ‘over there’), threes (‘she also said’, ‘over by the’) or even on the level of individual characters (‘th ‘, ‘a’, ‘n he’). Rybicki and Eder’s solution is to quantify all 5000 words six times, and culling them in increments of twenty; rather than finding a single ‘best’ fit, we just throw everything in and attain the average level of similarity existing between each text, subject to particular conditions. I propose a similar approach, by analysing single words, bigrams, trigrams, quadgrams and quingrams in both word and character form. This is all done through the ‘Stylo’ package, a custom-made library constructed within the R language.

Once all these analyses have been run, R outputs a list of edges into the working directory, which will form the basis of the network. It looks like this:

Screenshot 2018-11-06 at 11.09.06

Each row here represents a relationship from one text, ‘Source’, ‘Target’. Each row is effectively a line drawn from column A to column B. The third column marked ‘Weight’ signifies the intensity of the relationship, the weakest being 1 and the strongest being ~1125. This seems to be the maximum value possible so I suspect the algorithm which creates this table cuts off the similarity calculation past a certain point. To return to the table, we can see that they run in descending order of intensity, and that Anne Bronte’s novel Agnes Grey is by far most like her other novel The Tenant of Wildfell Hall. From there there’s a pronounced drop-off from 902 to a weight of 226, the next most similar novel, James Joyce’s Finnegans Wake.

A list of this kind is effectively outputted for every single scale mentioned above. They are then combined into a single massive list of edges (about 14720 rows in all). Because there are about ten edges lists, there are ten different weights for each relationship. Each of these are average into a single ‘edge’, and this forms the basis of the network, which I’ll talk about in a subsequent post.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

How big are the words modernists use?

It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.

Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*

*The previous graph had twice as many authors and was far too noisy, with not enough distinction between the colours to make it anything other than a headwreck to read.

In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.

As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.

Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.

By the time the words are ten letters long, true to form it’s Self who’s writing is the only one above 1%.

Collocations in Modernist Prose

Screen Shot 2017-07-24 at 14.51.47I have recently begun to experiment with Natural Language Processing to determine how particular words in modernist texts are correlated. I’m still getting my head around Python and NLTK, but so far I’m finding it much more user-friendly than similar packages in R.

Long-term I hope to graph these collocations in high-vector space, so that I can graph them, but for the moment, I’m interested in noting the prevalence of the term ‘young man’, Self and Baume being the only authors that have female adjective-noun phrases, and the usage of titles which convey particular social hierarchies; Joyce, Woolf and Bowen’s collocations are almost exclusively composed of these, as is Stein’s, with the clarifier that Stein’s appear shorn of their ‘Mr.’, ‘Miss.’ or ‘Doctor’.

Here’s all the collocations in the modernist corpus:

young man; robert jordan; new york; gertrude stein; old man; could see; henry martin; every one; years ago; first time; long time; hugh monckton; great deal; come back; david hersland; good deal; every day; edward colman; came back; alfred hersland

Canonical modernist texts:

young man; robert jordan; gertrude stein; henry martin; new york; every one; old man; could see; years ago; long time; hugh monckton; first time; great deal; david hersland; come back; good deal; every day; edward colman; alfred hersland; mr. bettesworth

Contemporary texts, Enright, Self, Baume, McBride:

fat controller; phar lap; von sasser; first time; per cent; could see; old man; one another; even though; years ago; new york; front door; young man; either side; someone else; dave rudman; last night; living room; steering wheel; every time

Djuna Barnes

frau mann; nora said; english girl; someone else; long ago; leaned forward; london bridge; come upon; could never; god knows; doctor said; sweet sake; first time; five francs; terrible thing; francis joseph; hôtel récamier; orange blossoms; bowed slightly; would say

Eimear McBride

kentish town; someone else; first time; last night; jesus christ; something else; years ago; five minutes; every day; hail mary; take care; next week; arms around; never mind; every single; little girl; little boy; two years; soon enough; come back

Elizabeth Bowen

mrs kerr; lady waters; mrs heccomb; major brutt; mme fisher; lady naylor; miss fisher; good deal; said mrs; first time; lady elfrida; one another; young man; colonel duperrier; aunt violet; last night; ann lee; one thing; sir robert; sir richard

Ernest Hemingway

robert jordan; old man; could see; colonel said; gran maestro; catherine said; jordan said; richard gordon; long time; pilar said; thou art; pablo said; nick said; bill said; girl said; captain willie; young man; automatic rifle; mr. frazer; david said

F. Scott FitzGerald

new york; young man; years ago; first time; sally carrol; several times; fifth avenue; ten minutes; minutes later; richard caramel; thousand dollars; five minutes; young men; evening post; old man; next day; saturday evening; long time; last night; come back

Gertrude Stein

gertrude stein; every one; david hersland; alfred hersland; angry feeling; family living; independent dependent; jeff campbell; julia dehning; mrs. hersland; daily living; whole one; bottom nature; madeleine wyman; good deal; mary maxworthing; middle living; miss mathilda; mabel linker; every day

James Joyce

buck mulligan; said mr.; martin cunningham; aunt kate; says joe; mary jane; corny kelleher; ned lambert; mrs. kearney; stephen said; mr. henchy; ignatius gallaher; father conmee; nosey flynn; mr. kernan; myles crawford; cissy caffrey; ben dollard; mr. cunningham; miss douce

Marcel Proust

young man; faubourg saint-germain; long ago; caught sight; first time; every day; one day; great deal; des laumes; young men; could see; quite well; next day; one another; would never; nissim bernard; victor hugo; would say; louis xiv; long time

Samuel Beckett

said camier; said mercier; miss counihan; lord gall; miss carridge; mr. kelly; panting stops; said belacqua; mr. endon; said wylie; said neary; one day; otto olaf; dr. killiecrankie; come back; vast stretch; mrs gorman; push pull; something else; ground floor

Sara Baume

even though; tawny bay; living room; old man; passenger seat; bird walk; maggot nose; shut-up-and-locked room; stone fence; food bowl; lonely peephole; low chair; old woman; kennel keeper; rearview mirror; shih tzu; shore wall; safe space; every day; oneeye oneeye

Virginia Woolf

miss barrett; mrs. ramsay; mrs. hilbery; young man; st. john; could see; years ago; peter walsh; mrs. thornbury; miss allan; said mrs.; young men; mrs. swithin; human beings; wimpole street; mrs. flushing; mr. ramsay; mrs. manresa; sir william; door opened

Anne Enright

new york; per cent; eliza lynch; dear friend; years old; even though; first time; came back; years ago; long time; michael weiss; señor lópez; living room; every time; looked like; could see; one day; said constance; pat madigan; mrs hanratty

Will Self

fat controller; phar lap; von sasser; one another; old man; could see; first time; per cent; dave rudman; let alone; front door; young man; skip tracer; quantity theory; jane bowen; los angeles; young woman; either side; charing cross; long since

Flann O’Brien

father fahrt; good fairy; father cobble; said shanahan; mrs crotty; said furriskey; said lamont; mrs laverty; one thing; sergeant fottrell; said slug; old mathers; public house; far away; cardinal baldini; monsignor cahill; mrs furriskey; red swan; black box; said shorty

Ford Madox Ford

henry martin; hugh monckton; edward colman; privy seal; mr. bettesworth; mr. fleight; young man; mr. sorrell; sergius mihailovitch; young lovell; new york; jeanne becquerel; lady aldington; kerr howe; anne jeal; miss peabody; mr. pett; great deal; marie elizabeth; robert grimshaw

Jorge Luis Borges

ts’ui pên; buenos aires; pierre menard; eleventh volume; richard madden; nils runeberg; yiddische zeitung; stephen albert; hundred years; erik lönnrot; firing squad; henri bachelier; madame henri; orbis tertius; vincent moon; paint shop; seventeenth century; anglo-american cyclopaedia; fergus kilpatrick; years ago

Joseph Conrad

mrs. travers; mrs verloc; mrs. fyne; peter ivanovitch; doña rita; miss haldin; mrs. gould; assistant commissioner; charles gould; san tomé; chief inspector; years ago; captain whalley; could see; van wyk; old man; dr. monygham; gaspar ruiz; young man; mr. jones

D.H. Lawrence

young man; st. mawr; mr. may; mrs. witt; blue eyes; miss frost; could see; one another; mrs bolton; ‘all right; come back; said alvina; two men; of course; good deal; long time; mr. george; next day

William Faulkner

uncle buck; aleck sander; miss reba; years ago; dewey dell; mrs powers; could see; white man; four years; old man; ned said; division commander; general compson; miss habersham; new orleans; uncle buddy; let alone; one another; united states; old general

Literary Cluster Analysis

I: Introduction

My PhD research will involve arguing that there has been a resurgence of modernist aesthetics in the novels of a number of contemporary authors. These authors are Anne Enright, Will Self, Eimear McBride and Sara Baume. All these writers have at various public events and in the course of many interviews, given very different accounts of their specific relation to modernism, and even if the definition of modernism wasn’t totally overdetermined, we could spend the rest of our lives defining the ways in which their writing engages, or does not engage, with the modernist canon. Indeed, if I have my way, this is what I will spend a substantial portion of my life doing.

It is not in the spirit of reaching a methodology of greater objectivity that I propose we analyse these texts through digital methods; having begun my education in statistical and quantitative methodologies in September of last year, I can tell you that these really afford us no *better* a view of any text then just reading them would, but fortunately I intend to do that too.

This cluster dendrogram was generated in R, and owes its existence to Matthew Jockers’ book Text Analysis with R for Students of Literature, from which I developed a substantial portion of the code that creates the output above.

What the code is attentive to, is the words that these authors use the most. When analysing literature qualitatively, we tend to have a magpie sensibility, zoning in on words which produce more effects or stand out in contrast to the literary matter which surrounds it. As such, the ways in which a writer would use the words ‘the’, ‘an’, ‘a’, or ‘this’, tends to pass us by, but they may be far more indicative of a writer’s style, or at least in the way that a computer would be attentive to; sentences that are ‘pretty’ are generally statistically insignificant.

II: Methodology

Every corpus that you can see in the above image was scanned into R, and then run through a code which counted the number of times every word was used in the text. The resulting figure is called the word’s frequency, and was then reduced down to its relative frequency, by dividing the figure by total number of words, and multiplying the result by 100. Every word with a relative frequency above a certain threshold was put into a matrix, and a function was used to cluster each matrix together based on the similarity of the figures they contained, according to a Euclidean metric I don’t fully understand.

The final matrix was 21 X 57, and compared these 21 corpora on the basis of their relative usage of the words ‘a’, ‘all’, ‘an’, ‘and’, ‘are’, ‘as’, ‘at’, ‘be’, ‘but’, ‘by’, ‘for’, ‘from’, ‘had’, ‘have’, ‘he’, ‘her’, ‘him’, ‘his’, ‘I’, ‘if’, ‘in’, ‘is’, ‘it’, ‘like’, ‘me’, ‘my’, ‘no’, ‘not’, ‘now’, ‘of’, ‘on’, ‘one’, ‘or’, ‘out’, ‘said’, ‘she’, ‘so’, ‘that’, ‘the’, ‘them’, ‘then’, ‘there’, ‘they’, ‘this’, ‘to’, ‘up’, ‘was’, ‘we’, ‘were’, ‘what’, ‘when’, ‘which’, ‘with’, ‘would’, and ‘you’.

Anyway, now we can read the dendrogram.

III: Interpretation

Speaking about the dendrogram in broad terms can be difficult for precisely the reason that I indicative above; quantitative/qualitative methodologies for text analysis are totally opposed to one another, but what is obvious is that Eimear McBride and Gertrude Stein are extreme outliers, and comparable only to each other. This is one way unsurprising, because of the brutish, repetitive styles and is in other ways very surprising, because McBride is on record as dismissing her work, for being ‘too navel-gaze-y.’

Jorge Luis Borges and Marcel Proust have branched off in their own direction, as has Sara Baume, which I’m not quite sure what to make of. Franz Kafka, Ernest Hemingway and William Faulkner have formed their own nexus. More comprehensible is the Anne Enright, Katherine Mansfield, D.H. Lawrence, Elizabeth Bowen, F. Scott FitzGerald and Virginia Woolf cluster; one could make, admittedly sweeping judgements about how this could be said to be modernism’s extreme centre, in which the radical experimentalism of its more revanchiste wing was fused rather harmoniously with nineteenth-century social realism, which produced a kind of indirect discourse, at which I think each of these authors excel.

These revanchistes are well represented in the dendrogram’s right wing, with Flann O’Brien, James Joyce, Samuel Beckett and Djuna Barnes having clustered together, though I am not quite sure what to make of Ford Madox Ford/Joseph Conrad’s showing at all, being unfamiliar with the work.

IV: Conclusion

The basic rule in interpreting dendrograms is that the closer the ‘leaves’ reach the bottom, the more similar they can be said to be. Therefore, Anne Enright and Will Self are the contemporary modernists most closely aligned to the forebears, if indeed forebears they can be said to be. It would be harder, from a quantitative perspective, to align Sara Baume with this trend in a straightforward manner, and McBride only seems to correlate with Stein because of how inalienably strange their respective prose styles are.

The primary point to take away here, if there is one, is that more investigations are required. The analysis is hardly unproblematic. For one, the corpus sizes vary enormously. Borges’ corpus is around 46 thousand words, whereas Proust reaches somewhere around 1.2 million. In one way, the results are encouraging, Borges and Barnes, two authors with only one texts in their corpus, aren’t prevented from being compared to novelists with serious word counts, but in another way, it is pretty well impossible to derive literary measurements from texts without taking their length into account. The next stage of the analysis will probably involve breaking the corpora up into units of 50 thousand words, so that the results for individual novels can be compared.

Can a recurrent neural network write good prose?

At this stage in my PhD research into literary style I am looking to machine learning and neural networks, and moving away from stylostatistical methodologies, partially out of fatigue. Statistical analyses are intensely process-based and always open, it seems to me, to fairly egregious ‘nudging’ in the name of reaching favourable outcomes. This brings a kind of bathos to some statistical analyses, as they account, for a greater extent than I’d like, for methodology and process, with the result that the novelty these approaches might have brought us are neglected. I have nothing against this emphasis on process necessarily, but I do also have a thing for outcomes, as well as the mysticism and relativity machine learning can bring, alienating us as it does from the process of the script’s decision making.

I first heard of the sci-fi writer from a colleague of mine in my department. It’s Robin Sloan’s plug-in for the script-writing interface Atom which allows you to ‘autocomplete’ texts based on your input. After sixteen hours of installing, uninstalling, moving directories around and looking up stackoverflow, I got it to work.I typed in some Joyce and got stuff about Chinese spaceships as output, which was great, but science fiction isn’t exactly my area, and I wanted to train the network on a corpus of modernist fiction. Fortunately, I had the complete works of Joyce, Virginia Woolf, Gertrude Stein, Sara Baume, Anne Enright, Will Self, F. Scott FitzGerald, Eimear McBride, Ernest Hemingway, Jorge Luis Borges, Joseph Conrad, Ford Madox Ford, Franz Kafka, Katherine Mansfield, Marcel Proust, Elizabeth Bowen, Samuel Beckett, Flann O’Brien, Djuna Barnes, William Faulkner & D.H. Lawrence to hand.

My understanding of this recurrent neural network, such as it is, runs as follows. The script reads the entire corpus of over 100 novels, and calculates the distance that separates every word from every other word. The network then hazards a guess as to what word follows the word or words that you present it with, then validates this against what its actuality. It then does so over and over and over, getting ‘better’ at predicting each time. The size of the corpus is significant in determining the length of time this will take, and mine required something around twelve days. I had to cut it off after twenty four hours because I was afraid my laptop wouldn’t be able to handle it. At this point it had carried out the process 135000 times, just below 10% of the full process. Once I get access to a computer with better hardware I can look into getting better results.

How this will feed into my thesis remains nebulous, I might move in a sociological direction and take survey data on how close they reckon the final result approximates literary prose. But at this point I’m interested in what impact it might conceivably have on my own writing. I am currently trying to sustain progress on my first novel alongside my research, so, in a self-interested enough way, I pose the question, can neural networks be used in the creation of good prose?

There have been many books written on the place of cliometric methodologies in literary history. I’m thinking here of William S. Burroughs’ cut-ups, Mallarmé’s infinite book of sonnets, and the brief flirtation the literary world had with hypertext in the 90’s, but beyond of the avant-garde, I don’t think I could think of an example of an author who has foregrounded their use of numerical methods of composition. A poet friend of mine has dabbled in this sort of thing but finds it expedient to not emphasise the aleatory aspect of what she’s doing, as publishers tend to give a frosty reception when their writers suggest that their work is automated to some extent.

And I can see where they’re coming from. No matter how good they get at it, I’m unlikely to get to a point where I’ll read automatically generated literary art. Speaking for myself, when I’m reading, it is not just about the words. I’m reading Enright or Woolf or Pynchon because I’m as interested in them as I am in what they produce. How synthetic would it be to set Faulkner and McCarthy in conversation with one another if their congruencies were wholly manufactured by outside interpretation or an anonymous algorithmic process as opposed to the discursive tissue of literary sphere, if a work didn’t arise from material and actual conditions? I know I’m making a lot of value-based assessments here that wouldn’t have a place in academic discourse, and on that basis what I’m saying is indefensible, but the probabilistic infinitude of it bothers me too. When I think about all the novelists I have yet to read I immediately get panicky about my own death, and the limitless possibilities of neural networks to churn out tomes and tomes of literary data in seconds just seems to me to exacerbate the problem.

However, speaking outside of my reader-identity, as a writer, I find it invigorating. My biggest problem as a writer isn’t writing nice sentences, given enough time I’m more than capable of that, the difficulty is finding things to wrap them around. Mood, tone, image, aren’t daunting, but a text’s momentum, the plot, I suppose, eludes me completely. It’s not something that bothers me, I consider plot to be a necessary evil, and resent novels that suspend information in a deliberate, keep-you-on-the-hook sort of way, but the ‘what next’ of composition is still a knotty issue.

The generation of text could be a useful way of getting an intelligent prompt that stylistically ‘borrows’ from a broad base of literary data, smashing words and images together in a generative manner to get the associative faculties going. I’m not suggesting that these scripts would be successful were they autonomous, I think we’re a few years off one of these algorithms writing a good novel, but I hope to demonstrate that my circa 350 generated words would be successful in facilitating the process of composition:

be as the whoo, put out and going to Ingleway effect themselves old shadows as she was like a farmers of his lake, for all or grips — that else bigs they perfectly clothes and the table and chest and under her destynets called a fingers of hanged staircase and cropping in her hand from him, “never married them my said?” know’s prode another hold of the utals of the bright silence and now he was much renderuched, his eyes. It was her natural dependent clothes, cattle that they came in loads of the remarks he was there inside him. There were she was solid drugs.

“I’m sons to see, then?’ she have no such description. The legs that somewhere to chair followed, the year disappeared curl at an entire of him frwented her in courage had approached. It was a long rose of visit. The moment, the audience on the people still the gulsion rowed because it was a travalious. But nothing in the rash.

“No, Jane. What does then they all get out him, but? Or perfect?”

“The advices?”

Of came the great as prayer. He said the aspect who, she lay on the white big remarking through the father — of the grandfather did he had seen her engoors, came garden, the irony opposition on his colling of the roof. Next parapes he had coming broken as though they fould

has a sort. Quite angry to captraita in the fact terror, and a sound and then raised the powerful knocking door crawling for a greatly keep, and is so many adventored and men. He went on. He had been her she had happened his hands on a little hand of a letter and a road that he had possibly became childish limp, her keep mind over her face went in himself voice. He came to the table, to a rashes right repairing that he fulfe, but it was soldier, to different and stuff was. The knees as it was a reason and that prone, the soul? And with grikening game. In such an inquisilled-road and commanded for a magbecross that has been deskled, tight gratulations in front standing again, very unrediction and automatiled spench and six in command, a

I don’t think I’d be alone in thinking that there’s some merit in parts of this writing. I wonder if there’s an extent to which Finnegans Wake has ‘tainted’ the corpus somewhat, because stylistically, I think that’s the closest analogue to what could be said to be going on here. Interestingly, it seems to be formulating its own puns, words like ‘unrediction,’ ‘automatiled spench’ (a tantalising meta-textual reference I think) and ‘destynets’, I think, would all be reminiscent of what you could expect to find in any given section of the Wake, but they don’t turn up in the corpus proper, at least according to a ctrl + f search. What this suggests to me is that the algorithm is plotting relationships on the level of the character, as well as phrasal units. However, I don’t recall the sci-fi model turning up paragraphs that were quite so disjointed and surreal — they didn’t make loads of sense, but they were recognisable, as grammatically coherent chunks of text. Although this could be the result of working with a partially trained model.

So, how might they feed our creative process? Here’s my attempt at making nice sentences out of the above.

— I have never been married, she said. — There’s no good to be gotten out of that sort of thing at all.

He’d use his hands to do chin-ups, pull himself up over the second staircase that hung over the landing, and he’d hang then, wriggling across the awning it created over the first set of stairs, grunting out eight to ten numbers each time he passed, his feet just missing the carpeted surface of the real stairs, the proper stairs.

Every time she walked between them she would wonder which of the two that she preferred. Not the one that she preferred, but the one that were more her, which one of these two am I, which one of these two is actually me? It was the feeling of moving between the two that she could remember, not his hands. They were just an afterthought, something cropped in in retrospect.

She can’t remember her sons either.

Her life had been a slow rise, to come to what it was. A house full of men, chairs and staircases, and she wished for it now to coil into itself, like the corners of stale newspapers.

The first thing you’ll notice about this is that it is a lot shorter. I started off by traducing the above, in as much as possible, into ‘plain words’ while remaining faithful to the n-grams I liked, like ‘bright silence’ ‘old shadows’ and ‘great as prayer’. In order to create images that play off one another, and to account for the dialogue, sentences that seemed to be doing similar things began to cluster together, so paragraphs organically started to shrink. Ultimately, once the ‘purpose’ of what I was doing started to come out, a critique of bourgeois values, memory loss, the nice phrasal units started to become spurious, and the eight or so paragraphs collapsed into the three and a half above. This is also ones of my biggest writing issues, I’ll type three full pages and after the editing process they’ll come to no more than 1.5 paragraphs, maybe?

The thematic sense of dislocation and fragmentation could be a product of the source material, but most things I write are about substance-abusing depressives with broken brains cos I’m a twenty-five year old petit-bourgeois male. There’s also a fairly pallid Enright vibe to what I’ve done with the above, I think the staircases line could come straight out of The Portable Virgin.

Maybe a more well-trained corpus could provide better prompts, but overall, if you want better results out of this for any kind of creative praxis, it’s probably better to be a good writer.

Modelling Humanities Data Blog Post #1 Deleuze, Descartes and Data to Knowledge

While dealing with the distinctions between data, knowledge and information in class, a pyramidal hierarchy was proposed, which can be seen on the left. This diagram discloses the process of making data (which have been defined as ‘facts’ which exist in the world), into information, and thereafter knowledge. These shifts from one state to another are not as neat as the diagram might suggest; it is just one interpretation giving shape to a highly dynamic and unsettled process; any movement from one of these levels to another is fraught. It is ‘a bargaining system,’ as every dataset has its limitations and aporias, not to speak of the process of interpretation or subsequent dissemination. This temporal dimension to data, its translation from a brute state is too often neglected within certain fields of study, fields in which data is more often understood as unambiguous, naturally hierarchicalised, and not open to contextualisation or debate.

This blog post aims to consider these issues within the context of a dataset obtained from The Central Statistics Office. The dataset contains information relating to the relative risk of falling into poverty based on one’s level of education between the years 2004 and 2015 inclusive. The data was analysed through use of the statistical analysis interface SPSS.

The purpose of the CSO is to compile and disseminate information relating to economic and social conditions within the state in order to give direction to the government in the formulation of policy. Therefore it was decided that the most pertinent information to be derived from the dataset would be the correlations between level of education and the likelihood of falling into poverty. The results appear below.

Correlation Between Risk of Poverty and Level of Education Achieved

Correlation Between Consistent Poverty (%) and Level of Education Received

Correlation Between Deprivation Rate (%) and Level of Education Received

Poverty Risk Based on Education Level

Deprivation Rate Based on Education Level

Consistent Poverty Rate based on Education Level

It can be seen that there is a very strong negative correlation between one’s level of education and one’s risk of exposure to poverty; the higher one ascends through the education system, the less likely it is one will fall into economic liminality. This is borne out both in the bar charts and the correlation tables, the latter of which yield p-values of .000, underlining the certainty of the finding. It should be noted that both graphing the data, and detecting correlations through use of the Spearman’s rho are elementary statistical procedures, but as the trend revealed here is consistent with more elaborate modelling of the relationship,[1] the parsimonious analysis carried out here is all that is required.

It should not be assumed that just because these graphs are informative that it is impossible to garner information from data in any other way. Even in its primary state, as it appears on the website, one could obtain information from a dataset through qualitative means. It is unlikely that this information will be as coherent as that which that can be gleaned from even the most basic graph, but it is important to emphasise the fact that the border that separates data from information is fluid.

It is unlikely to be a novel finding that those who have a third level education have higher incomes than those who do not; there is a robust body of research detailing the many benefits of attending university. [2] Therefore, can it be said that the visualisation of the dataset above has contributed to knowledge? One would answer this question relative to one’s initial research question, and how the information complicates or advances it. If the causal relationship between exposure to poverty and level of education has been confirmed, and a government agency makes the recommendation that further investment in educational support programmes are necessary, it is somewhere in this process that the boundary separating information from knowledge has been crossed.

The above diagram actualises the temporal nature of data to a greater extent than the pyramid, but in doing so it perpetuates a linearisation of the process, a line along which René Descartes’ notion of thought could be said to align. Descartes understood thought as a positive function which tends towards the good and toward truth. This ‘good sense’, allows us to ‘judge correctly and to distinguish the true from the false’.[3] Gilles Deleuze believes Descartes instantiates a model of thought which is oppressive, and which perceives thinking relative to external needs and values rather than in its actuality: ‘It cannot be regarded as fact that thinking is the natural exercise of a faculty, and that this faculty is possessed of a good nature and a good will.’[4]

In Deleuze’s conception, thought takes on a sensual disposition, reversing the Cartesian notion of mental inquiry beginning from a state of disinterestedness in order to arrive at a moment at which one recognises ‘rightness’. Deleuze argues that there is no such breakthrough moment or established methodology to thought, and argues for regarding it as more invasive, or unwelcome, a point of encounter when ‘something in the world forces us to think.’[5]

Rather than taking the neat, schematic movement from capturing data to modelling to interpreting for granted, Deleuze is engaged by these moments of crisis, points just before or just after the field of our understanding is qualitatively transformed into something different:

How else can one write but of those things which one doesn’t know, or know badly?…We write only at the frontiers of our knowledge, at the border which separates our knowledge from our ignorance and transforms one into the other.[6]

Deleuze’s comments have direct bearing upon our understanding of data, and how they should be understood within the context of the wider questions we ask of them. Deleuze argues that, ‘problems must be considered not as ‘givens’ (data) but as ideal ‘objecticities’ possessing their own sufficiency and implying acts of constitution and investment in their respective symbolic fields.’[7] While it is possible that Deleuze would risk overstating the case, were we to apply his theories to this dataset, it is nonetheless crucial to recall that data, and the methodologies we use to unpack and present them participate in wider economies of significance, ones with indeterminate horizons.

Notes

[1] Department for Business, Education and Skills, ‘BIS Research Paper №146: The Benefits of Higher Education and Participation for Individuals and Society: Key Findings and Reports’, (Department for Business, Education and Skills: 2013) https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/254101/bis-13-1268-benefits-of-higher-education-participation-the-quadrants.pdf

[2] OECD, Education Indicators in Focus, (OECD: 2012) https://www.oecd.org/education/skills-beyond-school/Education%20Indicators%20in%20Focus%207.pdf

[3] Descartes, René, Discourse on the Method of Rightly Conducting the Reason, and Seeking Truth in the Sciences (Gutenberg: 2008), http://www.gutenberg.org/files/59/59-h/59-h.htm

[4] Deleuze, Gilles, Difference and Repetition (Bloomsbury Academic: 2016), p.175

[5] Ibid.

[6] Ibid, p. xviii

[7] Ibid, p.207

Bibliography

Deleuze, Gilles, Difference and Repetition (Bloomsbury Academic: 2016), p.175

Department for Business, Education and Skills, ‘BIS Research Paper №146: The Benefits of Higher Education and Participation for Individuals and Society: Key Findings and Reports’, (Department for Business, Education and Skills: 2013) https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/254101/bis-13-1268-benefits-of-higher-education-participation-the-quadrants.pdf

Descartes, René, Discourse on the Method of Rightly Conducting the Reason, and Seeking Truth in the Sciences (Gutenberg: 2008), http://www.gutenberg.org/files/59/59-h/59-h.htm

OECD, Education Indicators in Focus, (OECD: 2012) https://www.oecd.org/education/skills-beyond-school/Education%20Indicators%20in%20Focus%207.pdf