The attached PDF (used because not even .txt files, let alone .r ones are supported by wordpress) is a 100-some line code in RStudio that I’ve used for some basic forays into textual analysis.
It is essentially useless, and quantifies three textual phenomena: richness of vocabulary, density of indefinite pronouns and density of hapax legomena, or words that appear only once. All measures are obtained from sequential samples of words, the size of which is based on the size of the text that is ‘fed’ into the code.
The measures are essentially useless; all variables are essentially contingent on one another, that is, if uniqueness goes up, indefinite pronoun density would have to go down, and hapax density would go up, though not to the same extent that indefinite would decrease, since these last two are arbitrary groupings of words, of course their increases would be to uniqueness’ detriment. Mostly I just needed to get some code up and running for a statistics project.
Comments are included to give a sense of what each line is doing, because not enough people using R for literary analysis do that.
Thanks are due to Matthew L. Jockers for his book, Text Analysis with R for Students of Literature, which I found literally indispensable.