Software and data sets

My GitHub home page


 The Threatening English Language (TEL) corpus

DOI

TEL is the Threatening English Language corpus. It is a collection of 309 written texts compiled from the publicly-available portion of CTARC (the Communicated Threat Assessment Research Corpus, compiled by Tammy Gales), MFT (the Malicious Forensic Texts corpus, compiled by Andrea Nini), and the written portion of CoJO (the Corpus of Judicial Opinions, compiled by Julia Muschalik). Additional texts are from ForensicLing.com (the forensic linguistic data site hosted by Tammy Gales and Dakota Wing). Basic metadata is supplied for each text where known from the original case research.


The Jack the Ripper Corpus

DOI

The Jack the Ripper corpus contains all the letters or postcards found and transcribed in the Appendix of

Evans S. P., Skinner K. (2001). Jack the Ripper: Letters from Hell. Stroud: Sutton.

The letters were OCR scanned and manually checked. The corpus consists of 209 texts and 17,463 word tokens. The average length of a text in the corpus is of eighty-three word tokens (min = 7, max = 648, SD = 67.4).


The Malicious Forensic Texts Corpus

DOI

The Malicious Forensic Texts (MFT) corpus is a corpus of authentic malicious forensic texts that has been compiled in order to study their register variation, where a malicious forensic text is defined as “a text that is a piece of written evidence in a forensic case that involves threat, abuse, defamation or a combination of the above”.


Multidimensional Analysis Tagger 

The Multidimensional Analysis Tagger is a program for Windows that replicates Biber’s (1988) Variation across Speech and Writing tagger for the multidimensional functional analysis of English texts, generally applied for studies on text type or genre variation. The program can generate a grammatically annotated version of the corpus selected as well as the necessary statistics to perform a text-type or genre analysis. The program plots the input text or corpus on Biber’s (1988) Dimensions and determines its closest text type, as proposed by Biber (1989) A Typology of English Texts. Finally, the program offers a tool for visualising the Dimensions features of an input text.


The Great American Word Mapper

The Great American Word Mapper  allows users to map the relative frequencies of the 97,246 most common words in an 8.9 billion word corpus of 890 million geocoded Tweets collected from across the contiguous United States between 11 October 2013 and 22 November 2014. The original app was created by Jack Grieve, Andrea Nini, and Diansheng Guo for the Trees and Tweets project, funded by AHRC/ESRC/JISC/IMLS as part of Digging into Data 3. This Quartz version was redesigned by Nikhil Sonnad. The four word-by-county regional data matrices used for Word Mapper are available for download here. The website offers a matrix containing the relative frequencies per billion words of the 97,246 words measured across 3,075 counties and three matrices containing the corresponding Getis-Ord Gi* z-scores, calculated using three different nearest neighbours spatial weights matrices. See the papers and talks at the link above for more information. If you use the data in your research, please cite one or more of the papers.