Corpus linguistics

- instructor: Costas Gabrielatos

Costas Gabrielatos will give two courses, 'Keyword analysis', and 'Beyond word frequency'.

Keyword analysis

In this session Gabrielatos will explore definitions of the terms keyword and keyness, and discuss appropriate metrics, focusing on the distinction between effect size and statistical significance. He will also focus on how to derive true keywords (i.e. based on effect-size), while also catering for statistical significance, as all but one current corpus tools use an inappropriate metric (log-likelihood), which only specifies statistical significance (the exception being Sketch Engine).

The corpus tools for this session will be Word Smith (so that participants can use their own datasets), supplemented with Excel.

Assigned reading

Gabrielatos, C. & Marchi, A. (2012). Keyness: Appropriate metrics and practical issues. CADS International Conference, Bologna, Italy, 13-15 September 2012.

Kilgarriff, A. (2009). Simple maths for keywords. In Mahlberg, M., González-Díaz, V. & Smith, C. (eds.) Proceedings of the Corpus Linguistics Conference CL2009. University of Liverpool, UK, 20-23 July 2009.

Beyond word frequency

Overall, this session will focus on a more comprehensive view of  'frequency: it will discuss how the normalized word frequency in a corpus may not always be the best way to count instances of a linguistic feature, and why it is best to view the normalized frequency of a linguistic unit as the number of instances of a feature out of the total number of opportunities for it to appear (Ball, 1994). The session will also focus on how the total number of instances (however measured) may be misleading on its own, and may need to be supplemented with  metrics of dispersion/spread. Regarding word frequency, the session will show how token and type frequencies can be examined in combination – not collapsed into a single type-token ratio metric, but visualised two-dimensionally in a scatterplot.

The corpus tool for this session will be Word Smith, supplemented with Excel.

Assigned reading

Ball, C.N. (1994). Automated text analysis: Cautionary tales. Literary and Linguistic Computing 9(4): 265-302.

Gries, S.Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4): 403-437.

About the instructor

Costas Gabrielatos is Senior Lecturer in English Language at the Department of English & History, Edge Hill University, UK. His research interests are in the use and development of corpus approaches to issues in descriptive, theoretical and applied linguistics, particularly as regards the English language. Gabrielatos has published and presented widely on the nature, techniques, metrics and applications of corpus linguistics – in particular as regards conditionals and modality, learner language, and critical discourse studies. His most recent book, co-authored with Paul Baker and Tony McEnery, is Discourse Analysis and Media Attitudes: The Representations of Islam in the British Press, published by Cambridge University Press, 2013. Website: http://repository.edgehill.ac.uk/profile/3068