[Winter 2010] Tutorial on quantitative corpus linguistics and R

Tue 27 December 2011 by Adrian Brasoveanu

Why corpus linguistics? Here’s part of the conclusion of Abney (1996) (“Statistical methods and linguistics”); see also chapter 1 of Manning & Schuetze (1999) (“Foundations of Statistical Natural Language Processing”):

In closing, let me repeat the main line of argument as concisely as I can. Statistical methods—by which I mean primarily weighted grammars and distributional induction methods—are clearly relevant to language acquisition, language change, language variation, language generation, and language comprehension. Understanding language in this broad sense is the ultimate goal of linguistics. The issues to which weighted grammars apply, particularly as concerns perception of grammaticality and ambiguity, one may be tempted to dismiss as performance issues. However, the set of issues labelled “performance” are not essentially computational, as one is often led to believe. Rather, “competence” represents a provisional narrowing and simplification of data in order to understand the algebraic properties of language. “Performance” is a misleading term for “everything else”. Algebraic methods are inadequate for understanding many important properties of human language, such as the measure of goodness that permits one to identify the correct parse out of a large candidate set in the face of considerable noise. […] The focus in computational linguistics has admittedly been on technology. But the same techniques promise progress at long last on questions about the nature of language that have been mysterious for so long. The time is ripe to apply them.

This tutorial is based primarily on Gries (2009); see the references in the R scripts for additional sources.

  1. Intro to R: CLG-meeting-1.r

  2. Indexing, slicing, logical & set operators, counting: CLG-meeting-2.r

  3. Raw input, basic graphics, intro to data frames: CLG-meeting-3.r

  4. Subsetting, editing and ordering data frames, lists, more graphics, introducing the Brown corpus: CLG-meeting-4.r, brown-text-categories.txt, brown-tag-set.txt

  5. Elementary programming functions (if, while, repeat, next, break, for loops etc.), taking advantage of the vectorial nature of R, general programming tips: CLG-meeting-5.r

  6. Character/string processing, searching & replacing w/o regular expressions, intro to regular expressions (disjunction, character classes, wildcard, negation, various abbreviations for character classes, quantifiers, non-greedy matching, back-referencing, look-around), putting it all together: CLG-meeting-6.r

  7. Using R in corpus linguistics: frequency lists for unannotated corpora, reverse frequency lists for unannotated corpora, frequency lists for annotated corpora, frequency lists of word-tag sequences, frequency lists of word pairs: CLG-meeting-7.r

  8. Concordances for unannotated corpora and for annotated (SGML POS-tagged) corpora, linguistic applications (potential prepositional verbs with “in”, preposition stranding in questions, potential verb-particle constructions, potential cognate object constructions): CLG-meeting-8.r, exact-matches.r

  9. Lemma-based concordances for POS-tagged and lemmatized corpora, collocations: CLG-meeting-9.r