About this document

These are the (lightly-edited) handouts from Kie Zuraw and Matt Wagers’s tutorial at AIMM3, the American International Morphology Meeting 3, at University of Massachusetts, Amherst (2-4 October, 2015).

The first handout will guide you to build a web-scraped corpus of a given language – here, Chamorro – using Unix command-line tools, and the BootCat system.

The second handout describes behavioral surveys and experiments used to collect comparable data about word frequency and word-form preferences. It guides the reader through the analysis of these data and their use in cross-validating the language data collected from the web (using R, RStudio and grep).

Abstract

The goal of this tutorial is to showcase some computational and experimental approaches to explore the morphological structure of languages that are lacking in curated linguistic resources (typically, but not always, smaller or understudied languages). We will demonstrate how to use off-the-shelf software, such as BootCaT (Baroni, Bernardini, Zanchetta, Ljubešić, and Shaoul) and ordinary Unix command line tools, to build a web-scraped corpus and to begin answering questions about word form (co-)occurrence patterns. We will also demonstrate some experimental designs and analyses adapted for smaller language communities - ones which can be used to collect information about word familiarity, the conditional probability of using particular word forms, and the processing correlates of morphological complexity. Throughout we will be interested in the complementarity fo the two approaches, with an eye toward cross-validating the measures we derive. This tutorial will focus on examples from Austronesian languages, particularly Chamorro and Tagalog. To the extent possible, it will be ‘live’ - data will be compiled as the tutorial proceeds. Students will be provided with the necessary resources to follow along.