1 Overview of what we’re going to do
2 What you’ll need
3 Tokenize the seed corpus (=extract all the words and their frequencies)
4 Choose a set of seed words
5 Make a list of essential Chamorro words
6 Run BootCaT: suggested choices to make on each screen
7 Repeat the tokenization procedure on your new corpus
8 Breather: where are we?
9 Part 2

1 Overview of what we’re going to do

Create a corpus of Chamorro from the web
Use it to estimate word frequencies
Compare to behavioral data

2 What you’ll need

Computer with internet connection
MS Excel, OpenOffice Calc, LibreOffice Calc, or other spreadsheet program
BootCaT frontend, which comes with Strawberry Perl: bootcat.sslmit.unibo.it/?section=download
Java: java.com/en/download
Windows Azure Marketplace Account Key
- First requires a Windows Live ID: signup.live.com
- www.bing.com/toolbox/bingsearchapi
- You get only 5,000 free searches per month, so don’t use them all at once!
Unix-like terminal
- On a Mac, you already have it: the “terminal”
- For Windows, install Cygwin: www.cygwin.com
If using Windows, Notepad++: notepad-plus-plus.org
Optional: R (www.r-project.org ) and RStudio (www.rstudio.com )
Our data files:
- aimm3.tutorial.zip file

3 Tokenize the seed corpus (=extract all the words and their frequencies)

Matt has already collected fair number of newspaper, bible, and other texts. We pasted them all into a file: SeedCorpus_utf8.txt. Make a new folder somewhere, put a copy in that folder, and open it up to have a look.
Get ready
- Open Cygwin or your Mac terminal (or a Linux window)
- Navigate to where you have the SeedCorpus_utf8.txt file stored
- cd c: will navigate to your C drive, for instance
- cd Users/Kie/Dropbox/PROJECT_ChamorroWebCorpus/ would then take you to the right folder if you were on my computer
- cd .. will take you up one directory level, if you ever need that
Look at the top of the file
- cat SeedCorpus_utf8.txt | head
Collapse upper and lower case using the tr program
- tr [A-ZÅÑÁÉÍÓÚ] [a-zåñáéíóú] < SeedCorpus_utf8.txt > SeedCorpus_case_collapsed.txt
- tr means “translate”
- Can you figure out how to look at the top of the resulting file?
Change anything that’s not a letter or apostrophe into a line break
- tr -sc [a-zåñáéíóú\'] '\n' < SeedCorpus_case_collapsed.txt > SeedCorpus_one_per_line.txt
- -c means “translate the complement”: anything that’s not one of the letters in square brackets gets translated to line break
- –s means “squeeze”: if you would end up with a bunch of line breaks in a row, turn them into a single one
Sort
- LC_ALL='C' sort < SeedCorpus_one_per_line.txt > SeedCorpus_sorted.txt
- Setting LC_ALL forces non-ASCII characters to be dealt with in some consistent way
Collapse identical tokens
- LC_ALL='C' uniq -c < SeedCorpus_sorted.txt > SeedCorpus_uniq.txt
- -c means “include a count of how many tokens were collapsed”
Sort by frequency
- LC_ALL='C' sort < SeedCorpus_uniq.txt > freq_list_from_seed_corpus.txt

There’s also a way to do it all in one step

cat SeedCorpus_utf8.txt | tr [A-ZÅÑÁÉÍÓÚ] [a-zåñáéíóú] | tr -sc [a-zåñáéíóú\'] '\n' | LC_ALL='C' sort | LC_ALL='C' uniq -c | LC_ALL='C' sort > freq_list_from_seed_corpus.txt

If you do this for another language, you’ll have to change certain things
- Is there case, or something analogous? If so, which characters need to collapse?
- What counts as a word-break character for you, besides white space? Apostrophe? Hyphen? How about numerals?
- If you’re working on a language written without spaces, you’re in trouble unless good
  resources already exist for guessing word boundaries.
If this didn’t work for you, freq_list_from_seed_corpus.txt is included in the zip file

4 Choose a set of seed words

Open freq_list_from_seed_corpus.txt in Excel or another spreadsheet program—open as fixed width, and as Unicode (UTF-8).
- You should have one column with frequencies and one with words
- Save this as a spreadsheet file
Make a new column with a formula to code whether the word is at least 4 letters
- =if(len(b1)>=4,”long enough”,0)
- Copy this formula down through the whole column
Sort the file by this new length column
Starting with the most-frequent words that are long enough, see whether the word is sufficiently Chamorro-distinguishing. If so, mark 1 in the next column; if not, mark 0.
- For example, you might already know that para is a common word in Spanish, so you can just exclude it
- But you might not be sure about siha, so do a web search (use quotation marks or + to avoid searching merely-similar strings) and see if the first page of results looks Chamorro
Keep going till you have about 100 words marked as usable
Save these words (one per line) in a new UTF-8 text file, best_words_hand_chosen.txt
- Or, just copy our version into your folder

5 Make a list of essential Chamorro words

The idea is to identify words that no Chamorro document could lack—very-high-frequency function words.
Sort your spreadsheet file by frequency again.
Save the top 15 or so most-frequent words in a new UTF-8 text file, whitelist.txt

6 Run BootCaT: suggested choices to make on each screen

Project definition
- Corpus name: whatever you like
- Language: Unspecified
- More options: “Use whitelist”, and select whitelist.txt
Types: 5 – Tokens: 20 – Ratio 0.1
- That means a document has to have at least 20 tokens total of at least 5 words from the whitelist, and at least 10% of tokens in the page have to be from the whitelist.
How do you want to proceed?
- Simple mode
Insert one seed per line
- Paste in contents of best_words_hand_chosen.txt
- Check “I’m done editing seeds”
The tuples that will be used as queries
- Tuple length: 4
- N. of tuples: 500
- You get 5000 free queries per month, and BootCaT may just hang up if you exceed that, so don’t spend them all at once
- 500 is going to give you a small corpus, but once are happy with the results you’re getting, you can run things again, with the same settings, and use us all your remaining searches for the month
- Click Generate tuples
Paste your Windows Azure Marketplace Account Key in the box
Next page (no title)
- Limit search to the following Internet domain: leave blank
- Exclude the following Internet domains: leave blank
- Adult filter: moderate
- Maximum number of URLs to return for each tuple: 10
- Click Collect URLs. Takes a few minutes
- When bar says 100%, click Next.
Here you can verify and remove individual URLs from the list
- In theory this could yield 10*500= 5,000 URLs, but in practice some query combinations won’t actually get any results. Then duplicate URLs are removed. So, result will be much less than 5,000.
- Since your list will be small, it’s worth looking at the URLs by hand to exclude bad ones
- archive.org results are usually no good
- If you actually visit any links, make sure your anti-virus/malware defenses are up!
Next page
- Click Build corpus. Takes some more time.
- Result will be saved in something like My Documents/BootCaT Corpora
- Or choose File > My Corpora to open the folder where they are
What to do if the Build corpus step hangs up and gives you some kind of Java memory error
- The culprit is whatever web page was being downloaded and boilerplate-stripped at that moment
- Make a copy of your list of URLs (url_list_edited.txt) in Notepad++ and delete the offending URL (save it with a new name)
- Re-open BooTCaT and try again
- But this time, choose Custom URLs mode, and use the new URL-list file
- Repeat until BooTCaT can get through the whole list

7 Repeat the tokenization procedure on your new corpus

You’ll have to navigate to the folder where corpus.txt is
You’ll also have to change the name of the input file in your command to corpus.txt, and the name of the output file to something like tokenized_corpus.txt
Challenge: Make a scatterplot of the most-frequent words’ frequency in the seed corpus vs. in your corpus
- easiest way is probably just to do it in spreadsheet program
- I suggest Excel’s function VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]). Be sure to set range_lookup to FALSE.
- If you’re proficient in R, Python, awk, or the like, you can probably think of a way to do it with one of those.
Because you only did 500 searches, your corpus will probably be smaller than the seed corpus
Open up our corpus file (corpus.txt, inside the Chamorro_4_Sept_2015A folder), which was made with 4,000 queries
- Tokenize it
Challenge: first remove all the lines with the URL names, so that URL parts don’t contribute to word frequencies
Make another scatterplot, comparing the seed corpus to this larger corpus

8 Breather: where are we?

We now have a bigger corpus of Chamorro, made from a variety of websites (newspapers, blogs, discussion forums, wikipedia…)
It does have a fair amount of noise, from pages that are actually in English and other languages
There are also pages written in a mix of Chamorro and English. This probably just reflects the sociolinguistic reality for this language
This won’t affect the frequencies of Chamorro words, except when they exist as words in other languages that are significant contaminants in the corpus
Now, let’s use our word-frequency estimates…

Building digital resources for research on under-resourced languages Corpus-based and behavioral approaches

Part I: Building a corpus

presented by Matt Wagers & Kie Zuraw, AIMM3 @ UMass

2 October, 2015

1 Overview of what we’re going to do

2 What you’ll need

3 Tokenize the seed corpus (=extract all the words and their frequencies)

4 Choose a set of seed words

5 Make a list of essential Chamorro words

6 Run BootCaT: suggested choices to make on each screen

7 Repeat the tokenization procedure on your new corpus

8 Breather: where are we?

9 Part 2