SeedCorpus_utf8.txt
. Make a new folder somewhere, put a copy in that folder, and open it up to have a look.cd c:
will navigate to your C drive, for instancecd Users/Kie/Dropbox/PROJECT_ChamorroWebCorpus/
would then take you to the right folder if you were on my computercd ..
will take you up one directory level, if you ever need thatcat SeedCorpus_utf8.txt | head
tr
program
tr [A-ZÅÑÁÉÍÓÚ] [a-zåñáéíóú] < SeedCorpus_utf8.txt > SeedCorpus_case_collapsed.txt
tr
means “translate”tr -sc [a-zåñáéíóú\'] '\n' < SeedCorpus_case_collapsed.txt > SeedCorpus_one_per_line.txt
-c
means “translate the complement”: anything that’s not one of the letters in square brackets gets translated to line break–s
means “squeeze”: if you would end up with a bunch of line breaks in a row, turn them into a single oneLC_ALL='C' sort < SeedCorpus_one_per_line.txt > SeedCorpus_sorted.txt
LC_ALL
forces non-ASCII characters to be dealt with in some consistent wayLC_ALL='C' uniq -c < SeedCorpus_sorted.txt > SeedCorpus_uniq.txt
-c
means “include a count of how many tokens were collapsed”LC_ALL='C' sort < SeedCorpus_uniq.txt > freq_list_from_seed_corpus.txt
There’s also a way to do it all in one step
cat SeedCorpus_utf8.txt | tr [A-ZÅÑÁÉÍÓÚ] [a-zåñáéíóú] | tr -sc [a-zåñáéíóú\'] '\n' | LC_ALL='C' sort | LC_ALL='C' uniq -c | LC_ALL='C' sort > freq_list_from_seed_corpus.txt
If this didn’t work for you, freq_list_from_seed_corpus.txt
is included in the zip file
=if(len(b1)>=4,”long enough”,0)
best_words_hand_chosen.txt
whitelist.txt
whitelist.txt
My Documents/BootCaT Corpora
corpus.txt
, and the name of the output file to something like tokenized_corpus.txt
VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
. Be sure to set range_lookup
to FALSE
.corpus.txt
, inside the Chamorro_4_Sept_2015A folder
), which was made with 4,000 queries