This (interactive) tutorial demonstrates the functions and capabilities of the tidylex
package in transforming dictionary text data, specifically those in the “backslash code” format. We will be applying the functions introduced in the Index to an actual dictionary file, and using them in conjunction with other tidyr functions to show how you can use tidylex
to your advantage.
We will be using a subset of the Rotokas dictionary, a Toolbox file of all entries starting with ‘k’ (~12000 lines). Download the file from the page into a directory of your choice, making sure that your directory on RStudio has access to it.
## [1] "\\_sh v3.0 400 Rotokas Dictionary"
## [2] "\\_DateStampHasFourDigitYear"
## [3] ""
## [4] "\\lx kaa"
## [5] "\\ps V"
## [6] "\\pt A"
## [7] "\\ge gag"
## [8] "\\tkp nek i pas"
## [9] "\\dcsv true"
## [10] "\\vx 1"
## [11] "\\sc ???"
## [12] "\\dt 29/Oct/2005"
## [13] "\\ex Apoka ira kaaroi aioa-ia reoreopaoro."
## [14] "\\xp Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok."
## [15] "\\xe Apoka is gagging from food while talking."
Each section below will exemplify each of the following functions:
read_lexicon()
Read your dictionary text file directly into a dataframe with read_lexicon()
. With no additional arguments, read_lexicon()
just inserts a column of line numbers.
By specifying a regular expression (regex
), and the column names (into
), you can separate lines into their components. For now, let’s just isolate the backslash codes from the values. Notice how the metadata lines (lines 1-3) have been filtered out, since they do not satisfy our regular expression.