Overview

This (interactive) tutorial demonstrates the functions and capabilities of the tidylex package in transforming dictionary text data, specifically those in the “backslash code” format. We will be applying the functions introduced in the Index to an actual dictionary file, and using them in conjunction with other tidyr functions to show how you can use tidylex to your advantage.

0. Access the dictionary file

We will be using a subset of the Rotokas dictionary, a Toolbox file of all entries starting with ‘k’ (~12000 lines). Download the file from the page into a directory of your choice, making sure that your directory on RStudio has access to it.

rtk_lines <- readLines("../src/rotokas.dic")

rtk_lines %>% head(15)
##  [1] "\\_sh v3.0  400  Rotokas Dictionary"                                 
##  [2] "\\_DateStampHasFourDigitYear"                                        
##  [3] ""                                                                    
##  [4] "\\lx kaa"                                                            
##  [5] "\\ps V"                                                              
##  [6] "\\pt A"                                                              
##  [7] "\\ge gag"                                                            
##  [8] "\\tkp nek i pas"                                                     
##  [9] "\\dcsv true"                                                         
## [10] "\\vx 1"                                                              
## [11] "\\sc ???"                                                            
## [12] "\\dt 29/Oct/2005"                                                    
## [13] "\\ex Apoka ira kaaroi aioa-ia reoreopaoro."                          
## [14] "\\xp Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok."
## [15] "\\xe Apoka is gagging from food while talking."

Each section below will exemplify each of the following functions:

  1. read_lexicon()
  2. add_group_col()
  3. compile_grammar()

1. Read lexicon files with read_lexicon()

Read your dictionary text file directly into a dataframe with read_lexicon(). With no additional arguments, read_lexicon() just inserts a column of line numbers.

By specifying a regular expression (regex), and the column names (into), you can separate lines into their components. For now, let’s just isolate the backslash codes from the values. Notice how the metadata lines (lines 1-3) have been filtered out, since they do not satisfy our regular expression.

1b. Filtering

Having structured the dictionary lines into useful categories, you can use the filter() function to target specific lines, as in the following examples.

The last dataframe with the unknown data value lines isn’t so informative - ideally, we want information about their headword parent as well, to identify which entries need fixing.

2. Group lines together with add_group_col()

By appending a grouping column to the dataframe, we can retain information on parent headwords before applying the filter() function.

3. Validate dictionary structures using compile_grammar()

This function allows you to test grammatical structures on your dictionary. The following method uses the function to access all entries that have ungrammatical lines.

## # A tibble: 9 x 2
##   lx_group       code_seq                                          
##   <chr>          <chr>                                             
## 1 2089: kapisi   lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 2 4341: kavatao  lx, ge, ex, xp, xe, ex, xp, xe, ex, xp            
## 3 4432: kavee    lx, ge, ex, xp, xe, ex, xp, xe, ex, xp            
## 4 4825: kavu     lx, ge, ex, xp, xe, ex, xp, xp                    
## 5 4980: keakeato lx, ge, ex, xp, xe, xp, xe                        
## 6 5620: kepito   lx, ge, ex, xp, xe, ex, xp                        
## 7 5793: keravo   lx, ge, ex, xp, xe, ex, xp                        
## 8 8487: kokoroku lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 9 8583: kokoruu  lx, ge, ex, xp

3b. How much of the dictionary does your grammar cover?

## [1] "Lines processed: 6660"
paste("Number of lines in dictionary:", sum(rtk_codes$n))
## [1] "Number of lines in dictionary: 12132"
paste0("Grammar coverage: ", round(100*sum(processed_lex$n)/sum(rtk_codes$n), 2), "%")
## [1] "Grammar coverage: 54.9%"