Introduction to tidylex

Overview

This (interactive) tutorial demonstrates the functions and capabilities of the tidylex package in transforming dictionary text data, specifically those in the “backslash code” format. We will be applying the functions introduced in the Index to an actual dictionary file, and using them in conjunction with other tidyr functions to show how you can use tidylex to your advantage.

0. Access the dictionary file

We will be using a subset of the Rotokas dictionary, a Toolbox file of all entries starting with ‘k’ (~12000 lines). Download the file from the page into a directory of your choice, making sure that your directory on RStudio has access to it.

rtk_lines <- readLines("../src/rotokas.dic")

rtk_lines %>% head(15)

##  [1] "\\_sh v3.0  400  Rotokas Dictionary"                                 
##  [2] "\\_DateStampHasFourDigitYear"                                        
##  [3] ""                                                                    
##  [4] "\\lx kaa"                                                            
##  [5] "\\ps V"                                                              
##  [6] "\\pt A"                                                              
##  [7] "\\ge gag"                                                            
##  [8] "\\tkp nek i pas"                                                     
##  [9] "\\dcsv true"                                                         
## [10] "\\vx 1"                                                              
## [11] "\\sc ???"                                                            
## [12] "\\dt 29/Oct/2005"                                                    
## [13] "\\ex Apoka ira kaaroi aioa-ia reoreopaoro."                          
## [14] "\\xp Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok."
## [15] "\\xe Apoka is gagging from food while talking."

Each section below will exemplify each of the following functions:

1. Read lexicon files with `read_lexicon()`

Read your dictionary text file directly into a dataframe with read_lexicon(). With no additional arguments, read_lexicon() just inserts a column of line numbers.

rtk_df <- read_lexicon("../src/rotokas.dic")

rtk_df %>% DT::datatable()

By specifying a regular expression (regex), and the column names (into), you can separate lines into their components. For now, let’s just isolate the backslash codes from the values. Notice how the metadata lines (lines 1-3) have been filtered out, since they do not satisfy our regular expression.

rtk_df <- read_lexicon(
  file  = "../src/rotokas.dic", 
  regex = "\\\\([a-z]+) (.*)",   # Note two capture groups, in parentheses
  into  = c("code", "value")     # Captured data placed, respectively, in 'code' and 'value' columns
  )

rtk_df %>% DT::datatable()

1b. Filtering

Having structured the dictionary lines into useful categories, you can use the filter() function to target specific lines, as in the following examples.

# List of all lexical entry lines
rtk_df %>% filter(code == "lx") %>% DT::datatable()

# List of all entry lines, English gloss lines, and Tok Pisin translations
rtk_df %>% filter(code %in% c("lx", "ge", "tkp")) %>% DT::datatable()

# List of all unknown data value lines
rtk_df %>% filter(value %>% stringr::str_detect("^\\?")) %>% DT::datatable()

The last dataframe with the unknown data value lines isn’t so informative - ideally, we want information about their headword parent as well, to identify which entries need fixing.

2. Group lines together with `add_group_col()`

By appending a grouping column to the dataframe, we can retain information on parent headwords before applying the filter() function.

# Grouping entries by parent headword ("lx_group")
rtk_df <- 
  rtk_df %>% add_group_col(
    name = lx_group,                     # Name of the new grouping column
    where = code == "lx",                # When to fill with a value, i.e. when *not* to inherit value
    value = paste0(line, ": ", value)    # What the value should be when above condition is true
    )

# Filtering again to retain only unknown data values
rtk_df %>% 
  filter(value %>% stringr::str_detect("^\\?")) %>% 
  
# Adding tally to see which entries have the most unknown values
  group_by(lx_group) %>% 
  add_tally() %>% 
  arrange(desc(n)) %>%
  DT::datatable()

3. Validate dictionary structures using `compile_grammar()`

This function allows you to test grammatical structures on your dictionary. The following method uses the function to access all entries that have ungrammatical lines.

# Creating a skeleton grammar (that only validates five codes)
rtk_skeleton <-
  'entry -> hword usage:+       
   usage -> gloss:+ example:?   
   example -> (rtk tkp eng):+   

   hword -> "lx"    # Entry word
   gloss -> "ge"    # Gloss line
   rtk -> "ex"      # Rotokas example
   tkp -> "xp"      # Tok Pisin translation
   eng -> "xe"      # English translation
  '

# Applying compile_grammar()
rtk_parser <- compile_grammar(rtk_skeleton)

# Identifying those codes used in grammar
skeleton_codes <-
  stringr::str_extract_all(rtk_skeleton, '"(.*?)"') %>%
  unlist() %>%
  stringr::str_remove_all('"')

# Running dictionary through grammar
rtk_parsed <-
  rtk_df %>%                              # Remember, all lines are grouped by their lx_group
  filter(code %in% skeleton_codes) %>%    # Keeping only lines specified by grammar
  mutate(                                 # Adding column showing line grammaticality (T/F/NA)
    code_ok = rtk_parser$parse_str(code, return_labels = TRUE)  
    )

rtk_parsed %>% DT::datatable()

# Isolating all entries with erroneous lines
# Providing ungrammatical code sequences
rtk_invalid <-
  rtk_parsed %>% 
  filter(                      
    any(code_ok == FALSE,      # Lines out of order
        is.na(code_ok))        # Lines with missing lines afterwards
    ) %>%
  summarise(code_seq = paste0(code, collapse = ", "))

rtk_invalid

## # A tibble: 9 x 2
##   lx_group       code_seq                                          
##   <chr>          <chr>                                             
## 1 2089: kapisi   lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 2 4341: kavatao  lx, ge, ex, xp, xe, ex, xp, xe, ex, xp            
## 3 4432: kavee    lx, ge, ex, xp, xe, ex, xp, xe, ex, xp            
## 4 4825: kavu     lx, ge, ex, xp, xe, ex, xp, xp                    
## 5 4980: keakeato lx, ge, ex, xp, xe, xp, xe                        
## 6 5620: kepito   lx, ge, ex, xp, xe, ex, xp                        
## 7 5793: keravo   lx, ge, ex, xp, xe, ex, xp                        
## 8 8487: kokoroku lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 9 8583: kokoruu  lx, ge, ex, xp

3b. How much of the dictionary does your grammar cover?

# Listing all backslash codes used, sorting by frequency
rtk_codes <- 
  rtk_df %>%   
  filter(!is.na(code)) %>% 
  group_by(code) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  mutate(weight = round(n*100/sum(n),2))

# How many lines have been validated by the grammar?
rtk_codes %>% filter(code %in% skeleton_codes) -> processed_lex
paste("Lines processed:", sum(processed_lex$n))

## [1] "Lines processed: 6660"

paste("Number of lines in dictionary:", sum(rtk_codes$n))

## [1] "Number of lines in dictionary: 12132"

paste0("Grammar coverage: ", round(100*sum(processed_lex$n)/sum(rtk_codes$n), 2), "%")

## [1] "Grammar coverage: 54.9%"

Nay San, Ellison Luk

2019-02-19

Overview

0. Access the dictionary file

1. Read lexicon files with `read_lexicon()`

1b. Filtering

2. Group lines together with `add_group_col()`

3. Validate dictionary structures using `compile_grammar()`

3b. How much of the dictionary does your grammar cover?

Contents

Introduction to tidylex

Nay San, Ellison Luk

2019-02-19

Overview

0. Access the dictionary file

1. Read lexicon files with read_lexicon()

1b. Filtering

2. Group lines together with add_group_col()

3. Validate dictionary structures using compile_grammar()

3b. How much of the dictionary does your grammar cover?

Contents

1. Read lexicon files with `read_lexicon()`

2. Group lines together with `add_group_col()`

3. Validate dictionary structures using `compile_grammar()`