This (interactive) tutorial demonstrates the functions and capabilities of the tidylex
package in transforming dictionary text data, specifically those in the “backslash code” format. We will be applying the functions introduced in the Index to an actual dictionary file, and using them in conjunction with other tidyr functions to show how you can use tidylex
to your advantage.
We will be using a subset of the Rotokas dictionary, a Toolbox file of all entries starting with ‘k’ (~12000 lines). Download the file from the page into a directory of your choice, making sure that your directory on RStudio has access to it.
## [1] "\\_sh v3.0 400 Rotokas Dictionary"
## [2] "\\_DateStampHasFourDigitYear"
## [3] ""
## [4] "\\lx kaa"
## [5] "\\ps V"
## [6] "\\pt A"
## [7] "\\ge gag"
## [8] "\\tkp nek i pas"
## [9] "\\dcsv true"
## [10] "\\vx 1"
## [11] "\\sc ???"
## [12] "\\dt 29/Oct/2005"
## [13] "\\ex Apoka ira kaaroi aioa-ia reoreopaoro."
## [14] "\\xp Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok."
## [15] "\\xe Apoka is gagging from food while talking."
Each section below will exemplify each of the following functions:
read_lexicon()
Read your dictionary text file directly into a dataframe with read_lexicon()
. With no additional arguments, read_lexicon()
just inserts a column of line numbers.
By specifying a regular expression (regex
), and the column names (into
), you can separate lines into their components. For now, let’s just isolate the backslash codes from the values. Notice how the metadata lines (lines 1-3) have been filtered out, since they do not satisfy our regular expression.
rtk_df <- read_lexicon(
file = "../src/rotokas.dic",
regex = "\\\\([a-z]+) (.*)", # Note two capture groups, in parentheses
into = c("code", "value") # Captured data placed, respectively, in 'code' and 'value' columns
)
rtk_df %>% DT::datatable()
Having structured the dictionary lines into useful categories, you can use the filter()
function to target specific lines, as in the following examples.
# List of all entry lines, English gloss lines, and Tok Pisin translations
rtk_df %>% filter(code %in% c("lx", "ge", "tkp")) %>% DT::datatable()
# List of all unknown data value lines
rtk_df %>% filter(value %>% stringr::str_detect("^\\?")) %>% DT::datatable()
The last dataframe with the unknown data value lines isn’t so informative - ideally, we want information about their headword parent as well, to identify which entries need fixing.
add_group_col()
By appending a grouping column to the dataframe, we can retain information on parent headwords before applying the filter()
function.
# Grouping entries by parent headword ("lx_group")
rtk_df <-
rtk_df %>% add_group_col(
name = lx_group, # Name of the new grouping column
where = code == "lx", # When to fill with a value, i.e. when *not* to inherit value
value = paste0(line, ": ", value) # What the value should be when above condition is true
)
# Filtering again to retain only unknown data values
rtk_df %>%
filter(value %>% stringr::str_detect("^\\?")) %>%
# Adding tally to see which entries have the most unknown values
group_by(lx_group) %>%
add_tally() %>%
arrange(desc(n)) %>%
DT::datatable()
compile_grammar()
This function allows you to test grammatical structures on your dictionary. The following method uses the function to access all entries that have ungrammatical lines.
# Creating a skeleton grammar (that only validates five codes)
rtk_skeleton <-
'entry -> hword usage:+
usage -> gloss:+ example:?
example -> (rtk tkp eng):+
hword -> "lx" # Entry word
gloss -> "ge" # Gloss line
rtk -> "ex" # Rotokas example
tkp -> "xp" # Tok Pisin translation
eng -> "xe" # English translation
'
# Applying compile_grammar()
rtk_parser <- compile_grammar(rtk_skeleton)
# Identifying those codes used in grammar
skeleton_codes <-
stringr::str_extract_all(rtk_skeleton, '"(.*?)"') %>%
unlist() %>%
stringr::str_remove_all('"')
# Running dictionary through grammar
rtk_parsed <-
rtk_df %>% # Remember, all lines are grouped by their lx_group
filter(code %in% skeleton_codes) %>% # Keeping only lines specified by grammar
mutate( # Adding column showing line grammaticality (T/F/NA)
code_ok = rtk_parser$parse_str(code, return_labels = TRUE)
)
rtk_parsed %>% DT::datatable()
# Isolating all entries with erroneous lines
# Providing ungrammatical code sequences
rtk_invalid <-
rtk_parsed %>%
filter(
any(code_ok == FALSE, # Lines out of order
is.na(code_ok)) # Lines with missing lines afterwards
) %>%
summarise(code_seq = paste0(code, collapse = ", "))
rtk_invalid
## # A tibble: 9 x 2
## lx_group code_seq
## <chr> <chr>
## 1 2089: kapisi lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 2 4341: kavatao lx, ge, ex, xp, xe, ex, xp, xe, ex, xp
## 3 4432: kavee lx, ge, ex, xp, xe, ex, xp, xe, ex, xp
## 4 4825: kavu lx, ge, ex, xp, xe, ex, xp, xp
## 5 4980: keakeato lx, ge, ex, xp, xe, xp, xe
## 6 5620: kepito lx, ge, ex, xp, xe, ex, xp
## 7 5793: keravo lx, ge, ex, xp, xe, ex, xp
## 8 8487: kokoroku lx, ge, ex, xp, xe, ex, xp, xe, ex, xp, xe, ex, xp
## 9 8583: kokoruu lx, ge, ex, xp
# Listing all backslash codes used, sorting by frequency
rtk_codes <-
rtk_df %>%
filter(!is.na(code)) %>%
group_by(code) %>%
tally() %>%
arrange(desc(n)) %>%
mutate(weight = round(n*100/sum(n),2))
# How many lines have been validated by the grammar?
rtk_codes %>% filter(code %in% skeleton_codes) -> processed_lex
paste("Lines processed:", sum(processed_lex$n))
## [1] "Lines processed: 6660"
## [1] "Number of lines in dictionary: 12132"
## [1] "Grammar coverage: 54.9%"