Overview

The purpose of tidylex is to provide a collaborative, open-source, cross-platform tool for tidying dictionary data stored as Toolbox-style backslash-coded data (a broad convention for serializing lexicographic data in a human-readable and -editable manner). This format is commonly used in the description of under-documented languages, many of which are also highly endangered.

The example below shows a toy French-to-English dictionary with 3 entries rouge, bonjour, and parler with various lexicographic information about these 3 entries (lx: lexeme, ps: part of speech, de: definition, xv: example, vernacular in source language, xe: translation, English). Tidylex makes it easy to make assertions of how these entries should be structured, test whether or not they are well-structured (examples provided below), and, most importantly, communicate the results of these tests with relevant parties.

\lx rouge
\ps adjective
\de red
\xv La chaise est rouge
\xe The chair is red

\lx bonjour
\de hello
\ps exclamation

\lx parler
\ps verb
\de speak
\xv Parlez-vous français?

Why is tidylex needed?

Owing to the dictionary data having been hand-edited over many years, often by multiple contributors, there is often a lot of structural inconsistency in these plain-text files. Given the structural variation, the knowledge about these languages are effectively ‘locked up’ in terms of machine-processability. Tidylex provides a set of functions to iteratively work towards a well-structured, or ‘tidy’, lexicon, and maintain the tidiness of the lexicon when used within a Continuous Testing setting (e.g. with Travis CI, or GitLab pipelines).

Installation

You can install tidylex from github with:

Examples

Formally define a well-formed entry and test entries against definition

Tidylex lets you define and use basic Nearley grammars within R to test for well-formedness of sequence of backslash codes.

For such sequences above, we can define a context-free grammar (equivalent to phrase structure rules) within the Nearley notation below (:?, :+ are quantifiers indicating, respectively, ‘zero or one’ and ‘one or more’ of the preceding entity). We use the compile_grammar function to generate code that can be used to test whether a series of values (e.g. those within the the code column) conform to a sequence expected by some grammar.

We can see from the data frame above that the sequence of codes for entry group 1:rouge (lx ps de xv xe) conforms to the grammar, while the group 7: bonjour does not. We can see that there is a value FALSE for code_ok for the de line (line 8).


Footnotes

  1. At the moment tidylex can’t work with all Nearley grammars since the R V8 package which uses an older version of the V8 engine for cross-compatibility requirements. So, Nearley grammars that make use of ES6 features won’t compile in V8 3.14.