Code definitions

code1 description codes_expected
alt list of alternative pronunciations, or orthographic alternatives, to the headword alt; ealt
ant list of antonyms ant; eant
cf list of words to compare with the headword cf; ecf
cm a comment or note cm; ecm
cmp comparative linguistic note cmp; ecmp
csl refers to an entry in Kendon’s Sign Language dictionary csl; ecsl
def a formal definition of the headword def; edef
dm semantic domain dm; edm
eg beginning of a block of example sentences eg
eeg end of a block of example sentences eeg
et English translation of a et; ewe
et English translation of a et; ewed
gl short, one word or simple phrase, glosses gl; egl
glo old gloss, kept for data provenance glo; eglo
lat Latin Name lat; elat
lato old Latin name, kept for data provenance lato; elato
me start of a block for a main entry me
eme end of a block for a main entry eme
nlat contains up-to-date (2016) corrected Latin names of plants/animals nlat; enlat
note marks a note to compiler to check data note; enote
org information about a word’s origin org; eorg
pdx start of a block for a paradigm example pdx
epdx end of a block for a paradigm example epdx
pdxs start of a block for a paradigm example pdxs
epdxs end of a block for a paradigm example epdxs
pvl list of Preverbs that have been cited with verb in main entry pvl; epvl
ref reference to relevant work, often bibliographic reference ref; eref
refa used to refer to Appendices, Tables, etc. refa; erefa
rul a label showing a grammatical or lexical rule or regular pattern that applies rul; erul
rv reversal for making the English-to-Warlpiri finder list rv; erv
se start of a block for a sense within a main entry se
ese end of a block for a sense within a main entry ese
sse start of a block for a subentry sse
esse end of a block for a subentry esse
sub start of a block for a sense within a subentry sub
esub end of a block for a sense within a subentry esub
syn list of synonyms syn; esyn
we Warlpiri example sentence we
wed Warlpiri definition or encyclopaedic information on headword wed
xme cross reference to a synonymous headword xme; exme
xs used to indicate additional sources for word xs; exs
xsse indicates that same as some subentry in meaning xsse; exsse

Parts of speech values

value description
-COMP Complementiser Suffix
-V Dependent Verb stem
AUX:CLITIC Auxiliary Clitic
AUX:COMP Auxiliary Complementiser
AUX:PRON Auxiliary Pronominal Clitic
CASE Case
CONJ Conjunction
ENCL Enclitic
EXCL Exclamatory marker
INF Infinitive Verb
INF-SFX Suffix on Infinitive Verb
N Nominal
N- Dependent Nominal stem
N-DAT Dative Case-marked Nominal
N-DAT-SFX Suffix on Dative Case-marked Nominal
N-ERG Ergative Case-marked Nominal
N-LOC Locative Case-marked Nominal
N-SFX Nominal Suffix
Nc Nominal of Cardinal direction subcategory
Nc-SFX Suffix on Nominal of Cardinal direction subcategory
Nd Nominal of Determiner subcategory
Nd-SFX Suffix on Nominal of Determiner subcategory
Nk Nominal of Kin subcategory
Nk- Dependent Kin Nominal stem
Nk-SFX Suffix on Kin Nominal
Np Nominal of Pronominal subcategory
Np-SFX Suffix on Nominal of Pronominal subcategory
Nq Nominal of Question/Quantifier subcategory
Nq-ERG Ergative Case-marked Nominal of Question/Quantifier subcategory
Nt Nominal of Temporal subcategory
PRT Particle
PN Proper noun
PV Preverb
SFX Suffix
V Verb
V-ENCL Enclitic to Verb
V-SFX Suffix on Verb

Block attributes grammar

# Note that whitespace, '_WS', is significant!

# Example:    ' (H,La,Wi,Y) EXT: FIG: NEO: (lit. some text) (BT)'

attributes    -> dialectInfo:? semanticInfo:? literalInfo:?    registerInfo:?

dialectInfo   -> _WS "(" dialects ")"          # whitespace, dialects in parentheses

semanticInfo  -> _WS semanticType
              |  _WS semanticType semanticInfo # space-separated sem. types: ' EXT: IDIOM:'
             
registerInfo  -> _WS "(" registers ")"         # whitespace, registers in parentheses

literalInfo   -> _WS "(lit." [^\)]:+ ")"       # whitespace, '(lit.', followed by anything except ")", then a ')'

dialects      -> dialect                       # single dialect, e.g.  'H'
              |  dialect "," dialects          # comma-separated, e.g. 'La,Y'

registers     -> register                      # single register, e.g. 'BT'
              |  register "," registers        # comma-separated, e.g. 'BT,SL'

_WS           -> " "

dialect      -> "E"                            # Eastern Warlpiri
              | "H"                            # Hansen River
              | "La"                           # Lajamanu
              | "Ny"                           # Nyirrpi
              | "P"                            # Papunya
              | "Wi"                           # Willowra (Wirliyajarrayi)
              | "WW"                           # Wakirti Warlpiri (Alekarange/ Tennant Creek)
              | "Y"                            # Yuendumu (Yurntumu)

semanticType -> "EXT:"                         # extended meaning
              | "EXT: ASSOC:"                  # extended meaning, on basis of association (eg. 'head' used for 'hat')
              | "FIG:"                         # figurative meaning
              | "FUNCT:"                       # functional meaning (eg. 'ear' meaning 'ability to hear well')
              | "IDIOM:"                       # idiom
              | "NEO:"                         # neologism
              | "SYMB:"                        # symbolic

register     -> "BT"                           # Baby Talk
              | "SL"                           # Special Register Language

Entry structure grammar

mainEntry       ->   "me" entryBlock "eme"  mainEntrySense:* subEntry:*

subEntry        ->  "sse" entryBlock "esse" subEntrySense:*

mainEntrySense  ->  "se" entryBlock "ese"

subEntrySense   ->  "sub" entryBlock "esub"

paradigmExample ->  "pdx" entryBlock "epdx"
                |  "pdxs" entryBlock "epdxs"

entryBlock       -> "org":? "dm":* "def":? "lat":? "gl":? "rv":? "cm":*
                      (exampleBlock:+ | paradigmExample:+):?
                      crossRefs

exampleBlock    -> "eg" "cm":* examplePair:+ "eeg"

examplePair     -> ("we" | "wed") "et"

                   # Cross-reference codes listed in alphabetical order
crossRefs       -> "ant":? "cf":? "csl":? "pvl":? "syn":?

Regular expressions

The list of regular expressions defined below can be retrieved using the use_wlp_regex() function, e.g. str_extract(string = "\\me jaala (PV): (H,Wi,Y)", pattern = use_wlp_regex("me_sse_value")).

wlp_regexes <- list(
    first_code = stringr::regex("^\\s*\\\\([a-z]+)"),

    last_code  = stringr::regex("\\\\([a-z]+)\\s*$"),

    all_codes  = stringr::regex("\\\\([a-z]+)"),

    # case I: '^cry'        -> 'cry'
    # case 2: '^[cry]cried' -> 'cry'
    eng_parent  = stringr::regex("
        (                   # match, either case I:
            \\^                 # a caret character
            [^\\s|\\[|\\)]+     # 1 or more, which are NOT a space, [, or ) character
        ) | (               # or, case II:
            \\^                 # a caret character followed by
            \\[.+?\\]           # 1 or more of any characters between square brackets [...]
        )
    ", comments = TRUE),

    # '\me jaala (PV): (H,Wi,Y)'  -> 'jaala'
    # '\sse jakarn-karri-mi (V):' -> 'jakarn-karri-mi'
    me_sse_value = stringr::regex("
        (?<=\\\\(me|sse)\\s)
        .+?                 # 1 or more of any character, up to
        (?=\\s\\([-|A-Z])   # a space, open parenthesis, an optional hyphen, and a capital letter
                            # i.e. part of speech, e.g. ' (N', ' (-V', etc.
    ", comments = TRUE),

    # '\gl' '\egl'
    gloss_codes = stringr::regex("
        \\\\                # backslash code
        e?                  # optional e prefix
        gl                  # gloss
    ", comments = TRUE),

    # 'car, boat'     -> c('car', 'boat')
    # 'of arm@, legs' -> c('of arm@, legs')
    gloss_delim = stringr::regex("
        (?<!@)              # negative lookbehind for '@' escape
        ,                   # comma
    ", comments = TRUE),

    # 'jaal(pa) (PV): (Y)'         -> '(PV)'
    # 'jaaljaal(pa) (N) (PV): (Y)' -> '(N) (PV)'
    pos_chunk = stringr::regex("
        \\s                 # obligatory space!
        \\(                 # open parenthesis
        -?[A-Z]             # uppercase character, optionally prefixed with '-'
        .*?                 # anything (non-greedy)
        \\)                 # close parenthesis
        :                   # obligatory colon!
      ", comments = TRUE),

    # '(PV)'     -> c('PV')
    # '(N) (PV)' -> c('N', 'PV')
    # '(N,V)'    -> c('N,V')
    pos_value = stringr::regex("
        \\s                 # obligatory space!
        (?<=\\()            # open parenthesis, positive look-behind
        .*?                 # anything (non-greedy)
        (?=\\))             # close parenthesis, positive look-ahead
        ", comments = TRUE),

    # '\[kn59]' '\[PPJ 10/87]'
    source_codes = stringr::regex("
        \\\\                # backslash code
        \\[                 # open square bracket
        .*?                 # anything, non-greedy
        \\]                 # close square bracket
    ", comments = TRUE)
)

Blacklisted characters

The table below lists characters which are blacklisted as they break the processing pipeline (e.g. causes resulting XML to be invalid).

regex (R) description
\u0002 Start of Text (STX) character
\u0005 Enquiry (ENQ) character
(\*|%)#(\*|%) Placeholder sense/homophone
\\(?![a-z]|\[) Stray backslash