Code definitions

code1	description	codes_expected
alt	list of alternative pronunciations, or orthographic alternatives, to the headword	alt; ealt
ant	list of antonyms	ant; eant
cf	list of words to compare with the headword	cf; ecf
cm	a comment or note	cm; ecm
cmp	comparative linguistic note	cmp; ecmp
csl	refers to an entry in Kendon’s Sign Language dictionary	csl; ecsl
def	a formal definition of the headword	def; edef
dm	semantic domain	dm; edm
eg	beginning of a block of example sentences	eg
eeg	end of a block of example sentences	eeg
et	English translation of a we field	et; ewe
et	English translation of a wed field	et; ewed
gl	short, one word or simple phrase, glosses	gl; egl
glo	old gloss, kept for data provenance	glo; eglo
lat	Latin Name (updated in 2016)	lat; elat
lato	old Latin name, kept for data provenance	lato; elato
me	start of a block for a main entry	me
eme	end of a block for a main entry	eme
note	marks a note to compiler to check data	note; enote
org	information about a word’s origin	org; eorg
pdx	start of a block for a paradigm example	pdx
epdx	end of a block for a paradigm example	epdx
pdxs	start of a block for a paradigm example	pdxs
epdxs	end of a block for a paradigm example	epdxs
pvl	list of Preverbs that have been cited with verb in main entry	pvl; epvl
ref	reference to relevant work, often bibliographic reference	ref; eref
refa	used to refer to Appendices, Tables, etc.	refa; erefa
rul	a label showing a grammatical or lexical rule or regular pattern that applies	rul; erul
rv	reversal for making the English-to-Warlpiri finder list	rv; erv
se	start of a block for a sense within a main entry	se
ese	end of a block for a sense within a main entry	ese
sse	start of a block for a subentry	sse
esse	end of a block for a subentry	esse
sub	start of a block for a sense within a subentry	sub
esub	end of a block for a sense within a subentry	esub
syn	list of synonyms	syn; esyn
we	Warlpiri example sentence	we
wed	Warlpiri definition or encyclopaedic information on headword	wed
xme	cross reference to a synonymous headword	xme; exme
xs	used to indicate additional sources for word	xs; exs
xsse	indicates that same as some subentry in meaning	xsse; exsse

Parts of speech values

value	description
-COMP	Complementiser Suffix
-V	Dependent Verb stem
AUX:CLITIC	Auxiliary Clitic
AUX:COMP	Auxiliary Complementiser
AUX:PRON	Auxiliary Pronominal Clitic
CASE	Case
CONJ	Conjunction
ENCL	Enclitic
EXCL	Exclamatory marker
INF	Infinitive Verb
INF-SFX	Suffix on Infinitive Verb
N	Nominal
N-	Dependent Nominal stem
N-DAT	Dative Case-marked Nominal
N-DAT-SFX	Suffix on Dative Case-marked Nominal
N-ERG	Ergative Case-marked Nominal
N-LOC	Locative Case-marked Nominal
N-SFX	Nominal Suffix
Nc	Nominal of Cardinal direction subcategory
Nc-SFX	Suffix on Nominal of Cardinal direction subcategory
Nd	Nominal of Determiner subcategory
Nd-SFX	Suffix on Nominal of Determiner subcategory
Nk	Nominal of Kin subcategory
Nk-	Dependent Kin Nominal stem
Nk-SFX	Suffix on Kin Nominal
Np	Nominal of Pronominal subcategory
Np-DAT-SFX	Suffix on Dative Case-marked Nominal of Pronominal subcategory
Np-SFX	Suffix on Nominal of Pronominal subcategory
Nq	Nominal of Question/Quantifier subcategory
Nq-ERG	Ergative Case-marked Nominal of Question/Quantifier subcategory
Nt	Nominal of Temporal subcategory
P	Postposition
PRT	Particle
PN	Proper noun
PV	Preverb
PVa	Adverbial Preverb
PV-ENCL	Enclitic to Preverb
SFX	Suffix
V	Verb
V-ENCL	Enclitic to Verb
V-SFX	Suffix on Verb

Block attributes grammar

# Note that whitespace, '_WS', is significant!

# Example:    ' (H,La,Wi,Y) EXT: FIG: NEO: (lit. some text) (BT)'

attributes    -> dialectInfo:? semanticInfo:? literalInfo:?    registerInfo:?

dialectInfo   -> _WS "(" dialects ")"          # whitespace, dialects in parentheses

semanticInfo  -> _WS semanticType
              |  _WS semanticType semanticInfo # space-separated sem. types: ' EXT: IDIOM:'
             
registerInfo  -> _WS "(" registers ")"         # whitespace, registers in parentheses

literalInfo   -> _WS "(lit." [^\)]:+ ")"       # whitespace, '(lit.', followed by anything except ")", then a ')'

dialects      -> dialect                       # single dialect, e.g.  'H'
              |  dialect "," dialects          # comma-separated, e.g. 'La,Y'

registers     -> register                      # single register, e.g. 'BT'
              |  register "," registers        # comma-separated, e.g. 'BT,SL'

_WS           -> " "

dialect      -> "E"                            # Eastern Warlpiri
              | "H"                            # Hansen River
              | "La"                           # Lajamanu
              | "Ny"                           # Nyirrpi
              | "P"                            # Papunya
              | "Wi"                           # Willowra (Wirliyajarrayi)
              | "WW"                           # Wakirti Warlpiri (Alekarange/ Tennant Creek)
              | "Y"                            # Yuendumu (Yurntumu)

semanticType -> "EXT:"                         # extended meaning
              | "EXT: ASSOC:"                  # extended meaning, on basis of association (eg. 'head' used for 'hat')
              | "FIG:"                         # figurative meaning
              | "FUNCT:"                       # functional meaning (eg. 'ear' meaning 'ability to hear well')
              | "IDIOM:"                       # idiom
              | "NEO:"                         # neologism
              | "SYMB:"                        # symbolic

register     -> "BT"                           # Baby Talk
              | "SL"                           # Special Register Language

Entry structure grammar

mainEntry       ->   "me" entryBlock "eme"  mainEntrySense:* subEntry:*

subEntry        ->  "sse" entryBlock "esse" subEntrySense:*

mainEntrySense  ->  "se" entryBlock "ese"

subEntrySense   ->  "sub" entryBlock "esub"

paradigmExample ->  "pdx" entryBlock "epdx"
                |  "pdxs" entryBlock "epdxs"

entryBlock       -> "org":? "dm":* "def":? "lat":? "gl":? "rv":? "cm":*
                      (exampleBlock:+ | paradigmExample:+):?
                      crossRefs

exampleBlock    -> "eg" "cm":* examplePair:+ "eeg"

examplePair     -> ("we" | "wed") "et"

                   # Cross-reference codes listed in alphabetical order
crossRefs       -> "ant":? "cf":? "csl":? "pvl":? "syn":?

Regular expressions

The list of regular expressions defined below can be retrieved using the use_wlp_regex() function, e.g. str_extract(string = "\\me jaala (PV): (H,Wi,Y)", pattern = use_wlp_regex("me_sse_value")).

wlp_regexes <- list(
    first_code = stringr::regex("^\\s*\\\\([a-z]+)"),

    last_code  = stringr::regex("\\\\([a-z]+)\\s*$"),

    all_codes  = stringr::regex("\\\\([a-z]+)"),

    # case I: '^cry'        -> 'cry'
    # case 2: '^[cry]cried' -> 'cry'
    eng_parent  = stringr::regex("
        (                   # match, either case I:
            \\^                 # a caret character
            [^\\s|\\[|\\)]+     # 1 or more, which are NOT a space, [, or ) character
        ) | (               # or, case II:
            \\^                 # a caret character followed by
            \\[.+?\\]           # 1 or more of any characters between square brackets [...]
        )
    ", comments = TRUE),

    # '\me jaala (PV): (H,Wi,Y)'  -> 'jaala'
    # '\sse jakarn-karri-mi (V):' -> 'jakarn-karri-mi'
    me_sse_value = stringr::regex("
        (?<=\\\\(me|sse)\\s)
        .+?                 # 1 or more of any character, up to
        (?=\\s\\([-|A-Z])   # a space, open parenthesis, an optional hyphen, and a capital letter
                            # i.e. part of speech, e.g. ' (N', ' (-V', etc.
    ", comments = TRUE),

    # '\gl' '\egl'
    gloss_codes = stringr::regex("
        \\\\                # backslash code
        e?                  # optional e prefix
        gl                  # gloss
    ", comments = TRUE),

    # 'car, boat'     -> c('car', 'boat')
    # 'of arm@, legs' -> c('of arm@, legs')
    gloss_delim = stringr::regex("
        (?<!@)              # negative lookbehind for '@' escape
        ,                   # comma
    ", comments = TRUE),

    # 'jaal(pa) (PV): (Y)'         -> '(PV)'
    # 'jaaljaal(pa) (N) (PV): (Y)' -> '(N) (PV)'
    pos_chunk = stringr::regex("
        \\s                 # obligatory space!
        \\(                 # open parenthesis
        -?[A-Z]             # uppercase character, optionally prefixed with '-'
        .*?                 # anything (non-greedy)
        \\)                 # close parenthesis
        :                   # obligatory colon!
      ", comments = TRUE),

    # '(PV)'     -> c('PV')
    # '(N) (PV)' -> c('N', 'PV')
    # '(N,V)'    -> c('N,V')
    pos_value = stringr::regex("
        \\s                 # obligatory space!
        (?<=\\()            # open parenthesis, positive look-behind
        .*?                 # anything (non-greedy)
        (?=\\))             # close parenthesis, positive look-ahead
        ", comments = TRUE),

    # '\[kn59]' '\[PPJ 10/87]'
    source_codes = stringr::regex("
        \\\\                # backslash code
        \\[                 # open square bracket
        .*?                 # anything, non-greedy
        \\]                 # close square bracket
    ", comments = TRUE)
)

Blacklisted characters

The table below lists characters which are blacklisted as they break the processing pipeline (e.g. causes resulting XML to be invalid).

regex (R)	description
\u0002	Start of Text (STX) character
\u0005	Enquiry (ENQ) character
(\\|%)#(\\|%)	Placeholder sense/homophone
\\(?![a-z]\|\[)	Stray backslash

Warlpiri dictionary structures

Nay San