vignettes/wlp-structures.Rmd
wlp-structures.Rmd| code1 | description | codes_expected |
|---|---|---|
| alt | list of alternative pronunciations, or orthographic alternatives, to the headword | alt; ealt |
| ant | list of antonyms | ant; eant |
| cf | list of words to compare with the headword | cf; ecf |
| cm | a comment or note | cm; ecm |
| cmp | comparative linguistic note | cmp; ecmp |
| csl | refers to an entry in Kendon’s Sign Language dictionary | csl; ecsl |
| def | a formal definition of the headword | def; edef |
| dm | semantic domain | dm; edm |
| eg | beginning of a block of example sentences | eg |
| eeg | end of a block of example sentences | eeg |
| et | English translation of a we field | et; ewe |
| et | English translation of a wed field | et; ewed |
| gl | short, one word or simple phrase, glosses | gl; egl |
| glo | old gloss, kept for data provenance | glo; eglo |
| lat | Latin Name (updated in 2016) | lat; elat |
| lato | old Latin name, kept for data provenance | lato; elato |
| me | start of a block for a main entry | me |
| eme | end of a block for a main entry | eme |
| note | marks a note to compiler to check data | note; enote |
| org | information about a word’s origin | org; eorg |
| pdx | start of a block for a paradigm example | pdx |
| epdx | end of a block for a paradigm example | epdx |
| pdxs | start of a block for a paradigm example | pdxs |
| epdxs | end of a block for a paradigm example | epdxs |
| pvl | list of Preverbs that have been cited with verb in main entry | pvl; epvl |
| ref | reference to relevant work, often bibliographic reference | ref; eref |
| refa | used to refer to Appendices, Tables, etc. | refa; erefa |
| rul | a label showing a grammatical or lexical rule or regular pattern that applies | rul; erul |
| rv | reversal for making the English-to-Warlpiri finder list | rv; erv |
| se | start of a block for a sense within a main entry | se |
| ese | end of a block for a sense within a main entry | ese |
| sse | start of a block for a subentry | sse |
| esse | end of a block for a subentry | esse |
| sub | start of a block for a sense within a subentry | sub |
| esub | end of a block for a sense within a subentry | esub |
| syn | list of synonyms | syn; esyn |
| we | Warlpiri example sentence | we |
| wed | Warlpiri definition or encyclopaedic information on headword | wed |
| xme | cross reference to a synonymous headword | xme; exme |
| xs | used to indicate additional sources for word | xs; exs |
| xsse | indicates that same as some subentry in meaning | xsse; exsse |
| value | description |
|---|---|
| -COMP | Complementiser Suffix |
| -V | Dependent Verb stem |
| AUX:CLITIC | Auxiliary Clitic |
| AUX:COMP | Auxiliary Complementiser |
| AUX:PRON | Auxiliary Pronominal Clitic |
| CASE | Case |
| CONJ | Conjunction |
| ENCL | Enclitic |
| EXCL | Exclamatory marker |
| INF | Infinitive Verb |
| INF-SFX | Suffix on Infinitive Verb |
| N | Nominal |
| N- | Dependent Nominal stem |
| N-DAT | Dative Case-marked Nominal |
| N-DAT-SFX | Suffix on Dative Case-marked Nominal |
| N-ERG | Ergative Case-marked Nominal |
| N-LOC | Locative Case-marked Nominal |
| N-SFX | Nominal Suffix |
| Nc | Nominal of Cardinal direction subcategory |
| Nc-SFX | Suffix on Nominal of Cardinal direction subcategory |
| Nd | Nominal of Determiner subcategory |
| Nd-SFX | Suffix on Nominal of Determiner subcategory |
| Nk | Nominal of Kin subcategory |
| Nk- | Dependent Kin Nominal stem |
| Nk-SFX | Suffix on Kin Nominal |
| Np | Nominal of Pronominal subcategory |
| Np-DAT-SFX | Suffix on Dative Case-marked Nominal of Pronominal subcategory |
| Np-SFX | Suffix on Nominal of Pronominal subcategory |
| Nq | Nominal of Question/Quantifier subcategory |
| Nq-ERG | Ergative Case-marked Nominal of Question/Quantifier subcategory |
| Nt | Nominal of Temporal subcategory |
| P | Postposition |
| PRT | Particle |
| PN | Proper noun |
| PV | Preverb |
| PVa | Adverbial Preverb |
| PV-ENCL | Enclitic to Preverb |
| SFX | Suffix |
| V | Verb |
| V-ENCL | Enclitic to Verb |
| V-SFX | Suffix on Verb |
# Note that whitespace, '_WS', is significant!
# Example: ' (H,La,Wi,Y) EXT: FIG: NEO: (lit. some text) (BT)'
attributes -> dialectInfo:? semanticInfo:? literalInfo:? registerInfo:?
dialectInfo -> _WS "(" dialects ")" # whitespace, dialects in parentheses
semanticInfo -> _WS semanticType
| _WS semanticType semanticInfo # space-separated sem. types: ' EXT: IDIOM:'
registerInfo -> _WS "(" registers ")" # whitespace, registers in parentheses
literalInfo -> _WS "(lit." [^\)]:+ ")" # whitespace, '(lit.', followed by anything except ")", then a ')'
dialects -> dialect # single dialect, e.g. 'H'
| dialect "," dialects # comma-separated, e.g. 'La,Y'
registers -> register # single register, e.g. 'BT'
| register "," registers # comma-separated, e.g. 'BT,SL'
_WS -> " "
dialect -> "E" # Eastern Warlpiri
| "H" # Hansen River
| "La" # Lajamanu
| "Ny" # Nyirrpi
| "P" # Papunya
| "Wi" # Willowra (Wirliyajarrayi)
| "WW" # Wakirti Warlpiri (Alekarange/ Tennant Creek)
| "Y" # Yuendumu (Yurntumu)
semanticType -> "EXT:" # extended meaning
| "EXT: ASSOC:" # extended meaning, on basis of association (eg. 'head' used for 'hat')
| "FIG:" # figurative meaning
| "FUNCT:" # functional meaning (eg. 'ear' meaning 'ability to hear well')
| "IDIOM:" # idiom
| "NEO:" # neologism
| "SYMB:" # symbolic
register -> "BT" # Baby Talk
| "SL" # Special Register Language
mainEntry -> "me" entryBlock "eme" mainEntrySense:* subEntry:*
subEntry -> "sse" entryBlock "esse" subEntrySense:*
mainEntrySense -> "se" entryBlock "ese"
subEntrySense -> "sub" entryBlock "esub"
paradigmExample -> "pdx" entryBlock "epdx"
| "pdxs" entryBlock "epdxs"
entryBlock -> "org":? "dm":* "def":? "lat":? "gl":? "rv":? "cm":*
(exampleBlock:+ | paradigmExample:+):?
crossRefs
exampleBlock -> "eg" "cm":* examplePair:+ "eeg"
examplePair -> ("we" | "wed") "et"
# Cross-reference codes listed in alphabetical order
crossRefs -> "ant":? "cf":? "csl":? "pvl":? "syn":?
The list of regular expressions defined below can be retrieved using the use_wlp_regex() function, e.g. str_extract(string = "\\me jaala (PV): (H,Wi,Y)", pattern = use_wlp_regex("me_sse_value")).
wlp_regexes <- list(
first_code = stringr::regex("^\\s*\\\\([a-z]+)"),
last_code = stringr::regex("\\\\([a-z]+)\\s*$"),
all_codes = stringr::regex("\\\\([a-z]+)"),
# case I: '^cry' -> 'cry'
# case 2: '^[cry]cried' -> 'cry'
eng_parent = stringr::regex("
( # match, either case I:
\\^ # a caret character
[^\\s|\\[|\\)]+ # 1 or more, which are NOT a space, [, or ) character
) | ( # or, case II:
\\^ # a caret character followed by
\\[.+?\\] # 1 or more of any characters between square brackets [...]
)
", comments = TRUE),
# '\me jaala (PV): (H,Wi,Y)' -> 'jaala'
# '\sse jakarn-karri-mi (V):' -> 'jakarn-karri-mi'
me_sse_value = stringr::regex("
(?<=\\\\(me|sse)\\s)
.+? # 1 or more of any character, up to
(?=\\s\\([-|A-Z]) # a space, open parenthesis, an optional hyphen, and a capital letter
# i.e. part of speech, e.g. ' (N', ' (-V', etc.
", comments = TRUE),
# '\gl' '\egl'
gloss_codes = stringr::regex("
\\\\ # backslash code
e? # optional e prefix
gl # gloss
", comments = TRUE),
# 'car, boat' -> c('car', 'boat')
# 'of arm@, legs' -> c('of arm@, legs')
gloss_delim = stringr::regex("
(?<!@) # negative lookbehind for '@' escape
, # comma
", comments = TRUE),
# 'jaal(pa) (PV): (Y)' -> '(PV)'
# 'jaaljaal(pa) (N) (PV): (Y)' -> '(N) (PV)'
pos_chunk = stringr::regex("
\\s # obligatory space!
\\( # open parenthesis
-?[A-Z] # uppercase character, optionally prefixed with '-'
.*? # anything (non-greedy)
\\) # close parenthesis
: # obligatory colon!
", comments = TRUE),
# '(PV)' -> c('PV')
# '(N) (PV)' -> c('N', 'PV')
# '(N,V)' -> c('N,V')
pos_value = stringr::regex("
\\s # obligatory space!
(?<=\\() # open parenthesis, positive look-behind
.*? # anything (non-greedy)
(?=\\)) # close parenthesis, positive look-ahead
", comments = TRUE),
# '\[kn59]' '\[PPJ 10/87]'
source_codes = stringr::regex("
\\\\ # backslash code
\\[ # open square bracket
.*? # anything, non-greedy
\\] # close square bracket
", comments = TRUE)
)The table below lists characters which are blacklisted as they break the processing pipeline (e.g. causes resulting XML to be invalid).
| regex (R) | description |
|---|---|
| \u0002 | Start of Text (STX) character |
| \u0005 | Enquiry (ENQ) character |
| (\*|%)#(\*|%) | Placeholder sense/homophone |
| \\(?![a-z]|\[) | Stray backslash |