vignettes/wlp-structures.Rmd
wlp-structures.Rmd
code1 | description | codes_expected |
---|---|---|
alt | list of alternative pronunciations, or orthographic alternatives, to the headword | alt; ealt |
ant | list of antonyms | ant; eant |
cf | list of words to compare with the headword | cf; ecf |
cm | a comment or note | cm; ecm |
cmp | comparative linguistic note | cmp; ecmp |
csl | refers to an entry in Kendon’s Sign Language dictionary | csl; ecsl |
def | a formal definition of the headword | def; edef |
dm | semantic domain | dm; edm |
eg | beginning of a block of example sentences | eg |
eeg | end of a block of example sentences | eeg |
et | English translation of a we field | et; ewe |
et | English translation of a wed field | et; ewed |
gl | short, one word or simple phrase, glosses | gl; egl |
glo | old gloss, kept for data provenance | glo; eglo |
lat | Latin Name (updated in 2016) | lat; elat |
lato | old Latin name, kept for data provenance | lato; elato |
me | start of a block for a main entry | me |
eme | end of a block for a main entry | eme |
note | marks a note to compiler to check data | note; enote |
org | information about a word’s origin | org; eorg |
pdx | start of a block for a paradigm example | pdx |
epdx | end of a block for a paradigm example | epdx |
pdxs | start of a block for a paradigm example | pdxs |
epdxs | end of a block for a paradigm example | epdxs |
pvl | list of Preverbs that have been cited with verb in main entry | pvl; epvl |
ref | reference to relevant work, often bibliographic reference | ref; eref |
refa | used to refer to Appendices, Tables, etc. | refa; erefa |
rul | a label showing a grammatical or lexical rule or regular pattern that applies | rul; erul |
rv | reversal for making the English-to-Warlpiri finder list | rv; erv |
se | start of a block for a sense within a main entry | se |
ese | end of a block for a sense within a main entry | ese |
sse | start of a block for a subentry | sse |
esse | end of a block for a subentry | esse |
sub | start of a block for a sense within a subentry | sub |
esub | end of a block for a sense within a subentry | esub |
syn | list of synonyms | syn; esyn |
we | Warlpiri example sentence | we |
wed | Warlpiri definition or encyclopaedic information on headword | wed |
xme | cross reference to a synonymous headword | xme; exme |
xs | used to indicate additional sources for word | xs; exs |
xsse | indicates that same as some subentry in meaning | xsse; exsse |
value | description |
---|---|
-COMP | Complementiser Suffix |
-V | Dependent Verb stem |
AUX:CLITIC | Auxiliary Clitic |
AUX:COMP | Auxiliary Complementiser |
AUX:PRON | Auxiliary Pronominal Clitic |
CASE | Case |
CONJ | Conjunction |
ENCL | Enclitic |
EXCL | Exclamatory marker |
INF | Infinitive Verb |
INF-SFX | Suffix on Infinitive Verb |
N | Nominal |
N- | Dependent Nominal stem |
N-DAT | Dative Case-marked Nominal |
N-DAT-SFX | Suffix on Dative Case-marked Nominal |
N-ERG | Ergative Case-marked Nominal |
N-LOC | Locative Case-marked Nominal |
N-SFX | Nominal Suffix |
Nc | Nominal of Cardinal direction subcategory |
Nc-SFX | Suffix on Nominal of Cardinal direction subcategory |
Nd | Nominal of Determiner subcategory |
Nd-SFX | Suffix on Nominal of Determiner subcategory |
Nk | Nominal of Kin subcategory |
Nk- | Dependent Kin Nominal stem |
Nk-SFX | Suffix on Kin Nominal |
Np | Nominal of Pronominal subcategory |
Np-DAT-SFX | Suffix on Dative Case-marked Nominal of Pronominal subcategory |
Np-SFX | Suffix on Nominal of Pronominal subcategory |
Nq | Nominal of Question/Quantifier subcategory |
Nq-ERG | Ergative Case-marked Nominal of Question/Quantifier subcategory |
Nt | Nominal of Temporal subcategory |
P | Postposition |
PRT | Particle |
PN | Proper noun |
PV | Preverb |
PVa | Adverbial Preverb |
PV-ENCL | Enclitic to Preverb |
SFX | Suffix |
V | Verb |
V-ENCL | Enclitic to Verb |
V-SFX | Suffix on Verb |
# Note that whitespace, '_WS', is significant!
# Example: ' (H,La,Wi,Y) EXT: FIG: NEO: (lit. some text) (BT)'
attributes -> dialectInfo:? semanticInfo:? literalInfo:? registerInfo:?
dialectInfo -> _WS "(" dialects ")" # whitespace, dialects in parentheses
semanticInfo -> _WS semanticType
| _WS semanticType semanticInfo # space-separated sem. types: ' EXT: IDIOM:'
registerInfo -> _WS "(" registers ")" # whitespace, registers in parentheses
literalInfo -> _WS "(lit." [^\)]:+ ")" # whitespace, '(lit.', followed by anything except ")", then a ')'
dialects -> dialect # single dialect, e.g. 'H'
| dialect "," dialects # comma-separated, e.g. 'La,Y'
registers -> register # single register, e.g. 'BT'
| register "," registers # comma-separated, e.g. 'BT,SL'
_WS -> " "
dialect -> "E" # Eastern Warlpiri
| "H" # Hansen River
| "La" # Lajamanu
| "Ny" # Nyirrpi
| "P" # Papunya
| "Wi" # Willowra (Wirliyajarrayi)
| "WW" # Wakirti Warlpiri (Alekarange/ Tennant Creek)
| "Y" # Yuendumu (Yurntumu)
semanticType -> "EXT:" # extended meaning
| "EXT: ASSOC:" # extended meaning, on basis of association (eg. 'head' used for 'hat')
| "FIG:" # figurative meaning
| "FUNCT:" # functional meaning (eg. 'ear' meaning 'ability to hear well')
| "IDIOM:" # idiom
| "NEO:" # neologism
| "SYMB:" # symbolic
register -> "BT" # Baby Talk
| "SL" # Special Register Language
mainEntry -> "me" entryBlock "eme" mainEntrySense:* subEntry:*
subEntry -> "sse" entryBlock "esse" subEntrySense:*
mainEntrySense -> "se" entryBlock "ese"
subEntrySense -> "sub" entryBlock "esub"
paradigmExample -> "pdx" entryBlock "epdx"
| "pdxs" entryBlock "epdxs"
entryBlock -> "org":? "dm":* "def":? "lat":? "gl":? "rv":? "cm":*
(exampleBlock:+ | paradigmExample:+):?
crossRefs
exampleBlock -> "eg" "cm":* examplePair:+ "eeg"
examplePair -> ("we" | "wed") "et"
# Cross-reference codes listed in alphabetical order
crossRefs -> "ant":? "cf":? "csl":? "pvl":? "syn":?
The list of regular expressions defined below can be retrieved using the use_wlp_regex()
function, e.g. str_extract(string = "\\me jaala (PV): (H,Wi,Y)", pattern = use_wlp_regex("me_sse_value"))
.
wlp_regexes <- list(
first_code = stringr::regex("^\\s*\\\\([a-z]+)"),
last_code = stringr::regex("\\\\([a-z]+)\\s*$"),
all_codes = stringr::regex("\\\\([a-z]+)"),
# case I: '^cry' -> 'cry'
# case 2: '^[cry]cried' -> 'cry'
eng_parent = stringr::regex("
( # match, either case I:
\\^ # a caret character
[^\\s|\\[|\\)]+ # 1 or more, which are NOT a space, [, or ) character
) | ( # or, case II:
\\^ # a caret character followed by
\\[.+?\\] # 1 or more of any characters between square brackets [...]
)
", comments = TRUE),
# '\me jaala (PV): (H,Wi,Y)' -> 'jaala'
# '\sse jakarn-karri-mi (V):' -> 'jakarn-karri-mi'
me_sse_value = stringr::regex("
(?<=\\\\(me|sse)\\s)
.+? # 1 or more of any character, up to
(?=\\s\\([-|A-Z]) # a space, open parenthesis, an optional hyphen, and a capital letter
# i.e. part of speech, e.g. ' (N', ' (-V', etc.
", comments = TRUE),
# '\gl' '\egl'
gloss_codes = stringr::regex("
\\\\ # backslash code
e? # optional e prefix
gl # gloss
", comments = TRUE),
# 'car, boat' -> c('car', 'boat')
# 'of arm@, legs' -> c('of arm@, legs')
gloss_delim = stringr::regex("
(?<!@) # negative lookbehind for '@' escape
, # comma
", comments = TRUE),
# 'jaal(pa) (PV): (Y)' -> '(PV)'
# 'jaaljaal(pa) (N) (PV): (Y)' -> '(N) (PV)'
pos_chunk = stringr::regex("
\\s # obligatory space!
\\( # open parenthesis
-?[A-Z] # uppercase character, optionally prefixed with '-'
.*? # anything (non-greedy)
\\) # close parenthesis
: # obligatory colon!
", comments = TRUE),
# '(PV)' -> c('PV')
# '(N) (PV)' -> c('N', 'PV')
# '(N,V)' -> c('N,V')
pos_value = stringr::regex("
\\s # obligatory space!
(?<=\\() # open parenthesis, positive look-behind
.*? # anything (non-greedy)
(?=\\)) # close parenthesis, positive look-ahead
", comments = TRUE),
# '\[kn59]' '\[PPJ 10/87]'
source_codes = stringr::regex("
\\\\ # backslash code
\\[ # open square bracket
.*? # anything, non-greedy
\\] # close square bracket
", comments = TRUE)
)
The table below lists characters which are blacklisted as they break the processing pipeline (e.g. causes resulting XML to be invalid).
regex (R) | description |
---|---|
\u0002 | Start of Text (STX) character |
\u0005 | Enquiry (ENQ) character |
(\*|%)#(\*|%) | Placeholder sense/homophone |
\\(?![a-z]|\[) | Stray backslash |