Words in the OGR are defined as minimal syntactically independent units and tokenization is standardized across the corpus texts. Note in particular that all clitics, including consonantal enclitics such as l in del or al, are considered separated tokens. The normalized text view shows all tokens separated by spaces; modernized punctuation (apostrophes, middle points, etc.) is not used.
Most annotation in the OGR is attached to the Word and is available in all versions of the corpus.
Four broad categories of annotation can be distinguished:
Type | Tag | Description | Link |
---|---|---|---|
graphical | word |
normalized form | below |
graphical | dipl |
diplomatic transcription | below |
graphical | wd_div |
manuscript word division following the token | below |
morphosyntactic | pos |
base part-of-speech tag | below |
morphosyntactic | pos_syn |
context-dependent part-of-speech tag | below |
morphosyntactic | morph |
inflection | below |
lemmatization | lemma |
lemma from any available source | below |
lemmatization | lemma_src |
source of lemma | below |
lemmatization | lemma_dmf |
DMF/ATILF lemma, if available | below |
phonological (TXM) | phon |
string of seg_phonemes in word |
Segment annotation |
phonological (TXM) | syllabified |
string of seg_phonemes with syllable structure markup |
Metrical annotation |
metrical (TXM) | prosody |
stress pattern | Metrical annotation |
metrical (TXM) | line_met |
metre of the line (line_met ) |
Metrical annotation |
metrical (TXM) | metpos |
metrical position of syllables, counting forwards | Metrical annotation |
metrical (TXM) | soptem |
metrical position of syllables, counting backwards | Metrical annotation |
other (TXM) | ref |
citable reference (line_ref ) |
Metrical annotation |
A normalized graphical form of the word similar to that found in a print edition, but without apostrophes or diacritics (except, at present, in Alexis).
dipl
and wd_div
provide the diplomatic transcription of the text.
dipl
indicates the diplomatic form of the token and wd_div
the type of word
division, including punctuation, which follows it.
Resolved abbreviations are given in [square brackets]
.
Superfluous letters are given between (parentheses)
.
Rare cases of editorial correction are given in the following format: word[=corrected]
.
The following special characters are used to denote manuscript word division:
+
agglutination to following word_
space|
line breakThese part-of-speech tags are based on the the CATTEX-09 tagging system used in the by the
Base de français médiéval.
pos
and pos_syn
only differ in non-lexicalized cases of conversion.
For example, a nominalized
infinitive will have [pos="VERinf"]
(verb) but [pos_syn="NOMcom"]
(noun).
Please see bfm.ens-lyon.fr for full documentation.
The OGR tagset introduces a distinction between clitics, i.e. atonic object pronouns before the finite verb and all object pronouns after a clause-initial finite verb, and NP pronouns, e.g. nominative (subject) pronouns and disjunctive oblique pronouns.
Cattex | OGR |
---|---|
PROadv |
PRCadv i and en, always clitic |
PROper |
PRCper clitic personal pronoun, PRNper NP personal pronoun |
PROdem |
PRCdem southern GR clitic o, PRNdem NP demonstratives |
others (PROrel , PROind , PROint , …) |
PRNrel , PRNind , PRNint … |
This specifies the flexional morphology of a form and is based on the Cattex-max tagset proposed, but not implemented, by the BFM team. The following grammatical categories are marked:
*pos
, *per
): person-number-gender-caseVERppe
): number-gender-caseThe grammatical categories are annotated in the following way:
lemma
contains a lemma form from the dictionary listed in lemma_src
.
Where multiple lemmas are distinguished in the dictionary by a trailing number,
e.g. ne1, ne2, only numbers greater than 1 are included, i.e. DMF ne1 becomes simply
ne.
lemma
is provided for every word, but forms are not always drawn from the same dictionary.
In particular, the base dictionaries for northern and southern Gallo-Romance are different.
lemma
should therefore be queried in conjunction with lemma_src
, which has the following
main values:
DMF
: Dictionnaire de moyen français (http://www.atilf.fr/dmf). Preferred dictionary for northern Gallo-Romance texts.DOM
: Dictionnaire de l’occitan médiéval (http://www.dom-en-ligne.de/). Preferred dictionary for southern Gallo-Romance texts.
TL
: Tobler-Lommatzsch Altfranzösisches Wörterbuch. Fallback dictionary.lemma_dmf
gives DMF lemmas for all texts, including southern Gallo-Romance.
Where the word or its cognate is not found in the DMF, lemma_dmf
is left blank.