[1. Tokenization] [2. Summary of annotation] [3. Graphical properties] [4. Morphosyntactic properties] [5. Lemmatization]
Words in the OGR are defined as minimal syntactically independent units and tokenization is standardized across the corpus texts. Note in particular that all clitics, including consonantal enclitics such as l in del or al, are considered separated tokens. The normalized text view shows all tokens separated by spaces; modernized punctuation (apostrophes, middle points, etc.) is not used.
Most annotation in the OGR is attached to the Word and is available in all versions of the corpus.
Four broad categories of annotation can be distinguished:
Type | Tag | Description | Link |
---|---|---|---|
graphical | word |
normalized form | below |
graphical | dipl |
diplomatic transcription | below |
graphical | wd_div |
manuscript word division following the token | below |
graphical | lang |
language of the token | below |
morphosyntactic | pos |
base part-of-speech tag | below |
morphosyntactic | morph |
inflection | below |
lemmatization | lemma |
lemma from any available source | below |
lemmatization | lemma_src |
source of lemma | below |
lemmatization | lemma_dmf |
DMF/ATILF lemma, if available | below |
phonological (TXM) | phon |
string of seg_phonemes in word |
Segment annotation |
phonological (TXM) | syllabified |
string of seg_phonemes with syllable structure markup |
Metrical annotation |
metrical (TXM) | prosody |
stress pattern | Metrical annotation |
metrical (TXM) | line_met |
metre of the line (line_met ) |
Metrical annotation |
metrical (TXM) | metpos |
metrical position of syllables, counting forwards | Metrical annotation |
metrical (TXM) | soptem |
metrical position of syllables, counting backwards | Metrical annotation |
other (TXM) | ref |
citable reference (line_ref ) |
Metrical annotation |
A normalized graphical form of the word similar to that found in a print edition, but without apostrophes or diacritics (except, at present, in Alexis).
dipl
and wd_div
provide the diplomatic transcription of the text.
dipl
indicates the diplomatic form of the token and wd_div
the type of word
division, including punctuation, which follows it.
Resolved abbreviations are given in [square brackets]
.
Superfluous letters are given between (parentheses)
.
Rare cases of editorial correction are given in the following format: word[=corrected]
.
The following special characters are used to denote manuscript word division:
+
agglutination to following word_
space|
line breakThe lang
tag indicates the language of the word based on its graphical
form. Note that the lang
tag doesn’t always match the matrix language
of the line or the sentence. In particular, words using Latin rather
than Romance orthography within a section of Romance text are also are
tagged as lang=lat
. These are shown in italics in the editions.
This part-of-speech tag is based on the the CATTEX-09 tagging system used in the by the Base de français médiéval.
The OGR tagset introduces a distinction between clitics, i.e. atonic object pronouns before the finite verb and all object pronouns after a clause-initial finite verb, and strong pronouns, e.g. nominative (subject) pronouns and disjunctive oblique pronouns.
Cattex | OGR |
---|---|
PROadv |
PRCadv i and en, always clitic |
PROper |
PRCper clitic personal pronoun, PRNper strong personal pronoun |
PROdem |
PRCdem southern GR clitic o, PRNdem other demonstratives |
others (PROrel , PROind , PROint , …) |
PRNrel , PRNind , PRNint … |
This specifies the flexional morphology of a form and is based on the Cattex-max tagset proposed, but not implemented, by the BFM team. The following grammatical categories are marked:
*pos
, *per
): person-number-gender-caseVERppe
): number-gender-caseThe grammatical categories are annotated in the following way:
lemma
contains a lemma form from the dictionary listed in lemma_src
.
Where multiple lemmas are distinguished in the dictionary by a trailing number,
e.g. ne1, ne2, only numbers greater than 1 are included, i.e. DMF ne1 becomes simply
ne.
lemma
is provided for every word, but forms are not always drawn from the same dictionary.
In particular, the base dictionaries for northern and southern Gallo-Romance are different.
lemma
should therefore be queried in conjunction with lemma_src
, which has the following
main values:
DMF
: Dictionnaire de moyen français (http://www.atilf.fr/dmf). Preferred dictionary for northern Gallo-Romance texts.DOM
: Dictionnaire de l’occitan médiéval (http://www.dom-en-ligne.de/). Preferred dictionary for southern Gallo-Romance texts.
wikipedia.fr
: Preferred source for French proper nouns.oc.wikipedia.org
: Preferred source for Occitan proper nouns.wiktionary
: Preferred source for Latin lemmas.TL
denotes a lemma from the Tobler-Lommatzsch dictionary and AND
a lemma from the Anglo-Norman Dictionary.
lemma_dmf
gives DMF lemmas for all texts, including southern Gallo-Romance.
Where the word or its cognate is not found in the DMF, lemma_dmf
is left blank.