Annotation 3: Words

[previous] [next]

[1. Tokenization] [2. Summary of annotation] [3. Graphical properties] [4. Morphosyntactic properties] [5. Lemmatization]

1. Tokenization

Words in the OGR are defined as minimal syntactically independent units and tokenization is standardized across the corpus texts. Note in particular that all clitics, including consonantal enclitics such as l in del or al, are considered separated tokens. The normalized text view shows all tokens separated by spaces; modernized punctuation (apostrophes, middle points, etc.) is not used.

2. Summary of Annotation

Most annotation in the OGR is attached to the Word and is available in all versions of the corpus.

Four broad categories of annotation can be distinguished:

Graphical properties: normalized and diplomatic form, manuscript word division.
Morphosyntactic properties: part-of-speech tag and inflectional morphology.
Lemmatization: lemma(s) and their source
Simplified annotation of other units for the TXM version.

Type	Tag	Description	Link
graphical	`word`	normalized form	below
graphical	`dipl`	diplomatic transcription	below
graphical	`wd_div`	manuscript word division following the token	below
graphical	`lang`	language of the token	below
morphosyntactic	`pos`	base part-of-speech tag	below
morphosyntactic	`morph`	inflection	below
lemmatization	`lemma`	lemma from any available source	below
lemmatization	`lemma_src`	source of lemma	below
lemmatization	`lemma_dmf`	DMF/ATILF lemma, if available	below
phonological (TXM)	`phon`	string of `seg_phonemes` in word	Segment annotation
phonological (TXM)	`syllabified`	string of `seg_phonemes` with syllable structure markup	Metrical annotation
metrical (TXM)	`prosody`	stress pattern	Metrical annotation
metrical (TXM)	`line_met`	metre of the line (`line_met`)	Metrical annotation
metrical (TXM)	`metpos`	metrical position of syllables, counting forwards	Metrical annotation
metrical (TXM)	`soptem`	metrical position of syllables, counting backwards	Metrical annotation
other (TXM)	`ref`	citable reference (`line_ref`)	Metrical annotation

3. Graphical properties

word

A normalized graphical form of the word similar to that found in a print edition, but without apostrophes or diacritics (except, at present, in Alexis).

dipl and wd_div

dipl and wd_div provide the diplomatic transcription of the text. dipl indicates the diplomatic form of the token and wd_div the type of word division, including punctuation, which follows it.

Resolved abbreviations are given in [square brackets]. Superfluous letters are given between (parentheses). Rare cases of editorial correction are given in the following format: word[=corrected].

The following special characters are used to denote manuscript word division:

+ agglutination to following word
_ space
| line break

lang

The lang tag indicates the language of the word based on its graphical form. Note that the lang tag doesn’t always match the matrix language of the line or the sentence. In particular, words using Latin rather than Romance orthography within a section of Romance text are also are tagged as lang=lat. These are shown in italics in the editions.

4. Morphosyntactic properties

pos

This part-of-speech tag is based on the the CATTEX-09 tagging system used in the by the Base de français médiéval.

PRO > PRC, PRN

The OGR tagset introduces a distinction between clitics, i.e. atonic object pronouns before the finite verb and all object pronouns after a clause-initial finite verb, and strong pronouns, e.g. nominative (subject) pronouns and disjunctive oblique pronouns.

Cattex	OGR
`PROadv`	`PRCadv` i and en, always clitic
`PROper`	`PRCper` clitic personal pronoun, `PRNper` strong personal pronoun
`PROdem`	`PRCdem` southern GR clitic o, `PRNdem` other demonstratives
others (`PROrel`, `PROind`, `PROint`, …)	`PRNrel`, `PRNind`, `PRNint` …

morph

This specifies the flexional morphology of a form and is based on the Cattex-max tagset proposed, but not implemented, by the BFM team. The following grammatical categories are marked:

nominals without person: number-gender-case
nominals with person (*pos, *per): person-number-gender-case
conjugated verb: mood-tense-person
past participles (VERppe): number-gender-case

The grammatical categories are annotated in the following way:

number (1 character): s (singular) or p (plural)
gender (1): m (masculine), f (feminine) or n (neuter)
case (1): n (nominative) or r (oblique). For third-person clitic personal pronouns, a (accusative/COD) or i (dative, COI)
person (1): 1-6 (not 1s, 1p, etc.)
mood (3): ind (indicative), sub (subjunctive), imp (imperative), con (conditional)
tense (3): pst (present), psp (preterite), pqp ( < Latin simple pluperfect), ipf (imperfect), fut (future)

5. Lemmatization

lemma contains a lemma form from the dictionary listed in lemma_src. Where multiple lemmas are distinguished in the dictionary by a trailing number, e.g. ne1, ne2, only numbers greater than 1 are included, i.e. DMF ne1 becomes simply ne.

lemma is provided for every word, but forms are not always drawn from the same dictionary. In particular, the base dictionaries for northern and southern Gallo-Romance are different. lemma should therefore be queried in conjunction with lemma_src, which has the following main values:

DMF: Dictionnaire de moyen français (http://www.atilf.fr/dmf). Preferred dictionary for northern Gallo-Romance texts.
DOM: Dictionnaire de l’occitan médiéval (http://www.dom-en-ligne.de/). Preferred dictionary for southern Gallo-Romance texts.
- Acute and grave accents are used to distinguish mid-open and mid-close vowels, e.g. éis for DOM ẹis; equally capital N replaces DOM ṉ.
wikipedia.fr: Preferred source for French proper nouns.
oc.wikipedia.org: Preferred source for Occitan proper nouns.
wiktionary: Preferred source for Latin lemmas.

TL denotes a lemma from the Tobler-Lommatzsch dictionary and AND a lemma from the Anglo-Norman Dictionary.

lemma_dmf gives DMF lemmas for all texts, including southern Gallo-Romance. Where the word or its cognate is not found in the DMF, lemma_dmf is left blank.