next up previous
Next: Form of the delivery Up: Short summary of the Previous: Corpus characteristics

Annotation: Schemata used and procedures applied

The annotation available in the corpus covers morphosyntax, grammatical functions, as far as they are relevant for the description of the verbal predicates, as well as semantic features attached to heads of relevant phrases.

For morphosyntax, the annotation follows the EAGLES/PAROLE guidelines. For the syntactic layer, the SPARKLE/MATE guidelines have been followed. For the semantic annotation, it was decided, at least for the Italian part of the corpus, to have a double annotation:

1.
a ''word sense'' annotation, through the linking of each corpus occurrence of a verb and its arguments to EuroWordNet readings;
2.
a ''semantic'' annotation, through the assignment to each corpus occurrence of a verb and its complement nouns to the semantic types of the SIMPLE ontology.

As for German WordNet data on verbs were not available at the time of compilation of the corpus, only the second (ontological) step was performed.

For German, the following steps were performed automatically: the selection of relevant sentences, part of speech tagging, annotation of grammatical functions: a stochastic part of speech tagger was used ([Schmid 1994]) and thereafter, the resulting data were processed by means of an LFG-based grammar of German, which assigns grammatical function labels. The grammatical function annotation from LFG was mapped onto the SPARKLE/MATE standards proposal, and the resulting MATE-conformant annotation was converted to XML.

A number of scripts are available which were used to perform the individual annotation steps, such that more material can in principle be processed according to the same procedures, whenever this becomes necessary.

Evidently, a completely manual correction step was necessary after syntactic analysis (elimination of wrong analyses, elimination of useless ambiguities), and, moreover, semantic annotation was carried out completely manually. This has in particular to do with the lack of semantic resources for German.

For Italian, part-of-speech-tagging was performed automatically.

This was followed by the marking of grammatical relations and the assignment and manual selection of the relevant EuroWordNet and SIMPLE tags (for word-sense and semantic type, respectively).


next up previous
Next: Form of the delivery Up: Short summary of the Previous: Corpus characteristics
Hannah Kermes
2/8/2001