Parole – ILSP

Start Date: 01/04/1996
End Date:
Funding: LE II (LE2 4017 - 10379)
Project Leader: Gavriilidou Maria

The aim of the PAROLE project was the compilation of large, generic and re-usable Written Language Resources for all EU Languages, comprising more specifically:

General language text corpora of the size of 20,000,000 words in 14 languages (Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish) and
computational lexicons with 20,000 lemmas in 12 languages (Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, Swedish).

The value of these resources lies not only in the size and number of languages covered by the project, but also in the fact that they are built according to common standards and specifications:

as regards text corpora, they have been compiled and annotated following the same guidelines:
- texts have been selected on the basis of specified common parameters for time of production (after 1970) and proportionate representation of the textual material according to publication medium (Book, Newspaper, Periodical and Miscellaneous)
- all texts have been annotated using the same mark-up format (PAROLE DTD) as regards bibliographical information and text structure (annotation at the level of paragraph)
- a subset of the corpus (250,000 words) has been morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features
as regards the lexica, harmonisation was achieved by developing a common model (the PAROLE model) which caters for the encoding of morphological and syntactic information in all languages; thus, all the lexicons have been built according to the same design principles and linguistic specifications and are encoded in the same representation format.

Following the completion of the project, the following subset of the resources for each language is available to the research community, either through the European Language Resources Association (ELRA) or directly through the project participants:

a subset of the text corpus (3,000,000 words), including the morphosyntactically annotated subcorpus, and
the computational lexicon.

For more information on the project, please visit the PAROLE/SIMPLE web site: http://www.ub.es/gilcub/SIMPLE/simple.html.

Contributors

Prev 1 2

Departments

Natural Language Processing and Language Infrastructures