TimeEL: Recognition of Temporal Expressions in Greek texts
|Authors:||Prokopis Prokopidis; Elina Desipri; Harris Papageorgiou; G. Markopoulos|
|Book title:||Proceedings of the 9th International Conference on Greek Linguistics|
In this paper we present work for the development of TimeEL, a rule-based software module that performs recognition of TIMEXes in Greek texts. In the first section of the paper, we provide details on the corpus we have used for development and evaluation purposes. The corpus, which amounts to 26.5K tokens and 1.7K sentences, comprises financial web documents, sport-related articles and transcribed documentaries about political affairs. An adaptation of the TIDES scheme for Greek was compiled by three postgraduate linguistics students, who worked on Callisto, a suitable annotation tool to mark the extent of 613 relevant time expressions (including 243 dates and 225 durations) in the corpus. Each TIMEX annotation also included attributes like VAL (the normalized, IS0 8601 compatible format of the expression), MOD (a representation of temporal modification by lexical items like ?an??io and ooci an??), etc. The TimeEL module is discussed in the following sections of the paper. Raw input is enriched with annotations generated automatically by a NLP pipeline that integrates a tokenizer, a sentence splitter, a POS tagger, a lemmatizer and a surface syntactic analyzer that recognizes non-recursive chunks. The core resource of TimeEL is a grammar that was developed and tested using the JAPE framework [Cunningham, 2008] for writing cascades of pattern-based rules over lexical items and annotations. The grammar contains macros concerning, among others, parts of days, duration adjectives and adverbial pre- and post-modifiers of temporal expressions. The main rules are grouped in three stages: a) markup of (combinations of) trigger lemmas, b) expansion of markables to the extent of chunks recognized by the syntactic analyzer, and c) post-processing of certain expressions (involving, for example, splitting TIMEXes representing time ranges into two distinct expressions). In the evaluation section of the paper we report encouraging results (recall of 88.5%, precision of 82.7%) on TIMEX extent recognition using strict exact match and testing on the whole corpus. Respective figures concerning a subset of the corpus containing only financial documents (where frequency of vague TIMEXes is smaller) are 96.1% and 91.3%. In the error analysis section, we show that most false positives partially overlap with missed true positives. We conclude with discussion of ongoing work on normalization of recognized expressions, and on plans for the manual annotation and automatic identification of temporal relations between events and TIMEXes.