AI for Modern Greek Dialects

Advancing AI Research for processing Modern Greek Dialects

This webpage provides consolidated information about the resources and neural models developed by Athena RC for the processing and documentation of the dialects of the Greek language.

Hellenistic Koine has, among other things, bequeathed a rich system of dialectical varieties of the Greek language. Each dialect carries unique linguistic characteristics, a long history, and invaluable cultural value. The unique features of the dialects that are absent from Standard Modern Greek, combined with the limited availability of dialectal data, create challenges and opportunities for Artificial Intelligence and Language Technology.

At ATHENA RC, Institute of Language and Speech Processing and Research Unit ARCHIMEDES, we develop methods and tools in the field of Artificial Intelligence and Natural Language Processing for Modern Greek dialects by transferring knowledge from technology we have developed for Standard Modern Greek. We develop resources and models as follows:

  • We conduct field research to collect authentic spoken dialectal speech from native speakers in authentic conversational conditions.
  • With the spoken data, we develop speech-to-text (STT) neural models, which we normalize with methods specifically adapted to each dialect. STT models can be used to convert additional oral data into text from the same or a similar dialect.
  • When dialectal texts are available, we enrich them with the oral data.
  • We develop treebanks with detailed morphosyntactic and morphophonological annotation by transferring knowledge from treebanks of Standard Modern Greek.
  • We use the treebanks to train neural models that can perform morphosyntactic annotation of new dialectal texts from the same or a similar dialect.

The research results include neural models for Speech-to-Text and morphosyntactic analysis, oral data, treebanks, and specifications for morphosyntactic annotation, all openly available.

To date, we have developed oral and textual resources for dialects spoken in Eastern Crete, Lesvos, and Messenia, as well as for Standard Modern Greek. In the framework of international collaborations, we also study other varieties, such as the Greek dialects of Southern Italy, Cypriot, and Pontic. At the same time, we explore the creation of synthetic data through Large Language Models (LLMs) and the comparative study of the dialects, aiming to enhance their presence in the era of Artificial Intelligence.

Corpora (annotated according to the UD schema)

The annotation follows the Universal Dependencies (UD) schema. The treebanks can be retrieved from the UD repository, where the formatting specifications of each treebank are also found. For each treebank we give its name, the link for the Us repository and the preferred citation format.

Models

For each model, we explain how it was developed and provide the preferred citation format.

Speech to text models

Neural models for morphosyntactic annotation

Team members

The following researchers have contributed to the work described above:

  • Antonis Anastasopoulos, Assistant Professor, George Mason University & Archimedes/Athena RC
  • Stella Markantonatou, Research Director, ILSP/Athena RC & Archimedes/Athena RC
  • Angela Ralli, Professor Emerita of Linguistics, University of Patras & Archimedes/Athena RC
  • Giorgos Paraskevopoulos, Senior Researcher, ILSP/Athena RC
  • Chara Tsoukala, Senior Researcher, ILSP/Athena RC
  • Vivian Stamou, PostDoc Researcher, Archimedes/Athena RC
  • Stavros Bompolas, PostDoc Researcher, Archimedes/Athena RC
  • Antonis Dimakis, PhD Student, Archimedes/Athena RC
  • Yannis Kazos, Electrical Engineer, Undergraduate Student, NTUA & Archimedes/Athena RC
  • Socrates Vakirtzian, Katerina Mouzou, MSc Students, NKUA & ILSP/Athena RC