Discovering Parallel Language Resources for Training MT Engines
|Authors:||Vassilis Papavassiliou; Prokopis Prokopidis; Stelios Piperidis|
|Book title:||Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)|
Web crawling is an efficient way for compiling the monolingual, parallel and/or domain-specific corpora needed for machine translation and other HLT applications. These corpora can be automatically processed to generate second order or synthesized derivative resources, including bilingual (general or domain-specific) lexica and terminology lists. In this submission, we discuss the architecture and use of the ILSP Focused Crawler (ILSP-FC), a system developed by researchers of the ILSP/Athena RIC for the acquisition of such resources, and currently being used through the European Language Resource Coordination effort. ELRC aims to identify and gather language and translation data relevant to public services and governmental institutions across 30 European countries participating in the Connecting Europe Facility (CEF).