Skip to main content
WorldCist'18 - 6th World Conference on Information Systems and Technologies

Full Program »

Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Krzysztof Wołk
Polish-Japanese Academy of Information Technology


Powered by OpenConf®
Copyright ©2002-2017 Zakon Group LLC