Skip to main content
WorldCist'17 - 5th World Conference on Information Systems and Technologies

Full Program »

Augmenting SMT with generated pseudo-parallel corpora from monolingual news resources

Several natural languages have had much processing, but the problem of limited linguistic resources remains. Manual creation of parallel corpora by humans is rather expensive and very time consuming. In addition, language data required for statistical machine translation (SMT) does not exist in adequate capacity to use its statistical information to initiate the research process. On the other hand, applying unsubstantiated approaches to build the parallel resources from multiple means like comparable corpora or quasi-comparable corpora is very complicated and provides rather noisy output. These outputs of the process would later need to be reprocessed, and in-domain adaptations would also be required. To optimize the performance of these algorithms, it is essential to use a quality parallel corpus for training of the end-to-end procedure. In the present research, we have developed a methodology to generate an accurate parallel corpus from monolingual resources through the calculation of compatibility between the results of machine translation systems. We have translations of huge, single-language resources through the application of multiple translation systems and the strict measurement of translation compatibility with rules based on the Levenshtein distance. The results produced by such an approach are very favorable. All the monolingual resources that we obtained were taken from the WMT16 conference for Czech to generate the parallel corpus, which improved translation performance.

Author(s):

Krzysztof Wołk    
Polish Japanese Academy of Information Technology
Poland

Agnieszka Wołk    
Polish Japanese Academy of Information Technology
Poland

 

Powered by OpenConf®
Copyright ©2002-2016 Zakon Group LLC