Skip to main content
WorldCist'18 - 6th World Conference on Information Systems and Technologies

Full Program »

Mixing textual data selection methods for improved in-domain data adaptation

The efficient use of machine translation (MT) training data is being revolutionized on account of the application of advanced data selection techniques. These techniques involve sentence extraction from broad domains and adaption for MTs of in-domain data. In this research, we attempt to improve in-domain data adaptation methodologies. We focus on three techniques to select sentences for analysis. The first technique is term frequency–inverse document frequency, which originated from information retrieval (IR). The second method, cited in language modeling literature, is a perplexity-based approach. The third method is a unique concept, the Levenshtein distance, which we discuss herein. We propose an effective combination of the three data selection techniques that are applied at the corpus level. The results of this study revealed that the individual techniques are not particularly successful in practical applications. However, multilingual resources and a combination-based IR methodology were found to be an effective approach.

Krzysztof Wolk
Polish-Japanese Academy of Information Technology


Powered by OpenConf®
Copyright ©2002-2017 Zakon Group LLC