Welcome to Multi Source


The open-source project for the readily combination of Computer Assisted Translation (CAT)
and interactive Neural Machine Translation (NMT).
We hope to encourage and streamline a global fourth empowerment with this software.


About the software


Multi Source combines the emerging possibilities of web-based CAT-systems and interactive NMT to deliver a complete package for document translations and discussions in many languages. The project stands on the shoulders from the open-source ecosystems of the django library weblate (localization), pytorch (deep learning) and OPUS (open translation corpus and models). Its bearing is comparable with the production of interactive, statistical machine translation attended by matecat. Have a look at the translation space to test the prototype.


CAT


The usage of CAT-systems has improved the quality and speed in the translation industry.
As the translation pattern we use the multilingual design that makes an enhencement of serveral assistances:

Every CAT-system needs its own post-editing guide. We encourage to use languagetool as automatic filter, grammar checker and stylistic assistance.
The biggest issues while correcting are single words in the source and target sentence. This can be eased with a multilingual dictionary integration.
A further assistance is proposed through corrected languages as additional source. On the one hand for the post-editor as further knowledge source and on the other hand as increased input for the machine translation.
The multilingual translation pattern allows us to sort the suggestions qualities with the universal quality estimation. The calculated metric score has a correlation coefficient over 0.8 with BLEU and is language independent.
The adapted interface with the backend grants interactive suggestions and the exchange of critical words or to auto-complete the sentence.

NMT


With the well tested transformer, various preprocessing and training techniques are used to build a featured, interactive neural machine translation.
The following timeline displays each step of the NMT open sourcing based on mBART:

Collecting monolingual, bilingual and multilingual corpora mostly with common crawl, WikiMatrix, tatoeba, UN Corpus and OPUS.
Training of a classification model with fasttext to anhance the data with domain labels. If the translation domain isn't measured, then we can use this model to automatically set the domain category as an input feature.
Creating a multidomain and multilingual corpora for training. Non-existing language pairs are created with the Human Language Project.
For the tagged joint training we use data augmentation with synonym replacement. The training persists of multiple inputs for translating with multiple languages at once, automatic post-editing and translation memory integration. Those synthetic inputs are validated like the final translation with the classification model to ensure the domain adaption. Moreover, the XML-R language model is used for the noisy channel modeling.
An exercise completes and releases the positive dialectic of the Enlightenment with the supported 25 languages.

Contact and Feedback


Please contact Kalle Hilsenbek via fourth_empowerment@protonmail.com and make your feature or corpara request.