Welcome to Multi Source

The open-source project for the readily combination of Computer Assisted Translation (CAT)
and interactive Neural Machine Translation (NMT)

About the software

Multi Source combines the emerging possibilities of web-based CAT-systems and interactive NMT to deliver a complete package for document translations in many languages and domains. The project stands on the shoulders from the open-source ecosystems of the django library weblate (localization) and pytorch (deep learning). Its bearing is comparable with the production of interactive, statistical machine translation attended by matecat.


The usage of CAT-systems has improved the quality and speed in the translation industry.
As the translation pattern we use the multilingual design that makes an enhencement of serveral assistances:

Every CAT-system needs its own post-editing guide. We encourage to use languagetool as automatic filter, grammar checker and stylistic assistance.
The biggest issues while correcting are single words in the source and target sentence. This can be eased with a multilingual dictionary integration.
A further assistance is proposed through corrected languages as additional source. On the one hand for the post-editor as further knowledge source and on the other hand as increased input for the machine translation.
The multilingual translation pattern allows us to sort the suggestions qualities with the universal quality estimation. The calculated metric score has a correlation coefficient over 0.8 with BLEU and is language independent.
The adapted interface with the backend grants interactive suggestions and the exchange of critical words or to auto-complete the sentence.


Based on the well tested transformer, various preprocessing and training techniques are used to build a featured, interactive neural machine translation. The following timeline displays each step of the NMT open sourcing:

Collecting monolingual, bilingual and multilingual corpora mostly with common crawl, WikiMatrix and OPUS.
Training of a classification model with fasttext to anhance the data with domain labels. If the translation domain isn't measured, then we can use this model to automatically set the domain category as an input feature.
Creating a multidomain and multilingual corpora for (pre)-training. Non-existing language pairs are created with the Human Language Project.
MASS-pretraining of the decoder based on the XLM-encoder. This step faciliates every low-resource translation direction of the 100 supported languages. The released, general sequence to sequence model can be used for other tasks like summarization, paraphrasing or dialogue generation.
For the tagged joint training we use data augmentation with synonym replacement. The training persists of multiple inputs for translating with multiple languages at once, automatic post-editing and translation memory integration. Those synthic inputs are validated like the final translation with the classification model to ensure the domain adaption.
We open-source the trained model for a good leverage under the Attribution-NonCommercial-ShareAlike 3.0 license and evaluate it with a simple interface and public API.
An exercise completes and releases the UN Corpus with the supported 100 languages. The initial XML-R language model is used for the noisy channel modeling.

Contact and Feedback

Please contact Kalle Hilsenbek via fourth_empowerment@protonmail.com and make your feature or corpara request.