Difference between revisions of "Microsoft Translator"

From Lingoport Wiki
Jump to: navigation, search
m
(add documentation on the custom translator portal)
 
Line 33: Line 33:
   
 
==== Category ID ====
 
==== Category ID ====
The category ID is an identifier that allows you to use the [[#Train_a_Custom_Translator_for_dictionaries_and_trained_models|Custom Translator]] feature of Microsoft Translator for custom models or dictionaries. A category ID is not required, and the entry can be left blank in order to use Microsoft’s built in translator.
+
The category ID is an identifier that allows you to use the [[#Use_the_Custom_Translator_portal_for_dictionaries_and_trained_models|Custom Translator]] feature of Microsoft Translator for custom models or dictionaries. A category ID is not required, and the entry can be left blank in order to use Microsoft’s built in translator.
   
 
=== Test Connection ===
 
=== Test Connection ===
Line 45: Line 45:
 
For more information on ''Access Level'' and linking this vendor to your project see [[L10n_Vendors_and_Integration#Standard_Vendor_Info|Standard Vendor Info]].
 
For more information on ''Access Level'' and linking this vendor to your project see [[L10n_Vendors_and_Integration#Standard_Vendor_Info|Standard Vendor Info]].
   
== Train a Custom Translator for dictionaries and trained models ==
+
== Use the Custom Translator portal for dictionaries and trained models ==
  +
The [https://portal.customtranslator.azure.ai Custom Translator portal] provides a web user interface for the department responsible for setting up the Microsoft Translator service to manage and customize the output of the machine translation. [[#Train_your_own_model|Custom trained fine-tuned models]] provide context sensitive customizations by fine-tuning one of Microsoft Translator’s base models to your company’s own pre existing translations, but require large amounts of data to train. [[#Train_a_custom_translator_with_a_dictionary|Dictionaries]] on the other hand can work with any size dataset, large or small, and work without context by forcing sentences or phrases to always have a desired translation.
We recommend following Microsoft’s [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model Quickstart Tutorial] on the subject to learn more about how to train and deploy a custom model for translation.
 
  +
  +
To use either of these features a workspace must be created from the Custom Translator portal, see [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model Microsoft’s Custom Translator quickstart] for more details on workspaces. From this workspace, any number of projects can be created. Each project get’s a [[#Category_ID|Category ID]] associated with it which is used when configuring the vendor. One project can only have a single Custom Translator deployed on it at a time, so create multiple projects if you plan on having multiple Custom Translators deployed at the same time. Files for dictionaries or custom trained models can then be uploaded to a project for training.
  +
  +
=== Train your own model ===
  +
Training your own model, or rather fine-tuning one of Microsoft Translator’s built-in models to your company’s own pre existing translations can provide domain specific machine translation results that are also context sensitive and more natural sounding. This comes at the cost of requiring a large dataset to train the model on, Microsoft Translator requires at least 10,000 pairs of sentences to train a model. See [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/document-formats-naming-convention this list] from Microsoft’s documentation for a full rundown on supported file formats. We recommend using TMX files when possible as they are universal across many translation vendors.
  +
  +
Upload the document(s) for training to the [https://portal.customtranslator.azure.ai/ Custom Translator portal] by clicking on the Documents tab and then the Upload files button. Fill in the relevant information in the upload dialog that pops up (if using TMX you can skip over the Parallel Data section), and then click on the Upload button.
  +
  +
Once all the documents are uploaded, the model can be trained. Click on the Projects tab and select the project you want to train the model for. Use the checkboxes next to the documents that you have uploaded to select which ones will be used in the training of the model. Then click the Create model button to start the training process. Once the training has finished, the model can be deployed on that project.
  +
  +
=== Train a Custom Translator with a dictionary ===
  +
Using dictionaries is a great way to customize the machine translator to be more domain specific because they work for datasets of all different sizes, large or small. Dictionaries are files of sentence (or phrase) pairs, just like the files to train a model on, but they are specially noted as dictionaries upon upload. When a Custom Translator is trained using a dictionary file, the sentence or phrase pairs in that dictionary will be consistent guaranteed translations regardless of context.
  +
  +
Custom translators can contain both custom models and dictionaries at once by selecting all of the training files as well as the dictionary files before training the Custom Translator. It is also possible to train a Custom Translator with just dictionary files, leaving out the custom model altogether. This results in a much faster (and therefore cheaper) training time for the Custom Translator, as well as allows for domain specific translation without the barrier of entry of a sentence pair minimum dataset. When a Custom Translator is trained with just dictionaries, it will use the Microsoft Translator base model for all translations that are not present in the dictionaries.

Latest revision as of 21:08, 19 May 2021

Localyzer integration with the Microsoft Azure Cognitive Services Translator

How the process works

For information on Machine Translation and how the process works see Localyzer Integration with Machine Translation.

Supported Localyzer File Types

Not all file types are supported for machine translation. For a list of supported types see Supported File Types.

Setting up your Localyzer group or project to use Microsoft Translator

To set up the ability to machine translate your resource files using Microsoft Translator you will need to have access to the Lingoport Jenkins Server. The vendor setup is located in the Jenkins System Configuration.

Jenkins -> Manage Jenkins -> Configure System

Add Microsoft Translator Vendor

AddMicrosoftVendor.png

The Add Microsoft Vendor button can be found under the LRM L10n Vendor Setup section of the Jenkins Configure System page. After clicking on Add Microsoft Translator Vendor, the Standard Info as well as the Microsoft Translator Credentials field groups will be displayed.

MicrosoftGroupVendor.png

Standard Info

This is the information that is standard to all vendor types.

For more information on setting up the standard vendor info see Standard Vendor Info.

Microsoft Translator Credentials

All of the Microsoft Translator credential information should be available through the department responsible for setting up your Microsoft Azure Cognitive Services Translator.

Subscription Key

The subscription key, also sometimes referred to as the API key or the credential key, is a key-code, not meant to be shared publicly, that grants access to the Microsoft Translator resource.

Category ID

The category ID is an identifier that allows you to use the Custom Translator feature of Microsoft Translator for custom models or dictionaries. A category ID is not required, and the entry can be left blank in order to use Microsoft’s built in translator.

Test Connection

The Test Connection button is used to test the validity of the entered credentials. If the credentials are valid then the Network Access Test Succeeded! Message will be displayed, otherwise an error message will be displayed.

Saving vendor info and linking the vendor to your project

When the Save button is clicked, the vendor information will be stored in the Jenkins com.lingoport.plugins.jenkinsgyzrlrmplugin.global.GlobalSettings.xml config file.

If the Access Level is for Group then the /var/lib/jenkins/Lingoport_Data/L10nStreamlining/<group name>/config/config_l10n_vendor.properties file will be created containing the vendor information otherwise the config_l10n_vendor.properties will not be created until it is used by a project.

For more information on Access Level and linking this vendor to your project see Standard Vendor Info.

Use the Custom Translator portal for dictionaries and trained models

The Custom Translator portal provides a web user interface for the department responsible for setting up the Microsoft Translator service to manage and customize the output of the machine translation. Custom trained fine-tuned models provide context sensitive customizations by fine-tuning one of Microsoft Translator’s base models to your company’s own pre existing translations, but require large amounts of data to train. Dictionaries on the other hand can work with any size dataset, large or small, and work without context by forcing sentences or phrases to always have a desired translation.

To use either of these features a workspace must be created from the Custom Translator portal, see Microsoft’s Custom Translator quickstart for more details on workspaces. From this workspace, any number of projects can be created. Each project get’s a Category ID associated with it which is used when configuring the vendor. One project can only have a single Custom Translator deployed on it at a time, so create multiple projects if you plan on having multiple Custom Translators deployed at the same time. Files for dictionaries or custom trained models can then be uploaded to a project for training.

Train your own model

Training your own model, or rather fine-tuning one of Microsoft Translator’s built-in models to your company’s own pre existing translations can provide domain specific machine translation results that are also context sensitive and more natural sounding. This comes at the cost of requiring a large dataset to train the model on, Microsoft Translator requires at least 10,000 pairs of sentences to train a model. See this list from Microsoft’s documentation for a full rundown on supported file formats. We recommend using TMX files when possible as they are universal across many translation vendors.

Upload the document(s) for training to the Custom Translator portal by clicking on the Documents tab and then the Upload files button. Fill in the relevant information in the upload dialog that pops up (if using TMX you can skip over the Parallel Data section), and then click on the Upload button.

Once all the documents are uploaded, the model can be trained. Click on the Projects tab and select the project you want to train the model for. Use the checkboxes next to the documents that you have uploaded to select which ones will be used in the training of the model. Then click the Create model button to start the training process. Once the training has finished, the model can be deployed on that project.

Train a Custom Translator with a dictionary

Using dictionaries is a great way to customize the machine translator to be more domain specific because they work for datasets of all different sizes, large or small. Dictionaries are files of sentence (or phrase) pairs, just like the files to train a model on, but they are specially noted as dictionaries upon upload. When a Custom Translator is trained using a dictionary file, the sentence or phrase pairs in that dictionary will be consistent guaranteed translations regardless of context.

Custom translators can contain both custom models and dictionaries at once by selecting all of the training files as well as the dictionary files before training the Custom Translator. It is also possible to train a Custom Translator with just dictionary files, leaving out the custom model altogether. This results in a much faster (and therefore cheaper) training time for the Custom Translator, as well as allows for domain specific translation without the barrier of entry of a sentence pair minimum dataset. When a Custom Translator is trained with just dictionaries, it will use the Microsoft Translator base model for all translations that are not present in the dictionaries.