Microsoft Trains World’s Largest Transformer Language Model With 17 Billion Parameters

Transformer-based language generation models are mainly used in the field of natural language processing. It helps to handle the ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization and it does not require that the sequence be processed in the order.

Microsoft AI & Research has shared their new research and claims that they developed a Transformer-based language generation model that it describes as the largest ever made, the company also open-sourced a deep learning library named DeepSpeed to make distributed training of large models easier.



DeepSpeed contains the Zero Redundancy Optimizer (ZeRO) for training models with 100 million parameters or more at scale, which Microsoft used to train there Transformer-based language generation model.

Microsoft has built the Turing Natural Language Generation Model (T-NLG), which has 17 billion parameters with 78 transformer layers, and it outperforms the state of the art models available today and achieves phenomenal results on many downstream natural language processing (NLP) applications.




If we go on comparing the Turing Natural Language Generation Model (T-NLG) than it is twice as large as Nvidia’s Megatron, which was the previous largest Transformer-based language generation model and if we compare T-NLG to OpenAI’s GPT-2, the Microsoft’s creation has 10 times as many parameters.

T-NLG is a Transformer-based generative language model, it can generate words to complete open-ended textual tasks. It can naturally summarize or answer questions about a personal document or email thread and even can generate direct answers to questions and summaries of input documents.




In order to train the model, researchers at Microsoft used an NVIDIA DGX-2 the world’s first 2 petaFLOPS system housing multiple NVIDIA V100 GPUs that were interconnected with InfiniBand. The type of training data used was similar to the one that Nvidia’s Megatron-LM models were trained on.

Tensor slicing was applied to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.  The team also used DeepSpeed with ZeRO to reduce the model-parallelism degree from 16 to 4, increase the batch size per node by fourfold, and reduce training time by three times.

The result, T-NLG, can improve systems that leverage NLP for chatbots, document understanding, and sentence/paragraph completion tasks. Some of the capabilities of the model are as follows. T-NLG is able to simplify and summarise text to provide direct answers to search queries.

For example, instead of returning a paragraph that would contain the answer to a search query (as is done by many search engines traditionally), T-NLG returns the direct answer. Similarly, the new model is also able to answer one-shot questions, that is, questions without context.

In terms of numbers and benchmark tests, Microsoft’s new T-NLG boasts better figures compared to Megatron-LM 8.3B on both LAMBADA and WikiText-103. ROUGE scores were also promising and T-NLG outperformed LSTM (CopyNet) for human evaluators of grammatical and factual correctness.

Microsoft Trains World’s Largest Transformer Language Model With 17 Billion Parameters

The power of T-NLG is that it is already so adept at the understanding text that it doesn’t need much supervision to outperform all the techniques we’ve employed previously.

To make T-NLG as versatile as possible for summarizing different types of text, the team of Microsoft researchers fine-tuned the T-NLG model in a multi-task fashion on nearly all publicly available summarization datasets, amounting to approximately four million training instances.

The company report ROUGE scores (a proxy for how well the generated summary exactly matches the unigrams and bigrams in a reference summary) to compare with another recent Transformer-based language model known as PEGASUS and previous state of the art models.

T-NLG has advanced the state of the art in natural language generation, providing new opportunities for Microsoft and our customers. Beyond saving our users time by summarizing documents and emails.

T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document.

Furthermore, it paves the way for more fluent chatbots and digital assistants, as natural language generation can help businesses with customer relationship management and sales by conversing with customers.

More in AI

Neuralink Plans To Install There New Prototype Into A Human In 2020

Researchers Develop AI Tool To Predict Behaviour of Quantum System. 

Elon Musk Hosting A “Super Fun AI Party” Aka Hackathon At His House

Denis Uses Neural Networks To Upscale A 1896 Film to 4K 60 FPS Quality



Leave a Reply

Your email address will not be published.