Introduction to Multi-Task Deep Neural Networks for Natural Language Understanding
BERT (Devlin et al., 2018) got the state-of-the-art result in 2018 in multiple NLP problems. It leveraged transformer architecture to learn contextualized word embeddings such that those vectors represent a better meaning in different domain problems. To extend the usage of BERT, Liu et al. proposed Multi-Task Deep Neural Networks (MT-DNN) to achieve the state-of-the-art result in multiple NLP problems. BERT helped to build a shared text representation in MT-DNN while the fine-tuning part is leveraging multi-task learning.
This story will discuss about Multi-Task Deep Neural Networks for Natural Language Understanding (Liu et al., 2019) and the following are will be covered:
- Multi-task Learning
- Data
- Architecture
- Experiment
Multi-task Learning
Multi-task learning is one of the transfer learning. When learning knowledge from multiple things, we do not need to learn everything from scratch but we can apply knowledge learned from other tasks to shorten the learning curve.
Taking ski and snowboard as an example, you do not need to spends lots of time to learn snowboard if you already master ski. It is because both sports shares some skill and you just need to understand the different part is ok. Recently, I heard from friends that he was master in snowboard. He only spent 1 months to master ski.
Go back to data science, researchers and scientists believe that transfer learning can be applied when learning text representation. GenSen (Sandeep et al., 2018) demonstrated multi-task learning improved the sentence embeddings. Part of text representation can be learned from different tasks and those shared parameters can be propagate back to learn a better weights.
Data
Input is a word sequence which can be a single sentence or combing two sentence into together with a separator. Same as BERT, sentence(s) will be tokenize and transforming to initial word embeddings, segment embeddings and position embeddings. After that multi bidirectional transformer will be used to learn the contextual word embeddings. The different part is leveraging multi-task to learn text representation and applying it to individual task in fine-tuning stage.
Architecture of MT-DNN
MT-DNN has to go though two stages to train the model. First stage includes pre-training of Lexicon Encoder and Transformer Encoder. By following BERT, both encoders are trained by masked language modeling and next sentence prediction. Second stage is fine-tuning part. mini batch base stochastic gradient descent (SGD) is applied.

Different from single task learning, MT-DNN will compute the loss across different task and applying the change to the model in the same time.
The loss is difference across different task. For classification task, it is binary classification problem so crossentropy loss is used. For text similarity task, mean square error is used. For ranking task, negative log likelihood is used.



From below architecture figure, the shared layers are transferring text to contextual embedding via BERT. After the shared layers, It will go though different sub-flow per to learn representation per specific task. The task specific layers are trained for specific task problems such as single sentence classification and pairwise text similarity.

Experiment
MT-DNN is based on PyTorch implementation of BERT and the hyperparametes are:
- Optimizer: Adamax
- Learning rate: 53–5
- Batch size: 32
- Maximum epoch: 5
- Dropout rate: 0.1

Source- Medium
ABOUT ME- Raj Ratn Pranesh, I am an aspiring data scientist.