Transformers: extend the capability of neural networks in NLP

Business Manager
Mon Feb 22, 2021

The transformer I am talking about is not the "Transformers" toy series and film franchise often directed by pyrotechnic Michael Bay, but it can be as powerful as or even better than these Transformers in a sense of changing the world of natural language processing (NLP) by learning our complex human languages.

Introduced in 2017, the transformer model is a fairly new deep learning model for natural language translation that does not require word sequences to be processed in order and takes advantage of parallel computing for training. Hugging Face, an AI company that provides packages and APIs for machine learning practitioners to build, train, and deploy state-of-the-art NLP models, recently has some exciting releases on transformer models. Before diving deep into transformers and Hugging Face’s new releases, let’s first discuss how NLP has been working and developing.

In the last decade, NLP practitioners often implemented neural networks to look for interesting patterns in order to achieve learnings. Before the boom of neural networks, NLP tasks mainly relied on term frequency, human labor to create a dictionary-like corpus and adapt, and word comparison in a corpus that usually misses nuance and new word meaning. Although these techniques are still being used and can solve common language tasks to a great extent, neural networks as unsupervised learning methods unlock some limits in understanding our complex human languages. 

NLP training often adopts recurrent neural network (RNN) models. RNN is an artificial neural network where a layered network has the information from the output node of previous steps looping back to the hidden layer, where the information is collectively analyzed with input of the next event. So the output of the upcoming events holds the memory of all past history along a temporal sequence. 

Among all invariants, long short-term memory (LSTM) is a popular algorithm for NLP training. Given a hundred-word paragraph, not all the words hold contextual meaning to deliver the message of the paragraph. The LTSM architecture has recurrent gates, also called forget gates, that only hold useful information along a word sequence. The model shows promising results in supporting long delays between significant words. However, the LTSM models still very much rely on the sequential order of words, causing slow training without the help of parallel processing. Even with a bidirectional version called Bi-LSTM, the models cannot fully capture the true meaning of source words, such that Bi-LSTM models learn the contextual value of a word sequence from left to right and from right to left separately, followed by simple concatenation. 

Transformer models are created to cope with the limits of LSTM models. Transformers use encoders, decoders, or both architectures to process the words from a sequence simultaneously, and can deeply understand the context with help of multi-head attention nodes and stacked networks structure. Bi-directional encoder representation from transformers (BERT) is a family of encoder-based transformer models that learn a word sequence into word embedding that considers the position, segment, and context of each word. The generative pre-trained transformer (GPT-n) series, the newest of which is GPT-3, is a decoder-based transformer that is pre-trained with billions of parameters of a language model “on a diverse corpus of unlabelled text, followed by discriminative fine-tuning on each specific task," according to  OpenAI. However, training a transformer from scratch can be very computationally expensive. For example, according to Lambda Lab’s estimation, the largest GPT-3 model with 175 billion parameters cost 4.6 millions dollars for its infrastructure and power supply, and needs 355 years using the lowest-priced GPU cloud on the market.

To take advantage of researcher’s efforts, general NLP practitioners can use pre-trained models available online through cloning the models or via API. Hugging Face is one of the few renowned groups that developed the pipeline for implementing the pre-trained models in a few lines of code. With a motive of “Democratizing NLP, one commit at a time!” shown on their LinkedIn profile, it currently offers 41 transformer models, each with different model sizes and varying features. Other than BERT base models, there are ALBERT, a light-weighted model developed by Google research team, RoBERT, a model that considered the hierarchical structure of a document, and more. On February 18, 2021, Hugging Face announced an alert for a new MBART-50 model that can inter-translate 50 languages. A week earlier, it released a new Retrieved Augmented Generation (RAG) model for open source use. It is developed by Facebook AI and supported by a high-performance distributed execution framework, and it can use external documents to augment and enrich the model in language understanding. Very exciting, right! 

The transformer models are still in the early stages of development, but their capability has been widely recognized and will only become more and more powerful. Natural language understanding, natural language inference like disinformation detection, questioning and answering, document summarization, all are what transformer models have been tested for and reliable and exceptional performances are the result. For future commercial use, an earlier article about generating images from text description using GPT-3 is an example. It is believed that the NLP community will continue the focus on transformer models for more breakthroughs.



Appears in
2021 - Spring - Issue 4