Navigating the GenAI Frontier : Transformers, GPT, and the Path to Accelerated Innovation

In the vast world of artificial intelligence, where machines aim to understand and talk like humans, there’s one exciting area: Neural Machine Translation (NMT). Imagine a place where language barriers vanish, conversations flow smoothly across borders, and words become bridges between cultures. NMT is our passport to this linguistic adventure.

In this blog, we’ll explore NMT, its roots in history, the powerful Transformers, and let’s discuss how models like GPT-1 learn. So, let’s begin!

First let's discuss about the technical details of the influential papers: “Sequence to Sequence Learning with Neural Networks” (often referred to as Seq2Seq) and “Neural Machine Translation by Joint Learning to Align and Translate.”

1. Sequence to Sequence Learning with Neural Networks (Seq2Seq):

Authors: Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Objective: Seq2Seq addresses the challenge of mapping sequences (e.g., sentences) to other sequences (e.g., translations).
Architecture:
- Utilizes a multilayered Long Short-Term Memory (LSTM) network.
- The encoder processes the input sequence (e.g., English sentence) and encodes it into a fixed-dimensional vector.
- The decoder generates the target sequence (e.g., translation) from this vector.
Achievements:
- Achieved a BLEU score of 34.8 on English-to-French translation from the WMT’14 dataset.
- Outperformed traditional phrase-based statistical machine translation systems.
- Learned sensible phrase and sentence representations sensitive to word order.
- Reversing the order of source sentences improved performance by introducing short-term dependencies.

2. Neural Machine Translation by Joint Learning to Align and Translate:

Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Objective: Proposes a novel approach to neural machine translation (NMT) that jointly learns alignment and translation.
Key Insights:
- Challenges the use of a fixed-length vector in the basic encoder-decoder architecture.
- Introduces the attention mechanism, allowing the model to automatically search for relevant parts of the source sentence during translation.
- Achieves translation performance comparable to state-of-the-art phrase-based systems for English-to-French translation.
- Qualitative analysis confirms that the (soft-)alignments found by the model align well with human intuition.
Issues Addressed:
- The fixed-length vector limitation.
- Handling long sentences effectively.
- Capturing word order and context.
- Optimizing the model by reversing word order in source sentences.
Context: Accepted at ICLR 2015 as an oral presentation.

These papers significantly advanced the field of NMT, addressing critical issues and introducing attention mechanisms. They remain influential in shaping modern language models!

Introduction to Transformers: Decoding the “Attention is All You Need” Paper

What Is the Transformer?

In their paper, Vaswani and colleagues proposed the Transformer, a model architecture that boldly abandoned recurrence (no more RNNs!) and instead relied entirely on an attention mechanism to draw global dependencies between input and output.

The Core Components of the Transformer

Self-Attention Mechanism:
- The self-attention mechanism allows the model to weigh the importance of different words in a sentence when processing each word.
- Unlike RNNs, which process words sequentially, the Transformer considers all words simultaneously, capturing long-range dependencies efficiently.
- Self-attention computes a weighted sum of all input words, where the weights depend on their relevance to the current word. It’s like having a conversation where you pay attention to the most relevant parts.
Encoder and Decoder:
- The Transformer consists of an encoder and a decoder.
- The encoder processes the input sequence (e.g., English sentence), while the decoder generates the output sequence (e.g., translated sentence).
- Both the encoder and decoder are composed of multiple layers of self-attention mechanisms and feed-forward neural networks.
Multi-Head Attention:
- Each layer of the encoder and decoder employs multi-head attention.
- Multi-head attention allows the model to capture various types of information simultaneously by applying attention multiple times in parallel.
- It’s like having multiple friends who each focus on different aspects of the conversation.
Position-Wise Feed-Forward Networks:
- These networks apply the same neural network transformation independently to each position in the sequence.
- Position-wise feed-forward networks capture position-specific information and enhance the model’s ability to model complex relationships within the input sequence.
Residual Connections and Layer Normalization:
- Transformers use residual connections (skip connections) to mitigate the vanishing gradient problem during training.
- Layer normalization stabilizes the training process by normalizing the activations within each layer.

Why We Need Transformers

Before the Transformer, sequence-to-sequence problems (such as neural machine translation) relied heavily on recurrent neural networks (RNNs) within an encoder-decoder architecture. However, RNNs faced limitations when dealing with long sequences. Their ability to retain information from the initial elements diminished as new elements were incorporated.

In the encoder, each hidden state was associated with a recent word in the input sentence. Unfortunately, if the decoder only accessed the last hidden state, it lost crucial information about the sequence’s beginning. To address this limitation, the attention mechanism was introduced. Instead of focusing solely on the last encoder state (as RNNs did), attention allowed the decoder to look at all encoder states, accessing information about every input element. This mechanism extracted information from the entire sequence, enabling the decoder to assign varying importance to different input elements for each output element. It learned to focus on the right input elements to predict the next output element.

However, this approach still had a significant limitation: each sequence had to be processed one element at a time. Both the encoder and decoder had to wait until the completion of previous steps to process the current step. For large corpora, this became computationally inefficient.

Working of each transformer component

The workings of each component in a Transformer:

Encoder-Decoder Architecture:
- The Transformer model comprises two main components: the encoder and the decoder.
- These components work together to handle sequence-to-sequence tasks, such as machine translation.
- The encoder processes the input sequence (source language) and encodes it into a set of context-aware representations.
- The decoder then generates the output sequence (target language) based on these representations.
Self-Attention Mechanism:
- The core innovation of Transformers lies in their use of self-attention.
- Self-attention allows each word in the input sequence to attend to all other words, capturing long-range dependencies and contextual relationships.
- Here’s how it works:
  - For each word in the input, the self-attention mechanism computes a weighted sum of all other words’ representations.
  - The weights are determined by the relevance of each word to the current word.
  - This process creates a context vector for each word, considering its interactions with the entire sequence.
  - The context vectors are then used to update the word representations.
Encoder:
- The encoder takes raw text as input and processes it step by step:
  - Tokenization: The input text is split into tokens (usually words or subwords).
  - Embedding Layer: Each token is converted into a dense vector representation (embedding).
  - Self-Attention Layers: Multiple layers of self-attention allow the encoder to capture context.
  - Positional Encoding: Since Transformers lack inherent positional information (unlike RNNs), positional encodings are added to the embeddings.
  - Stacking: The layers are stacked to create a deep representation of the input.
  - Output: The final encoder output contains context-aware representations for each token.
Decoder:
- The decoder generates the output sequence based on the encoder’s representations:
  - Cross-Attention: Instead of self-attention, the decoder uses cross-attention to attend to the encoder’s output.
  - Masked Self-Attention: The decoder also employs masked self-attention to prevent it from attending to future tokens during training.
  - Positional Encoding: Similar to the encoder, positional encodings are added to the decoder’s embeddings.
  - Stacking: The decoder layers are stacked to create a deep representation.
  - Output: The final decoder output provides predictions for the target sequence.
Limitations:
- Transformers are still constrained by an input context window. Extremely long sequences may be truncated or split.
- Despite this limitation, Transformers have revolutionized NLP by enabling efficient training and transfer learning.

As of now we understood that the Transformer introduced a paradigm shift in NLP by leveraging attention mechanisms, parallel computation and global dependencies. Its architecture has become the foundation for various state-of-the-art models, including BERT, GPT, and T5.

Now let's discuss about how GPT-1 (Generative Pre-trained Transformer) is trained from scratch

Pre-training:
- GPT-1 was released in June 2018.
- During pre-training, it learns the general patterns and rules of natural language.
- It uses a large collection of unlabeled text data, such as Wikipedia articles or web pages.
- The goal is to predict the next word given the preceding context of the sentence.
- The loss function used is a standard language modeling objective.
- At each step, GPT-1 predicts the probability distribution of all possible tokens as the next token.
- The context window size (parameter k) determines how far back it looks.
- The loss function is also known as log-likelihood.
- Unlike BERT (Bidirectional Encoder Representations from Transformer), which predicts tokens based on context from both sides, GPT-1 only uses the previous context.
- The intuition behind this loss function is to learn to generate text by understanding context ¹.
Fine-tuning:
- After pre-training, GPT-1 is fine-tuned on specific tasks.
- Fine-tuning involves training the model on labeled data for specific applications.
- For example, it can be fine-tuned for sentiment analysis, question answering, or text completion.
- During fine-tuning, the model adapts to the specific task by adjusting its parameters.
- The training process is repeated multiple times with different subsets of data to improve accuracy and generalization.

GPT-1 starts by learning language patterns from a large amount of unlabeled text during pre-training. Then, it fine-tunes its knowledge for specific tasks using labeled data. It’s like learning the basics first and then specializing in a particular field!

References

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pretraining. URL: Language Understanding Paper.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Bidirectional encoder representations from transformers. arXiv preprint arXiv:1810.04805.

Search This Blog

Navigating the GenAI Frontier : Transformers, GPT, and the Path to Accelerated Innovation